## Indexing

Reference:
* [PyMuPDF level 3 chunking](https://pymupdf.readthedocs.io/en/latest/rag.html#preparing-data-for-chunking)
* [How to split text based on semantic similarity](https://python.langchain.com/docs/how_to/semantic-chunker/)
* [Semantic Chunking for RAG](https://medium.com/the-ai-forum/semantic-chunking-for-rag-f4733025d5f5)
* [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

In [1]:
pdf_doc = "./docs/2005.11401.pdf"

In [2]:
# level 2: Recursive Character Text Splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    file_path = pdf_doc,
    mode = "page",
    extract_tables = "markdown"
)
documents = loader.load()

recursive_text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
recursive_chunks = recursive_text_splitter.split_documents(documents)

print(len(recursive_chunks))
print(recursive_chunks[0].page_content)

95
Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research; ‡University College London; ⋆New York University;
plewis@fb.com
Abstract
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-
stream NLP tasks. However, their ability to access and precisely manipulate knowl-
edge is still limited, and hence on knowledge-intensive tasks, their performance
lags behind task-speciﬁc architectures. Additionally, providing provenance for their
decisions and updating their world knowledge remain open research problems. Pre-
trained models with a differentiable access mechanism to explicit non-parametric


In [3]:
# save chunks to check the result
import pathlib

for i in range(0, len(recursive_chunks)): 
    pathlib.Path(f"./results/{i}.txt").write_bytes(recursive_chunks[i].page_content.encode())

In [4]:
# level 3: document chunking
import pathlib
import pymupdf4llm
from langchain.text_splitter import MarkdownTextSplitter

md_text = pymupdf4llm.to_markdown(pdf_doc)
md_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=200)
md_chunks = md_splitter.create_documents([md_text])

print(len(md_chunks))
print(md_chunks[2].page_content)

Processing ./docs/2005.11401.pdf...
104
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance
lags behind task-specific architectures. Additionally, providing provenance for their
decisions and updating their world knowledge remain open research problems. Pretrained models with a differentiable access mechanism to explicit non-parametric
memory have so far been only investigated for extractive downstream tasks. We
explore a general-purpose fine-tuning recipe for retrieval-augmented generation
(RAG) — models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric
memory is a pre-trained seq2seq model and the non-parametric memory is a dense


In [5]:
# level 4: semantic chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

embedding = HuggingFaceEmbeddings(
    model_name = "BAAI/bge-m3",
    model_kwargs = {'device': 'cpu'},
)

semantic_splitter = SemanticChunker(
    embedding,
    breakpoint_threshold_type = "percentile"
)

semantic_chunks = semantic_splitter.create_documents([d.page_content for d in documents])

print(len(semantic_chunks))
print(semantic_chunks[0].page_content)

62
Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research; ‡University College London; ⋆New York University;
plewis@fb.com
Abstract
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-
stream NLP tasks. However, their ability to access and precisely manipulate knowl-
edge is still limited, and hence on knowledge-intensive tasks, their performance
lags behind task-speciﬁc architectures.
