# Rag From Scratch: Indexing


## Preface: Chunking

We don't explicity cover document chunking / splitting.

For an excellent review of document chunking, see this video from Greg Kamradt:

https://www.youtube.com/watch?v=8OJC21T2SL4

## Enviornment

`(1) Packages`

In [1]:
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

In [24]:
#! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain youtube-transcript-api pytube

`(2) LangSmith`

https://docs.smith.langchain.com/

In [3]:
import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = "key"

`(3) API Keys`

## Part 12: Multi-representation Indexing

Docs:

https://blog.langchain.dev/semi-structured-multi-modal-rag/

https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector

Paper:

https://arxiv.org/abs/2312.06648

In [4]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())



In [25]:
#! pip install langchain_huggingface

In [7]:
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    max_new_tokens=512,
    do_sample=False,
    repetition_penalty=1.03,
)

chat_model = ChatHuggingFace(llm=llm)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [8]:
documents = []

In [9]:
for i, doc in enumerate(docs):
    documents.append(" ".join(doc.page_content.split()))

In [10]:
import uuid # provides tools for generating UUIDs (Universally Unique Identifiers)

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

chain = (
    {"doc": lambda x: x[:2000]}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | chat_model
    | StrOutputParser()
)

summaries = chain.batch(documents, {"max_concurrency": 5})

In [11]:
summaries

["The document explains the concept of LLM (large language model) powered autonomous agents, where LLM acts as the core controller of the agent's brain. The system consists of several components: planning, reflection, and memory. Planning involves breaking down complex tasks into smaller subgoals, reflection enables self-criticism and self-reflection for improvement, and memory allows for both short-term and long-term storage of information. The system also includes the ability to use external tools for",
 'The article discusses the importance of high-quality human data for training machine learning models, as most of the labeled data comes from human annotation. It highlights the need for attention to detail and careful execution during data collection, as human raters contribute to data quality at each stage of the process. Task design, selection and training of raters, and collecting and aggregating data are all important steps that affect data quality. ML techniques can also be use

In [12]:
from langchain.storage import InMemoryByteStore  # in-memory storage layer for the parent documents
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever # retriever that combines multiple sources of data, allowing retrieval from both vector and byte stores
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",  # unique identifier for a set of related documents or embeddings(namespace)
                     embedding_function=embeddings)

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore, # for child chunks (summaries).
    byte_store=store, # for the parent documents.
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs] # list of unique document IDs (doc_ids) is generated for each parent document in the docs

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs))) # store multiple key-value pairs in the document store

  vectorstore = Chroma(collection_name="summaries",  # unique identifier for a set of related documents or embeddings(namespace)


In [13]:
query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

Document(metadata={'doc_id': 'c740431b-556d-470e-96a0-1a79948d726a'}, page_content="The document explains the concept of LLM (large language model) powered autonomous agents, where LLM acts as the core controller of the agent's brain. The system consists of several components: planning, reflection, and memory. Planning involves breaking down complex tasks into smaller subgoals, reflection enables self-criticism and self-reflection for improvement, and memory allows for both short-term and long-term storage of information. The system also includes the ability to use external tools for")

In [14]:
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]

  retrieved_docs = retriever.get_relevant_documents(query,n_results=1)


"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\nemojisearch.app\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n"

Related idea is the [parent document retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever).

## Part 13: RAPTOR

Flow:

Deep dive video:

https://www.youtube.com/watch?v=jbGchdTL7d0

Paper:

https://arxiv.org/pdf/2401.18059.pdf

Full code:

https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb

## Part 14: ColBERT

RAGatouille makes it as simple to use ColBERT.

ColBERT generates a contextually influenced vector for each token in the passages.

ColBERT similarly generates vectors for each token in the query.

Then, the score of each document is the sum of the maximum similarity of each query embedding to any of the document embeddings:

See [here](https://hackernoon.com/how-colbert-helps-developers-overcome-the-limits-of-rag) and [here](https://python.langchain.com/docs/integrations/retrievers/ragatouille) and [here](https://til.simonwillison.net/llms/colbert-ragatouille).

In [26]:
#! pip install -U ragatouille # library designed for efficient retrieval-augmented generation tasks, building upon the principles of retrieval-augmented generation architectures

In [16]:
from ragatouille import RAGPretrainedModel # handle pretrained models that are suitable for retrieval-augmented generation tasks
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


[Oct 16, 17:33:53] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


In [28]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts", # indicates that you want to extract the plain text content of the page.
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None

full_document = get_wikipedia_page("Hayao_Miyazaki")
print(len(full_document))

67531


In [29]:
full_document[:300]

'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regar'

In [19]:
RAG.index(
    collection=[full_document],
    index_name="Miyazaki-123",
    max_document_length=180,
    split_documents=True,
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Oct 16, 17:34:39] #> Creating directory .ragatouille/colbert/indexes/Miyazaki-123 


[Oct 16, 17:34:40] [0] 		 #> Encoding 121 passages..


100%|██████████| 4/4 [01:22<00:00, 20.55s/it]

[Oct 16, 17:36:03] [0] 		 avg_doclen_est = 131.39669799804688 	 len(local_sample) = 121
[Oct 16, 17:36:03] [0] 		 Creating 1,024 partitions.
[Oct 16, 17:36:03] [0] 		 *Estimated* 15,899 embeddings.
[Oct 16, 17:36:03] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki-123/plan.json ..





used 20 iterations (7.8104s) to cluster 15105 items into 1024 clusters
[0.036, 0.042, 0.04, 0.037, 0.033, 0.039, 0.034, 0.037, 0.036, 0.034, 0.036, 0.039, 0.036, 0.038, 0.036, 0.04, 0.034, 0.035, 0.035, 0.04, 0.036, 0.037, 0.037, 0.036, 0.039, 0.032, 0.041, 0.036, 0.037, 0.037, 0.038, 0.041, 0.039, 0.036, 0.036, 0.034, 0.038, 0.036, 0.037, 0.04, 0.036, 0.04, 0.034, 0.037, 0.037, 0.034, 0.036, 0.041, 0.038, 0.034, 0.036, 0.036, 0.036, 0.038, 0.038, 0.037, 0.039, 0.04, 0.043, 0.033, 0.036, 0.038, 0.036, 0.036, 0.038, 0.036, 0.039, 0.039, 0.033, 0.033, 0.037, 0.036, 0.035, 0.038, 0.036, 0.034, 0.036, 0.039, 0.035, 0.036, 0.039, 0.036, 0.033, 0.041, 0.034, 0.034, 0.039, 0.036, 0.034, 0.042, 0.037, 0.037, 0.035, 0.038, 0.036, 0.035, 0.04, 0.035, 0.039, 0.038, 0.04, 0.042, 0.039, 0.036, 0.039, 0.038, 0.039, 0.036, 0.037, 0.033, 0.038, 0.036, 0.036, 0.033, 0.036, 0.039, 0.038, 0.037, 0.037, 0.038, 0.034, 0.034, 0.034, 0.039, 0.036, 0.038, 0.04, 0.037]


0it [00:00, ?it/s]

[Oct 16, 17:36:11] [0] 		 #> Encoding 121 passages..



  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:25<01:15, 25.24s/it][A
 50%|█████     | 2/4 [00:44<00:43, 21.81s/it][A
 75%|███████▌  | 3/4 [01:04<00:20, 20.77s/it][A
100%|██████████| 4/4 [01:18<00:00, 19.75s/it]
1it [01:19, 79.50s/it]
100%|██████████| 1/1 [00:00<00:00, 471.69it/s]

[Oct 16, 17:37:30] #> Optimizing IVF to store map from centroids to list of pids..
[Oct 16, 17:37:30] #> Building the emb2pid mapping..
[Oct 16, 17:37:30] len(emb2pid) = 15899



100%|██████████| 1024/1024 [00:00<00:00, 45755.39it/s]

[Oct 16, 17:37:30] #> Saved optimized IVF to .ragatouille/colbert/indexes/Miyazaki-123/ivf.pid.pt
Done indexing!





'.ragatouille/colbert/indexes/Miyazaki-123'

In [20]:
results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
results

Loading searcher for index Miyazaki-123 for the first time... This may take a few seconds
[Oct 16, 17:37:31] #> Loading codec...
[Oct 16, 17:37:31] #> Loading IVF...
[Oct 16, 17:37:31] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Oct 16, 17:38:19] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 2044.01it/s]

[Oct 16, 17:38:19] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 176.70it/s]

[Oct 16, 17:38:19] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Oct 16, 17:38:59] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What animation studio did Miyazaki found?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  7284,  2996,  2106,  2771,  3148, 18637,  2179,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



[{'content': '=== Studio Ghibli ===\n\n\n==== Early films (1985–1995) ====\nFollowing the success of Nausicaä of the Valley of the Wind, Miyazaki and Takahata founded the animation production company Studio Ghibli on June 15, 1985, as a subsidiary of Tokuma Shoten, with offices in Kichijōji designed by Miyazaki. Miyazaki named the studio after the Caproni Ca.309 and the Italian word meaning "a hot wind that blows in the desert"; the name had been registered a year earlier.',
  'score': 25.885807037353516,
  'rank': 1,
  'document_id': '7165b56d-60d6-41c0-b684-876203864c77',
  'passage_id': 42},
 {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in

In [21]:
retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What animation studio did Miyazaki found?")



[Document(metadata={}, page_content='=== Studio Ghibli ===\n\n\n==== Early films (1985–1995) ====\nFollowing the success of Nausicaä of the Valley of the Wind, Miyazaki and Takahata founded the animation production company Studio Ghibli on June 15, 1985, as a subsidiary of Tokuma Shoten, with offices in Kichijōji designed by Miyazaki. Miyazaki named the studio after the Caproni Ca.309 and the Italian word meaning "a hot wind that blows in the desert"; the name had been registered a year earlier.'),
 Document(metadata={}, page_content='Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in Tokyo City, Miyazaki expressed interest in manga and animation from an ear

In [22]:
from langchain_core.prompts import ChatPromptTemplate
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

promt = """You are a helpful assistant. Your task will be to give a response using context:
{context}
Give answer for this query: {query}
"""

def create_context(docs:list):
    context = ""
    for doc in docs:
        context += doc.page_content
    return context


promt = ChatPromptTemplate.from_template(promt)

chain = (
    {
        'context': lambda inputs: create_context(retriever.invoke(inputs['question'])),
        'query': itemgetter('question')
    }
    | promt
    | chat_model
    | StrOutputParser()
)

In [23]:
chain.invoke({'question': "What animation studio did Miyazaki found?"})



'Miyazaki founded the animation production company Studio Ghibli on June 15, 1985.'