## **This repo introduces RAG with ColBert (RAGaTouille)**

> Implementation with both LangChain and Llama-index



## 1- Langchain

### Install dependencies

In [1]:
!pip install -q -U ragatouille
!pip install -q langchain
!pip install -q langchain-openai
!pip install -q langchain-core
!pip install -q langchain-community
!pip install -q pypdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.8/41.8 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.1/289.1 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m86.6 M

### Load the model

In [2]:
from ragatouille import RAGPretrainedModel

model_name = "colbert-ir/colbertv2.0"
model = RAGPretrainedModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

### Document Loading

In [4]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("ORCA.pdf")
pages = loader.load_and_split()

len(pages)

55

In [5]:
# Graping all pages in one document

full_document = ""
for page in pages:
  full_document += page.page_content

print(full_document)

Orca: Progressive Learning from Complex
Explanation Traces of GPT-4
Subhabrata Mukherjee∗†, Arindam Mitra∗
Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah
Microsoft Research
Abstract
Recent research has focused on enhancing the capability of smaller models
through imitation learning, drawing on the outputs generated by large
foundation models (LFMs). A number of issues impact the quality of these
models, ranging from limited imitation signals from shallow LFM outputs;
small scale homogeneous training data; and most notably a lack of rigorous
evaluation resulting in overestimating the small model’s capability as they
tend to learn to imitate the style, but not the reasoning process of LFMs . To
address these challenges, we develop Orca, a 13-billion parameter model
that learns to imitate the reasoning process of LFMs. Orca learns from
rich signals from GPT-4 including explanation traces; step-by-step thought
processes; and other complex instructions, guided by teacher assi

### Embedding + index

In [6]:
model.index(
    collection = [full_document],
    index_name = 'orca',
    max_document_length = 512,
    split_documents = True,
    # Default database is PLAID, you can use FAISS instead as follows
    # use_faiss = True
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Apr 16, 14:34:25] #> Creating directory .ragatouille/colbert/indexes/orca 


[Apr 16, 14:34:27] [0] 		 #> Encoding 87 passages..
[Apr 16, 14:34:30] [0] 		 avg_doclen_est = 359.0804748535156 	 len(local_sample) = 87
[Apr 16, 14:34:30] [0] 		 Creating 2,048 partitions.
[Apr 16, 14:34:30] [0] 		 *Estimated* 31,240 embeddings.
[Apr 16, 14:34:30] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/orca/plan.json ..
used 19 iterations (0.5929s) to cluster 29678 items into 2048 clusters
[Apr 16, 14:34:31] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Apr 16, 14:36:03] Loading p

0it [00:00, ?it/s]

[Apr 16, 14:37:35] [0] 		 #> Encoding 87 passages..


1it [00:01,  1.36s/it]
100%|██████████| 1/1 [00:00<00:00, 1081.56it/s]

[Apr 16, 14:37:37] #> Optimizing IVF to store map from centroids to list of pids..
[Apr 16, 14:37:37] #> Building the emb2pid mapping..
[Apr 16, 14:37:37] len(emb2pid) = 31240



100%|██████████| 2048/2048 [00:00<00:00, 60395.95it/s]

[Apr 16, 14:37:37] #> Saved optimized IVF to .ragatouille/colbert/indexes/orca/ivf.pid.pt
Done indexing!





'.ragatouille/colbert/indexes/orca'

### Retreival

In [7]:
results = model.search(query="What is instruction tuning?", k=3)

results

Loading searcher for index orca for the first time... This may take a few seconds
[Apr 16, 14:37:45] #> Loading codec...
[Apr 16, 14:37:45] #> Loading IVF...
[Apr 16, 14:37:45] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 1376.99it/s]

[Apr 16, 14:37:45] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 242.61it/s]

Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is instruction tuning?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  2003,  7899, 17372,  1029,   102,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')






[{'content': "For multimodal tasks, instruction tuning has been used to generate\nsynthetic instruction-following data for language-image tasks, such as image captioning [ 23]\nand visual question answering [24].\nA wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and\nKoala [14], have adopted instruction-tuning to train smaller language models with outputs\ngenerated from large foundation models from the GPT family. As outlined in Section 1.1,\na significant drawback with all these works has been both limited task diversity, query\ncomplexity and small-scale training data in addition to limited evaluation overstating the\nbenefits of such approach.\n2.2 Role of System Instructions\nVanilla instruction-tuning (refer to Figure 4 for examples) often uses input, response pairs\nwith short and terse responses. Such responses when used to train smaller models, as in\nexisting works, give them limited ability to trace the reasoning process of the LFM. In

### Langchain retriever

In [8]:
retriever = model.as_langchain_retriever(k=3)

retriever.invoke("What is instruction tuning?")

[Document(page_content="For multimodal tasks, instruction tuning has been used to generate\nsynthetic instruction-following data for language-image tasks, such as image captioning [ 23]\nand visual question answering [24].\nA wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and\nKoala [14], have adopted instruction-tuning to train smaller language models with outputs\ngenerated from large foundation models from the GPT family. As outlined in Section 1.1,\na significant drawback with all these works has been both limited task diversity, query\ncomplexity and small-scale training data in addition to limited evaluation overstating the\nbenefits of such approach.\n2.2 Role of System Instructions\nVanilla instruction-tuning (refer to Figure 4 for examples) often uses input, response pairs\nwith short and terse responses. Such responses when used to train smaller models, as in\nexisting works, give them limited ability to trace the reasoning process of t

### Creating Chain

In [10]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('openai_key')

In [11]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    """Answer the following question based on the provided context:

<STR>
{context}
</END>

Question: {input}"""
)

llm = ChatOpenAI()
document_chain = create_stuff_documents_chain(
    llm,
    prompt
) # stuff chain uses the full context as input to LLM
retreival_chain = create_retrieval_chain(
    retriever,
    document_chain
)

In [13]:
retreival_chain.invoke({"input": "What is instruction tuning?"})

{'input': 'What is instruction tuning?',
 'context': [Document(page_content="For multimodal tasks, instruction tuning has been used to generate\nsynthetic instruction-following data for language-image tasks, such as image captioning [ 23]\nand visual question answering [24].\nA wide range of works in recent times, including Alpaca [ 7], Vicuna [ 9], WizardLM [ 8] and\nKoala [14], have adopted instruction-tuning to train smaller language models with outputs\ngenerated from large foundation models from the GPT family. As outlined in Section 1.1,\na significant drawback with all these works has been both limited task diversity, query\ncomplexity and small-scale training data in addition to limited evaluation overstating the\nbenefits of such approach.\n2.2 Role of System Instructions\nVanilla instruction-tuning (refer to Figure 4 for examples) often uses input, response pairs\nwith short and terse responses. Such responses when used to train smaller models, as in\nexisting works, give the

In [14]:
response = retreival_chain.invoke({"input": "What is instruction tuning?"})
response["answer"]

'Instruction tuning is a technique that allows pre-trained language models to learn from input (natural language descriptions of the task) and response pairs. It has been applied to both language-only and multimodal tasks to improve the performance of models in various benchmarks. In the context provided, instruction tuning has been used to generate synthetic instruction-following data for language-image tasks such as image captioning and visual question answering.'

## 2- Llama-index

### Install dependencies

In [16]:
!pip install -q llama-index
!pip install -q llama-hub
!pip install -q llama-index-core
!pip install -q llama-index-llms-openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.9/103.9 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for html2text (setup.py) ... [?25l[?25hdone


### Document Loading

In [17]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

reader = SimpleDirectoryReader(
    input_files = ['ORCA.pdf']
)
docs = reader.load_data()

docs

[Document(id_='b5d4b189-66ca-4da3-9e75-320d313eda0f', embedding=None, metadata={'page_label': '1', 'file_name': 'ORCA.pdf', 'file_path': 'ORCA.pdf', 'file_type': 'application/pdf', 'file_size': 1458242, 'creation_date': '2024-04-16', 'last_modified_date': '2024-04-16'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Orca: Progressive Learning from Complex\nExplanation Traces of GPT-4\nSubhabrata Mukherjee∗†, Arindam Mitra∗\nGanesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah\nMicrosoft Research\nAbstract\nRecent research has focused on enhancing the capability of smaller models\nthrough imitation learning, drawing on the outputs generated by large\nfoundation models (LFMs). A number of issues impact the quality of these\nmodels, ranging from l

### Load the model

In [18]:
from llama_index.core.llama_pack import download_llama_pack

RAGatouilleRetrieverPack = download_llama_pack(
    "RAGatouilleRetrieverPack", "./ragatouille_pack"
)

In [19]:
from llama_index.llms.openai import OpenAI # load llm

In [20]:
ragatouille_pack = RAGatouilleRetrieverPack(
    docs,
    llm = OpenAI(model = "gpt-3.5-turbo"),
    index_name = "orca_paper",
    top_k = 5,
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Apr 16, 14:54:17] #> Creating directory .ragatouille/colbert/indexes/orca_paper 


[Apr 16, 14:54:18] [0] 		 #> Encoding 219 passages..
[Apr 16, 14:54:19] [0] 		 avg_doclen_est = 155.7305908203125 	 len(local_sample) = 219
[Apr 16, 14:54:19] [0] 		 Creating 2,048 partitions.
[Apr 16, 14:54:19] [0] 		 *Estimated* 34,104 embeddings.
[Apr 16, 14:54:19] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/orca_paper/plan.json ..
used 20 iterations (0.4308s) to cluster 32400 items into 2048 clusters
[0.034, 0.038, 0.032, 0.031, 0.032, 0.034, 0.035, 0.033, 0.032, 0.035, 0.033, 0.034, 0.031, 0.035, 0.034, 0.035, 0.031, 0.034, 0.031, 0.034, 

0it [00:00, ?it/s]

[Apr 16, 14:54:20] [0] 		 #> Encoding 219 passages..


1it [00:01,  1.20s/it]
100%|██████████| 1/1 [00:00<00:00, 878.02it/s]

[Apr 16, 14:54:21] #> Optimizing IVF to store map from centroids to list of pids..
[Apr 16, 14:54:21] #> Building the emb2pid mapping..
[Apr 16, 14:54:21] len(emb2pid) = 34105



100%|██████████| 2048/2048 [00:00<00:00, 47595.73it/s]

[Apr 16, 14:54:21] #> Saved optimized IVF to .ragatouille/colbert/indexes/orca_paper/ivf.pid.pt





Done indexing!


In [21]:
response = ragatouille_pack.run("What is instruction tuning? ")


Loading searcher for index orca_paper for the first time... This may take a few seconds
[Apr 16, 14:54:38] #> Loading codec...
[Apr 16, 14:54:38] #> Loading IVF...
[Apr 16, 14:54:38] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 5053.38it/s]

[Apr 16, 14:54:38] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 296.96it/s]

Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is instruction tuning? , 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  2003,  7899, 17372,  1029,   102,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')






In [22]:
response

Response(response='Instruction tuning is a technique that allows pre-trained language models to learn from input (natural language descriptions of the task) and response pairs. It involves training smaller models with pairs of user instructions, input, and corresponding outputs to improve their performance on various benchmarks. This technique has been applied to both language-only and multimodal tasks to enhance the zero-shot and few-shot capabilities of models.', source_nodes=[NodeWithScore(node=TextNode(id_='bdcfbc96-8c2a-45ce-8811-b7514fdc1154', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Given user instructions for a task and an input,\nthe system generates a response. Existing works like Alpaca [ 7], Vicuna [ 9] and variants\nfollow a similar template to train small models with ⟨{user instruction, input}, output ⟩.\n2 Preliminaries\n2.1 Instruction Tuning\nInstruction tuning [ 22] is a technique that allows 

In [23]:
print(response)

Instruction tuning is a technique that allows pre-trained language models to learn from input (natural language descriptions of the task) and response pairs. It involves training smaller models with pairs of user instructions, input, and corresponding outputs to improve their performance on various benchmarks. This technique has been applied to both language-only and multimodal tasks to enhance the zero-shot and few-shot capabilities of models.
