## Using RAGatouille with llama-hub loaders

Use any loader from llama-hub with ragatouille!

Basically we should be able to load anything from https://llamahub.ai/?tab=loaders which opens up a lot of avenues for RAGatouille!


In [1]:
# !pip install llama-hub
# !pip install arxiv
# !pip install semanticscholar

In [2]:
from ragatouille import RAGPretrainedModel

  from .autonotebook import tqdm as notebook_tqdm


# PubMed Loader


In [3]:
from llama_index import download_loader

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
PubmedReader = download_loader("PubmedReader")

loader = PubmedReader()
documents = loader.load_data(search_query="covid 19 vaccine")

list_vaccine_papers = [document.text for document in documents]

RAG.index(
    collection=list_vaccine_papers,
    index_name="vaccine_papers",
    max_document_length=180,
    split_documents=True,
)

[Jan 14, 20:59:06] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783939&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783816&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783752&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783720&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783708&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783652&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783611&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783392&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783308&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10783278&db=pmc


[Jan 14, 20:59:22] #> Note: Output directory .ragatouille/colbert/indexes/vaccine_papers already exists


[Jan 14, 20:59:22] #> Will delete 10 files already at .ragatouille/colbert/indexes/vaccine_papers in 20 seconds...
#> Star



[Jan 14, 20:59:44] [0] 		 #> Encoding 605 passages..


100%|██████████| 10/10 [00:33<00:00,  3.36s/it]


[Jan 14, 21:00:18] [0] 		 avg_doclen_est = 133.9619903564453 	 len(local_sample) = 605
[Jan 14, 21:00:18] [0] 		 Creating 4,096 partitions.
[Jan 14, 21:00:18] [0] 		 *Estimated* 81,047 embeddings.
[Jan 14, 21:00:18] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/vaccine_papers/plan.json ..
Clustering 76995 points in 128D to 4096 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (4.44 s, search 4.38 s): objective=16826.5 imbalance=1.402 nsplit=0       
[0.035, 0.038, 0.037, 0.033, 0.033, 0.036, 0.035, 0.033, 0.034, 0.035, 0.034, 0.035, 0.035, 0.035, 0.034, 0.034, 0.032, 0.035, 0.033, 0.035, 0.035, 0.036, 0.036, 0.035, 0.033, 0.034, 0.036, 0.036, 0.035, 0.04, 0.034, 0.037, 0.035, 0.035, 0.034, 0.031, 0.036, 0.035, 0.036, 0.04, 0.036, 0.036, 0.035, 0.034, 0.033, 0.033, 0.035, 0.037, 0.035, 0.034, 0.033, 0.034, 0.035, 0.034, 0.034, 0.037, 0.039, 0.036, 0.039, 0.034, 0.033, 0.034, 0.036, 0.036, 0.037, 0.034, 0.035, 0.037, 0.033, 0.035, 0.035,

0it [00:00, ?it/s]
  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [00:03<00:32,  3.61s/it][A
 20%|██        | 2/10 [00:07<00:29,  3.68s/it][A
 30%|███       | 3/10 [00:11<00:26,  3.74s/it][A
 40%|████      | 4/10 [00:15<00:22,  3.79s/it][A
 50%|█████     | 5/10 [00:18<00:18,  3.79s/it][A
 60%|██████    | 6/10 [00:22<00:15,  3.80s/it][A
 70%|███████   | 7/10 [00:26<00:11,  3.79s/it][A
 80%|████████  | 8/10 [00:30<00:07,  3.79s/it][A
 90%|█████████ | 9/10 [00:33<00:03,  3.79s/it][A
100%|██████████| 10/10 [00:35<00:00,  3.56s/it][A
1it [00:36, 36.30s/it]
100%|██████████| 1/1 [00:00<00:00, 3509.88it/s]
100%|██████████| 4096/4096 [00:00<00:00, 307811.25it/s]


[Jan 14, 21:00:59] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 14, 21:00:59] #> Building the emb2pid mapping..
[Jan 14, 21:00:59] len(emb2pid) = 81047
[Jan 14, 21:00:59] #> Saved optimized IVF to .ragatouille/colbert/indexes/vaccine_papers/ivf.pid.pt
#> Joined...
Done indexing!


In [7]:
k = 3
# search in 50 documents for relevant papers
results = RAG.search(query="efficacy of vaccine", k=k, index_name="vaccine_papers")
results

[{'content': 'A review of the challenges assessing the clinical efficacy of vaccines against SARS-CoV-2 Lancet Infect Dis. 2021 Feb 1 21 2 e26 35 https://pubmed.ncbi.nlm.nih.gov/33125914/ doi: 10.1016/S1473-3099(20)30773-8 33125914  8 World Health Organization. Coronavirus disease (COVID-19): Herd immunity, lockdowns and COVID-19 [Internet]. [cited 2022 Jun 22].',
  'score': 23.274442672729492,
  'rank': 1},
 {'content': 'Safety and efficacy of COVID-19 vaccines: a systematic review and meta-analysis of different vaccines at phase 3 Vaccines (Basel) 2021 9 989 34579226  26 Ismail II Salama S A systematic review of cases of CNS demyelination following COVID-19 vaccination J Neuroimmunol 2022 362 577765 34839149  27 Hsiao Y-T Tsai M-J Chen Y-H Acute transverse myelitis after COVID-19 vaccination Medicina (Kaunas) 2021 57 1010 34684047  28 Ismail II Salama S Association of CNS demyelination and COVID-19 infection: an updated systematic review J Neurol 2022 269 541 576 34386902  29 Román G

# Semantic Scholar Loader


In [None]:
from llama_hub.semanticscholar import SemanticScholarReader

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
query_space = "biases in language models"

s2reader = SemanticScholarReader()
documents = s2reader.load_data(
    query_space, limit=50, full_text=False
)  # install pypdf2 to get full text

# iterate over documents and get "text" field from each document and store in list_documents
list_documents = [document.text for document in documents]

RAG.index(
    collection=list_documents,
    index_name="biases_llms",
    max_document_length=180,
    split_documents=True,
)

  from .autonotebook import tqdm as notebook_tqdm


[Jan 14, 20:44:05] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...






[Jan 14, 20:44:07] #> Creating directory .ragatouille/colbert/indexes/biases_llms 


#> Starting...
[Jan 14, 20:44:09] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




[Jan 14, 20:44:09] [0] 		 #> Encoding 92 passages..


100%|██████████| 2/2 [00:05<00:00,  2.54s/it]


[Jan 14, 20:44:14] [0] 		 avg_doclen_est = 124.78260803222656 	 len(local_sample) = 92
[Jan 14, 20:44:14] [0] 		 Creating 1,024 partitions.
[Jan 14, 20:44:14] [0] 		 *Estimated* 11,479 embeddings.
[Jan 14, 20:44:14] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/biases_llms/plan.json ..
Clustering 10906 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (0.20 s, search 0.20 s): objective=2226.81 imbalance=1.447 nsplit=0       
[0.037, 0.037, 0.035, 0.032, 0.032, 0.035, 0.039, 0.033, 0.033, 0.035, 0.033, 0.033, 0.036, 0.04, 0.033, 0.035, 0.032, 0.034, 0.031, 0.035, 0.033, 0.036, 0.031, 0.037, 0.035, 0.034, 0.038, 0.033, 0.034, 0.035, 0.034, 0.039, 0.035, 0.033, 0.032, 0.031, 0.035, 0.034, 0.032, 0.036, 0.035, 0.036, 0.032, 0.034, 0.034, 0.032, 0.037, 0.036, 0.037, 0.035, 0.031, 0.037, 0.041, 0.036, 0.035, 0.033, 0.037, 0.035, 0.037, 0.034, 0.034, 0.036, 0.039, 0.036, 0.034, 0.035, 0.031, 0.034, 0.031, 0.034, 0.036, 0

0it [00:00, ?it/s]
  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:03<00:03,  3.45s/it][A
100%|██████████| 2/2 [00:04<00:00,  2.47s/it][A
1it [00:04,  4.98s/it]
100%|██████████| 1/1 [00:00<00:00, 5974.79it/s]
100%|██████████| 1024/1024 [00:00<00:00, 364567.29it/s]


[Jan 14, 20:44:19] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 14, 20:44:19] #> Building the emb2pid mapping..
[Jan 14, 20:44:19] len(emb2pid) = 11480
[Jan 14, 20:44:19] #> Saved optimized IVF to .ragatouille/colbert/indexes/biases_llms/ivf.pid.pt
#> Joined...
Done indexing!


In [None]:
k = 3
# search in 50 documents for relevant papers
results = RAG.search(query="demographic biases", k=k, index_name="biases_llms")
results

New index_name received! Updating current index_name (biases_llms) to biases_llms
Loading searcher for index biases_llms for the first time... This may take a few seconds
[Jan 14, 20:44:31] #> Loading codec...
[Jan 14, 20:44:31] #> Loading IVF...
[Jan 14, 20:44:31] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 14, 20:44:31] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 3802.63it/s]

[Jan 14, 20:44:31] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 660.52it/s]

[Jan 14, 20:44:31] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Jan 14, 20:44:32] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . demographic biases, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1, 15982, 13827,  2229,   102,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])





  'score': 24.941387176513672,
  'rank': 1},
 {'content': 'Occupational Biases in Norwegian and Multilingual Language Models In this paper we explore how a demographic distribution of occupations, along gender dimensions, is reflected in pre-trained language models. We give a descriptive assessment of the distribution of occupations, and investigate to what extent these are reflected in four Norwegian and two multilingual models. To this end, we introduce a set of simple bias probes, and perform five different tasks combining gendered pronouns, first names, and a set of occupations from the Norwegian statistics bureau. We show that language specific models obtain more accurate results, and are much closer to the real-world distribution of clearly gendered occupations. However, we see that none of the models have correct representations of the occupations that are demographically balanced between genders.',
  'score': 24.29305076599121,
  'rank': 2},
 {'content': 'Our experiments primar

# PDF Loader


In [10]:
from pathlib import Path
from llama_index import download_loader

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path("data/llama2.pdf"))



In [None]:
list_pdf_documents = [document.text for document in documents]


RAG.index(
    collection=list_pdf_documents,
    index_name="llama2",
    max_document_length=256,
    split_documents=True,
)





[Jan 14, 20:44:39] #> Creating directory .ragatouille/colbert/indexes/llama2 


#> Starting...
[Jan 14, 20:44:42] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




[Jan 14, 20:44:42] [0] 		 #> Encoding 413 passages..


100%|██████████| 7/7 [00:35<00:00,  5.05s/it]


[Jan 14, 20:45:18] [0] 		 avg_doclen_est = 175.64407348632812 	 len(local_sample) = 413
[Jan 14, 20:45:18] [0] 		 Creating 4,096 partitions.
[Jan 14, 20:45:18] [0] 		 *Estimated* 72,541 embeddings.
[Jan 14, 20:45:18] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/llama2/plan.json ..
Clustering 68914 points in 128D to 4096 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (4.18 s, search 4.13 s): objective=16924.9 imbalance=1.444 nsplit=0       
[0.036, 0.04, 0.036, 0.036, 0.035, 0.04, 0.039, 0.036, 0.036, 0.038, 0.037, 0.037, 0.036, 0.037, 0.037, 0.039, 0.035, 0.039, 0.035, 0.037, 0.037, 0.039, 0.037, 0.039, 0.034, 0.037, 0.039, 0.036, 0.038, 0.042, 0.037, 0.04, 0.039, 0.036, 0.037, 0.035, 0.039, 0.037, 0.037, 0.041, 0.039, 0.038, 0.036, 0.037, 0.037, 0.036, 0.036, 0.041, 0.037, 0.035, 0.037, 0.037, 0.038, 0.037, 0.036, 0.038, 0.041, 0.039, 0.044, 0.035, 0.037, 0.038, 0.039, 0.039, 0.039, 0.038, 0.037, 0.037, 0.035, 0.038, 0.039, 0.036, 

0it [00:00, ?it/s]
  0%|          | 0/7 [00:00<?, ?it/s][A
 14%|█▍        | 1/7 [00:05<00:35,  5.97s/it][A
 29%|██▊       | 2/7 [00:11<00:29,  5.83s/it][A
 43%|████▎     | 3/7 [00:17<00:23,  5.82s/it][A
 57%|█████▋    | 4/7 [00:23<00:17,  5.84s/it][A
 71%|███████▏  | 5/7 [00:29<00:11,  5.83s/it][A
 86%|████████▌ | 6/7 [00:35<00:05,  5.85s/it][A
100%|██████████| 7/7 [00:37<00:00,  5.37s/it][A
1it [00:38, 38.35s/it]
100%|██████████| 1/1 [00:00<00:00, 2763.05it/s]
100%|██████████| 4096/4096 [00:00<00:00, 266979.58it/s]


[Jan 14, 20:46:01] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 14, 20:46:01] #> Building the emb2pid mapping..
[Jan 14, 20:46:01] len(emb2pid) = 72541
[Jan 14, 20:46:01] #> Saved optimized IVF to .ragatouille/colbert/indexes/llama2/ivf.pid.pt
#> Joined...
Done indexing!


In [None]:
k = 3
# search in 50 documents for relevant papers
results = RAG.search(query="Public releases of other llms", k=k, index_name="llama2")
results

[{'content': 'They enable interaction with humans through intuitive\nchat interfaces, which has led to rapid and widespread adoption among the general public.\nThecapabilitiesofLLMsareremarkableconsideringtheseeminglystraightforwardnatureofthetraining\nmethodology. Auto-regressivetransformersarepretrainedonanextensivecorpusofself-superviseddata,\nfollowed by alignment with human preferences via techniques such as Reinforcement Learning with Human\nFeedback(RLHF).Althoughthetrainingmethodologyissimple,highcomputationalrequirementshave\nlimited the development of LLMs to a few players. There have been public releases of pretrained LLMs\n(such as BLOOM (Scao et al., 2022), LLaMa-1 (Touvron et al., 2023), and Falcon (Penedo et al., 2023)) that\nmatch the performance of closed pretrained competitors like GPT-3 (Brown et al., 2020) and Chinchilla\n(Hoffmann et al., 2022), but none of these models are suitable substitutes for closed “product” LLMs, such\nasChatGPT,BARD,andClaude.',
  'score':