## Retrieval model Indexing

In this notebook, using the trained retrieval model, we'll build an index from our documents (i.e wikipedia data)

In [1]:
# importing packages
from ragatouille import RAGPretrainedModel
import requests

No CUDA runtime is found, using CUDA_HOME='/usr'


In [2]:
#loading the pretrained colbert retrieval model
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")



[May 13, 01:45:11] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




In [3]:
# getting data from wikipedia using the API
def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.
    
    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {
        "User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"
    }

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data['query']['pages'].values()))
    return page['extract'] if 'extract' in page else None

In [4]:
# getting the data
mu_corpus = get_wikipedia_page('Manchester United F.C.') 
other_docs = [get_wikipedia_page('Manchester City F.C.'), get_wikipedia_page('Arsenal F.C.'), get_wikipedia_page('Chelsea F.C.'), get_wikipedia_page('Tottenham Hotspur F.C.'), get_wikipedia_page('Liverpool F.C.'), get_wikipedia_page('Premier League')]

In [5]:
# creating the index
RAG.index(
    collection=[mu_corpus], 
    document_ids=['EPL'],
    document_metadatas=[{"entity": "organization", "source": "wikipedia"}],
    index_name="EPL", 
    max_document_length=180, 
    split_documents=True,
    use_faiss=True,
    )



[May 13, 01:45:18] #> Note: Output directory .ragatouille/colbert/indexes/EPL already exists


[May 13, 01:45:18] #> Will delete 1 files already at .ragatouille/colbert/indexes/EPL in 20 seconds...




[May 13, 01:45:39] [0] 		 #> Encoding 91 passages..


100%|██████████| 3/3 [00:35<00:00, 11.67s/it]

[May 13, 01:46:14] [0] 		 avg_doclen_est = 125.38461303710938 	 len(local_sample) = 91
[May 13, 01:46:14] [0] 		 Creating 1,024 partitions.
[May 13, 01:46:14] [0] 		 *Estimated* 11,409 embeddings.
[May 13, 01:46:14] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/EPL/plan.json ..





Clustering 10840 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (0.53 s, search 0.50 s): objective=2343.59 imbalance=1.439 nsplit=0       
[0.039, 0.044, 0.037, 0.033, 0.038, 0.043, 0.035, 0.038, 0.035, 0.036, 0.031, 0.038, 0.037, 0.038, 0.036, 0.037, 0.029, 0.036, 0.034, 0.037, 0.036, 0.041, 0.036, 0.036, 0.038, 0.037, 0.036, 0.039, 0.036, 0.039, 0.038, 0.04, 0.042, 0.034, 0.038, 0.035, 0.038, 0.041, 0.04, 0.041, 0.038, 0.032, 0.036, 0.034, 0.037, 0.037, 0.034, 0.036, 0.038, 0.036, 0.035, 0.036, 0.035, 0.035, 0.039, 0.037, 0.042, 0.039, 0.039, 0.035, 0.036, 0.041, 0.038, 0.036, 0.04, 0.037, 0.038, 0.036, 0.034, 0.037, 0.038, 0.036, 0.039, 0.034, 0.037, 0.037, 0.034, 0.042, 0.037, 0.036, 0.038, 0.035, 0.038, 0.037, 0.034, 0.037, 0.038, 0.038, 0.036, 0.038, 0.036, 0.039, 0.036, 0.037, 0.035, 0.038, 0.041, 0.036, 0.035, 0.036, 0.04, 0.04, 0.038, 0.039, 0.037, 0.036, 0.036, 0.033, 0.039, 0.036, 0.038, 0.035, 0.037, 0.03, 0.037, 0.036,

0it [00:00, ?it/s]

[May 13, 01:46:15] [0] 		 #> Encoding 91 passages..


100%|██████████| 3/3 [00:34<00:00, 11.34s/it]
1it [00:34, 34.14s/it]
100%|██████████| 1/1 [00:00<00:00, 988.76it/s]

[May 13, 01:46:49] #> Optimizing IVF to store map from centroids to list of pids..
[May 13, 01:46:49] #> Building the emb2pid mapping..
[May 13, 01:46:49] len(emb2pid) = 11410



100%|██████████| 1024/1024 [00:00<00:00, 122625.76it/s]

[May 13, 01:46:49] #> Saved optimized IVF to .ragatouille/colbert/indexes/EPL/ivf.pid.pt
Done indexing!





'.ragatouille/colbert/indexes/EPL'

### Updating the index

In [6]:
RAG.add_to_index(other_docs)

[May 13, 01:47:02] #> Loading codec...
[May 13, 01:47:02] #> Loading IVF...
[May 13, 01:47:02] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[May 13, 01:47:03] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 4644.85it/s]

[May 13, 01:47:03] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 253.97it/s]

[May 13, 01:47:03] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[May 13, 01:47:03] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[May 13, 01:47:03] #> Note: Output directory .ragatouille/colbert/indexes/EPL already exists






[May 13, 01:47:04] [0] 		 #> Encoding 574 passages..


100%|██████████| 18/18 [03:40<00:00, 12.27s/it]

[May 13, 01:50:45] [0] 		 avg_doclen_est = 128.39547729492188 	 len(local_sample) = 574
[May 13, 01:50:45] [0] 		 Creating 4,096 partitions.
[May 13, 01:50:45] [0] 		 *Estimated* 73,699 embeddings.
[May 13, 01:50:45] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/EPL/plan.json ..





used 20 iterations (83.6058s) to cluster 70015 items into 4096 clusters
[0.035, 0.037, 0.037, 0.032, 0.035, 0.035, 0.032, 0.032, 0.032, 0.034, 0.032, 0.033, 0.033, 0.037, 0.034, 0.035, 0.028, 0.034, 0.033, 0.034, 0.034, 0.036, 0.035, 0.033, 0.033, 0.033, 0.033, 0.034, 0.033, 0.034, 0.036, 0.034, 0.04, 0.032, 0.034, 0.032, 0.035, 0.036, 0.035, 0.038, 0.036, 0.032, 0.034, 0.033, 0.036, 0.033, 0.031, 0.035, 0.036, 0.032, 0.033, 0.033, 0.033, 0.032, 0.034, 0.034, 0.039, 0.037, 0.04, 0.032, 0.032, 0.036, 0.034, 0.034, 0.035, 0.035, 0.034, 0.034, 0.032, 0.034, 0.034, 0.031, 0.033, 0.035, 0.035, 0.034, 0.033, 0.036, 0.034, 0.036, 0.036, 0.035, 0.036, 0.035, 0.033, 0.035, 0.036, 0.037, 0.034, 0.036, 0.034, 0.036, 0.034, 0.036, 0.033, 0.036, 0.038, 0.033, 0.033, 0.036, 0.035, 0.039, 0.035, 0.036, 0.034, 0.033, 0.033, 0.034, 0.036, 0.033, 0.034, 0.035, 0.035, 0.033, 0.034, 0.034, 0.033, 0.035, 0.036, 0.033, 0.032, 0.032, 0.033, 0.035, 0.031, 0.03, 0.034, 0.035]


0it [00:00, ?it/s]

[May 13, 01:52:09] [0] 		 #> Encoding 574 passages..


100%|██████████| 18/18 [03:31<00:00, 11.75s/it]
1it [03:34, 214.15s/it]
100%|██████████| 1/1 [00:00<00:00, 562.62it/s]

[May 13, 01:55:43] #> Optimizing IVF to store map from centroids to list of pids..
[May 13, 01:55:43] #> Building the emb2pid mapping..
[May 13, 01:55:43] len(emb2pid) = 73699



100%|██████████| 4096/4096 [00:00<00:00, 128439.94it/s]

[May 13, 01:55:44] #> Saved optimized IVF to .ragatouille/colbert/indexes/EPL/ivf.pid.pt
Successfully updated index with 483 new documents!
 New index size: 574





### Retrieving documents

In [7]:
k = 3
results = RAG.search(query="When was Manchester United formed?")

Loading searcher for index EPL for the first time... This may take a few seconds
[May 13, 01:56:26] #> Loading codec...
[May 13, 01:56:26] #> Loading IVF...
[May 13, 01:56:26] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 5461.33it/s]

[May 13, 01:56:26] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 50.49it/s]

Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . When was Manchester United formed?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 2043, 2001, 5087, 2142, 2719, 1029,  102,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])






In [8]:
results

[{'content': "From 2012, some shares of the club were listed on the New York Stock Exchange, although the Glazer family retains overall ownership and control of the club.\n\n\n== History ==\n\n\n=== Early years (1878–1945) ===\n\nManchester United was formed in 1878 as Newton Heath LYR Football Club by the Carriage and Wagon department of the Lancashire and Yorkshire Railway (LYR) depot at Newton Heath. The team initially played games against other departments and railway companies, but on 20 November 1880, they competed in their first recorded match; wearing the colours of the railway company – green and gold – they were defeated 6–0 by Bolton Wanderers' reserve team. By 1888, the club had become a founding member of The Combination, a regional football league.",
  'score': 27.248620986938477,
  'rank': 1,
  'document_id': 'EPL',
  'passage_id': 4,
  'document_metadata': {'entity': 'organization', 'source': 'wikipedia'}},
 {'content': "In January 1902, with debts of £2,670 – equivalen