# Basic indexing and searching with RAGatouille

In this quick example, we'll use the `RAGPretrainedModel` magic class to demonstrate how to:

- **Build an index from raw documents**
- **Search an index for relevant documents**
- **Load an index and the associated pretrained model to update or query it.**

Please note: Indexing is currently not supported on Google Colab and Windows 10.

First, let's load up a pre-trained ColBERT model:

In [2]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

And that's all you need to do to load the model! All the config is now stored, and ready to be used for indexing.

## Creating an index

Let's index some documents now. We'll use data from Wikipedia, to build our Miyazaki-Index, which will store all you could ever know about Hayao Miyazaki('s wikipedia page).

First, let's write a function to fetch the data from the Wikipedia with a clear user-agent, to be a good netizen:

In [3]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.
    
    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {
        "User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"
    }

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data['query']['pages'].values()))
    return page['extract'] if 'extract' in page else None

And now, let's use it to fetch the page's content and check how long it is:

In [4]:
full_document = get_wikipedia_page("Hayao_Miyazaki")
len(full_document)

45093

That's a lot of characters! Thankfully, `RAGPretrainedColBERT.index()` also relies on a `CorpusProcessor`! It takes in various pre-processing functions and applies them to your documents before embedding and indexing them.

By default, `CorpusProcessor` uses LlamaIndex's `SentenceSplitter`, with a chunk-size defined by your index's max document length. By default, `max_document_length` is 256 tokens, but you can set it to whatever you like.

Let's keep our information units small and go for 180 when creating our index:

In [5]:
RAG.index(collection=[full_document], index_name="Miyazaki", max_document_length=180, split_documents=True)



[Jan 06, 15:02:54] #> Note: Output directory .ragatouille/colbert/indexes/Miyazaki already exists


[Jan 06, 15:02:54] #> Will delete 10 files already at .ragatouille/colbert/indexes/Miyazaki in 20 seconds...
#> Starting...
[Jan 06, 15:03:19] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




[Jan 06, 15:03:21] [0] 		 #> Encoding 81 passages..


100%|██████████| 2/2 [00:08<00:00,  4.24s/it]


[Jan 06, 15:03:29] [0] 		 avg_doclen_est = 129.9629669189453 	 len(local_sample) = 81
[Jan 06, 15:03:29] [0] 		 Creating 1,024 partitions.
[Jan 06, 15:03:29] [0] 		 *Estimated* 10,527 embeddings.
[Jan 06, 15:03:29] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..
Clustering 10001 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (0.26 s, search 0.25 s): objective=2071.55 imbalance=1.471 nsplit=0       
[0.036, 0.038, 0.038, 0.036, 0.033, 0.035, 0.032, 0.035, 0.035, 0.034, 0.035, 0.038, 0.032, 0.038, 0.036, 0.036, 0.033, 0.036, 0.035, 0.035, 0.038, 0.037, 0.034, 0.035, 0.037, 0.034, 0.039, 0.035, 0.034, 0.037, 0.04, 0.037, 0.038, 0.035, 0.033, 0.033, 0.035, 0.032, 0.037, 0.038, 0.037, 0.039, 0.035, 0.031, 0.037, 0.033, 0.034, 0.036, 0.036, 0.034, 0.034, 0.035, 0.033, 0.034, 0.035, 0.036, 0.039, 0.039, 0.037, 0.032, 0.033, 0.035, 0.036, 0.033, 0.035, 0.033, 0.034, 0.035, 0.032, 0.034, 0.033, 0.035

0it [00:00, ?it/s]
  0%|          | 0/2 [00:00<?, ?it/s][A

[Jan 06, 15:03:30] [0] 		 #> Encoding 81 passages..



 50%|█████     | 1/2 [00:07<00:07,  7.42s/it][A
100%|██████████| 2/2 [00:09<00:00,  4.56s/it][A
1it [00:09,  9.20s/it]
100%|██████████| 1/1 [00:00<00:00, 1460.41it/s]
100%|██████████| 1024/1024 [00:00<00:00, 221698.62it/s]


[Jan 06, 15:03:39] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 06, 15:03:39] #> Building the emb2pid mapping..
[Jan 06, 15:03:39] len(emb2pid) = 10527
[Jan 06, 15:03:39] #> Saved optimized IVF to .ragatouille/colbert/indexes/Miyazaki/ivf.pid.pt
#> Joined...
Done indexing!


And that's our index created! It's already compressed and save to disk, so you're ready to use it anywhere you want. By the way, the default behaviour of `index()` is to split documents, but if for any reason you'd like them to remain intact (if you've already preprocessed them, for example), you can set it to false to bypass it!

Let's move on to querying our index now...

## Retrieving Documents

`RAGPretrainedModel` has just indexed our document, so the index is already loaded into it and ready to use! 

Searching is very simple and straightforward, let's say I have a single query:

In [6]:
k = 3 # How many documents you want to retrieve, defaults to 10, we set it to 3 here for readability
results = RAG.search(query="What animation studio did Miyazaki found?", k=k)
results

Loading searcher for index Miyazaki for the first time... This may take a few seconds
[Jan 06, 15:03:43] #> Loading codec...
[Jan 06, 15:03:43] #> Loading IVF...
[Jan 06, 15:03:43] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 1412.22it/s]

[Jan 06, 15:03:43] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 163.80it/s]


Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What animation studio did Miyazaki found?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  7284,  2996,  2106,  2771,  3148, 18637,  2179,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\n\n\n=== Studio Ghibli ===\n\n\n==== Early films (1985–1996) ====\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates".',
  'score': 25.90575408935547,
  'rank': 1},
 {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in Tokyo City in the E

But is it efficient? Let's check how long it takes ColBERT to embed our query and retrieve documents. Because ColBERT's main retrieval approach relies on `maxsim`, a very efficient operation, searching through orders of magnitudes more documents shouldn't take much longer:

In [7]:
%%timeit
RAG.search(query="What animation studio did Miyazaki found?")

65.3 ms ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


You can also batch queries, which will run faster if you've got many different queries to run at once. The output format is the same as for a single query, except it's a list of lists, where item at index `i` will correspond to the query at index `i`:

In [7]:
all_results = RAG.search(query=["What animation studio did Miyazaki found?", "Miyazaki son name"], k=k)
all_results

2it [00:00, 139.07it/s]


[[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\n\n\n=== Studio Ghibli ===\n\n\n==== Early films (1985–1996) ====\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates".',
   'score': 25.90625,
   'rank': 1},
  {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in Tokyo City in the Empire

And that's it for the basics of querying an index! You're now ready to index and retrieve documents with RAGatouille!

## Using an already-created index

In the examples above, we embedded documents into an index and queried it during the same session. But a key feature is **persistence**: indexing is the slowest part, we don't want to have to do this every-time!

Loading an already-created Index is just as straightforward as creating one from scratch. First, we'll load up an instance of RAGPretrainedModel from the index, where the full configuration of the embedder is stored:

In [8]:
# This is the path to index. We recommend keeping this path format when using RAGatouille somewhere else.
path_to_index = ".ragatouille/colbert/indexes/Miyazaki/"
RAG = RAGPretrainedModel.from_index(path_to_index)

And that's it! The index is now fully ready to be queried using `search()` as above.

### Updating an index

Once you've loaded an existing index, you might want to add new documents to it. RAGatouille supports this via the `RAGPretrainedModel.add_to_index()` function. Due to the way ColBERT stores documents as bags-of-embeddings, there are cases where recreating the index is more efficient than updating it -- you don't need to worry about it, the most efficient method is automatically used when you call `add_to_index()`.

You want to expand, and cover more of Studio Ghibli, so let's get the Studio's page into our index too!

In [9]:
new_documents = get_wikipedia_page("Studio_Ghibli")

RAG.add_to_index([new_documents])

[Jan 03, 17:24:37] #> Loading codec...
[Jan 03, 17:24:37] #> Loading IVF...
[Jan 03, 17:24:37] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 2593.88it/s]

[Jan 03, 17:24:37] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 527.12it/s]

New index_name received! Updating current index_name (Miyazaki) to Miyazaki


[Jan 03, 17:24:37] #> Note: Output directory .ragatouille/colbert/indexes/Miyazaki already exists







#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
[Jan 03, 17:24:42] [0] 		 #> Encoding 141 passages..
[Jan 03, 17:24:43] [0] 		 avg_doclen_est = 127.42552947998047 	 len(local_sample) = 141
[Jan 03, 17:24:43] [0] 		 Creating 2,048 partitions.
[Jan 03, 17:24:43] [0] 		 *Estimated* 17,966 embeddings.
[Jan 03, 17:24:43] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..
Clustering 17069 points in 128D to 2048 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 0 (0.16 s, search 0.16 s): objective=5644.62 imbalance=1.479 nsplit=0       



[Jan 03, 17:24:46] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

[Jan 03, 17:24:46] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[0.035, 0.038, 0.038, 0.034, 0.032, 0.034, 0.033, 0.035, 0.031, 0.033, 0.033, 0.035, 0.033, 0.034, 0.034, 0.038, 0.031, 0.032, 0.035, 0.034, 0.036, 0.034, 0.032, 0.034, 0.034, 0.032, 0.036, 0.033, 0.032, 0.035, 0.035, 0.037, 0.037, 0.033, 0.034, 0.033, 0.033, 0.034, 0.034, 0.036, 0.032, 0.036, 0.032, 0.032, 0.036, 0.032, 0.033, 0.037, 0.035, 0.034, 0.031, 0.033, 0.033, 0.034, 0.034, 0.035, 0.034, 0.037, 0.041, 0.032, 0.033, 0.033, 0.033, 0.031, 0.035, 0.034, 0.036, 0.034, 0.03, 0.033, 0.035, 0.033, 0.034, 0.034, 0.034, 0.033, 0.035, 0.034, 0.033, 0.032, 0.034, 0.036, 0.031, 0.036, 0.033, 0.034, 0.036, 0.034, 0.032, 0.039, 0.033, 0.035, 0.032, 0.037, 0.035, 0.035, 0.036, 0.033, 0.036, 0.034, 0.037, 0.039, 0.034, 0.032, 0.036, 0.034, 0.035, 0.033, 0

1it [00:00,  2.72it/s]
100%|██████████| 1/1 [00:00<00:00, 2264.74it/s]
100%|██████████| 2048/2048 [00:00<00:00, 120260.05it/s]


[Jan 03, 17:24:47] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 03, 17:24:47] #> Building the emb2pid mapping..
[Jan 03, 17:24:47] len(emb2pid) = 17967
[Jan 03, 17:24:47] #> Saved optimized IVF to .ragatouille/colbert/indexes/Miyazaki/ivf.pid.pt
#> Joined...
Done indexing!
Successfully updated index with 60 new documents!
 New index size: 141


And again, that's it! The index has been updated with your new document set, and the updates are already persisted to disk. You're now ready to query it with `search()`!