# Basic indexing and searching with RAGatouille

In this quick example, we'll use the `RAGPretrainedModel` magic class to demonstrate how to:

- **Build an index from raw documents**
- **Search an index for relevant documents**
- **Load an index and the associated pretrained model to update or query it.**

Please note: Indexing is currently not supported on Google Colab and Windows 10.

First, let's load up a pre-trained ColBERT model:

In [1]:
%load_ext autoreload
%autoreload 2

In [1]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

  from .autonotebook import tqdm as notebook_tqdm


[Jan 25, 14:37:34] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




And that's all you need to do to load the model! All the config is now stored, and ready to be used for indexing.

## Creating an index

Let's index some documents now. We'll use data from Wikipedia, to build our Miyazaki-Index, which will store all you could ever know about Hayao Miyazaki('s wikipedia page).

First, let's write a function to fetch the data from the Wikipedia with a clear user-agent, to be a good netizen:

In [2]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.
    
    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {
        "User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"
    }

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data['query']['pages'].values()))
    return page['extract'] if 'extract' in page else None

And now, let's use it to fetch the page's content and check how long it is:

In [3]:
full_document = get_wikipedia_page("Hayao_Miyazaki")
len(full_document)

45346

That's a lot of characters! Thankfully, `RAGPretrainedColBERT.index()` also relies on a `CorpusProcessor`! It takes in various pre-processing functions and applies them to your documents before embedding and indexing them.

By default, `CorpusProcessor` uses LlamaIndex's `SentenceSplitter`, with a chunk-size defined by your index's max document length. By default, `max_document_length` is 256 tokens, but you can set it to whatever you like.

Let's keep our information units small and go for 180 when creating our index:

In [4]:
RAG.index(collection=[full_document], index_name="Miyazaki", max_document_length=180, split_documents=True)



[Jan 25, 14:38:27] #> Note: Output directory .ragatouille/colbert/indexes/Miyazaki already exists


[Jan 25, 14:38:27] #> Will delete 10 files already at .ragatouille/colbert/indexes/Miyazaki in 20 seconds...
#> Starting...
[Jan 25, 14:38:51] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




[Jan 25, 14:38:52] [0] 		 #> Encoding 81 passages..


100%|██████████| 2/2 [00:03<00:00,  1.56s/it]


[Jan 25, 14:38:55] [0] 		 avg_doclen_est = 129.74073791503906 	 len(local_sample) = 81
[Jan 25, 14:38:55] [0] 		 Creating 1,024 partitions.
[Jan 25, 14:38:55] [0] 		 *Estimated* 10,508 embeddings.
[Jan 25, 14:38:55] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..
Clustering 9984 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
[0.035, 0.035, 0.036, 0.032, 0.031, 0.036, 0.032, 0.034, 0.032, 0.036, 0.035, 0.037, 0.032, 0.035, 0.037, 0.037, 0.031, 0.033, 0.034, 0.034, 0.035, 0.035, 0.034, 0.036, 0.035, 0.032, 0.035, 0.031, 0.035, 0.036, 0.033, 0.034, 0.035, 0.033, 0.033, 0.032, 0.035, 0.033, 0.031, 0.038, 0.033, 0.038, 0.033, 0.03, 0.034, 0.035, 0.033, 0.034, 0.035, 0.033, 0.031, 0.032, 0.033, 0.034, 0.034, 0.036, 0.035, 0.035, 0.037, 0.03, 0.032, 0.032, 0.034, 0.032, 0.033, 0.034, 0.032, 0.036, 0.031, 0.032, 0.032, 0.032, 0.031, 0.031, 0.034, 0.032, 0.032, 0.036, 0.032, 0.033, 0.033, 0.036, 0.031, 0.037, 0.031

0it [00:00, ?it/s]
  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:02<00:02,  2.24s/it][A
100%|██████████| 2/2 [00:02<00:00,  1.44s/it][A
1it [00:02,  2.92s/it]
100%|██████████| 1/1 [00:00<00:00, 4544.21it/s]
100%|██████████| 1024/1024 [00:00<00:00, 318783.29it/s]


[Jan 25, 14:38:58] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 25, 14:38:58] #> Building the emb2pid mapping..
[Jan 25, 14:38:58] len(emb2pid) = 10509
[Jan 25, 14:38:58] #> Saved optimized IVF to .ragatouille/colbert/indexes/Miyazaki/ivf.pid.pt

#> Joined...
Done indexing!


And that's our index created! It's already compressed and save to disk, so you're ready to use it anywhere you want. By the way, the default behaviour of `index()` is to split documents, but if for any reason you'd like them to remain intact (if you've already preprocessed them, for example), you can set it to false to bypass it!

Let's move on to querying our index now...

## Retrieving Documents

`RAGPretrainedModel` has just indexed our document, so the index is already loaded into it and ready to use! 

Searching is very simple and straightforward, let's say I have a single query:

In [14]:
k = 3 # How many documents you want to retrieve, defaults to 10, we set it to 3 here for readability
results = RAG.search(query="What animation studio did Miyazaki found?", k=k)
results

Loading searcher for index Miyazaki for the first time... This may take a few seconds
[Jan 25, 14:50:03] #> Loading codec...
[Jan 25, 14:50:03] #> Loading IVF...
[Jan 25, 14:50:03] #> Loading doclens...


100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 4583.94it/s]

[Jan 25, 14:50:03] #> Loading codes and residuals...



100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 289.60it/s]


Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What animation studio did Miyazaki found?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  7284,  2996,  2106,  2771,  3148, 18637,  2179,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\n\n\n=== Studio Ghibli ===\n\n\n==== Early films (1985–1996) ====\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates".',
  'score': 25.906383514404297,
  'rank': 1},
 {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, Japanese: [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in Tokyo Ci

But is it efficient? Let's check how long it takes ColBERT to embed our query and retrieve documents. Because ColBERT's main retrieval approach relies on `maxsim`, a very efficient operation, searching through orders of magnitudes more documents shouldn't take much longer:

In [7]:
%%timeit
RAG.search(query="What animation studio did Miyazaki found?")

65.3 ms ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


You can also batch queries, which will run faster if you've got many different queries to run at once. The output format is the same as for a single query, except it's a list of lists, where item at index `i` will correspond to the query at index `i`:

In [7]:
all_results = RAG.search(query=["What animation studio did Miyazaki found?", "Miyazaki son name"], k=k)
all_results

2it [00:00, 139.07it/s]


[[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\n\n\n=== Studio Ghibli ===\n\n\n==== Early films (1985–1996) ====\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were inspired by Greek architecture and "European urbanistic templates".',
   'score': 25.90625,
   'rank': 1},
  {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\nBorn in Tokyo City in the Empire

And that's it for the basics of querying an index! You're now ready to index and retrieve documents with RAGatouille!

## Using an already-created index

In the examples above, we embedded documents into an index and queried it during the same session. But a key feature is **persistence**: indexing is the slowest part, we don't want to have to do this every-time!

Loading an already-created Index is just as straightforward as creating one from scratch. First, we'll load up an instance of RAGPretrainedModel from the index, where the full configuration of the embedder is stored:

In [6]:
# This is the path to index. We recommend keeping this path format when using RAGatouille somewhere else.
path_to_index = ".ragatouille/colbert/indexes/Miyazaki/"
RAG = RAGPretrainedModel.from_index(path_to_index)

And that's it! The index is now fully ready to be queried using `search()` as above.

### Updating an index

Once you've loaded an existing index, you might want to add new documents to it. RAGatouille supports this via the `RAGPretrainedModel.add_to_index()` function. Due to the way ColBERT stores documents as bags-of-embeddings, there are cases where recreating the index is more efficient than updating it -- you don't need to worry about it, the most efficient method is automatically used when you call `add_to_index()`.

You want to expand, and cover more of Studio Ghibli, so let's get the Studio's page into our index too!

In [7]:
new_documents = get_wikipedia_page("Studio_Ghibli")

RAG.add_to_index([new_documents])

[Jan 25, 14:39:43] #> Loading codec...
[Jan 25, 14:39:43] #> Loading IVF...
[Jan 25, 14:39:43] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 25, 14:39:43] #> Loading doclens...


100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 2037.06it/s]

[Jan 25, 14:39:43] #> Loading codes and residuals...



100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 482.71it/s]

[Jan 25, 14:39:43] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Jan 25, 14:39:44] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
New index_name received! Updating current index_name (Miyazaki) to Miyazaki


[Jan 25, 14:39:44] #> Note: Output directory .ragatouille/colbert/indexes/Miyazaki already exists


#> Starting...
[Jan 25, 14:39:48] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




[Jan 25, 14:39:50] [0] 		 #> Encoding 140 passages..


100%|██████████| 3/3 [00:05<00:00,  1.95s/it]


[Jan 25, 14:39:56] [0] 		 avg_doclen_est = 127.67142486572266 	 len(local_sample) = 140
[Jan 25, 14:39:56] [0] 		 Creating 2,048 partitions.
[Jan 25, 14:39:56] [0] 		 *Estimated* 17,873 embeddings.
[Jan 25, 14:39:56] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..
Clustering 16981 points in 128D to 2048 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
[0.037, 0.037, 0.036, 0.033, 0.03, 0.034, 0.032, 0.034, 0.033, 0.034, 0.034, 0.036, 0.034, 0.035, 0.034, 0.035, 0.031, 0.032, 0.032, 0.033, 0.034, 0.032, 0.032, 0.034, 0.034, 0.032, 0.037, 0.033, 0.032, 0.032, 0.034, 0.035, 0.036, 0.032, 0.032, 0.031, 0.031, 0.032, 0.034, 0.036, 0.034, 0.037, 0.031, 0.031, 0.034, 0.032, 0.031, 0.037, 0.032, 0.032, 0.03, 0.033, 0.033, 0.032, 0.033, 0.034, 0.035, 0.036, 0.039, 0.031, 0.034, 0.034, 0.032, 0.033, 0.033, 0.034, 0.035, 0.036, 0.03, 0.033, 0.035, 0.031, 0.033, 0.034, 0.036, 0.033, 0.034, 0.034, 0.034, 0.034, 0.034, 0.037, 0.031, 0.035, 0.03

0it [00:00, ?it/s]
  0%|          | 0/3 [00:00<?, ?it/s][A
 33%|███▎      | 1/3 [00:02<00:04,  2.23s/it][A
 67%|██████▋   | 2/3 [00:04<00:02,  2.24s/it][A
100%|██████████| 3/3 [00:04<00:00,  1.66s/it][A
1it [00:05,  5.03s/it]
100%|██████████| 1/1 [00:00<00:00, 4609.13it/s]
100%|██████████| 2048/2048 [00:00<00:00, 319008.23it/s]


[Jan 25, 14:40:01] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 25, 14:40:01] #> Building the emb2pid mapping..
[Jan 25, 14:40:01] len(emb2pid) = 17874
[Jan 25, 14:40:01] #> Saved optimized IVF to .ragatouille/colbert/indexes/Miyazaki/ivf.pid.pt

#> Joined...
Done indexing!
Successfully updated index with 59 new documents!
 New index size: 140


And again, that's it! The index has been updated with your new document set, and the updates are already persisted to disk. You're now ready to query it with `search()`!

# Uploading and Using an Index from Huggingface


In [1]:
import os
HF_TOKEN = os.getenv("HF_TOKEN")

!huggingface-cli login --token $HF_TOKEN

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/satyamtiwary/.cache/huggingface/token
Login successful


In [10]:
repo_name = "Technoculture/Miyazaki"

from ragatouille.models.utils import upload_index_and_model

In [16]:
upload_index_and_model(
    ".ragatouille/colbert",
    repo_name,
)

Path .ragatouille/colbert does not contain a valid ColBERT config!


ValueError: 

In [10]:
RAG = RAGPretrainedModel.from_index(index_name)

NameError: name 'index_name' is not defined

In [18]:
from pathlib import Path

export_path = "~/.rag"
str(Path(export_path) / "model")

'~/.rag/model'