In [2]:
#!pip install -q ragatouille

# Basic indexing and searching with RAGatouille

In this quick example, we'll use the `RAGPretrainedModel` magic class to demonstrate how to:

- **Build an index from raw documents**
- **Search an index for relevant documents**
- **Load an index and the associated pretrained model to update or query it.**

Please note: Indexing is currently not supported on Google Colab and Windows 10.

First, let's load up a pre-trained ColBERT model:

In [3]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

  from .autonotebook import tqdm as notebook_tqdm


And that's all you need to do to load the model! All the config is now stored, and ready to be used for indexing.

## Creating an index

In [4]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.
    
    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {
        "User-Agent": "Wikipedia RAGatouille (Lx Yuan)"
    }

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data['query']['pages'].values()))
    return page['extract'] if 'extract' in page else None

And now, let's use it to fetch the page's content and check how long it is:

In [5]:
elon_content = get_wikipedia_page("Elon Musk")
print(elon_content[:500])

Elon Reeve Musk ( EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX; angel investor, CEO, product architect and former chairman of Tesla, Inc.; owner, chairman and CTO of X Corp.; founder of the Boring Company and xAI; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. He is the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionair


That's a lot of characters! Thankfully, `RAGPretrainedColBERT.index()` also relies on a `CorpusProcessor`! It takes in various pre-processing functions and applies them to your documents before embedding and indexing them.

By default, `CorpusProcessor` uses LlamaIndex's `SentenceSplitter`, with a chunk-size defined by your index's max document length. By default, `max_document_length` is 256 tokens, but you can set it to whatever you like.

Let's keep our information units small and go for 180 when creating our index:

In [6]:
RAG.index(
    collection=[elon_content], 
    index_name="elon_musk", 
    max_document_length=180, 
    split_documents=True
)

________________________________________________________________________________
 This means that indexing will be slow. To make use of your GPU.
Please install `faiss-gpu` by running:
pip uninstall --y faiss-cpu & pip install faiss-gpu
 ________________________________________________________________________________
Will continue with CPU indexing in 5 seconds...


[Jan 11, 09:26:22] #> Note: Output directory .ragatouille/colbert/indexes/elon_musk already exists


[Jan 11, 09:26:22] #> Will delete 10 files already at .ragatouille/colbert/indexes/elon_musk in 20 seconds...
#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
[Jan 11, 09:26:46] [0] 		 #> Encoding 107 passages..
[Jan 11, 09:26:49] [0] 		 avg_doclen_est = 135.12149047851562 	 len(local_sample) = 107
[Jan 11, 09:26:49] [0] 		 Creating 1,024 partitions.
[Jan 11, 09:26:49] [0] 		 *Estimated* 14,457 embeddings.
[Jan 11, 09:26:49] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/elon_musk/plan.json ..
Clusteri



[Jan 11, 09:26:50] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

[Jan 11, 09:26:50] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[0.036, 0.04, 0.039, 0.038, 0.037, 0.042, 0.037, 0.037, 0.035, 0.039, 0.033, 0.036, 0.039, 0.04, 0.038, 0.041, 0.033, 0.035, 0.032, 0.036, 0.038, 0.037, 0.037, 0.039, 0.035, 0.036, 0.037, 0.038, 0.039, 0.038, 0.037, 0.045, 0.04, 0.035, 0.038, 0.033, 0.039, 0.037, 0.038, 0.043, 0.038, 0.035, 0.041, 0.041, 0.037, 0.034, 0.037, 0.042, 0.04, 0.038, 0.038, 0.039, 0.04, 0.038, 0.037, 0.037, 0.046, 0.038, 0.039, 0.037, 0.036, 0.041, 0.038, 0.037, 0.039, 0.038, 0.04, 0.04, 0.034, 0.039, 0.04, 0.037, 0.039, 0.038, 0.038, 0.034, 0.038, 0.045, 0.039, 0.04, 0.039, 0.039, 0.037, 0.037, 0.04, 0.037, 0.035, 0.039, 0.034, 0.047, 0.037, 0.04, 0.036, 0.041, 0.036, 0.04, 0.043, 0.034, 0.037, 0.036, 0.042, 0.041, 0.035, 0.036, 0.037, 0.038, 0.04, 0.035, 0.037, 0.036,

1it [00:00,  6.07it/s]
100%|██████████| 1/1 [00:00<00:00, 1889.33it/s]
100%|██████████| 1024/1024 [00:00<00:00, 103750.69it/s]


#> Joined...
Done indexing!


And that's our index created! It's already compressed and save to disk, so you're ready to use it anywhere you want. By the way, the default behaviour of `index()` is to split documents, but if for any reason you'd like them to remain intact (if you've already preprocessed them, for example), you can set it to false to bypass it!

Let's move on to querying our index now...

## Retrieving Documents

`RAGPretrainedModel` has just indexed our document, so the index is already loaded into it and ready to use! 

Searching is very simple and straightforward, let's say I have a single query:

In [7]:
k = 3 # How many documents you want to retrieve, defaults to 10, we set it to 3 here for readability
results = RAG.search(query="Who is Elon Musk?", k=k)
results

Loading searcher for index elon_musk for the first time... This may take a few seconds
[Jan 11, 09:27:22] #> Loading codec...
[Jan 11, 09:27:22] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 11, 09:27:22] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 11, 09:27:22] #> Loading IVF...
[Jan 11, 09:27:22] #> Loading doclens...


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3930.93it/s]

[Jan 11, 09:27:22] #> Loading codes and residuals...



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 722.28it/s]

Searcher loaded!






#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . Who is Elon Musk?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2040,  2003,  3449,  2239, 14163,  6711,  1029,   102,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')



[{'content': 'Elon Reeve Musk ( EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX; angel investor, CEO, product architect and former chairman of Tesla, Inc.; owner, chairman and CTO of X Corp.; founder of the Boring Company and xAI; co-founder of Neuralink and OpenAI; and president of the Musk Foundation.',
  'score': 27.78125,
  'rank': 1},
 {'content': "; owner, chairman and CTO of X Corp.; founder of the Boring Company and xAI; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. He is the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index, and $254 billion according to Forbes, primarily from his ownership stakes in Tesla and SpaceX.A member of the wealthy South African Musk family, Elon was born in Pretoria and briefly attended the University of Pretoria before immigrating to Canada at age 18, acquiring citiz

But is it efficient? Let's check how long it takes ColBERT to embed our query and retrieve documents. Because ColBERT's main retrieval approach relies on `maxsim`, a very efficient operation, searching through orders of magnitudes more documents shouldn't take much longer:

In [10]:
%%timeit
RAG.search(query="Who is Elon Musk?")

24 ms ± 93.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


You can also batch queries, which will run faster if you've got many different queries to run at once. The output format is the same as for a single query, except it's a list of lists, where item at index `i` will correspond to the query at index `i`:

In [11]:
all_results = RAG.search(
    query=[
        "What is SpaceX", 
        "What is durian?" # irrelevant query
    ], 
    k=k
)

2it [00:00, 112.43it/s]


In [12]:
# What is SpaceX
all_results[0]

[{'content': 'In October 2002, eBay acquired PayPal for $1.5 billion, and that same year, with $100 million of the money he made, Musk founded SpaceX, a spaceflight services company. In 2004, he became an early investor in electric vehicle manufacturer Tesla Motors, Inc. (now Tesla, Inc.). He became its chairman and product architect, assuming the position of CEO in 2008. In 2006, Musk helped create SolarCity, a solar-energy company that was acquired by Tesla in 2016 and became Tesla Energy. In 2013, he proposed a hyperloop high-speed vactrain transportation system. In 2015, he co-founded OpenAI, a nonprofit artificial intelligence research company.',
  'score': 23.59375,
  'rank': 1},
 {'content': 'In 2020, SpaceX launched its first crewed flight, the Demo-2, becoming the first private company to place astronauts into orbit and dock a crewed spacecraft with the ISS.\n\n\n==== Starlink ====\n\nIn 2015, SpaceX began development of the Starlink constellation of low-Earth-orbit satellites

In [13]:
# What is durian?
all_results[1]

[{'content': 'Time has listed Musk as one of the most influential people in the world on four occasions in 2010, 2013, 2018, and 2021. Musk was selected as Time\'s "Person of the Year" for 2021. Then Time editor-in-chief Edward Felsenthal wrote that "Person of the Year is a marker of influence, and few individuals have had more influence than Musk on life on Earth, and potentially life off Earth too". In February 2022, Musk was elected to the National Academy of Engineering. Following a tumultuous year of changes and controversies at X, The New Republic labeled Musk its 2023 Scoundrel of the Year.\n\n\n== Notes and references ==\n\n\n=== Notes ===\n\n\n=== Citations ===\n\n\n== Works cited ==',
  'score': 5.74609375,
  'rank': 1},
 {'content': 'Elon Reeve Musk ( EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX; angel investor, CEO, product architect and former chairman of Tesla, Inc.; owner, chairman and CTO of X Corp.; foun

And that's it for the basics of querying an index! You're now ready to index and retrieve documents with RAGatouille!

## Using an already-created index

In the examples above, we embedded documents into an index and queried it during the same session. But a key feature is **persistence**: indexing is the slowest part, we don't want to have to do this every-time!

Loading an already-created Index is just as straightforward as creating one from scratch. First, we'll load up an instance of RAGPretrainedModel from the index, where the full configuration of the embedder is stored:

In [17]:
# This is the path to index. We recommend keeping this path format when using RAGatouille somewhere else.
path_to_index = ".ragatouille/colbert/indexes/elon_musk/"
RAG = RAGPretrainedModel.from_index(path_to_index)

In [19]:
k = 3 # How many documents you want to retrieve, defaults to 10, we set it to 3 here for readability
results = RAG.search(query="Tesla is ...?", k=k)
results

[{'content': 'At the same time, Musk refused to block Russian state media on Starlink, declaring himself "a free speech absolutist".\n\n\n=== Tesla ===\n\nTesla, Inc., originally Tesla Motors, was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning, who financed the company until the Series A round of funding. Both men played active roles in the company\'s early development prior to Musk\'s involvement. Musk led the Series A round of investment in February 2004; he invested $6.5 million, became the majority shareholder, and joined Tesla\'s board of directors as chairman.',
  'score': 21.03125,
  'rank': 1},
 {'content': 'As of 2019, Musk was the longest-tenured CEO of any automotive manufacturer globally. In 2021, Musk nominally changed his title to "Technoking" while retaining his position as CEO.\nTesla began delivery of an electric sports car, the Roadster, in 2008. With sales of about 2,500 vehicles, it was the first serial production all-electric car to use lithium-io

And that's it! The index is now fully ready to be queried using `search()` as above.

### Updating an index

Once you've loaded an existing index, you might want to add new documents to it. RAGatouille supports this via the `RAGPretrainedModel.add_to_index()` function. Due to the way ColBERT stores documents as bags-of-embeddings, there are cases where recreating the index is more efficient than updating it -- you don't need to worry about it, the most efficient method is automatically used when you call `add_to_index()`.

You want to expand, and cover more of Studio Ghibli, so let's get the Studio's page into our index too!

In [20]:
new_documents = get_wikipedia_page("Donald Trump")

RAG.add_to_index([new_documents])

[Jan 11, 09:31:33] #> Loading codec...
[Jan 11, 09:31:33] #> Loading IVF...
[Jan 11, 09:31:33] #> Loading doclens...


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2571.61it/s]

[Jan 11, 09:31:33] #> Loading codes and residuals...



100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 764.41it/s]

________________________________________________________________________________
 This means that indexing will be slow. To make use of your GPU.
Please install `faiss-gpu` by running:
pip uninstall --y faiss-cpu & pip install faiss-gpu
 ________________________________________________________________________________
Will continue with CPU indexing in 5 seconds...





New index_name received! Updating current index_name (elon_musk) to elon_musk


[Jan 11, 09:31:38] #> Note: Output directory .ragatouille/colbert/indexes/elon_musk already exists


#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
[Jan 11, 09:31:43] [0] 		 #> Encoding 285 passages..
[Jan 11, 09:31:45] [0] 		 avg_doclen_est = 131.62806701660156 	 len(local_sample) = 285
[Jan 11, 09:31:45] [0] 		 Creating 2,048 partitions.
[Jan 11, 09:31:45] [0] 		 *Estimated* 37,513 embeddings.
[Jan 11, 09:31:45] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/elon_musk/plan.json ..
Clustering 35639 points in 128D to 2048 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s




[Jan 11, 09:31:52] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

[Jan 11, 09:31:52] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[0.037, 0.043, 0.04, 0.038, 0.037, 0.041, 0.035, 0.036, 0.035, 0.037, 0.037, 0.036, 0.037, 0.039, 0.038, 0.038, 0.032, 0.035, 0.034, 0.037, 0.038, 0.037, 0.039, 0.04, 0.034, 0.036, 0.039, 0.039, 0.039, 0.038, 0.035, 0.041, 0.041, 0.037, 0.037, 0.033, 0.041, 0.038, 0.039, 0.046, 0.039, 0.035, 0.041, 0.039, 0.036, 0.035, 0.038, 0.041, 0.041, 0.036, 0.038, 0.038, 0.037, 0.038, 0.04, 0.04, 0.044, 0.04, 0.045, 0.036, 0.035, 0.039, 0.038, 0.04, 0.041, 0.039, 0.039, 0.041, 0.036, 0.038, 0.04, 0.036, 0.037, 0.04, 0.037, 0.037, 0.038, 0.043, 0.041, 0.039, 0.039, 0.04, 0.037, 0.038, 0.04, 0.037, 0.036, 0.038, 0.036, 0.042, 0.037, 0.038, 0.037, 0.039, 0.036, 0.039, 0.042, 0.036, 0.037, 0.037, 0.04, 0.042, 0.035, 0.035, 0.038, 0.038, 0.038, 0.036, 0.04, 0.034

1it [00:00,  2.60it/s]
100%|██████████| 1/1 [00:00<00:00, 1707.78it/s]
100%|██████████| 2048/2048 [00:00<00:00, 109872.41it/s]


[Jan 11, 09:31:52] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 11, 09:31:52] #> Building the emb2pid mapping..
[Jan 11, 09:31:52] len(emb2pid) = 37514
[Jan 11, 09:31:52] #> Saved optimized IVF to .ragatouille/colbert/indexes/elon_musk/ivf.pid.pt
#> Joined...
Done indexing!
Successfully updated index with 178 new documents!
 New index size: 285


And again, that's it! The index has been updated with your new document set, and the updates are already persisted to disk. You're now ready to query it with `search()`!

In [29]:
%ls .ragatouille/colbert/indexes/

[0m[01;34mMiyazaki[0m/  [01;34melon_musk[0m/
