# Basic indexing and searching with RAGatouille

In this quick example, we'll use the `RAGPretrainedModel` magic class to demonstrate how to:

- **Build an index from raw documents**
- **Search an index for relevant documents**
- **Load an index and the associated pretrained model to update or query it.**

Please note: Indexing is currently not supported on Google Colab and Windows 10.

First, let's load up a pre-trained ColBERT model:

In [None]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

And that's all you need to do to load the model! All the config is now stored, and ready to be used for indexing.

## Creating an index

Let's index some documents now. We'll use data from Wikipedia, to build our Miyazaki-Index, which will store all you could ever know about Hayao Miyazaki('s wikipedia page).

First, let's write a function to fetch the data from the Wikipedia with a clear user-agent, to be a good netizen:

In [None]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.
    
    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {
        "User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"
    }

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data['query']['pages'].values()))
    return page['extract'] if 'extract' in page else None

And now, let's use it to fetch the page's content and check how long it is:

In [None]:
full_document = get_wikipedia_page("Hayao_Miyazaki")
len(full_document)

That's a lot of characters! Thankfully, `RAGPretrainedColBERT.index()` also relies on a `CorpusProcessor`! It takes in various pre-processing functions and applies them to your documents before embedding and indexing them.

By default, `CorpusProcessor` uses LlamaIndex's `SentenceSplitter`, with a chunk-size defined by your index's max document length. By default, `max_document_length` is 256 tokens, but you can set it to whatever you like.

Let's keep our information units small and go for 180 when creating our index. We'll also add an optional document ID and an optional metadata entry for our index:

In [None]:
RAG.index(
    documents=[full_document], 
    document_ids=['miyazaki'],
    document_metadatas=[{"entity": "person", "source": "wikipedia"}],
    index_name="Miyazaki", 
    max_document_length=180, 
    split_documents=True
    )

And that's our index created! It's already compressed and save to disk, so you're ready to use it anywhere you want. By the way, the default behaviour of `index()` is to split documents, but if for any reason you'd like them to remain intact (if you've already preprocessed them, for example), you can set it to false to bypass it!

Let's move on to querying our index now...

## Retrieving Documents

`RAGPretrainedModel` has just indexed our document, so the index is already loaded into it and ready to use! 

Searching is very simple and straightforward, let's say I have a single query:

In [None]:
k = 3 # How many documents you want to retrieve, defaults to 10, we set it to 3 here for readability
results = RAG.search(query="What animation studio did Miyazaki found?", k=k)
results

But is it efficient? Let's check how long it takes ColBERT to embed our query and retrieve documents. Because ColBERT's main retrieval approach relies on `maxsim`, a very efficient operation, searching through orders of magnitudes more documents shouldn't take much longer:

In [None]:
%%timeit
RAG.search(query="What animation studio did Miyazaki found?")

You can also batch queries, which will run faster if you've got many different queries to run at once. The output format is the same as for a single query, except it's a list of lists, where item at index `i` will correspond to the query at index `i`:

In [None]:
all_results = RAG.search(query=["What animation studio did Miyazaki found?", "Miyazaki son name"], k=k)
all_results

And that's it for the basics of querying an index! You're now ready to index and retrieve documents with RAGatouille!

## Using an already-created index

In the examples above, we embedded documents into an index and queried it during the same session. But a key feature is **persistence**: indexing is the slowest part, we don't want to have to do this every-time!

Loading an already-created Index is just as straightforward as creating one from scratch. First, we'll load up an instance of RAGPretrainedModel from the index, where the full configuration of the embedder is stored:

In [None]:
# This is the path to index. We recommend keeping this path format when using RAGatouille somewhere else.
path_to_index = ".ragatouille/colbert/indexes/Miyazaki/"
RAG = RAGPretrainedModel.from_index(path_to_index)

And that's it! The index is now fully ready to be queried using `search()` as above.

### Updating an index

Once you've loaded an existing index, you might want to add new documents to it. RAGatouille supports this via the `RAGPretrainedModel.add_to_index()` function. Due to the way ColBERT stores documents as bags-of-embeddings, there are cases where recreating the index is more efficient than updating it -- you don't need to worry about it, the most efficient method is automatically used when you call `add_to_index()`.

You want to expand, and cover more of Studio Ghibli, so let's get the Studio's page into our index too!

In [None]:
new_documents = get_wikipedia_page("Studio_Ghibli")

RAG.add_to_index([new_documents])

And again, that's it! The index has been updated with your new document set, and the updates are already persisted to disk. You're now ready to query it with `search()`!