# Reranking with RAGatouille

In this quick example, we'll use the `RAGPretrainedModel` magic class to demonstrate how to **re-rank documents** retrieved by another retriever, such as **your existing RAG pipeline**.

First, as usual, let's load up a pre-trained ColBERT model:

In [1]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

  from .autonotebook import tqdm as notebook_tqdm


[Jan 25, 18:45:59] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




Now that our model is loaded, we must build an index of our documents for our first retrieval step! In the real world, you'd likely already have some sort of pipeline doing this, which we're going to emulate here, using bge embeddings Spotify's excellent `voyager` library.

In [2]:
from sentence_transformers import SentenceTransformer
from voyager import Index, Space

class MyExistingRetrievalPipeline:
    index: Index
    embedder: SentenceTransformer

    def __init__(self, embedder_name: str = "BAAI/bge-small-en-v1.5"):
        self.embedder = SentenceTransformer(embedder_name)
        self.collection_map = {}
        self.index = Index(
            Space.Cosine,
            num_dimensions=self.embedder.get_sentence_embedding_dimension(),
        )

    def index_documents(self, documents: list[str]) -> None:
        # There's very few documents in our example, so we don't bother with batching
        for document in documents:
            self.collection_map[self.index.add_item(self.embedder.encode(document['content']))] = document['content']

    def query(self, query: str, k: int = 10) -> list[str]:
        query_embedding = self.embedder.encode(query)
        to_return = []
        for idx in self.index.query(query_embedding, k=k)[0]:
            to_return.append(self.collection_map[idx])
        return to_return

In [3]:
existing_pipeline = MyExistingRetrievalPipeline()

Now that our mock of existing pipeline is set up, let's index some documents with it! We'll re-use our favourite combo from the previous examples: `CorpusProcessor` and `get_wikipedia_page()`:

In [4]:
from ragatouille.utils import get_wikipedia_page
from ragatouille.data import CorpusProcessor

corpus_processor = CorpusProcessor()

documents = [get_wikipedia_page("Hayao Miyazaki"), get_wikipedia_page("Studio Ghibli"), get_wikipedia_page("Princess Mononoke"), get_wikipedia_page("Shrek")]
documents = corpus_processor.process_corpus(documents, chunk_size=200)

Now, let's add those to the voyager index so we can simulate a real query:

In [5]:
existing_pipeline.index_documents(documents)

In [6]:
query = "What's Gihbli's famous policy?"
raw_results = existing_pipeline.query(query, k=20)
raw_results

["Another defining feature is Hisaishi's unique use of leitmotif, rather than a singular song being associated with one character, the motif is the theme of the film. Hisaishi began using leitmotif in Ghibli films first in Howl's Moving Castle.\n\n\n== Criticism ==\nRayna Denison argues that, despite the feminist themes of Ghibli films, the studio has been reluctant to promote women within the company and regularly overworks its laborers, including on public holidays.Nathalie Pascaru and Maxim Tvorun-Dunn criticize the studio for what they see as the increasing commercialization of their films' iconography through plastic merchandise and tourist destinations, undermining the environmentalist themes of their films through industrial actions which are detrimental to the environment; as well as for turning their films' characters into decontextualized pop-culture references rather than multidimensional characters used to convey stories.",
 'In September 2023, Nippon TV announced that Stud

Oh! We can see in the results that the policy we're looking for is explained very clearly:

>   'The studio is also known for its strict "no-edits" policy in licensing their films abroad due to Nausicaä of the Valley of the Wind being heavily edited for the film\'s release in the United States as Warriors of the Wind.\n\n\n=== Independent era ===\nBetween 1999 and 2005, Studio Ghibli was a subsidiary brand of Tokuma Shoten; however, that partnership ended in April 2005, when Studio Ghibli was spun off from Tokuma Shoten and was re-established as an independent company with relocated headquarters.\nOn February 1, 2008, Toshio Suzuki stepped down from the position of Studio Ghibli president, which he had held since 2005, and Koji Hoshino (former president of Walt Disney Japan) took over. Suzuki said he wanted to improve films with his own hands as a producer, rather than demanding this from his employees.',

The problem is that it's ranked as the **14th** most relevant result! In a real RAG pipeline, this'd often be well outside the context you'd give to your LLM.

This is where ColBERT re-ranking comes into play. Let's use our previously loaded `RAGPretrainedModel` to re-rank the results of our existing pipeline:

In [7]:
RAG.rerank(query=query, documents=raw_results, k=5)

100%|██████████| 1/1 [00:01<00:00,  1.97s/it]


[{'content': 'The studio is also known for its strict "no-edits" policy in licensing their films abroad due to Nausicaä of the Valley of the Wind being heavily edited for the film\'s release in the United States as Warriors of the Wind.\n\n\n=== Independent era ===\nBetween 1999 and 2005, Studio Ghibli was a subsidiary brand of Tokuma Shoten; however, that partnership ended in April 2005, when Studio Ghibli was spun off from Tokuma Shoten and was re-established as an independent company with relocated headquarters.\nOn February 1, 2008, Toshio Suzuki stepped down from the position of Studio Ghibli president, which he had held since 2005, and Koji Hoshino (former president of Walt Disney Japan) took over. Suzuki said he wanted to improve films with his own hands as a producer, rather than demanding this from his employees.',
  'score': 15.333166122436523,
  'rank': 0,
  'result_index': 11},
 {'content': "Studio Ghibli, Inc. (Japanese: 株式会社スタジオジブリ, Hepburn: Kabushiki-gaisha Sutajio Jibur

And here it is! The relevant extract is now all the way at the top of the results, ready to be passed to the rest of your pipeline!

So why not just use rerank() on the whole index if it's so good? Well, you could, but it's not very efficient. ColBERT is an extremely fast querier, but it needs to have an index built to do so. When you're using ColBERT to rerank documents, it's doing it index-free, which means it needs to encode all your documents and queries, and perform the comparison on the fly. This is fine for a handful of document on CPU or a few hundreds on GPU, but it's going to take exponentially longer as you add more documents!

Re-ranking the results of another retrieval method is a good compromise: it allows you to leverage ColBERT's power without having to modify the rest of your pipeline, just increase the `k` value of your retriever and let ColBERT rescore them!