[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LucaStrano/Experimental_RAG_Tech/blob/main/experimental_tech/2_compress_and_rerank.ipynb)

## Single Pass Rerank and Contextual Compression using Recursive Reranking

### Overview

This notebook demonstrates how to use a **Reranker Model** to perform both _Reranking_ and _Contextual Compression_ in a single pass. We will go over the **Intuition** behind the technique, the **Implementation** details and, finally, a short **Conclusion**.

### 1. Intuition

The use of a Reranker Model has become essential in most modern RAG pipelines, especially when dealing with large or complex datasets. It significantly improves the **Precision** of retrieved results by re-evaluating and reordering initial retrieved documents based on deeper semantic understanding. A Reranker Model is tipically involved in the following steps:

1. Use a fast **Embedding Model** to retrieve a set of $\text{top}_K$ candidate documents of based on query similarity. Usually, $\text{top}_K$ is set to a relatively high number to ensure high **Recall**;

2. Use a **Reranker Model** to re-evaluate the $\text{top}_K$ documents and select the $top_N$ most relevant ones. $\text{top}_N$ is usually set to a much lower number to ensure high **Precision**.

You might wonder why a reranker model is necessary at all: after all, the initial retrieval step already returns a set of seemingly relevant documents. This is because embedding models, while effective for initial retrieval, rely on the **Encoder** Architecture which compresses the semantic meaning of the documents into fixed-size vectors. Relevance is then estimated using a _similarity function_ (such as cosine similarity) over the calculated vectors. While this approach is efficient, it can miss subtle semantic nuances and contextual cues of the original texts.
Reranker Models, on the other hand, use a **Cross-Encoder** architecture, which jointly processes the query and candidate documents at the _token level_, allowing for a more fine-grained understanding of their relationship. This process, while more computationally expensive, ensures a higher quality of the final results.

To further enhance the quality of the results, we can also apply a **Contextual Compression** step. This step involves breaking down the retrieved documents into smaller, more manageable chunks. This allows us to not only select the most relevant documents but also to extract only the most relevant pieces from them, effectively compressing the context while retaining essential information.

The problem with this pipeline is that it now requires three separate steps: An initial retrieval step, a reranking step, and a compression step. Using traditional methods, this can be inefficient and highly time-consuming. What if we could combine both Reranking and Compression into a single step? This is where the **Recursive Reranking** technique comes into play, which functions as follows:

1. Use a fast Embedding Model to retrieve a set of $top_K$ candidate documents;

2. Using a Reranker Model, calculate a relevance score for each sub-section of each document;

3. Use calculated sub-section scores to both rerank documents and select only the most relevant sub-sections of each reranked document.

### 2. Recursive Reranking

This section focuses on the **Preliminaries** and the **Implementation** of the Recursive Reranking Technique.

### 2.1 Preliminaries to Recursive Reranking

Let's start by installing the necessary dependencies. We will use the `chromadb` library to handle our vector database, and the `sentence-transformers` library to use our Reranker Model.

In [None]:
%conda install -c conda-forge sentence-transformers hf-xet chromadb
# %pip install -U sentence-transformers hf-xet chromadb

Let's first define our example documents that we will use throughout this notebook:

In [1]:
docs = [
"""
Italy, officially the Italian Republic, is a country in Southern and Western Europe.
It consists of a peninsula that extends into the Mediterranean Sea.
The Alps mountain range forms its northern boundary, while the Apennine Mountains run down the length of the peninsula.
The territory also includes well as nearly 800 islands, notably Sicily and Sardinia.
It is a country in Southern Europe with a population of approximately 60 million people.
""",

"""
The capital of Italy is Rome, which is also the largest city in the country.
Rome is known for its nearly 3,000 years of globally influential art, architecture, and culture.
The city is often referred to as the "Eternal City" and is famous for its ancient history, including landmarks such as the Colosseum and the Vatican.
It is the capital city of Italy and has a population of almost 3 million people.
""",

"""
Italy's history goes back to numerous Italic peoples—notably including the ancient Romans, 
who conquered the Mediterranean world during the Roman Republic and ruled it for centuries during the Roman Empire.
The Roman Empire was among the largest in history, wielding great economical, cultural, political, and military power.
""",

"""
France is a country in Western Europe, known for its rich history, culture, and influence.
The food in France is renowned worldwide, with dishes like coq au vin and ratatouille.
France has a world class cuisine and is famous for its wine, cheese, and pastries.
Regions like Bordeaux and Champagne are particularly well-known in the culinary world.
"""
]

We will use Chroma's in-memory vector database to simulate the initial retrieval step:

In [None]:
import chromadb
from uuid import uuid4

client = chromadb.Client()
collection = client.create_collection(name="italy")
collection.add(
    ids = [str(uuid4()) for _ in range(len(docs))],
    documents = docs,
)

Chroma automatically handles the creation of embeddings for the documents we add to the collection. By default, it uses the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, which is lightweight and efficient. Let's try querying our collection to see if it works:

In [28]:
query = "How many people live in Italy and what is the capital?"
results = collection.query(query_texts=[query], n_results=2)

for i, id in enumerate(results['ids'][0]):
    print(f"Document {i+1}: {results['documents'][0][i]}")
    print(f"ID: {id}\nDistance: {results['distances'][0][i]}")
    print("-" * 50)

Document 1: 
The capital of Italy is Rome, which is also the largest city in the country.
Rome is known for its nearly 3,000 years of globally influential art, architecture, and culture.
The city is often referred to as the "Eternal City" and is famous for its ancient history, including landmarks such as the Colosseum and the Vatican.
It is the capital city of Italy and has a population of almost 3 million people.

ID: 9b23cca7-2749-4ed1-9404-4edd035a1b8a
Distance: 0.6291064023971558
--------------------------------------------------
Document 2: 
Italy, officially the Italian Republic, is a country in Southern and Western Europe.
It consists of a peninsula that extends into the Mediterranean Sea.
The Alps mountain range forms its northern boundary, while the Apennine Mountains run down the length of the peninsula.
The territory also includes well as nearly 800 islands, notably Sicily and Sardinia.
It is a country in Southern Europe with a population of approximately 60 million people.


We get great results. Let's now introduce our Reranker Model. We will use the [`cross-encoder/ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) model, which is a lightweight Cross-Encoder model designed for reranking tasks.

In [11]:
from sentence_transformers import CrossEncoder
import torch # comes with sentence-transformers

DEVICE = 'cuda' if torch.cuda.is_available() \
         else 'mps' if torch.backends.mps.is_available() \
         else 'cpu'
print(f"Using device: {DEVICE}")
rerank = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2', device=DEVICE)
print("✅ Reranker model loaded.")

Using device: mps
✅ Reranker model loaded.


Let's do a quick check and see how the model ranks our documents with an example query:

In [32]:
query = "What is the capital of Italy?"
rerank_results = rerank.predict([[query, doc] for doc in docs])
rerank_results

array([ 1.5755582,  9.614823 , -4.1138754, -8.50889  ], dtype=float32)

Unsurprisingly, the highest ranked document is the one that talks about Rome, which is the capital of Italy. The lowest ranked document is the one that talks about France, which is not relevant to the query at all. This is to be expected. 
Let’s now take the highest-ranked document and predict a relevance score for each of its sentences individually. This allows us to analyze the alignment between the query and different parts of the document in a more fine-graned way, rather than treating the whole document as a single block.

In [17]:
query = "What is the capital of Italy?"
sents = [sent.strip() for sent in docs[1].split('.') if sent.strip()]
sent_scores = rerank.predict([[query, sent] for sent in sents])

for i, (sent, score) in enumerate(zip(sents, sent_scores)):
    print(f"Sentence {i+1}: {sent}")
    print(f"Score: {score:.4f}")
    print("-" * 50)

print("Mean score of sentences:", sent_scores.mean())
print("Standard deviation of scores:", sent_scores.std())

Sentence 1: The capital of Italy is Rome, which is also the largest city in the country
Score: 9.2809
--------------------------------------------------
Sentence 2: Rome is known for its nearly 3,000 years of globally influential art, architecture, and culture
Score: -1.6989
--------------------------------------------------
Sentence 3: The city is often referred to as the "Eternal City" and is famous for its ancient history, including landmarks such as the Colosseum and the Vatican
Score: -7.3384
--------------------------------------------------
Sentence 4: It is the capital city of Italy and has a population of almost 3 million people
Score: 6.3446
--------------------------------------------------
Mean score of sentences: 1.6470724
Standard deviation of scores: 6.5626807


We can see that the model assigns the highest score to the sentence that is more relevant to the query. The mean score of the whole document is quite low, given that the document contains multiple sentences that are not strictly relevant to the query. This is a common issue with long documents, where an high portion of sentences may not be relevant at all. The standard deviation of the scores is also quite high. We should control for these statistical measures when performing the selection of both document and sentences.

### 2.2 Implementing Recursive Reranking

We are all set up! We can now implement the main logic of our Recursive Reranking Technique. The approach consists of the following steps:

1. Split each document into separate sentences, then use the Reranker Model to calculate a relevance score for each sentence with respect to the query;

2. Each document is then given a score based on the mean of the highest $\text{score}_N$ scores of its sentences (This is done because we could have long chunks that contain multiple non-relevant sentences, which can drag down the overall score of the document);

3. Select the $\text{top}_N$ documents based on their scores;

4. For each selected document, we perform Contextual Compression by selecting only the most relevant sentences using a simple **Static Filter**. Specifically, for document $d$, we select all sentences whose score satisfies:
$$\text{score} \geq \mu_d + \alpha \cdot \sigma_d$$
where $\mu_d$ is the mean and $\sigma_d$ is the standard deviation of sentence scores in document $d$, and $\alpha$ is a tunable hyperparameter controlling the strictness of the filter. 

We start by defining a function that will perform the initial retrieval step using the Chroma collection we created earlier:

In [33]:
def retrieve(query : str, k : int) -> list[str]:
    """
    Retrieve the top-k documents from the Chroma collection based on the query.
    """
    results = collection.query(query_texts=[query], n_results=k)
    return results['documents'][0]

Let's now define the main hyperparamters and implement the `recursive_rerank` function, which takes a `query` as a string and `docs` as a list of strings, and returns a list of reranked and compressed documents.

In [None]:
import numpy as np

SCORE_N = 2     # Number of top sentences to consider for document scoring
TOP_N = 2       # Number of top documents to select
ALPHA = 0.2     # Strength of Contextual Compression

def recursive_rerank(query: str, 
                     docs: list[str],
                     score_n : int = SCORE_N,
                     top_n : int = TOP_N,
                     alpha : float = ALPHA) -> list[str]:
    """
    Perform recursive reranking and contextual compression of documents in a single pass.
    """

    # Step 1: Calculate sentence scores
    all_sents = []
    sent_scores = []
    for doc in docs:
        # Split using SpaCy for better sentence segmentation
        sents = [sent.strip() for sent in doc.split('.') if sent.strip()]
        all_sents.append(sents)
        scores = rerank.predict([[query, sent] for sent in sents])
        sent_scores.append(scores)

    # Step 2: Calculate document scores based on top score_N sentence scores
    doc_scores = []
    for scores in sent_scores:
        indx = min(score_n, len(scores))
        sorted_scores = sorted(scores, reverse=True)[:indx]
        doc_score = sum(sorted_scores) / indx
        doc_scores.append(doc_score)

    # Step 3: Select top N documents
    # We will use document indices to save space
    top_docs_indices = \
        sorted(
            range(len(doc_scores)), 
            key=lambda i: doc_scores[i], 
            reverse=True
        )[:min(top_n, len(doc_scores))]
    
    # Step 4: rerank and compress documents whose indices are in top_docs_indices
    filtered_docs = []
    for i in top_docs_indices:
        mean = np.mean(sent_scores[i])
        std_dev = np.std(sent_scores[i])
        filtered_sents = [sent for sent, score in zip(all_sents[i], sent_scores[i]) 
                          if score >= mean + alpha * std_dev]
        if filtered_sents:
            filtered_docs.append('.\n'.join(filtered_sents) + '.')

    return filtered_docs

Let's finally test our implementation with an example query and see how it performs:

In [57]:
retrieve_query = "How many people live in Italy and what is the capital?"
docs = retrieve(retrieve_query, k=4)
reranked_docs = recursive_rerank(retrieve_query, docs)
for doc in reranked_docs:
    print(doc)
    print("-" * 50)

The capital of Italy is Rome, which is also the largest city in the country.
It is the capital city of Italy and has a population of almost 3 million people.
--------------------------------------------------
Italy, officially the Italian Republic, is a country in Southern and Western Europe.
It is a country in Southern Europe with a population of approximately 60 million people.
--------------------------------------------------


We get exactly what we want! The two reranked documents returned are the ones discussing Rome and the population of Italy, which are both aligned with the query. We also retained only the most relevant sentences from each document, effectively performing Contextual Compression. Let's test it once again with a different query:

In [75]:
retrieve_query = "Does france have good food?"
docs = retrieve(retrieve_query, k=4)
reranked_docs = recursive_rerank(retrieve_query, docs, alpha=0.2, top_n=3)
for doc in reranked_docs:
    print(doc)
    print("-" * 50)

The food in France is renowned worldwide, with dishes like coq au vin and ratatouille.
France has a world class cuisine and is famous for its wine, cheese, and pastries.
--------------------------------------------------
Italy, officially the Italian Republic, is a country in Southern and Western Europe.
It is a country in Southern Europe with a population of approximately 60 million people.
--------------------------------------------------
The capital of Italy is Rome, which is also the largest city in the country.
It is the capital city of Italy and has a population of almost 3 million people.
--------------------------------------------------


This time, the highest ranked document is the one that talks about France and its food. We can also control the strictness of the Contextual Compression step by adjusting the `alpha` parameter. Let's try the same query with an higher `alpha` value:

In [76]:
retrieve_query = "Does france have good food?"
docs = retrieve(retrieve_query, k=4)
reranked_docs = recursive_rerank(retrieve_query, docs, alpha=0.9, top_n=3)
for doc in reranked_docs:
    print(doc)
    print("-" * 50)

The food in France is renowned worldwide, with dishes like coq au vin and ratatouille.
France has a world class cuisine and is famous for its wine, cheese, and pastries.
--------------------------------------------------
It is a country in Southern Europe with a population of approximately 60 million people.
--------------------------------------------------
The capital of Italy is Rome, which is also the largest city in the country.
--------------------------------------------------


As you can see, The highest ranked document has still retained every sentence, but the other documents have been compressed to only one sentence.

### 3. Conclusion

You can find a diagram of the Recursive Reranking Technique [At this link](https://raw.githubusercontent.com/LucaStrano/Experimental_RAG_Tech/refs/heads/main/images/2_compress_and_rerank_diagram.png).

The Recursive Reranking Technique offers a powerful way to combine both Reranking and Contextual Compression in a single pass. This technique is particularly useful when dealing with noisy chunks and high retrieval hyperparameters. 

The main advantages of this approach include:

- High efficiency, since it combines both Reranking and Contextual Compression in a single pass;

- Lower latency, since it avoids the need of perfoming multiple LLM calls to compress the retrieved documents.

This technique works best when paired with other chunking techniques such as **Semantic Chunking** or **Proposition Chunking**. The Recursive Reranking function could also be enhanced by using a better (but more computationally intesive) Reranker Model or by considering windows of sentences instead of single sentences. This would allow for a wider understanding of the context, especially with ambiguous sentences where entities aren't directly mentioned (e.g., _The capital of Italy is rome. It is a city containing..._).