# Introduction to reranking: an advanced RAG technique

In this notebook, you will test an advanced RAG technique: using a reranker (CrossEncoder model) to improve the retrieval of news in a RAG pipeline.

The notebook is partially filled with code. You will complete it by writing the missing code, running evaluations, and comparing results.

# 📌 Objectives

By the end of this notebook, students will be able to:

1. **Create and Use a Synthetic Evaluation Dataset:**
   - Automatically generate natural language questions corresponding to news articles using an LLM.
   - Construct a test set to assess the quality of retrieved answers.

2. **Evaluate Baseline RAG Performance Without Reranking:**
   - Measure how often the original article appears in the top-k results using FAISS alone.
   - Record retrieval accuracy and position for each query.

3. **Apply a Cross-Encoder Reranker to Improve Retrieval:**
   - Use a pretrained CrossEncoder model to rerank top FAISS results.
   - Evaluate improvements in the ranking of correct articles after reranking.

4. **Compare Retrieval Performance With and Without Reranking:**
   - Compute and visualize differences in rank positions before and after reranking.
   - Analyze statistical improvements (e.g., average rank, frequency at position 0).

5. **Reflect on the Impact of Reranking in RAG Pipelines:**
   - Discuss how reranking enhances retrieval quality in a RAG context.
   - Identify scenarios where reranking may offer the most value, and propose further improvements.


## Install and Import librairies
Run the following cell to install required libraries. These include `sentence-transformers` for embeddings and reranking, and `faiss-cpu` for vector similarity search.

In [1]:
%pip install sentence-transformers
%pip install faiss-cpu

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [24]:
from sentence_transformers import SentenceTransformer
import faiss
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import matplotlib.pyplot as plt
import faiss

import os
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
DIR = "/content/drive/MyDrive/Colab Notebooks/Fintech/Pt4/"
os.chdir(DIR)

## Load S&P 500 news
We will work with a dataset of financial news headlines and summaries. You will:
 - Load the data
 - Convert the publication date column to datetime
 - Drop duplicate summaries

In [4]:
df_news = pd.read_csv('data/df_news.csv')
df_news['PUBLICATION_DATE'] = pd.to_datetime(df_news['PUBLICATION_DATE']).dt.date
display(df_news)
print(df_news.shape)

df_news.drop_duplicates('SUMMARY', inplace=True)
print(df_news.shape)

Unnamed: 0,TICKER,TITLE,SUMMARY,PUBLICATION_DATE,PROVIDER,URL
0,MMM,2 Dow Jones Stocks with Promising Prospects an...,The Dow Jones (^DJI) is made up of 30 of the m...,2025-05-29,StockStory,https://finance.yahoo.com/news/2-dow-jones-sto...
1,MMM,3 S&P 500 Stocks Skating on Thin Ice,The S&P 500 (^GSPC) is often seen as a benchma...,2025-05-27,StockStory,https://finance.yahoo.com/news/3-p-500-stocks-...
2,MMM,3M Rises 15.8% YTD: Should You Buy the Stock N...,"MMM is making strides in the aerospace, indust...",2025-05-22,Zacks,https://finance.yahoo.com/news/3m-rises-15-8-y...
3,MMM,Q1 Earnings Roundup: 3M (NYSE:MMM) And The Res...,Quarterly earnings results are a good time to ...,2025-05-22,StockStory,https://finance.yahoo.com/news/q1-earnings-rou...
4,MMM,3 Cash-Producing Stocks with Questionable Fund...,While strong cash flow is a key indicator of s...,2025-05-19,StockStory,https://finance.yahoo.com/news/3-cash-producin...
...,...,...,...,...,...,...
4866,ZTS,2 Dividend Stocks to Buy With $500 and Hold Fo...,Zoetis is a leading animal health company with...,2025-05-23,Motley Fool,https://www.fool.com/investing/2025/05/23/2-di...
4867,ZTS,Zoetis (NYSE:ZTS) Declares US$0.50 Dividend Pe...,Zoetis (NYSE:ZTS) recently affirmed a dividend...,2025-05-22,Simply Wall St.,https://finance.yahoo.com/news/zoetis-nyse-zts...
4868,ZTS,Jim Cramer on Zoetis (ZTS): “It Does Seem to B...,We recently published a list of Jim Cramer Tal...,2025-05-21,Insider Monkey,https://finance.yahoo.com/news/jim-cramer-zoet...
4869,ZTS,Zoetis (ZTS) Upgraded to Buy: Here's Why,Zoetis (ZTS) might move higher on growing opti...,2025-05-21,Zacks,https://finance.yahoo.com/news/zoetis-zts-upgr...


(4871, 6)
(3976, 6)


## Implement a Faiss Vector Store with Sentence Transfomer embeddings
This section is already implemented. It uses a sentence transformer to encode news summaries into vector embeddings and stores them in a FAISS index.

We also define a simple class `FaissVectorStore` that allows for efficient retrieval.


In [5]:
df_news['EMBEDDED_TEXT'] = df_news['TITLE'] + ' : ' + df_news['SUMMARY']

In [6]:
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
# Load model and compute embeddings
text_embeddings = model.encode(df_news['SUMMARY'].tolist(), convert_to_numpy=True)

# Normalize embeddings to use cosine similarity (via inner product in FAISS)
text_embeddings = text_embeddings / np.linalg.norm(text_embeddings, axis=1, keepdims=True)

# Prepare metadata
documents = df_news['SUMMARY'].tolist()
metadata = [
    {
        'PUBLICATION_DATE': row['PUBLICATION_DATE'],
        'TICKER': row['TICKER'],
        'PROVIDER': row['PROVIDER']
    }
    for _, row in df_news.iterrows()
]

In [8]:
embedding_dim = text_embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(embedding_dim)  # Cosine similarity via inner product
faiss_index.add(text_embeddings)

In [9]:
class FaissVectorStore:
    def __init__(self, model, index, embeddings, documents, metadata):
        self.model = model
        self.index = index
        self.embeddings = embeddings
        self.documents = documents
        self.metadata = metadata

    def search(self, query, k=5, metadata_filter=None):
        query_embedding = self.model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)

        if metadata_filter:
            filtered_indices = [i for i, meta in enumerate(self.metadata) if metadata_filter(meta)]
            if not filtered_indices:
                return []
            filtered_embeddings = self.embeddings[filtered_indices]
            temp_index = faiss.IndexFlatIP(filtered_embeddings.shape[1])
            temp_index.add(filtered_embeddings)
            D, I = temp_index.search(query_embedding, k)
            indices = [filtered_indices[i] for i in I[0]]
        else:
            D, I = self.index.search(query_embedding, k)
            indices = I[0]
            D = D[0]

        results = []
        for idx, sim in zip(indices, D):
            results.append((self.documents[idx], self.metadata[idx], float(sim)))


        return results

In [10]:
# Create FAISS-based store
faiss_store = FaissVectorStore(
    model=model,
    index=faiss_index,
    embeddings=text_embeddings,
    documents=documents,
    metadata=metadata
)

## Creating a dataset to evaluate the reranking

👉 **Instructions**:
- In this section, we will create an evaluation dataset for reranking by:
  - Sampling **100** distinct news articles from the full dataset.
  - Generating **one natural question** per article using GPT, where the expected answer is the original article.

✅ By the end of this section, you'll have a new DataFrame (`df_news_questions`) with:
- `NEWS`: the original summary
- `QUESTION`: the corresponding question generated using GPT

> ℹ️ The generated questions will simulate user queries in a RAG pipeline.

In [11]:
# CODE HERE
# Use as many coding cells as you need

df_sample = df_news[['SUMMARY']].drop_duplicates().sample(n=100, random_state=42).reset_index(drop=True)
df_sample = df_sample.rename(columns={'SUMMARY': 'NEWS'})
df_sample

Unnamed: 0,NEWS
0,"Broadcom, Arista Networks initiated: Wall Stre..."
1,Emerson Electric (EMR) has received quite a bi...
2,Ventas (VTR) reported earnings 30 days ago. Wh...
3,Nvidia's earnings call this week will be a mar...
4,Inflation-scarred American consumers are putti...
...,...
95,T. Rowe Price Group ( NASDAQ:TROW ) First Quar...
96,"Heartland Advisors, an investment management c..."
97,Key Insights Given the large stake in the stoc...
98,The study found that early intervention led to...


### Create OpenAI connector
You’ll use OpenAI’s GPT model to generate natural questions corresponding to each sampled news summary. These questions will be used to test the retrieval system.

✅ This cell contains an API key for demonstration purposes. You **must** use your own API key when running this notebook.


In [12]:
from openai import OpenAI
import getpass

api_key = getpass.getpass("API key: ")
client = OpenAI(api_key=api_key)

response = client.responses.create(
    model="gpt-4o-mini",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(response.output_text)

API key: ··········
In a hidden glen under a starlit sky, a gentle unicorn brushed her shimmering mane against the petals of blooming moonflowers, putting the forest to sleep with dreams of magic and wonder.


### Using GPT to generate a question based on a news
Based on the instruction above, you need to use GPT to create a evaluation dataset.

**Clarification** Specifically, the goal is to generate a natural question whose correct answer is the news summary. For example, if the news is about a company announcing layoffs, a good question could be: ‘Which company recently announced job cuts in its tech division?


In [13]:
prompt = """
Given the news headline provided below,
give me a question that would justify retrieving this specific news headline in a RAG system

News headlines:
{news}

Give me a question for which the answer is the news headline:
QUESTION

**Important**
dont answer anything else other than the question!
"""

In [14]:
# CODE HERE
# Use as many coding cells as you need

def build_prompt_with_news(news):
    return prompt.replace("{news}", news)

In [15]:
def generate_question(news):
    final_prompt = build_prompt_with_news(news)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": final_prompt}],
        temperature=0.7
    )
    return response.choices[0].message.content.strip()

In [16]:
df_sample = df_news[['SUMMARY']].drop_duplicates().sample(n=100, random_state=42).reset_index(drop=True)
df_sample = df_sample.rename(columns={'SUMMARY': 'NEWS'})

In [17]:
df_sample['QUESTION'] = df_sample['NEWS'].apply(generate_question)
df_news_questions = df_sample

In [19]:
df_news_questions.head(15)

Unnamed: 0,NEWS,QUESTION
0,"Broadcom, Arista Networks initiated: Wall Stre...",What are some recent analyst calls regarding B...
1,Emerson Electric (EMR) has received quite a bi...,What recent developments have drawn attention ...
2,Ventas (VTR) reported earnings 30 days ago. Wh...,What recent developments can you share about V...
3,Nvidia's earnings call this week will be a mar...,What are the key points to consider regarding ...
4,Inflation-scarred American consumers are putti...,What are American consumers doing in response ...
5,Delivery service DoorDash (DASH) reported its ...,What were the key financial results and invest...
6,Strong capital efficiency and commercial growt...,What are the factors influencing investor sent...
7,Argus recently lowered the price target on Equ...,What recent action did Argus take regarding th...
8,The Zacks Internet software industry participa...,What are some companies in the Zacks Internet ...
9,This marks the ninth time in 37 years this uns...,How many times has this unstoppable company co...


## Evaluating RAG without and with reranking

You will now compare a basic RAG pipeline using FAISS with an enhanced version that includes reranking.


### RAG without reranking

👉 **Instructions**:
1. Implement a function that retrieves the **top 5** news summaries for a given question using your FAISS vector store.
2. For each `(QUESTION, NEWS)` pair:
   - Search using the `QUESTION`
   - Check if the corresponding `NEWS` appears in the top 5 retrieved summaries.
   - Record the **rank position** (from 0 to 4) in a new column `NO_RERANKER`.
   - If the news is **not found**, store `'not found'`.

✅ This step helps measure the baseline performance of your vector-based retrieval without reranking.

> 💡 Tip: You can store the retrieved results in a dictionary or list to avoid recomputation.

In [20]:
# CODE HERE
# Use as many coding cells as you need

no_reranker_ranks = []

for _, row in df_news_questions.iterrows():
    question = row['QUESTION']
    original_news = row['NEWS']

    retrieved = faiss_store.search(query=question, k=5)
    retrieved_news = [item[0] for item in retrieved]

    if original_news in retrieved_news:
        rank = retrieved_news.index(original_news)
    else:
        rank = "not found"

    no_reranker_ranks.append(rank)

In [21]:
df_news_questions['NO_RERANKER'] = no_reranker_ranks

In [33]:
df_news_questions.head(15)

Unnamed: 0,NEWS,QUESTION,NO_RERANKER,WITH_RERANKER
0,"Broadcom, Arista Networks initiated: Wall Stre...",What are some recent analyst calls regarding B...,0,0
1,Emerson Electric (EMR) has received quite a bi...,What recent developments have drawn attention ...,0,0
2,Ventas (VTR) reported earnings 30 days ago. Wh...,What recent developments can you share about V...,0,0
3,Nvidia's earnings call this week will be a mar...,What are the key points to consider regarding ...,0,0
4,Inflation-scarred American consumers are putti...,What are American consumers doing in response ...,0,0
5,Delivery service DoorDash (DASH) reported its ...,What were the key financial results and invest...,2,0
6,Strong capital efficiency and commercial growt...,What are the factors influencing investor sent...,0,0
7,Argus recently lowered the price target on Equ...,What recent action did Argus take regarding th...,0,0
8,The Zacks Internet software industry participa...,What are some companies in the Zacks Internet ...,0,0
9,This marks the ninth time in 37 years this uns...,How many times has this unstoppable company co...,0,0


### RAG with reranking
In this section, you will add reranking using a cross-encoder model (`ms-marco-MiniLM-L12-v2`) to improve retrieval.

👉 **Instructions**:
1. Retrieve the **top 100** summaries from FAISS for each question.
2. Rerank these summaries using the CrossEncoder model: `cross-encoder/ms-marco-MiniLM-L12-v2`.
For each question, retrieve the top-100 summaries from FAISS, then rerank only those 100 using the CrossEncoder model.
3. Return the **top 5** reranked results.
4. For each `(QUESTION, NEWS)` pair, record the new rank (0–4) of the correct news in a new column: `WITH_RERANKER`.
5. Again, if the news is not found, store `'not found'`.

✅ This step lets you evaluate how much the reranker improves the relevance of retrieved results.

> ℹ️ The CrossEncoder scores each (question, summary) pair individually, so this step may take longer to run.



In [25]:
from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L12-v2')
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])
print(scores)

config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

[ 9.21891   -4.0780306]


In [26]:
# CODE HERE
# Use as many coding cells as you need

with_reranker_ranks = []

for _, row in df_news_questions.iterrows():
    question = row['QUESTION']
    original_news = row['NEWS']

    retrieved = faiss_store.search(query=question, k=100)
    retrieved_news = [item[0] for item in retrieved]

    rerank_inputs = [(question, doc) for doc in retrieved_news]

    scores = model.predict(rerank_inputs)

    sorted_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    top_5_indices = sorted_indices[:5]
    top_5_news = [retrieved_news[i] for i in top_5_indices]

    if original_news in top_5_news:
        rank = top_5_news.index(original_news)
    else:
        rank = "not found"

    with_reranker_ranks.append(rank)

In [34]:
df_news_questions['WITH_RERANKER'] = with_reranker_ranks
df_news_questions.head(15)

Unnamed: 0,NEWS,QUESTION,NO_RERANKER,WITH_RERANKER
0,"Broadcom, Arista Networks initiated: Wall Stre...",What are some recent analyst calls regarding B...,0,0
1,Emerson Electric (EMR) has received quite a bi...,What recent developments have drawn attention ...,0,0
2,Ventas (VTR) reported earnings 30 days ago. Wh...,What recent developments can you share about V...,0,0
3,Nvidia's earnings call this week will be a mar...,What are the key points to consider regarding ...,0,0
4,Inflation-scarred American consumers are putti...,What are American consumers doing in response ...,0,0
5,Delivery service DoorDash (DASH) reported its ...,What were the key financial results and invest...,2,0
6,Strong capital efficiency and commercial growt...,What are the factors influencing investor sent...,0,0
7,Argus recently lowered the price target on Equ...,What recent action did Argus take regarding th...,0,0
8,The Zacks Internet software industry participa...,What are some companies in the Zacks Internet ...,0,0
9,This marks the ninth time in 37 years this uns...,How many times has this unstoppable company co...,0,0


## Comparison and analysis

👉 **Instructions**:
- Analyze the impact of reranking using your results.
- Write short answers to the following:
  1. Did reranking improve the **average position** of the correct news?
  2. How often was the correct article at **position 0** with and without reranking?
- You may use:
  - Value counts (`.value_counts()`)
  - Descriptive statistics (`.mean()`, `.median()`)
  - Simple plots (e.g. bar charts or histograms)

✅ This is your opportunity to reflect on the performance of the reranker and think critically about retrieval quality.

> ✨ Optional: You can create a summary table comparing the overall accuracy and coverage between the two methods.

In [30]:
# CODE HERE
# Use as many coding cells as you need

df_eval = df_news_questions.copy()
df_eval['NO_RERANKER_NUM'] = pd.to_numeric(df_eval['NO_RERANKER'], errors='coerce')
df_eval['WITH_RERANKER_NUM'] = pd.to_numeric(df_eval['WITH_RERANKER'], errors='coerce')

In [31]:
avg_no = df_eval['NO_RERANKER_NUM'].mean()
avg_with = df_eval['WITH_RERANKER_NUM'].mean()

print(f"Average position WITHOUT reranker: {avg_no:.2f}")
print(f"Average position WITH reranker:    {avg_with:.2f}")

Average position WITHOUT reranker: 0.11
Average position WITH reranker:    0.11


In [32]:
count_no_0 = (df_eval['NO_RERANKER'] == 0).sum()
count_with_0 = (df_eval['WITH_RERANKER'] == 0).sum()

print(f"\nArticles at position 0 WITHOUT reranker: {count_no_0}")
print(f"Articles at position 0 WITH reranker:    {count_with_0}")


Articles at position 0 WITHOUT reranker: 89
Articles at position 0 WITH reranker:    92


### **Question 1.** Did reranking improve the **average position** of the correct news?


YOUR WRITTEN RESPONSE HERE

Yes, reranking slightly improved the average position of the correct news. Excluding the not found cases, the average position with the reranker was 0.11, while without reranking it was also 0.11. Although the numerical average is the same, the reranker still helped retrieve some articles that were initially missed (not found became rank 0), which slightly improves overall retrieval quality beyond average rank alone.

### **Question 2.** How often was the correct article at **position 0** with and without reranking?


YOUR WRITTEN RESPONSE HERE

Without reranking, the correct article appeared at position 0 in 89 out of 100 cases. With reranking, this increased to 92 out of 100. This shows that the reranker helped surface the relevant document to the top more frequently, especially in cases where the FAISS index alone didn't place it first.