# Re-Ranking

In the context of RAG (Retrieval-Augmented Generation), reranking of retrieval results is a crucial step that refines the initial set of retrieved documents based on their relevance to the input query. This process involves re-scoring the retrieved documents using a more sophisticated model, such as a cross-encoder, to better capture the semantic similarity between the query and the documents. The reranked list of documents is then used as input for the generation model, ensuring that the most relevant and accurate information is utilized to generate the final output.

![Cross Encoder Image](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/CrossEncoder.png)


Read more [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)

Here are the steps:
* [Loading the reranking model](#loading-the-reranking-model)
* [Lading retrieval results](#loading-retrieval-results)
* [Calculating reranking score](#calculating-the-re-ranking-scores)
* [Generating a reply on the reranked documents](#using-merged-results-to-generate-a-reply)

In [1]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

## Loading the Reranking model

In [2]:
from sentence_transformers import CrossEncoder 
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
print(cross_encoder.model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 384, padding_idx=0)
      (position_embeddings): Embedding(512, 384)
      (token_type_embeddings): Embedding(2, 384)
      (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-5): 6 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=384, out_features=384, bias=True)
              (key): Linear(in_features=384, out_features=384, bias=True)
              (value): Linear(in_features=384, out_features=384, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=384, out_features=384, bias=True)
              (LayerNorm): LayerNorm((384,), eps=1e-1

## Loading retrieval results

We will load the retrieval results from the previous Hybrid-Search notebook, to avoid repetition. We can ignore the scores of the dense and sparse index, as we will calculate the ranking score based on the text of the document/chunk.

In [5]:
import json
hybrid_search_results = {}
with open('../data/dense_results.json') as f:
    dense_results = json.load(f)
    for doc in dense_results:
        hybrid_search_results[doc['id']] = doc
with open('../data/sparse_results.json') as f:
    sparse_results = json.load(f)
    for doc in sparse_results:
        hybrid_search_results[doc['id']] = doc

In [4]:
hybrid_search_results

{15: {'id': 15,
  'text': "3.8 â Mixtral_8x7B 3.5 32 > $3.0 i] 228 fos a 2.0 0 5k 10k 15k 20k 25k 30k Context length Passkey Performance ry 3.8 â Mixtral_8x7B 3.5 0.8 32 > 0.6 $3.0 i] 228 04 fos 0.2 a 2.0 0.0 OK 4K 8K 12K 16K 20K 24K 28K 0 5k 10k 15k 20k 25k 30k Seq Len Context length Figure 4: Long range performance of Mixtral. (Left) Mixtral has 100% retrieval accuracy of the Passkey task regardless of the location of the passkey and length of the input sequence. (Right) The perplexity of Mixtral on the proof-pile dataset decreases monotonically as the context length increases.\n\nThe chunk discusses the long-range performance of the Mixtral model, demonstrating its ability to retrieve a passkey regardless of its location in a long input sequence, and showing that the model's perplexity on the proof-pile dataset decreases as the context length increases.",
  'metadata': {'title': 'Mixtral of Experts',
   'arxiv_id': '2401.04088',
   'references': ['1905.07830']}},
 4: {'id': 4,
  'te

In [6]:
# This is the query that we used for the retrieval of the above documents
query = "What is context size of Mixtral?"

## Calculating the re-ranking scores

We are using the `cross_encoder` to calculate the match score.

In [7]:
pairs = [[query, doc['text']] for doc in hybrid_search_results.values()] 
scores = cross_encoder.predict(pairs) 

print(scores)

[ 5.065694   3.368832   7.1048393 -4.1161065 -4.375498  -5.261078
 -3.7225747  3.1854544  1.7966686 -2.5144243  2.5638714  1.6508546
  2.3361564 -3.0395935  3.08699   -2.4781275]


## Selecting top 3 reranked documents

In [8]:
# Combine scores with corresponding document IDs
results_with_scores = [
    (doc_id, hybrid_search_results[doc_id]['text'], score)
    for doc_id, score in zip(hybrid_search_results.keys(), scores)
]

# Sort results by score in descending order and take the top 3
top_results = sorted(results_with_scores, key=lambda x: x[2], reverse=True)[:3]


In [9]:
import numpy as np


# Add rows to the table with top 3 results
for doc_id, text, score in top_results:
    print("ID:", doc_id)
    print("Score:", score)
    print("Document:", text)
    print("--------------------------------")


ID: 2
Score: 7.1048393
Document: expertsâ ) to process the token and combine their output additively. This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Mixtral is pretrained with multilingual data using a context size of 32k tokens. It either matches or exceeds the performance of Llama 2 70B and GPT-3.5, over several benchmarks. In particular, Mixture of Experts Layer i gating inputs af outputs router expert

This chunk describes the key architectural details of the Mixtral model, a sparse mixture-of-experts language model that outperforms larger models like Llama 2 70B and GPT-3.5 on various benchmarks.
--------------------------------
ID: 15
Score: 5.065694
Document: 3.8 â Mixtral_8x7B 3.5 32 > $3.0 i] 228 fos a 2.0 0 5k 10k 15k 20k 25k 30k Context length Passkey Performance ry 3.8 â Mixtral_8x7B 3.5 0.8 32 > 0.6 $3.0 i] 228 04 fos 0.2 a 2.0 0.0 OK 4K 8K 12K 16K 

## Using merged results to generate a reply

We can now take the improved merged results and call the LLM to generate the reply to the user's query.

In [10]:
# define a variable to hold the search results for the generation model
search_results = [doc[1] for doc in top_results]

In [11]:
from dotenv import load_dotenv

load_dotenv()

True

In [12]:
# Now time to connect to the large language model
from openai import OpenAI

client = OpenAI()
model_name = "gpt-4.1-mini-2025-04-14"
completion = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "system", "content": "You are chatbot, an research expert. Your top priority is to help guide users to understand reserach papers."},
        {"role": "user", "content": query},
        {"role": "assistant", "content": str(search_results)}
    ]
)

response_text = str(completion.choices[0].message.content)

In [13]:
print(f"Hybrid Search with Reranking Reply to {query}: {response_text}")

Hybrid Search with Reranking Reply to What is context size of Mixtral?: Mixtral uses a fully dense context length of 32,000 tokens (32k tokens).
