# Rerankers

Rerankers have been a common component of retrieval pipelines for many years. They allow us to add a final "reranking" step to our retrieval pipelines — like with **R**etrieval **A**ugmented **G**eneration (RAG) — that can be used to dramatically optimize our retrieval pipelines and improve their accuracy.

In the example notebook we'll learn how to create retrieval pipelines with reranking using the [Cohere reranking model](https://txt.cohere.com/rerank/) (which is available for free).

To begin, we setup our prerequisite libraries.

In [1]:
!pip install -qU \
    pinecone-client==3.1.0 \
    cohere==4.27

## Data Preparation

We start by downloading a dataset that we will encode and store. The dataset [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) contains scraped data from many popular ArXiv papers centred around LLMs. Including papers from Llama 2, GPTQ, and the GPT-4 technical paper.

In [2]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 41584/41584 [00:01<00:00, 29843.31 examples/s]


Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

We have 41.5K chunks, where each chunk is roughly the length of 1-2 paragraphs in length. Here is an example of a single record:

In [3]:
data[0]

{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof i

Format the data into the format we need, this will contain `id`, `text` (which we will embed), and `metadata`. For this use-case we don't need metadata but it can be useful to include so that if needed in the future we can make use of metadata filtering.

In [5]:
data = data.map(lambda x: {
    "id": f'{x["id"]}-{x["chunk-id"]}',
    "text": x["chunk"],
    "metadata": {
        "title": x["title"],
        "url": x["source"],
        "primary_category": x["primary_category"],
        "published": x["published"],
        "updated": x["updated"],
        "text": x["chunk"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "summary", "source",
    "authors", "categories", "comment",
    "journal_ref", "primary_category",
    "published", "updated", "references",
    "doi", "chunk-id",
    "chunk"
])
dataset = data

Map:   0%|          | 0/41584 [00:00<?, ? examples/s]


KeyError: 'chunk-id'

In [6]:
dataset = data

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using OpenAI's text-embedding-ada-002. There is some cost associated with this model, so be aware of that (costs for running this notebook are <$1).

In [6]:
import os
import time
import numpy as np
from dotenv import load_dotenv
from datasets import load_dataset
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.schema import Document

# Load environment variables from .env file (optional)
load_dotenv()



# Initialize HuggingFace embedding model
hf_embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm
  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


W0612 12:03:30.086000 30048 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


In [None]:


# Function to create LangChain documents from dataset
def prepare_documents(data):
    docs = []
    for record in dataset:
        metadata = record.get('metadata', {})
        doc = Document(
            page_content=record['text'],
            metadata={"id": record['id'], **metadata}
        )
        docs.append(doc)
    return docs

# Convert dataset to documents
documents = prepare_documents(dataset)

# Create vector store (FAISS) from documents
vectorstore = FAISS.from_documents(documents, hf_embeddings)

# Save FAISS index locally
vectorstore.save_local("faiss_index")


  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


W0611 16:29:14.791000 27788 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


In [32]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_groq import ChatGroq 
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import numpy as np
from langchain_community.vectorstores import FAISS 
import faiss 

hf_embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

try:
    # Load the vectorstore from the saved folder
    vectorstore = FAISS.load_local("faiss_index", hf_embeddings, allow_dangerous_deserialization=True)
    print("FAISS index loaded successfully from 'faiss_index'.")

    index = vectorstore.index
    docstore = vectorstore.docstore
    index_to_docstore_id = vectorstore.index_to_docstore_id

    docs = [doc.page_content for doc_id, doc in vectorstore.docstore._dict.items()]
    print(f"Extracted {len(docs)} documents from the loaded vectorstore.")

except Exception as e:
    exit()

def retrieve(query, k):
    query_embedding = hf_embeddings.embed_query(query)
    query_embedding = np.array(query_embedding).astype('float32').reshape(1, -1)

    faiss.normalize_L2(query_embedding)

    _, idx = index.search(query_embedding, k)

    results = []
    for i in idx[0]:
        doc_id = index_to_docstore_id[i]
        doc = docstore._dict[doc_id]
        results.append(doc.page_content.strip())

    return results



FAISS index loaded successfully from 'faiss_index'.
Extracted 41584 documents from the loaded vectorstore.


In [111]:
query = "can you explain why we would want to do rlhf?"
docs  = retrieve(query, k=25)

print("Retrieved documents:")
for i, doc in enumerate(docs, 1):
    print(f"\n--- Document {i} ---\n{doc}")

Retrieved documents:

--- Document 1 ---
preferences and values which are diﬃcult to capture by hard- coded reward functions.
RLHF works by using a pre-trained LM to generate text, which i s then evaluated by humans by, for example,
ranking two model generations for the same prompt. This data is then collected to learn a reward model
that predicts a scalar reward given any generated text. The r eward captures human preferences when
judging model output. Finally, the LM is optimized against s uch reward model using RL policy gradient
algorithms like PPO ( Schulman et al. ,2017). RLHF can be applied directly on top of a general-purpose LM
pre-trained via self-supervised learning. However, for mo re complex tasks, the model’s generations may not
be good enough. In such cases, RLHF is typically applied afte r an initial supervised ﬁne-tuning phase using
a small number of expert demonstrations for the correspondi ng downstream task ( Ramamurthy et al. ,2022;
Ouyang et al. ,2022;Stiennon et 

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [None]:
import cohere

os.environ["COHERE_API_KEY"] = os.getenv("cohere_api_key") 
# init client
co = cohere.Client(os.environ["COHERE_API_KEY"])

rerank_docs = co.rerank(
    query=query, documents=docs , top_n=25, model="rerank-v3.5"
)

type(rerank_docs[0])

Creating an index, we set `dimension` equal to to dimensionality of Ada-002 (`1536`), and use a `metric` also compatible with Ada-002 (this can be either `cosine` or `dotproduct`). We also pass our `spec` to index initialization.

cohere.responses.rerank.RerankResult

We access the text content of the docs like so:

In [114]:
rerank_docs[0].document["text"]
for i, rerank_docs in enumerate(rerank_docs, 1):
    print(f"\n--- Document {i} ---\n{rerank_docs}")


--- Document 1 ---
RerankResult<document['text']: model to estimate the eventual performance of a larger RL policy. The slopes of these lines also
explain how RLHF training can produce such large effective gains in model size, and for example it
explains why the RLHF and context-distilled lines in Figure 1 are roughly parallel.
• One can ask a subtle, perhaps ill-deﬁned question about RLHF training – is it teaching the model
new skills or simply focusing the model on generating a sub-distribution of existing behaviors . We
might attempt to make this distinction sharp by associating the latter class of behaviors with the
region where RL reward remains linear inp
KL.
• To make some bolder guesses – perhaps the linear relation actually provides an upper bound on RL
reward, as a function of the KL. One might also attempt to extend the relation further by replacingp
KLwith a geodesic length in the Fisher geometry.
By making RL learning more predictable and by identifying new quantitative c

The reordered results look like so:

In [109]:
# Get the reranked texts
[res.document for res in rerank_docs.results]


[{'text': 'preferences and values which are diﬃcult to capture by hard- coded reward functions.\nRLHF works by using a pre-trained LM to generate text, which i s then evaluated by humans by, for example,\nranking two model generations for the same prompt. This data is then collected to learn a reward model\nthat predicts a scalar reward given any generated text. The r eward captures human preferences when\njudging model output. Finally, the LM is optimized against s uch reward model using RL policy gradient\nalgorithms like PPO ( Schulman et al. ,2017). RLHF can be applied directly on top of a general-purpose LM\npre-trained via self-supervised learning. However, for mo re complex tasks, the model’s generations may not\nbe good enough. In such cases, RLHF is typically applied afte r an initial supervised ﬁne-tuning phase using\na small number of expert demonstrations for the correspondi ng downstream task ( Ramamurthy et al. ,2022;\nOuyang et al. ,2022;Stiennon et al. ,2020).\nA succes

Let's write a function to allow us to more easily compare the original results vs. reranked results.

In [None]:
def compare(query: str, k: int = 25, top_n: int = 3):
    # Step 1: Retrieve from FAISS
    original_docs = retrieve(query, k=k)

    # Step 2: Rerank using Cohere
    rerank_results = co.rerank(
        query=query,
        documents=original_docs,
        top_n=top_n,
        model="rerank-v3.5"
    ).results

    # Step 3: Extract reranked text
    reranked_texts = [res.document["text"] for res in rerank_results]

    print(f"\n Query: {query}\n")
    
    print(f"--- Top {top_n} FAISS Results (Before Rerank) ---")
    for i, text in enumerate(original_docs[:top_n]):
        clean_text = text[:100].replace("\n", " ")
        print(f"{i+1:02d}: {clean_text}...")

    print(f"\n--- Top {top_n} Reranked Results (After Rerank) ---")
    for i, text in enumerate(reranked_texts):
        clean_text = text[:100].replace("\n", " ")
        print(f"{i+1:02d}: {clean_text}...")

    print(f"\n Rank Changes (FAISS → Rerank):")
    rank_map = {}
    for i, text in enumerate(reranked_texts):
        try:
            original_pos = original_docs.index(text)
            rank_map[i+1] = original_pos + 1
            print(f"Reranked #{i+1:02d} was FAISS #{original_pos+1:02d}")
        except ValueError:
            print(f"Reranked #{i+1:02d} not found in FAISS top-{k} results.")
            rank_map[i+1] = None

    return rank_map


---

In [90]:
compare(query, 25, 3)


🔎 Query: can you explain why we would want to do rlhf?

--- Top 3 FAISS Results (Before Rerank) ---
01: preferences and values which are diﬃcult to capture by hard- coded reward functions. RLHF works by u...
02: the output is generated. These models often use a ﬁxed input and output vocabulary, which prevents t...
03: Tom Brown and Jared Kaplan, and much of Anthropic’s technical staff contributed to the development o...

--- Top 3 Reranked Results (After Rerank) ---
01: model to estimate the eventual performance of a larger RL policy. The slopes of these lines also exp...
02: preferences and values which are diﬃcult to capture by hard- coded reward functions. RLHF works by u...
03: by being evasive [4]. Our second contribution is to release our dataset of 38,961 red team attacks f...

🔁 Rank Changes (FAISS → Rerank):
Reranked #01 was FAISS #10
Reranked #02 was FAISS #01
Reranked #03 was FAISS #19


{1: 10, 2: 1, 3: 19}

In [None]:
from dotenv import load_dotenv

load_dotenv()
groq_api_key = os.getenv("groq_api")

def generate_answer(question: str, contexts: list[str]) -> str:
    """
    Generates an answer to the user's question using the provided contexts
    and a Groq-hosted LLM via LangChain Expression Language (LCEL).

    Args:
        question (str): The user's question.
        contexts (list[str]): A list of relevant document contexts.

    Returns:
        str: The generated answer from the LLM.
    """
    # 1. Initialize the Groq Chat model
    llm = ChatGroq(
        model_name='gemma2-9b-it',
        temperature=0, # Keep temperature at 0 for more factual/less creative answers
        groq_api_key=groq_api_key
    )

    # 2. Define the RAG prompt template
    prompt_template = ChatPromptTemplate.from_messages([
        ("system", 
         "Answer the user question **only** with facts found in the context. "
         "If the answer is not in the context, state that you cannot answer from the provided information.\n\n"
         "Context:\n{context}"), # 'context' is the variable where retrieved docs will be injected
        ("user", "{question}")
    ])

    # This creates a chain that takes 'context' and 'question' as input,
    # formats them into the prompt, and sends to the LLM.
    # Note: `create_stuff_documents_chain` is more for LangChain's Document objects.
    # We are directly formatting the context string in the LCEL chain below.
    # For a simple RAG chain:
    generation_chain = (
        {
            "context": lambda x: "\n".join(f"- {c}" for c in x["contexts"]), # Format contexts from list of strings
            "question": RunnablePassthrough() # Pass the question through
        }
        | prompt_template
        | llm
        | StrOutputParser()
    )

    try:
        # Invoke the chain with the question and contexts
        result = generation_chain.invoke({"question": question, "contexts": contexts})
        return result.strip()
    except Exception as e:
        print(f"Error generating answer with Groq/LangChain: {e}")
        return "Error: Could not generate answer."


In [108]:
reranked_context_texts = []
for hit in rerank_docs.results:
    # Use hit.index to get the actual document content from the original list
    original_doc_text = docs[hit.index]
    reranked_context_texts.append(original_doc_text)

# Select only the top 2 reranked documents for the LLM's context
# This is where your 'top_context' (named 'final_contexts_for_llm' in the Canvas) is created
final_contexts_for_llm = reranked_context_texts[:25]

# Now, you can pass these top 2 contexts to your generation function:
answer = generate_answer(query, final_contexts_for_llm)
print(answer)

Error generating answer with Groq/LangChain: Connection error.
Error: Could not generate answer.


In [107]:
final_contexts_for_llm = doc[:25]

# Now, you can pass these top 2 contexts to your generation function:
answer = generate_answer(query, final_contexts_for_llm)
print(answer)

This document does not contain the answer to why we would want to do rlhf.
