[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/better-rag/00-rerankers.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/better-rag/00-rerankers.ipynb)

# Rerankers

Rerankers have been a common component of retrieval pipelines for many years. They allow us to add a final "reranking" step to our retrieval pipelines — like with **R**etrieval **A**ugmented **G**eneration (RAG) — that can be used to dramatically optimize our retrieval pipelines and improve their accuracy.

In the example notebook we'll learn how to create retrieval pipelines with reranking using the [Cohere reranking model](https://txt.cohere.com/rerank/) (which is available for free).

To begin, we setup our prerequisite libraries.

In [2]:
!pip install -qU \
    datasets==2.14.5 \
    openai==1.6.1 \
    pinecone-client==3.1.0 \
    cohere==4.27

[31mERROR: You must give at least one requirement to install (see "pip help install")[0m[31m
[0m

In [None]:
# !pip install openai==1.6.1 pinecone-client==3.1.0 cohere==4.27

## Data Preparation

We start by downloading a dataset that we will encode and store. The dataset [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) contains scraped data from many popular ArXiv papers centred around LLMs. Including papers from Llama 2, GPTQ, and the GPT-4 technical paper.

In [2]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

We have 41.5K chunks, where each chunk is roughly the length of 1-2 paragraphs in length. Here is an example of a single record:

In [3]:
data[0]

{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof i

Format the data into the format we need, this will contain `id`, `text` (which we will embed), and `metadata`. For this use-case we don't need metadata but it can be useful to include so that if needed in the future we can make use of metadata filtering.

In [4]:
data = data.map(lambda x: {
    "id": f'{x["id"]}-{x["chunk-id"]}',
    "text": x["chunk"],
    "metadata": {
        "title": x["title"],
        "url": x["source"],
        "primary_category": x["primary_category"],
        "published": x["published"],
        "updated": x["updated"],
        "text": x["chunk"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "summary", "source",
    "authors", "categories", "comment",
    "journal_ref", "primary_category",
    "published", "updated", "references",
    "doi", "chunk-id",
    "chunk"
])
data

Dataset({
    features: ['id', 'text', 'metadata'],
    num_rows: 41584
})

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using OpenAI's text-embedding-ada-002. There is some cost associated with this model, so be aware of that (costs for running this notebook are <$1).

In [4]:
# !pip install PyPDF2

Defaulting to user installation because normal site-packages is not writeable


In [1]:
from collections import namedtuple
from PyPDF2 import PdfReader

Page = namedtuple("Page", ["id", "page_content", "metadata"])

def pdf_reader(file_path):
    reader = PdfReader(file_path)
    pdf_pages = []
    for page_number, page in enumerate(reader.pages):
        page_content = page.extract_text().strip()
        if page_content:
            metadata = {"page_number": page_number}  # Add any additional metadata as needed
            pdf_pages.append(Page(id=page_number, page_content=page_content, metadata=metadata))
    return pdf_pages

file_path = '../data/RaptorContract.pdf'
pdf_pages = pdf_reader(file_path)

In [29]:
import os
import openai
import getpass  # platform.openai.com

# get API key from top-right dropdown on OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") 
# or getpass.getpass("Enter your OpenAI API key: ")

embed_model = "text-embedding-ada-002"

Now we create our vector DB to store our vectors. For this we need to get a [free Pinecone API key](https://app.pinecone.io) — the API key can be found in the "API Keys" button found in the left navbar of the Pinecone dashboard.

In [3]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY") or getpass.getpass()

# configure client
pc = Pinecone(api_key=api_key)

  from tqdm.autonotebook import tqdm


Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [12]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Creating an index, we set `dimension` equal to to dimensionality of Ada-002 (`1536`), and use a `metric` also compatible with Ada-002 (this can be either `cosine` or `dotproduct`). We also pass our `spec` to index initialization.

In [13]:
import time

index_name = "rerankers"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Define embedding function with OpenAI:

In [36]:
def embed(batch: list) -> list[float]:
    # create embeddings (exponential backoff to avoid RateLimitError)
    for j in range(5):  # max 5 retries
        try:
            res = openai.embeddings.create(
                input=[item["text"] for item in batch],
                model=embed_model
            )
            passed = True
        except openai.RateLimitError:
            time.sleep(2 ** j)  # wait 2^j seconds before retrying
            print("Retrying...")
    if not passed:
        raise RuntimeError("Failed to create embeddings.")
    # get embeddings
    embeds = [record.embedding for record in res.data]
    return embeds

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI's `text-embedding-ada-002` built embeddings like so:

**⚠️ WARNING: Embedding costs for the full dataset as of 3 Jan 2024 is ~$5.70**

In [30]:
pdf_pages[0]

Page(id=0, page_content='[R&G Draft 12.__.2021] \n112923184_5  \n \nSTOCK PURCHASE AGREEMENT \nBY AND AMONG \n[BUYER], \n[TARGET COMPANY], \nTHE SELLERS LISTED ON SCHEDULE I HERETO \nAND  \nTHE SELLERS ’ REPRESENTATIVE NAMED HEREIN \nDated as of [●]  \n \n[This document is intended solely to facilitate discussions among the parties identified herein.  \nNeither this document nor such discussions are intended to create, nor will either or both be \ndeemed to create, a legally binding or enforceable offer or agreement of any type or nature, \nunless and until a definitive written agreement is executed and delivered by each of th e parties \nhereto. \n \nThis document shall be kept confidential pursuant to the terms of the Confidentiality \nAgreement entered into by the parties and, if applicable, its affiliates with respect to the subject \nmatter hereof.]', metadata={'page_number': 0})

In [43]:
from langchain.embeddings.openai import OpenAIEmbeddings
embed_mod = OpenAIEmbeddings(model="text-embedding-ada-002")

In [46]:
from tqdm.auto import tqdm

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(pdf_pages), batch_size)):
    passed = False
    # find end of batch
    i_end = min(len(pdf_pages), i + batch_size)
    # create batch
    batch = pdf_pages[i:i_end]
    # embeds = embed([page.page_content for page in batch])
    embeds = embed_mod.embed_documents([page.page_content for page in batch])
    to_upsert = list(zip([str(page.id) for page in batch], embeds, [page.metadata for page in batch]))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

100%|██████████| 1/1 [00:30<00:00, 30.18s/it]


In [47]:
# from tqdm.auto import tqdm

# batch_size = 100  # how many embeddings we create and insert at once

# for i in tqdm(range(0, len(data), batch_size)):
#     passed = False
#     # find end of batch
#     i_end = min(len(data), i+batch_size)
#     # create batch
#     batch = data[i:i_end]
#     embeds = embed(batch["text"])
#     to_upsert = list(zip(batch["id"], embeds, batch["metadata"]))
#     # upsert to Pinecone
#     index.upsert(vectors=to_upsert)

Now let's test retrieval _without_ Cohere's reranking model.

In [69]:
from langchain.vectorstores import Pinecone

text_field = "page_number"  # the metadata field that contains our text

# Initialize the vector store object
vectorstore = Pinecone(index, embed_mod.embed_query, text_field)

query = "Under what circumstances and to what extent the Sellers are responsible for a breach of representations and warranties?"

# Perform similarity search
results = vectorstore.similarity_search(query, k=3)

# Retrieve the content texts for the similar documents
# Assuming results is the list of Document objects
similar_values = [result.page_content for result in results]

# Print the similar values
for value in similar_values:
    print(value)



54.0
66.0
64.0


In [None]:
query = "Under what circumstances and to what extent the Sellers are responsible for a breach of representations and warranties?"

# Perform similarity search
results = vectorstore.similarity_search(query, k=3)

Found document with no `page_content` key. Skipping.
Found document with no `page_content` key. Skipping.
Found document with no `page_content` key. Skipping.


In [62]:
from langchain.vectorstores import Pinecone

text_field = "page_content"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_mod.embed_query, text_field)



In [64]:
query = "Under what circumstances and to what extent the Sellers are responsible for a breach of representations and warranties?"

# Perform similarity search
results = vectorstore.similarity_search(query, k=3)

Found document with no `page_content` key. Skipping.
Found document with no `page_content` key. Skipping.
Found document with no `page_content` key. Skipping.


In [48]:
def get_docs(query: str, top_k: int) -> list[str]:
    # encode query
    xq = embed([query])[0]
    # search pinecone index
    res = index.query(vector=xq, top_k=top_k, include_metadata=True)
    # get doc text
    docs = {x["metadata"]['page_content']: i for i, x in enumerate(res["matches"])}
    return docs

In [68]:
results

[Document(page_content='54.0'),
 Document(page_content='66.0'),
 Document(page_content='64.0')]

In [49]:
query = "can you explain why we would want to do rlhf?"
docs = get_docs(query, top_k=25)
print("\n---\n".join(docs.keys()))

TypeError: string indices must be integers

Good, but can we get better?

## Reranking Responses

We can easily get the responses we need when we include _many_ responses, but this doesn't work well with LLMs. The recall performance for LLMs [decreases as we add more into the context window](https://www.pinecone.io/blog/why-use-retrieval-instead-of-larger-context/) — we call this excessive filling of the context window _"context stuffing"_.

Fortunately reranking offers us a solution that helps us find those records that may not be within the top-3 results, and pull them into a smaller set of results to be given to the LLM.

We will use Cohere's rerank endpoint for this, to use it you will need a [Cohere API key](https://dashboard.cohere.com/api-keys). Once you have your key you use it to create authenticate your Cohere client like so:

In [14]:
import cohere

os.environ["COHERE_API_KEY"] = os.getenv("COHERE_API_KEY") or getpass.getpass()
# init client
co = cohere.Client(os.environ["COHERE_API_KEY"])

··········


Now we can rerank our results with `co.rerank`. Let's try it with our earlier results.

In [15]:
rerank_docs = co.rerank(
    query=query, documents=docs.keys(), top_n=25, model="rerank-english-v2.0"
)

This returns a list of `RerankResult` objects:

In [16]:
type(rerank_docs[0])

cohere.responses.rerank.RerankResult

We access the text content of the docs like so:

In [17]:
rerank_docs[0].document["text"]

'significant for 3 out of 7 datasets.\n28\nK New Tasks\nConstrained Generation We introduce “CommonGen-Hard," a more challenging extension of the\nCommonGen dataset (Lin et al., 2020), designed to test state-of-the-art language models’ advanced\ncommonsense reasoning, contextual understanding, and creative problem-solving. CommonGenHard requires models to generate coherent sentences incorporating 20-30 concepts, rather than only\nthe 3-5 related concepts given in CommonGen. SELF-REFINE focuses on iterative creation with\nintrospective feedback, making it suitable for evaluating the effectiveness of language models on the\nCommonGen-Hard task.\nAcronym Generation Acronym generation requires an iterative refinement process to create\nconcise and memorable representations of complex terms or phrases, involving tradeoffs between\nlength, ease of pronunciation, and relevance, and thus serves as a natural testbed for our approach.\nWe source a dataset of 250 acronyms4and manually prune it to

The reordered results look like so:

In [18]:
[docs[doc.document["text"]] for doc in rerank_docs]

[20,
 22,
 6,
 15,
 13,
 4,
 19,
 1,
 12,
 23,
 24,
 17,
 16,
 18,
 21,
 7,
 0,
 11,
 8,
 14,
 9,
 5,
 2,
 3,
 10]

Let's write a function to allow us to more easily compare the original results vs. reranked results.

In [19]:
def compare(query: str, top_k: int, top_n: int):
    # first get vec search results
    docs = get_docs(query, top_k=top_k)
    i2doc = {docs[doc]: doc for doc in docs.keys()}
    # rerank
    rerank_docs = co.rerank(
        query=query, documents=docs.keys(), top_n=top_n, model="rerank-english-v2.0"
    )
    original_docs = []
    reranked_docs = []
    # compare order change
    for i, doc in enumerate(rerank_docs):
        rerank_i = docs[doc.document["text"]]
        print(str(i)+"\t->\t"+str(rerank_i))
        if i != rerank_i:
            reranked_docs.append(f"[{rerank_i}]\n"+doc.document["text"])
            original_docs.append(f"[{i}]\n"+i2doc[i])
    for orig, rerank in zip(original_docs, reranked_docs):
        print("ORIGINAL:\n"+orig+"\n\nRERANKED:\n"+rerank+"\n\n---\n")

Beginning with our `"can you explain why we would want to do rlhf?"` query, let's take a look at the top-3 results with / without reranking:

In [20]:
compare(query, 25, 3)

0	->	20
1	->	22
2	->	6
ORIGINAL:
[0]
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi,
and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP
2020 , pp. 1823–1840, Online, November 2020. Association for Computational Linguistics.
doi: 10.18653/v1/2020.ﬁndings-emnlp.165. URL https://aclanthology.org/2020.
findings-emnlp.165 .
Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith,
and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and antiexperts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers) , pp. 6691–6706, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclantholo

Both results from reranking provide many more reasons as to why we would want to use RLHF than the original records. Let's try another query:

In [21]:
compare("what is red teaming?", top_k=25, top_n=3)

0	->	11
1	->	22
2	->	17
ORIGINAL:
[0]
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi,
and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP
2020 , pp. 1823–1840, Online, November 2020. Association for Computational Linguistics.
doi: 10.18653/v1/2020.ﬁndings-emnlp.165. URL https://aclanthology.org/2020.
findings-emnlp.165 .
Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith,
and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and antiexperts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers) , pp. 6691–6706, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthol

Again, the results provide more relevant responses when using reranking rather than the original search.

Don't forget to delete your index when you're done to save resources!

In [22]:
pc.delete_index(index_name)

---