# Introducing RAG

On a separate track, the fast adoption of text generation models led many users to ask the models questions and expect factual answers. And while the models were able to answer fluently and confidently, their answers were not always correct or up-to-date. This problem grew to be known as model “hallucinations,” and one of the leading ways to reduce it is to build systems that can retrieve relevant information and provide it to the LLM to aid it in generating more factual answers. This method, called RAG, is one of the most popular applications of LLMs.

Three broad categories of these models are:

### Dense Retrieval

Dense retrieval systems rely on the concept of embeddings, the same concept we’ve encountered in the previous chapters, and turn the search problem into retrieving the nearest neighbors of the search query

### ReRanking

Search systems are often pipelines of multiple steps. A reranking language model is one of these steps and is tasked with scoring the relevance of a subset of results against the query. The order of results is then changed based on these scores

### RAG

The growing LLM capability of text generation led to a new type of search systems that include a model that generates an answer in response to a query.

Generative search is a subset of a broader type of category of systems better called RAG systems. These are text generation systems that incorporate search capabilities to reduce hallucinations, increase factuality, and/or ground the generation model on a specific dataset.

A RAG system formulates an answer to a question and (preferably) cites its information sources.


### General path

We chunk a document before proceeding to embed each chunk. Those embedding vectors are then stored in the vector database and are ready for retrieval.

# Dense Retrieval Example

Let’s take a look at a dense retrieval example by using Cohere to search the Wikipedia page for the film Interstellar. In this example, we will do the following:

1. Get the text we want to make searchable and apply some light processing to chunk it into sentences.
2. Embed the sentences.
3. Build the search index.
4. Search and see the results.

In [23]:
from dotenv import load_dotenv
import cohere
import os
import numpy as np
import faiss
import pandas as pd
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm import tqdm


In [2]:
load_dotenv()

True

In [5]:
# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key=os.getenv("COHERE_API_KEY"))

### Breakdown Interstellar wikipedia text

In [6]:
text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades"""

# Split into a list of sentences
texts = text.split('.')

# Clean up to remove empty spaces and new lines
texts = [t.strip(' \n') for t in texts]

In [7]:
len(texts)

15

In [8]:
# Get the embeddings
response = co.embed(
  texts=texts,
  input_type="search_document",
).embeddings

In [11]:
embeds = np.array(response)

In [12]:
embeds.shape

(15, 4096)

## Building Search Index

Before we can search, we need to build a search index. An index stores the embeddings and is optimized to quickly retrieve the nearest neighbors even if we have a very large number of points:

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors. 

- useful in large-scale vector search, such as:
    - nearest neighbor search, which is common in applications like recommendation systems, image retrieval, and semantic search.

In [14]:
dim = embeds.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.float32(embeds))

## Search the index

In [15]:
def search(query, number_of_results=3):
  
  # 1. Get the query's embedding
  query_embed = co.embed(texts=[query], 
                input_type="search_query",).embeddings[0]

  # 2. Retrieve the nearest neighbors
  distances , similar_item_ids = index.search(np.float32([query_embed]), number_of_results) 

  # 3. Format the results
  texts_np = np.array(texts) # Convert texts list to numpy for easier indexing
  results = pd.DataFrame(data={'texts': texts_np[similar_item_ids[0]], 
                              'distance': distances[0]})
  
  # 4. Print and return the results
  print(f"Query:'{query}'\nNearest neighbors:")
  return results

In [17]:
query = "how precise was the science"
results = search(query)
results

Query:'how precise was the science'
Nearest neighbors:


Unnamed: 0,texts,distance
0,It has also received praise from many astronom...,10757.371094
1,Caltech theoretical physicist and 2017 Nobel l...,11566.136719
2,Interstellar uses extensive practical and mini...,11922.841797


In [18]:
for i,row in results.iterrows():
  print(f"{i+1}. '{row['texts']}'\n")

1. 'It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics'

2. 'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar'

3. 'Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects'



We can actually verify that we will have NOT got the same answer if we had use the classifcal keyword searches. We’ll use the BM25 algorithm, which is one of the leading lexical search methods. 

In [22]:
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

In [24]:
tokenized_corpus = []
for passage in tqdm(texts):
    tokenized_corpus.append(bm25_tokenizer(passage))

100%|██████████| 15/15 [00:00<00:00, 173318.35it/s]


In [25]:
bm25 = BM25Okapi(tokenized_corpus)

In [29]:
bm25.idf["science"]

1.6863989535702286

In [30]:
def keyword_search(query, top_k=3, num_candidates=15):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print(f"Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:top_k]:
        print("\t{:.3f}\t{}".format(hit['score'], texts[hit['corpus_id']].replace("\n", " ")))


In [31]:
keyword_search(query = "how precise was the science")

Input question: how precise was the science
Top-3 lexical search (BM25) hits
	1.789	Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan
	1.373	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar
	0.000	It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine


### Caveats of Dense Retrieval

In [32]:
query = "What is the mass of the moon?"
results = search(query)
results

Query:'What is the mass of the moon?'
Nearest neighbors:


Unnamed: 0,texts,distance
0,Cinematographer Hoyte van Hoytema shot it on 3...,12854.445312
1,The film had a worldwide gross over $677 milli...,13301.009766
2,It has also received praise from many astronom...,13332.0


In cases like this, one possible heuristic is to set a threshold level—a maximum distance for relevance, for example. A lot of search systems present the user with the best info they can get and leave it up to the user to decide if it’s relevant or not. 

Tracking the information of whether the user clicked on a result (and were satisfied by it) can improve future versions of the search system.

Another caveat of dense retrieval is when a user wants to find an exact match for a specific phrase. 

**That’s a case that’s perfect for keyword matching. That’s one reason why hybrid search, which includes both semantic search and keyword search, is advised instead of relying solely on dense retrieval.**

- Dense retrieval systems also find it challenging to work properly in domains other than the ones that they were trained on

- What about questions whose answers span multiple sentences? what is the best way to chunk long texts?

## Chunking long texts


- indexing one vector per document
    - Embedding the document in chunks, embedding those chunks, and then aggregating those chunks into a single vector. 
    - The usual method of aggregation here is to average those vectors.
    - A downside of this approach is that it results in a highly compressed vector that loses a lot of the information in the document.
- indexing multiple vectors per document.
    - In this approach, we chunk the document into smaller pieces, and embed those chunks. 
    - Our search index then becomes that of chunk embeddings, not entire document embeddings.
    - The chunking approach is better because it has full coverage of the text 
    - Chunking methods:
        - Each sentence is a chunk. The issue here is this could be too granular and the vectors don’t capture enough of the context.
        - Each paragraph is a chunk. This is great if the text is made up of short paragraphs. Otherwise, it may be that every 3–8 sentences is a chunk.
        - Some chunks derive a lot of their meaning from the text around them. So we can incorporate some context via:
            - Adding the title of the document to the chunk.
            - **Overlapping Chunks**: Adding some of the text before and after them to the chunk. This way, the chunks can overlap so they include some surrounding text that also appears in adjacent chunks. 

### Nearest neighbor search versus vector databases


Once the query is embedded, we need to find the nearest vectors to it from our text archive

<img src="imgs/search.png" alt="Cohere logo" width="400" height="200"/>

As you scale beyond to the millions of vectors, an optimized approach for retrieval is to rely on **approximate nearest neighbor search** libraries like Annoy or FAISS. 

Another class of vector retrieval systems are vector databases like Weaviate or Pinecone. 
- A vector database allows you to add or delete vectors without having to rebuild the index. 
- They also provide ways to filter your search or customize it in ways beyond merely vector distances.

### Fine-tuning embedding models for dense retrieval


 The process for this fine-tuning is to get training data composed of queries and relevant results.

 “Interstellar premiered on October 26, 2014, in Los Angeles.” Two possible queries where this is a relevant result are:

- Relevant query 1: “Interstellar release date”
- Relevant query 2: “When did Interstellar premier”

The fine-tuning process aims to make the embeddings of these queries close to the embedding of the resulting sentence. It also needs to see negative examples of queries that are not relevant to the sentence, for example:

- Irrelevant query: “Interstellar cast”

# ReRanking

For those organizations, an easier way to incorporate language models is as a final step inside their search pipeline

 This step is tasked with changing the order of the search results based on relevance to the search query. 

<img src="imgs/rerank.png" alt="Cohere logo" width="450" height="250"/>

A reranker takes in the search query and a number of search results, and returns the optimal ordering of these documents so the most relevant ones to the query are higher in ranking. 

In [33]:
texts

['Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan',
 'It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine',
 'Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind',
 'Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007',
 'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar',
 'Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm',
 'Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles',
 'Interstellar uses extensive practical a

In [34]:
query = "how precise was the science"
results = co.rerank(query=query, documents=texts, top_n=3, return_documents=True)
results.results

[RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics'), index=12, relevance_score=0.16981852),
 RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014'), index=10, relevance_score=0.07004896),
 RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar'), index=4, relevance_score=0.0043994132)]

In this basic example, we passed our reranker all 15 of our documents.

However, in production, an index would have thousands or millions of entries, and we need to shortlist, say one hundred or one thousand results and then present those to the reranker. 

This shortlisting step is called the first stage of the search pipeline.

### First-Stage

The first-stage retriever can be keyword search, dense retrieval, or better yet—hybrid search that uses both of them. 

In [36]:
def keyword_and_reranking_search(query, top_k=3, num_candidates=10):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print(f"Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:top_k]:
        print("\t{:.3f}\t{}".format(hit['score'], texts[hit['corpus_id']].replace("\n", " ")))

    #Add re-ranking
    docs = [texts[hit['corpus_id']] for hit in bm25_hits]

    print(f"\nTop-3 hits by rank-API ({len(bm25_hits)} BM25 hits re-ranked)")
    results = co.rerank(query=query, documents=docs, top_n=top_k, return_documents=True)
    for hit in results.results:
        print("\t{:.3f}\t{}".format(hit.relevance_score, hit.document.text.replace("\n", " ")))

In [37]:
keyword_and_reranking_search(query = "how precise was the science")

Input question: how precise was the science
Top-3 lexical search (BM25) hits
	1.789	Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan
	1.373	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar
	0.000	Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects

Top-3 hits by rank-API (10 BM25 hits re-ranked)
	0.004	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar
	0.004	Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind
	0.003	Brothers Christopher and Jonat

# Open source retrieval and reranking with sentence transformers

If you want to locally set up retrieval and reranking on your own machine, then you can use the Sentence Transformers library. 

## How reranking models work

One popular way of building LLM search rerankers is to present the query and each result to an LLM working as a cross-encoder. 

This means that a query and possible result are presented to the model at the same time allowing the model to view both these texts before it assigns a relevance score

All of the documents are processed simultaneously as a batch yet each document is evaluated against the query independently. 

This formulation of search as relevance scoring basically boils down to being a classification problem. Given those inputs, the model outputs a score from 0–1 where 0 is irrelevant and 1 is highly relevant.

# Retrieval Evaluation Metrics

Evaluating search systems needs three major components: a text archive, a set of queries, and relevance judgments indicating which documents are relevant for each query. 


To evaluate search systems, we need a test suite including queries and relevance judgments indicating which documents in our archive are relevant for each query.

<img src="imgs/test_suite.png" alt="Cohere logo" width="450" height="250"/>

 Let’s assume we pass query 1 to two different search systems. And get two sets of results.

<img src="imgs/judgements.png" alt="Cohere logo" width="450" height="250"/>

### Metric: MAP

<img src="imgs/map.png" alt="Cohere logo" width="450" height="250"/>

# Retrieval-Augmented Generation (RAG)

The leading method the industry turned to remedy this behavior is RAG

RAG systems incorporate search capabilities in addition to generation capabilities.

## From Search to RAG

Let’s now turn our search system into a RAG system. We do that by adding an LLM to the end of the search pipeline. 

We present the question and the top retrieved documents to the LLM, and ask it to answer the question given the context provided by the search results. 

<img src="imgs/rag.png" alt="Cohere logo" width="450" height="250"/>

This generation step is called **grounded generation** because the retrieved relevant information we provide the LLM establishes a certain context that grounds the LLM in the domain we’re interested in.

In [51]:
from langchain import LlamaCpp
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.chains import RetrievalQA

In [47]:
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="weights/Phi-3-mini-4k-instruct-fp16.gguf",
    n_gpu_layers=-1,
    max_tokens=500,
    n_ctx=2048,
    seed=42,
    verbose=False
)

llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized


In [48]:
# Embedding Model for converting text to numerical representations
embedding_model = HuggingFaceEmbeddings(
    model_name='thenlper/gte-small'
)

In [49]:
texts

['Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan',
 'It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine',
 'Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind',
 'Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007',
 'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar',
 'Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm',
 'Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles',
 'Interstellar uses extensive practical a

### Preparing the Vector Database

In [50]:
# Create a local vector database
db = FAISS.from_texts(texts, embedding_model)

### The RAG Prompt

In [52]:
# Create a prompt template
template = """<|user|>
Relevant information:
{context}

Provide a concise answer the following question using the relevant information provided above:
{question}<|end|>
<|assistant|>"""
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

# RAG Pipeline
rag = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=db.as_retriever(),
    chain_type_kwargs={
        "prompt": prompt
    },
    verbose=True
)

In [53]:
rag.invoke('Income generated')



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Income generated',
 'result': ' The Income generated from the film was over $677 million worldwide, making it the tenth-highest grossing film of 2014.'}

# Advanced RAG Techniques


### Query rewriting


If the RAG system is a chatbot, the preceding simple RAG implementation would likely struggle with the search step **if a question is too verbose**, or to refer to context in previous messages in the conversation.

This is why it’s a good idea to use an LLM to **rewrite the query** into one that aids the retrieval step in getting the right information. 

```text
User Question: “We have an essay due tomorrow. We have to write about some animal. I love penguins. I could write about them. But I could also write about dolphins. Are they animals? Maybe. Let’s do dolphins. Where do they live for example?”
```

Gets:
```text
Query: “Where do dolphins live”
```


### Multi-query RAG

The next improvement we can introduce is to extend the query rewriting to be able to search multiple queries if more than one is needed to answer a specific question. Take for example:

```text
User Question: “Compare the financial results of Nvidia in 2020 vs. 2023”
```

Gets:
```text
Query 1: “Nvidia 2020 financial results”

Query 2: “Nvidia 2023 financial results”
```




### Multi-hop RAG

A more advanced question may require a series of sequential queries.

```text
User Question: “Who are the largest car manufacturers in 2023? Do they each make EVs or not?”
```

To answer this, the system must first search for:
```text
Step 1, Query 1: “largest car manufacturers 2023”
```
Then receive results:
```text
Step 2, Query 1: “Toyota Motor Corporation electric vehicles”

Step 2, Query 2: “Volkswagen AG electric vehicles”

Step 2, Query 3: “Hyundai Motor Company electric vehicles”
```


### Query Routing

An additional enhancement is to give the model the ability to search **multiple data sources**. 


Specify to the model if a question is related to topic A, look for in the information system of A, if topic B, look for system B etc... This way we can have separate databases schemas for each topic and ease the search


### Agentic RAG

The data sources can also now be abstracted into tools. We saw, for example, that we can search Notion, but by the same token, we should be able to post to Notion as well.

Cohere’s Command R+ excels at these tasks and is available as an open-weights model as well


### RAG Evaluation

There are still ongoing developments in how to evaluate RAG models. A good paper to read on this topic is “Evaluating verifiability in generative search engines” (2023), which runs human evaluations on different generative search systems.

- Fluency: Whether the generated text is fluent and cohesive.
- Perceived utility: Whether the generated answer is helpful and informative.
- Citation recall: The proportion of generated statements about the external world that are fully supported by their citations.
- Citation precision: The proportion of generated citations that support their associated statements.


While human evaluation is always preferred, there are approaches that attempt to automate these evaluations by having a capable **LLM act as a judge (called LLM-as-a-judge)**

Ragas is a software library that does exactly this. It also scores some additional useful metrics like:

- Faithfulness
    - Whether the answer is consistent with the provided context
- Answer relevance
    - How relevant the answer is to the question