# RAG using Hybrid Retrievers and Reranking

**Introduction to RAG using Hybrid Retrievers and Reranking:**

Retrieval-Augmented Generation (RAG) is a framework where language models (LMs) are combined with information retrieval systems to improve the model's ability to generate contextually accurate and informative responses. Hybrid retrievers combine multiple retrieval methods (e.g., sparse and dense retrieval) to obtain relevant documents, which are then passed through a reranking system to prioritize the most useful information. This hybrid approach boosts the performance of LMs by ensuring they work with the most relevant external data, leading to more accurate and coherent outputs.

## Advantages:

- **Improved accuracy**: Combines multiple retrieval techniques for a more comprehensive set of relevant documents.
- **Efficient use of resources**: Leverages both dense and sparse retrieval, enhancing coverage without excessive computation.
- **Better context understanding**: Reranking ensures the model focuses on the most relevant and high-quality data for generation.
- **Reduced hallucination**: Access to real-world knowledge decreases the likelihood of generating factually incorrect information.

## Disadvantages:

- **Increased complexity**: Requires more sophisticated architecture and tuning.
- **Dependency on quality of retrieval**: Performance heavily depends on the quality of the documents retrieved.
- **Computational cost**: Reranking and hybrid retrieval can lead to higher computational overhead.
- **Limited by retriever quality**: If the retriever doesn't fetch relevant data, the generated output might still be subpar.


### Step-1: Required Package Installation

These dependencies will set up a complete environment for working on a RAG system using Langchain, along with embeddings, document retrieval, and generative models.

In [None]:
!pip install dotenv transformers==4.44.2 langchain==0.3.3 \
                             langchain-community==0.3.0 \
                             langchain-core==0.3.10 \
                             langchain-text-splitters==0.3.0 \
                            chroma-hnswlib==0.7.6 \
                             chromadb==0.5.11 \
                             accelerate==1.0.1 \
                             pypdf \
                             ipywidgets \
                            langchain-groq \
                            huggingface-hub==0.25.1 \
                            langchain-huggingface==0.1.0 \
                            InstructorEmbedding==1.0.1 \
                             rank-bm25==0.2.2
!pip install sentence-transformers==2.2.2

### Step-2: Imports

These imports set up an environment that integrates document loading, embeddings, vector stores, and interactions with large language models (LLMs), making it suitable for building RAG (Retrieval-Augmented Generation) systems.

In [None]:
# Standard library imports
import os
from pathlib import Path
import ipywidgets as widgets
from dotenv import load_dotenv
import bs4
import warnings
warnings.filterwarnings("ignore")

# Chroma and related imports
from chromadb.config import Settings
from langchain_community.vectorstores import Chroma

# Langchain related imports
import langchain
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq

### Step 3: LLM Setup

In this step, we will be using the `llama-3.1-8b-instant` model from GROQ. To access and use the model, you will need to create an API key. 
Need steps for generate your API key, visit the following link: [GROQ API_Key_Generation](https://github.com/AryanKarumuri/Gen-AI-Projects/blob/main/README.md#api-key-generation-guide) 

In [None]:
load_dotenv()
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
if GROQ_API_KEY:
    llm=ChatGroq(groq_api_key=GROQ_API_KEY,model_name="llama-3.1-8b-instant")
    #print(GROQ_API_KEY)
else:
    print("Add Groq API Key")

### Step-4: Data Loading

The provided code uses the `WebBaseLoader` to scrape and load content from a specified web page. In this case, it fetches content from a post titled "hallucination" on Lilian Weng's website.

- **web_paths**: Defines the URL to scrape (`https://lilianweng.github.io/posts/2024-07-07-hallucination/`).
- **bs_kwargs**: Passes additional parameters to BeautifulSoup for filtering the content, specifically extracting elements with classes like `post-content`, `post-title`, and `post-header`.
  
The code will load the relevant sections of the page based on these filters using `loader.load()`.


In [4]:
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2024-07-07-hallucination/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)

docs = loader.load()

### Step-5: Document Loading and Chunking

This code first checks whether any documents were successfully loaded. If no documents are found, it raises a `ValueError`.

- **`RecursiveCharacterTextSplitter`**: A text splitter is used to break down the loaded document into smaller chunks for easier processing by models. It splits the document into chunks of 1000 characters with a 300-character overlap between consecutive chunks.
- **`text_splitter.split_documents(docs)`**: This line splits the loaded documents (`docs`) into chunks and stores them in the `document_chunks` list

This approach is useful for large documents that need to be processed in smaller pieces to fit within model input size limitations.


In [5]:
if not docs:
    raise ValueError("No documents loaded.")

# Splitting docs into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=300)
document_chunks = text_splitter.split_documents(docs)

print(document_chunks[0])

page_content='Extrinsic Hallucinations in LLMs
    
Date: July 7, 2024  |  Estimated Reading Time: 30 min  |  Author: Lilian Weng


Hallucination in large language models usually refers to the model generating unfaithful, fabricated, inconsistent, or nonsensical content. As a term, hallucination has been somewhat generalized to cases when the model makes mistakes. Here, I would like to narrow down the problem of hallucination to cases where the model output is fabricated and not grounded by either the provided context or world knowledge.
There are two types of hallucination:' metadata={'source': 'https://lilianweng.github.io/posts/2024-07-07-hallucination/'}


### Step-7: Post-Processing

- **`format_docs(docs)`**: The function takes a list of documents (`docs`) and joins their `page_content` into a single string. The content is separated by two newline characters (`\n\n`) for better readability.

In [6]:
# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### Step-8: Getting Embeddings

- **`EMBEDDING_MODEL_NAME`**: Specifies the name of the pre-trained embedding model, which is `"hkunlp/instructor-large"`. This model is designed for generating embeddings.

- **`get_embeddings()`**: Initializes the `HuggingFaceInstructEmbeddings` with the given model name and provides instructions for how to represent documents and queries for retrieval.


In [18]:
# Pre-trained Embedding Model.
EMBEDDING_MODEL_NAME = "hkunlp/instructor-large"

#Embeddings
def get_embeddings():
    embeddings = HuggingFaceInstructEmbeddings(
            model_name=EMBEDDING_MODEL_NAME,
            embed_instruction="Represent the document for retrieval:",
            query_instruction="Represent the question for retrieving supporting documents:"
        )
    return embeddings

### Step-9: Database Creation and DB Retriever

- **Database Creation** (`Chroma.from_documents()`): Creates a vector store using **Chroma**, which indexes the document chunks and stores their embeddings in a collection.
   
- **DB Retriever** (`vector_store.as_retriever()`): Converts the vector store into a retriever, allowing for efficient querying and retrieval of relevant documents based on vector similarity.


In [None]:
# Database Creation
vector_store = Chroma.from_documents(
    documents=document_chunks,       
    embedding=get_embeddings(),    
    collection_name="db_collection"  
)

# DB Retriever
retriever = vector_store.as_retriever()

### Step-10: Hybrid Retrievers Setup

This function defines a **hybrid retriever system** that combines multiple retrieval techniques to improve the quality of document retrieval. Here's a breakdown of each component:

1. **DB Retriever**: 
   - A retriever is created from a vector store (`vector_store.as_retriever()`). This typically represents a database of vectors (dense representations of documents) for semantic search.
     

2. **BM25 Retriever**: 
   - A **BM25 retriever** is initialized using the document chunks. BM25 is a sparse retrieval method commonly used for document ranking based on term frequency.
   - The number of top results (`k`) to return is set to 5.
     

3. **Contextual Compression Retriever**:
   - A **HuggingFaceCrossEncoder** is used for reranking based on a pre-trained model (`BAAI/bge-reranker-base`).
   - The **CrossEncoderReranker** is used for compressing and reranking the documents, selecting the top 3.
   - The **ContextualCompressionRetriever** combines the compression model with the original retriever, adding context-sensitive compression to the retrieval process.


4. **Ensemble Retriever**:
   - An **EnsembleRetriever** is created by combining the **ContextualCompressionRetriever** and the **BM25 Retriever**. The weights (0.7 and 0.3) indicate the contribution of each retriever to the final ranking.
     

The function returns the **ensemble retriever**, which will leverage both dense and sparse retrieval techniques to improve the accuracy and relevance of the search results.


In [9]:
def hybrid_retrievers(document_chunks):
    # DB Retriever
    retriever = vector_store.as_retriever()
    
    # BM25 Retriever
    sparse_retriever = BM25Retriever.from_documents(document_chunks)
    sparse_retriever.k = 5

    # Adding Contextual Compression Retriever
    model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=model, top_n=3)
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=retriever
    )

    # Creating Ensemble Retriever
    ensemble_retriever = EnsembleRetriever(
        retrievers=[compression_retriever, sparse_retriever],
        weights=[0.7, 0.3]
    )
    
    return ensemble_retriever

### Step-11: System Prompt

1. **System Prompt**: Defines the behavior of the assistant for question-answering tasks. It ensures that the assistant answers strictly based on the provided context and informs the user when the answer is unknown.

2. **Prompt Template** (`ChatPromptTemplate.from_template()`): Creates a flexible prompt template using the system prompt. It replaces the placeholders `{context}` and `{question}` with actual values at runtime.


In [10]:
system_prompt = (
        "You are an assistant for question-answering tasks. "
        "Use the following pieces of retrieved context to answer the question. "
        "You must answer questions strictly using the provided context. If you don't know the answer, say that you don't know. "
        "\n\n"
        "{context}"
        "Question: {question}"
    )

prompt = ChatPromptTemplate.from_template(system_prompt)

### Step-12: RAG Chain

In [11]:
# Chain
rag_chain = (
    {"context": hybrid_retrievers(document_chunks) | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

### Testing our Hybrid RAG

In [12]:
# Question
from pprint import pprint
pprint(rag_chain.invoke("What is this document about?"))

('Based on the provided context, I can answer your question:\n'
 '\n'
 'What is this document about?\n'
 '\n'
 'According to the context, this document is about Anti-Hallucination Methods '
 'for Language Models (LLMs), specifically discussing various methods to '
 'improve factuality and reduce hallucinations in LLMs, including '
 'Retrieval-augmented Generation (RAG), RARR, WebGPT, and fine-tuning for '
 'attribution and factuality.')


In [14]:
pprint(rag_chain.invoke("What is FactualityPrompt?"))

('FactualityPrompt is a benchmark consisting of both factual and nonfactual '
 'prompts, relying on Wikipedia documents or sentences as the knowledge base '
 'for factuality grounding.')


In [15]:
pprint(rag_chain.invoke("What are the observations on model hallucination behavior?"))

('According to the provided context, some interesting observations on model '
 'hallucination behavior are as follows:\n'
 '\n'
 '1. Error rates are higher for rarer entities in the task of biography '
 'generation.\n'
 '2. Error rates are higher for facts mentioned later in the generation.\n'
 '3. Using retrieval to ground the model generation significantly helps reduce '
 'hallucination.\n'
 '\n'
 'Additionally, there are some observations from the experiments, where dev '
 'set accuracy is considered a proxy for hallucinations. These include:\n'
 '\n'
 '1. Unknown examples are fitted substantially slower than Known examples.\n'
 '2. The best dev performance is obtained when the LLM fits the majority of '
 'the Known training examples but only a few of the Unknown ones. The model '
 'starts to hallucinate when it learns most of the Unknown examples.\n'
 '3. Among Known examples, MaybeKnown cases result in better overall '
 'performance, more essential than HighlyKnown ones.\n'
 '\n'


In [16]:
pprint(rag_chain.invoke("Explain about self RAG from the document."))

('Based on the provided context, I will explain about Self-RAG.\n'
 '\n'
 'Self-RAG, also known as "Self-reflective retrieval-augmented generation," is '
 'a framework that trains a Language Model (LM) end-to-end to learn to reflect '
 'on its own generation. It achieves this by outputting both task output and '
 'intermittent special reflection tokens.\n'
 '\n'
 'The Self-RAG model uses four types of reflection tokens: \n'
 '\n'
 '1. Retrieve: decides whether to run retrieval in parallel to get a set of '
 'documents.\n'
 '2. IsRel: whether the prompt and retrieved document are relevant.\n'
 '3. IsSup: whether the output text is supported by the retrieved document.\n'
 '4. IsUse: whether the output text is useful to the given prompt.\n'
 '\n'
 'Self-RAG trains a critic model and a generator model by prompting GPT-4 and '
 'then distills that into an in-house model to reduce inference cost.\n'
 '\n'
 'Self-RAG aims to improve the quality of the generated output by critiquing '
 'its ow

In [17]:
pprint(rag_chain.invoke("Explain the formula for sampling methods."))

('The provided context mentions the formula for sampling methods in the '
 'context of factual-nucleus sampling, which is as follows:\n'
 '\n'
 '\\[ p_t = \\max(\\omega, p \\cdot \\lambda^{t−1}) \\]\n'
 '\n'
 'where:\n'
 '\n'
 '- $p_t$ is the probability at the $t$-th token in the sentence.\n'
 '- $\\omega$ is a parameter to prevent the sampling from falling back to '
 'greedy sampling, which can hurt generation quality and diversity.\n'
 '- $p$ is the initial probability.\n'
 '- $\\lambda$ is a parameter that dynamically adapts the probability during '
 'sampling.')
