<a href="https://colab.research.google.com/github/Prakum14/Testfiles/blob/master/RAG_with_OpenAI_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG : Retrieval Augmented Generation

**(with OpenAI LLMs)**

## Learning Objectives

At the end of the experiment, you will be able to:

1. Load the Documents
2. Splitting the documents into chunks
3. Embedding the chunks and storing them in vector db
4. Retrieving the relevant chunks to the query
 * Addressing Diversity
 * Addressing Specificity
5. Connecting with LLM to get a final grounded answer

## Introduction

> **RAG diagram:**
>
> <img src='https://drive.google.com/uc?id=1sCVvpsmtZEU1WSK1FFGMGHbEjrgtCNLi'>

> **Vector Store and Retrieval:**
>
> <img src='https://drive.google.com/uc?id=1_zX5gtSNrV8Qdx7Nz4_gMR8dCwvxCDS7' width=750px>

> **Embedding Model:**
>
> <img src='https://drive.google.com/uc?id=1HnvjGJ4HmpS-0wndpH-Q8cKMwIwWkTUe'>

> **Retrieval in Action:**
>
> <img src='https://drive.google.com/uc?id=1ry2TWFsewwqYP3Lw9muuPmbyuQqXwnYV' width=800px>

> **Example workflow with embedding model:**
>
><br>
>
> <img src='https://drive.google.com/uc?id=1zTuMMX54L2HrnmCYktTxVfMVrkIz8w15' width=600px>

### Install Dependencies

In [1]:
# Install required libraries silently without showing output in the notebook

%%capture
!pip -q install openai               # Install the OpenAI Python client to access OpenAI's models and APIs
!pip -q install langchain-openai      # Install the LangChain OpenAI integration to simplify using OpenAI with LangChain
!pip -q install langchain-core        # Install LangChain core components for building chains and applications
!pip -q install langchain-community   # Install LangChain community extensions for additional functionalities
!pip -q install sentence-transformers # Install Sentence-Transformers for using pre-trained sentence embeddings
!pip -q install langchain-huggingface # Install LangChain integration for Hugging Face models
!pip -q install langchain-chroma      # Install Chroma for working with vector databases in LangChain
!pip -q install chromadb              # Install ChromaDB, a library for storing and searching embeddings in a vector store
!pip -q install pypdf                 # Install PyPDF for extracting text from PDF documents

### Import Required Packages

In [2]:
# Importing necessary libraries for working with OpenAI, LangChain, and PDF data

import os                                           # For interacting with the operating system, managing file paths, etc.
import openai                                       # OpenAI Python client for accessing models like GPT
import numpy as np                                  # NumPy for numerical operations, especially with arrays and matrices
from langchain_community.document_loaders import PyPDFLoader  # LangChain's PyPDFLoader to load text from PDF files
from langchain_openai import ChatOpenAI             # LangChain's integration for using OpenAI's chat models (like GPT-3.5, GPT-4)
from langchain_chroma import Chroma                 # Chroma integration for vector stores to manage and search embeddings
from langchain_core.prompts import PromptTemplate    # LangChain's utility for creating prompt templates
from langchain_core.output_parsers import StrOutputParser  # To parse model output as a string
from langchain.schema.runnable import RunnablePassthrough  # For passing data through LangChain's runnables without modification

#### **Provide your OpenAI API key**

In [6]:
# Importing Colab's userdata module to access stored secrets
from google.colab import userdata

# Fetching the OpenAI API key stored in Colab Secrets
api_key = userdata.get('OPENAI_API_KEY')  # <-- change this as per your secret's name

# Storing the API key in the environment variables for global access
os.environ['OPENAI_API_KEY'] = api_key

# Setting the OpenAI API key for the openai package to use
openai.api_key = os.getenv('OPENAI_API_KEY')

### Load LLM

In [7]:
# Importing the ChatOpenAI class from the langchain_openai module to work with OpenAI models
from langchain_openai import ChatOpenAI

# Loading the GPT-4o-mini model with a temperature setting of 0 for deterministic output
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [12]:
# Sending a query to the loaded model and invoking the model to answer the question
response = llm.invoke("What is the Capital of India?")

# Printing the response content from the model
print(response.content)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [10]:
# Sending a query to the model asking for 5 points on how to learn programming
response = llm.invoke("How to learn programming? give 5 points")

# Printing the response content from the model, which will include the 5 points
print(response.content)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

### **Loading the documents**

[PDF Loader](https://python.langchain.com/docs/how_to/document_loader_pdf/)

In [13]:
# UPLOAD the Docs first to this notebook, then run this cell

from langchain_community.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    PyPDFLoader("/content/pca_d1.pdf"),
    PyPDFLoader("/content/ens_d2.pdf"),
    PyPDFLoader("/content/ens_d2.pdf"),    # Loading duplicate documents on purpose
]

docs = []
for loader in loaders:
    docs.extend(loader.load())


ValueError: File path /content/pca_d1.pdf is not a valid file or url

In [None]:
len(docs)        # 7 pages were there in total from above documents

In [None]:
docs

In [None]:
print(docs[0].page_content)

### **Splitting of document**

[Recursively split by character](https://python.langchain.com/docs/how_to/recursive_text_splitter/)

[Split by character](https://python.langchain.com/docs/how_to/character_text_splitter/)

In [None]:
# Importing the RecursiveCharacterTextSplitter from langchain_text_splitters to split large text into smaller chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
# Initializing the RecursiveCharacterTextSplitter with chunk size and overlap parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,       # The maximum size of each chunk (500 characters)
    chunk_overlap = 50      # The number of characters that will overlap between consecutive chunks (50 characters)
)

In [None]:
# Splitting the documents into smaller chunks using the defined text_splitter
splits = text_splitter.split_documents(docs)

# Printing the number of splits (chunks) generated
print(len(splits))

# Printing the length of the content of the first chunk
print(len(splits[0].page_content))

# Displaying the content of the first chunk
splits[0].page_content

In [None]:
splits[0]

### **Embeddings**

Let's take our splits and embed them.

In [None]:
from langchain_openai import OpenAIEmbeddings

# Initializing OpenAI's embedding model for generating embeddings for text using the 'text-embedding-3-small' model
embedding = OpenAIEmbeddings(model='text-embedding-3-small')

In [None]:
embedding

### **Understanding similarity search with a toy example**

In [None]:
sentence1 = "i like dogs"
sentence2 = "i like cats"
sentence3 = "the weather is ugly, too hot outside"

In [None]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [None]:
len(embedding1), len(embedding2), len(embedding3)

In [None]:
embedding1[:10]

In [None]:
import numpy as np

def cosine_similarity(vector1, vector2):
    # Ensure that the vectors are numpy arrays
    vector1 = np.array(vector1)
    vector2 = np.array(vector2)

    # Calculate the dot product of the vectors
    dot_product = np.dot(vector1, vector2)

    # Calculate the magnitude (norm) of the vectors
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)

    # Compute cosine similarity
    if norm_vector1 == 0 or norm_vector2 == 0:
        return 0  # Avoid division by zero
    return dot_product / (norm_vector1 * norm_vector2)


In [None]:
# Calculating the cosine similarity between the embeddings of three sentences
cosine_similarity(embedding1, embedding2), cosine_similarity(embedding1, embedding3), cosine_similarity(embedding2, embedding3)

### **Vectorstores**

In [None]:
# Light-weight and in memory
from langchain_chroma import Chroma

In [None]:
persist_directory = 'docs/chroma/'
!rm -rf ./docs/chroma  # remove old database files if any

In [None]:
vectordb = Chroma.from_documents(
    documents=splits,                    # splits we created earlier
    embedding=embedding,
    persist_directory=persist_directory, # save the directory
)

In [None]:
print(vectordb._collection.count()) # same as number of splits

### **Similarity Search in Vector store**

Algorithms for retrieving relevant chunks In Vector databases,

In vector databases, algorithms for retrieving relevant chunks to a query are often based on **similarity search techniques**, primarily using nearest neighbor search.

Here are some common approaches:

>**Approximate Nearest Neighbor (ANN) Search:** Vector databases frequently use ANN algorithms to improve efficiency when searching for vectors that
are close to the query vector.
>
>Popular **ANN** algorithms include:
>
>1. HNSW (Hierarchical Navigable Small World Graph): This is a graph-based approach that finds approximate nearest neighbors using a multi-
layered graph structure.
>
>2. Faiss: An open-source library developed by Facebook, which uses various algorithms for fast similarity search, such as Product Quantization and
Inverted File System (IVF).
>
>3. Annoy (Approximate Nearest Neighbors Oh Yeah): Developed by Spotify, it uses a forest of random projection trees for approximate nearest
neighbor search.


In [None]:
question = "How does ensemble method works?"

In [None]:
# k --> No. of Document object to return
docs = vectordb.similarity_search(question, k=6)

In [None]:
# Print the length of the documents retrieved and display the content of each document
print(len(docs))

# Iterate through each document in the retrieved list 'docs'
for i in range(len(docs)):
    # Print the content of each document's page
    print(docs[i].page_content)
    # Print a separator line for better readability between document contents
    print('='*140)

### **Edge cases where failure may happen**

1. Lack of Diversity : Semantic search fetches all similar documents, but does not enforce diversity.

    - Notice that we're getting duplicate chunks (because of the duplicate `ens_d2.pdf` in the index). `docs[0]` and `docs[1]` are indentical.

  **Addressing Diversity - MMR (Maximum Marginal Relevance)**

Maximum Marginal Relevance (MMR) is a method used to retrieve relevant items to a query while avoiding redundancy. It does this by ensuring a balance between relevancy and diversity in the items retrieved.

<img src='https://miro.medium.com/v2/resize:fit:828/format:webp/1*U-9mPt5tBfPBPrwC4_oD1w.png'>

In [None]:
# Perform a similarity search in the vector database for the given question
# The 'k=3' argument specifies that the top 3 most relevant documents should be returned (without using MMR)
docs = vectordb.similarity_search(question, k=3)

# Print the number of documents retrieved from the search
print(len(docs))

# Iterate through each of the retrieved documents and print their content
for i in range(len(docs)):
    # Print the content of each document
    print(docs[i].page_content)
    # Print a separator line for better readability between document contents
    print('='*140)

**Example 1. Addressing Diversity - MMR-Maximum Marginal Relevance**

In [None]:
# Perform a similarity search in the vector database for the given question, using Maximum Marginal Relevance (MMR)
# The 'k=3' argument specifies that the top 3 most relevant documents should be returned after applying MMR.
# The 'fetch_k=6' argument fetches 6 documents initially to apply MMR and filter the best 3 from them.
docs_with_mmr = vectordb.max_marginal_relevance_search(question, k=3, fetch_k=6)

# Print the number of documents retrieved after applying MMR
print(len(docs_with_mmr))

# Iterate through each of the retrieved documents and print their content
for i in range(len(docs_with_mmr)):
    # Print the content of each document
    print(docs_with_mmr[i].page_content)
    # Print a separator line for better readability between document contents
    print('='*140)

2. Lack of specificity:  The question may be from a particular doc but answer may contain information from other doc.

  **Addressing Specificity: Working with metadata - Manually**

  **Working with metadata using self-query retriever - Automatically**

**Example 2. Addressing Specificity: Working with metadata - Manually**

In [None]:
# Perform a similarity search in the vector database for the given question, without any filtering based on metadata.
# The 'k=5' argument specifies that the top 5 most relevant documents should be returned for the question.
question = "What is variance?"

docs = vectordb.similarity_search(question, k=5)

# Iterate through each of the retrieved documents and print their metadata
# Metadata contains information about the source or the origin of the document, e.g., the file from which it was fetched.
for doc in docs:
    # Print the metadata of each document to show the source details
    print(doc.metadata)

We can filter the results based on metadata.

In [None]:
# Perform a similarity search in the vector database for the given question.
# The 'k=5' argument specifies that the top 5 most relevant documents should be returned for the question.
# The 'filter' argument is used to only return documents whose metadata matches the specified value (in this case, the source file '/content/ens_d2.pdf').
# This ensures that the search results are filtered to include only documents that are related to the 'ens_d2.pdf' file.
question = "what is the role of variance in pca?"
docs = vectordb.similarity_search(
    question,
    k=5,
    filter={"source":'/content/ens_d2.pdf'}     # manually passing metadata, using metadata filter.
)

# Iterate through each of the retrieved documents and print their metadata
# Metadata contains information about the source of the document, helping to trace the origin of the retrieved answer.
for doc in docs:
    # Print the metadata of each document to show the source details, filtered by the source attribute
    print(doc.metadata)

In [None]:
# Perform a similarity search with max marginal relevance (MMR) for the given question.
# MMR is used to retrieve diverse and relevant documents by balancing relevance and diversity.
# The 'k=2' argument specifies that 2 documents should be returned after applying MMR.
# The 'fetch_k=5' argument specifies that 5 documents should be initially retrieved to select the top 2 most relevant and diverse ones.
# The 'filter' argument is used to only return documents whose metadata matches the specified value (in this case, the source file '/content/ens_d2.pdf').
# This ensures that the search results are filtered to include only documents that are related to the 'ens_d2.pdf' file.
docs_with_mmr = vectordb.max_marginal_relevance_search(
    question,
    k=2,               # Number of relevant documents to return after applying MMR
    fetch_k=5,         # Number of documents to initially retrieve before applying MMR to select the top 2
    filter={"source":'/content/ens_d2.pdf'}  # Filter the documents by the source metadata (ens_d2.pdf)
)

In [None]:
# Iterate over the documents returned by the MMR search
for i in range(len(docs_with_mmr)):
    # Print the page content of each document
    print(docs_with_mmr[i].page_content)
    # Print a separator line for clarity
    print('='*140)

[**Addressing Specificity -Automatically: Working with metadata using self-query retriever**](https://python.langchain.com/docs/how_to/self_query/)

### **Additional tricks: Compression**

Another approach for improving the quality of retrieved docs is compression. Information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

[Contextual compression](https://python.langchain.com/docs/how_to/contextual_compression/) is meant to fix this.

## **Retrieval**

**[Vectorstore as a retriever](https://python.langchain.com/docs/how_to/vectorstore_retriever/)**

**Better Approach**

In [None]:
# Without MMR (Max Marginal Relevance)
question = "What is principal component analysis?"

# Initialize the retriever with search parameters to return 3 documents
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

# Retrieve documents based on the question
docs = retriever.invoke(question)

# Display the retrieved documents
docs

In [None]:
# With MMR (Max Marginal Relevance)
retriever = vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 2, "fetch_k": 5})

# Retrieve documents based on the question using MMR
docs = retriever.invoke(question)

# Display the retrieved documents
docs

## **Augmentation**

In [None]:
# Importing PromptTemplate from langchain_core.prompts to format prompts for LLMs
from langchain_core.prompts import PromptTemplate  # To format prompts

# Importing StrOutputParser from langchain_core.output_parsers to transform the output of an LLM into a more usable format
from langchain_core.output_parsers import StrOutputParser  # to transform the output of an LLM into a more usable format

# Importing RunnableParallel and RunnablePassthrough from langchain.schema.runnable
# RunnableParallel allows for parallel execution of tasks, while RunnablePassthrough simply passes the input without modifying it
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough  # Required by LCEL (LangChain Expression Language)

In [None]:
# Build prompt template for the question-answering system
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Always say "thanks for asking!" at the end of the answer.
{context}   # This will be the context documents retrieved based on the question
Question: {question}   # This will be the question being asked
Helpful Answer:"""   # This is where the model's answer will be placed

# Create the PromptTemplate instance with the specified variables and template
QA_PROMPT = PromptTemplate(input_variables=["context", "question"], template=template)

## **Creating final RAG Chain**

> <img src='https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2F63f8a8482c9ec06a8d7d1041514f87c06dd108a9-3442x942.png&w=3840&q=75' width=1200px>

[[Image source](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/)]

Above figure describes the LCEL flow using `RunnableParallel` and `RunnablePassthrough`.

A Runnable is a **unit of execution** in the LangChain framework. It represents a specific task or operation that can be performed.

Examples of Runnables include data transformations, computations, or any other operation that can be **expressed** in the LCEL(LangChain expression language).

[Runnable Lambdas](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.RunnableLambda.html) is a LangChain abstraction that allows us to turn Python functions into **pipe-compatible functions**, similar to the Runnable class.

[RunnablePassthrough](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.passthrough.RunnablePassthrough.html) on its own allows you to pass inputs unchanged. This typically is **used in conjuction with [RunnableParallel](https://python.langchain.com/v0.1/docs/expression_language/interface/#parallelism)** to pass data through to a new key in the map.

The **RunnableParallel** object allows us to define multiple values and operations, and run them all in parallel.

The **RunnablePassthrough** object is used as a “passthrough” that takes any input to the current component ('retrieval' in above figure) and allows us to provide it in the component output via the “question” key or any other custom key.

In [None]:
def get_context_info(question):
    # Create a retriever object from the vectordb instance
    # The retriever will use the "mmr" search type (Maximum Marginal Relevance) to fetch documents
    retriever = vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 3, "fetch_k": 5})

    # The 'invoke' method of the retriever is used to retrieve relevant documents based on the input question
    # 'k' specifies how many relevant documents to return after applying MMR, and 'fetch_k' controls how many documents are initially fetched
    docs = retriever.invoke(question)

    # Return the retrieved documents as the context for answering the question
    return docs

In [None]:
from langchain_core.runnables import RunnableLambda  # Importing the RunnableLambda class

# Create a RunnableParallel object that defines how to parallelize tasks
retrieval = RunnableParallel(
    {
        # The "context" task will execute a Lambda function that calls get_context_info on the "question" key from input x
        "context": RunnableLambda(lambda x: get_context_info(x["question"])),

        # The "question" task simply returns the value of "question" from the input x
        "question": RunnableLambda(lambda x: x["question"])
    }
)

In [None]:
retrieval.invoke({"question": "What is PCA ?"})

In [None]:
retrieval.invoke({"question": "How ensemble methods works?"})

In [None]:
# RAG Chain

rag_chain = (retrieval                     # Retrieval
             | QA_PROMPT                   # Augmentation
             | llm                         # Generation
             | StrOutputParser()
             )

In [None]:
response = rag_chain.invoke({"question": "What is PCA ?"})

response

In [None]:
response = rag_chain.invoke({"question": "What is principal component analysis?"})

response

In [None]:
response = rag_chain.invoke({"question": "How ensemble method works?"})

print(response)

In [None]:
# For queries that is not in documents
response = rag_chain.invoke({"question": "Who is the CEO of OpenAI "})

print(response)

[**Details of Chroma through LangChain**](https://python.langchain.com/docs/integrations/vectorstores/chroma/)

## Reusing Vector DB

### **Download the vector DB**

In [None]:
# Zip the entire folder
!zip -r /content/docs.zip /content/docs

In [None]:
from google.colab import files
files.download("/content/docs.zip")

### **Upload the vector db from previous step and unzip**

In [None]:
!unzip /content/docs.zip  -d /

In [None]:
from langchain_chroma import Chroma  # Importing Chroma for vector database management
from langchain_openai import OpenAIEmbeddings  # Importing OpenAI Embeddings for text embeddings

# Initializing the OpenAI embedding model
embedding = OpenAIEmbeddings(model='text-embedding-3-small')

# Setting up the Chroma vector database
vectordb = Chroma(persist_directory = 'docs/chroma/',  # Path to persist the vector database on disk
                  embedding_function = embedding  # Passing the OpenAI embedding model as the function for generating vector embeddings
                  )

### **Re-ranking example with Open-source model**

* [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)
* [MS MARCO Cross-Encoders](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html) for Re-ranking
  * Usage with **SentenceTransformers
Pre-trained models** can be used like this:

In [None]:
# Define a query and some candidate sentences
query = "I love programming in Python."

# Some toy data representing candidate sentences/documents
candidates = [
    "Python is a great programming language.",
    "I enjoy long walks on the beach.",
    "Machine learning can be used to build models.",
    "I like writing code in Python.",
    "Artificial intelligence is fascinating."
]

In [None]:
Paragraph1 = candidates[0]
Paragraph2 = candidates[1]
Paragraph3 = candidates[2]

In [None]:
from sentence_transformers import CrossEncoder  # Importing the CrossEncoder class from sentence_transformers

model_name = 'cross-encoder/ms-marco-TinyBERT-L-2-v2'  # Defining the model name for the CrossEncoder

# Initializing the CrossEncoder model with the specified pre-trained model and a maximum sequence length of 512 tokens
model = CrossEncoder(model_name, max_length=512)

# Predicting relevance scores for pairs of query and paragraphs using the CrossEncoder model
scores = model.predict([(query, Paragraph1), (query, Paragraph2), (query, Paragraph3)])

# Printing the resulting relevance scores
print(scores)

In [None]:
print(scores)

* **Usage with Transformers**

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification  # Importing the required modules from transformers library
import torch  # Importing PyTorch for tensor operations

# Loading the pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Loading the tokenizer for the pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenizing the input data (query and paragraphs) with padding and truncation options
features = tokenizer([query, query, query], [Paragraph1, Paragraph2, Paragraph3], padding=True, truncation=True, return_tensors="pt")

# Setting the model to evaluation mode (important for models that have dropout layers, etc.)
model.eval()

# Disable gradient calculation as we are in inference mode (to save memory and computations)
with torch.no_grad():
    # Getting the model's output logits (raw prediction scores before applying a softmax function)
    scores = model(**features).logits

    # Printing the raw logits (scores) for each input pair (query, paragraph)
    print(scores)