## Building a Retrieval-Augmented Generation (RAG) System with LangChain

### Introduction

In this notebook, we will learn how to build a Retrieval-Augmented Generation (RAG) system using LangChain in Python. RAG systems combine information retrieval and natural language generation to produce answers that are grounded in external knowledge bases. This approach is particularly useful when dealing with large documents or datasets where direct querying isn’t efficient or possible.

### Objectives

- Understand the concept of Retrieval-Augmented Generation (RAG).
- Learn how to use LangChain to implement a RAG system.
- Implement the system step by step with guided TODO tasks.
- Test your implementation at each step.
- Provide helpful explanations and definitions.

Help

### Methods Used:

- LangChain: A library for building language model applications.
- VectorStore (FAISS): A tool for efficient similarity search and clustering of dense vectors.
- OpenAI Embeddings: Representations of text that can capture semantic meaning.
- RetrievalQA Chain: Combines retrieval and question-answering over documents.

### Data Used

- I extracted some chapters of the Gen AI course as a txt file. 
- The goal how this notebook is to build a RAG system that can answer questions based on the content of these chapters.

## Step 1: Set Up Your Environment

We need to import the required modules and set up the OpenAI API key.

In [1]:
# Import necessary libraries
import sys
from dotenv import load_dotenv
from langchain import OpenAI, hub
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents.base import Document
from langchain_core.prompts import ChatPromptTemplate
from typing import List


from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI

In [2]:
load_dotenv()
sys.path.append("../")

## Step 2: Load and Split Documents

Load the document you want to use and split it into manageable chunks.

In [3]:
# TODO: Load your document and split it into chunks
# Hint: Use TextLoader and RecursiveCharacterTextSplitter

filename = "../data/gen_ai_course.txt"
# Answer:
loader = TextLoader(filename)  
documents = loader.load() 

# Answer:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)  


## Step 3: Create Embeddings and Build the VectorStore

Generate embeddings for each chunk and store them in a vector store for efficient retrieval.

In [4]:
from dotenv import load_dotenv
import os

load_dotenv()
google_api_key = os.getenv("GOOGLE_API_KEY")


In [5]:
# TODO: Create embeddings and store them in a VectorStore
# Hint: Use OpenAIEmbeddings and FAISS
# Create embeddings and store them in a VectorStore
# Answer:  
#embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
#vectorstore = FAISS.from_documents(docs, embeddings)

from langchain_google_genai import GoogleGenerativeAIEmbeddings

# Initialise les embeddings avec la clé API chargée
embeddings = GoogleGenerativeAIEmbeddings(api_key=os.getenv("GOOGLE_API_KEY"), model="models/embedding-001")

# Créez le VectorStore avec FAISS
vectorstore = FAISS.from_documents(docs, embeddings)


## Step 4: Set Up the QA Chain using LCEL 

Create a chain that can retrieve relevant chunks and generate answers based on them.

In [6]:
# TODO: Create a RetrievalQA chain
# Hint: Use ChatOpenAI, create a prompt, and use StrOutputParser
# Hint: The chain should be an LCEL chain https://python.langchain.com/v0.1/docs/expression_language/get_started/
#llm = ChatOpenAI(model="gpt-4", temperature=0)


# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
#prompt = ChatPromptTemplate.from_template(
    #template="Based on the following documents, answer the user's question.\n\nDocuments:\n{docs}\n\nQuestion: {question}\nAnswer:"
#)

#def format_docs(docs: List[Document]):
 #   return "\n\n".join(doc.page_content for doc in docs)


# Answer:
#qa_chain = hub.rlm.rag_chain(
 #   retriever=vectorstore.as_retriever(),  
 #   llm=llm,  
 #   prompt=prompt,  
  #  doc_formatter=format_docs, 
   # output_parser=StrOutputParser() 
#)  

#Gemini

llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    max_tokens=None,
    timeout=20,
    max_retries=2,
)

def format_docs(docs: List[Document]):
    return "\n\n".join(doc.page_content for doc in docs)

prompt = ChatPromptTemplate.from_messages(
    messages=[
        ("system", "You are a question-answering chatbot. Provide answers based on the provided documents."),
        ("human", "Question: {question}\n\nRelevant Documents:\n{formatted_docs}")
    ]
)

formatted_docs = format_docs(docs)

qa_chain = prompt | llm



## Step 5: Ask Questions and Get Answers

Test the system by asking a question.

In [7]:
# TODO: Ask a question to the QA chain
# Replace 'Your question here' with an actual question and run the qa_chain for this question

# Answer:
#query = "What is the main topic discussed in the document?"
#result = qa_chain.invoke({"question": query})
#print(result)


query_1 = "What is the main topic discussed in the document?"
result = qa_chain.invoke({
    "question": query_1,
    "formatted_docs": formatted_docs
}).content
print(result)


This document provides a comprehensive overview of Large Language Models (LLMs), focusing on their architecture, training, and applications, especially within the context of Generative AI.  It covers topics such as transformer models, pre-training and fine-tuning LLMs, retrieval augmented generation (RAG), and the use of tools and agents with LLMs.  It also touches upon the history of language models, leading up to the transformer architecture, and discusses various optimization techniques and evaluation methods.



## Step 6: Test Your Implementation with Different Questions

Try out different questions to see how the system performs.

In [8]:
# Replace 'Another question here' with your own question and run the qa_chain for this question

query = "Can you summarize the key points mentioned?"
#result = qa_chain.invoke({"question": query})
#print(result)

result = qa_chain.invoke({  
    "question": query,      
    "formatted_docs": formatted_docs  
})
print(result)


content='This document covers a wide range of topics related to Large Language Models (LLMs), focusing on their construction, training, optimization, and applications.\n\n**Key Concepts:**\n\n* **Building LLMs:** This involves pre-training (using cross-entropy loss, tokenization (BPE), data preprocessing, scaling laws) and post-training (fine-tuning).  Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF) using reward models and Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO) are crucial for aligning LLMs with human preferences. Evaluation is done using datasets like IFEval, BBH, MMLU-Pro, and Math.\n* **Transformers Architecture:** The core of LLMs, replacing older architectures like RNNs and LSTMs. Key components include self-attention/cross-attention, multi-head attention, residual connections, layer normalization, feed-forward layers, softmax layer, and positional embeddings.  Attention mechanisms allow the model to weigh the imp

In [12]:
query_2 = "Can you summarize the Transformer's Architecture?"
#result = qa_chain.invoke({"question": query})
#print(result)

result = qa_chain.invoke({  
    "question": query_2,      
    "formatted_docs": formatted_docs  
})
print(result)

content="The Transformer architecture replaces recurrent networks with a self-attention mechanism.  Key components include:\n\n* **Self-Attention/Cross Attention:** This mechanism allows the model to weigh the importance of different parts of the input sequence when generating an output.  Self-attention focuses on relationships within a single sequence, while cross-attention relates two different sequences.\n* **Multi-Head Attention:**  This performs multiple attention calculations in parallel, allowing the model to capture different relationships between words.  These parallel calculations are then combined.\n* **Residual Connections & Layer Normalization:** Residual connections help mitigate vanishing gradients and keep information local. Layer normalization stabilizes hidden state dynamics.\n* **Feed Forward Layer:**  This consists of two linear transformations with a ReLU activation in between, applied to each position independently.  It helps the model learn complex non-linear rel

In [10]:
query_3 = "What is the main difference between a Transformer and an LSTM ?"
#result = qa_chain.invoke({"question": query})
#print(result)

result = qa_chain.invoke({  
    "question": query_3,      
    "formatted_docs": formatted_docs  
})
print(result)

content='The main difference between a Transformer and an LSTM lies in how they process sequential data. LSTMs process data sequentially, maintaining a "memory" of past information using a cell state and gates to control information flow.  This sequential nature makes them slower for long sequences and susceptible to vanishing gradients, although LSTMs mitigate this issue better than standard RNNs.  Transformers, on the other hand, process data in parallel using the attention mechanism. This allows them to capture relationships between all words in a sequence simultaneously, leading to better performance on long sequences and faster training times.  The attention mechanism also allows Transformers to weigh the importance of different parts of the input sequence, making them more effective at capturing long-range dependencies.  While LSTMs were a significant improvement over standard RNNs, Transformers have largely superseded them for many NLP tasks due to their superior performance and

## Step 7: Improve the System

You can experiment with different parameters, like adjusting the chunk size or using a different language model.

Conclusion

Congratulations! You’ve built a simple Retrieval-Augmented Generation system using LangChain. This system can retrieve relevant information from documents and generate answers to user queries.

Help

- TextLoader: Loads text data from files.
- RecursiveCharacterTextSplitter: Splits text into smaller chunks for better processing.
- FAISS: A library for efficient similarity search of embeddings.
- RetrievalQA Chain: A chain that retrieves relevant documents and answers questions based on them.
- OpenAIEmbeddings: Generates embeddings that capture the semantic meaning of text.

## Help

In [11]:
from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate([
    ("system", "You are a helpful AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "I'm doing well, thanks!"),
    ("human", "{user_input}"),
])

prompt_value = template.invoke(
    {
        "name": "Bob",
        "user_input": "What is your name?"
    }
)

# Output:
# ChatPromptValue(
#    messages=[
#        SystemMessage(content='You are a helpful AI bot. Your name is Bob.'),
#        HumanMessage(content='Hello, how are you doing?'),
#        AIMessage(content="I'm doing well, thanks!"),
#        HumanMessage(content='What is your name?')
#    ]
#)