## Building a Retrieval-Augmented Generation (RAG) System with LangChain

### Introduction

In this notebook, we will learn how to build a Retrieval-Augmented Generation (RAG) system using LangChain in Python. RAG systems combine information retrieval and natural language generation to produce answers that are grounded in external knowledge bases. This approach is particularly useful when dealing with large documents or datasets where direct querying isn’t efficient or possible.

### Objectives

- Understand the concept of Retrieval-Augmented Generation (RAG).
- Learn how to use LangChain to implement a RAG system.
- Implement the system step by step with guided TODO tasks.
- Test your implementation at each step.
- Provide helpful explanations and definitions.

Help

### Methods Used:

- LangChain: A library for building language model applications.
- VectorStore (FAISS): A tool for efficient similarity search and clustering of dense vectors.
- OpenAI Embeddings: Representations of text that can capture semantic meaning.
- RetrievalQA Chain: Combines retrieval and question-answering over documents.

### Data Used

- I extracted some chapters of the Gen AI course as a txt file. 
- The goal how this notebook is to build a RAG system that can answer questions based on the content of these chapters.

## Step 1: Set Up Your Environment

We need to import the required modules and set up the OpenAI API key.

In [1]:
import sys
from dotenv import load_dotenv
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents.base import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import GoogleGenerativeAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from typing import List
import os

In [2]:
load_dotenv()
sys.path.append("../")

In [3]:
DB_PASSWORD = os.getenv("DB_PASSWORD")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

## Step 2: Load and Split Documents

Load the document you want to use and split it into manageable chunks.

In [5]:
filename = "../data/gen_ai_course.txt"

In [6]:
loader = TextLoader(filename)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

## Step 3: Create Embeddings and Build the VectorStore

Generate embeddings for each chunk and store them in a vector store for efficient retrieval.

In [4]:
import os 
os.environ["GOOGLE_API_KEY"]

'AIzaSyA0BJ-l4g5TYK-Gd0fvK6lJMUIroDsr1rI'

In [7]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

In [8]:
vectorstore = FAISS.from_documents(docs, embeddings) 

## Step 4: Set Up the QA Chain using LCEL 

Create a chain that can retrieve relevant chunks and generate answers based on them.

In [14]:
llm = GoogleGenerativeAI(model="gemini-1.5-pro", temperature=0)

template = """Answer the question based only on the following context:
{context} 
Question: {question}
"""

# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt

prompt = ChatPromptTemplate.from_template(template)
retriever = vectorstore.as_retriever() #pour récupérer les documents pertinents.

def format_docs(docs: List[Document]):
    return "\n\n".join(doc.page_content for doc in docs)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever= retriever,  
    chain_type_kwargs={"prompt": prompt},  
)

In [15]:
question = "What's a Retrieval Augmented Generation?"
docs = retriever.get_relevant_documents(question)
formatted_docs = format_docs(docs)

answer = qa_chain.run(
    {
        "query": question,
        "context": formatted_docs,
    })
print(f"Question: {question}\nAnswer: {answer}")

Question: What's a Retrieval Augmented Generation?
Answer: Retrieval-Augmented Generation (RAG) is a framework that combines retrieval-based and generation-based models. It enhances language models by giving them access to external knowledge bases or documents during generation. This allows the model to generate more accurate and up-to-date information by retrieving relevant data, rather than relying only on its internal parameters.



## Step 5: Ask Questions and Get Answers

Test the system by asking a question.

In [21]:
query = "What is the main topic discussed in the document?"
result = qa_chain.run(
    {
        "query": query,
        "context": formatted_docs,
    })
print(f"Question: {query}\nAnswer: {result}")

Question: What is the main topic discussed in the document?
Answer: Retrieval Augmented Generation (RAG)



In [24]:
query = "What is the main topic discussed in the document?"
result = qa_chain({"query": query})
result = result["result"]
print(f"Question: {query}")
print(f"Answer: {result}")

Question: What is the main topic discussed in the document?
Answer: Retrieval Augmented Generation (RAG) is the main topic, including its basic architecture, information retrieval techniques, and enhancements like context enrichment and multi-faceted filtering.



## Step 6: Test Your Implementation with Different Questions

Try out different questions to see how the system performs.

In [25]:
query = "Can you summarize the key points about Trasnformers?"
result = qa_chain({"query": query})
result = result["result"]
print(f"Question: {query}")
print(f"Answer: {result}")

Question: Can you summarize the key points about Trasnformers?
Answer: Transformers use an encoder-decoder architecture.  Word embeddings are used to represent words as numerical vectors, capturing semantic relationships and providing context.  These embeddings can be generated from one-hot encodings.  The model aims to handle long sequences and considers the relative positions of words.  Training efficiency is a goal.



In [26]:
query = "What is a Trasnformer?"
result = qa_chain({"query": query})
result = result["result"]
print(f"Question: {query}")
print(f"Answer: {result}")

Question: What is a Trasnformer?
Answer: Word embeddings are used to transform words into numerical values (vectors) providing semantic relationships between them.  One-hot encoding is also mentioned but embeddings are favored due to their ability to capture semantic meaning.  The model uses these embeddings to understand the context of a sentence.  The example shows how the word "Transformers" is embedded and how its position in the sentence contributes to its meaning ("at the beginning of the sentence").  The model aims to be fast trainable, though specific training methods are not detailed.  The architecture described involves a feed-forward layer and multi-head attention mechanisms.



In [27]:
query = "What are the steps of self-attention mechanism?"
result = qa_chain({"query": query})
result = result["result"]
print(f"Question: {query}")
print(f"Answer: {result}")

Question: What are the steps of self-attention mechanism?
Answer: Based on the provided text, the steps of the self-attention mechanism are not explicitly listed as ordered steps. However, the following concepts and calculations are associated with it:

1. **Components:** Query (What am I looking for?), Key (What do I have?), and Value (What do I reveal to others?).

2. **Calculations with Values (V):**  The text shows examples of calculations involving values (V1, V2, V3, V4) being multiplied by different weights.  For example: `1x V1`, `0.97 x V1`, `0.33 x V2`, etc.  These calculations appear to be weighted sums of the values.

3. **Method 1 WV:**  A method involving multiplication with a matrix "WV" and subsequent calculations, including additions and multiplications with the values (V).

4. **Softmax:** A softmax operation is applied to a set of numbers (4, 70, 0, 85, -4, -10, 0, 0, 2, 0, 0, 0, 3, -3, 4, 5), dividing each by 128.  This suggests softmax is part of the process.

The 

## Step 7: Improve the System

You can experiment with different parameters, like adjusting the chunk size or using a different language model.

Conclusion

Congratulations! You’ve built a simple Retrieval-Augmented Generation system using LangChain. This system can retrieve relevant information from documents and generate answers to user queries.

Help

- TextLoader: Loads text data from files.
- RecursiveCharacterTextSplitter: Splits text into smaller chunks for better processing.
- FAISS: A library for efficient similarity search of embeddings.
- RetrievalQA Chain: A chain that retrieves relevant documents and answers questions based on them.
- OpenAIEmbeddings: Generates embeddings that capture the semantic meaning of text.

## Help

In [9]:
from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate([
    ("system", "You are a helpful AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "I'm doing well, thanks!"),
    ("human", "{user_input}"),
])

prompt_value = template.invoke(
    {
        "name": "Bob",
        "user_input": "What is your name?"
    }
)

# Output:
# ChatPromptValue(
#    messages=[
#        SystemMessage(content='You are a helpful AI bot. Your name is Bob.'),
#        HumanMessage(content='Hello, how are you doing?'),
#        AIMessage(content="I'm doing well, thanks!"),
#        HumanMessage(content='What is your name?')
#    ]
#)

messages=[SystemMessage(content='You are a helpful AI bot. Your name is Bob.', additional_kwargs={}, response_metadata={}), HumanMessage(content='Hello, how are you doing?', additional_kwargs={}, response_metadata={}), AIMessage(content="I'm doing well, thanks!", additional_kwargs={}, response_metadata={}), HumanMessage(content='What is your name?', additional_kwargs={}, response_metadata={})]
