## Building a Retrieval-Augmented Generation (RAG) System with LangChain

### Introduction

In this notebook, we will learn how to build a Retrieval-Augmented Generation (RAG) system using LangChain in Python. RAG systems combine information retrieval and natural language generation to produce answers that are grounded in external knowledge bases. This approach is particularly useful when dealing with large documents or datasets where direct querying isn’t efficient or possible.

### Objectives

- Understand the concept of Retrieval-Augmented Generation (RAG).
- Learn how to use LangChain to implement a RAG system.
- Implement the system step by step with guided TODO tasks.
- Test your implementation at each step.
- Provide helpful explanations and definitions.

Help

### Methods Used:

- LangChain: A library for building language model applications.
- VectorStore (FAISS): A tool for efficient similarity search and clustering of dense vectors.
- OpenAI Embeddings: Representations of text that can capture semantic meaning.
- RetrievalQA Chain: Combines retrieval and question-answering over documents.

### Data Used

- I extracted some chapters of the Gen AI course as a txt file.
- The goal how this notebook is to build a RAG system that can answer questions based on the content of these chapters.

## Step 1: Set Up Your Environment

We need to import the required modules and set up the OpenAI API key.

In [None]:
# Install the repository
!git clone https://github.com/BastinFlorian/GenAI-Dauphine-Course.git

Cloning into 'GenAI-Dauphine-Course'...
remote: Enumerating objects: 83, done.[K
remote: Counting objects: 100% (83/83), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 83 (delta 36), reused 71 (delta 26), pack-reused 0 (from 0)[K
Receiving objects: 100% (83/83), 4.47 MiB | 26.33 MiB/s, done.
Resolving deltas: 100% (36/36), done.


In [None]:
!pip install ipykernel==5.5.6

Collecting jedi>=0.16 (from ipython>=5.0.0->ipykernel==5.5.6)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2


In [None]:
!pip show ipykernel
!pip show google-colab

Name: ipykernel
Version: 6.29.5
Summary: IPython Kernel for Jupyter
Home-page: https://ipython.org
Author: 
Author-email: IPython Development Team <ipython-dev@scipy.org>
License: BSD 3-Clause License
        
        Copyright (c) 2015, IPython Development Team
        
        All rights reserved.
        
        Redistribution and use in source and binary forms, with or without
        modification, are permitted provided that the following conditions are met:
        
        1. Redistributions of source code must retain the above copyright notice, this
           list of conditions and the following disclaimer.
        
        2. Redistributions in binary form must reproduce the above copyright notice,
           this list of conditions and the following disclaimer in the documentation
           and/or other materials provided with the distribution.
        
        3. Neither the name of the copyright holder nor the names of its
           contributors may be used to endorse o

In [None]:
# Install the requirements
!pip install -r /content/GenAI-Dauphine-Course/requirements.txt

Collecting jupyter>=1.0.0 (from -r /content/GenAI-Dauphine-Course/requirements.txt (line 4))
  Downloading jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting faiss-cpu (from -r /content/GenAI-Dauphine-Course/requirements.txt (line 8))
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting langchain-community (from -r /content/GenAI-Dauphine-Course/requirements.txt (line 9))
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting tiktoken>=0.7.0 (from -r /content/GenAI-Dauphine-Course/requirements.txt (line 10))
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting langchain-openai (from -r /content/GenAI-Dauphine-Course/requirements.txt (line 11))
  Downloading langchain_openai-0.2.10-py3-none-any.whl.metadata (2.6 kB)
Collecting llama-parse (from -r /content/GenAI-Dauphine-Course/requirements.txt (line 12))
  Dow

In [None]:
!pip install langchain-google-genai

Collecting langchain-google-genai
  Downloading langchain_google_genai-2.0.6-py3-none-any.whl.metadata (3.6 kB)
Downloading langchain_google_genai-2.0.6-py3-none-any.whl (41 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-google-genai
Successfully installed langchain-google-genai-2.0.6


In [None]:
# Import necessary libraries
import sys
from dotenv import load_dotenv
from langchain import hub
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents.base import Document
from langchain_core.prompts import ChatPromptTemplate
from typing import List

In [None]:
load_dotenv()
sys.path.append("../")

In [None]:
# Vérifiez si les packages nécessaires sont installés
import sys
import os

# Vérifier les packages installés
print("Python version:", sys.version)
print("Pip version:", os.popen('pip --version').read())

# Importer les bibliothèques et vérifier leur installation
try:
    from dotenv import load_dotenv
    import langchain
    from langchain.vectorstores import FAISS
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.document_loaders import TextLoader
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
    print("Toutes les bibliothèques sont installées.")
except ImportError as e:
    print(f"Erreur d'importation de {e.name}")

Python version: 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]
Pip version: pip 24.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Toutes les bibliothèques sont installées.


In [None]:
GOOGLE_API_KEY="AIzaSyA0BJ-l4g5TYK-Gd0fvK6lJMUIroDsr1rI"

In [None]:
import os
os.environ["GOOGLE_API_KEY"] = "AIzaSyA0BJ-l4g5TYK-Gd0fvK6lJMUIroDsr1rI"
api_key = os.getenv("GOOGLE_API_KEY")
print(api_key)

AIzaSyA0BJ-l4g5TYK-Gd0fvK6lJMUIroDsr1rI


In [None]:
from dotenv import load_dotenv

# Charger les variables d'environnement
load_dotenv()

# Vérifiez si les variables sont chargées correctement
import os

google_key = os.getenv("GOOGLE_API_KEY")

if google_key:
    print("Clé API Google chargée avec succès.")
else:
    print("Erreur: Clé API Google non chargée.")


Clé API Google chargée avec succès.


In [None]:
import sys

# Vérifier que le chemin nécessaire a été ajouté
if "../" in sys.path:
    print("Le chemin a été ajouté avec succès.")
else:
    sys.path.append("../")
    print("Chemin ajouté avec succès.")

Le chemin a été ajouté avec succès.


In [None]:
!python --version

Python 3.10.12


## Step 2: Load and Split Documents

Load the document you want to use and split it into manageable chunks.

In [None]:
# TODO: Load your document and split it into chunks
# Hint: Use TextLoader and RecursiveCharacterTextSplitter

# Specify the filename
filename = "/content/GenAI-Dauphine-Course/data/gen_ai_course.txt"
# Load the document
loader = TextLoader(filename)
documents = loader.load()

# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

In [None]:
from dotenv import load_dotenv
import os

# Charger les variables d'environnement
load_dotenv()

# Vérifier si les variables sont chargées correctement
google_api_key = os.getenv("GOOGLE_API_KEY")
print(f"Google API Key: {google_api_key}")

Google API Key: AIzaSyA0BJ-l4g5TYK-Gd0fvK6lJMUIroDsr1rI


## Step 3: Create Embeddings and Build the VectorStore

Generate embeddings for each chunk and store them in a vector store for efficient retrieval.

In [None]:
# TODO: Create embeddings and store them in a VectorStore
# Hint: Use OpenAIEmbeddings and FAISS
embeddings = ...
vectorstore = ...
# Hint : Use GoogleGenerativeAIEmbeddings(model=...)

## Step 4: Set Up the QA Chain using LCEL

Create a chain that can retrieve relevant chunks and generate answers based on them.

In [None]:
llm = ...  # Initialize ChatGoogleGenerativeAI with the required arguments

#Create a function to format documents for the prompt
def format_docs(docs: List[Document]):
    # Hint: Join the content of each document
    return ...  # Join the page content of docs into a stringg

# Hint: Define the prompt template with system and human messages. See help below
prompt = ...

# Hint: Format the documents using the function above
formatted_docs = ...

# Hint: Create the QA chain by combining the prompt and model
qa_chain = ...


## Step 5: Ask Questions and Get Answers

Test the system by asking a question.

In [None]:
# TODO: Ask a question to the QA chain
# Replace 'Your question here' with an actual question and run the qa_chain for this question

# Answer:
query = "What is the main topic discussed in the document?"
result = ...
print(result)

Ellipsis


## Step 6: Test Your Implementation with Different Questions

Try out different questions to see how the system performs.

In [None]:
# Replace 'Another question here' with your own question and run the qa_chain for this question

query = "Can you summarize the key points mentioned?"
result = ...
print(result)

Ellipsis


## Step 7: Improve the System

You can experiment with different parameters, like adjusting the chunk size or using a different language model.

Conclusion

Congratulations! You’ve built a simple Retrieval-Augmented Generation system using LangChain. This system can retrieve relevant information from documents and generate answers to user queries.

Help

- TextLoader: Loads text data from files.
- RecursiveCharacterTextSplitter: Splits text into smaller chunks for better processing.
- FAISS: A library for efficient similarity search of embeddings.
- RetrievalQA Chain: A chain that retrieves relevant documents and answers questions based on them.
- OpenAIEmbeddings: Generates embeddings that capture the semantic meaning of text.

## Help

In [None]:
from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate([
    ("system", "You are a helpful AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "I'm doing well, thanks!"),
    ("human", "{user_input}"),
])

prompt_value = template.invoke(
    {
        "name": "Bob",
        "user_input": "What is your name?"
    }
)

# Output:
# ChatPromptValue(
#    messages=[
#        SystemMessage(content='You are a helpful AI bot. Your name is Bob.'),
#        HumanMessage(content='Hello, how are you doing?'),
#        AIMessage(content="I'm doing well, thanks!"),
#        HumanMessage(content='What is your name?')
#    ]
#)

messages=[SystemMessage(content='You are a helpful AI bot. Your name is Bob.', additional_kwargs={}, response_metadata={}), HumanMessage(content='Hello, how are you doing?', additional_kwargs={}, response_metadata={}), AIMessage(content="I'm doing well, thanks!", additional_kwargs={}, response_metadata={}), HumanMessage(content='What is your name?', additional_kwargs={}, response_metadata={})]
