## Building a Retrieval-Augmented Generation (RAG) System with LangChain

### Introduction

In this notebook, we will learn how to build a Retrieval-Augmented Generation (RAG) system using LangChain in Python. RAG systems combine information retrieval and natural language generation to produce answers that are grounded in external knowledge bases. This approach is particularly useful when dealing with large documents or datasets where direct querying isn’t efficient or possible.

### Objectives

- Understand the concept of Retrieval-Augmented Generation (RAG).
- Learn how to use LangChain to implement a RAG system.
- Implement the system step by step with guided TODO tasks.
- Test your implementation at each step.
- Provide helpful explanations and definitions.

Help

### Methods Used:

- LangChain: A library for building language model applications.
- VectorStore (FAISS): A tool for efficient similarity search and clustering of dense vectors.
- OpenAI Embeddings: Representations of text that can capture semantic meaning.
- RetrievalQA Chain: Combines retrieval and question-answering over documents.

### Data Used

- I extracted some chapters of the Gen AI course as a txt file.
- The goal how this notebook is to build a RAG system that can answer questions based on the content of these chapters.

## Step 1: Set Up Your Environment

We need to import the required modules and set up the OpenAI API key.

In [40]:
# Import necessary libraries
import sys
from dotenv import load_dotenv
from langchain import hub
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents.base import Document
from langchain_core.prompts import ChatPromptTemplate
from typing import List
from langchain import LLMChain 

In [41]:
load_dotenv(dotenv_path=r'C:\Users\USER\Desktop\GENERATIVE AI DAUPHINE\GenAI-Dauphine-Course\env.template')
sys.path.append("../")

In [42]:
from dotenv import load_dotenv
import os

# Charger les variables d'environnement
success = load_dotenv(dotenv_path='.env.template')

if success:
    print("Variables d'environnement chargées avec succès.")
else:
    print("Erreur lors du chargement des variables d'environnement.")


Variables d'environnement chargées avec succès.


In [43]:
from dotenv import load_dotenv

# Charger les variables d'environnement
load_dotenv(dotenv_path='.env.template')

# Vérifiez si les variables sont chargées correctement
import os
google_key = os.getenv("GOOGLE_API_KEY")

if google_key:
    print("Clé API Google chargée avec succès.")
else:
    print("Erreur: Clé API Google non chargée.")
    # Afficher toutes les variables d'environnement pour déboguer
    print("Variables d'environnement actuelles:", os.environ)

Clé API Google chargée avec succès.


## Step 2: Load and Split Documents

Load the document you want to use and split it into manageable chunks.

In [44]:
# Load your document and split it into chunks
# Hint: Use TextLoader and RecursiveCharacterTextSplitter

# Specify the filename
filename = r"C:\Users\USER\Desktop\GENERATIVE AI DAUPHINE\GenAI-Dauphine-Course\data\gen_ai_course.txt"

# Load the document
loader = TextLoader(filename, encoding='utf-8')
documents = loader.load()

# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

In [45]:
# Output the number of chunks created for verification
print(f"Number of chunks created: {len(docs)}")

Number of chunks created: 243


## Step 3: Create Embeddings and Build the VectorStore

Generate embeddings for each chunk and store them in a vector store for efficient retrieval.

In [46]:
# Create embeddings and store them in a VectorStore
#embeddings = GoogleGenerativeAIEmbeddings(model="text-embedding-ada-002")
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
# Create a FAISS vector store for efficient similarity search
vector_store = FAISS.from_documents(docs, embeddings)

In [47]:
# Tester avec un texte
text = "La nature est un trésor précieux à préserver."
embedding_result = embeddings.embed_query(text)

# Afficher l'embedding
print(embedding_result)

[-0.005807802546769381, -0.016429657116532326, 0.013996670953929424, -0.030927123501896858, 0.054504286497831345, -0.015225336886942387, -0.010164381936192513, 0.016435004770755768, 0.044515300542116165, 0.024288734421133995, 0.03959793969988823, 0.031120670959353447, 0.006636400241404772, -0.019907498732209206, -0.010129876434803009, -0.062123462557792664, 0.013115126639604568, 0.050864387303590775, 0.020333316177129745, -0.020906612277030945, -0.010220176540315151, 0.04769838601350784, 0.012790014035999775, -0.0372074693441391, 0.048084769397974014, -0.03612648695707321, 0.00975894182920456, -0.05787339434027672, -0.04003853350877762, 0.05130202695727348, -0.06563553214073181, 0.015496921725571156, -0.06580051779747009, -0.026914644986391068, 0.01033852994441986, -0.047606322914361954, 0.027343878522515297, 0.03338739648461342, 0.04223555326461792, -0.0016030212864279747, -0.022290769964456558, 0.013075722381472588, -0.006281361449509859, -0.04385257139801979, 0.02721501886844635, 0.

## Step 4: Set Up the QA Chain using LCEL

Create a chain that can retrieve relevant chunks and generate answers based on them.

In [48]:
google_api_key=os.getenv("GOOGLE_API_KEY"),

In [60]:
# Initialisation du modèle Google Generative AI
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",  # Modèle disponible
    temperature=0.7,        # Contrôle la créativité
    max_output_tokens=300,  # Limite de la longueur de la réponse
    timeout=15,             # Délai maximum pour attendre une réponse
    max_retries=3           # Nombre de tentatives en cas d'échec
)

# Création d'un prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an AI assistant that emphasizes the importance of nature and preserving the environment."),
    ("human", "{query}")  # Placeholder pour intégrer des requêtes dynamiques
])

#Create a function to format documents for the prompt
def format_docs(docs: List[Document]):
    return "\n\n".join(doc.page_content for doc in docs)

# Configuration de la chaîne QA en combinant le prompt et le model
qa_chain = LLMChain(
    llm=llm,
    prompt=prompt
)

# Fournir une question à la chaîne QA
query = "Pourquoi est-il important de préserver la nature ?"
response = qa_chain.run({"query": query})

# Affichage de la réponse
print("Réponse du modèle :")
print(response)

Réponse du modèle :
Préserver la nature est crucial, non seulement pour notre propre survie, mais aussi pour le bien-être de la planète entière.  La nature est un système interconnecté et complexe dont nous dépendons entièrement.  Voici quelques raisons pour lesquelles sa préservation est primordiale :

* **Notre survie dépend de la nature:**  L'air que nous respirons, l'eau que nous buvons et la nourriture que nous mangeons proviennent tous de la nature.  Les écosystèmes naturels purifient l'air et l'eau, régulent le climat et pollinisent nos cultures. Détruire la nature, c'est scier la branche sur laquelle nous sommes assis.

* **La biodiversité est essentielle:** La nature abrite une incroyable diversité d'espèces animales et végétales.  Chaque espèce joue un rôle important dans l'équilibre des écosystèmes.  La perte de biodiversité fragilise ces écosystèmes, les rendant plus vulnérables aux maladies et aux changements climatiques.  Imaginez un monde sans le chant des oiseaux, le bo

## Step 5: Ask Questions and Get Answers

Test the system by asking a question.

In [64]:
# Exemple de document à utiliser avec le modèle
docs = [
    Document(page_content="This document discusses the urgent need to protect natural ecosystems and biodiversity. It highlights the impact of human activities on the environment and proposes strategies to mitigate these effects."),
]

# Fonction pour formater les documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Formater les documents
formatted_docs = format_docs(docs)

# Définir une question précise et directe
query = f"Can you summarize the main topic discussed in the following document, focusing on environmental conservation?\n\n{formatted_docs}"

# Interroger la chaîne QA avec plus de contexte
result = qa_chain.invoke({
    "query": query,
})

# Afficher uniquement la réponse du modèle

print(f"Answer: {result['text']}")


Answer: The document emphasizes the urgent need for environmental conservation by protecting our precious natural ecosystems and the rich biodiversity they harbor.  It underscores the detrimental impact of human activities and explores strategies to lessen our footprint and foster a healthier relationship with the natural world.  Essentially, the core message is a call to action to safeguard our planet's future.



## Step 6: Test Your Implementation with Different Questions

Try out different questions to see how the system performs.

In [68]:
# Remplacer 'Another question here' par votre question et exécuter la chaîne QA pour cette question

query = "What is the impact of global warming?"
result = qa_chain.invoke(
    {
        "query": query,
        "formatted_docs": formatted_docs,
    }
)

# Afficher uniquement la réponse du modèle
print(result['text'])  # Affiche la réponse générée


Global warming, driven by human activities releasing greenhouse gases into the atmosphere, is significantly impacting our precious planet, affecting both natural systems and human society.  It's crucial to remember that the Earth's delicate ecosystems are interconnected, and changes in one area can have ripple effects across the globe.

Here are some key impacts:

**On Natural Systems:**

* **Rising Temperatures:**  The most direct impact is the increase in average global temperatures, both on land and in the oceans. This warming disrupts natural cycles and stresses ecosystems.  Think of the delicate balance of a forest, dependent on specific temperature ranges for its flora and fauna to thrive.
* **Melting Ice and Rising Sea Levels:**  As temperatures rise, glaciers and polar ice caps melt at an alarming rate, contributing to rising sea levels. This threatens coastal communities and vital habitats like coral reefs and mangrove forests, which are crucial for biodiversity.
* **Extreme W

## Step 7: Improve the System

You can experiment with different parameters, like adjusting the chunk size or using a different language model.

* Utilisation d'un autre modèle de langage

In [71]:
# Exemple de changement de modèle
llm = ChatGoogleGenerativeAI(
    model="gpt-4",  # Remplacer par un autre modèle
    temperature=0.7,
    max_output_tokens=300,
    timeout=15,
    max_retries=3
)

Conclusion

Congratulations! You’ve built a simple Retrieval-Augmented Generation system using LangChain. This system can retrieve relevant information from documents and generate answers to user queries.

Help

- TextLoader: Loads text data from files.
- RecursiveCharacterTextSplitter: Splits text into smaller chunks for better processing.
- FAISS: A library for efficient similarity search of embeddings.
- RetrievalQA Chain: A chain that retrieves relevant documents and answers questions based on them.
- OpenAIEmbeddings: Generates embeddings that capture the semantic meaning of text.

## Help

In [None]:
from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate([
    ("system", "You are a helpful AI bot. Your name is {name}."),
    ("human", "Hello, how are you doing?"),
    ("ai", "I'm doing well, thanks!"),
    ("human", "{user_input}"),
])

prompt_value = template.invoke(
    {
        "name": "Bob",
        "user_input": "What is your name?"
    }
)

# Output:
# ChatPromptValue(
#    messages=[
#        SystemMessage(content='You are a helpful AI bot. Your name is Bob.'),
#        HumanMessage(content='Hello, how are you doing?'),
#        AIMessage(content="I'm doing well, thanks!"),
#        HumanMessage(content='What is your name?')
#    ]
#)