# Retrieval Augmented Generation (RAG) using LangChain
    Description
Large Language Models (LLMs) are being integrated into computers, phones, and software applications, but they do have one drawback: their knowledge is limited by their training data, which is slow and costly. Enter Retrieval Augmented Generation (RAG)! RAG enables you to integrate external data with LLMs. In this notebook, you'll learn state-of-the-art techniques for loading, processing, and retrieving external data for LLMs! You'll utilize vector databases, the latest LLMs, including GPT-4o-Mini, and the LangChain framework to create RAG applications. This notebook concludes with a chapter on Graph RAG, a twist on traditional RAG that uses graph databases for more reliable data retrieval.

- **Framework**: A framework is a predefined software infrastructure that provides a set of components, tools, and rules to simplify and accelerate application development.

### Loading Documents for RAG with LangChain

“Loading Documents for RAG with LangChain” usually refers to the step where you bring external data (PDFs, Word docs, web pages, databases, etc.) into a LangChain pipeline so it can be chunked, embedded, stored in a vector database, and later retrieved by the LLM.

In [5]:
#Import Langchain (framework) package:
#pip install langchain & pip install langchain_community#

from langchain_community.document_loaders import PyPDFLoader
PATH_TO_PDF_FILE = "/Users/alexandreohayon/Desktop/financial_report_desj.pdf"
#Create your loader 
loader = PyPDFLoader(file_path="financial_report_desj.pdf",
                     password=None
                    )
#Get informatins about the file
docs = loader.load() 
print(docs[0])

page_content='Rapport financier
Deuxième trimestre de 2025
Le Mouvement Desjardins enregistre des excédents de 900 M$ 
pour le deuxième trimestre de 2025 et franchit le cap des 500 G$ d'actifs
MESSAGE DE LA DIRECTION
Lévis, le 12 août 2025 – Au terme du deuxième trimestre terminé le 30 juin 2025, le Mouvement Desjardins, plus grand groupe financier coopératif en 
Amérique du Nord, a enregistré des excédents avant ristournes aux membres de 900 M$, comparativement à 918 M$ pour la période correspondante 
de 2024. Cette diminution des excédents s'explique principalement par une hausse de la dotation à la provision pour pertes de crédit, en raison 
notamment de l'évolution défavorable des perspectives économiques, qui reflètent l'incidence potentielle des perturbations commerciales.  En 
contrepartie, le secteur Particuliers et Entreprises a bénéficié de la progression du revenu net d'intérêts liée principalement à l'augmentation du 
portefeuille de prêts qui a notamment permis de franchir

### Text splitting, embeddings, and vector storage

1️⃣ **Text Splitting**
Large documents are split into smaller chunks so LLMs can process them efficiently. Chunks can overlap to preserve context.

2️⃣ **Embeddings**
Each chunk is converted into a vector representation using an embedding model (e.g., OpenAI). Vectors capture the semantic meaning of the text.

3️⃣ **Vector Storage**

Vectors are stored in a vector database (FAISS, Chroma, Pinecone, etc.) for fast similarity search. When a query comes in, relevant chunks are retrieved based on vector similarity.
    
    RAG flow: Load PDF → Split text → Create embeddings → Store vectors → Retrieve relevant chunks → LLM generates answer.

🔹 **chunk_size**

The maximum length of each text chunk.

In your example, chunk_size=75 means each chunk will be at most 75 characters long (because you’re using CharacterTextSplitter).

🔹 **chunk_overlap**

The number of characters that are repeated between consecutive chunks.

With chunk_overlap=10, the last 10 characters of one chunk are added at the beginning of the next chunk.

This keeps context continuity across chunks.

In [6]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter

#Here is the text to split
text = '''RAG (retrieval augmented generation) is an advanced NLP model 
that combines retrieval mechanisms with generative capabilities. 
RAG aims to improve the accuracy and relevance of its outputs by grounding responses in precise, contextually appropriate data.'''

#Create a text splitter and its separators
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=15, chunk_overlap=10)
#Get the chunks from the splitter
chunks = text_splitter.split_text(text=text)

print(chunks)

# Split the text using text_splitter
chunks = text_splitter.split_text(text)
print(chunks)
print([len(chunk) for chunk in chunks])

Created a chunk of size 62, which is longer than the specified 15
Created a chunk of size 65, which is longer than the specified 15
Created a chunk of size 62, which is longer than the specified 15
Created a chunk of size 65, which is longer than the specified 15


['RAG (retrieval augmented generation) is an advanced NLP model', 'that combines retrieval mechanisms with generative capabilities.', 'RAG aims to improve the accuracy and relevance of its outputs by grounding responses in precise, contextually appropriate data.']
['RAG (retrieval augmented generation) is an advanced NLP model', 'that combines retrieval mechanisms with generative capabilities.', 'RAG aims to improve the accuracy and relevance of its outputs by grounding responses in precise, contextually appropriate data.']
[61, 64, 127]


#### CharacterTextSplitter

- Very basic splitter.

- Splits text by a single character you specify (e.g. ".", "\n", " ").

- If a chunk is longer than chunk_size, it will just cut it off at that size, even in the middle of a word or sentence.

- Good for simple use cases but not very “smart.”

#### RecursiveCharacterTextSplitter

- Much smarter splitter.

- Uses a list of separators (e.g. paragraphs \n\n, then sentences ., then words , etc.).

- It tries to split on the largest meaningful separator first (paragraphs), then smaller ones if needed.

- Guarantees chunks respect chunk_size, but tries to keep sentences and paragraphs intact.

- Best choice for documents (PDFs, articles) because it preserves context better.

    ⚡ Key Difference

split_text → gives you plain text chunks.

split_documents → gives you Document objects (text + metadata).

In [7]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader(file_path="financial_report_desj.pdf")
document = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n","."," ",""],
    chunk_size=75,
    chunk_overlap=10
)
chunks = text_splitter.split_documents(documents=document)
#Count the number of characters in page content including spaces between words
print([len(chunk.page_content) for chunk in chunks])

[44, 58, 73, 23, 74, 66, 23, 73, 74, 7, 73, 69, 70, 68, 5, 70, 64, 19, 74, 15, 65, 69, 64, 18, 72, 74, 47, 62, 54, 36, 69, 71, 1, 74, 69, 22, 7, 74, 61, 64, 45, 47, 69, 71, 13, 22, 74, 54, 10, 74, 27, 42, 69, 73, 19, 35, 66, 70, 22, 71, 12, 1, 66, 72, 18, 27, 59, 57, 71, 22, 53, 71, 73, 13, 36, 72, 18, 74, 71, 10, 67, 65, 26, 64, 67, 21, 68, 37, 40, 67, 52, 1, 73, 68, 74, 10, 43, 66, 58, 54, 72, 13, 50, 56, 65, 70, 9, 70, 64, 38, 70, 37, 60, 59, 46, 34, 47, 51, 54, 50, 46, 30, 49, 71, 28, 41, 48, 56, 66, 60, 24, 71, 55, 58, 69, 64, 32, 55, 54, 41, 44, 70, 34, 65, 49, 56, 27, 72, 68, 17, 71, 12, 49, 23, 20, 31, 35, 18, 70, 73, 9, 1, 73, 58, 1, 68, 69, 18, 74, 29, 46, 65, 71, 13, 68, 25, 56, 73, 69, 1, 69, 69, 15, 74, 67, 74, 71, 14, 74, 69, 66, 24, 58, 68, 68, 29, 58, 74, 29, 69, 72, 1, 72, 70, 72, 70, 17, 71, 51, 1, 67, 59, 20, 70, 69, 21, 44, 75, 33, 51, 75, 23, 1, 72, 73, 60, 70, 30, 29, 70, 25, 22, 71, 66, 12, 70, 71, 8, 8, 74, 67, 9, 67, 68, 19, 38, 70, 40, 1, 18, 61, 74, 19, 46, 4

### Embedding and storing documents

    What are Embeddings?

- An embedding model converts text into a vector (list of numbers).

- These vectors capture the semantic meaning of text.

- Similar texts → vectors close in space.

In [8]:
from openai import OpenAI
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import chromadb
import os

#Initialize the OpenAI Embedding Model
embedding_model = OpenAIEmbeddings(
    api_key=os.getenv("API_KEY"), model="text-embedding-3-small")
#Create a Chroma vector store to store the embeddings from the document
vector_store = Chroma.from_documents(
    documents=document, embedding=embedding_model, collection_metadata=None
)

### Building an LCEL retrieval chain

     LCEL retrieval chain:

- Load docs → bring in your data (PDF, text, etc.).

- Split docs → break into chunks with a splitter.

- Embed + store → convert chunks into vectors, save in a vector DB (e.g. FAISS).

- Retriever → searches relevant chunks for a query.

- LLM + Prompt → define how the LLM should use the retrieved chunks.

- Retrieval chain → ties retriever + LLM together into one pipeline.

- Invoke → you ask a question, chain fetches chunks + LLM answers.

👉 **Basically**: Question → Retrieve docs → Feed docs into LLM → Answer.

In [9]:
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

prompt = """
Use the only the context provided to answer the following question. 
If you don't know the answer, reply that you are unsure.
Context: {context}
Question: {question}
"""

# Convert the string into a chat prompt template
prompt_template = ChatPromptTemplate.from_template(prompt)
llm = ChatOpenAI(api_key=os.getenv('API_KEY'),  model="gpt-4o-mini")
# Create an LCEL chain to test the prompt
chain = prompt_template | llm

# Invoke the chain on the inputs provided
print(chain.invoke({"context": "DataCamp's RAG course was created by Meri Nova and James Chapman!", "question": "Who created DataCamp's RAG course?"}))

content="DataCamp's RAG course was created by Meri Nova and James Chapman." additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 16, 'prompt_tokens': 63, 'total_tokens': 79, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CNVMa4OHoSLjUHHTRT2mvnZmV8JUS', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='run--c7619b81-4685-442e-b0f7-0585404f7174-0' usage_metadata={'input_tokens': 63, 'output_tokens': 16, 'total_tokens': 79, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


### Building an Retrieval Chain 

Create a retrieval chain using **LangChain's Expression Language (LCEL)**. This will combine the vector store containing your embedded document chunks from the RAG paper you loaded earlier, a prompt template, and an LLM so we can begin talking to our documents.

- **Retriever** = finds the most relevant text chunks from your docs.

- **RunnablePassthrough()** = sends your question straight into the chain.

- **StrOutputParser()** = cleans the LLM’s reply into plain text.

- **Pipeline** = Question → Retrieve docs → Fill prompt → LLM → String answer.

In [10]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
# Convert the vector store into a retriever
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k":2})

# Create the LCEL retrieval chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt_template
    | llm
  |StrOutputParser()
)

# Invoke the chain
print(chain.invoke("Qu'est ce qui s'est passé au 30 juin 2025 pour le mouvement Desjardins dans la note 9 - Gestion du capital  concernant le respect des exigences?"))

Le Mouvement Desjardins a maintenu une solide capitalisation en conformité avec les règles de Bâle III au 30 juin 2025. Ses ratios de fonds propres de la catégorie 1A et du total des fonds propres étaient respectivement de 22,9 % et de 25,5 %, comparativement à 22,2 % et 24,2 % au 31 décembre 2024.


### Conversational RAG Assistant for PDF Document Analysis

    Description 
A powerful AI-driven Streamlit application that allows you to upload multiple PDF files and interact with them through natural conversation.
Using LangChain and Retrieval-Augmented Generation (RAG), the assistant extracts, indexes, and understands document content to provide accurate, context-aware answers.

Easily compare reports, summarize sections, or explore insights across all your PDFs — all through simple, human-like chat.

    Application
Behind the scenes, the Conversational RAG Assistant works by letting you upload multiple PDFs, which are read page by page, split into smaller text chunks, and transformed into numerical embeddings that capture their meaning. These embeddings are stored in a FAISS vector index, allowing the system to quickly find the most relevant chunks when you ask a question. Your query is also embedded, compared to the stored vectors, and the top matching pieces of text are passed to an LLM (like GPT-4o-mini) along with your question to generate an accurate, context-aware answer. The app then displays this answer in Streamlit, showing the sources used and optional summaries such as four main titles extracted from the documents to give you an overview before chatting.

In [11]:
import os
import streamlit as st
from langchain_community.document_loaders import PyPDFLoader
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain.vectorstores import FAISS
from langchain.vectorstores import Chroma

In [12]:
print(os.getcwd()) # Execution Location
print(os.listdir()) #List of all the documents in the virtual codespace environment
pdfs = [pdffile for pdffile in os.listdir(path=None) if pdffile.endswith("pdf")] #List of existing PDFs in the file
print(pdfs)

/workspaces/Retrieval-Augmented-Generation-LangChain-
['README.md', 'financial_report_desj.pdf', '.git', '.gitignore', 'rag_app.py', 'LICENSE', '.env', 'financial_report_desjardins_2025q2.pdf', 'Conversational PDF (RAG Application).ipynb']
['financial_report_desj.pdf', 'financial_report_desjardins_2025q2.pdf']


In [13]:
pdf_document = PyPDFLoader(
    file_path="financial_report_desj.pdf", password=None
)

document_loader = pdf_document.load()

#print(document_loader[0])

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n","\n","."," ",""], 
    chunk_size=100, 
    chunk_overlap=15
    )

chunks = text_splitter.split_documents(documents=document_loader)

embedding_model = OpenAIEmbeddings(api_key=os.getenv("OPENAI_API_KEY"),
                                   organization=None, 
                                   deployment=None, 
                                   dimensions=None,
                                   model="text-embedding-3-large"
                                   )

vector_embeddings = FAISS.from_documents(documents=chunks,
                                         embedding=embedding_model
                                         )

prompt_template = ChatPromptTemplate.from_template(
"""
You are a helpful assistant. Use the context below to answer the question.

Context:
{context}

Question:
{question}
"""
)

llm = ChatOpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-mini"
)

retriever = vector_embeddings.as_retriever(search_type = "mmr", 
                                           search_kwargs={"k":2}
                                           )
chain_lcel = (
    {"context":retriever, "question":RunnablePassthrough()} |
    prompt_template |
    llm |
    StrOutputParser()
)
prompt_user = "In the economic environment and outlook section, What is the anticipated increase in Canada’s GDP in 2025 and 2026 ? "
chain_lcel.invoke(prompt_user)

"The anticipated increase in Canada's GDP is 1.4% in both 2025 and 2026."

In [14]:
import faiss
import numpy as np
from sklearn.manifold import TSNE
import plotly.express as px
import pandas as pd

vector_embeddings = FAISS.from_documents(chunks, embedding_model)

vectors = []
for i in range(vector_embeddings.index.ntotal):
    vectors.append(vector_embeddings.index.reconstruct(i))
vectors = np.array(vectors)
print("Shape of embeddings:", vectors.shape)

tsne = TSNE(n_components=2, random_state=42)
vectors_2d = tsne.fit_transform(vectors)

df = pd.DataFrame(vectors_2d, columns=["x", "y"])
df["chunk"] = [f"Chunk {i}" for i in range(len(df))]

fig = px.scatter(df, x="x", y="y", text="chunk",
                 title="FAISS Embeddings Visualization (t-SNE)")
fig.update_traces(marker=dict(size=8, color='blue', opacity=0.7), textposition='top center')
fig.show()
 

Shape of embeddings: (6488, 3072)
