# Introduction 

After the fine-tuning arc I decided to discover how RAGs are woarking and how to implement them .

**What is a RAG ?**

A RAG stands for Retrievel augmented generative , it allows to add context and knowledge to an LLM , because of the fact that the knowledge base of an LLM is constrained by what was included in his training data. RAG allows us to intregrate external data as for example PDFs , pictures or others type of document. In addition to that this was for me the best concept to get initiated to Langchain .

# How does a RAG work ?

There are 5 main steps that we will need to understand to create a RAG .
- 1) Loading : we need to load our documents (pdfs,images,csv ...)
- 2) Spliting and chunking : we need to split these documents into smaller chunks 
- 3) Embedding : we need to embbed every chunk that we generated
- 4) Storing : we store these embeddings in a vector database 
- 5) Retrieving : we use the vector store as a retriever , where the LLM will be able to search to find answers to the user's query.

If you did not understand every word don't worry I will explain everything .

But before to dive deeper into details , I personnally like to have a global idea of how a concept is working , so for that here is a quick explanation of how a RAG works:  
- Let's say the input docs are pdfs , we load the pdfs split them into smaller units : chunks
- Then we take every chunk and we convert it into a vector of floats : this is what we call embedding
- We store these chunks in a Vectore Database (a vectore database is very quick)

After these steps , when a user write an input , the input text in converted into a vector of floats (**the embeddings**) and then we use the VectorDB to find the closest vectors (closest in the sens of meanning) . Once we found these vectors we reconvert them into text and send them to the llm with the initial request of the user : we added context to the llm who will answers with this new knowledge.

# 1) Loading  and chunking documents

For the beginning I will only load a pdf to see how it is working . For this we will need the module text_loader from the langchain_community framework .
I decided to use PyMuPdf framework because It extracts the text, completes with headers, lists, and can converts it to Markdown , what is LLM friendly and will allows us to have better results.

In [105]:
from langchain_community.document_loaders import  DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os
from typing import List
import torch
import threading
import chromadb
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.runnables import RunnablePassthrough , RunnableLambda
from langchain_classic.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv
import pprint

The thing here is that we want to extract all the pdf documents in a folder , it means that even if the folder contains others directories where there is pdf document we would be able to find them and use them. I wanted to do this to give the user more flexibility and allow him to organize his folder as he wants. 

For this we will use the DirectoryLoader class that allows us to specity : 
- **Path** 
- **Global pattern matching (glob)**
- **PdfLoader** 

In [None]:
separators=["\n\n", "\n", " ", "","\n\n"]
def load_and_chunk(path:str):
 """
 **path**  
 
 path where all the pdf are stocked  
 **returns** : List[Document] , int
 """
 assert os.listdir(path) , "The path does not contain any pdf"
 loader = DirectoryLoader(
    path= path,
    glob = "**/*.pdf",
    show_progress=True,
    recursive= True,
 )

 docs = loader.load_and_split(text_splitter=RecursiveCharacterTextSplitter(chunk_size = 100 , chunk_overlap = 10 , separators=separators))
 print(f" Successfully generated {len(docs)} chunks")
 return docs 



In [4]:
path = "./pdf_documents"
chunks , nb_chunks= load_and_chunk(path=path)
print(chunks[0].metadata)

100%|██████████| 1/1 [00:02<00:00,  2.54s/it]

 Successfully generated 86 chunks
{'source': 'pdf_documents/reglement.pdf'}





Okay now we have a list of documents where every document contains two elements:  
- Metadata (can be useful)
- Page_content : the content of the chunked part  

If you have followed my until here you may understand that we still have 3 steps to finalize our RAG. Let's see the third step

# 2) Step 3 : embbedings 

Now that we have our chunks , we want to convert them in vectors of float : these vectors are called **embeddings** . This transformation is done through an **embedding model** that aims to capture the meanning of the text/chunk . 

**But why do wee need this ?**  

We need this because we want to store the vectors in a Vector Database (VectoreDB) where we will be able to store and retrieve high-dimensional vector data . And what matters is the fact that in this DB vectors with a **close meanning** are located closer together in the vector space and when our RAG app will receives a user input, it will be embedded and used to query the database, returning the most similar documents.

For this I will use the embedding model : 
`intfloat/multilingual-e5-small`

In [24]:
# This util function will allow us to only load one time the model in our memory to avoid loading it every time 
# I used a lock to avoid race condition do this function
_model_cache = None
_model_lock = threading.Lock()
def get_model():
    global _model_cache
    if _model_cache is None:
        with _model_lock:  # lock for one thread 
            if _model_cache is None:  
                device = 'cuda' if torch.cuda.is_available() else 'cpu'
                _model_cache = SentenceTransformer('intfloat/multilingual-e5-small', device=device)
    return _model_cache

def get_embeddings(docs : List[Document]) :
    """ 
    **docs**  
    The list of chunks created by loading the pdfs  
    """
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print("Using cuda " if device == "cuda" else "Using the cpu")
    model = get_model()
    embeddings = model.encode([doc.page_content for doc in docs], convert_to_tensor=True, device=device)
    assert len(embeddings) >0 , "The document is empty: NO EMBEDDING GENERATED"
    print(f"Successfully generated {len(embeddings)} embeddings")
    #Convert embeddings to a list 
    embeddings_list = embeddings.tolist()
    return embeddings_list
    


Let's test this sample of code !

In [25]:
embedding_list = get_embeddings(chunks)

Using cuda 
Successfully generated 86 embeddings


# 3) Step 3 : storing embeddings in a VectoreDB  

For this part as I used `ChromaDB` framework for the VectorDB . ChromaDB is an open-source AI application db that gives us everything we need for retrieval :  
- Store embeddings and their metadata
- Vector search
- Full-text search
- Document storage
- Metadata filtering
- Multi-modal retrieval

At the beginning I wrote the code mainly with the ChromaDB library , It allows to quickly understand how Chroma works and its main commands. But in reality langchain has a package where he directly integrated chroma , it is : `langchain_chroma`.  

So you will find two version of this step , the one with chroma and the other one with langchain, read the one that you want (or both)
 !

In [7]:
def store_embeddings(chromaDbPath: str, name_collection: str, chunks : List[Document]):
    """
    **chromaDbPath** 
    
    The path of the persistent client , if it does not exists it creates a new one  
    
    **name_collection**  
    
    Name wanted to the collection that we want to create to stock our vectors  
    
    **chunks**  
    
    List of chunks generated by loading and chunking the documents  
    
    """
    client = chromadb.PersistentClient(path=chromaDbPath)
    assert client is not None , "Problem in getting/creating the client"
    
    #Create a collection 
    collection = client.get_or_create_collection(
        name= name_collection,
        embedding_function= None #We will use our embeddings
    )
    
    # Create Ids and get embeddings of the document
    ids = [f"chunk_{i}" for i in range(len(chunks))]
    embeddings = get_embeddings(chunks)
    metadata = [chunk.metadata for chunk in chunks]
    documents = [chunk.page_content for chunk in chunks]
    try:
        collection.add(
            ids = ids  ,
            embeddings = embeddings,
            metadatas=metadata,
            documents= documents
        )
    except Exception as e:
        print(" Error while adding to ChromaDB : ",e)
    print("Successfully added all the docs to the vector database")


In [8]:
chromDbPath = './chromaDB'
name_collection = "saad1"
store_embeddings(
    chromaDbPath= chromDbPath,
    name_collection= name_collection,
    chunks=chunks
)


Using cuda 
Successfully generated 86 embeddings
Successfully added all the docs to the vector database


In [9]:
def query_vectorDb(query: str,chromaDbPath: str, collection_name :str,n_results: int) -> str:
    
    assert len(query) >0 , "Query cannot be empty"
    #Load the client and the db
    client = chromadb.PersistentClient(path= chromaDbPath)
    collection = client.get_collection(
        name= collection_name,
        embedding_function=None
    )
    
    doc_query = [Document(metadata={},page_content=query)]
    model_name = 'intfloat/multilingual-e5-small'   
    embedded_query = get_embeddings(doc_query,model_name)
    result = collection.query(
        query_embeddings=embedded_query,
        n_results= n_results
    )
    
    return result

Now let try with `langchain-chroma` package .  


In [None]:
def store_embeddings_with_langchain(collection_name: str , persist_dir:str , chunks : List[Document]):
    
    """ 
    **collection_name**  
    Name of the collection that already exists or to create  
    
    **persist_dir**  
    Directory where collection and db are stocked  
    
    **chunks**  
    List of chunks (that under the hood are Documents) generated  
    
    **returns**  
    Chroma instance
    
    """
    if torch.cuda.is_available():
        model_kwargs = {"device": "cuda"}
        print("Using Cuda to generate embeddings")
    else:
        model_kwargs= "cpu"
        print("Using CPU to generate embeddings")
    embeddings = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-small", model_kwargs=model_kwargs)
    vector_store = Chroma(
        collection_name= collection_name,
        embedding_function= embeddings,
        persist_directory= persist_dir,
    )
    
    ids=  [f"chunk_{i}" for i in range(len(chunks))]
    try:
        vector_store.add_documents(documents=chunks,ids=ids)
    except Exception as e:
        print(" Error while adding to ChromaDB : ",e)
    print(f"Successfully added {len(chunks)} vectors in the vector database ")
    return vector_store

In [60]:
vector_store = store_embeddings_with_langchain(
    collection_name=name_collection,
    persist_dir=chromDbPath,
    chunks=chunks
)

Using Cuda to generate embeddings
Successfully added 86 vectors in the vector database 


In [85]:
vector_store.similarity_search("Danger")

[Document(id='chunk_17', metadata={'source': 'pdf_documents/reglement.pdf'}, page_content='1.2. Organisation'),
 Document(id='chunk_64', metadata={'source': 'pdf_documents/reglement.pdf'}, page_content='se voir sanctionnés. Les degrés de sanctions applicables sont présentés plus loin.'),
 Document(id='chunk_79', metadata={'source': 'pdf_documents/reglement.pdf'}, page_content='Les sanctions sont appliquées par le Bureau et varient selon la gravité de la faute. Elles peuvent'),
 Document(id='chunk_13', metadata={'source': 'pdf_documents/reglement.pdf'}, page_content='de redoublement et de césure.')]

Now the function to query our vectore store is less complex because Langchain do everything for us with the `Chroma Class`.
We will create a pipeline by using one of the most intersting features of Langchain : `chains` and more precisely , we will create a **LCEL**: which stands for <u>LangChain Expression Language</u>.  

And here you may ask yourself he is talking about chains , but what these chains are linking ? The answer is : **runnables**. 

My next post will be about Runnables in Langchain so here I will just try to give a quick explanation of how it's working and why it used .  

# Runnables 

A **runnable** in Langchain is a the basic block/component that can be linked to other object of type Runnables . In fact **Runnables** is an abstract class in LangChain that is herited by others components (PromptTemplate, ChatOpenAI ....) and it provides methods that allows Runnables objects to connect between them : it's a link in the chain . Runnables object can also be executed with `.invoke()` method.  

And this is the power of LangChain , it allows us to seamlessly integrate various components required for building workflows and ensures that all components follow the same set of rules and can easily connect, which simplifies the development process. In addition there Runnables primitives that supports different workflows : parallel , sequential and also conditional executions.  

Okay now we can see the functionnement of a **chain**:  
$$
chain = Runnables1 | Runnables2 | .... | RunnablesN
$$
The ` | ` operator here takes the output of the previous Runnables and pass it as an entry to the next one. 

Now let's code our final step by using chains !

In [101]:
def get_rag_answer(query: str , vector_store: Chroma ):
    
    load_dotenv()
    llm = ChatGoogleGenerativeAI(
        model="gemini-2.5-flash",
        temperature = 0.5,
        max_retries = 2
    )

    docs_retriever = RunnableLambda(lambda query: vector_store.similarity_search(query=query,k=6))
    docs_to_text = RunnableLambda(lambda docs: "\n\n".join(doc.page_content for doc in docs))
    query_passthrough = RunnablePassthrough()
    prompt_template = ChatPromptTemplate.from_template(
        
        """Use the following context to answer the question at the end. 
           You must be respectful and helpful, and answer in the language of the question.
           If you don't know the answer, say that you don't know.

           Context: {context}

           Question: {question}"""
    )
    prompt_runnable = RunnableLambda(lambda args: prompt_template.format_messages(context=args["context"], question=args["query"]))
    
    pipeline = (
        {
            "context": docs_retriever|docs_to_text,
            "query": query_passthrough
        }
        | prompt_runnable
        | llm
        | StrOutputParser()
    )
    
    answer = pipeline.invoke(query)
    return answer
            


In [106]:
query = 'De quoi parle le document ?'
vector_store = vector_store

answer = get_rag_answer(query,vector_store)
answer

"Le document parle d'une organisation ou d'une association, vraisemblablement liée à l'ENSEEIHT, qui a pour objectif de développer des liens entre les personnes qui y sont attachées. Il aborde également des sujets tels que le montant de la cotisation, les sanctions applicables et fait référence à des campagnes pour les années 2021-2022."

Everything is working now it's time to put together all of the functions that we wrote !

In [109]:
def ask_question(query:str):
    collection_name = "tcp"
    path = "./pdf_documents"
    persitent_dir = './chromaDB'
    
    chunks = load_and_chunk(path)
    vector_store = store_embeddings_with_langchain(
        collection_name,
        persitent_dir,
        chunks
    )
    answer = get_rag_answer(query,vector_store)
    pprint.pp(answer) 

In [110]:
answer = ask_question("De quoi parle le document?")

100%|██████████| 1/1 [00:00<00:00,  6.63it/s]


 Successfully generated 138 chunks
Using Cuda to generate embeddings
 Error while adding to ChromaDB :  'list' object has no attribute 'page_content'
Successfully added 2 vectors in the vector database 
("Je ne peux pas vous dire de quoi parle le document car aucun contexte n'a "
 'été fourni.')
