# Project : RAG

The goal of this project is to create a simple LLMs based RAG module. Obviously the most complex part is how to correctly link all components. We let you free to create your own database. Anything can be ragged, from research paper, songs, subtitles, books,...

This project is voluntarly sparsely helped, as, as engineers, you will dig into lots of existing method, and will need to pick the best one up. We want you to get familiar with the engineering world.

We will give you some recommendations. You will see lots of issues during this project, some you've already seen during Labs, other that are new. So buckle up, read the docs, and RAG your data.

Obviously there are lots of tutorials of the internet that you could just copy paste to get a baseline.


## **We encourage you to do code versioning using Github.**


**Ideal Project Timeline:**

*   Talking and getting to know the Modules. Discussing about the choice of your LLMs and the environment you'll have. Choosing a first set of data to RAG. Setting up your Github (Optional) (1h)
*   Setting up your first RAG Chain using Langchain or other (2-3h)
*   Understand the limitation of your RAG and find enhancements to set up your 2nd RAG. (2h)
*   Unveiling the unknown document, adapt your RAGs (2h)
*   Deploy it (Optional)
*   Begin your presentation (2h)
*   Presentation (5 min/groups)


We'll evaluate your presentation quality, your RAG system's capability, and your progress throughout the project sessions.

**Students who missed the first session will start with a score of 0 in the progress part**. :-)


Presentation:
- The number of slides you can do is unlimited
- You only have 5 min to present your project. We will stop you at 5 min whether you've finished or not
- Your presentation should include:
  - a presentation of your workflow (Agile Methodology, What's the job of each one...)
  - a presentation of your final pipeline with all enhancements done
  - a proof a work of your RAG on your data, and on the unknown data
  - what limitations you have and how to tackle them. For each too obvious limitations (more GPUs, more RAM..) : -1
  - if you've deployed your RAG, a scannable QR code to live test it.

#  Preliminaries : Some useful downloads

We give you some useful frameworks, that you could use to build your RAG.

In [None]:
!pip install -q pypdf python-dotenv
!pip install transformers
!pip install -q datasets loralib sentencepiece
!pip install -q einops accelerate langchain bitsandbytes
!pip install sentence_transformers
!pip install llama-index
!%pip install --upgrade --quiet  langchain langchain-openai faiss-cpu tiktoken


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/286.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m225.3/286.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90

# I - Data Sourcing

Data Sourcing for RAG is really simple. Some questions to guide you:

* What data do we need ?
* How can we correctly parse the data ?
* Does the vector index provides a good representation for the vector database ?
* For example, using FAISS, can you easily retrieve your document ?

To begin, pick a simple story and store it in a Vector Database. You have the choice between multiple VectorStores (ChromaDB, QDrant,...)


In [None]:
import re
import fitz  # PyMuPDF

CHOOSING A DATASET

In [None]:
def extract_text_from_pdf(pdf_path):
    """
    Extraction du texte d'un fichier PDF et le sauvegarde dans un fichier texte.
    """
    doc = fitz.open(pdf_path)
    all_text = ""

    for page in doc:
        all_text += page.get_text()

    doc.close()
    return all_text

# Remplacez 'chemin/vers/votre/fichier.pdf' par le chemin réel du fichier PDF
pdf_path = '/content/GoodLuck.pdf'
pdf_path = '/content/Marketing responsale et Durable VF-1.pdf'
# Chaleya - d'Arijit Singh et Shilpa Rao

text = extract_text_from_pdf(pdf_path)
print(text)

# Sauvegarder le texte dans un fichier .txt
with open('output_text.txt', 'w', encoding='utf-8') as text_file:
    text_file.write(text)

print("L'extraction du texte est terminée et le résultat est sauvegardé dans 'output_text.txt'.")


1 
 
 
 
 
 
 
 
 
 
 
 
Table des matières 
Introduction ................................................................................................................................................................................ 2 
1. Définition du marketing durable 
............................................................................................................................................. 3 
2. Importance et défis du marketing durable .............................................................................................................................. 3 
3. Les 3 piliers du développement durable ................................................................................................................................. 5 
3.1. Économique 
.................................................................................................................................................................... 6 
3.2. Social 
..................................

DATA CLEANING

In [None]:
def clean_text(text):
    """
    Function to clean the extracted text.
    - Removes headers and footers.
    - Removes page numbers.
    - Removes special characters and multiple spaces.
    """
    # Removing headers/footers (e.g., "Page | 123")
    cleaned_text = re.sub(r"Page\s*\|\s*\d+", " ", text)

    # Removing isolated page numbers
    cleaned_text = re.sub(r"\n\d+\n", "\n", cleaned_text)

    # Removing special characters and replacing them with spaces
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", " ", cleaned_text)

    # Replacing multiple spaces with a single space
    cleaned_text = re.sub(r"\s+", " ", cleaned_text)

    return cleaned_text

DATA STRUCTURING

In [None]:
def structure_data(text):
    """
    Fonction pour structurer le texte nettoyé en unités logiques (paragraphes).
    """
    # Division du texte en paragraphes sur la base de sauts de ligne
    paragraphs = text.split("\n")

    # Supprime les éventuels espaces blancs au début et à la fin de chaque paragraphe
    paragraphs = [para.strip() for para in paragraphs]

    # Filtrage pour éliminer les paragraphes vides ou trop courts
    paragraphs = [para for para in paragraphs if len(para) > 2]

    return paragraphs

USING THE FUNCTIONS

In [None]:
# text file extracted from the pdf
file = '/content/output_text.txt'

#read the file
with open(file, 'r', encoding='utf-8') as file:
    raw_text = file.read()

# Cleaning
cleaned_text = clean_text(raw_text)

# Structuring
#structured_data = structure_data(cleaned_text)
structured_data = structure_data(raw_text)

# Example of displaying the first 5 structured paragraphs
for paragraph in structured_data:
    print(paragraph)

Table des matières
Introduction ................................................................................................................................................................................ 2
1. Définition du marketing durable
............................................................................................................................................. 3
2. Importance et défis du marketing durable .............................................................................................................................. 3
3. Les 3 piliers du développement durable ................................................................................................................................. 5
3.1. Économique
.................................................................................................................................................................... 6
3.2. Social
....................................................................

DATA INDEXING

I choose to use FAISS this a library developped by Facebook AI Research to facilitate the indexing

In [None]:
!pip install faiss-cpu
!pip install -U sentence-transformers
!pip install gradio

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0
Collecting gradio
  Downloading gradio-4.22.0-py3-none-any.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.110.0-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.13.0 (fro

In [None]:
from sentence_transformers import SentenceTransformer

transformer_model = "multi-qa-mpnet-base-dot-v1"
#transformer_model = 'all-MiniLM-L6-v2'

model = SentenceTransformer(transformer_model)

# Générez les embeddings pour chaque paragraphe
embeddings = model.encode(structured_data, convert_to_tensor=True, batch_size=64)  # Utilisez un batch_size adapté à votre machine

# Vérifiez la forme des embeddings
print(embeddings.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

torch.Size([885, 768])


In [None]:
import faiss
import numpy as np

# Convert embeddings to numpy array if not already in one
embeddings_np = embeddings.cpu().detach().numpy() if hasattr(embeddings, 'cpu') else np.array(embeddings)

# Determine the size of embeddings
d = embeddings_np.shape[1]

# Create a FAISS index - here we use a flat L2 index
index = faiss.IndexFlatL2(d)

# Add vectors to the index
index.add(embeddings_np)

print(f"Indexed {index.ntotal} vectors.")

Indexed 885 vectors.


In [None]:
from sentence_transformers import SentenceTransformer

# Supposons que 'model' est votre SentenceTransformer pour les embeddings
model = SentenceTransformer(transformer_model)

def search_with_context(query: str, k: int = 5):
    # Utiliser le modèle pour encoder la requête en un vecteur
    query_vec = model.encode([query])  # Pas besoin de convert_to_tensor ou cpu().detach().numpy() ici

    # Effectuer la recherche avec FAISS
    distances, indices = index.search(query_vec, k)

    # Récupérer les passages correspondants
    passages = [structured_data[idx] for idx in indices[0]]
    return passages


# II - Module Creation

Should you create a RAG Module, you need or not a LLM. We encourage you to test some LLMs (Mistral, LLama, Falcon, Gemma, ...) However, be aware that you won't have the space to run it on this colab.

We highly recommend to use LangChain, to build your Q&A app.

Some Questions to guide you:
* What model is easily accesible
* Are there any existing code to begin with ?
* What about the prompts ?
* What about the document parsing ?


In [None]:
!pip install langchain
!pip install -q pypdf python-dotenv
!pip install transformers
!pip install -q datasets loralib sentencepiece
!pip install -q einops accelerate langchain bitsandbytes
!pip install sentence_transformers
!pip install llama-index
!%pip install --upgrade --quiet  langchain langchain-openai faiss-cpu tiktoken
!%pip install --upgrade --quiet  llama-cpp-python
!pip install llama-cpp-python

/bin/bash: line 1: fg: no job control
/bin/bash: line 1: fg: no job control
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.57.tar.gz (36.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.9/36.9 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.57-cp310-cp310-manylinux_2_35_x86_64.whl size=2867779 sha256

In [None]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"]="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"
# importation de l'api HuggingFace
from langchain import HuggingFaceHub

In [None]:
#choix du LLM : mistral
llm_repo = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# choix du niveau d'aléatoire : température
llm_huggingface=HuggingFaceHub(repo_id=llm_repo,model_kwargs={"temperature":0.1,"max_length":4096})

# III - Model Serving (Optional and for the best)

You can inspire yourself from the first hands-on, if your feeling powerful, you can build something using fastapi and push your deployed RAG into a simple github.io instance. Or just use the gradio deploiement framework.

In [None]:
def ask_rag(question):
    context = search_with_context(question)
    full_query = f"{context}\nQuestion: {question}\nAnswer:"
    output = llm_huggingface.predict(full_query)
    return output

In [None]:
!pip install --upgrade gradio



In [None]:
import gradio as gr

# Gradio interface creation
iface = gr.Interface(
    fn=ask_rag,
    inputs=gr.Textbox(lines=2, placeholder="Ask your question here..."),  # Ajustement ici
    outputs="text",
    title="RAG about PDF",
    description="Ask a question and get an answer based on the content of a PDF."
)

iface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://ee8b44a08ac05129a9.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


