<a href="https://colab.research.google.com/github/thad75/OptionAI/blob/main/Project/Project_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project : RAG

The goal of this project is to create a simple LLMs based RAG module. Obviously the most complex part is how to correctly link all components. We let you free to create your own database. Anything can be ragged, from research paper, songs, subtitles, books,...

This project is voluntarly sparsely helped, as, as engineers, you will dig into lots of existing method, and will need to pick the best one up. We want you to get familiar with the engineering world.

We will give you some recommendations. You will see lots of issues during this project, some you've already seen during Labs, other that are new. So buckle up, read the docs, and RAG your data.

Obviously there are lots of tutorials of the internet that you could just copy paste to get a baseline.


## **We encourage you to do code versioning using Github.**


**Ideal Project Timeline:**

*   Talking and getting to know the Modules. Discussing about the choice of your LLMs and the environment you'll have. Choosing a first set of data to RAG. Setting up your Github (Optional) (1h)
*   Setting up your first RAG Chain using Langchain or other (2-3h)
*   Understand the limitation of your RAG and find enhancements to set up your 2nd RAG. (2h)
*   Unveiling the unknown document, adapt your RAGs (2h)
*   Deploy it (Optional)
*   Begin your presentation (2h)
*   Presentation (5 min/groups)


We'll evaluate your presentation quality, your RAG system's capability, and your progress throughout the project sessions.

**Students who missed the first session will start with a score of 0 in the progress part**. :-)


Presentation:
- The number of slides you can do is unlimited
- You only have 5 min to present your project. We will stop you at 5 min whether you've finished or not
- Your presentation should include:
  - a presentation of your workflow (Agile Methodology, What's the job of each one...)
  - a presentation of your final pipeline with all enhancements done
  - a proof a work of your RAG on your data, and on the unknown data
  - what limitations you have and how to tackle them. For each too obvious limitations (more GPUs, more RAM..) : -1
  - if you've deployed your RAG, a scannable QR code to live test it.

#  Preliminaries : Some useful downloads

We give you some useful frameworks, that you could use to build your RAG.

In [None]:
!pip install -q pypdf python-dotenv
!pip install transformers
!pip install -q datasets loralib sentencepiece
!pip install -q einops accelerate langchain bitsandbytes
!pip install sentence_transformers
!pip install llama-index
!%pip install --upgrade --quiet  langchain langchain-openai faiss-cpu tiktoken

/bin/bash: line 1: fg: no job control


# I - Data Sourcing

Data Sourcing for RAG is really simple. Some questions to guide you:

* What data do we need ? A: Anything from a simple text to a  document corpus.
* How can we correctly parse the data ? A: we can use the available langchain library tools for splitting text and managing documents.
* Does the vector index provides a good representation for the vector database ?
* For example, using FAISS, can you easily retrieve your document ? A: using FAISS we can not only retrieve our document but also feed it to the embeddings of our LLM.

To begin, pick a simple story and store it in a Vector Database. You have the choice between multiple VectorStores (ChromaDB, QDrant,...)


In [None]:
import re
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS

We start by defining a simple data sample. It's important to start with a short data corpus that we know very well, to know if our LLM it's giving us a correct answer to our questions.

In [None]:
test_document = "Angoras are cats. Maine coon are cats. Dogs are not cats. Cats and dogs have fur."

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

#texts = text_splitter.create_documents([test_document])

#text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(test_document)[0]

print(texts)

Angoras are cats. Maine coon are cats. Dogs are not cats. Cats and dogs have fur.


# II - Module Creation

Should you create a RAG Module, you need or not a LLM. We encourage you to test some LLMs (Mistral, LLama, Falcon, Gemma, ...) However, be aware that you won't have the space to run it on this colab.

We highly recommend to use LangChain, to build your Q&A app.

Some Questions to guide you:
* What model is easily accesible
* Are there any existing code to begin with ?
* What about the prompts ?
* What about the document parsing ?


Since we will relying entirely on the langchain framework so we can also define the form of the prompt our models will expect, which are questions:

In [None]:
# We write a simple question to try on our model
question = "Is a Maine coon a cat?"

## Choosing an LLM
We will try different LLMs that require different types of resources to choose the one that suits best our Q&A app.

In [None]:
!pip install langchain
!pip install -q pypdf python-dotenv
!pip install transformers
!pip install -q datasets loralib sentencepiece
!pip install -q einops accelerate langchain bitsandbytes
!pip install sentence_transformers
!pip install llama-index
!%pip install --upgrade --quiet  langchain langchain-openai faiss-cpu tiktoken
!%pip install --upgrade --quiet  llama-cpp-python
!pip install llama-cpp-python

/bin/bash: line 1: fg: no job control
/bin/bash: line 1: fg: no job control


In [None]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"
os.environ["HUGGINGFACE_TOKEN"] = "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"

from langchain import HuggingFaceHub

### Mixtral 8x7b
We begin by testing the model from Mistral AI: Mixtral 8x7b. We will be using quatized version from the huggingface community that, since it's lighter than the original 7b language model, will allow us to use it in our project with quick execution times.

In [None]:
llm_repo = "mistralai/Mixtral-8x7B-Instruct-v0.1"
llm_huggingface=HuggingFaceHub(repo_id=llm_repo,model_kwargs={"temperature":0.1,"max_length":256},)

In [None]:
output=llm_huggingface.predict(question)
print(output)

Is a Maine coon a cat?

Yes, a Maine coon is a cat. It is a breed of domestic cat that originated in the United States. Maine coons are known for their large size, long fur, and bushy tails. They are generally friendly and good-natured animals, and they are often referred to as "gentle giants." Maine coons are a popular breed of cat, and they are often kept as pets.


In [None]:
output=llm_huggingface.predict("Is a Dog a Cat?")
print(output)

Is a Dog a Cat?

No, a dog is not a cat. Dogs and cats are two different species of animals. Dogs belong to the Canidae family, which includes wolves, foxes, and other types of dogs. Cats belong to the Felidae family, which includes lions, tigers, and other types of cats.

Dogs and cats have different physical characteristics, behaviors, and needs. Dogs are generally larger and more muscular than cats, and they


In [None]:
import re
import fitz  # PyMuPDF

In [None]:
def extract_text_from_pdf(pdf_path):
    """
    Extrait le texte d'un fichier PDF et le sauvegarde dans un fichier texte.
    """
    doc = fitz.open(pdf_path)
    all_text = ""

    for page in doc:
        all_text += page.get_text()

    doc.close()
    return all_text

# Remplacez 'chemin/vers/votre/fichier.pdf' par le chemin réel vers votre fichier PDF
pdf_path = '/content/data_test2.pdf'
text = extract_text_from_pdf(pdf_path)

# Sauvegarder le texte dans un fichier .txt
with open('output_text.txt', 'w', encoding='utf-8') as text_file:
    text_file.write(text)

print("L'extraction du texte est terminée et le résultat est sauvegardé dans 'output_text.txt'.")


L'extraction du texte est terminée et le résultat est sauvegardé dans 'output_text.txt'.


In [None]:
def clean_text(text):
    """
    Function to clean the extracted text.
    - Removes headers and footers.
    - Removes page numbers.
    - Removes special characters and multiple spaces.
    """
    # Removing headers/footers (e.g., "Page | 123")
    cleaned_text = re.sub(r"Page\s*\|\s*\d+", " ", text)

    # Removing isolated page numbers
    cleaned_text = re.sub(r"\n\d+\n", "\n", cleaned_text)

    # Removing special characters and replacing them with spaces
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", " ", cleaned_text)

    # Replacing multiple spaces with a single space
    cleaned_text = re.sub(r"\s+", " ", cleaned_text)

    return cleaned_text

In [None]:
def structure_data(text):
    """
    Fonction pour structurer le texte nettoyé en unités logiques (paragraphes).
    """
    # Division du texte en paragraphes sur la base de sauts de ligne
    paragraphs = text.split("\n")

    # Supprime les éventuels espaces blancs au début et à la fin de chaque paragraphe
    paragraphs = [para.strip() for para in paragraphs]

    # Filtrage pour éliminer les paragraphes vides ou trop courts
    paragraphs = [para for para in paragraphs if len(para) > 40]

    return paragraphs

In [None]:
# text file extracted from the pdf
file = '/content/output_text.txt'

#read the file
with open(file, 'r', encoding='utf-8') as file:
    raw_text = file.read()

# Cleaning
cleaned_text = clean_text(raw_text)

# Structuring
structured_data = structure_data(cleaned_text)

# Example of displaying the first 5 structured paragraphs
for paragraph in structured_data[:5]:
    print(paragraph)

print(structured_data[:5])

Oliver has been writing a book in secret since 2008 His book deals with a fantastic world His book tells the story of three heroes The heroes are Tom Thomas and Alexis The book is called The Fate of 3 Friends
['Oliver has been writing a book in secret since 2008 His book deals with a fantastic world His book tells the story of three heroes The heroes are Tom Thomas and Alexis The book is called The Fate of 3 Friends']


In [None]:
from sentence_transformers import SentenceTransformer

# Supposons que 'model' est votre SentenceTransformer pour les embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

def search_with_context(query: str, k: int = 5):
    # Utiliser le modèle pour encoder la requête en un vecteur
    query_vec = model.encode([query])  # Pas besoin de convert_to_tensor ou cpu().detach().numpy() ici

    # Effectuer la recherche avec FAISS
    distances, indices = index.search(query_vec, k)

    # Récupérer les passages correspondants
    passages = [structured_data[idx] for idx in indices[0]]
    return passages


In [None]:
def ask_rag(question):
    context = search_with_context(question)
    full_query = f"{context}\nQuestion: {question}\nAnswer:"
    output = llm_huggingface.predict(full_query)
    return output

In [None]:
!pip install --upgrade gradio



In [None]:
import gradio as gr

# Assurez-vous que la fonction ask_rag est correctement définie et utilise le contexte et le modèle LLM comme décrit précédemment

iface = gr.Interface(
    fn=ask_rag,
    inputs=gr.Textbox(lines=2, placeholder="Posez votre question ici..."),  # Ajustement ici
    outputs="text",
    title="RAG sur PDF",
    description="Posez une question et obtenez une réponse basée sur le contenu d'un PDF."
)

iface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://7415d38da39381c912.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




__________________________

In [None]:
from operator import itemgetter

from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_community.embeddings import HuggingFaceHubEmbeddings

In [None]:
from langchain_community.llms import HuggingFaceHub

# Set the token
token = "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"

# Instantiate HuggingFaceHub with token
llm = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    token=token,
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)

vectorstore = FAISS.from_texts(
    ["harrison worked at kensho"], embedding=HuggingFaceHubEmbeddings()
)
retriever = vectorstore.as_retriever()

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

from langchain_community.chat_models.huggingface import ChatHuggingFace

model = ChatHuggingFace(llm=llm)

ValidationError: 1 validation error for HuggingFaceHub
token
  extra fields not permitted (type=value_error.extra)

In [None]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [None]:
chain.invoke("where did harrison work?")

In [None]:
template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer in the following language: {language}
"""
prompt = ChatPromptTemplate.from_template(template)

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
        "language": itemgetter("language"),
    }
    | prompt
    | model
    | StrOutputParser()
)

In [None]:
chain.invoke({"question": "where did harrison work", "language": "italian"})

# III - Model Serving (Optional and for the best)

You can inspire yourself from the first hands-on, if your feeling powerful, you can build something using fastapi and push your deployed RAG into a simple github.io instance. Or just use the gradio deploiement framework.

# Sources
- LangChan documentation: https://python.langchain.com/docs/get_started/introduction
- Llama implementation on Google Colab: https://www.reddit.com/r/LocalLLaMA/comments/16xswej/llamacpp_on_t4_google_colab_unable_to_use_gpu/
- HuggingFace llm on Colab: https://www.aimletc.com/how-to-access-open-sourced-llms-from-huggingface-on-colab/