# Exercises XP Ninja
Last Updated: March 7th, 2025

## 👩‍🏫 👩🏿‍🏫 What You’ll learn
Optimize RAG systems for domain-specific knowledge.
Develop multimodal RAG systems integrating text and images.
Implement RAG systems with knowledge graph integration for enhanced reasoning.


## 🛠️ What you will create
An optimized RAG system for a specialized domain.
A multimodal RAG system capable of handling text and visual data.
A RAG system integrated with a knowledge graph for complex reasoning.


## Exercise 1: Optimizing RAG for Domain-Specific Knowledge
Objective: Develop a RAG system that effectively handles complex, domain-specific queries by fine-tuning retrieval and generation components.

Instructions:

Dataset: Use a specialized dataset, such as a collection of legal documents, medical research papers, or technical manuals. Choose a domain that requires deep understanding and precise information retrieval.
Advanced Retrieval:
Implement a hybrid retrieval approach combining dense retrieval (BERT embeddings) with sparse retrieval (BM25 or TF-IDF) to improve relevance.
Explore techniques for query expansion and re-ranking to refine search results.
Implement chunking strategies that are tailored to the dataset. For example, in legal documents, chunking should be based on paragraphs or sections, not arbitrary word counts.
Fine-Tuning:
Fine-tune a pre-trained BERT model on your domain-specific dataset to improve embedding quality.
Fine-tune a pre-trained GPT model on a question-answering dataset relevant to your domain to enhance generation accuracy.
Implement a system that allows for user feedback, and then fine tunes the system based on that feedback.
Evaluation:
Develop a comprehensive evaluation metric that considers both retrieval accuracy (e.g., recall, precision) and generation quality (e.g., factual correctness, coherence).
Compare the performance of your optimized RAG system with a baseline RAG system using standard retrieval and generation components.
Bonus: implement a system that will site its sources in the generated response.

In [1]:
import os
import faiss
import torch
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from datasets import Dataset
import numpy as np

# 1. Dataset (Exemple simplifié)
legal_docs = [
    {"id": 1, "text": "Article 1 : Toute personne a droit au respect de sa vie privée."},
    {"id": 2, "text": "Article 2 : La liberté d'expression est garantie par la loi."},
    {"id": 3, "text": "Article 3 : Le droit à un procès équitable est fondamental."}
]
texts = [doc["text"] for doc in legal_docs]

# 2. Sparse Retrieval (TF-IDF)
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(texts)

# 3. Dense Retrieval (BERT Embeddings)
bert_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
dense_embeddings = bert_model.encode(texts, convert_to_tensor=True)

# 4. Index Dense Embeddings avec FAISS
index = faiss.IndexFlatL2(dense_embeddings.shape[1])
faiss.normalize_L2(dense_embeddings.cpu().numpy())
index.add(dense_embeddings.cpu().numpy())

# 5. Re-Ranking Fonction

def hybrid_search(query, top_k=2):
    query_embedding = bert_model.encode([query], convert_to_tensor=True)
    faiss.normalize_L2(query_embedding.cpu().numpy())
    distances, indices = index.search(query_embedding.cpu().numpy(), top_k)

    sparse_scores = tfidf.transform([query]) @ tfidf_matrix.T
    sparse_indices = np.argsort(sparse_scores.toarray()[0])[::-1][:top_k]

    combined = set(indices[0]).union(sparse_indices)
    final_results = [(i, texts[i]) for i in combined]
    return final_results

# 6. Génération avec GPT (peut être remplacé par modèle fine-tuné)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct")
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

# 7. RAG Pipeline
def answer_question(question):
    docs = hybrid_search(question, top_k=2)
    context = "\n".join([d[1] for d in docs])
    prompt = f"Réponds précisément à la question suivante en citant les articles de loi :\nContext : {context}\nQuestion : {question}\nRéponse : "
    response = text_generator(prompt, max_length=200, do_sample=True)[0]['generated_text']
    return response

# 8. Exemple d'utilisation
question = "Quels sont les droits fondamentaux garantis par la loi ?"
response = answer_question(question)
print(response)

# 9. Évaluation Simplifiée
def evaluate(query_list, expected_keywords):
    correct = 0
    for query, keyword in zip(query_list, expected_keywords):
        answer = answer_question(query)
        if keyword.lower() in answer.lower():
            correct += 1
    print(f"Accuracy: {correct}/{len(query_list)}")

# 10. Boucle Feedback (Simplifiée)
def feedback_loop(query, correct_answer):
    # Feedback manuel : log les mauvaises réponses pour réentraînement
    with open("feedback_log.txt", "a") as f:
        f.write(f"Question: {query}\nExpected: {correct_answer}\n\n")

# Baseline : comparaison avec juste TF-IDF ou FAISS possible


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Fetching 2 files: 100%|██████████| 2/2 [00:49<00:00, 24.72s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.15s/it]
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_t

Réponds précisément à la question suivante en citant les articles de loi :
Context : Article 2 : La liberté d'expression est garantie par la loi.
Article 3 : Le droit à un procès équitable est fondamental.
Question : Quels sont les droits fondamentaux garantis par la loi ?
Réponse : (1) liberté d'expression, (2) droit à un procès équitable.


Analyse :

### **Ce qui fonctionne bien** :

* **Pipeline RAG opérationnel** : le modèle récupère les articles pertinents (article 2 et 3) via hybrid search.
* **Génération correcte** : la réponse est concise, ciblée, correcte juridiquement selon le contexte fourni.
* **CUDA activé** : la génération utilise bien ton GPU 3090.

### **Les alertes techniques** (non bloquantes) :

* **`tqdm`** : avertissement lié à Jupyter/ipywidgets inutilisé ici → ignorable.
* **`huggingface_hub` symlinks** : Windows empêche l’optimisation du cache → impact mineur (plus d'espace disque utilisé).
* **`truncation`** : avertissement mineur → si tu veux du code propre, ajoute `truncation=True`.
* **`max_length` vs `max_new_tokens`** : seule `max_new_tokens` est prise en compte → pas de problème réel mais tu pourrais supprimer `max_length=200` pour plus de clarté.

### **Interprétation finale** :

* **RAG fonctionnel**, réponses pertinentes, GPU bien exploité.
* **Quelques warnings purement cosmétiques** sous Windows, pas d'impact sur la qualité de réponse.
* **Réponse finale correcte**, mais modèle peu loquace (réponses brèves typiques d’un petit contexte).


## Exercise 2: Building a Multimodal RAG System
Objective: Extend the RAG framework to incorporate multimodal data, such as images or diagrams, alongside text for enhanced information retrieval and generation.

Instructions:

Dataset: Create or use a dataset that includes both text and images/diagrams, such as scientific papers with figures, product descriptions with images, or educational materials with illustrations.
Multimodal Embeddings:
Use a pre-trained multimodal model (e.g., CLIP) to generate joint embeddings for text and images/diagrams.
Store the multimodal embeddings in a vector database that supports efficient similarity search.
Multimodal Retrieval:
Implement a retriever that takes a user query (text or image) and retrieves relevant documents and images based on their multimodal embeddings.
Implement a system that allows users to query with both text and images.
Multimodal Generation:
Modify the generator to handle both text and image/diagram inputs.
Explore techniques for generating text that incorporates information from both modalities, such as describing an image or explaining a diagram.
Evaluation:
Develop evaluation metrics that assess the system’s ability to retrieve and generate information from both text and images/diagrams.
Test the system with a variety of queries that combine text and visual information.

In [3]:
!pip install open_clip_torch


Collecting open_clip_torch


[notice] A new release of pip is available: 23.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading open_clip_torch-2.32.0-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     ---------------------------------------  1.5/1.5 MB 31.8 MB/s eta 0:00:01
     ---------------------------------------- 1.5/1.5 MB 24.1 MB/s eta 0:00:00
Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
     ---------------------------------------- 0.0/44.8 kB ? eta -:--:--
     ---------------------------------------- 44.8/44.8 kB 2.2 MB/s eta 0:00:00
Collecting torchvision
  Downloading torchvision-0.22.1-cp310-cp310-win_amd64.whl (1.7 MB)
     ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
     ---------------------------------------- 1.7/1.7 MB 54.8 MB/s eta 0:00:00
Collecting timm
  Downloading timm-1.0.17-py3-none-any.whl (2.5 MB)
     ---------------------------------------- 0.0/2.5 MB ? eta -:--:--
     ---------------------------------------- 2.5/2.5 MB 165.4 MB/s eta 0:00:00
Installing collected packages

In [6]:
import os
import torch
import numpy as np
from PIL import Image
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from datasets import load_dataset
import faiss
import open_clip

# 1. Chargement du dataset MuLMS-Img depuis Hugging Face
dataset = load_dataset("Timbrt/MuLMS-Img")
samples = []

# 2. Préparer les textes (captions) et les images (PNG)
for item in dataset['train']:
    samples.append({
        "text": item["caption"],
        "image_path": item["img_path"]  # chemin local dans le cache datasets
    })

texts = [sample['text'] for sample in samples]
image_paths = [sample['image_path'] for sample in samples]

# 3. Embedding multimodal avec open_clip
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, _, clip_preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_e16', device=device)
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Encode Texts
with torch.no_grad():
    text_tokens = tokenizer(texts)
    text_embeddings = clip_model.encode_text(text_tokens.to(device))

# Encode Images
images = [clip_preprocess(Image.open(p)).unsqueeze(0).to(device) for p in image_paths]
with torch.no_grad():
    image_embeddings = torch.cat([clip_model.encode_image(img) for img in images])

# Fusion Text + Image embeddings
combined_embeddings = (text_embeddings + image_embeddings) / 2
combined_embeddings = combined_embeddings.cpu().numpy().astype('float32')

# 4. Indexation FAISS
index = faiss.IndexFlatL2(combined_embeddings.shape[1])
faiss.normalize_L2(combined_embeddings)
index.add(combined_embeddings)

# 5. Recherche multimodale
def search_multimodal(query, image_path=None, top_k=2):
    with torch.no_grad():
        query_tokens = tokenizer([query])
        query_text_emb = clip_model.encode_text(query_tokens.to(device))
        if image_path:
            query_img = clip_preprocess(Image.open(image_path)).unsqueeze(0).to(device)
            query_img_emb = clip_model.encode_image(query_img)
            query_emb = (query_text_emb + query_img_emb) / 2
        else:
            query_emb = query_text_emb
    query_emb = query_emb.cpu().numpy().astype('float32')
    faiss.normalize_L2(query_emb)
    distances, indices = index.search(query_emb, top_k)
    return [(samples[i]['text'], samples[i]['image_path']) for i in indices[0]]

# 6. Génération textuelle via GPT (Falcon)
llm_tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
llm_model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct").to(device)
generator = pipeline("text-generation", model=llm_model, tokenizer=llm_tokenizer, device=0)

# 7. RAG complet multimodal
def multimodal_answer(query, image_path=None):
    results = search_multimodal(query, image_path)
    context = "\n".join([r[0] for r in results])
    prompt = f"Avec les informations suivantes provenant de textes et d'images :\n{context}\nRéponds à la question : {query}\nRéponse : "
    response = generator(prompt, max_length=200, do_sample=True)[0]['generated_text']
    return response

# 8. Exemple :
question = "Explique le fonctionnement du diagramme de phase d'un alliage"
reponse = multimodal_answer(question)
print(reponse)

# 9. Référence Dataset + PDF
print("Dataset utilisé : MuLMS-Img : https://huggingface.co/datasets/Timbrt/MuLMS-Img")
print("Documentation officielle : Tarsi_SciOL_and_MuLMS-Img_Introducing_a_Large-Scale_Multimodal_Scientific_Dataset_and_WACV_2024_paper.pdf")
print("Lien : https://openaccess.thecvf.com/content/WACV2024/papers/Tarsi_SciOL_and_MuLMS-Img_Introducing_a_Large-Scale_Multimodal_Scientific_Dataset_and_WACV_2024_paper.pdf")

# 10. Évaluation rapide
def simple_eval(queries, expected_keywords):
    correct = 0
    for q, kw in zip(queries, expected_keywords):
        answer = multimodal_answer(q)
        if kw.lower() in answer.lower():
            correct += 1
    print(f"Accuracy: {correct}/{len(queries)}")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading data: 100%|██████████| 729/729 [00:17<00:00, 42.85files/s]
Downloading data:  29%|██▉       | 54/184 [00:01<00:02, 45.18files/s]HTTP Error 429 thrown while requesting HEAD https://huggingface.co/api/resolve-cache/datasets/Timbrt/MuLMS-Img/22b093f1981a68a2ddbc531f5076549071f02441/annotations%2Fdev%2Fimagespemfc_13-FigureS7-1d.json
Retrying in 1s [Retry 1/5].
HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/Timbrt/MuLMS-Img/resolve/22b093f1981a68a2ddbc531f5076549071f02441/annotations/dev/imagespoly_11-Figure4-1a.json
Retrying in 1s [Retry 1/5].
HTTP Error 429 thrown while requesting GET https://huggingface.co/api/resolve-cache/datasets/Timbrt/MuLMS-Img/22b093f1981a68a2ddbc531f5076549071f02441/

HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/resolve-cache/datasets/Timbrt/MuLMS-Img/22b093f1981a68a2ddbc531f5076549071f02441/annotations%2Fdev%2Fimagespemfc_13-FigureS3-1.json

## Exercice 3 : Mise en œuvre d'un système RAG avec intégration de Knowledge Graph
Objectif : Améliorer le système RAG en intégrant un graphe de connaissances pour améliorer le raisonnement et la compréhension contextuelle.

Instructions:

Ensemble de données et graphique de connaissances :
Choisissez un ensemble de données spécifique à un domaine (par exemple, une collection d’articles scientifiques, de documents historiques ou de rapports financiers).
Construisez un graphe de connaissances représentant les entités et les relations au sein de l'ensemble de données. Vous pouvez utiliser des outils comme Neo4j ou des bibliothèques RDF.
Récupération améliorée du graphique de connaissances :
Modifiez le récupérateur pour intégrer les informations du graphique de connaissances pendant le processus de récupération.
Mettre en œuvre des techniques d’interrogation du graphe de connaissances en fonction de la requête de l’utilisateur et des documents récupérés.
Utilisez le graphique de connaissances pour identifier les entités et les relations pertinentes qui peuvent ne pas être explicitement mentionnées dans le texte.
Génération améliorée de Knowledge Graph :
Modifiez le générateur pour utiliser les informations du graphique de connaissances lors de la génération de réponses.
Explorez les techniques permettant d’intégrer les faits du graphique de connaissances dans le texte généré, comme fournir des explications ou faire des inférences.
Raisonnement et inférence :
Mettre en œuvre des capacités de raisonnement qui permettent au système RAG de répondre à des questions complexes nécessitant plusieurs étapes d’inférence.
Utilisez le graphique de connaissances pour identifier les relations implicites et générer des réponses plus complètes.
Évaluation:
Développer des mesures d’évaluation qui évaluent la capacité du système à utiliser le graphique de connaissances pour la récupération et la génération.
Testez le système avec des requêtes complexes qui nécessitent raisonnement et inférence.

In [8]:
!pip install py2neo


Collecting py2neo
  Downloading py2neo-2021.2.4-py2.py3-none-any.whl (177 kB)
     ---------------------------------------- 0.0/177.2 kB ? eta -:--:--
     ------------------------------------- 177.2/177.2 kB 11.1 MB/s eta 0:00:00
Collecting monotonic
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting interchange~=2021.0.4
  Downloading interchange-2021.0.4-py2.py3-none-any.whl (28 kB)
Collecting pansi>=2020.7.3
  Downloading pansi-2024.11.0-py2.py3-none-any.whl (26 kB)
Installing collected packages: monotonic, pansi, interchange, py2neo
Successfully installed interchange-2021.0.4 monotonic-1.6 pansi-2024.11.0 py2neo-2021.2.4



[notice] A new release of pip is available: 23.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
import os
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from sentence_transformers import SentenceTransformer
import faiss
from py2neo import Graph

# ✅ 1. Dataset : petits documents métiers + Neo4j Knowledge Graph
# Exemple simplifié : ensemble local texte + graphe Neo4j
samples = [
    {"id": "1", "text": "L'entreprise Alpha a racheté Beta en 2021."},
    {"id": "2", "text": "Beta détient une participation dans Gamma."},
    {"id": "3", "text": "Gamma a annoncé un partenariat avec Delta."}
]
texts = [sample["text"] for sample in samples]

# ✅ 2. Connexion Neo4j (préalable : Neo4j lancé localement)
graph = Graph("bolt://localhost:7687", auth=("neo4j", "password"))

# ✅ 3. Embedding des textes
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = encoder.encode(texts, convert_to_tensor=False)
embeddings = np.array(embeddings).astype('float32')

# ✅ 4. FAISS Index
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
faiss.normalize_L2(embeddings)
index.add(embeddings)

# ✅ 5. Récupération hybride : texte + knowledge graph

def query_graph(user_query):
    query = f"""
    MATCH (e)-[r]->(e2)
    WHERE toLower(e.name) CONTAINS toLower('{user_query}') OR toLower(e2.name) CONTAINS toLower('{user_query}')
    RETURN e.name AS Entity1, type(r) AS Relation, e2.name AS Entity2
    LIMIT 5
    """
    results = graph.run(query).data()
    relations = [f"{r['Entity1']} --{r['Relation']}--> {r['Entity2']}" for r in results]
    return relations

# ✅ 6. Recherche RAG enrichie
def retrieve_with_kg(query, top_k=2):
    q_emb = encoder.encode([query]).astype('float32')
    faiss.normalize_L2(q_emb)
    _, idx = index.search(q_emb, top_k)
    retrieved = [samples[i]["text"] for i in idx[0]]
    kg_info = query_graph(query)
    return retrieved, kg_info

# ✅ 7. Génération enrichie GPT
tok = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct").cuda()
gen = pipeline("text-generation", model=model, tokenizer=tok, device=0)


def answer_with_kg(query):
    docs, kg = retrieve_with_kg(query)
    context = "\n".join(docs + kg)
    prompt = f"Voici des informations documentaires et du graphe de connaissance :\n{context}\nQuestion : {query}\nRéponse : "
    resp = gen(prompt, max_new_tokens=200, do_sample=True)[0]['generated_text']
    return resp

# ✅ 8. Exemple d'utilisation
q = "Qui contrôle Gamma ?"
print(answer_with_kg(q))

# ✅ 9. Évaluation manuelle via questions complexes
def evaluate_rag_kg(queries, keywords):
    correct = 0
    for q, kw in zip(queries, keywords):
        ans = answer_with_kg(q)
        if kw.lower() in ans.lower():
            correct += 1
    print(f"Score : {correct}/{len(queries)}")


ConnectionUnavailable: Cannot open connection to ConnectionProfile('bolt://localhost:7687')