# TD: RAG

Dans ce notebook, un RAG basique est implémenté:
- On chunk les documents par paragraphes
- On a un embedding pour les chunks
- Pour une question, on peut embedde la question et récupérer les N chunks les plus pertinents
- On utilise un modèle de génération de texte (SMoLL) pour faire la partie question + chunks les plus pertinents -> réponse.

Téléchargez (cette archive)[https://drive.google.com/file/d/1TnfKs7bTwmpbXklbgiIBpdw7I_wJ5y9Y/view?usp=sharing] avec différentes 

Dans ce TD, vous allez expérimenter différentes façons de chunk et d'embeded les documents et les questions pour que le RAG retrieve les documents les plus pertinents. <br/>
Vous expérimenterez aussi la prompt donnée au générateur de texte pour avoir les meilleures réponses.

Voici la [liste de questions](https://drive.google.com/file/d/14hZ0hTx5dM1WgJYewZsn9BkHzEReq-pj/view?usp=sharing) que je poserai au RAG. </br>
A rendre: 
- Le notebook de votre RAG
- un CSV avec question,embedding,rag_reply
- un CSV avec chunk,embedding</br>
L'embedding doit être le JSON d'une liste de float.</br>
Quand je ferai "json.loads(embedding)", je dois récupérer une liste de floats

In [50]:
import numpy as np

import pandas as pd
from pathlib import Path

# Data loading

In [51]:
path = Path("../data/raw/rag/")

In [52]:
texts = []
for filename in path.glob("*.md"):
    with open(filename) as f:
        texts.append(f.read())

texts[0]

'# Title: Introduction to Cybersecurity: Principles and Practices  \n\n**Teacher:** Professor Lydia Carter  \n\n**Description:**  \nThis course introduces the fundamentals of cybersecurity, focusing on protecting systems, networks, and data from cyber threats. Students will explore key topics such as cryptography, network security, ethical hacking, and risk management. Through practical labs and real-world case studies, students will gain hands-on experience in identifying vulnerabilities, implementing security measures, and understanding the legal and ethical aspects of cybersecurity.  \n\n**Prerequisites:**  \n- Basic knowledge of computer networks and operating systems  \n- Proficiency in at least one programming language (e.g., Python, Java, or C++)  \n- Completion of "Introduction to Computer Science" or equivalent  \n\n**Assessment:**  \n- Weekly quizzes and assignments (25%)  \n- Midterm exam: Fundamentals of cybersecurity (20%)  \n- Final project: Design and present a comprehen

# Chunk
## Basic

In [53]:
def parse_class(text):
    chunks = text.split("\n\n")
    title = chunks[0].replace("# Title: ", "")
    return {"title": title, "chunks": chunks}

In [54]:
def parse_class_add_title(text):
    chunks = text.split("\n\n")
    title = chunks[0].replace("# Title: ", "")
    return {"title": title, "chunks": [f"{title}: {chunk}" for chunk in chunks]}

In [55]:
chunks = sum((parse_class_add_title(txt)["chunks"] for txt in texts), [])

# Embedding

## BAAI's embedding

In [56]:
from FlagEmbedding import FlagModel

In [57]:
model = FlagModel(
    'BAAI/bge-base-en-v1.5',
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
    use_fp16=True,
)

In [58]:
corpus_embedding = model.encode(chunks)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [59]:
queries = [
    "Who is the reinforcement learning teacher?",
    "In what class will I learn game AI?",
]

In [60]:
query_embedding = model.encode(queries)

In [61]:
sim_scores = query_embedding @ corpus_embedding.T

In [62]:
for query, score in zip(queries, sim_scores):
    print(" ---- ")
    print("Query: ", query)
    indexes = np.argsort(score)[-5:]
    print("Sources:")
    for i, idx in enumerate(reversed(indexes)):
        if score[idx] > .5:
            print(f"{i+1} -- similarity {score[idx]:.2f} -- \"", chunks[idx], '"')
            
    

 ---- 
Query:  Who is the reinforcement learning teacher?
Sources:
1 -- similarity 0.80 -- " Foundations of Reinforcement Learning  : **Teacher:** Dr. Arjun Patel   "
2 -- similarity 0.74 -- " Foundations of Reinforcement Learning  : # Title: Foundations of Reinforcement Learning   "
3 -- similarity 0.71 -- " Foundations of Reinforcement Learning  : 2. **Tabular Methods**  
   - Dynamic programming approaches: Policy Iteration and Value Iteration  
   - Monte Carlo methods and Temporal-Difference (TD) Learning   "
4 -- similarity 0.71 -- " Foundations of Reinforcement Learning  : 4. **Policy-Based Methods**  
   - Policy Gradient methods and REINFORCE algorithm  
   - Advantage Actor-Critic (A2C) and Proximal Policy Optimization (PPO)   "
5 -- similarity 0.71 -- " Foundations of Reinforcement Learning  : **Description:**  
This course explores the foundational principles and practical applications of reinforcement learning (RL), a branch of machine learning focused on decision-making a

# Eval retrieval: Mean Reciprocal Rank
Le fichier [question_answer_short.csv](https://drive.google.com/file/d/1EB8IwGlqvpNy3oq7xyR2IzdqJDX8C_fr/view?usp=drive_link) contient une liste de question et le texte à retrouver dans les documents.<br/>
Je considère que tout chunk contenant le "texte à retrouver" était un bon chunk

In [63]:
df = pd.read_csv(path / "question_answer_short.csv")

In [64]:
query_embedding = model.encode(list(df["question"]))

In [65]:
acceptable_chunks = []
for answer in df["answer"]:
    chunks_ok = set(i for i, chunk in enumerate(chunks) if answer in chunk)
    acceptable_chunks.append(chunks_ok)

In [66]:
def compute_mrr(sim_score, acceptable_chunks):
    ranks = []
    for this_score, this_acceptable_chunks in zip(sim_score, acceptable_chunks):
        indexes = reversed(np.argsort(this_score))
        rank = 1 + next(i for i, idx in enumerate(indexes) if idx in this_acceptable_chunks)
        ranks.append(rank)
        
    return {
        "score": sum(1 / r if r < 6 else 0 for r in ranks) / len(ranks),
        "ranks": ranks,
    }

In [67]:
sim_scores = query_embedding @ corpus_embedding.T

In [68]:
res = compute_mrr(sim_scores, acceptable_chunks)
res["score"]

0.6

# Text generation

In [69]:
def get_context(query, corpus, corpus_embeddings):
    query_embedding = model.encode([query])
    sim_scores = query_embedding @ corpus_embedding.T
    indexes = list(np.argsort(sim_scores[0]))[-5:]
    return [corpus[i] for i in indexes]

In [70]:
get_context("Which class will teach me to build a chatbot?", chunks, corpus_embedding)

['# Natural Language Processing (NLP) Fundamentals and Applications: 5. **Applications of NLP**\n  - Sentiment analysis and text classification\n  - Machine translation and summarization\n  - Chatbots and conversational agents',
 '# Natural Language Processing (NLP) Fundamentals and Applications: **Description:**\nThis course offers a comprehensive introduction to the field of Natural Language Processing (NLP), focusing on the computational techniques that allow machines to understand, interpret, and generate human language. You will learn about linguistic structures, text preprocessing, sentiment analysis, machine translation, and language modeling. Using hands-on projects and industry-relevant tools, this course provides a strong foundation in both traditional and modern NLP methods, including neural networks and transformers.',
 '# Natural Language Processing (NLP) Fundamentals and Applications: Whether you aim to pursue a career in AI or enhance your programming toolkit, this cours

## SMOLL

In [71]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM2-360M-Instruct"
# checkpoint = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
# checkpoint = "amd/Instella-3B"

device = "cpu" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_generator = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

In [72]:
def build_smoll_prompt(query, corpus, corpus_embedding):
    context_str = "\n\n".join(get_context(query, chunks, corpus_embedding))

    prompt = f"""<|im_start|>system
You reply to the user's request using only context information.
Context information to answer "{query}" is below
------
Context:
{context_str}
------
You are a helpful assistant for a Computer Science university. You reply to students'questions about the courses that they can attend.
<|im_end|>
<|im_start|>user
{query}
<|im_reend|>
"""
    return prompt


In [73]:
def build_smoll_messages(query, chunks, corpus_embedding):
    context_str = "\n\n".join(get_context(query, chunks, corpus_embedding))

    messages = [
        {"role": "system", "content": f"""You reply to the user's request using only context information.
Context information to answer "{query}" is below
------
Context:
{context_str}
------
You are a helpful assistant for a Computer Science university. You reply to students'questions about the courses that they can attend.
"""},
        {"role": "user", "content": query},
    ]

    return messages


In [74]:
messages = build_smoll_messages("Who is the NLP teacher?", chunks, corpus_embedding)

input_text=tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model_generator.generate(inputs, max_new_tokens=100, temperature=0.01, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))

<|im_start|>system
You reply to the user's request using only context information.
Context information to answer "Who is the NLP teacher?" is below
------
Context:
# Natural Language Processing (NLP) Fundamentals and Applications: **Prerequisites:**
- Proficiency in Python programming
- Basic understanding of linear algebra and probability
- Successful completion of "Introduction to Machine Learning" or equivalent

# Natural Language Processing (NLP) Fundamentals and Applications: **Course Outline:**
1. **Introduction to NLP**
  - Key concepts and challenges
  - Overview of linguistic structure and grammar

# Natural Language Processing (NLP) Fundamentals and Applications: # Natural Language Processing (NLP) Fundamentals and Applications

# Natural Language Processing (NLP) Fundamentals and Applications: **Schedule Time:**
- Tuesdays and Thursdays: 10:00 AM - 11:30 AM
- Lab Sessions: Fridays 2:00 PM - 4:00 PM

# Natural Language Processing (NLP) Fundamentals and Applications: **Teacher

# Groq generator

In [75]:
from dotenv import load_dotenv
import os   
load_dotenv()
groq_api_key = os.getenv("API_KEY")

In [76]:
import openai


In [77]:
client = openai.OpenAI(
    api_key=groq_api_key,
    base_url="https://api.groq.com/openai/v1"
)

In [78]:
query = "What must I do to pass the NLP class?"

context_str = "\n\n".join(get_context(query, chunks, corpus_embedding))

prompt = f"""Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
If the answer is not in the context information, reply "I cannot answer that question".
Query: {query}
Answer:"""

# Génération des fichiers de rendu

## Informations sur les chunks


In [79]:
# Afficher les statistiques sur les chunks
import json

print("="*60)
print("STATISTIQUES DES CHUNKS")
print("="*60)
print(f"\nNombre total de chunks : {len(chunks)}")
print(f"\nTaille des chunks (en caractères) :")
print(f"  - Minimum : {min(len(chunk) for chunk in chunks)} caractères")
print(f"  - Maximum : {max(len(chunk) for chunk in chunks)} caractères")
print(f"  - Moyenne : {sum(len(chunk) for chunk in chunks) / len(chunks):.2f} caractères")
print(f"  - Médiane : {sorted([len(chunk) for chunk in chunks])[len(chunks)//2]} caractères")

print(f"\nDimension des embeddings : {corpus_embedding.shape[1]}")
print(f"Forme du corpus_embedding : {corpus_embedding.shape}")
print("="*60)


STATISTIQUES DES CHUNKS

Nombre total de chunks : 98

Taille des chunks (en caractères) :
  - Minimum : 71 caractères
  - Maximum : 608 caractères
  - Moyenne : 226.24 caractères
  - Médiane : 208 caractères

Dimension des embeddings : 768
Forme du corpus_embedding : (98, 768)


## Exemples de chunks


In [80]:
# Afficher quelques exemples de chunks
print("Exemples de chunks (3 premiers) :")
print("="*60)
for i, chunk in enumerate(chunks[:3]):
    print(f"\n[Chunk {i+1}] Taille: {len(chunk)} caractères")
    print(f"Contenu: {chunk[:200]}..." if len(chunk) > 200 else f"Contenu: {chunk}")
    print("-"*60)


Exemples de chunks (3 premiers) :

[Chunk 1] Taille: 125 caractères
Contenu: Introduction to Cybersecurity: Principles and Practices  : # Title: Introduction to Cybersecurity: Principles and Practices  
------------------------------------------------------------

[Chunk 2] Taille: 96 caractères
Contenu: Introduction to Cybersecurity: Principles and Practices  : **Teacher:** Professor Lydia Carter  
------------------------------------------------------------

[Chunk 3] Taille: 540 caractères
Contenu: Introduction to Cybersecurity: Principles and Practices  : **Description:**  
This course introduces the fundamentals of cybersecurity, focusing on protecting systems, networks, and data from cyber th...
------------------------------------------------------------


## Génération du CSV chunks avec embeddings


In [81]:
# Créer un DataFrame avec les chunks et leurs embeddings
# L'embedding doit être au format JSON (liste de floats)

chunks_data = []
for i, (chunk, embedding) in enumerate(zip(chunks, corpus_embedding)):
    chunks_data.append({
        'chunk': chunk,
        'embedding': json.dumps(embedding.tolist())  # Convertir en JSON
    })

df_chunks = pd.DataFrame(chunks_data)
print(f"DataFrame créé avec {len(df_chunks)} chunks")
df_chunks.head()


DataFrame créé avec 98 chunks


Unnamed: 0,chunk,embedding
0,Introduction to Cybersecurity: Principles and ...,"[0.01177978515625, 0.00832366943359375, -0.014..."
1,Introduction to Cybersecurity: Principles and ...,"[0.0369873046875, -0.01201629638671875, -0.010..."
2,Introduction to Cybersecurity: Principles and ...,"[0.0277099609375, -0.012054443359375, 0.001070..."
3,Introduction to Cybersecurity: Principles and ...,"[0.01399993896484375, -0.01212310791015625, -0..."
4,Introduction to Cybersecurity: Principles and ...,"[0.021697998046875, -0.0242156982421875, -0.03..."


In [82]:
# Sauvegarder le CSV des chunks
output_chunks_path = Path("../data/processed/chunks_embeddings.csv")
df_chunks.to_csv(output_chunks_path, index=False)
print(f"✓ CSV des chunks sauvegardé : {output_chunks_path}")
print(f"  Nombre de lignes : {len(df_chunks)}")


✓ CSV des chunks sauvegardé : ../data/processed/chunks_embeddings.csv
  Nombre de lignes : 98


## Génération du CSV questions avec embeddings et réponses RAG


In [83]:
# Charger les questions depuis question.csv
df_questions = pd.read_csv(path / "question.csv")
# Retirer les lignes vides
df_questions = df_questions.dropna(subset=['question'])
df_questions = df_questions[df_questions['question'].str.strip() != '']

print(f"Nombre de questions à traiter : {len(df_questions)}")
df_questions.head(10)


Nombre de questions à traiter : 27


Unnamed: 0,question
0,Who is the reinforcement learning teacher?
1,In what class will I learn game A.I.?
2,What are the requirements to build a game A.I.?
3,How will I validate the reinforcement learning...
4,Which class will teach me to build a chatbot?
5,What are the requirements to build a chatbot?
6,What models do we use for text A.I.?
7,What are the applications of NLP?
8,What class will teach me to program a Arduino?
9,What IoT system can I build in class


In [84]:
# Fonction pour générer une réponse RAG avec Groq
def generate_rag_answer(query, chunks, corpus_embedding, client):
    """
    Génère une réponse RAG pour une question donnée.
    
    Paramètres:
    - query: la question posée
    - chunks: liste des chunks de texte
    - corpus_embedding: embeddings des chunks
    - client: client OpenAI/Groq
    
    Retourne:
    - La réponse générée par le modèle
    """
    # Récupérer le contexte pertinent
    context_str = "\n\n".join(get_context(query, chunks, corpus_embedding))
    
    # Construire le prompt
    prompt = f"""Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
If the answer is not in the context information, reply "I cannot answer that question".
Query: {query}
Answer:"""
    
    try:
        # Appeler l'API Groq
        res = client.chat.completions.create(
            messages=[{"role": "user", "content": prompt}],
            model="openai/gpt-oss-20b",
        )
        return res.choices[0].message.content
    except Exception as e:
        print(f"Erreur lors de la génération pour '{query}': {e}")
        return f"Erreur: {str(e)}"


In [85]:
# Générer les embeddings et réponses pour toutes les questions
questions_list = df_questions['question'].tolist()

# Générer les embeddings pour toutes les questions
print("Génération des embeddings pour les questions...")
questions_embeddings = model.encode(questions_list)
print(f"✓ Embeddings générés pour {len(questions_list)} questions")

# Générer les réponses RAG pour chaque question
print("\nGénération des réponses RAG...")
rag_answers = []
for i, question in enumerate(questions_list):
    print(f"  [{i+1}/{len(questions_list)}] Traitement de: {question[:60]}...")
    answer = generate_rag_answer(question, chunks, corpus_embedding, client)
    rag_answers.append(answer)

print(f"✓ {len(rag_answers)} réponses générées")


Génération des embeddings pour les questions...
✓ Embeddings générés pour 27 questions

Génération des réponses RAG...
  [1/27] Traitement de: Who is the reinforcement learning teacher?...
  [2/27] Traitement de: In what class will I learn game A.I.?...
  [3/27] Traitement de: What are the requirements to build a game A.I.?...
  [4/27] Traitement de: How will I validate the reinforcement learning class?...
  [5/27] Traitement de: Which class will teach me to build a chatbot?...
  [6/27] Traitement de: What are the requirements to build a chatbot?...
  [7/27] Traitement de: What models do we use for text A.I.?...
  [8/27] Traitement de: What are the applications of NLP?...
  [9/27] Traitement de: What class will teach me to program a Arduino?...
  [10/27] Traitement de: What IoT system can I build in class...
  [11/27] Traitement de: What are the applications of IoT?...
  [12/27] Traitement de: How do I validate the IoT class?...
  [13/27] Traitement de: What language must I know to cod

In [86]:
# Créer le DataFrame avec questions, embeddings et réponses
questions_data = []
for question, embedding, answer in zip(questions_list, questions_embeddings, rag_answers):
    questions_data.append({
        'question': question,
        'embedding': json.dumps(embedding.tolist()),  # Convertir en JSON
        'rag_reply': answer
    })

df_questions_output = pd.DataFrame(questions_data)
print(f"DataFrame créé avec {len(df_questions_output)} questions")
df_questions_output.head()


DataFrame créé avec 27 questions


Unnamed: 0,question,embedding,rag_reply
0,Who is the reinforcement learning teacher?,"[-0.009368896484375, -0.0645751953125, -0.0037...",The reinforcement learning teacher is Dr. Arju...
1,In what class will I learn game A.I.?,"[-0.02490234375, 0.013763427734375, 0.04547119...",You’ll learn game AI in the **Foundations of R...
2,What are the requirements to build a game A.I.?,"[-0.0035305023193359375, 0.0216064453125, 0.04...",I cannot answer that question.
3,How will I validate the reinforcement learning...,"[0.0026454925537109375, -0.0254974365234375, -...",I cannot answer that question.
4,Which class will teach me to build a chatbot?,"[0.0225677490234375, -0.0487060546875, 0.00802...",The **Applications of NLP** class (section 5) ...


In [87]:
# Sauvegarder le CSV des questions
output_questions_path = Path("../data/processed/questions_embeddings_answers.csv")
df_questions_output.to_csv(output_questions_path, index=False)
print(f"✓ CSV des questions sauvegardé : {output_questions_path}")
print(f"  Nombre de lignes : {len(df_questions_output)}")


✓ CSV des questions sauvegardé : ../data/processed/questions_embeddings_answers.csv
  Nombre de lignes : 27


## Vérification des fichiers générés


In [88]:
# Vérifier que les embeddings peuvent être rechargés correctement
print("Vérification du format des embeddings...")
print("\n1. Test de chargement d'un embedding de chunk:")
test_chunk_embedding = json.loads(df_chunks.iloc[0]['embedding'])
print(f"   ✓ Type: {type(test_chunk_embedding)}")
print(f"   ✓ Longueur: {len(test_chunk_embedding)}")
print(f"   ✓ Premier élément: {test_chunk_embedding[0]} (type: {type(test_chunk_embedding[0])})")

print("\n2. Test de chargement d'un embedding de question:")
test_question_embedding = json.loads(df_questions_output.iloc[0]['embedding'])
print(f"   ✓ Type: {type(test_question_embedding)}")
print(f"   ✓ Longueur: {len(test_question_embedding)}")
print(f"   ✓ Premier élément: {test_question_embedding[0]} (type: {type(test_question_embedding[0])})")

print("\n✓ Les embeddings sont au bon format !")


Vérification du format des embeddings...

1. Test de chargement d'un embedding de chunk:
   ✓ Type: <class 'list'>
   ✓ Longueur: 768
   ✓ Premier élément: 0.01177978515625 (type: <class 'float'>)

2. Test de chargement d'un embedding de question:
   ✓ Type: <class 'list'>
   ✓ Longueur: 768
   ✓ Premier élément: -0.009368896484375 (type: <class 'float'>)

✓ Les embeddings sont au bon format !


## Aperçu des réponses générées


In [89]:
# Afficher quelques exemples de questions et réponses
print("Exemples de questions et réponses générées:")
print("="*80)

for i in range(min(5, len(df_questions_output))):
    row = df_questions_output.iloc[i]
    print(f"\n[Question {i+1}]")
    print(f"Q: {row['question']}")
    print(f"R: {row['rag_reply']}")
    print("-"*80)


Exemples de questions et réponses générées:

[Question 1]
Q: Who is the reinforcement learning teacher?
R: The reinforcement learning teacher is Dr. Arjun Patel.
--------------------------------------------------------------------------------

[Question 2]
Q: In what class will I learn game A.I.?
R: You’ll learn game AI in the **Foundations of Reinforcement Learning** class.
--------------------------------------------------------------------------------

[Question 3]
Q: What are the requirements to build a game A.I.?
R: I cannot answer that question.
--------------------------------------------------------------------------------

[Question 4]
Q: How will I validate the reinforcement learning class?
R: I cannot answer that question.
--------------------------------------------------------------------------------

[Question 5]
Q: Which class will teach me to build a chatbot?
R: The **Applications of NLP** class (section 5) covers chatbots and conversational agents, and the **Final Proj

## Résumé final


In [93]:
# Résumé final du RAG
print("="*80)
print("RÉSUMÉ DU RAG")
print("="*80)

print("\n CONFIGURATION DU CHUNKING:")
print(f"   • Méthode: Split par paragraphes (\\n\\n) avec ajout du titre")
print(f"   • Nombre de chunks: {len(chunks)}")
print(f"   • Taille moyenne: {sum(len(chunk) for chunk in chunks) / len(chunks):.0f} caractères")
print(f"   • Taille min/max: {min(len(chunk) for chunk in chunks)}/{max(len(chunk) for chunk in chunks)} caractères")

print("\n EMBEDDING:")
print(f"   • Modèle: BAAI/bge-base-en-v1.5")
print(f"   • Dimension: {corpus_embedding.shape[1]}")
print(f"   • Instruction de requête: 'Represent this sentence for searching relevant passages:'")

print("\n GÉNÉRATION:")
print(f"   • Modèle: openai/gpt-oss-20b (via Groq)")
print(f"   • Nombre de chunks récupérés: 5 (top-5)")
print(f"   • Stratégie: Contexte + prompt avec instruction de ne répondre que selon le contexte")

print("\n FICHIERS GÉNÉRÉS:")
print(f"   • {output_chunks_path}")
print(f"     → {len(df_chunks)} chunks avec embeddings")
print(f"   • {output_questions_path}")
print(f"     → {len(df_questions_output)} questions avec embeddings et réponses")

print("\n RAG complet généré avec succès !")
print("="*80)


RÉSUMÉ DU RAG

 CONFIGURATION DU CHUNKING:
   • Méthode: Split par paragraphes (\n\n) avec ajout du titre
   • Nombre de chunks: 98
   • Taille moyenne: 226 caractères
   • Taille min/max: 71/608 caractères

 EMBEDDING:
   • Modèle: BAAI/bge-base-en-v1.5
   • Dimension: 768
   • Instruction de requête: 'Represent this sentence for searching relevant passages:'

 GÉNÉRATION:
   • Modèle: openai/gpt-oss-20b (via Groq)
   • Nombre de chunks récupérés: 5 (top-5)
   • Stratégie: Contexte + prompt avec instruction de ne répondre que selon le contexte

 FICHIERS GÉNÉRÉS:
   • ../data/processed/chunks_embeddings.csv
     → 98 chunks avec embeddings
   • ../data/processed/questions_embeddings_answers.csv
     → 27 questions avec embeddings et réponses

 RAG complet généré avec succès !


In [91]:
res = client.chat.completions.create(                                            
    messages=[{"role": "user", "content": prompt}],                              
    model="openai/gpt-oss-20b",                                                                 
)                                                                                

In [92]:
res.choices[0].message.content

'To pass the NLP course, you need to:\n\n1. **Complete all weekly coding assignments** – they count for 30\u202f% of your grade.  \n2. **Do well on the midterm exam** – worth 20\u202f% of the final grade.  \n3. **Finish the final project** (an end‑to‑end NLP application) – this is the largest component, accounting for 30\u202f%.  \n4. **Participate actively** in class discussions and code reviews – this contributes 20\u202f% of the grade.  \n\nIn addition, make sure you meet the prerequisites (Python programming, basic linear algebra and probability, and prior completion of an introductory ML course). Performing satisfactorily across all four areas will earn you a passing grade.'