# **Sujet 2 : LLM finement ajust√© pour classification de sentiments + r√©ponses contextuelles**

**Objectif**


Cr√©er un mini-LLM qui peut :
* Classifier une ¬¥emotion (positif, n√©gatif, neutre),
* Fournir une r√©ponse bas√©e sur des documents (type RAG).


**√©tapes guid√©es**
1. Choisir le jeu de donn√©es : par exemple IMDB (film reviews). Revoir : encoding.pdf,
Supervis√© mesures perfs.pdf.
2. Chargement du mod√©le : utilisez distilbert ou roberta-small.
3. Fine-tuning LoRA (simple) :
‚Ä¢ Utilisez le package peft pour faire un fine-tuning l√©ger.
‚Ä¢ Vous pouvez figer les poids et ne fine-tuner qu‚Äôune petite partie.
4. Cr√©ation du syst√®me RAG :
* Embedding avec sentence-transformers.
* Similarit√© cosinus
* G√©n√©ration via prompt + contexte.
* Revoir : GenAI RAG 1.pdf, GenAI RAG 2.pdf.
5. Interface simple : interface CLI pour faire une bo√Æte de texte et afficher la r√©ponse, si si le temps le permet, essayer Streamlit.

# **ETAPE 1**

installation des librairies

importation des modules

importation & √©chantillonnage des donn√©es

## etape 1.1 installation des librairies


In [None]:
# Installation des librairies
!pip install -q torch transformers datasets accelerate peft scikit-learn numpy pandas tqdm openpyxl


## etape 1.2 importation modules

cellule √† compl√©ter au fur et mesure

In [None]:
# importation de librairies /modules dans le script
import os  # operations fichiers / syst√®me
import json  # fichiers JSON data charger/sauvegarder
import time  # mesure latence
from datetime import datetime  # timestamps
from collections import Counter  # compter le repetitions

# Data
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer  #  TF-IDF
from sklearn.neighbors import NearestNeighbors  #  kNN search
from sklearn.metrics.pairwise import cosine_similarity  #  similarit√© cosinus
from sklearn.model_selection import train_test_split  # train/val split

# Hugging Face
from datasets import load_dataset  # charger IMDB dataset
from transformers import (
    AutoTokenizer,  #tokenization
    AutoModelForSequenceClassification,  #  classification
    AutoModel,  #  embedding
    AutoModelForCausalLM,  #  text generation model
    TrainingArguments,  #  training configuration
    Trainer,  #  training the model
    pipeline
)

# LoRA fine-tuning
from peft import (
    LoraConfig,  # configuration
    get_peft_model,  # appliquer LoRA
    TaskType  # specifier task type
)

# barres de progression
from tqdm import tqdm

# PyTorch
import torch
from torch.utils.data import Dataset  # For custom datasets

print("import successful")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")  # Check if GPU is available


import successful
PyTorch version: 2.9.0+cu126
CUDA available: True


## etape 1.3 : charger et pr√©parer le dataset IMDB

√©chantillonnage 10 000 r√©view : √©quilibrer positives/n√©gatives



In [None]:
# chargement donn√©es depuis Hugging Face
# movie reviews : positive/negative labels
print("Loading IMDB dataset...")
from datasets import load_dataset, concatenate_datasets # Import concatenate_datasets
dataset = load_dataset("imdb")

# charger train et test splits
original_train_data = dataset["train"]  # Original training data
original_test_data = dataset["test"]  # Original test data

# --- echantillonnage equilibr√© / Training Data ---
TRAIN_NEG_SAMPLES = 5000
TRAIN_POS_SAMPLES = 5000

# isoler reviews negatives et positives
train_neg = original_train_data.filter(lambda x: x['label'] == 0)
train_pos = original_train_data.filter(lambda x: x['label'] == 1)

# Selection des reviews pour l'√©chantillon
train_neg = train_neg.select(range(min(TRAIN_NEG_SAMPLES, len(train_neg))))
train_pos = train_pos.select(range(min(TRAIN_POS_SAMPLES, len(train_pos))))

# Combiner and m√©langer les subsets
train_data = concatenate_datasets([train_neg, train_pos]) # librairie Datasets de Hugging Face
train_data = train_data.shuffle(seed=42) # librairie Datasets de Hugging Face

# --- √©chantillon √©quilibr√© / Test Data ---
TEST_NEG_SAMPLES = 1000
TEST_POS_SAMPLES = 1000

# isoler reviews negatives and positives des test data
test_neg = original_test_data.filter(lambda x: x['label'] == 0)
test_pos = original_test_data.filter(lambda x: x['label'] == 1)

# Selectionner l'√©chantillon
test_neg = test_neg.select(range(min(TEST_NEG_SAMPLES, len(test_neg))))
test_pos = test_pos.select(range(min(TEST_POS_SAMPLES, len(test_pos))))

# Combiner et m√©langer l'√©chantillon des test_data
test_data = concatenate_datasets([test_neg, test_pos]) # cf librairie datasets de Hugging Face
test_data = test_data.shuffle(seed=42)

print(f"total extrait Training : {len(train_data)} (Negative: {TRAIN_NEG_SAMPLES}, Positive: {TRAIN_POS_SAMPLES})")
print(f"total extrait Test : {len(test_data)} (Negative: {TEST_NEG_SAMPLES}, Positive: {TEST_POS_SAMPLES})")

# afficher des exemples de review+ label
for item in range(10):

  print("\nExemples de reviews:")
  print(f"Texte: {train_data[item]['text'][:200]}...")  # premiers 200 characters
  print(f"Label: {train_data[item]['label']} (0=negative, 1=positive)")

Loading IMDB dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(‚Ä¶):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

total extrait Training : 10000 (Negative: 5000, Positive: 5000)
total extrait Test : 2000 (Negative: 1000, Positive: 1000)

Exemples de reviews:
Texte: There are many kinds of reunion shows. One kind is where old actors are taken out of mothballs and set to recreate characters they haven't played for twenty or thirty years. These have mixed results. ...
Label: 1 (0=negative, 1=positive)

Exemples de reviews:
Texte: This isn't the best romantic comedy ever made, but it is certainly pretty nice and watchable. It's directed in an old-fashioned way and that works fine. Cybill Shepherd as Corinne isn't bad in her rol...
Label: 1 (0=negative, 1=positive)

Exemples de reviews:
Texte: I've always liked Fred MacMurray, and¬óalthough her career was tragically cut short¬óI think Carole Lombard is fun to watch. Pair these two major and attractive stars together, add top supporting player...
Label: 0 (0=negative, 1=positive)

Exemples de reviews:
Texte: Anna (Charlotte Burke), who is just on the ve

# **ETAPE 2** : Mod√®le de classification des reviews positives (label=1) / negatives (label=0) : DistilBERT (version all√©g√©e de BERT)



## √©tape 2.1: charger le tokenizer et le mod√®le pr√©-entrain√©


In [None]:
# configuration du mod√®le
MODEL_NAME = "distilbert-base-uncased"
NUM_LABELS = 2  # classification binaire: positive (1) or negative (0)

# charger le tokenizer pour convertir le texte de reviews en tokens (nombres)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# charger le mod√®le de claissification pr√©-entrain√©
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

print("Modele et tokenizer charg√©s")
print(f"Param√®tres du mod√®le: {sum(p.numel() for p in model.parameters()):,}")  # Total parameters


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Modele et tokenizer charg√©s
Param√®tres du mod√®le: 66,955,010


## √©tape 2.2: Tokenizer le Dataset

tokenizer training et test data.


In [None]:
# Fonction : tokenizer le dataset
def tokenize_function(examples):
    """
    Tokenize  examples.
    Args: examples: Dictionary with 'text' key containing review texts
    Returns: Tokenized inputs with 'input_ids', 'attention_mask', etc.
    """
    # - truncation=True: limiter la s√©quenca max_length
    # - padding=True: harmoniser la longueur des s√©quences
    # - max_length=512: Maximulongueur max d'une sequence length (=limite de BERT)
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=True,
        max_length=512
    )

# tokenization des training data
print("Tokenizing training data...")
tokenized_train = train_data.map(
    tokenize_function,
    batched=True,  # Process in batches for speed
    remove_columns=["text"]  # Remove original text (we have tokens now)
)

print("Exemple de tokenization pour les donn√©es d'entra√Ænement:")
print(f"Original text: {train_data[0]['text'][:100]}...")
print(f"Input IDs: {tokenized_train[0]['input_ids'][:20]}...")
print(f"Attention Mask: {tokenized_train[0]['attention_mask'][:20]}...")

# Apply tokenization to test data
print("\nTokenizing test data...")
tokenized_test = test_data.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

print("Exemple de tokenization pour les donn√©es de test:")
print(f"Original text: {test_data[0]['text'][:100]}...")
print(f"Input IDs: {tokenized_test[0]['input_ids'][:20]}...")
print(f"Attention Mask: {tokenized_test[0]['attention_mask'][:20]}...")

print("\n Tokenization termin√©e !")
print(f"Total training: {len(tokenized_train)}")
print(f"Total test: {len(tokenized_test)}")


Tokenizing training data...


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Exemple de tokenization pour les donn√©es d'entra√Ænement:
Original text: There are many kinds of reunion shows. One kind is where old actors are taken out of mothballs and s...
Input IDs: [101, 2045, 2024, 2116, 7957, 1997, 10301, 3065, 1012, 2028, 2785, 2003, 2073, 2214, 5889, 2024, 2579, 2041, 1997, 5820]...
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]...

Tokenizing test data...


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Exemple de tokenization pour les donn√©es de test:
Original text: Linda Blair has been acting for forty years now, and while she will never escape the part of Regan M...
Input IDs: [101, 8507, 10503, 2038, 2042, 3772, 2005, 5659, 2086, 2085, 1010, 1998, 2096, 2016, 2097, 2196, 4019, 1996, 2112, 1997]...
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]...

 Tokenization termin√©e !
Total training: 10000
Total test: 2000


## √©tape 2.3: Configure LoRA pour un Fine-Tuning plus rapide

r√©duction du nombre de param√®tres √† tuner


In [None]:
# Configure LoRA parameters
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Sequence classification task
    r=8,  # Rank: lower = fewer parameters (8 is a good balance)
    lora_alpha=16,  # Scaling factor for LoRA weights
    lora_dropout=0.1,  # Dropout rate for LoRA layers
    target_modules=["q_lin", "v_lin"]  # Which layers to apply LoRA to
    # "q_lin" and "v_lin" are query and value linear layers in attention
)

# Apply LoRA to the model
# This freezes the base model weights and adds trainable LoRA adapters
print("Applying LoRA to model...")
model = get_peft_model(model, lora_config)

# Print trainable parameters
# With LoRA, we only train a tiny fraction of parameters!
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print("‚úì LoRA applied successfully!")
print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}% of total)")
print(f"Total parameters: {total_params:,}")


Applying LoRA to model...
‚úì LoRA applied successfully!
Trainable parameters: 739,586 (1.09% of total)
Total parameters: 67,694,596


## √©tape 2.4: mise en place du process de training

We configure the training process: learning rate, batch size, number of epochs, etc.


In [None]:
# Training arguments
# These control how the model is trained
training_args = TrainingArguments(
    output_dir="./results",  # Where to save model checkpoints
    num_train_epochs=2, #3,  # Number of training epochs (full passes through data)
    per_device_train_batch_size=8,  # Batch size per device (small for memory)
    per_device_eval_batch_size=8,  # Evaluation batch size
    learning_rate=2e-4,  # Learning rate (how fast model learns)
    weight_decay=0.01,  # Regularization to prevent overfitting
    logging_dir="./logs",  # Where to save logs
    logging_steps=100,  # Log every 100 steps
    eval_strategy="epoch",  # Evaluate at end of each epoch
    save_strategy="epoch",  # Save model at end of each epoch
    load_best_model_at_end=True,  # Load best model after training
    fp16=True,  # Use mixed precision (faster, less memory) if GPU available
    report_to="none"  # Don't report to external services
)

print("‚úì Training configuration set up!")


‚úì Training configuration set up!


## √©tape 2.5: split entre train et validation


We split the training data into train and validation sets, then create a Trainer object to handle the training process.


In [None]:
## NE PAS REFAIRE LE SPLIT : PARTIE A EFFACER/DEBUT
# # Split tokenized training data into train and validation sets
# # 80% for training, 20% for validation
# train_val_split = tokenized_train.train_test_split(test_size=0.2, seed=42)

# train_dataset = train_val_split["train"]  # Training set
# val_dataset = train_val_split["test"]  # Validation set
###PARTIE A EFFACER/FIN

# attribution des donn√©es pour le training et le test √† partir de l'√©chantillon
#  √©quilibr√© extrait du dataset IMDB au d√©but
train_dataset = tokenized_train
val_dataset = tokenized_test

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

# Create a metric function for evaluation
# This computes accuracy during training
def compute_metrics(eval_pred):
    """
    Compute accuracy metric.

    Args:
        eval_pred: Tuple of (predictions, labels)

    Returns:
        Dictionary with accuracy score
    """
    predictions, labels = eval_pred
    # Get predicted class (0 or 1) from logits
    predictions = np.argmax(predictions, axis=1)
    # Calculate accuracy
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}

# Create the Trainer
# Trainer handles all the training loop, evaluation, and saving
trainer = Trainer(
    model=model,  # The model to train
    args=training_args,  # Training configuration
    train_dataset=train_dataset,  # Training data
    eval_dataset=val_dataset,  # Validation data
    compute_metrics=compute_metrics,  # How to compute metrics
    tokenizer=tokenizer  # Tokenizer for decoding
)

print("‚úì Trainer created and ready for training!")


Training samples: 10000
Validation samples: 2000
‚úì Trainer created and ready for training!


  trainer = Trainer(


## Step 3: Train the Model

Now we train the model! This may take some time depending on your hardware. With LoRA, it should be faster than full fine-tuning.


In [None]:
# Train the model
# This will take several minutes depending on your hardware
print("Starting training...")
print("This may take a few minutes. Please wait...")

# Start training
trainer.train()

print("‚úì Training complete!")

# Evaluate on validation set
print("\nEvaluating on validation set...")
eval_results = trainer.evaluate()
print(f"Validation Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"Validation Loss: {eval_results['eval_loss']:.4f}")


Starting training...
This may take a few minutes. Please wait...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2885,0.254031,0.9025
2,0.215,0.257089,0.911


‚úì Training complete!

Evaluating on validation set...


Validation Accuracy: 0.9025
Validation Loss: 0.2540


## Step 4: Save the Fine-Tuned Model

We save the trained model so we can use it later without retraining.


In [None]:
# Save the fine-tuned model
MODEL_SAVE_PATH = "./fine_tuned_sentiment_model"

print(f"Saving model to {MODEL_SAVE_PATH}...")
trainer.save_model(MODEL_SAVE_PATH)
tokenizer.save_pretrained(MODEL_SAVE_PATH)

print("‚úì Model saved successfully!")

# Also save the full model with LoRA adapters
# This allows us to load it later
model.save_pretrained(MODEL_SAVE_PATH)


Saving model to ./fine_tuned_sentiment_model...
‚úì Model saved successfully!


## Step 5: Test the Sentiment Classifier

Let's test our fine-tuned model on some example reviews to see if it works correctly.


In [None]:
# Create a sentiment analysis pipeline for easy inference
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1  # Use GPU if available
)

# Test on some example reviews
test_reviews = [
    "This movie was absolutely fantastic! I loved every minute of it.",
    "Terrible movie, waste of time. Boring and poorly acted.",
    "It was okay, nothing special but not bad either.",
    "One of the best films I've ever seen. Highly recommended!",
    "Awful. Just awful. Don't watch this."
]

print("Testing sentiment classifier:\n")
for review in test_reviews:
    result = sentiment_pipeline(review)[0]
    label = "POSITIVE" if result["label"] == "LABEL_1" else "NEGATIVE"
    score = result["score"]
    print(f"Review: {review[:60]}...")
    print(f"  ‚Üí {label} (confidence: {score:.3f})\n")


Device set to use cuda:0


Testing sentiment classifier:

Review: This movie was absolutely fantastic! I loved every minute of...
  ‚Üí POSITIVE (confidence: 0.999)

Review: Terrible movie, waste of time. Boring and poorly acted....
  ‚Üí NEGATIVE (confidence: 1.000)

Review: It was okay, nothing special but not bad either....
  ‚Üí POSITIVE (confidence: 0.585)

Review: One of the best films I've ever seen. Highly recommended!...
  ‚Üí POSITIVE (confidence: 0.997)

Review: Awful. Just awful. Don't watch this....
  ‚Üí NEGATIVE (confidence: 0.992)



## Step 6: Build the RAG System - Prepare Corpus and Embeddings

Now we build the RAG (Retrieval-Augmented Generation) system. This will:
1. Create embeddings of all reviews using BERT
2. Integration of normalization in the custom via "normalize_embeddings" attribute
3. Implement kNN search for retrieval


In [None]:
# On utilise Sentence-BERT, sp√©cifiquement con√ßu pour la similarit√© s√©mantique.
# 'all-MiniLM-L6-v2' est un excellent standard, rapide et performant.
# 'all-mpnet-base-v2' est plus pr√©cis mais plus lent (√† tester si besoin).
from sentence_transformers import SentenceTransformer
import numpy as np # S'assurer que numpy est import√©

embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
print(f"Chargement du mod√®le d'embedding optimis√© : {embedding_model_name}")
embedding_model = SentenceTransformer(embedding_model_name)

# Passage sur GPU si disponible
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_model = embedding_model.to(device)
print(f"‚úì Mod√®le d'embedding charg√© sur {device}")

# --- Mise √† jour de la fonction get_embeddings ---
def get_embeddings(texts, batch_size=32):
    """Generate embeddings using SentenceTransformer (optimis√©)."""
    if isinstance(texts, str):
        texts = [texts]

    # .encode() g√®re le tokenizing, le padding, le pooling et la normalisation
    # On utilise show_progress_bar=True pour suivre l'avancement
    embeddings = embedding_model.encode(
        texts,
        batch_size=batch_size,
        convert_to_tensor=False, # On veut des numpy arrays pour FAISS
        device=device, # Utiliser le bon device
        normalize_embeddings=True # <---Normalisation L2
    )

    # Si le r√©sultat est un tenseur, on le convertit en numpy
    if isinstance(embeddings, torch.Tensor):
        embeddings = embeddings.cpu().numpy()

    # On s'assure que c'est du float32 (standard pour FAISS et calculs vectoriels)
    return embeddings.astype(np.float32)

print("‚úì Nouvelle fonction get_embeddings pr√™te avec Sentence-BERT + Normalisation.")

# Get all review texts from training data
print("\nExtracting review texts...")
# Assurez-vous que train_data est d√©fini avant ce bloc
corpus_texts = [item["text"] for item in train_data]
print(f"Corpus size: {len(corpus_texts)} reviews")

# Create embeddings for the corpus
print("\nCreating Sentence embeddings (this may take a few minutes)...")
# On utilise la nouvelle fonction get_embeddings
corpus_embeddings = get_embeddings(corpus_texts)

print(f"‚úì Embeddings created! Shape: {corpus_embeddings.shape}")

Chargement du mod√®le d'embedding optimis√© : sentence-transformers/all-MiniLM-L6-v2
‚úì Mod√®le d'embedding charg√© sur cuda
‚úì Nouvelle fonction get_embeddings pr√™te avec Sentence-BERT + Normalisation.

Extracting review texts...
Corpus size: 10000 reviews

Creating Sentence embeddings (this may take a few minutes)...
‚úì Embeddings created! Shape: (10000, 384)


In [None]:
import re  # Import pour les expressions r√©guli√®res (nettoyage)

# Le syst√®me repose enti√®rement sur la recherche s√©mantique (Embeddings + kNN).

# Build kNN index using cosine similarity
# kNN finds the most similar documents to a query
print("\nBuilding kNN index...")
knn_model = NearestNeighbors(
    n_neighbors=15,  # Retrieve top 10 most similar documents
    metric="cosine"  # Use cosine similarity (measures angle between vectors)
)
knn_model.fit(corpus_embeddings)  # Fit on BERT embeddings
print("‚úì kNN index built!")

# Function to retrieve relevant documents
def retrieve_documents(query_text, top_k=10):
    """
    Retrieve top-k most similar documents to a query.
    Args:
        query_text: The query text
        top_k: Number of documents to retrieve
    Returns:
        List of dictionaries with text, similarity, and sentiment
    """
    # Get query embedding
    query_embedding = get_embeddings([query_text])

    # Find nearest neighbors
    distances, indices = knn_model.kneighbors(query_embedding, n_neighbors=top_k)

    # Get retrieved documents
    retrieved_docs = []
    for idx, dist in zip(indices[0], distances[0]):
        raw_text = corpus_texts[idx]

        # --- OPTIMISATION 1 : R√©cup√©ration du Sentiment ---
        # On r√©cup√®re le label d'origine (0 ou 1) via l'index
        # (Suppose que train_data est toujours disponible)
        label_id = train_data[int(idx)]['label']
        sentiment_str = "POSITIVE" if label_id == 1 else "NEGATIVE"

        # --- NETTOYAGE DU TEXTE ---
        clean_text = re.sub(r'<[^>]+>', ' ', raw_text)
        clean_text = re.sub(r'\s+', ' ', clean_text).strip()

        retrieved_docs.append({
            "text": clean_text,
            "similarity": 1 - dist,
            "sentiment": sentiment_str # Ajout de la m√©tadonn√©e
        })

    return retrieved_docs


Building kNN index...
‚úì kNN index built!


In [None]:
# Test de r√©cup√©ration (Retrieval) avec le mod√®le kNN
query_test = "What did you think about movies with Brad Pitt?"

print(f"Requ√™te : '{query_test}'")
print("-" * 50)

# Appel de la fonction de r√©cup√©ration qui utilise l'index kNN
# Elle convertit la requ√™te en vecteur et cherche les voisins les plus proches
retrieved_docs = retrieve_documents(query_test, top_k =5)

# S√©curit√© : V√©rifier que la r√©cup√©ration a fonctionn√© (n'est pas None)
if retrieved_docs is not None:
    # Affichage des r√©sultats
    for i, doc in enumerate(retrieved_docs, 1):
        print(f"\nüìÑ R√©sultat #{i}")
        print(f"   Score de similarit√© (Cosinus) : {doc['similarity']:.4f}")
        print(f"   Extrait du texte : \"{doc['text'][:250]}...\"")
else:
    print("\n‚ö†Ô∏è ERREUR : 'retrieved_docs' est vide (None).")
    print("Cela signifie que la fonction 'retrieve_documents' (cellule pr√©c√©dente) ne retourne rien.")
    print("Action requise : Ajoutez 'return retrieved_docs' √† la fin de la fonction retrieve_documents et r√©-ex√©cutez sa cellule.")

Requ√™te : 'What did you think about movies with Brad Pitt?'
--------------------------------------------------

üìÑ R√©sultat #1
   Score de similarit√© (Cosinus) : 0.5672
   Extrait du texte : "Upon seeing this film once again it appeared infinitely superior to me this time than the previous times I have viewed it. The acting is stunningly wonderful. The characters are very clearly drawn. Brad Pitt is simply superb as the errant son who reb..."

üìÑ R√©sultat #2
   Score de similarit√© (Cosinus) : 0.5651
   Extrait du texte : "Great drama with all the areas covered EXCEPT for screenlay which was too slow and should have shown more relevant scenes like Pitt's character interviewing the President,or Pitt getting murdered instead of just having it described to us.Scenes like ..."

üìÑ R√©sultat #3
   Score de similarit√© (Cosinus) : 0.5526
   Extrait du texte : "This movie was pretty absurd. There was a FEW funny parts. Its goes right in to the bin of movies in my memory where I thin

## Step 6.5: Save Corpus Data for CLI Use

We save the corpus texts and embeddings so they can be loaded by the CLI interface script.


In [None]:
# Save corpus texts and embeddings for CLI use
# This allows the CLI script to load the data without recomputing embeddings
print("Saving corpus data for CLI use...")

# Save corpus texts as numpy array (for easy loading)
np.save("corpus_texts.npy", np.array(corpus_texts, dtype=object))

# Save embeddings
np.save("corpus_embeddings.npy", corpus_embeddings)

print("‚úì Corpus data saved!")
print("  - corpus_texts.npy")
print("  - corpus_embeddings.npy")
print("\nThese files can now be loaded by the CLI interface script.")


Saving corpus data for CLI use...
‚úì Corpus data saved!
  - corpus_texts.npy
  - corpus_embeddings.npy

These files can now be loaded by the CLI interface script.


## Step 7 Custom du mod√®le de g√©n√©ration de texte √† partir du mod√®le entrain√©

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load Flan-T5 for text generation
# Flan-T5 is an instruction-tuned model excellent for RAG tasks.
print("Loading Flan-T5 for text generation...")
generator_model_name = "google/flan-t5-base"
generator_tokenizer = AutoTokenizer.from_pretrained(generator_model_name)
generator_model = AutoModelForSeq2SeqLM.from_pretrained(generator_model_name)

# Move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
generator_model = generator_model.to(device)
generator_model.eval()  # Evaluation mode

print(f"‚úì Generator model loaded on {device}")

# Function to generate response using retrieved context
def generate_response(
    query,
    retrieved_docs,
    max_new_tokens=300,  # Augment√© de 150 √† 300
    min_length=80,       # NOUVEAU : param√®tre expos√© (d√©faut 80 tokens)
    num_context_docs=4,  # Augment√© de 2 √† 4 pour donner plus de mati√®re
    max_doc_length=800   # Augment√© de 600 √† 800
):
    # 1. Prepare Context WITHOUT Labels
    context_parts = []
    for doc in retrieved_docs[:num_context_docs]:
        # Take up to max_doc_length characters from each document
        context_parts.append(f"Review: {doc['text'][:max_doc_length]}")

    context = "\n\n".join(context_parts)

    # 2. Construct Optimized Prompt
    # On donne un r√¥le clair √† l'IA et des instructions pr√©cises
    if not context:
        # Special instruction if no context is available
        prompt = f"""You are an expert movie assistant. You were asked: '{query}'. However, no relevant reviews were found to answer this question. Please state that you cannot provide an answer based on available information.

Answer:"""
    else:
        # MODIFIED PROMPT: Encourage balanced output AND details
        prompt = f"""You are a helpful and balanced movie assistant. Use the reviews below to answer the user's question detailedly.
Try to provide a balanced view. If the reviews mention both good and bad points, summarize both.
Avoid being overly negative unless the reviews are unanimously negative.
Expand on specific details mentioned in the reviews (acting, plot, effects).

Context Reviews:
{context}

User Question: {query}

Answer:"""

    # Tokenize prompt
    inputs = generator_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(device)

    # Generate response
    with torch.no_grad():
        outputs = generator_model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_new_tokens,
            min_length=min_length,  # Utilisation du param√®tre
            num_return_sequences=1,
            temperature=0.8,  # L√©g√®rement augment√© pour plus de cr√©ativit√©
            do_sample=True,
            no_repeat_ngram_size=2
        )

    # Decode generated text
    generated_text = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)

    return generated_text.strip()

Loading Flan-T5 for text generation...
‚úì Generator model loaded on cuda


## Step 7.5 Test Modele G√©n√©ration de code

In [None]:
# Test generation with LENGTH settings
print("\nTesting text generation with Flan-T5 (LONG ANSWER)...\n")
test_query = "What do people think about the special effects in the last movie of Brad Pitt?"

# Retrieval (now includes sentiment)
retrieved = retrieve_documents(test_query, top_k=10)

# Generate response (Asking for a longer response explicitely)
response = generate_response(
    test_query,
    retrieved,
    max_new_tokens=300,
    min_length=100,  # Force au moins 100 tokens (~80 mots)
    num_context_docs=5 # Utilise plus de critiques
)

print(f"Query: {test_query}")
print(f"\nGenerated Response (Length: {len(response.split())} words):")
print("-" * 50)
print(response)
print("-" * 50)


Testing text generation with Flan-T5 (LONG ANSWER)...

Query: What do people think about the special effects in the last movie of Brad Pitt?

Generated Response (Length: 86 words):
--------------------------------------------------
It's one of the finest. I saw it once again and it was a great movie. The actors are excellent and the scenery is beautiful throughout the film. But I can't say that the special effects are laughable. It is not as good as the rest of this movie and most reviews I have read. Overall, this was good. This movie contains lots of good special effect. And yes, the actors were excellent in the movie but there isn't enough action to make it worth watching.
--------------------------------------------------


## Step 8 Sauvegarde du mod√®le

In [None]:
# Sauvegarde du mod√®le d'embeddings
# Note: SentenceTransformer g√®re son propre tokenizer interne et utilise .save()
MODEL_EMBEDDING_PATH = "./embedding_model"

print(f"Saving embedding model to {MODEL_EMBEDDING_PATH}...")
embedding_model.save(MODEL_EMBEDDING_PATH)
print("‚úì Embedding model saved successfully!")

# Sauvegarde du mod√®le de g√©n√©ration et son tokenizer
MODEL_GENERATOR_PATH = "./generator_model"

print(f"Saving generator model to {MODEL_GENERATOR_PATH}...")
generator_model.save_pretrained(MODEL_GENERATOR_PATH)
generator_tokenizer.save_pretrained(MODEL_GENERATOR_PATH)
print("‚úì Generator model saved successfully!")

Saving embedding model to ./embedding_model...
‚úì Embedding model saved successfully!
Saving generator model to ./generator_model...
‚úì Generator model saved successfully!


In [None]:
# import shutil
# import os

# # Define paths for the fine-tuned sentiment model and its tokenizer
# MODEL_SAVE_PATH = "./fine_tuned_sentiment_model"

# # 1. Compresser le dossier du mod√®le de sentiment affin√©
# model_dir = MODEL_SAVE_PATH
# zip_file_name_sentiment = "fine_tuned_sentiment_model.zip"

# # Cr√©er l'archive zip
# print(f"Compression du dossier '{model_dir}' en '{zip_file_name_sentiment}'...")
# shutil.make_archive(model_dir, 'zip', model_dir)
# print("‚úì Dossier du mod√®le de sentiment compress√© avec succ√®s !")

# # 2. T√©l√©charger le fichier zip du mod√®le de sentiment
# from google.colab import files

# print(f"T√©l√©chargement de '{zip_file_name_sentiment}'...")
# files.download(zip_file_name_sentiment)
# print("‚úì T√©l√©chargement du mod√®le de sentiment lanc√© !")


# # Compresser le dossier du mod√®le d'embeddings
# embedding_model_dir = './embedding_model'
# embedding_zip_file_name = "embedding_model.zip"
# print(f"Compression du dossier '{embedding_model_dir}' en '{embedding_zip_file_name}'...")
# shutil.make_archive(embedding_model_dir, 'zip', embedding_model_dir)
# print("‚úì Dossier d'embeddings compress√© avec succ√®s !")

# # Compresser le dossier du mod√®le de g√©n√©ration
# generator_model_dir = './generator_model'
# generator_zip_file_name = "generator_model.zip"
# print(f"Compression du dossier '{generator_model_dir}' en '{generator_zip_file_name}'...")
# shutil.make_archive(generator_model_dir, 'zip', generator_model_dir)
# print("‚úì Dossier de g√©n√©ration compress√© avec succ√®s !")

# # T√©l√©charger les fichiers zip des mod√®les d'embedding et de g√©n√©ration
# print(f"T√©l√©chargement de '{embedding_zip_file_name}'...")
# files.download(embedding_zip_file_name)

# print(f"T√©l√©chargement de '{generator_zip_file_name}'...")
# files.download(generator_zip_file_name)

# # T√©l√©charger les fichiers corpus_embeddings.npy et corpus_texts.npy
# print("T√©l√©chargement de 'corpus_embeddings.npy'...")
# files.download('corpus_embeddings.npy')

# print("T√©l√©chargement de 'corpus_texts.npy'...")
# files.download('corpus_texts.npy')

# print("‚úì T√©l√©chargements lanc√©s !")

## Step 9: Create Evaluation Functions

We create functions to evaluate the system on multiple metrics: accuracy, response length, repetitions, keyword presence, and latency.


In [None]:
# Evaluation functions

def count_repetitions(text, n=3):
    """
    Count repeated n-grams (phrases) in text.
    Higher repetition = lower quality.

    Args:
        text: Text to analyze
        n: Length of n-grams to check

    Returns:
        Number of repeated n-grams
    """
    words = text.lower().split()
    ngrams = [tuple(words[i:i+n]) for i in range(len(words)-n+1)]
    ngram_counts = Counter(ngrams)
    # Count n-grams that appear more than once
    repetitions = sum(1 for count in ngram_counts.values() if count > 1)
    return repetitions

def check_keyword_presence(text, keywords):
    """
    Check if important keywords from query appear in response.

    Args:
        text: Response text
        keywords: List of keywords to check

    Returns:
        Number of keywords found
    """
    text_lower = text.lower()
    found = sum(1 for keyword in keywords if keyword.lower() in text_lower)
    return found

def evaluate_response(query, response, true_label=None, start_time=None):
    """
    Evaluate a single response on multiple metrics.

    Args:
        query: Original query
        response: Generated response
        true_label: True sentiment label (if available)
        start_time: Start time for latency calculation

    Returns:
        Dictionary with evaluation metrics
    """
    metrics = {}

    # Response length
    metrics["response_length"] = len(response.split())

    # Repetitions
    metrics["repetitions"] = count_repetitions(response)

    # Keyword presence (extract important words from query)
    query_words = [w for w in query.lower().split() if len(w) > 3]  # Words longer than 3 chars
    metrics["keywords_found"] = check_keyword_presence(response, query_words)

    # Latency (time taken to generate)
    if start_time:
        metrics["latency"] = time.time() - start_time

    # Classification accuracy (if true label provided)
    if true_label is not None:
        result = sentiment_pipeline(query)[0]
        predicted_label = 1 if result["label"] == "LABEL_1" else 0
        metrics["classification_correct"] = (predicted_label == true_label)
        metrics["classification_confidence"] = result["score"]

    return metrics

print("‚úì Evaluation functions created!")


‚úì Evaluation functions created!


## Step 10: Run End-to-End Evaluation

We test the complete system on 50 queries and collect all evaluation metrics.


In [None]:
# Create test queries for evaluation
# Mix of different types of queries
test_queries = [
    "What did people think about the plot?",
    "How was the acting?",
    "Was the movie entertaining?",
    "Did people like the special effects?",
    "What about the cinematography?",
    "How was the dialogue?",
    "Was the movie well-directed?",
    "Did people enjoy the soundtrack?",
    "How was the pacing?",
    "What did people think about the ending?",
] * 5 # On fait un test rapide sur 50 requ√™tes

# Run evaluation
print(f"Running evaluation on {len(test_queries)} queries (Quick Test)...")
print("This may take a minute...\n")

# Ensure sentiment_pipeline is defined
try:
    sentiment_pipeline
except NameError:
    print("sentiment_pipeline not found. Initializing it now...")
    sentiment_pipeline = pipeline(
        "sentiment-analysis",
        model=model,
        tokenizer=tokenizer,
        device=0 if torch.cuda.is_available() else -1
    )
# -----------------------------------------------

results = []

for i, query in enumerate(tqdm(test_queries, desc="Evaluating")):
    start_time = time.time()

    # Retrieve relevant documents
    retrieved = retrieve_documents(query, top_k=5)

    # Generate response
    response = generate_response(query, retrieved)

    # --- CORRECTION : Analyser le sentiment de la R√âPONSE, pas de la question ---
    sentiment_result = sentiment_pipeline(response)[0]
    response_sentiment = "POSITIVE" if sentiment_result["label"] == "LABEL_1" else "NEGATIVE"

    # Evaluate
    metrics = evaluate_response(query, response, start_time=start_time)

    # Store results
    results.append({
        "query": query,
        "response_sentiment": response_sentiment, # On stocke le sentiment de la r√©ponse
        "sentiment_confidence": sentiment_result["score"],
        "response": response,
        **metrics  # Add all evaluation metrics
    })

# Convert to DataFrame for easy analysis
results_df = pd.DataFrame(results)

# Calculate summary statistics
print("\n" + "="*60)
print("EVALUATION RESULTS (Analyzed on Generated Responses)")
print("="*60)
print(f"\nTotal queries: {len(results_df)}")
print(f"Average response length: {results_df['response_length'].mean():.1f} words")
print(f"Average repetitions: {results_df['repetitions'].mean():.1f}")
print(f"Average keywords found: {results_df['keywords_found'].mean():.1f}")
print(f"Average latency: {results_df['latency'].mean():.2f} seconds")
print(f"\nGenerated Response Sentiment distribution:")
print(results_df['response_sentiment'].value_counts())

# Save results to CSV
results_df.to_csv("evaluation_results_optimized.csv", index=False)
print("\n‚úì Results saved to 'evaluation_results_optimized.csv'")

Running evaluation on 50 queries (Quick Test)...
This may take a minute...



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [01:48<00:00,  2.17s/it]


EVALUATION RESULTS (Analyzed on Generated Responses)

Total queries: 50
Average response length: 66.1 words
Average repetitions: 0.0
Average keywords found: 0.5
Average latency: 2.17 seconds

Generated Response Sentiment distribution:
response_sentiment
NEGATIVE    28
POSITIVE    22
Name: count, dtype: int64

‚úì Results saved to 'evaluation_results_optimized.csv'





## Step 11: Robustness Tests

We test the system with edge cases: vague prompts, contradictory prompts, and unrelated prompts.


In [None]:
# Robustness tests
print("="*60)
print("ROBUSTNESS TESTS")
print("="*60)

# Test 1: Vague prompt
print("\n1. VAGUE PROMPT TEST")
print("-" * 60)
vague_query = "What about it?"
print(f"Query: '{vague_query}'")
retrieved = retrieve_documents(vague_query, top_k=3)
response = generate_response(vague_query, retrieved)
print(f"Response: {response[:200]}...")
metrics = evaluate_response(vague_query, response)
print(f"Metrics: Length={metrics['response_length']}, Repetitions={metrics['repetitions']}")

# Test 2: Contradictory prompt
print("\n2. CONTRADICTORY PROMPT TEST")
print("-" * 60)
contradictory_query = "This movie is both amazing and terrible at the same time"
print(f"Query: '{contradictory_query}'")
sentiment_result = sentiment_pipeline(contradictory_query)[0]
print(f"Sentiment: {sentiment_result['label']} (confidence: {sentiment_result['score']:.3f})")
retrieved = retrieve_documents(contradictory_query, top_k=3)
response = generate_response(contradictory_query, retrieved)
print(f"Response: {response[:200]}...")

# Test 3: Unrelated prompt
print("\n3. UNRELATED PROMPT TEST")
print("-" * 60)
unrelated_query = "What is the weather like today?"
print(f"Query: '{unrelated_query}'")
retrieved = retrieve_documents(unrelated_query, top_k=3)
print(f"Retrieved documents similarity scores:")
for i, doc in enumerate(retrieved, 1):
    print(f"  {i}. {doc['similarity']:.3f}")
response = generate_response(unrelated_query, retrieved)
print(f"Response: {response[:200]}...")

print("\n" + "="*60)
print("Robustness tests complete!")
print("="*60)


ROBUSTNESS TESTS

1. VAGUE PROMPT TEST
------------------------------------------------------------
Query: 'What about it?'


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Response: negative. If this movie had a good plot, it would have been better. But it was so bad that I was surprised to see two kids slip in and I didn't feel like I could give it 0 stars....
Metrics: Length=39, Repetitions=0

2. CONTRADICTORY PROMPT TEST
------------------------------------------------------------
Query: 'This movie is both amazing and terrible at the same time'
Sentiment: LABEL_1 (confidence: 0.961)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Response: It is not a bad movie. It doesn't have one redeeming value. The acting is decent and the plot is believable....

3. UNRELATED PROMPT TEST
------------------------------------------------------------
Query: 'What is the weather like today?'


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Retrieved documents similarity scores:
  1. 0.261
  2. 0.235
  3. 0.228
Response: cold. No snow today. Snowy weather tonight. Sunny. Good to see Disney in a film with sex and humor. Unlike the other films, the film is more humour than gloom, and more engaging than the first two fil...

Robustness tests complete!
