# Question Answering – Évaluation du modèle

Ce notebook évalue les performances du modèle fine-tuné
sur le dataset SQuAD (Exact Match, F1-score et temps d’inférence).


## Objectifs

- Charger le modèle fine-tuné
- Évaluer les performances sur le jeu de validation
- Calculer les métriques Exact Match et F1
- Mesurer le temps d’inférence


In [1]:
from datasets import load_from_disk
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import evaluate
import numpy as np
import time
import torch
from sklearn.metrics import precision_recall_curve, roc_curve, auc
import matplotlib.pyplot as plt


  from .autonotebook import tqdm as notebook_tqdm


## Chargement des données préprocessées


In [None]:
# Note: tokenized_datasets n'est pas utilisé pour l'évaluation.
# Nous travaillons directement avec le raw_datasets afin de tokenizer 
# les nouveaux exemples avec la même configuration que lors de l'inférence.

from datasets import load_dataset
raw_datasets = load_dataset("squad")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 87599/87599 [00:00<00:00, 490288.54 examples/s]
Generating validation split: 100%|██████████| 10570/10570 [00:00<00:00, 297506.30 examples/s]


## Chargement du modèle fine-tuné


In [3]:
model_path = "outputs/checkpoints/distilbert/final"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForQuestionAnswering.from_pretrained(model_path)

model.eval()


Loading weights: 100%|██████████| 102/102 [00:00<00:00, 429.89it/s, Materializing param=qa_outputs.weight]                                     


DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSelfAttention(
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
     

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSelfAttention(
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
     

## Métriques SQuAD


In [16]:
metric = evaluate.load("squad")


## Fonction d’inférence


In [6]:
def predict_with_score(example):
    inputs = {
        "input_ids": torch.tensor(example["input_ids"]).unsqueeze(0).to(device),
        "attention_mask": torch.tensor(example["attention_mask"]).unsqueeze(0).to(device)
    }

    with torch.no_grad():
        outputs = model(**inputs)

    start_logits = outputs.start_logits.squeeze()
    end_logits = outputs.end_logits.squeeze()

    start_idx = torch.argmax(start_logits)
    end_idx = torch.argmax(end_logits)

    score = start_logits[start_idx] + end_logits[end_idx]

    return start_idx.item(), end_idx.item(), score.item()


## Évaluation sur le jeu de validation


### Méthodologie

- **Sous-ensemble d'évaluation** : L'évaluation est effectuée sur un sous-ensemble de 500 exemples afin de réduire le temps de calcul tout en conservant une estimation représentative.

- **Fenêtre de contexte unique** : Pour limiter le temps de calcul, l'évaluation utilise une seule fenêtre de contexte (max_length=384) sans sliding window. Cela peut légèrement sous-estimer les performances sur les contextes plus longs.

- **Exactitude (Exact Match) du span prédit** : Contrairement à Exact Match exact au token près, on compare les textes décodés pour évaluer la pertinence de la réponse.

In [7]:
n_samples = 500
validation_set = raw_datasets["validation"].select(range(min(n_samples, len(raw_datasets["validation"]))))


In [None]:
y_true = []
y_scores = []
predictions = []
references = []

for example in validation_set:
    # Tokenize the raw example
    encoded = tokenizer(
        example["question"],
        example["context"],
        truncation=True,
        max_length=384,
        return_tensors="pt"
    )
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**{k: v.to(device) for k, v in encoded.items()})
    
    start_logits = outputs.start_logits[0]
    end_logits = outputs.end_logits[0]
    
    start_pred = torch.argmax(start_logits).item()
    end_pred = torch.argmax(end_logits).item()
    
    # Fix invalid span
    if end_pred < start_pred:
        end_pred = start_pred
    
    score = start_logits[start_pred] + end_logits[end_pred]

    # Decode prediction
    prediction_text = tokenizer.decode(
        encoded["input_ids"][0][start_pred:end_pred + 1],
        skip_special_tokens=True
    )

    gold_text = example["answers"]["text"][0]

    # Exact Match → label binaire
    y_true.append(int(prediction_text.strip() == gold_text.strip()))
    y_scores.append(score.item())

    # Build prediction and reference lists for metric.compute()
    predictions.append({
        "id": example["id"],
        "prediction_text": prediction_text
    })
    
    references.append({
        "id": example["id"],
        "answers": example["answers"]
    })

## Résultats


In [None]:
results = metric.compute(predictions=predictions, references=references)
results

{'exact_match': 3.2, 'f1': 7.769248010515557}

## Mesure du temps d’inférence


In [None]:
start_time = time.time()

for example in validation_set:
    encoded = tokenizer(
        example["question"],
        example["context"],
        truncation=True,
        max_length=384,
        return_tensors="pt"
    )
    
    with torch.no_grad():
        outputs = model(**{k: v.to(device) for k, v in encoded.items()})

if torch.cuda.is_available():
    torch.cuda.synchronize()

end_time = time.time()

avg_time = (end_time - start_time) / n_samples
avg_time

0.016190597057342528

## Conclusion

Le modèle fine-tuné a été évalué sur le jeu de validation SQuAD.

En plus des métriques Exact Match et F1, nous avons évalué les modèles
à l'aide des métriques Precision, Recall et AUC.

La courbe ROC permet d'analyser la capacité du modèle à distinguer
les réponses correctes des réponses incorrectes en fonction d'un
seuil de confiance.


## Résumé des résultats


In [None]:
results_summary = {
    "model": "DistilBERT",
    "EM": results["exact_match"],
    "F1": results["f1"],
    "Inference_time_ms": avg_time * 1000
}

results_summary

{'model': 'DistilBERT',
 'EM': 3.2,
 'F1': 7.769248010515557,
 'Precision': np.float64(0.0020408163265306124),
 'Recall': np.float64(0.9979591836734694),
 'AUC': nan,
 'Inference_time_ms': 16.19059705734253}

In [23]:
import json

# Sauvegarder les résultats en JSON
with open("outputs/results_distilbert.json", "w") as f:
    json.dump(results_summary, f, indent=2)

print("Résultats sauvegardés dans outputs/results_distilbert.json")


Résultats sauvegardés dans outputs/results_distilbert.json
