# Detection d'Intention avec CamemBERT (Transformers)

**Notebook optimise pour Google Colab GPU avec Transformers**

Ce notebook utilise CamemBERT pour de la classification d'intent (TRIP, NOT_TRIP)

## Montage Google Drive

Monte Google Drive pour acceder aux datasets et desactive WandB (tracking non necessaire).

In [17]:
from google.colab import drive
import os

# Monter Google Drive
drive.mount('/content/drive')

# Desactiver WandB
os.environ['WANDB_DISABLED'] = 'true'

# Chemins
workdir = '/content/drive/MyDrive/intent_classification/dataset'
os.makedirs(workdir, exist_ok=True)

print('Working directory:', workdir)
print('WandB: DESACTIVE')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Working directory: /content/drive/MyDrive/intent_classification/dataset
WandB: DESACTIVE


## Chargement des datasets et analyse

Charge les datasets (train_set.csv, test_set.csv) et affiche les statistiques.

In [18]:
import os
import json
import pandas as pd

# Chemins des datasets
train_path = os.path.join(workdir, "dataset_train.json")
test_path = os.path.join(workdir, "dataset_test.json")

# Vérification existence
for path in [train_path, test_path]:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Fichier introuvable : {path}")

# Chargement JSON
def load_intent_json(path):
    with open(path, "r", encoding="utf-8") as f:
        raw = json.load(f)

    # Cas 1 : liste directe
    if isinstance(raw, list):
        return pd.DataFrame(raw)

    # Cas 2 : dict avec une seule clé (train / test / data)
    if isinstance(raw, dict):
        # On récupère la première clé contenant une liste
        for key, value in raw.items():
            if isinstance(value, list):
                return pd.DataFrame(value)

        raise ValueError(f"Aucune liste trouvée dans le JSON : {path}")

    raise ValueError(f"Format JSON non reconnu : {type(raw)}")

train_df = load_intent_json(train_path)
test_df = load_intent_json(test_path)

# Vérifications structurelles
assert {"text", "label"}.issubset(train_df.columns), "Colonnes requises manquantes dans train_df"
assert {"text", "label"}.issubset(test_df.columns), "Colonnes requises manquantes dans test_df"

print("Dataset TRAIN")
print(f"Nombre d'exemples : {len(train_df)}")

print("\nDistribution des classes :")
print(train_df["label"].value_counts())

print("\nPourcentages :")
print(train_df["label"].value_counts(normalize=True).mul(100).round(2))

print("\nExemples :")
display(train_df.head(3))


Dataset TRAIN
Nombre d'exemples : 1000

Distribution des classes :
label
TRIP        500
NOT_TRIP    500
Name: count, dtype: int64

Pourcentages :
label
TRIP        50.0
NOT_TRIP    50.0
Name: proportion, dtype: float64

Exemples :


Unnamed: 0,id,text,label
0,319,Quitter Pau pour rejoindre Toulouse au plus vite.,TRIP
1,301,je vaix de Sens a Nice svp,TRIP
2,850,Je vais faire un tour au parc.,NOT_TRIP


## Preparation des datasets HuggingFace pour Intent Classification

Convertit les DataFrames en datasets HuggingFace et encode les labels.

In [19]:
from datasets import Dataset
from sklearn.preprocessing import LabelEncoder

# Encodage des labels
label_encoder = LabelEncoder()

# On apprend l'encodage uniquement à partir du train
label_encoder.fit(train_df["label"])

print("Classes encodées :")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {i} → {label}")

# Transformation des labels texte → labels numériques
train_df["label_id"] = label_encoder.transform(train_df["label"])
test_df["label_id"] = label_encoder.transform(test_df["label"])

# Création des datasets HugginFace
train_dataset = Dataset.from_pandas(train_df[["text", "label_id"]])
test_dataset = Dataset.from_pandas(test_df[["text", "label_id"]])

print("\nDatasets HuggingFace créés :")
print(f"  Train : {len(train_dataset)} exemples")
print(f"  Test  : {len(test_dataset)} exemples")


Classes encodées :
  0 → NOT_TRIP
  1 → TRIP

Datasets HuggingFace créés :
  Train : 1000 exemples
  Test  : 500 exemples


## Tuning des hyperparametres avec Optuna

Recherche automatique des meilleurs hyperparametres avant l'entrainement final.

In [None]:
import json
import numpy as np
import optuna
from sklearn.metrics import f1_score, accuracy_score
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding

RUN_TUNING = True
N_TRIALS = 20

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds)
    }

tokenizer = AutoTokenizer.from_pretrained("camembert-base")

def build_tokenized_datasets(max_length):
    def tokenize(batch):
        return tokenizer(
            batch["text"],
            truncation=True,
            max_length=max_length
        )

    tokenized_train = train_dataset.map(tokenize, batched=True)
    tokenized_test = test_dataset.map(tokenize, batched=True)

    tokenized_train = tokenized_train.remove_columns(["text"]).rename_column("label_id", "labels")
    tokenized_test = tokenized_test.remove_columns(["text"]).rename_column("label_id", "labels")

    tokenized_train.set_format("torch")
    tokenized_test.set_format("torch")

    return tokenized_train, tokenized_test

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])
    num_epochs = trial.suggest_int("num_epochs", 3, 10)
    warmup_ratio = trial.suggest_float("warmup_ratio", 0.0, 0.3)
    weight_decay = trial.suggest_float("weight_decay", 0.0, 0.1)
    max_length = trial.suggest_int("max_length", 64, 256, step=32)

    tokenized_train, tokenized_test = build_tokenized_datasets(max_length)

    model = AutoModelForSequenceClassification.from_pretrained(
        "camembert-base",
        num_labels=2
    )

    training_args = TrainingArguments(
        output_dir=f"/content/drive/MyDrive/intent_classification/optuna_trial_{trial.number}",
        eval_strategy="epoch",
        save_strategy="no",
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=32,
        learning_rate=learning_rate,
        warmup_ratio=warmup_ratio,
        weight_decay=weight_decay,
        report_to="none"
    )

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_test,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    trainer.train()
    eval_result = trainer.evaluate()

    return eval_result["eval_f1"]

best_params = {
    "learning_rate": 1e-5,
    "batch_size": 16,
    "num_epochs": 5,
    "warmup_ratio": 0.1,
    "weight_decay": 0.0,
    "max_length": 128
}

if RUN_TUNING:
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=N_TRIALS, show_progress_bar=True)
    best_params = study.best_params

    with open("best_hyperparams.json", "w", encoding="utf-8") as f:
        json.dump(best_params, f, indent=2, ensure_ascii=False)
MAX_LENGTH = int(best_params["max_length"])
print("Meilleurs hyperparametres:", best_params)

## Fine-Tuning Intent Classification avec CamemBERT

Entrainement du modele CamemBERT pour la classification d'intent.

**Ameliorations clees** :
- Comprehension semantique (vs TF-IDF)
- Gestion des langues etrangeres
- Transfer learning depuis modele pre-entraine francais

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
import torch

# 1. TOKENIZATION
try:
    MAX_LENGTH
except NameError:
    MAX_LENGTH = 128

tokenizer = AutoTokenizer.from_pretrained("camembert-base")

def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=MAX_LENGTH
    )

tokenized_train = train_dataset.map(tokenize, batched=True)
tokenized_test = test_dataset.map(tokenize, batched=True)

tokenized_train = tokenized_train.remove_columns(["text"])
tokenized_test = tokenized_test.remove_columns(["text"])

tokenized_train = tokenized_train.rename_column("label_id", "labels")
tokenized_test = tokenized_test.rename_column("label_id", "labels")

tokenized_train.set_format("torch")
tokenized_test.set_format("torch")

# 2. MODELE
model = AutoModelForSequenceClassification.from_pretrained(
    "camembert-base",
    num_labels=2
)

# 3. METRIQUE
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)

    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds)
    }

# 4. TRAINING CONFIG
learning_rate = best_params["learning_rate"] if "best_params" in globals() else 1e-5
batch_size = best_params["batch_size"] if "best_params" in globals() else 16
num_epochs = best_params["num_epochs"] if "best_params" in globals() else 5
warmup_ratio = best_params["warmup_ratio"] if "best_params" in globals() else 0.1
weight_decay = best_params["weight_decay"] if "best_params" in globals() else 0.0

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/intent_classification/intent_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=32,
    learning_rate=learning_rate,
    lr_scheduler_type="cosine",
    warmup_ratio=warmup_ratio,
    weight_decay=weight_decay,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    report_to="none"
)

# 5. TRAINER
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()
model.cpu()  # IMPORTANT
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

torch.save(quantized_model.state_dict(), "quantized_model.pt")

# 6. SAUVEGARDE DU MODELE
save_dir = "/content/drive/MyDrive/intent_classification/intent_model_final"

trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)

print(f"Modèle et tokenizer sauvegardés dans : {save_dir}")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

CamembertForSequenceClassification LOAD REPORT from: camembert-base
Key                         | Status     | 
----------------------------+------------+-
lm_head.dense.bias          | UNEXPECTED | 
roberta.pooler.dense.bias   | UNEXPECTED | 
lm_head.dense.weight        | UNEXPECTED | 
roberta.pooler.dense.weight | UNEXPECTED | 
lm_head.layer_norm.weight   | UNEXPECTED | 
lm_head.bias                | UNEXPECTED | 
lm_head.layer_norm.bias     | UNEXPECTED | 
classifier.out_proj.bias    | MISSING    | 
classifier.dense.weight     | MISSING    | 
classifier.out_proj.weight  | MISSING    | 
classifier.dense.bias       | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.636831,0.796,0.830565
2,No log,0.396416,0.836,0.859107
3,No log,0.551868,0.806,0.837521
4,No log,0.712382,0.768,0.811688
5,No log,0.794137,0.756,0.803859
6,No log,0.905071,0.74,0.793651
7,No log,0.943395,0.74,0.793651
8,0.148552,0.979954,0.736,0.791139
9,0.148552,0.994338,0.736,0.791139
10,0.148552,0.995727,0.736,0.791139


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

There were missing keys in the checkpoint model loaded: ['roberta.embeddings.LayerNorm.weight', 'roberta.embeddings.LayerNorm.bias', 'roberta.encoder.layer.0.attention.output.LayerNorm.weight', 'roberta.encoder.layer.0.attention.output.LayerNorm.bias', 'roberta.encoder.layer.0.output.LayerNorm.weight', 'roberta.encoder.layer.0.output.LayerNorm.bias', 'roberta.encoder.layer.1.attention.output.LayerNorm.weight', 'roberta.encoder.layer.1.attention.output.LayerNorm.bias', 'roberta.encoder.layer.1.output.LayerNorm.weight', 'roberta.encoder.layer.1.output.LayerNorm.bias', 'roberta.encoder.layer.2.attention.output.LayerNorm.weight', 'roberta.encoder.layer.2.attention.output.LayerNorm.bias', 'roberta.encoder.layer.2.output.LayerNorm.weight', 'roberta.encoder.layer.2.output.LayerNorm.bias', 'roberta.encoder.layer.3.attention.output.LayerNorm.weight', 'roberta.encoder.layer.3.attention.output.LayerNorm.bias', 'roberta.encoder.layer.3.output.LayerNorm.weight', 'roberta.encoder.layer.3.output.Laye

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Modèle et tokenizer sauvegardés dans : /content/drive/MyDrive/intent_classification/intent_model_final


## Evaluation du modele Intent Classification

Evalue le modele sur le test set avec metriques detaillees et matrice de confusion.

In [21]:
def predict_intent(text: str):
    device = next(model.parameters()).device  # récupère le device du modèle

    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=128
    )

    # déplacer les tenseurs sur le même device que le modèle
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    probs = torch.softmax(logits, dim=1)
    pred_id = torch.argmax(probs, dim=1).item()

    label_map = {0: "NOT_TRIP", 1: "TRIP"}

    return {
        "text": text,
        "intent": label_map[pred_id],
        "confidence": float(probs[0][pred_id])
    }



predict_intent("De Paris à Lyon")
# predict_intent("Merci beaucoup")
# predict_intent("Je voulais juste me renseigner")
# predict_intent("Cotonou vers Parakou demain")


{'text': 'De Paris à Lyon', 'intent': 'TRIP', 'confidence': 0.9058588743209839}

In [22]:
predict_intent("Les joueurs vont de barcelone à madrid")

{'text': 'Les joueurs vont de barcelone à madrid',
 'intent': 'TRIP',
 'confidence': 0.8982693552970886}