# 🧠 Emotion Classifier – Fine-tuning DistilBERT

Ce projet fine-tune le modèle `distilbert-base-uncased` sur le dataset **Emotion** pour classifier les émotions dans un texte.

## 🎯 Objectifs
- Utiliser un modèle pré-entraîné (DistilBERT)
- Fine-tuner sur le dataset *Emotion*
- Sauvegarder le modèle et le tester sur un texte personnel

In [1]:
!pip install transformers datasets torch

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline
)
from datasets import load_dataset
import torch



## 📥 Étape 3 : Chargement du dataset *Emotion*

On utilise le dataset `emotion` disponible sur 🤗 HuggingFace.


In [2]:
print("Chargement du dataset Emotion...")
dataset = load_dataset("emotion")

print("\nExemple :")
print(dataset["train"][0])
print("\nLabels disponibles :", dataset["train"].features["label"].names)

Chargement du dataset Emotion...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

split/train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

split/validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

split/test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]


Exemple :
{'text': 'i didnt feel humiliated', 'label': 0}

Labels disponibles : ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


## ✂️ Étape 4 : Tokenisation des textes

In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

print("\nTokenisation en cours...")
tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


Tokenisation en cours...


Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## 🧹 Étape 5 : Préparation des jeux de données
On garde les colonnes utiles et on renomme les labels.

In [4]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]

## 🧠 Étape 6 : Chargement du modèle pré-entraîné


In [5]:
num_labels = dataset["train"].features["label"].num_classes
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=num_labels
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## ⚙️ Étape 7 : Configuration de l'entraînement


In [6]:
import os
os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir="./models/emotion-distilbert",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    report_to="none"  # empêche tout logging externe (wandb, tensorboard)
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
)

  trainer = Trainer(


## 🏋️ Étape 8 : Fine-tuning du modèle

In [7]:
print("\nDébut du fine-tuning...")
trainer.train()


Début du fine-tuning...


Epoch,Training Loss,Validation Loss
1,0.2121,0.203702
2,0.1288,0.170425


TrainOutput(global_step=2000, training_loss=0.31605774092674255, metrics={'train_runtime': 1588.1602, 'train_samples_per_second': 20.149, 'train_steps_per_second': 1.259, 'total_flos': 4239259140096000.0, 'train_loss': 0.31605774092674255, 'epoch': 2.0})

## 📈 Étape 9 : Évaluation sur le jeu de test

In [8]:
print("\nÉvaluation sur le jeu de test...")
metrics = trainer.evaluate()
print(metrics)


Évaluation sur le jeu de test...


{'eval_loss': 0.17042547464370728, 'eval_runtime': 33.0043, 'eval_samples_per_second': 60.598, 'eval_steps_per_second': 3.787, 'epoch': 2.0}


## 💾 Étape 10 : Sauvegarde du modèle fine-tuné

In [9]:
print("\nSauvegarde du modèle...")
trainer.save_model("./models/emotion-distilbert")
tokenizer.save_pretrained("./models/emotion-distilbert")


Sauvegarde du modèle...


('./models/emotion-distilbert/tokenizer_config.json',
 './models/emotion-distilbert/special_tokens_map.json',
 './models/emotion-distilbert/vocab.txt',
 './models/emotion-distilbert/added_tokens.json',
 './models/emotion-distilbert/tokenizer.json')

## 💬 Étape 11 : Test du modèle sur un texte personnel

In [10]:
emotion_classifier = pipeline(
    "text-classification",
    model="./models/emotion-distilbert",
    tokenizer="./models/emotion-distilbert"
)

test_text = "I'm so happy and excited today!"
print("\nTexte de test :", test_text)
result = emotion_classifier(test_text)
print("Résultat :", result)


Device set to use cuda:0



Texte de test : I'm so happy and excited today!
Résultat : [{'label': 'LABEL_1', 'score': 0.99699866771698}]
