## Ma512 final project year 2025-2026

### Project Gideline 
use a pre-trained language model and to fine-tune it to fit df_train to perform a prediction of the sentiment.

for this project we have decided to use **distilbert-base-uncased** as a small well suited model 

### Data Preprocessing and mapping

This part prepares text datasets for training a sentiment classification model using DistilBERT. It first loads the training and validation datasets from CSV files. Then, it defines sentiment labels (like "sadness", "joy", etc.) and maps these labels to numeric values. The code uses the DistilBERT tokenizer to convert the text data into tokenized format, ensuring the text is padded and truncated to a fixed length. Finally, it converts the pandas DataFrames into Hugging Face Dataset objects and applies the tokenizer to both the training and validation sets, preparing them for training by setting the format for PyTorch compatibility (including input IDs and attention masks).

In [1]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict

# Load the datasets
df_train = pd.read_csv("train.csv", sep=';')
df_val = pd.read_csv("val.csv", sep=';')

# Define the labels
labels = {"sadness": 0, "joy": 1, "love": 2, "anger": 3, "fear": 4, "surprise": 5}

df_train.columns = ['text', 'label']
df_val.columns = ['text', 'label']

# Convert labels to numbers
df_train['label'] = df_train['label'].map(labels)
df_val['label'] = df_val['label'].map(labels)

# Load the DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize the texts
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

# Create Hugging Face datasets
train_dataset = Dataset.from_pandas(df_train[['text', 'label']])
val_dataset = Dataset.from_pandas(df_val[['text', 'label']])

# Apply the tokenizer
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Prepare the datasets for training
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

  from .autonotebook import tqdm as notebook_tqdm
Map: 100%|██████████| 15999/15999 [00:09<00:00, 1705.87 examples/s]
Map: 100%|██████████| 1999/1999 [00:01<00:00, 1555.02 examples/s]


### Model Trainning 

This part fine-tunes the DistilBERT model for sequence classification using the prepared datasets. It first loads the pre-trained DistilBERT model for sequence classification with the number of output labels corresponding to the sentiment labels defined earlier. Then, it sets up the training parameters using the TrainingArguments class, including options. The Trainer object is then created with the model, training arguments, and the datasets for both training and validation. Finally, the model is trained using the trainer.train() method, and after training, the fine-tuned model and tokenizer are saved to **output_dir** for later

The training of the model present in this project was performed using this script, but executed in a Linux environment to achieve better performance.

In [None]:
# Charger le modèle DistilBERT pour la classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=len(labels))

# Définir les arguments d'entraînement
training_args = TrainingArguments(
    output_dir='./results',         # dossier de sauvegarde des résultats
    num_train_epochs=3,             # nombre d'époques d'entraînement
    per_device_train_batch_size=32, # taille du batch d'entraînement
    per_device_eval_batch_size=32,  # taille du batch de validation
    warmup_steps=100,               # nombre de pas de warmup
    weight_decay=0.01,              # taux de decay pour l'optimiseur
    logging_dir='./logs',           # dossier des logs
    logging_steps=10,               # fréquence des logs
    logging_first_step=True,        # log au premier pas d'entraînement
    save_strategy="epoch",          # sauvegarde du modèle après chaque époque
    load_best_model_at_end=True,    # charge le meilleur modèle à la fin
    eval_strategy="epoch",          # évaluation après chaque époque
    disable_tqdm=False,             # s'assurer que la barre de progression est activée
)

# Créer le Trainer
trainer = Trainer(
    model=model,                         # modèle à entraîner
    args=training_args,                  # paramètres d'entraînement
    train_dataset=train_dataset,         # dataset d'entraînement
    eval_dataset=val_dataset,            # dataset de validation
)

# Entraîner le modèle
trainer.train()

# Sauvegarder le modèle fine-tuné
model.save_pretrained('./emotion_distilbert')
tokenizer.save_pretrained('./emotion_distilbert')


## Model testing 

This part loads the fine-tuned DistilBERT model and tokenizer to perform emotion prediction on new input sentences. It first loads the model and tokenizer from the saved directory. The emotion labels are then defined to match the training labels (e.g., "sadness", "joy", etc.). A predict_emotion function is created, which tokenizes the input text, performs a prediction using the model, and returns the predicted emotion by selecting the label with the highest probability. Finally, the function is tested on a series of example sentences, predicting the emotion for each and printing the results.

In [4]:
# Load the fine-tuned tokenizer and model
model_path = './emotion_distilbert'
tokenizer = DistilBertTokenizer.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path)

# List of possible emotions (must match the order of the labels defined during training)
labels = ["sadness", "joy", "love", "anger", "fear", "surprise"]

# Prediction function
def predict_emotion(text):
    # Tokenize the sentence
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    
    # Prediction
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        
    # Find the index of the class with the highest probability
    predicted_class_id = torch.argmax(logits, dim=-1).item()
    
    # Return the corresponding emotion
    return labels[predicted_class_id]

# Test on sentences of your choice
test_sentences = [
    "I feel like everything is going well", # Easy example for "joy"
    "I feel really lonely today.",          # Easy example for "sadness"
    "I love you so much.",                  # Easy example for "love"
    "I'm furious for not being listened to, it's like my opinions don't matter.",       # Complex example for "anger"
    "Every strange sound in the night makes me jump, I can't help but worry.",          # Complex example for "fear"
    "Suddenly, he did something that completely changed the game, it was astonishing.", # Complex example for "surprise"
    "It's like a heavy weight is on my shoulders, and I can't bear it anymore.",        # Complex example for "sadness"
    "I'm both excited and nervous for this new beginning, it's a whirlwind of emotions" # Complex example of mixed feelings
]

# Predict the emotion for each sentence
for sentence in test_sentences:
    emotion = predict_emotion(sentence)
    print(f"Sentence: {sentence}\nPredicted emotion: {emotion}\n")

Sentence: I feel like everything is going well
Predicted emotion: joy

Sentence: I feel really lonely today.
Predicted emotion: sadness

Sentence: I love you so much.
Predicted emotion: love

Sentence: I'm furious for not being listened to, it's like my opinions don't matter.
Predicted emotion: anger

Sentence: Every strange sound in the night makes me jump, I can't help but worry.
Predicted emotion: fear

Sentence: Suddenly, he did something that completely changed the game, it was astonishing.
Predicted emotion: surprise

Sentence: It's like a heavy weight is on my shoulders, and I can't bear it anymore.
Predicted emotion: anger

Sentence: I'm both excited and nervous for this new beginning, it's a whirlwind of emotions
Predicted emotion: fear

