# Capstone – Emotionality of Tweets: DistilBERT Fine-Tuning


Notebook advisories

This notebook was developed with conceptual and implementation influence from Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow, and the Kaggle notebook “Twitter Emotion Classification” by Andrey Shtrass.

AI Usage: AI assistance was used to help debug code, refine explanations, and assist comprehension of reinforcement learning concepts.

In [1]:
from pathlib import Path
from typing import List, Dict, Tuple
import sys

import numpy as np
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    precision_recall_fscore_support,
)

from datasets import Dataset as HFDataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)

PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

from src.data_utils import load_and_prepare_emotion_splits  # noqa: E402

PROJECT_ROOT

  from .autonotebook import tqdm as notebook_tqdm


PosixPath('/Users/maine/School/Machine Learning 2/Final_Project_Code')

## 1. Configuration

In [2]:
# Configuration
DATA_DIR = PROJECT_ROOT / "data" / "primary_emotions"
print("Data directory:", DATA_DIR)

MODEL_NAME = "distilbert-base-uncased"

EMOTION_ID2NAME: Dict[int, str] = {
    0: "sadness",
    1: "joy",
    2: "love",
    3: "anger",
    4: "fear",
    5: "surprise",
}

MAX_SEQ_LEN = 64
BATCH_SIZE = 16          # per device
NUM_EPOCHS = 3
LEARNING_RATE = 2e-5
WEIGHT_DECAY = 0.01
RANDOM_STATE = 42

Data directory: /Users/maine/School/Machine Learning 2/Final_Project_Code/data/primary_emotions


## 2. Evaluation Helper

In [3]:
# Evaluation helper (same style as TF-IDF baseline)

def evaluate_predictions(
        y_true: List[int],
        y_pred: List[int],
        label_ids: List[int],
        label_names: List[str],
        split_name: str,
) -> None:
    """
    Print detailed classification metrics for a given split.
    """
    acc = accuracy_score(y_true, y_pred)
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
        y_true,
        y_pred,
        average="macro",
        zero_division=0,
    )

    print(f"\n=== {split_name} Metrics ===")
    print(f"Accuracy:  {acc:.4f}")
    print(f"Macro P:   {precision_macro:.4f}")
    print(f"Macro R:   {recall_macro:.4f}")
    print(f"Macro F1:  {f1_macro:.4f}")

    print("\nPer-class report:")
    print(
        classification_report(
            y_true,
            y_pred,
            labels=label_ids,
            target_names=label_names,
            zero_division=0,
        )
    )

    cm = confusion_matrix(y_true, y_pred, labels=label_ids)
    print("Confusion matrix (rows = true, cols = pred):")
    print(cm)

## 3. Building Hugging Face Datasets and Tokenizer

In [4]:
# ============================================================
# 3. Build HF Datasets and tokenizer
# ============================================================

def build_hf_datasets(
        max_length: int = MAX_SEQ_LEN,
) -> Tuple[HFDataset, HFDataset, HFDataset, AutoTokenizer]:
    """
    Build Hugging Face Datasets and tokenizer for DistilBERT fine-tuning.
    And return train/val/test datasets and tokenizer.
    """
    # Load splits from shared helper
    splits = load_and_prepare_emotion_splits(DATA_DIR, normalize=True)
    X_train, y_train = splits.train_texts, splits.train_labels
    X_val, y_val = splits.val_texts, splits.val_labels
    X_test, y_test = splits.test_texts, splits.test_labels

    # Build raw HF datasets
    train_dict = {"text": X_train, "label": y_train}
    val_dict = {"text": X_val, "label": y_val}
    test_dict = {"text": X_test, "label": y_test}

    hf_train = HFDataset.from_dict(train_dict)
    hf_val = HFDataset.from_dict(val_dict)
    hf_test = HFDataset.from_dict(test_dict)

    # Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

    def tokenize_batch(batch):
        return tokenizer(
            batch["text"],
            padding="max_length",
            truncation=True,
            max_length=max_length,
        )

    # Map tokenizer over datasets
    hf_train = hf_train.map(tokenize_batch, batched=True)
    hf_val = hf_val.map(tokenize_batch, batched=True)
    hf_test = hf_test.map(tokenize_batch, batched=True)

    # Rename label -> labels for transformers
    hf_train = hf_train.rename_column("label", "labels")
    hf_val = hf_val.rename_column("label", "labels")
    hf_test = hf_test.rename_column("label", "labels")

    # Set format for PyTorch
    hf_train.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "labels"],
    )
    hf_val.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "labels"],
    )
    hf_test.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "labels"],
    )

    return hf_train, hf_val, hf_test, tokenizer

## 4. Trainer Metrics

In [5]:
# Metrics for Trainer

def compute_metrics(eval_pred):
    """
    Compute scalar metrics for HF Trainer from (logits, labels).
    """
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    acc = accuracy_score(labels, preds)
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
        labels,
        preds,
        average="macro",
        zero_division=0,
    )

    return {
        "accuracy": acc,
        "macro_precision": precision_macro,
        "macro_recall": recall_macro,
        "macro_f1": f1_macro,
    }

## 5. Fine-Tuning DistilBERT

Steps:

1. Build tokenized HF datasets and tokenizer.
2. Define label mappings (`id2label`, `label2id`)
3. Load `AutoModelForSequenceClassification` with:
   - `MODEL_NAME = "distilbert-base-uncased"`
   - `num_labels = 6`
   - `id2label`, `label2id`
4. Configure `TrainingArguments`:
   - Learning rate
   - Batch sizes
   - Number of epochs
   - Weight decay
   - Random seed
5. Instantiate `Trainer` with:
   - Model
   - TrainingArguments
   - Train and eval datasets
   - Tokenizer
   - `compute_metrics`
6. Call `trainer.train()` to fine-tune the model.
7. Evaluate with:
   - `trainer.evaluate()`
   - Custom detailed evaluation using `evaluate_predictions`.

In [6]:
# fine-tune DistilBERT and evaluate

def main() -> None:
    print("Data directory:", DATA_DIR)
    print("Building HF datasets and tokenizer...")

    train_ds, val_ds, test_ds, tokenizer = build_hf_datasets(max_length=MAX_SEQ_LEN)

    label_ids = sorted(EMOTION_ID2NAME.keys())
    label_names = [EMOTION_ID2NAME[i] for i in label_ids]

    id2label = {i: EMOTION_ID2NAME[i] for i in label_ids}
    label2id = {name: i for i, name in EMOTION_ID2NAME.items()}

    print("Using emotion mapping:")
    print(id2label)

    # Load pretrained model with correct label mapping
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=len(label_ids),
        id2label=id2label,
        label2id=label2id,
    )

    # Training arguments (compatible with older transformers)
    output_dir = PROJECT_ROOT / "models" / "distilbert_emotion"
    training_args = TrainingArguments(
        output_dir=str(output_dir),
        learning_rate=LEARNING_RATE,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=NUM_EPOCHS,
        weight_decay=WEIGHT_DECAY,
        logging_steps=100,
        seed=RANDOM_STATE,
        # NOTE: no evaluation_strategy/save_strategy/load_best_model_at_end/etc.
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    # Train
    print("Starting DistilBERT fine-tuning...")
    trainer.train()

    # HF evaluation on validation set (scalar metrics)
    print("\nHF Trainer evaluation on validation set:")
    print(trainer.evaluate(eval_dataset=val_ds))

    # Custom, detailed evaluation on val and test
    def collect_predictions(dataset: HFDataset) -> Tuple[List[int], List[int]]:
        preds_output = trainer.predict(dataset)
        logits = preds_output.predictions
        labels = preds_output.label_ids
        preds = np.argmax(logits, axis=-1)
        return labels.tolist(), preds.tolist()

    print("\nDetailed evaluation on validation set:")
    y_val_true, y_val_pred = collect_predictions(val_ds)
    evaluate_predictions(y_val_true, y_val_pred, label_ids, label_names, "Validation")

    print("\nDetailed evaluation on test set:")
    y_test_true, y_test_pred = collect_predictions(test_ds)
    evaluate_predictions(y_test_true, y_test_pred, label_ids, label_names, "Test")

## 6. Run Fine-Tuning

In [7]:
main()

Data directory: /Users/maine/School/Machine Learning 2/Final_Project_Code/data/primary_emotions
Building HF datasets and tokenizer...


Map: 100%|██████████| 16000/16000 [00:00<00:00, 55971.82 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 76911.72 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 76839.86 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Using emotion mapping:
{0: 'sadness', 1: 'joy', 2: 'love', 3: 'anger', 4: 'fear', 5: 'surprise'}
Starting DistilBERT fine-tuning...




Step,Training Loss
100,1.4725
200,0.9272
300,0.5392
400,0.4133
500,0.3251
600,0.274
700,0.2556
800,0.2314
900,0.2432
1000,0.2175





HF Trainer evaluation on validation set:




{'eval_loss': 0.1469372808933258, 'eval_accuracy': 0.9415, 'eval_macro_precision': 0.9181008932111295, 'eval_macro_recall': 0.9182592317283181, 'eval_macro_f1': 0.918050752442631, 'eval_runtime': 3.54, 'eval_samples_per_second': 564.973, 'eval_steps_per_second': 35.311, 'epoch': 3.0}

Detailed evaluation on validation set:





=== Validation Metrics ===
Accuracy:  0.9415
Macro P:   0.9181
Macro R:   0.9183
Macro F1:  0.9181

Per-class report:
              precision    recall  f1-score   support

     sadness       0.96      0.97      0.96       550
         joy       0.97      0.95      0.96       704
        love       0.87      0.90      0.88       178
       anger       0.95      0.94      0.94       275
        fear       0.89      0.91      0.90       212
    surprise       0.87      0.84      0.86        81

    accuracy                           0.94      2000
   macro avg       0.92      0.92      0.92      2000
weighted avg       0.94      0.94      0.94      2000

Confusion matrix (rows = true, cols = pred):
[[534   1   0   6   9   0]
 [  3 670  23   2   2   4]
 [  3  15 160   0   0   0]
 [ 10   2   0 258   5   0]
 [  7   0   0   6 193   6]
 [  1   4   1   0   7  68]]

Detailed evaluation on test set:





=== Test Metrics ===
Accuracy:  0.9210
Macro P:   0.8725
Macro R:   0.8690
Macro F1:  0.8707

Per-class report:
              precision    recall  f1-score   support

     sadness       0.95      0.97      0.96       581
         joy       0.95      0.94      0.95       695
        love       0.83      0.84      0.83       159
       anger       0.92      0.91      0.91       275
        fear       0.87      0.88      0.87       224
    surprise       0.71      0.68      0.70        66

    accuracy                           0.92      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.92      0.92      0.92      2000

Confusion matrix (rows = true, cols = pred):
[[564   3   0   8   6   0]
 [  5 654  27   4   0   5]
 [  0  24 133   2   0   0]
 [ 13   3   0 250   9   0]
 [  6   0   0   9 196  13]
 [  3   3   0   0  15  45]]


## 7. Conclusion and Discussion

This notebook fine-tuned DistilBERT for 6-way emotion classification on tweets, using the same train/validation/test splits and label mapping as the TF–IDF + Logistic Regression and BiLSTM baselines. The model started from the pretrained `distilbert-base-uncased` encoder, added a small 6-class classification head, and was trained with the Hugging Face `Trainer` on tweets truncated/padded to 64 tokens.

On the held-out test set, DistilBERT reached around 92% accuracy and 0.88 macro-F1, clearly outperforming both the classical baseline (~0.73 macro-F1) and the BiLSTM (~0.80 macro-F1). The largest gains appeared in the harder, lower-frequency emotions such as love, fear, and especially surprise, where the transformer’s contextual representations helped close much of the gap. Common emotions like sadness and joy were already strong for all models, but DistilBERT still pushed their F1 scores into the mid-90s. The confusion matrix shows that the model is still not perfect. It continues to mix related emotions, and it only sees single tweets in isolation, without conversation history, user context, or any explicit modeling of sarcasm or irony. That limits its understanding of emotional tone can be.

These results are consistent with expectations for pretrained transformers on short text classification:
DistilBERT’s subword tokenization and deep contextual embeddings allow it to pick up more nuance in phrasing. Though it is more computationally expensive than the baselines.

In the broader project, this DistilBERT model serves as the strongest baseline for reading emotional tone in tweets and provides a realistic picture of what current transformer-based text classifiers can and cannot do in this setting.