# Forecasting Sentiment for World Cup 2034 in Saudi Arabia Using a DistilBERT Model


- I evaluated the fine-tuned DistilBERT model using the `Trainer.evaluate()` method to get the final performance metrics.
- I printed both the final **accuracy** and **macro F1 score** to assess how well the model performs across all sentiment classes.
- I then used `Trainer.predict()` to generate predictions on the test dataset.
- I converted the raw logits into class labels using `argmax`.
- Finally, I printed a detailed classification report that shows the **precision**, **recall**, and **F1 score** for each sentiment class: **Negative**, **Neutral**, and **Positive**.


In [None]:
# STEP 1: SETUP
# Enable GPU in Colab: Runtime > Change runtime type > GPU

# STEP 2: INSTALL DEPENDENCIES
!pip install -U transformers datasets accelerate scikit-learn

# STEP 3: IMPORT LIBRARIES
import os
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    Trainer,
    TrainingArguments
)

# Disable Weights & Biases logging
os.environ["WANDB_DISABLED"] = "true"

# STEP 4: CHECK GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Running on:", "GPU" if device == "cuda" else "CPU")


In [None]:
# STEP 5: UPLOAD FILE (or use Google Drive instead)
from google.colab import files
uploaded = files.upload()

# Replace filename if different
df = pd.read_csv("/content/drive/MyDrive/Project/Final_Thesis_Merged.csv")

# STEP 6: CLEAN & PREPARE DATA
df = df.dropna(subset=["Rewritten Comment", "Sentiment"])
df["Sentiment"] = df["Sentiment"].astype(str).str.extract(r'(-?1|0)').astype(float)
df = df.dropna(subset=["Sentiment"])

# Map sentiment to class labels
label_map = {-1: 0, 0: 1, 1: 2}
df["label"] = df["Sentiment"].map(label_map)

# Split into train/test
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["Rewritten Comment"].tolist(),
    df["label"].tolist(),
    test_size=0.2,
    stratify=df["label"],
    random_state=42
)

In [None]:
# STEP 7: TOKENIZATION
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=256)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=256)

# STEP 8: DATASET WRAPPER
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = SentimentDataset(train_encodings, train_labels)
test_dataset = SentimentDataset(test_encodings, test_labels)


In [None]:
# STEP 9: LOAD MODEL
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
model.to(device)

# STEP 10: METRICS
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = torch.argmax(torch.tensor(logits), dim=1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average="macro")
    }

# STEP 11: TRAINING ARGUMENTS
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

# STEP 12: TRAINER
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# STEP 13: TRAIN
trainer.train()

# STEP 14: EVALUATE
results = trainer.evaluate()
print(f"\nFinal Accuracy: {results['eval_accuracy']:.4f}")
print(f"Final Macro F1 Score: {results['eval_f1']:.4f}")

# STEP 15: CLASSIFICATION REPORT
preds = trainer.predict(test_dataset).predictions
y_pred = torch.argmax(torch.tensor(preds), dim=1)
print("\nDetailed Report:\n")
print(classification_report(test_labels, y_pred.tolist(), target_names=["Negative", "Neutral", "Positive"]))


Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (17 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9

Saving Final_Thesis_Merged.csv to Final_Thesis_Merged.csv


  df = pd.read_csv("/content/drive/MyDrive/Project/Final_Thesis_Merged.csv")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.537,0.514364,0.772762,0.744113
2,0.3404,0.514447,0.794947,0.760929
3,0.1641,0.813234,0.790398,0.764872



Final Accuracy: 0.7949
Final Macro F1 Score: 0.7609

Detailed Report:

              precision    recall  f1-score   support

    Negative       0.83      0.89      0.86      8317
     Neutral       0.72      0.64      0.68      4219
    Positive       0.79      0.71      0.75      1753

    accuracy                           0.79     14289
   macro avg       0.78      0.75      0.76     14289
weighted avg       0.79      0.79      0.79     14289



After training my DistilBERT sentiment classifier for three epochs, the model achieved its best performance in the second epoch, with a final accuracy of 79.49% and a macro F1 score of 0.7609. The classification report shows that the model performed best on negative comments, achieving an F1 score of 0.86 with strong precision (0.83) and recall (0.89). Neutral sentiment was harder to classify, with a lower F1 score of 0.68 due to a drop in recall (0.64). Positive comments had a moderate F1 score of 0.75. Overall, the macro average F1 score of 0.76 indicates balanced performance across all classes, despite some class imbalance—most comments were negative, followed by neutral, and far fewer were positive. This aligns with expectations for a sensitive topic like Saudi Arabia hosting the 2034 World Cup.
