### **Student Information**
Name: Vincent Limardi

Student ID: 111006242

GitHub ID: Limardi

Kaggle name: vlimardi

Kaggle private scoreboard snapshot: 

![pic_ranking.png](./screenshot/pic_ranking.png)

---

# **Instructions**

For this lab we have divided the assignments into **three phases/parts**. The `first two phases` refer to the `exercises inside the Master notebooks` of the [DM2025-Lab2-Exercise Repo](https://github.com/difersalest/DM2025-Lab2-Exercise.git). The `third phase` refers to an `internal Kaggle competition` that we are gonna run among all the Data Mining students. Together they add up to `100 points` of your grade. There are also some `bonus points` to be gained if you complete `extra exercises` in the lab **(bonus 15 pts)** and in the `Kaggle Competition report` **(bonus 5 pts)**.

**Environment recommendations to solve lab 2:**
- **Phase 1 exercises:** Need GPU for training the models explained in that part, if you don't have a GPU in your laptop it is recommended to run in Colab or Kaggle for a faster experience, although with CPU they can still be solved but with a slower execution.
- **Phase 2 exercises:** We use Gemini's API so everything can be run with only CPU without a problem.
- **Phase 3 exercises:** For the competition you will probably need GPU to train your models, so it is recommended to use Colab or Kaggle if you don't have a laptop with a dedicated GPU.
- **Optional Ollama Notebook (not graded):** You need GPU, at least 4GB of VRAM with 16 GB of RAM to run the local open-source LLM models. 

## **Phase 1 (30 pts):**

1. __Main Exercises (25 pts):__ Do the **take home exercises** from Sections: `1. Data Preparation` to `9. High-dimension Visualization: t-SNE and UMAP`, in the [DM2025-Lab2-Master-Phase_1 Notebook](https://github.com/difersalest/DM2025-Lab2-Exercise/blob/main/DM2025-Lab2-Master-Phase_1.ipynb). Total: `8 exercises`. Commit your code and submit the repository link to NTU Cool **`BEFORE the deadline (Nov. 3th, 11:59 pm, Monday)`**

2. **Code Comments (5 pts):** **Tidy up the code in your notebook**. 

## **Phase 2 (30 pts):**

1. **Main Exercises (25 pts):** Do the remaining **take home exercises** from Section: `2. Large Language Models (LLMs)` in the [DM2025-Lab2-Master-Phase_2_Main Notebook](https://github.com/difersalest/DM2025-Lab2-Exercise/blob/main/DM2025-Lab2-Master-Phase_2_Main.ipynb). Total: `5 exercises required from sections 2.1, 2.2, 2.4 and 2.6`. Commit your code and submit the repository link to NTU Cool **`BEFORE the deadline (Nov. 24th, 11:59 pm, Monday)`**

2. **Code Comments (5 pts):** **Tidy up the code in your notebook**. 

3. **`Bonus (15 pts):`** Complete the bonus exercises in the [DM2025-Lab2-Master-Phase_2_Bonus Notebook](https://github.com/difersalest/DM2025-Lab2-Exercise/blob/main/DM2025-Lab2-Master-Phase_2_Bonus.ipynb) and [DM2025-Lab2-Master-Phase_2_Main Notebook](https://github.com/difersalest/DM2025-Lab2-Exercise/blob/main/DM2025-Lab2-Master-Phase_2_Main.ipynb) `where 2 exercises are counted as bonus from sections 2.3 and 2.5 in the main notebook`. Total: `7 exercises`. Commit your code and submit the repository link to NTU Cool **`BEFORE the deadline (Nov. 24th, 11:59 pm, Monday)`**

## **Phase 3 (40 pts):**

1. **Kaggle Competition Participation (30 pts):** Participate in the in-class **Kaggle Competition** regarding Emotion Recognition on Twitter by clicking in this link: **[Data Mining Class Kaggle Competition](https://www.kaggle.com/t/3a2df4c6d6b4417e8bf718ed648d7554)**. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20 pts of the 30 pts in this competition participation part.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **`BEFORE the deadline (Nov. 24th, 11:59 pm, Monday)`**. Make sure to take a screenshot of your position at the end of the competition and store it as `pic_ranking.png` under the `pics` folder of this repository and rerun the cell **Student Information**.

2. **Competition Report (10 pts)** A report section to be filled in inside this notebook in Markdown Format, we already provided you with the template below. You need to describe your work developing the model for the competition. The report should include a section describing briefly the following elements: 
* Your preprocessing steps.
* The feature engineering steps.
* Explanation of your model.

* **`Bonus (5 pts):`**
    * You will have to describe more detail in the previous steps.
    * Mention different things you tried.
    * Mention insights you gained. 

[Markdown Guide - Basic Syntax](https://www.markdownguide.org/basic-syntax/)

**`Things to note for Phase 3:`**

* **The code used for the competition should be in this Jupyter Notebook File** `DM2025-Lab2-Homework.ipynb`.

* **Push the code used for the competition to your repository**.

* **The code should have a clear separation for the same sections of the report, preprocessing, feature engineering and model explanation. Briefly comment your code for easier understanding, we provide a template at the end of this notebook.**

* Showing the kaggle screenshot of the ranking plus the code in this notebook will ensure the validity of your participation and the report to obtain the corresponding points.

After the competition ends you will have two days more to submit the `DM2025-Lab2-Homework.ipynb` with your report in markdown format and your code. Do everything **`BEFORE the deadline (Nov. 26th, 11:59 pm, Wednesday) to obtain 100% of the available points.`**

Upload your files to your repository then submit the link to it on the corresponding NTU Cool assignment.

## **Deadlines:**

![lab2_deadlines](./pics/lab2_deadlines.png)

---

Next you will find the template report with some simple markdown syntax explanations, use it to structure your content.

You can delete the syntax suggestions after you use them.

---

***

# **Project Report**



## 1. Model Development (10 pts Required)

**Syntax:** `##` creates a secondary heading (H2).

**Describe briefly each section, you can add graphs/charts to support your explanations.**

### 1.1 Preprocessing Steps

* First we enumerate the emotion label. 
* Then we use "**distilbert-base-uncased**" to tokenize the sentence to get the embedded value. 
* Next we split the training data to train and validation with **85:15 ratio**.

### 1.2 Feature Engineering Steps

* **Class Weight Computation**: Calculated balanced class weights using `sklearn.utils.class_weight.compute_class_weight` to address class imbalance in the dataset. This ensures minority classes (like `fear`) contribute equally to the loss function.
* **Tokenizer Features**: The DistilBERT tokenizer automatically generates:
    * `input_ids`: Numerical representation of tokens
    * `attention_mask`: Binary mask indicating real tokens vs padding
* **Subword Tokenization**: DistilBERT uses WordPiece tokenization which handles out-of-vocabulary words by breaking them into subword units.

### 1.3 Explanation of Your Model

* **Model Architecture**: Used **DistilBERT** (`distilbert-base-uncased`), a distilled version of BERT with:
    * 66 million parameters (40% smaller than BERT)
    * 6 transformer layers (vs BERT's 12)
    * Retains 97% of BERT's performance while being 60% faster
* **Classification Head**: `AutoModelForSequenceClassification` with `num_labels=4` for 4-way emotion classification.
* **Custom Weighted Trainer**: Implemented `WeightedTrainer` class that uses weighted `CrossEntropyLoss` to handle class imbalance.
* **Training Configuration**:
    * Epochs: 2
    * Batch size: 32 (train), 64 (eval)
    * Learning rate: 2e-5
    * Warmup steps: 100
    * Weight decay: 0.01
    * Mixed precision training (FP16) enabled for faster training
* **Early Stopping**: Used `EarlyStoppingCallback` with `patience=2` based on weighted F1 score to prevent overfitting.
* **Evaluation Metrics**: Tracked accuracy and weighted F1 score during training.


---

## 2. Bonus Section (5 pts Optional)

**Add more detail in previous sections**

### 2.1 Mention Different Things You Tried

* **Model Selection**: Initially considered `roberta-large` but switched to `distilbert-base-uncased` for significantly faster training (4x speedup) while maintaining competitive accuracy.
* **Phase 1 Approaches**: Explored traditional ML methods including:
    * Bag of Words with Decision Trees
    * Word2Vec embeddings
    * K-means clustering for text analysis
    * t-SNE/UMAP for visualization
* **Hyperparameter Tuning**: Experimented with:
    * Different max sequence lengths (128 vs 256)
    * Various batch sizes and learning rates
    * Different validation split ratios
* **LLM Exploration** (Bonus notebook): Explored using Gemini API for multi-modal prompting and tool calling capabilities.

### 2.2 Mention Insights You Gained

* **Model Size vs Speed Trade-off**: DistilBERT provides an excellent balance between performance and training time - achieving near-BERT accuracy in a fraction of the time.
* **Class Imbalance Matters**: Using weighted loss function significantly improved predictions for minority classes (`fear`, `anger`).
* **Transformer Efficiency**: Pre-trained transformer models dramatically outperform traditional Bag-of-Words approaches for emotion classification.
* **Early Stopping Importance**: Monitoring F1 score (rather than just accuracy) and implementing early stopping prevented overfitting and saved training time.
* **Tokenization Impact**: Proper handling of max_length and padding is crucial - too short truncates important context, too long wastes computation.
---

**`From here on starts the code section for the competition.`**

---

# **Competition Code**

## 1. Preprocessing Steps

In [None]:
### Add the code related to the preprocessing steps in cells inside this section
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
import os

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load data
if 'train_data_filtered' not in globals():
    train_data_filtered = pd.read_csv("./data/processed/kaggle_train.csv")
if 'test_data' not in globals():
    test_data = pd.read_csv("./data/processed/kaggle_test.csv")

emotion_to_id = {'anger': 0, 'fear': 1, 'joy': 2, 'sadness': 3}
id_to_emotion = {v: k for k, v in emotion_to_id.items()}

## 2. Feature Engineering Steps

In [None]:
### Add the code related to the feature engineering steps in cells inside this section
# Class weights
train_labels = train_data_filtered['emotion'].map(emotion_to_id).tolist()
class_weights = compute_class_weight('balanced', classes=np.unique(train_labels), y=train_labels)
class_weights = torch.FloatTensor(class_weights).to(device)

class EmotionDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        encoding = self.tokenizer(
            text, truncation=True, padding='max_length',
            max_length=self.max_length, return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

train_dataset = EmotionDataset(train_texts_split, train_labels_split, tokenizer)
val_dataset = EmotionDataset(val_texts_split, val_labels_split, tokenizer)
test_texts = test_data['text'].tolist()
test_dataset = EmotionDataset(test_texts, [0]*len(test_texts), tokenizer)

## 3. Model Implementation Steps

In [None]:
### Add the code related to the model implementation steps in cells inside this section
model_name = "distilbert-base-uncased"
print(f"\nUsing model: {model_name} (MUCH faster than roberta-large)")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4
)
model.to(device)

class WeightedTrainer(Trainer):
    def __init__(self, class_weights, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights
    
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = nn.CrossEntropyLoss(weight=self.class_weights)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1_score': f1_score(labels, predictions, average='weighted')
    }

training_args = TrainingArguments(
    output_dir='./results/bert_fast',
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=100,
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="f1_score",
    save_total_limit=1,
    fp16=True,
    dataloader_num_workers=0,
    report_to="none",
)

trainer = WeightedTrainer(
    class_weights=class_weights,
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

print("\n" + "="*80)
print("Training (should finish in 30-60 minutes)...")
print("="*80)
trainer.train()

predictions = trainer.predict(val_dataset)
pred_labels = np.argmax(predictions.predictions, axis=1)
accuracy = accuracy_score(val_labels_split, pred_labels)
print(f"\nValidation Accuracy: {accuracy:.4f}")

test_predictions = trainer.predict(test_dataset)
test_pred_labels = np.argmax(test_predictions.predictions, axis=1)
test_pred_emotions = [id_to_emotion[label] for label in test_pred_labels]

results_df = pd.DataFrame({
    'id': test_data['id'].values,
    'text': test_data['text'].values,
    'emotion': test_pred_emotions
})

output_dir = "./results/kaggle"
os.makedirs(output_dir, exist_ok=True)
submission_df = results_df[['id', 'emotion']].copy()
submission_file = f"{output_dir}/kaggle_submission_bert_fast.csv"
submission_df.to_csv(submission_file, index=False)
print(f"\nâœ“ Saved to: {submission_file}")
