### Student Information
Name:

Student ID:

GitHub ID:

Kaggle name:

Kaggle private scoreboard snapshot:

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)__. 

#### Data prepare

In [None]:
import pandas as pd
import os

data_path = "dataset/"
df_data = pd.read_json(os.path.join(data_path, "tweets_DM.json"), lines=True)
df_data["_source"][0]

In [None]:
df_data.head()

In [None]:
df_source = pd.json_normalize(df_data["_source"])

In [None]:
df_source.head()

In [None]:
df_data = pd.concat([df_data, df_source], axis=1)
df_data = df_data.drop(columns=["_source"])

In [None]:
df_data.head()

In [None]:
len(df_data["tweet.hashtags"])

In [None]:
df_data = df_data.rename(columns={"tweet.hashtags": "hashtags"})
df_data = df_data.rename(columns={"tweet.tweet_id": "tweet_id"})
df_data = df_data.rename(columns={"tweet.text": "text"})
df_data = df_data.rename(columns={"_score": "score"})
df_data = df_data.rename(columns={"_index": "index"})
df_data = df_data.rename(columns={"_crawldate": "crawldate"})
df_data = df_data.rename(columns={"_type": "type"})

In [None]:
df_data.head()

In [None]:
df_identification = pd.read_csv(os.path.join(data_path, "data_identification.csv"))

In [None]:
df_data = pd.merge(df_data, df_identification, on="tweet_id")

In [None]:
df_data.head()

In [None]:
df_test = df_data[df_data["identification"] == "test"]

In [None]:
df_emotion = pd.read_csv(os.path.join(data_path, "emotion.csv"))

In [None]:
df_emotion.head()

In [None]:
df_data = pd.merge(df_data, df_emotion, on="tweet_id")

In [None]:
df_data.head()

In [None]:
df_train = df_data[df_data["identification"] == "train"]

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
emotion_list = []
for i in range(len(df_train)):
    if df_train["emotion"][i] not in emotion_list:
        emotion_list.append(df_train["emotion"][i])
print(emotion_list)

In [None]:
emotion_list.append("None")

In [None]:
emotion_id = {}
id_emotion = {}
for i in range(len(emotion_list)):
    emotion_id[emotion_list[i]] = i
print(emotion_id)

for i in range(len(emotion_list)):
    id_emotion[i] = emotion_list[i]
print(id_emotion)

In [None]:
df_train["emotion_id"] = df_train["emotion"].apply(lambda x: emotion_id[x])
df_train.head()

In [None]:
df_test["emotion"] = "None"
df_test["emotion_id"] = df_test["emotion"].apply(lambda x: emotion_id[x])
df_test.head()

In [None]:
import random

random.seed(42)

df_val = df_train.sample(frac=0.2)
df_train = df_train.drop(df_val.index)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [None]:
def save_df(df, file_path):
    df.to_csv(file_path, index=False)

In [None]:
save_df(df_train, "train.csv")
save_df(df_val, "val.csv")
save_df(df_test, "test.csv")

#### Training

In [None]:
import pandas as pd
import os


def load_df(file_path):
    return pd.read_csv(file_path)


data_path = "dataset/"
df_train = load_df(os.path.join(data_path, "train.csv"))
df_val = load_df(os.path.join(data_path, "val.csv"))
df_test = load_df(os.path.join(data_path, "test.csv"))

In [None]:
### Begin Assignment Here
lr = 1e-5
batch_size = 64
epochs = 10
grad_clip = 1.0
bert = "google-bert/bert-base-uncased"

In [None]:
# Load model directly
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained(bert, cache_dir="cache/")
model = BertForSequenceClassification.from_pretrained(
    bert, num_labels=len(df_train["emotion_id"].unique()), cache_dir="cache/"
)

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader


class EmotionDataset(Dataset):
    def __init__(self, dataframe):
        # Use the provided dataframe
        self.data = dataframe[["text", "emotion_id"]]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        if isinstance(idx, int):
            # Tokenize the text
            tokenized_text = tokenizer(
                self.data.iloc[idx]["text"],
                padding="max_length",
                truncation=True,
                return_tensors="pt",
            )
            sample = {
                "input_ids": tokenized_text["input_ids"].squeeze(0),
                "attention_mask": tokenized_text["attention_mask"].squeeze(0),
                "emotion_id": torch.tensor(
                    self.data.iloc[idx]["emotion_id"], dtype=torch.long
                ),
            }
            return sample
        else:
            raise TypeError("Index must be an integer")

In [None]:
def collate_fn(batch):
    input_ids = [item["input_ids"] for item in batch]
    attention_mask = [item["attention_mask"] for item in batch]
    emotion_ids = [item["emotion_id"] for item in batch]

    return {
        "input_ids": torch.stack(input_ids),
        "attention_mask": torch.stack(attention_mask),
        "emotion_id": torch.tensor(emotion_ids),
    }

In [None]:
train_dataset = EmotionDataset(df_train)
dl_train = DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn
)
validation_dataset = EmotionDataset(df_val)
dl_validation = DataLoader(
    validation_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)
test_dataset = EmotionDataset(df_test)
dl_test = DataLoader(
    test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)

In [None]:
for batch in dl_train:
    print(batch)
    break

In [None]:
import torch


class EmotionClassifier(torch.nn.Module):
    def __init__(self, model, num_emotions=8):
        super(EmotionClassifier, self).__init__()
        self.model = model
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, num_emotions)

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        logits = self.dropout(logits)

        return logits

In [None]:
import torch.optim as optim
from torch.nn import CrossEntropyLoss

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = EmotionClassifier(model).to(device)
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
loss_fn = CrossEntropyLoss()

In [None]:
id_emotion = {
    0: "anticipation",
    1: "sadness",
    2: "fear",
    3: "joy",
    4: "anger",
    5: "trust",
    6: "disgust",
    7: "surprise",
    8: "None",
}

In [None]:
from tqdm import tqdm
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from torchmetrics.classification import (
    MulticlassAccuracy,
    MulticlassF1Score,
    MulticlassPrecision,
    MulticlassRecall,
)

# Initialize metrics (update num_classes based on your dataset)
num_classes = len(id_emotion)  # Replace with the number of emotion classes
accuracy = MulticlassAccuracy(num_classes=num_classes).to(device)
f1_score = MulticlassF1Score(num_classes=num_classes).to(device)
precision = MulticlassPrecision(num_classes=num_classes).to(device)
recall = MulticlassRecall(num_classes=num_classes).to(device)

# Set up optimizer and learning rate scheduler with warm-up
total_steps = len(dl_train) * epochs
warmup_steps = int(0.1 * total_steps)  # Warm-up for 10% of total steps

scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps
)

for epoch in range(epochs):
    model.train()
    total_loss = 0
    pbar = tqdm(dl_train, desc=f"Epoch {epoch + 1}")
    for batch in pbar:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        emotion_id = batch["emotion_id"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs, emotion_id)
        total_loss += loss.item()

        # Update metrics
        predictions = torch.argmax(outputs, dim=-1)
        accuracy.update(predictions, emotion_id)
        f1_score.update(predictions, emotion_id)
        precision.update(predictions, emotion_id)
        recall.update(predictions, emotion_id)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        optimizer.step()

        # Update learning rate scheduler
        scheduler.step()

        pbar.set_postfix({"loss": loss.item()})

    print(f"Training loss: {total_loss / len(dl_train)}")
    print(f"Training Accuracy: {accuracy.compute().item()}")
    print(f"Training F1 Score: {f1_score.compute().item()}")
    print(f"Training Precision: {precision.compute().item()}")
    print(f"Training Recall: {recall.compute().item()}")

    # Reset metrics
    accuracy.reset()
    f1_score.reset()
    precision.reset()
    recall.reset()

    # Validation phase
    model.eval()
    total_loss = 0
    pbar = tqdm(dl_validation, desc=f"Epoch {epoch + 1} Validation")
    with torch.no_grad():
        for batch in pbar:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            emotion_id = batch["emotion_id"].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs, emotion_id)
            total_loss += loss.item()

            # Update metrics
            predictions = torch.argmax(outputs, dim=-1)
            accuracy.update(predictions, emotion_id)
            f1_score.update(predictions, emotion_id)
            precision.update(predictions, emotion_id)
            recall.update(predictions, emotion_id)

    print(f"Validation loss: {total_loss / len(dl_validation)}")
    print(f"Validation Accuracy: {accuracy.compute().item()}")
    print(f"Validation F1 Score: {f1_score.compute().item()}")
    print(f"Validation Precision: {precision.compute().item()}")
    print(f"Validation Recall: {recall.compute().item()}")

    # Reset metrics
    accuracy.reset()
    f1_score.reset()
    precision.reset()
    recall.reset()