<a href="https://colab.research.google.com/github/NoahYe123/EmotionClassification/blob/main/COMP_551_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Authors: Yifan Du, Abiola Olaniyan, Noah Ye (Group 24)

# COMP 551 Assignment 4
Portions of code/models obtained from provided COMP 551 notebooks.

### Task 1: Preprocess dataset


In [None]:
!pip install datasets
!pip install xgboost

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from torch.utils.data import Dataset
import torch

In [None]:
# Load the dataset
ds = load_dataset("google-research-datasets/go_emotions", "simplified")

# Convert to a DataFrame for easier processing
df = pd.DataFrame(ds['train'])

# Drop data points with more than one label
df['num_labels'] = df['labels'].apply(len)  # Count the number of labels per data point
df_single_label = df[df['num_labels'] == 1].drop(columns=['num_labels'])

# Calculate percentage
original_size = len(df)
simplified_size = len(df_single_label)
percentage_retained = (simplified_size / original_size) * 100

# Print results
print(f"Original dataset size: {original_size}")
print(f"Simplified dataset size: {simplified_size}")
print(f"Percentage of data retained: {percentage_retained:.2f}%")

filtered_df = pd.DataFrame(df_single_label)

# Split manually into train, validation, and test sets (e.g., 80% train, 10% val, 10% test)
train_df, temp_df = train_test_split(filtered_df, test_size=0.2, random_state=42)
validation_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Preprocess for Baseline and Naive Bayes Models
# TF-IDF Vectorizer for Baseline
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_baseline = tfidf_vectorizer.fit_transform(train_df['text'])
X_val_baseline = tfidf_vectorizer.transform(validation_df['text'])
X_test_baseline = tfidf_vectorizer.transform(test_df['text'])

# Count Vectorizer for Naive Bayes
count_vectorizer = CountVectorizer(max_features=5000)
X_train_nb = count_vectorizer.fit_transform(train_df['text'])
X_val_nb = count_vectorizer.transform(validation_df['text'])
X_test_nb = count_vectorizer.transform(test_df['text'])

# Labels
# Reset index for labels to ensure sequential indexing
y_train = torch.tensor([label[0] for label in train_df['labels'].reset_index(drop=True)])
y_val = torch.tensor([label[0] for label in validation_df['labels'].reset_index(drop=True)])
y_test = torch.tensor([label[0] for label in test_df['labels'].reset_index(drop=True)])

# Preprocess for LLM Models
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_data(texts):
    return tokenizer(
        list(texts),  # Tokenizer expects a list of texts
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )

train_tokens = tokenize_data(train_df['text'])
validation_tokens = tokenize_data(validation_df['text'])
test_tokens = tokenize_data(test_df['text'])

# Final Outputs
print("Baseline TF-IDF Preprocessed Shapes:", X_train_baseline.shape, X_val_baseline.shape, X_test_baseline.shape)
print("Naive Bayes CountVectorizer Shapes:", X_train_nb.shape, X_val_nb.shape, X_test_nb.shape)
print("LLM Tokenized Shapes:", train_tokens['input_ids'].shape, validation_tokens['input_ids'].shape, test_tokens['input_ids'].shape)

Original dataset size: 43410
Simplified dataset size: 36308
Percentage of data retained: 83.64%
Baseline TF-IDF Preprocessed Shapes: (29046, 5000) (3631, 5000) (3631, 5000)
Naive Bayes CountVectorizer Shapes: (29046, 5000) (3631, 5000) (3631, 5000)
LLM Tokenized Shapes: torch.Size([29046, 128]) torch.Size([3631, 52]) torch.Size([3631, 47])


In [None]:
from transformers import AutoModelForSequenceClassification
 # Load BERT model with the correct number of labels
num_labels = 28  # Update this based on the GoEmotions dataset
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=num_labels
)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


###Task 2: Implement Naive Bayes & Finetune an LLM

**2.1: Multinomial Naive Bayes Model**

In [None]:
# Code taken from the Gaussian Naive Bayes classifier tutorial
def logsumexp(Z):                                                # dimension C x N
    Zmax = np.max(Z,axis=0)[None,:]                              # max over C
    log_sum_exp = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=0))
    return log_sum_exp


class MultiNomNB:

  def __init__(self):
    return

  def fit(self, x, y):
    N, D = x.shape
    C = len(np.unique(y)) # get number of unique classes
    # one parameter for each feature conditioned on each class
    theta = np.zeros((C,D)) # conditional probabilities
    Nc = np.zeros(C) # number of instances in class c
    # for each class get the MLE for the theta (given by the relative frequency)
    for c in range(C):
        x_c = x[y == c]                                   #slice all the elements from class c
        Nc[c] = x_c.shape[0]                              #get number of elements of class c
        d_c = x_c.sum(axis=0) + 1                #counts of word d in all documents labelled c
        total_c = np.sum(d_c)    #total word count in all documents labelled c
        theta[c,:] = d_c/total_c                          #MLE for theta

    self.theta = theta                                    # C x D
    self.pi = (Nc+1)/(N+C)                                #Laplace smoothing (using alpha_c=1 for all c) you can derive using Dirichlet's distribution
    return self

  def predict(self, xt):
    Nt, D = xt.shape
    # for numerical stability we work in the log domain
    # we add a dimension because this is added to the log-likelihood matrix
    # that assigns a likelihood for each class (C) to each test point, and so it is C x N
    log_prior = np.log(self.pi)[:, None]
    # logarithm of the likelihood term for Multinomial
    log_likelihood = xt * np.log(self.theta.T)
    # posterior calculation
    log_posterior = log_prior.T + log_likelihood
    posterior = np.exp(log_posterior - logsumexp(log_posterior))

    return posterior.T                                                  # dimension N x C

  def evaluate_acc(self, y_true, y_pred):
    accuracy = np.mean(y_true == y_pred)
    print(f"Model accuracy: {accuracy}")

In [None]:
model = MultiNomNB()
model.fit(X_train_nb, y_train)
yh = model.predict(X_test_nb)
predicted_labels = np.argmax(yh, axis=0)

model.evaluate_acc(y_test.numpy() , predicted_labels)

Model accuracy: 0.4136601487193611


**2.2: Bert is the model we use and GPT2 is creativity; NOTE: takes 10 mins to train**

In [None]:
# Pre-trained Model
class EmotionDataset(Dataset):
    def __init__(self, tokens, labels):
        self.tokens = tokens
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.tokens["input_ids"][idx],
            "attention_mask": self.tokens["attention_mask"][idx],
            "labels": self.labels[idx],
        }

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}

# Create datasets
train_dataset = EmotionDataset(train_tokens, y_train)
validation_dataset = EmotionDataset(validation_tokens, y_val)
test_dataset = EmotionDataset(test_tokens, y_test)

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=28  # Replace 28 with the number of emotion labels
)


# Freeze base layers
for param in model.base_model.parameters():
    param.requires_grad = False

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
)


# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics  # Add the metrics function here
)

# Train the model
trainer.train()

# Evaluate on test data
test_metrics = trainer.evaluate(test_dataset)
print("Test Metrics:", test_metrics)



# Save the fine-tuned model
model.save_pretrained("./fine_tuned_bert")
tokenizer.save_pretrained("./fine_tuned_bert")


# Load pre-trained BERT model without fine-tuning
pretrained_model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=28  # Same number of labels as the fine-tuned model
)

# Initialize Trainer for the pre-trained model
pretrained_trainer = Trainer(
    model=pretrained_model,
    args=training_args,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Evaluate the pre-trained model on the test set
pretrained_metrics = pretrained_trainer.evaluate()
print("Pre-Trained Model Metrics:", pretrained_metrics)

# Comparison of results
print("Comparison of Models:")
print("Fine-Tuned Bert Model Metrics:", test_metrics)
print("Pre-Trained  Bert Model Metrics:", pretrained_metrics)


# Load the model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("/opt/models/distilgpt2", num_labels=len(set(y_train.tolist())))
model.resize_token_embeddings(len(tokenizer))


class EmotionDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

train_dataset = EmotionDataset(train_tokens, y_train)
val_dataset = EmotionDataset(validation_tokens, y_val)
test_dataset = EmotionDataset(test_tokens, y_test)

# Freeze all layers except the classification head
for param in model.base_model.parameters():
    param.requires_grad = False


# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=lambda p: {"accuracy": accuracy_score(p.label_ids, p.predictions.argmax(-1))}
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()
print("Evaluation Results for GPT2:", results)

**2.2: Fine Tuning LLM**

In [None]:
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score
import torch
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
ds = load_dataset("google-research-datasets/go_emotions", "simplified")

# Convert to a DataFrame for easier processing
df = pd.DataFrame(ds['train'])

# Drop data points with more than one label
df['num_labels'] = df['labels'].apply(len)  # Count the number of labels per data point
df_single_label = df[df['num_labels'] == 1].drop(columns=['num_labels'])

# Calculate percentage
original_size = len(df)
simplified_size = len(df_single_label)
percentage_retained = (simplified_size / original_size) * 100

# Print results
print(f"Original dataset size: {original_size}")
print(f"Simplified dataset size: {simplified_size}")
print(f"Percentage of data retained: {percentage_retained:.2f}%")

filtered_df = pd.DataFrame(df_single_label)

# Split manually into train, validation, and test sets (e.g., 80% train, 10% val, 10% test)
train_df, temp_df = train_test_split(filtered_df, test_size=0.2, random_state=42)
validation_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Preprocess for Baseline and Naive Bayes Models
# TF-IDF Vectorizer for Baseline
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_baseline = tfidf_vectorizer.fit_transform(train_df['text'])
X_val_baseline = tfidf_vectorizer.transform(validation_df['text'])
X_test_baseline = tfidf_vectorizer.transform(test_df['text'])

# Count Vectorizer for Naive Bayes
count_vectorizer = CountVectorizer(max_features=5000)
X_train_nb = count_vectorizer.fit_transform(train_df['text'])
X_val_nb = count_vectorizer.transform(validation_df['text'])
X_test_nb = count_vectorizer.transform(test_df['text'])

# Labels
# Reset index for labels to ensure sequential indexing
y_train = torch.tensor([label[0] for label in train_df['labels'].reset_index(drop=True)])
y_val = torch.tensor([label[0] for label in validation_df['labels'].reset_index(drop=True)])
y_test = torch.tensor([label[0] for label in test_df['labels'].reset_index(drop=True)])

# Preprocess for LLM Models
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_data(texts):
    return tokenizer(
        list(texts),
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )

train_tokens = tokenize_data(train_df['text'])
validation_tokens = tokenize_data(validation_df['text'])
test_tokens = tokenize_data(test_df['text'])

# Final Outputs
print("Baseline TF-IDF Preprocessed Shapes:", X_train_baseline.shape, X_val_baseline.shape, X_test_baseline.shape)
print("Naive Bayes CountVectorizer Shapes:", X_train_nb.shape, X_val_nb.shape, X_test_nb.shape)
print("LLM Tokenized Shapes:", train_tokens['input_ids'].shape, validation_tokens['input_ids'].shape, test_tokens['input_ids'].shape)


# part 2.2


class EmotionDataset(Dataset):
    def __init__(self, tokens, labels):
        self.tokens = tokens
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.tokens["input_ids"][idx],
            "attention_mask": self.tokens["attention_mask"][idx],
            "labels": self.labels[idx],
        }

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}

# Create datasets
train_dataset = EmotionDataset(train_tokens, y_train)
validation_dataset = EmotionDataset(validation_tokens, y_val)
test_dataset = EmotionDataset(test_tokens, y_test)

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=28  # Replace 28 with the number of emotion labels
)

# Freeze base layers
for param in model.base_model.parameters():
    param.requires_grad = False

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
)


# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics  # Add the metrics function here
)

# Train the model
trainer.train()

# Evaluate on test data
test_metrics = trainer.evaluate(test_dataset)
print("Test Metrics:", test_metrics)

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_bert")
tokenizer.save_pretrained("./fine_tuned_bert")


# Load pre-trained BERT model without fine-tuning
pretrained_model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=28  # Same number of labels as the fine-tuned model
)

# Initialize Trainer for the pre-trained model
pretrained_trainer = Trainer(
    model=pretrained_model,
    args=training_args,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Evaluate the pre-trained model on the test set
pretrained_metrics = pretrained_trainer.evaluate()
print("Pre-Trained Model Metrics:", pretrained_metrics)

# Comparison of results
print("Comparison of Models:")
print("Fine-Tuned Bert Model Metrics:", test_metrics)
print("Pre-Trained  Bert Model Metrics:", pretrained_metrics)


# Load the fine-tuned model
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("./fine_tuned_bert")

# Change for specified document (Reddit comment)
doc = 77

outputs = fine_tuned_model(
    input_ids=test_tokens["input_ids"][doc:doc + 1],  # Analyze only the first document
    attention_mask=test_tokens["attention_mask"][doc:doc + 1],
    output_attentions=True
)

# Access attentions
attentions = outputs.attentions

layer_idx = 7  # Layer 8 (0-indexed)
head_idx = 2   # Head 3 (0-indexed)

# Extract attention scores
selected_attention = attentions[layer_idx][:, head_idx, :, :]  # Shape: (batch_size, seq_length, seq_length)

# Extract attention for the [CLS] token
cls_token_idx = 0
cls_attention_scores = selected_attention[0, cls_token_idx, :].detach().cpu().numpy()

# Decode tokens for the document
tokens = tokenizer.convert_ids_to_tokens(test_tokens["input_ids"][doc].detach().cpu().numpy())

plt.figure(figsize=(10, 5))
plt.bar(tokens, cls_attention_scores)
plt.title(f"Attention Weights from [CLS] (Layer {layer_idx + 1}, Head {head_idx + 1})")
plt.xlabel("Tokens")
plt.ylabel("Attention Weight")
plt.xticks(rotation=90)
plt.savefig(fname="Attention Weights.png")


attention_matrix = selected_attention[0].detach().cpu().numpy()  # Shape: (seq_length, seq_length)
plt.figure(figsize=(12, 10))
sns.heatmap(attention_matrix, xticklabels=tokens, yticklabels=tokens, cmap="viridis", cbar=True)
plt.title(f"Attention Heatmap (Layer {layer_idx + 1}, Head {head_idx + 1})")
plt.xlabel("Tokens")
plt.ylabel("Tokens")
plt.xticks(rotation=90)
plt.savefig(fname="Attention Heatmap.png")

**This Part is for Creativity by Fine-Tuning The GPT2 Model**

In [None]:
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score
import torch

# Load the dataset
ds = load_dataset("google-research-datasets/go_emotions", "simplified")

# Convert to a DataFrame for easier processing
df = pd.DataFrame(ds['train'])

# Drop data points with more than one label
df['num_labels'] = df['labels'].apply(len)  # Count the number of labels per data point
df_single_label = df[df['num_labels'] == 1].drop(columns=['num_labels'])

# Calculate percentage
original_size = len(df)
simplified_size = len(df_single_label)
percentage_retained = (simplified_size / original_size) * 100

# Print results
print(f"Original dataset size: {original_size}")
print(f"Simplified dataset size: {simplified_size}")
print(f"Percentage of data retained: {percentage_retained:.2f}%")

filtered_df = pd.DataFrame(df_single_label)

# Split manually into train, validation, and test sets (e.g., 80% train, 10% val, 10% test)
train_df, temp_df = train_test_split(filtered_df, test_size=0.2, random_state=42)
validation_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Labels
y_train = torch.tensor([label[0] for label in train_df['labels'].reset_index(drop=True)])
y_val = torch.tensor([label[0] for label in validation_df['labels'].reset_index(drop=True)])
y_test = torch.tensor([label[0] for label in test_df['labels'].reset_index(drop=True)])

# Preprocess for LLM Models
tokenizer = AutoTokenizer.from_pretrained("/opt/models/distilgpt2")

# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Function to tokenize data
def tokenize_data(texts):
    return tokenizer(
        list(texts),  # Tokenizer expects a list of texts
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )

train_tokens = tokenize_data(train_df['text'])
validation_tokens = tokenize_data(validation_df['text'])
test_tokens = tokenize_data(test_df['text'])

# Load the model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("/opt/models/distilgpt2", num_labels=len(set(y_train.tolist())))
model.resize_token_embeddings(len(tokenizer))

# Custom Dataset class
class EmotionDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

train_dataset = EmotionDataset(train_tokens, y_train)
val_dataset = EmotionDataset(validation_tokens, y_val)
test_dataset = EmotionDataset(test_tokens, y_test)

# Freeze all layers except the classification head
for param in model.base_model.parameters():
    param.requires_grad = False


# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=lambda p: {"accuracy": accuracy_score(p.label_ids, p.predictions.argmax(-1))}
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()
print("Evaluation results:", results)

# Final Outputs
print("LLM Tokenized Shapes:", train_tokens['input_ids'].shape, validation_tokens['input_ids'].shape, test_tokens['input_ids'].shape)


**Multiple Baselines for Comparison (RF, XGBoost, SR)**

In [None]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest model on the training data
rf_model.fit(X_train_baseline, y_train.numpy())

# Evaluate on validation data
y_val_pred = rf_model.predict(X_val_baseline)

# Verify the number of unique classes
print(f"Unique classes in y_train: {len(set(y_train.numpy()))}")
print(f"Unique classes in y_val: {len(set(y_val.numpy()))}")
print(f"Unique classes in y_test: {len(set(y_test.numpy()))}")

# Adjust target_names and labels for validation report
val_report = classification_report(
    y_val.numpy(),
    y_val_pred,
    labels=list(range(28)),  # Explicitly specify label set
    target_names=[str(i) for i in range(28)],  # Adjust for 28 classes
    zero_division=0  # Avoid undefined metric warnings
)
print("Validation Classification Report for RF:")
print(val_report)

# Test the model on the test data
y_test_pred = rf_model.predict(X_test_baseline)

# Generate a classification report for the test data with corrected labels
test_report = classification_report(
    y_test.numpy(),
    y_test_pred,
    labels=list(range(28)),  # Explicitly specify label set
    target_names=[str(i) for i in range(28)],  # Adjust for 28 classes
    zero_division=0  # Avoid undefined metric warnings
)
print("Test Classification Report for RF:")
print(test_report)


# Initialize the Softmax Regression model
sf_model = LogisticRegression(
    multi_class='multinomial',  # Enables Softmax Regression
    solver='lbfgs',             # Recommended solver for multiclass problems
    max_iter=200,               # Maximum number of iterations
    random_state=42             # Ensures reproducibility
)

# Train the Softmax Regression model on the training data
sf_model.fit(X_train_baseline, y_train.numpy())

# Predict on validation data
y_val_pred_sf = sf_model.predict(X_val_baseline)

# Generate classification report for validation data
val_report_sf = classification_report(
    y_val.numpy(),
    y_val_pred_sf,
    labels=list(range(28)),         # Explicitly specify label set
    target_names=[str(i) for i in range(28)],  # Adjust for 28 classes
    zero_division=0                 # Avoid undefined metric warnings
)
print("Validation Classification Report for Softmax Regression:")
print(val_report_sf)

# Predict on test data
y_test_pred_sf = sf_model.predict(X_test_baseline)

# Generate classification report for test data
test_report_sf = classification_report(
    y_test.numpy(),
    y_test_pred_sf,
    labels=list(range(28)),         # Explicitly specify label set
    target_names=[str(i) for i in range(28)],  # Adjust for 28 classes
    zero_division=0                 # Avoid undefined metric warnings
)
print("Test Classification Report for Softmax Regression:")
print(test_report_sf)

# Uncomment and run on Colab to observe accuracies
# # Initialize the XGBoost model
# xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
# xgb_model.fit(X_train_baseline, y_train)

# # Predict on validation data
# y_val_pred_xgb = xgb_model.predict(X_val_baseline)

# # Generate classification report for validation data
# val_report_xgb = classification_report(
#     y_val.numpy(),
#     y_val_pred_xgb,
#     labels=list(range(28)),         # Explicitly specify label set
#     target_names=[str(i) for i in range(28)],  # Adjust for 28 classes
#     zero_division=0                 # Avoid undefined metric warnings
# )
# print("Validation Classification Report for XGBoost:")
# print(val_report_xgb)

# # Predict on test data
# y_test_pred_xgb = xgb_model.predict(X_test_baseline)

# # Generate classification report for test data
# test_report_xgb = classification_report(
#     y_test.numpy(),
#     y_test_pred_xgb,
#     labels=list(range(28)),         # Explicitly specify label set
#     target_names=[str(i) for i in range(28)],  # Adjust for 28 classes
#     zero_division=0                 # Avoid undefined metric warnings
# )
# print("Test Classification Report for XGBoost:")
# print(test_report_xgb)

### Task 3: Experiments

###3.1  Performance for finetuned LLM, the pretrained LLM, and baselines(SR, RF, and XGBoost) <br>
## Baselines
Metrics of SR:<br>
Accuracy: 0.52<br>
Metrics of RF:<br>
Accuracy: 0.54<br>
Metrics of XGBoost:<br>
Accuracy:0.56<br>
Metrics of Naive Bayes:<br>
Model accuracy: 0.4131093362709997<br>


## LLM
###Metrics of the BERT model before fine-tuning:<br>

BERT Model<br>
Test Metrics:<br>
loss value computed(`eval_loss`): 3.4808437824249268<br>
accuracy of the model(`eval_accuracy`): 0.004406499586890664<br>
evaluation time(`eval_runtime`): 3.5655<br>
samples evaluated per second(`eval_samples_per_second`): 1018.357<br>
evaluation steps per second(`eval_steps_per_second`): 127.33<br>

### Metrics of the BERT model after fine-tuning<br>

BERT Model<br>
Test Metrics:<br>
loss value computed(`eval_loss`): 2.5959577560424805<br>
accuracy of the model(`eval_accuracy`): 0.3395758744147618<br>
evaluation time(`eval_runtime`): 3.5926<br>
samples evaluated per second(`eval_samples_per_second`): 1010.679<br>
evaluation steps per second(`eval_steps_per_second`): 126.37<br>
epoch(`epoch`): 3.0<br>

GPT2:<br>
Test Metrics:<br>
loss value computed(`eval_loss`): 2.628753185272217<br>
accuracy of the model(`eval_accuracy`): 0.3979619939410631<br>
evaluation time(`eval_runtime`): 9.3724<br>
samples evaluated per second(`eval_samples_per_second`): 387.413<br>
evaluation steps per second(`eval_steps_per_second`): 387.413<br>
epoch(`epoch`): 3.0<br>


