<h2>CS 3780/5780 Creative Project: </h2>
<h3>Emotion Classification of Natural Language</h3>

Names and NetIDs for your group members: Sirui Zhang(sz694), Sheng Zhang(sz696)

<h3>Introduction:</h3>

<p> The creative project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The past programming projects provide templates for how to do this (and you can reuse part of your code if you wish), and the lectures provide some of the methods you can use. So, this creative project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is classifying texts to human emotions. Through words, humans express feelings, articulate thoughts, and communicate our deepest needs and desires. Language helps us interpret the nuances of joy, sadness, anger, and love, allowing us to connect with others on a deeper level. Are you able to train an ML model that recognizes the human emotions expressed in a piece of text? <b>Please read the project description PDF file carefully and follow the instructions there. Also make sure you write your code and answers to all the questions in this Jupyter Notebook </b> </p>
<p>


<h2>Part 0: Basics</h2><p>

<h3>0.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [None]:
import os
import pandas as pd
import numpy as np
import torch
# TODO
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW
from tqdm import tqdm

<h3>0.2 Accuracy and Mean Squared Error:</h3><p>
To measure your performance in the Kaggle Competition, we are using accuracy. As a recap, accuracy is the percent of labels you predict correctly. To measure this, you can use library functions from sklearn. A simple example is shown below.
<p>

In [None]:
from sklearn.metrics import accuracy_score
y_pred = [3, 2, 1, 0, 1, 2, 3]
y_true = [0, 1, 2, 3, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.42857142857142855

<h2>Part 1: Basic</h2><p>
Note that your code should be commented well and in part 1.4 you can refer to your comments.

<h3>1.1 Load and preprocess the dataset:</h3><p>
We provide how to load the data on Kaggle's Notebook.
<p>

In [None]:
train = pd.read_csv("/content/train.csv")
train_text = train["text"]
train_label = train["label"]

test = pd.read_csv("/content/test.csv")
test_id = test["id"]
test_text = test["text"]



In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# TODO
# Ensure necessary NLTK data is downloaded
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define the text preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize the text (split into words)
    words = text.split()
    # Remove stopwords and apply lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    # Rejoin words into a single string
    return ' '.join(words)

# Apply preprocessing to the text columns
train['processed_text'] = train['text'].apply(preprocess_text)
test['processed_text'] = test['text'].apply(preprocess_text)

# Save the processed data to new CSV files
train[['processed_text', 'label']].to_csv("train_processed.csv", index=False)
test[['id', 'processed_text']].to_csv("test_processed.csv", index=False)

# Print the first few rows of the processed data for verification
print("Processed Training Data:")
print(train[['processed_text', 'label']].head())

print("\nProcessed Testing Data:")
print(test[['id', 'processed_text']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Processed Training Data:
                                      processed_text  label
0  interact daily basis either real life online v...      1
1  stranger fiction cant even begin comprehend po...      1
2                   sit aftermath feeling damn alone      1
3                                      great job hat     25
4  hate thread posted people whining feel wronged...      9

Processed Testing Data:
   id                                     processed_text
0   0                   im feeling like hot potato right
1   1  feel becoming impressed upon little year old h...
2   2    id ever held girl hand boy sure feel triumphant
3   3  feel thats feel grief brave thought lost life ...
4   4       feel never resolved way keep everybody happy


In [None]:
# Split the dataset into training and validation sets
X = train['processed_text']
y = train['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

<h3>1.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 0.1.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# TODO
# Convert text data into TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # You can adjust max_features
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_val_tfidf = tfidf_vectorizer.transform(X_val)

# Train a Logistic Regression model
logistic_model = LogisticRegression(random_state=42, max_iter=1000)
logistic_model.fit(X_train_tfidf, y_train)

# Make predictions on the validation set
y_val_pred = logistic_model.predict(X_val_tfidf)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_val, y_val_pred))

print("Accuracy Score:")
print(accuracy_score(y_val, y_val_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.75      0.91      0.83       455
           2       0.00      0.00      0.00         3
           3       0.00      0.00      0.00        16
           4       0.85      0.64      0.73       183
           5       0.00      0.00      0.00         6
           6       0.00      0.00      0.00        11
           7       0.00      0.00      0.00         6
           8       0.00      0.00      0.00        26
           9       0.87      0.61      0.72       224
          10       0.00      0.00      0.00        25
          11       0.00      0.00      0.00        17
          12       0.30      0.68      0.41       157
          13       0.00      0.00      0.00         3
          14       0.00      0.00      0.00        11
          15       0.00      0.00      0.00        16
          16       0.65      0.26      0.37        50
    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
from google.colab import files
# Step 1: Preprocess the test dataset using the same TF-IDF vectorizer
test_tfidf = tfidf_vectorizer.transform(test['text'])  # Ensure test['text'] exists in your test dataset

# Step 2: Make predictions on the test dataset
test_predictions = logistic_model.predict(test_tfidf)

# Step 3: Create a DataFrame with the results
submission = pd.DataFrame({
    'id': test['id'],          # Use the 'id' column from your test dataset
    'label': test_predictions  # Predicted labels
})

# Step 4: Save the predictions to a CSV file
submission.to_csv('logistic_regression_submission.csv', index=False)

print("Predictions saved to 'logistic_regression_submission.csv'.")
files.download('logistic_regression_submission.csv')

Predictions saved to 'logistic_regression_submission.csv'.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<h3>1.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# TODO

# Already did in Part 1.1

<h3>1.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

1.4.1 How did you formulate the learning problem?

1.4.2 Which two learning methods from class did you choose and why did you made the choices?

1.4.3 How did you do the model selection?

1.4.4 Does the test performance reach the first baseline "Tiny Piney"? (Please include a screenshot of Kaggle Submission)

1.4.1:

I formulated the learning problem as a supervised text classification task where the goal was to predict the type of emotion expressed in a sentence. The input data, which consisted of processed text, was converted into numerical features using the TF-IDF vectorization technique to highlight the importance of words. The target output was a set of predefined emotion labels. I treated this as a multi-class classification problem and used logistic regression as the main model to learn the relationship between the features and the labels. The model was trained on labeled data, and its performance was evaluated using metrics like accuracy and classification reports to ensure it effectively captured the emotional tone of the sentences.

1.4.2: Logistic Regression:

Why Chosen: Logistic Regression is a simple and interpretable linear model that works well on text data when combined with feature extraction techniques like TF-IDF. Its computational efficiency and strong baseline performance on high-dimensional data make it a reliable first choice.
How It Was Used: Trained on TF-IDF-transformed features to predict sentiment labels.

1.4.3:


For model selection, I experimented with different algorithms to see which one performed best on the validation set. After preprocessing the text data using TF-IDF, I trained and evaluated models like logistic regression and compared them to alternatives like Naive Bayes or SVM. I used metrics such as accuracy, precision, recall, and F1-score to assess their performance. Based on these evaluations, I chose the model that provided the best balance between performance and efficiency. Additionally, I fine-tuned hyperparameters, such as regularization strength and the number of iterations, to ensure the model performed as effectively as possible.

1.4.4

<h2>Part 2: Be creative!</h2><p>

<h3>2.1 Open-ended Code:</h3><p>
You may follow the steps in part 1 again but making innovative changes like using new training algorithms, etc. Make sure you explain everything clearly in part 2.2. Note that beating "Zero Hero" is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.2
# TODO
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm

# Step 1: Load and preprocess the dataset
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'label': torch.tensor(label, dtype=torch.long)
        }

# Load data
train = pd.read_csv("/train.csv")
test = pd.read_csv("/test.csv")

# Preprocessing
X_train, X_val, y_train, y_val = train_test_split(train['text'], train['label'], test_size=0.2, random_state=42)

# Step 2: Initialize BERT tokenizer and model
MODEL_NAME = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=len(set(train['label'])))

# Step 3: Create PyTorch datasets and dataloaders
MAX_LEN = 32  # Reduce sequence length for faster processing
BATCH_SIZE = 16

train_dataset = TextDataset(X_train.tolist(), y_train.tolist(), tokenizer, MAX_LEN)
val_dataset = TextDataset(X_val.tolist(), y_val.tolist(), tokenizer, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

# Step 4: Set up optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Step 5: Training Loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

EPOCHS = 3  # Reduced epochs
for epoch in range(EPOCHS):
    model.train()
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

# Step 6: Validation
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# Evaluate the model
print("Classification Report:")
print(classification_report(all_labels, all_preds))

print("Accuracy Score:")
print(accuracy_score(all_labels, all_preds))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 0: 100%|██████████| 500/500 [23:56<00:00,  2.87s/it, loss=1.31]
Epoch 1: 100%|██████████| 500/500 [23:21<00:00,  2.80s/it, loss=1.24]
Epoch 2: 100%|██████████| 500/500 [23:11<00:00,  2.78s/it, loss=0.633]


Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.88      0.95      0.91       455
           2       0.00      0.00      0.00         3
           3       0.00      0.00      0.00        16
           4       0.84      0.85      0.85       183
           5       0.00      0.00      0.00         6
           6       0.00      0.00      0.00        11
           7       0.00      0.00      0.00         6
           8       0.12      0.27      0.17        26
           9       0.87      0.81      0.84       224
          10       0.09      0.04      0.06        25
          11       0.00      0.00      0.00        17
          12       0.38      0.65      0.48       157
          13       0.00      0.00      0.00         3
          14       0.00      0.00      0.00        11
          15       0.00      0.00      0.00        16
          16       0.62      0.64      0.63        50
    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# Step 7: Inference on test dataset
test_dataset = TextDataset(test['text'].tolist(), [0]*len(test), tokenizer, MAX_LEN)  # Dummy labels for test set
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

test_preds = []
model.eval()
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        test_preds.extend(preds.cpu().numpy())

# Save predictions
submission = pd.DataFrame({
    'id': test['id'],
    'label': test_preds
})
submission.to_csv('bert_submission.csv', index=False)

In [None]:
files.download('bert_submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<h3>2.2 Explanation in Words:</h3><p>
You need to answer the following questions in a markdown cell after this cell:

2.2.1 How much did you manage to improve performance on the test set? Did you beat "Zero Hero" in Kaggle? (Please include a screenshot of Kaggle Submission)

2.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

2.2.1:

Using BERT, I significantly improved the performance on the test set and successfully beat the "Zero Hero" baseline on Kaggle. By fine-tuning BERT on the emotion classification dataset, I leveraged its powerful language understanding to achieve much better predictions compared to simpler models. The improvement was evident in the accuracy score, which surpassed the baseline submission on Kaggle.

2.2.2:
To achieve this, I utilized BERT for its advanced contextual language understanding and fine-tuned it for the emotion classification task. Here's what I did:

Data Preprocessing:

The dataset was split into training and validation sets to allow reliable evaluation of the model's performance.
The text data was tokenized using the BertTokenizer from Hugging Face. I used padding and truncation to ensure uniform input lengths (MAX_LEN = 32), which helped maintain computational efficiency.
Dataset Preparation:

I implemented a TextDataset class to format the data for PyTorch. The class included tokenized input features (input_ids and attention_mask) and the corresponding labels.
PyTorch DataLoaders were used to create batches for training and validation to handle the data efficiently.
Fine-Tuning BERT:

I used the pre-trained bert-base-uncased model and added a classification head to match the number of emotion labels in the dataset.
Fine-tuning was performed using the AdamW optimizer with a learning rate of 2e-5, which is well-suited for transformer models.
To prevent overfitting and improve efficiency, I trained the model for 3 epochs and monitored the loss during training.
Training Process:

The training loop involved feeding batches of tokenized text into the model and calculating the loss using cross-entropy.
I performed backpropagation to update the model weights while using the GPU to speed up computations.
Validation and Metrics:

During validation, I evaluated the model in evaluation mode to generate predictions without updating weights.
Predictions were compared against the true labels, and metrics such as accuracy, precision, recall, and F1-score were calculated to assess performance.
Why I Used BERT:

BERT is a state-of-the-art transformer model pre-trained on a large corpus, making it capable of capturing the context and nuances in text better than traditional models.
Its bidirectional nature allows it to understand both preceding and succeeding words in a sentence, which is crucial for identifying emotions.
Outcome:

The fine-tuned BERT model significantly outperformed the baseline, as reflected in the evaluation metrics and the Kaggle leaderboard. Its ability to learn from pre-trained embeddings and adapt to the specific task of emotion classification made it a game-changer.

<h2>Part 3: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The results should be presented in two columns in csv format: the first column is the data id (0-14999) and the second column includes the predictions for the test set. The first column must be named id and the second column must be named label (otherwise your submission will fail). A sample predication file can be downloaded from Kaggle for each problem.
We provide how to save a csv file if you are running Notebook on Kaggle.

In [None]:
id = range(15000)
prediction = range(15000)
submission = pd.DataFrame({'id': id, 'label': prediction})
submission.to_csv('/kaggle/working/submission.csv', index=False)

In [None]:
# TODO

# You may use pandas to generate a dataframe with country, date and your predictions first
# and then use to_csv to generate a CSV file.

<h2>Part 4: Resources and Literature Used</h2><p>

Please cite the papers and open resources you used.