<h2>CS 3780/5780 Creative Project: </h2>
<h3>Emotion Classification of Natural Language</h3>

Names and NetIDs for your group members: James Tu (jt737), Andrew Cheung (aec295)

<h3>Introduction:</h3>

<p> The creative project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The past programming projects provide templates for how to do this (and you can reuse part of your code if you wish), and the lectures provide some of the methods you can use. So, this creative project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is classifying texts to human emotions. Through words, humans express feelings, articulate thoughts, and communicate our deepest needs and desires. Language helps us interpret the nuances of joy, sadness, anger, and love, allowing us to connect with others on a deeper level. Are you able to train an ML model that recognizes the human emotions expressed in a piece of text? <b>Please read the project description PDF file carefully and follow the instructions there. Also make sure you write your code and answers to all the questions in this Jupyter Notebook </b> </p>
<p>


<h2>Part 0: Basics</h2><p>

<h3>0.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [2]:
import os
import pandas as pd
import numpy as np
import torch
# TODO

<h3>0.2 Accuracy and Mean Squared Error:</h3><p>
To measure your performance in the Kaggle Competition, we are using accuracy. As a recap, accuracy is the percent of labels you predict correctly. To measure this, you can use library functions from sklearn. A simple example is shown below. 
<p>

In [3]:
from sklearn.metrics import accuracy_score
y_pred = [3, 2, 1, 0, 1, 2, 3]
y_true = [0, 1, 2, 3, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.42857142857142855

<h2>Part 1: Basic</h2><p>
Note that your code should be commented well and in part 1.4 you can refer to your comments.

<h3>1.1 Load and preprocess the dataset:</h3><p>
We provide how to load the data on Kaggle's Notebook.
<p>

In [4]:
# train = pd.read_csv("/kaggle/input/cs-3780-5780-how-do-you-feel/train.csv")
# train_text = train["text"]
# train_label = train["label"]

# test = pd.read_csv("/kaggle/input/cs-3780-5780-how-do-you-feel/test.csv")
# test_id = test["id"]
# test_text = test["text"]

In [5]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4

# Load the training data
train = pd.read_csv("../data/train.csv")
train_text = train["text"]
train_label = train["label"]

# Load the testing data
test = pd.read_csv("../data/test.csv")
test_id = test["id"]
test_text = test["text"]


In [6]:
train.head()

Unnamed: 0,text,label
0,i interact with on a daily basis either in rea...,1
1,Stranger than fiction. Can't even begin to com...,1
2,i sit here with the aftermath feeling so damn ...,1
3,Great job! Hats off to you.,25
4,i hate you threads posted by people just whini...,9


In [7]:
test.head()

Unnamed: 0,id,text
0,0,im feeling like a hot potato right now
1,1,i feel that are becoming impressed upon my lit...
2,2,id ever held any girls hand but boy did i sure...
3,3,i feel thats when i feel my grief over the bra...
4,4,i feel will never been resolved in a way to ke...


<h3>1.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 0.1.

### SVM Model

In [8]:
from sklearn.svm import SVC

### Naive Bayes Model

In [9]:
from sklearn.naive_bayes import MultinomialNB

### Boost Model

In [10]:
import xgboost as xgb

<h3>1.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

### First we split our data into a training and validation set

In [11]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(train_text, train_label, test_size=0.2, random_state=42)

### Next we vectorize the training text

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

1.3.1 Training the SVM model

In [13]:
from sklearn.metrics import accuracy_score
# Train an SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_tfidf, y_train)

y_pred = svm_model.predict(X_test_tfidf)
accuracy_svm = accuracy_score(y_test, y_pred)

print(f'SVM Accuracy: {accuracy_svm}')

SVM Accuracy: 0.7005


1.3.2 Training the Naive Bayes Model

In [14]:
# Train a Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# Predict and calculate accuracy for Naive Bayes model
y_pred_nb = nb_model.predict(X_test_tfidf)
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f'Naive Bayes Accuracy: {accuracy_nb}')


Naive Bayes Accuracy: 0.4525


1.3.3 Training the Boost Model

In [15]:
# Train an XGBoost model
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_model.fit(X_train_tfidf, y_train)

# Predict and calculate accuracy for XGBoost model
y_pred_xgb = xgb_model.predict(X_test_tfidf)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print(f'XGBoost Accuracy: {accuracy_xgb}')

Parameters: { "use_label_encoder" } are not used.



XGBoost Accuracy: 0.705


<h3>1.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

1.4.1 How did you formulate the learning problem?

1.4.2 Which two learning methods from class did you choose and why did you made the choices?

1.4.3 How did you do the model selection?

1.4.4 Does the test performance reach the first baseline "Tiny Piney"? (Please include a screenshot of Kaggle Submission)

<h2>Part 2: Be creative!</h2><p>

<h3>2.1 Open-ended Code:</h3><p>
You may follow the steps in part 1 again but making innovative changes like using new training algorithms, etc. Make sure you explain everything clearly in part 2.2. Note that beating "Zero Hero" is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

### BERT

In [16]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.2

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(set(train_label)))

# Tokenize the data
train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(X_test.tolist(), truncation=True, padding=True, max_length=128)

class EmotionDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = EmotionDataset(train_encodings, y_train.tolist())
test_dataset = EmotionDataset(test_encodings, y_test.tolist())

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Train BERT model

In [17]:
# Train the model
trainer.train()

# Generate predictions for the test dataset using BERT model
predictions = trainer.predict(test_dataset)
bert_predictions = predictions.predictions.argmax(-1)
accuracy_bert = accuracy_score(y_test, bert_predictions)
print(f'BERT Accuracy: {accuracy_bert}')

  0%|          | 0/1500 [00:00<?, ?it/s]

{'loss': 3.3425, 'grad_norm': 7.687621593475342, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.02}
{'loss': 3.2893, 'grad_norm': 6.8066325187683105, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.04}
{'loss': 3.2764, 'grad_norm': 8.257126808166504, 'learning_rate': 3e-06, 'epoch': 0.06}
{'loss': 3.2433, 'grad_norm': 7.525886058807373, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.08}
{'loss': 3.2301, 'grad_norm': 8.627923011779785, 'learning_rate': 5e-06, 'epoch': 0.1}
{'loss': 3.1363, 'grad_norm': 7.310774803161621, 'learning_rate': 6e-06, 'epoch': 0.12}
{'loss': 3.1165, 'grad_norm': 6.796634674072266, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.14}
{'loss': 2.9596, 'grad_norm': 7.731834411621094, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.16}
{'loss': 2.8239, 'grad_norm': 8.486319541931152, 'learning_rate': 9e-06, 'epoch': 0.18}
{'loss': 2.7857, 'grad_norm': 7.00655460357666, 'learning_rate': 1e-05, 'epoch': 0.2}
{'loss': 2.6353, 'grad_norm': 5.38925123

KeyboardInterrupt: 

<h3>2.2 Explanation in Words:</h3><p>
You need to answer the following questions in a markdown cell after this cell:

2.2.1 How much did you manage to improve performance on the test set? Did you beat "Zero Hero" in Kaggle? (Please include a screenshot of Kaggle Submission)

2.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

<h2>Part 3: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The results should be presented in two columns in csv format: the first column is the data id (0-14999) and the second column includes the predictions for the test set. The first column must be named id and the second column must be named label (otherwise your submission will fail). A sample predication file can be downloaded from Kaggle for each problem. 
We provide how to save a csv file if you are running Notebook on Kaggle.

In [11]:
# id = range(15000)
# prediction = range(15000)
# submission = pd.DataFrame({'id': id, 'label': prediction})
# submission.to_csv('/kaggle/working/submission.csv', index=False)

In [17]:
# TODO
id = range(15000)
svm_predictions = svm_model.predict(vectorizer.transform(test_text))
nb_predictions = nb_model.predict(vectorizer.transform(test_text))
xgb_predictions = xgb_model.predict(vectorizer.transform(test_text))

# Save predictions to CSV
svm_submission = pd.DataFrame({'id': id, 'label': svm_predictions})
nb_submission = pd.DataFrame({'id': id, 'label': nb_predictions})
xgb_submission = pd.DataFrame({'id': id, 'label': xgb_predictions})

svm_submission.to_csv('../submission/svm_predictions.csv', index=False)
nb_submission.to_csv('../submission/nb_predictions.csv', index=False)
xgb_submission.to_csv('../submission/xgb_predictions.csv', index=False)
# You may use pandas to generate a dataframe with country, date and your predictions first 
# and then use to_csv to generate a CSV file.

[27 16 21 ... 12  1  1]


In [16]:

# Save BERT predictions to CSV
bert_submission = pd.DataFrame({'id': range(len(bert_predictions)), 'label': bert_predictions})
bert_submission.to_csv('../submission/bert_submission.csv', index=False)

(15000,)
(15000, 2)


<h2>Part 4: Resources and Literature Used</h2><p>

Please cite the papers and open resources you used.