<h2>CS 3780/5780 Creative Project: </h2>
<h3>Emotion Classification of Natural Language</h3>

Names and NetIDs for your group members: Alexia Adams (aa862),
Matthew Mentis-Cort (mam692)

<h3>Introduction:</h3>

<p> The creative project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The past programming projects provide templates for how to do this (and you can reuse part of your code if you wish), and the lectures provide some of the methods you can use. So, this creative project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is classifying texts to human emotions. Through words, humans express feelings, articulate thoughts, and communicate our deepest needs and desires. Language helps us interpret the nuances of joy, sadness, anger, and love, allowing us to connect with others on a deeper level. Are you able to train an ML model that recognizes the human emotions expressed in a piece of text? <b>Please read the project description PDF file carefully and follow the instructions there. Also make sure you write your code and answers to all the questions in this Jupyter Notebook </b> </p>
<p>


<h2>Part 0: Basics</h2><p>

<h3>0.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [1]:
import os
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import sklearn
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
import re

<h3>0.2 Accuracy and Mean Squared Error:</h3><p>
To measure your performance in the Kaggle Competition, we are using accuracy. As a recap, accuracy is the percent of labels you predict correctly. To measure this, you can use library functions from sklearn. A simple example is shown below. 
<p>

In [2]:
from sklearn.metrics import accuracy_score
y_pred = [3, 2, 1, 0, 1, 2, 3]
y_true = [0, 1, 2, 3, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.42857142857142855

<h2>Part 1: Basic</h2><p>
Note that your code should be commented well and in part 1.4 you can refer to your comments.

<h3>1.1 Load and preprocess the dataset:</h3><p>
We provide how to load the data on Kaggle's Notebook.
<p>

In [3]:
# train = pd.read_csv("/kaggle/input/cs-3780-5780-how-do-you-feel/train.csv")
train = pd.read_csv("train.csv")
train_text = train["text"]
train_label = train["label"]

# test = pd.read_csv("/kaggle/input/cs-3780-5780-how-do-you-feel/test.csv")
test = pd.read_csv("test.csv")
test_id = test["id"]
test_text = test["text"]

We first take a look at the data we're given to decide how to put it
into our model.

In [4]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4

# Change all letters into lowercase
# def preprocess_words(text):
#   text = text.lower()
#   text = re.sub("[^\w\s]", " ", text)
#   text = text.split()
#   return text

# print(preprocess_words("Hello user123 245 doggies!...!hsdikj"))
train

Unnamed: 0,text,label
0,i interact with on a daily basis either in rea...,1
1,Stranger than fiction. Can't even begin to com...,1
2,i sit here with the aftermath feeling so damn ...,1
3,Great job! Hats off to you.,25
4,i hate you threads posted by people just whini...,9
...,...,...
9995,im feeling so shy,4
9996,Honestly if they were so worried about the tub...,20
9997,Don't wear out our [NAME]. We need him if this...,10
9998,Happy new year!,19


Seeing as our inputs are sentences and we want to predict labels,
it makes sense to use a "bag of words" representation to
represent the data. Each sentence will be represented by a
vector that has a length of the total number of words in
the vocabulary. Each vector's entry is 1 if the word is
in the sentence, and 0 otherwise.

In [5]:
# Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X1 = vectorizer.fit_transform(train_text.to_list())
X1 = X1.toarray()
# print(vectorizer.get_feature_names_out())
# print(X1)

test_word_counts = vectorizer.transform(test_text.to_list())
test_word_counts = test_word_counts.toarray()
# print(test_word_counts.shape)
# print(test_word_counts)


<h3>1.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 0.1.

<h5>Naive Bayes</h5>

We first try Naive Bayes model for classifying the words as
it is a reasonable assumption to make that the words
are independent of the label and it directly
estimates probabilities for each word being in a class. 

In [6]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# Naive Bayes

def naivebayesPY(x,y):
    """
    function [pos,neg] = naivebayesPY(x,y);

    Computation of P(Y)
    Input:
        x : n input vectors of d dimensions (n,d)
        y : n labels (one of 0 to 27) (n,)

    Output:
    probs: [prob p(y=0), ..., prob p(y=27)
    """
    
    n = float(len(y))
    probs = []
    for i in range(28):
        probs.append(np.count_nonzero(y == i) / n)
    return probs

# probs = naivebayesPY(X1,train_label)
# print(probs)

In [7]:
# def naivebayesPXY_mle(x,y):
#     """
#     function [posprob,negprob] = naivebayesPXY(x,y);
    
#     Computation of P(X|Y) -- Maximum Likelihood Estimate
#     Input:
#         x : n input vectors of d dimensions (n,d)
#         y : n labels (-1 or +1) (n,)
    
#     Output:
#     labelprobs: list of probability vectors of p(x|y=c) for c = 0..27
#     """
    
#     # MLE = num of times letter x occurs in examples of class y / num of training examples in class y
        
#     # indices of positive and negative examples
#     labelprobs = []
#     for i in range(28):
#         indices = np.argwhere(y == i).flatten()
#         mat = x[indices]
#         word_num = np.sum(mat, axis = 0)
#         my = np.sum(mat)
#         prob = word_num / my
#         labelprobs.append(prob)
    
#     return labelprobs

We then apply smoothing to avoid 0 probabilities:

In [8]:
def naivebayesPXY_smoothing(x,y):
    """
    function [posprob,negprob] = naivebayesPXY(x,y);
    
    Computation of P(X|Y) -- Smoothing with Laplace estimate
    Input:
        x : n input vectors of d dimensions (n,d)
        y : n labels (-1 or +1) (n,)
    
    Output:
    posprob: probability vector of p(x|y=1) (d,)
    negprob: probability vector of p(x|y=-1) (d,)
    """

    labelprobs = []
    for i in range(28):
        indices = np.argwhere(y == i).flatten()
        mat = x[indices]
        my = np.sum(mat)
        counts = np.sum(mat, axis = 0)
        sz = x.shape[1]
        prob = (counts + 1) / (my + sz)
        labelprobs.append(prob)
    
    return labelprobs


# labelprobs_smooth = naivebayesPXY_smoothing(X1,train_label)
# print(labelprobs_smooth[0])
# print(np.log(labelprobs_smooth[0]))

In [9]:
def naivebayes(x,y,xtest,naivebayesPXY):
    """
    function logratio = naivebayes(x,y);
    
    Computation of log P(Y|X=x1) using Bayes Rule
    Input:
    x : n input vectors of d dimensions (n,d)
    y : n labels (-1 or +1) (n,)
    xtest: input vector of d dimensions (d,)
    naivebayesPXY: input function for getting conditional probabilities (naivebayesPXY_smoothing)
    
    Output:
    logratio: log (P(Y = 1|X=xtest)/P(Y=-1|X=xtest))
    """

    labels_Y = naivebayesPY(x, y)   
    labelprobs = naivebayesPXY(x, y)
    
    # summation of (num occurrences of word) * log(probability of letter)
    pxy_lst = []
    for i in range(28):
      # print("here: ", labelprobs[i])
      px_y = np.sum(np.multiply(xtest, np.log(labelprobs[i])))
      pxy_lst.append(px_y)
      # print(px_y)

    logratios = []
    for i in range(28):
      numerator = np.log(labels_Y[i]) + pxy_lst[i]
      denominator = 0
      for j in range(28):
         if i != j:
           denominator = denominator + np.log(labels_Y[j]) + pxy_lst[j]  
      logratios.append(numerator - denominator)
    
    return logratios

# p_sm = naivebayes(X1, train_label, X1[0,:], naivebayesPXY_smoothing)
# print(p_sm)

In [10]:
def naivebayesCL(x,y,naivebayesPXY):
    """
    function [w,b]=naivebayesCL(x,y);
    Implementation of a Naive Bayes classifier
    Input:
    x : n input vectors of d dimensions (n,d)
    y : n labels (-1 or +1) (n,)
    naivebayesPXY: input function for getting conditional probabilities (naivebayesPXY_smoothing OR naivebayesPXY_mle)

    Output:
    w : weight vector of d dimensions (d,)
    b : bias (scalar)
    """
    
    n, d = x.shape
    
    # bias
    probs_Y = naivebayesPY(x, y)
    bs = []
    for i in range(28):
        numer = probs_Y[i]
        denom = 1 - probs_Y[i]
        b = np.log(numer / denom)
        bs.append(b)
    
    # weight (vector of d dimensions)
    pxy_probs = naivebayesPXY(x, y)
    ws = []
    for i in range(28):
        numerator = pxy_probs[i]
        denominator = 0
        for j in range(28):
           if i != j:
             denominator = denominator + pxy_probs[j]  
        w = np.log(numerator / denominator)
        ws.append(w)
        
    return ws, bs


# weights_sm,biases_sm = naivebayesCL(X1,train_label, naivebayesPXY_smoothing)

In [11]:
def classifyLinear(x,ws,bs):
    """
    function preds=classifyLinear(x,w,b);
    
    Make predictions with a linear classifier. Predictions should be signed. 
    Input:
    x : n input vectors of d dimensions (n,d)
    w : weight vector of d dimensions (d,)
    b : bias
    
    Output:
    preds: predictions
    """
    ws = np.array(ws)
    bs = np.array(bs)
    # print(x.shape)
    # print(ws.shape)
    # print(bs.shape)
    val = np.argmax(x @ ws.T + bs, axis=1)
    return val

<h5>Logistic Regression Model</h5>

We then try a logistic regression model, as
logistic regression models are designed for predicting outputs
that are in categories. Instead of modeling based
off of a binary output, we use multiclass logistic
regression, as we have more than 2 classes.

In [12]:
# Logistic Regression model

# goal: use logistic regression to model y(labels) as a function of x(text)
# input: vector with one entry for each word in sentence,
# 1 if word is in sentence and 0 if not
# output: label of text
log_reg_model = LogisticRegression(random_state=42, max_iter=10000)

<h3>1.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [13]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4

# split data into test, train, validation
X_train, X_val, y_train, y_val = train_test_split(X1, train_label, test_size=0.3, random_state=42)

We first trained the Naive Bayes model on the training data:

In [14]:
# train Naive Bayes model on splits
p_sm = naivebayes(X_train, y_train, X_train[0,:], naivebayesPXY_smoothing)
weights_sm, biases_sm = naivebayesCL(X_train, y_train, naivebayesPXY_smoothing)

In [15]:
# calculate training and test error
naive_bayes_train_preds = classifyLinear(X_train, weights_sm, biases_sm)
naive_bayes_val_preds = classifyLinear(X_val, weights_sm, biases_sm)

print('Training error (Smoothing with Laplace estimate): %.2f%%' % (100 *(naive_bayes_train_preds != y_train).mean()))
print(f"Training accuracy score: {100 * round(accuracy_score(naive_bayes_train_preds, y_train),2):.2f}%")

print('Validation error (Smoothing with Laplace estimate): %.2f%%' % (100 *(naive_bayes_val_preds != y_val).mean()))
print(f"Validation accuracy score: {100 * round(accuracy_score(naive_bayes_val_preds, y_val),2):.2f}%")

Training error (Smoothing with Laplace estimate): 38.56%
Training accuracy score: 61.00%
Validation error (Smoothing with Laplace estimate): 50.97%
Validation accuracy score: 49.00%


Then we trained the Logistic Regression model on the training data:

In [16]:
# train logistic regression model on splits
log_reg_model.fit(X_train, y_train)

In [17]:
# make test predictions and calculate accuracy
log_reg_train_preds = log_reg_model.predict(X_train)
log_reg_val_preds = log_reg_model.predict(X_val)

print(f"Training error with logistic regression: \
{100 * (np.round(log_reg_train_preds) != y_train).mean():.2f}%")
print(f"Accuracy score: \
{100 * round(accuracy_score(log_reg_train_preds, y_train),2):.2f}%")

print(f"Validation error with logistic regression: \
{100 * (np.round(log_reg_val_preds) != y_val).mean():.2f}%")
print(f"Validation accuracy score: \
{100 * round(accuracy_score(log_reg_val_preds, y_val), 2):.2f}%")

Training error with logistic regression: 3.01%
Accuracy score: 97.00%
Validation error with logistic regression: 29.67%
Validation accuracy score: 70.00%


In [18]:
# Naive Bayes
# X_test_nb = vectorizer.transform(test_text.to_list()) 

# p_sm = naivebayes(X1, train_label, test_text, naivebayesPXY_smoothing)
# weights_test_nb,biases_test_nb = naivebayesCL(train_text,train_label, naivebayesPXY_smoothing)
# classifyLinear(test_text, weights_test_nb, biases_test_nb)

# classifyLinear(X_test_nb, weights_sm,biases_sm)

<h3>1.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

1.4.1 How did you formulate the learning problem?

The problem is that we have text in sentences that we aim
to classify into a finite amount of groups. In other words,
our input is text, and we want to output a number representing
a class for that text. We decided to use the "bag of words"
representation to represent the text, where each piece of
text is represented by a vector.

1.4.2 Which two learning methods from class did you choose and why did you made the choices?

We chose a **Naive Bayes classifier** and a
**Multiclass Logistic Regression classifier**.
Naive Bayes was chosen becuase the model directly
estimates probabilities for each word being in a class.
Multiclass Logistic Regression was chosen because this
regression was designed to make predictions on an
output that falls into multiple categories.

1.4.3 How did you do the model selection?

We wanted to avoid having 0 valued probabilities,
so we used Laplace Smoothing for our Naive Bayes model.
We used a multiclass logistic regression because we have
28 emotions that we want to predict.

1.4.4 Does the test performance reach the first baseline "Tiny Piney"? (Please include a screenshot of Kaggle Submission)

The test performance does reach the first baseline.
**INSERT SCREENSHOT HERE**

<h2>Part 2: Be creative!</h2><p>

<h3>2.1 Open-ended Code:</h3><p>
You may follow the steps in part 1 again but making innovative changes like using new training algorithms, etc. Make sure you explain everything clearly in part 2.2. Note that beating "Zero Hero" is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [19]:
# Make sure you comment your code clearly and you may refer to these comments in the part 2.2
# Transformer
# Tutorial from: https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html
# helpful for determining d_model: https://discuss.pytorch.org/t/embed-dim-must-be-divisible-by-num-heads/54394/3
# print(X1[0].shape)
# d_model = 512
# n_head = 16
# transformer_model = nn.Transformer(d_model=d_model, nhead=n_head, num_encoder_layers=12)
# print(transformer_model.d_model)

# truncate each sentence to use first `transformer_model.d_model` words
# src = torch.tensor(X1[:, :d_model])
# tgt = torch.tensor(X1[:, :d_model])
# print(src.shape)
# print(tgt.shape)
# src = torch.rand((10, 32, transformer_model.d_model))
# tgt = torch.rand((20, 32, transformer_model.d_model))
# out = transformer_model(src, tgt)
# out

<h5>Linear Regression Model</h5>

In [20]:
# Linear Regression model

# goal: use linear regression to model y(labels) as a function of x(text)
# input: vector with one entry for each word in sentence,
# 1 if word is in sentence and 0 if not
# output: label of text
# model2_X = train[['text']]
# print(X1.shape)
# print(train_label.shape)
# model = LinearRegression().fit(X1, train_label)
# print(model.coef_)
# print(model.coef_.shape)


In [21]:
# make test predictions and calculate accuracy
# linreg_train_preds = model.predict(X1)
# print(linreg_train_preds[:10])
# print(np.round(linreg_train_preds)[:10])
# linreg_test_preds = model.predict(test_word_counts)

# print(test[['text']].shape)
# print(f"Training error with linear regression: {100 * (np.round(linreg_train_preds) != train_label).mean()}")

<h3>2.2 Explanation in Words:</h3><p>
You need to answer the following questions in a markdown cell after this cell:

2.2.1 How much did you manage to improve performance on the test set? Did you beat "Zero Hero" in Kaggle? (Please include a screenshot of Kaggle Submission)

2.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

<h2>Part 3: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The results should be presented in two columns in csv format: the first column is the data id (0-14999) and the second column includes the predictions for the test set. The first column must be named id and the second column must be named label (otherwise your submission will fail). A sample predication file can be downloaded from Kaggle for each problem. 
We provide how to save a csv file if you are running Notebook on Kaggle.

In [22]:
id = range(len(test_id))
prediction = prediction = range(15000)
submission = pd.DataFrame({'id': id, 'label': prediction})
# submission.to_csv('/kaggle/working/submission.csv', index=False)
# submission.to_csv('submission.csv', index=False)

In [None]:
# Log Reg predictions
id = range(len(test_id))
# submission_preds = log_reg_model.predict(test_word_counts[:, :X_train.shape[1]])
submission_preds = log_reg_model.predict(test_word_counts)
preds_df = pd.DataFrame({'id': id, 'label': submission_preds})
# submission.to_csv('submission.csv', index=False)
preds_df

Unnamed: 0,id,label
0,0,27
1,1,16
2,2,21
3,3,21
4,4,21
...,...,...
14995,14995,9
14996,14996,9
14997,14997,12
14998,14998,1


<h2>Part 4: Resources and Literature Used</h2><p>

Please cite the papers and open resources you used.