# Homework 2 (Due Thursday Dec 1, 6:29pm PST)

Please submit as a notebook in the format `HW2_FIRSTNAME_LASTNAME_USCID.ipynb` in a group chat to me and the TAs.

Your `USCID` is your student 10-digit ID.

### Part II. Emotion Classification (5 pts)

Use the `datasets/emotions_dataset.zip` (see the original Dataset source on [Kaggle](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp)) to build a classification model that predicts the emotion of sentence. If you would like, you may classify only the top 4 emotions, and group all other classes as `Other`. 

In order to earn full credit, you must:

* Show the performance of your model with `CountVectorizer`, `TfIdfVectorizer`, `word2vec`, and `glove` embeddings.
    - for `word2vec`, make sure not to use the `en_core_web_sm` dataset (these are not real embeddings)
* Perform text preprocessing (or explain why it was not necessary):
    - stopword removal
    - ngram tokenization
    - stemming/lemmatization
    - fuzzy matching / regex cleaning / etc. (as you deem necessary, but show that you analyzed the text to make your decision)
* Show **AUROC / F1 scores** for on the holdout (test + validation) datasets.
* A brief discussion (2-3 sentences) of what could improve your model and why.

### 1. Importing Datasets 

In [2]:
import pandas as pd

In [3]:
train_df = pd.read_csv('../datasets/emotions/train.txt',header=None, names=['text'])
train_df[['text','emotion']] = train_df['text'].str.split(';',expand=True)
train_df['type'] = "train"
test_df = pd.read_csv('../datasets/emotions/test.txt',header=None, names=['text'])
test_df[['text','emotion']] = test_df['text'].str.split(';',expand=True)
test_df['type'] = "test"
val_df = pd.read_csv('../datasets/emotions/val.txt',header=None, names=['text'])
val_df[['text','emotion']] = val_df['text'].str.split(';',expand=True)
val_df['type'] = "val"

df = train_df.append(test_df.append(val_df , ignore_index=True) , ignore_index=True)

df.head()

  df = train_df.append(test_df.append(val_df , ignore_index=True) , ignore_index=True)


Unnamed: 0,text,emotion,type
0,i didnt feel humiliated,sadness,train
1,i can go from feeling so hopeless to so damned...,sadness,train
2,im grabbing a minute to post i feel greedy wrong,anger,train
3,i am ever feeling nostalgic about the fireplac...,love,train
4,i am feeling grouchy,anger,train


In [4]:
# Checking if all the data have been appended correctly
print(train_df.shape)
print(test_df.shape)
print(val_df.shape)
print(df.shape)

print(df['type'].unique())

(16000, 3)
(2000, 3)
(2000, 3)
(20000, 3)
['train' 'test' 'val']


### 2. Text Preprocessing


- **Stopword removal:** We want remove stopwords so that we can only focus on words that are relavent for predicting the sentiment. We don't want common recurring words to override the actual sentiment in the text. However, we do want to remove certain words from the list of stopwords from the packages we are using because in our text, words that change the meaning to the opposute like didn't, haven't, or wasn't. 
- **N-gram tokenization:** We want to do n-gram tokenization because there are word pairings that change the sentiment. For example "didn't feel humiliated" mean something different compared to just "humiliated". So we want to account for bigrams and trigrams. 
- **Stemming/lemmetization:** we don't want to do stemming or lemmetization because there are certain words that may have lose its original sentiment if we lemmetize them. For example, hope and hopeless would both be lemmetized to hope, but this completely alters the sentiment of the "hopeless". 
- **Fuzzy-matching/Regex:** Upon drawing multiple randome samples of the text and physically examining the text in the dataset, we determined that there aren't too many spelling errors or issues with text quality. Therefore, we won't be performing fuzzy-matching or Regex. 


In [5]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [6]:
# Add to stopword: feel, feeling
stop = stopwords.words('english')
stop.extend(['feel','feeling'])

# Remove from stopword
stop.remove('didn')
stop.remove('didn\'t')

# https://stackoverflow.com/questions/29523254/python-remove-stop-words-from-pandas-dataframe
df['text_clean'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

df.head(10)

Unnamed: 0,text,emotion,type,text_clean
0,i didnt feel humiliated,sadness,train,didnt humiliated
1,i can go from feeling so hopeless to so damned...,sadness,train,go hopeless damned hopeful around someone care...
2,im grabbing a minute to post i feel greedy wrong,anger,train,im grabbing minute post greedy wrong
3,i am ever feeling nostalgic about the fireplac...,love,train,ever nostalgic fireplace know still property
4,i am feeling grouchy,anger,train,grouchy
5,ive been feeling a little burdened lately wasn...,sadness,train,ive little burdened lately wasnt sure
6,ive been taking or milligrams or times recomme...,surprise,train,ive taking milligrams times recommended amount...
7,i feel as confused about life as a teenager or...,fear,train,confused life teenager jaded year old man
8,i have been with petronas for years i feel tha...,joy,train,petronas years petronas performed well made hu...
9,i feel romantic too,love,train,romantic


#### Word2Vec Build Classification Model ####

In [7]:
import spacy
# load the language model, but we disable the ner (named entity recognition) and parser (dependency parser)
# since we don't need them for our use case to speed things up
nlp = spacy.load('en_core_web_md', disable = ['ner', 'parser'])

import numpy as np
def process_text(text):
  """
  This function will use Spacy to perform stopword removal and lemmatization.
  """
  doc = nlp(text)
  processed_text = " ".join([token.lemma_ for token in doc if not token.is_stop])
  # this will get the word2vec embeddings for the processed text (the average of each token in the doc's word2vec embeddings)
  return np.array(nlp(processed_text).vector)



In [8]:
# use pandas' apply(...) method to apply this process_text function to each row's text field
df["vectors"] = df.text.apply(process_text)

In [9]:
X = np.array([vector for vector in df["vectors"]])
y = df["emotion"]

In [10]:
X.shape

(20000, 300)

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

print(f"Training dataset is {X_train.shape}")
print(f"Training target is {y_train.shape}")
print(f"Test dataset is {X_test.shape}")
print(f"Test target is {y_test.shape}")

from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(max_iter=1000)
logistic_regression.fit(X_train, y_train)
training_predictions = logistic_regression.predict(X_train)
training_predictions[:5] # these are our model's prediction for first 5 documents in the training dataset

from sklearn.metrics import confusion_matrix 
confusion_matrix(y_train, training_predictions)

# we get 65.3% accuracy on the training data using word2vec 
print(logistic_regression.score(X_train, y_train))

# we got 62.7% accuracy on the test data
print(logistic_regression.score(X_test, y_test))

Training dataset is (15000, 300)
Training target is (15000,)
Test dataset is (5000, 300)
Test target is (5000,)
0.6534
0.6272


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [16]:
from sklearn.metrics import roc_auc_score
y_probabilities = logistic_regression.predict_proba(X_test)
roc_auc_score(y_test, y_probabilities, multi_class="ovo")

0.8674393049275498

#### Count Vectorizer Classification Model

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df["text_clean"])
y = df["emotion"].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y, test_size=0.2)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# calculate accuracy
print(f"Training accuracy: {np.mean(y_pred == y_test)}")

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training accuracy: 0.89525


array([[ 513,   13,   22,    4,   31,    0],
       [  21,  361,   12,    5,   21,   20],
       [   8,    7, 1219,   46,   19,   11],
       [   3,    1,   55,  267,    4,    1],
       [  34,   13,   27,    3, 1098,    0],
       [   0,   22,    9,    0,    7,  123]])

In [18]:
from sklearn.metrics import roc_auc_score
y_probabilities = lr.predict_proba(X_test)

roc_auc_score(y_pred, y_probabilities, multi_class="ovo")

0.9997672478362439

#### TFIDF Vectorizer Model

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df["text_clean"])
y = df["emotion"].values

X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y, test_size=0.2)

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# calculate accuracy
print(f"Training accuracy: {np.mean(y_pred == y_test)}")

confusion_matrix(y_test, y_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training accuracy: 0.86825


array([[ 464,    7,   39,    1,   56,    0],
       [  25,  338,   44,    2,   28,    4],
       [   5,    2, 1304,   18,   21,    3],
       [   3,    1,  105,  210,   13,    0],
       [  20,    9,   39,    1, 1092,    1],
       [   3,   28,   31,    1,   17,   65]])

In [21]:
from sklearn.metrics import roc_auc_score
y_probabilities = lr.predict_proba(X_test)

roc_auc_score(y_pred, y_probabilities, multi_class="ovo")

0.9979881006871151