***Objective: build a logistic regression classifier to perform text classification.***

# Preprocessing

In [131]:
# Load important library
import numpy as np
import pandas as pd

# Import NLP library
import nltk
from nltk.tokenize import word_tokenize

In [132]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [133]:
# reference: https://www.nltk.org/howto/stem.html
# Stemmer to stem each token
from nltk.stem.porter import *
stemmer = PorterStemmer()

In [134]:
# Load dataset
df_pool = pd.read_csv('CS173-published-sheet - Sheet1.csv')
df = df_pool.filter(like = "Sentences")
df_lex = df_pool.filter(like = "Lexicons")
# View first few columns
df.head()

Unnamed: 0,Sadness Sentences,Joy Sentences,Fear Sentences,Anger Sentences,Disgust Sentences,Sadness + Joy Sentences,Fear + Anger Sentences,Surprise + Disgust Sentences,Sadness + Joy + Fear Sentences
0,The devastating news of the child's abduction ...,It was a sunny summer morning and the laughter...,As he walked in the dead of night he could hea...,While driving his family to a restaurant a car...,She had forgotten to take out the trash before...,"When I visited my old childhood home, I felt a...",Tom was fearful of his bully and remained doci...,He opened the package of his food delivery and...,The parents watched their son leave for colleg...
1,The wretched people have chosen wrongly.,The youth are filled with zeal.,The abandoned youth could be heard yelling.,We must abolish it for its wrongful acts.,The mysterious slime emitted an abhorrent smell.,"They have advanced to the next round, despite ...",His animosity was made worse by the repeated\...,She was bewildered by their appalling behavior.,After seeing the group’s admirable performance...
2,"She was not fond of graveyards, let alone the\...",Ms. Smith taught the lesson to fidelity. The o...,"Leaning over the edge, I peered into the abyss...",Your actions make a mockery of ethics! I must ...,"His illness left his skin unnatural, like plas...","Despite finally winning, I feel robbed of clos...","What you have created is an abomination, a bon...",Enough of your dreadful raving... Did you forg...,So begins an endless journey with no destinati...
3,I’m feeling very anxious about going back to w...,The women’s soccer team was proud that they ac...,I’m afraid we might get trapped in the cave si...,"After getting yelled at for 30 minutes, Sam be...",I can’t believe how wasteful the food industry...,Although my heart aches because I will never b...,I hate Tony for wrongfully accusing me of stea...,When we looked under the couch we found the un...,She starts to weep because she knows that I mi...
4,Sometimes I feel like something is wrong with ...,It's crazy to see that Jane and Mike were able...,Ever since the doctors found a tumor in Jessie...,I cannot believe that you’re still holding a g...,I think it’s so nasty that he’s a grown man an...,It must have been tough to find out all the de...,It’s insane how Daniel was able to act normal ...,It’s concerning how Pearson isn’t in much dist...,It really makes you wonder what people are cap...


We will have two labels: {Joy, Sadness} \
Joy as positive class: 1 \
Sadness as negative class: 0

In [135]:
# labels
labels = ["Joy", "Sadness"]

## Dataset Partitioning

In [136]:
# Split the dataframe
# First 30 rows as training dataset
df_train = df.iloc[:30]

# Next 10 rows as validation dataset
# It seems that in this HW, validation dataset is not really used
# Because in our case, the testing dataset and validation dataset kind of the same thing
# They are both testing dataset with known label
df_validation = df.iloc[30:40]
# Drop index
df_validation.reset_index(drop=True, inplace=True)

# Next ~10 rows as testing dataset
df_test = df.iloc[40:]
# Drop index
df_test.reset_index(drop=True, inplace=True)

## Construct features

For each sentence, we will have 3 features \
x1: Counts of Joy lexicon from NRC emotion lexicon dict. in the document (sentences) \
x2: Counts of Sadness lexicon from NRC emotion lexicon dict. in the document \
x3: Total number of tokens in the document \
x4: 1, for biase term \

reference: https://colab.research.google.com/drive/1uuQ5nvel5SpD-9-t1PiOUTpDYXJ0BtI_ \
NRC Emotional Lexicon: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm \
Some code is copied from: https://colab.research.google.com/drive/1uuQ5nvel5SpD-9-t1PiOUTpDYXJ0BtI_?usp=sharing#scrollTo=z7iU-S7pKXns

### Input Data Lexicon approach

In [137]:
# Create a Lexicon count for each label
# First create a empty dictionary for each label
Dictionary_Lex = {}
for label in labels:
  Dictionary_Lex[label] = {}

# Loop though each column
for column in df_lex:
  # Loop through each label and see if that label is mentioned in the current column
  # This way, if lexicon has multiple label, it will be added multiple times as well
  for label in labels:
    if label in column:
      # Loop through every sentence
      # There is NaN, so we need dropna()
      for lexicons in df_lex[column].dropna():
        # Preprocessing the lexicons, by removing special characters
        lexicons = lexicons.replace("(","")
        lexicons = lexicons.replace(")","")
        lexicons = lexicons.replace(".","")
        # There are two styles of "," and ", "
        # We will first change all ", " to ","
        lexicons = lexicons.replace(", ",",")
        # Then we will remove ",", replace it with " "
        lexicon_list = lexicons.replace(","," ").split()
        # Loop through each word in the lexicon list
        for lexicon in lexicon_list:
          # If the word is not in the dictionary, add it by setting it to 1
          if lexicon not in Dictionary_Lex[label]:
            Dictionary_Lex[label][lexicon] = 1
          # If the word is in the dictionary already, + 1
          else:
            Dictionary_Lex[label][lexicon] += 1

In [138]:
# Convert the Dictionary to a pandas dataframe
Dictionary_Lex = pd.DataFrame(Dictionary_Lex)
# Replace all NaN with 0
Dictionary_Lex = Dictionary_Lex.fillna(0)
# View first couple rows
# The Dictionary is case sensitive btw
Dictionary_Lex.head()

Unnamed: 0,Joy,Sadness
child,1.0,0.0
laughter,1.0,0.0
sunny,1.0,0.0
youth,10.0,6.0
zeal,5.0,4.0


In [139]:
# Get Joy and Sadness Dictionary
Joy_Dictionary = Dictionary_Lex[Dictionary_Lex["Joy"] > 0].index.tolist()
# Apply stemming to each work
Joy_Dictionary = [stemmer.stem(word) for word in Joy_Dictionary]

Sadness_Dictionary = Dictionary_Lex[Dictionary_Lex["Sadness"] > 0].index.tolist()
Sadness_Dictionary = [stemmer.stem(word) for word in Sadness_Dictionary]

### NRC approach (used)

In [140]:
# Download NRC Emotional Lexicon
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/upshot-trump-emolex/data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt

File ‘NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt’ already there; not retrieving.



In [141]:
# We will first read in the NRC emotion lexicon dict
filepath = "NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt"
emolex_df = pd.read_csv(filepath,  names=["word", "emotion", "association"], skiprows=45, sep='\t', keep_default_na=False)
emolex_df.head()

Unnamed: 0,word,emotion,association
0,aback,anger,0
1,aback,anticipation,0
2,aback,disgust,0
3,aback,fear,0
4,aback,joy,0


In [142]:
# Get the Joy Lexicon and store in a list
Joy_Dictionary_NRC = emolex_df[(emolex_df.association == 1) & (emolex_df.emotion == 'joy')].word
Joy_Dictionary_NRC = Joy_Dictionary_NRC.tolist()
# Apply stemming
Joy_Dictionary_NRC = [stemmer.stem(word) for word in Joy_Dictionary_NRC]

In [143]:
# Get the Sadness Lexicon and store in a list
Sadness_Dictionary_NRC = emolex_df[(emolex_df.association == 1) & (emolex_df.emotion == 'sadness')].word
Sadness_Dictionary_NRC = Sadness_Dictionary_NRC.tolist()
# Apply stemming
Sadness_Dictionary_NRC = [stemmer.stem(word) for word in Sadness_Dictionary_NRC]

### Construct Feature Vector

In [144]:
# Function return the counts of Joy lexicon from NRC emotion lexicon dict. in the document
# Input is a tokenized Sentence
def find_x1(Sentence):
  count = 0
  # Loop through each token and find number of token in the Joy dictionary
  for token in Sentence:
    if token in Joy_Dictionary_NRC:
      count += 1
  return count

In [145]:
# Function return the counts of Sadness lexicon from NRC emotion lexicon dict. in the document
def find_x2(Sentence):
  count = 0
  # Loop through each token and find number of token in the Sadness dictionary
  for token in Sentence:
    if token in Sadness_Dictionary_NRC:
      count += 1
  return count

In [146]:
# Function return the total number of tokens in the document
def find_x3(Sentence):
  return len(Sentence)

In [147]:
# Function to construct feature vector given tokenized sentence
def construct_feature_vector(Sentence):
  x1 = find_x1(Sentence)
  x2 = find_x2(Sentence)
  x3 = find_x3(Sentence)
  x4 = 1
  return np.array([x1, x2, x3, x4])

## Construct gold-reference labels

In [148]:
# For the training Dataframe
# Create a list to store sentence with its corresponding label
# We will only keep sentence under "Joy" or "Sadness" label

Dictionary_Train = []
# Loop though each column
for column in df_train.columns:
  # Loop through each label and see if that label is mentioned in the current column
  # This way, if sentence has multiple label, it will be added multiple times as well
  for label in labels:
    if label in column:
      # Loop through every sentence
      # There is NaN, so we need dropna()
      for sentence in df_train[column].dropna():
        # Append the sentence label pair
        if (label == "Joy"):
          Dictionary_Train.append([sentence, 1])
        else:
          Dictionary_Train.append([sentence, 0])

In [149]:
# Convert the list to a pandas dataframe
Dictionary_Train = pd.DataFrame(Dictionary_Train, columns = ["Sentence", "Label"])
# Lower case each sentence
Dictionary_Train["Tokenized_Sentence"] = Dictionary_Train["Sentence"].str.lower()
# Tokenize each sentence
Dictionary_Train["Tokenized_Sentence"] = Dictionary_Train["Tokenized_Sentence"].apply(word_tokenize)
# Stem each token
Dictionary_Train["Tokenized_Sentence"] = Dictionary_Train["Tokenized_Sentence"].apply(lambda x: [stemmer.stem(y) for y in x])
Dictionary_Train.head()

Unnamed: 0,Sentence,Label,Tokenized_Sentence
0,The devastating news of the child's abduction ...,0,"[the, devast, news, of, the, child, 's, abduct..."
1,The wretched people have chosen wrongly.,0,"[the, wretch, peopl, have, chosen, wrongli, .]"
2,"She was not fond of graveyards, let alone the\...",0,"[she, wa, not, fond, of, graveyard, ,, let, al..."
3,I’m feeling very anxious about going back to w...,0,"[i, ’, m, feel, veri, anxiou, about, go, back,..."
4,Sometimes I feel like something is wrong with ...,0,"[sometim, i, feel, like, someth, is, wrong, wi..."


In [150]:
# For the validation Dataframe
# Create a list to store sentence with its corresponding label
# We will only keep sentence under "Joy" or "Sadness" label

Dictionary_Validation = []
# Loop though each column
for column in df_validation.columns:
  # Loop through each label and see if that label is mentioned in the current column
  # This way, if sentence has multiple label, it will be added multiple times as well
  for label in labels:
    if label in column:
      # Loop through every sentence
      # There is NaN, so we need dropna()
      for sentence in df_validation[column].dropna():
        # Append the sentence label pair
        if (label == "Joy"):
          Dictionary_Validation.append([sentence, 1])
        else:
          Dictionary_Validation.append([sentence, 0])

In [151]:
# Convert the list to a pandas dataframe
Dictionary_Validation = pd.DataFrame(Dictionary_Validation, columns = ["Sentence", "Label"])
# Lower case each sentence
Dictionary_Validation["Tokenized_Sentence"] = Dictionary_Validation["Sentence"].str.lower()
# Tokenize each sentence
Dictionary_Validation["Tokenized_Sentence"] = Dictionary_Validation["Tokenized_Sentence"].apply(word_tokenize)
# Stem each token
Dictionary_Validation["Tokenized_Sentence"] = Dictionary_Validation["Tokenized_Sentence"].apply(lambda x: [stemmer.stem(y) for y in x])
Dictionary_Validation.head()

Unnamed: 0,Sentence,Label,Tokenized_Sentence
0,The lonely and hungry puppy whined as he sat i...,0,"[the, lone, and, hungri, puppi, whine, as, he,..."
1,Whenever Alexa remembered how the love of her ...,0,"[whenev, alexa, rememb, how, the, love, of, he..."
2,As I stared into the lifeless eyes of the only...,0,"[as, i, stare, into, the, lifeless, eye, of, t..."
3,Gas became overpriced during the pandemic.\r,0,"[ga, becam, overpr, dure, the, pandem, .]"
4,The sickening realization finally dawned on Ma...,0,"[the, sicken, realiz, final, dawn, on, mariano..."


In [152]:
# For the testing Dataframe
# Create a list to store sentence with its corresponding label
# We will only keep sentence under "Joy" or "Sadness" label

Dictionary_Test = []
# Loop though each column
for column in df_test.columns:
  # Loop through each label and see if that label is mentioned in the current column
  # This way, if sentence has multiple label, it will be added multiple times as well
  for label in labels:
    if label in column:
      # Loop through every sentence
      # There is NaN, so we need dropna()
      for sentence in df_test[column].dropna():
        # Append the sentence label pair
        if (label == "Joy"):
          Dictionary_Test.append([sentence, 1])
        else:
          Dictionary_Test.append([sentence, 0])

In [153]:
# Convert the list to a pandas dataframe
Dictionary_Test = pd.DataFrame(Dictionary_Test, columns = ["Sentence", "Label"])
# Lower case each sentence
Dictionary_Test["Tokenized_Sentence"] = Dictionary_Test["Sentence"].str.lower()
# Tokenize each sentence
Dictionary_Test["Tokenized_Sentence"] = Dictionary_Test["Tokenized_Sentence"].apply(word_tokenize)
# Stem each token
Dictionary_Test["Tokenized_Sentence"] = Dictionary_Test["Tokenized_Sentence"].apply(lambda x: [stemmer.stem(y) for y in x])
Dictionary_Test.head()

Unnamed: 0,Sentence,Label,Tokenized_Sentence
0,"He sat alone by the window, lost in a pensive ...",0,"[he, sat, alon, by, the, window, ,, lost, in, ..."
1,Tim was unlucky. He had a traumatic experience...,0,"[tim, wa, unlucki, ., he, had, a, traumat, exp..."
2,"After his wife left him, he lost his job, and ...",0,"[after, hi, wife, left, him, ,, he, lost, hi, ..."
3,"The news caused my world to collapse, leaving ...",0,"[the, news, caus, my, world, to, collaps, ,, l..."
4,He was distraught by the horrors he saw during...,0,"[he, wa, distraught, by, the, horror, he, saw,..."


# Logistic Regression Classifier

In [154]:
# Sigmoid function
def sigmoid(z):
  # Yu mentioned about this line in discussion
  z = np.clip(z, -100, 100)
  return 1 / (1 + np.exp(-z))

In [155]:
# Create logistic regression classifier that computes p(y=1) where y is the document and 1 means Joy class
def logistic_regression_classifier(X, W):
  # Compute dot product of feature and weight vector
  z = np.dot(X, W)
  # Apply sigmoid function
  p = sigmoid(z)
  if p >= 0.5:
    # Positive label, same as Joy
    return 1
  else:
    # Negative label, same as Sadness
    return 0

In [156]:
# Create Cross Entropy Loss function
# Taken additional parameter y: True label
def CrossEntropyLoss(X, W, y):
  # Compute dot product of feature and weight vector
  z = np.dot(X, W)
  # Apply sigmoid function
  p = sigmoid(z)
  # Yu mentioned about this line in discussion
  # Add a small number here to prevent log(0)
  epsilon = 1e-10
  # Apply cross-entropy loss with epsilon
  # np.log uses natural log
  loss = - (y * np.log(p + epsilon) + (1 - y) * np.log(1 - p + epsilon))
  return loss

Find Cross Entropy Loss for the first example (of the validation dataframe)

In [157]:
# Get the first example/row from the training dataframe
# This line for our easier interpretation
Example_sentence = Dictionary_Train["Sentence"][0]
print(Example_sentence)
# Get tokenized_sentencec
Example_tokenized_sentence = Dictionary_Train["Tokenized_Sentence"][0]
# Get true label
Example_label = Dictionary_Train["Label"][0]
# Initialize all the weights to be zero
# Technically three weights and last parameter is: biase term
Example_weight_vector = np.array([0, 0, 0, 0])
# Construct feature vector
Example_feature_vector = construct_feature_vector(Example_tokenized_sentence)
print(Example_feature_vector)
# Apply loss function
Example_loss = CrossEntropyLoss(Example_feature_vector, Example_weight_vector, Example_label)
print(Example_loss)

The devastating news of the child's abduction left a
 solemn shadow over the family for the next month.
[ 1  2 20  1]
0.6931471803599453


# Learning & Optimization

In [158]:
# Implement Stochastic Gradient Descent (SGD)
# We need to input feature vector, weight vector, target true label, and a learning rate
# Output: updated weight vector
def SGD(X, W, y, learning_rate):
  # Compute dot product of feature and weight vector
  z = np.dot(X, W)
  # Apply sigmoid function
  p = sigmoid(z)
  # Calculate gradient
  # The last feature of X should be 1, its corresponding to update the biase term
  gradient = (p - y) * X
  # update weight vector
  W_update = W - learning_rate * gradient
  return W_update

# Get the best learning rate

Use the validation set to decide your best learning rate

In [159]:
learning_rates = [0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]

In [160]:
# We will try our model using one of the learning rates
# Then apply validation dataset on our trained model to get an average validation loss
for learning_rate in learning_rates:
  # Get a fresh weight vector, everything is 0
  curr_W = np.array([0, 0, 0, 0])
  # Loop through the training dataset
  # We will run 100 iterations
  for epoch in range(100):
    for index, row in Dictionary_Train.iterrows():
      # Current sentence
      curr_sentence = row["Tokenized_Sentence"]
      # Current label
      curr_label = row["Label"]
      # Current feature vector
      curr_feature_vector = construct_feature_vector(curr_sentence)
      # Update weight vector
      curr_W = SGD(curr_feature_vector, curr_W, curr_label, learning_rate)
  print("Updated weight vector:", curr_W)

  # Loop through the validation dataset
  # Get the average validation loss
  total_loss = 0
  for index, row in Dictionary_Validation.iterrows():
    # Current sentence
    curr_sentence = row["Tokenized_Sentence"]
    # Current label
    curr_label = row["Label"]
    # Current feature vector
    curr_feature_vector = construct_feature_vector(curr_sentence)
    # Current loss
    curr_loss = CrossEntropyLoss(curr_feature_vector, curr_W, curr_label)
    # Add to total loss
    total_loss += curr_loss
  # Get average loss
  average_loss = total_loss / len(Dictionary_Validation)

  # Loop through the validation dataset
  # Get the average accuracy
  total_accuracy = 0
  for index, row in Dictionary_Test.iterrows():
    # Current sentence
    curr_sentence = row["Tokenized_Sentence"]
    # Current label
    curr_label = row["Label"]
    # Current feature vector
    curr_feature_vector = construct_feature_vector(curr_sentence)
    # Get predicted label
    curr_predicted_label = logistic_regression_classifier(curr_feature_vector, curr_W)
    if (curr_predicted_label == curr_label):
      total_accuracy += 1
  # Get average accuracy
  average_accuracy = total_accuracy / len(Dictionary_Validation)

  print("For learning rate:", learning_rate, "average validation loss:", average_loss, "average accuarcy:", average_accuracy)

Updated weight vector: [ 0.04533601 -0.04577142 -0.00126744 -0.00122527]
For learning rate: 1e-05 average validation loss: 0.6688625432404323 average accuarcy: 0.5833333333333334
Updated weight vector: [ 0.1765058  -0.17824296 -0.01216666  0.0017689 ]
For learning rate: 5e-05 average validation loss: 0.6749872950961839 average accuarcy: 0.5666666666666667
Updated weight vector: [ 0.26572819 -0.27987007 -0.02383821  0.01365058]
For learning rate: 0.0001 average validation loss: 0.8030243597263288 average accuarcy: 0.55
Updated weight vector: [ 0.45134869 -0.56133993 -0.06660745  0.09317967]
For learning rate: 0.0005 average validation loss: 1.704189740034066 average accuarcy: 0.5333333333333333
Updated weight vector: [ 0.5130148  -0.69132816 -0.08851691  0.13128363]
For learning rate: 0.001 average validation loss: 2.2804731490847217 average accuarcy: 0.5
Updated weight vector: [ 0.69235888 -1.01891395 -0.17116025  0.3255062 ]
For learning rate: 0.005 average validation loss: 3.97851105

In [161]:
# Train actual Weight vector using learning rate 0.00001 on the training dataset
W = np.array([0, 0, 0, 0])
for epoch in range(200):
  # Loop through the training dataset
  for index, row in Dictionary_Train.iterrows():
    # Current sentence
    curr_sentence = row["Tokenized_Sentence"]
    # Current label
    curr_label = row["Label"]
    # Current feature vector
    curr_feature_vector = construct_feature_vector(curr_sentence)
    # Update weight vector
    W = SGD(curr_feature_vector, W, curr_label, 0.00001)

In [162]:
# Check our Weight Vector
W

array([ 0.08344806, -0.08487796, -0.00149924, -0.00216813])

# Evaluation

In [163]:
# Check our test dataset
Dictionary_Test.head()

Unnamed: 0,Sentence,Label,Tokenized_Sentence
0,"He sat alone by the window, lost in a pensive ...",0,"[he, sat, alon, by, the, window, ,, lost, in, ..."
1,Tim was unlucky. He had a traumatic experience...,0,"[tim, wa, unlucki, ., he, had, a, traumat, exp..."
2,"After his wife left him, he lost his job, and ...",0,"[after, hi, wife, left, him, ,, he, lost, hi, ..."
3,"The news caused my world to collapse, leaving ...",0,"[the, news, caus, my, world, to, collaps, ,, l..."
4,He was distraught by the horrors he saw during...,0,"[he, wa, distraught, by, the, horror, he, saw,..."


In [164]:
# Apply Logistic Regression to predict label
Dictionary_Test["Predicted_Label"] = Dictionary_Test["Tokenized_Sentence"].apply(lambda x: logistic_regression_classifier(construct_feature_vector(x), W))

In [165]:
# Check our test dataset with predicted_label
Dictionary_Test.head()

Unnamed: 0,Sentence,Label,Tokenized_Sentence,Predicted_Label
0,"He sat alone by the window, lost in a pensive ...",0,"[he, sat, alon, by, the, window, ,, lost, in, ...",0
1,Tim was unlucky. He had a traumatic experience...,0,"[tim, wa, unlucki, ., he, had, a, traumat, exp...",0
2,"After his wife left him, he lost his job, and ...",0,"[after, hi, wife, left, him, ,, he, lost, hi, ...",0
3,"The news caused my world to collapse, leaving ...",0,"[the, news, caus, my, world, to, collaps, ,, l...",0
4,He was distraught by the horrors he saw during...,0,"[he, wa, distraught, by, the, horror, he, saw,...",0


## Confusion matrix

In [166]:
# Confusion matrix library
from sklearn.metrics import confusion_matrix

In [167]:
# Now we can generate the confusion matrix since we have True label and Predicted Label for all sentence
# Generate a 2 x 2 confusion matrix
# reference: https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.confusion_matrix.html
y_true = Dictionary_Test["Label"]
y_pred = Dictionary_Test["Predicted_Label"]
confusion_matrix_df = pd.DataFrame(confusion_matrix(y_true, y_pred, labels=[1, 0]), index = ["Joy", "Sadness"], columns = ["Joy", "Sadness"])

In [168]:
confusion_matrix_df

Unnamed: 0,Joy,Sadness
Joy,18,8
Sadness,9,17


## Precision, Accuracy, Recall, F-1

In [169]:
# Find number of True Negative for the given label
# Bascially sum of all number that row or column is not "label"
def find_TN(label):
  TN = 0
  for label_i in labels:
    for label_j in labels:
      # Column and index in same order anyways
      if label_i != label and label_j != label:
        TN += confusion_matrix_df.at[label_i, label_j]
  return TN

In [170]:
# Calculate accuracy, precision, recall, and F1-score
def calculate_performance(label):
  # Find total number of sentence
  total_sentence_number, _ = Dictionary_Test.shape

  # Find accuracy of the given label based on confusion matrix
  # TP + TN / total sentence number
  TN = find_TN(label)
  TP = confusion_matrix_df.at[label, label]
  accuracy = (TP + TN) / total_sentence_number

  # Find precision of the given label based on confusion matrix
  # TP/TP + FP
  FP = 0
  for label_j in labels:
    if label_j != label:
      # Same row, exclude when column == label
      FP += confusion_matrix_df.at[label, label_j]
  precision = TP / (TP + FP)

  # Find recall of the given label based on confusion matrix
  # TP/TP + FN
  FN = 0
  for label_i in labels:
    if label_i != label:
      # Same column, exclude when row == label
      FN += confusion_matrix_df.at[label_i, label]
  recall = TP / (TP + FN)

  # Find f1_score of the given label based on confusion matrix
  f1_score = (2 * precision * recall) / (precision + recall)
  return accuracy, precision, recall, f1_score

In [171]:
accuracy, precision, recall, f1_score = calculate_performance("Joy")
print("Accuracy =", accuracy)
print("Precision =", precision)
print("Recall =", recall)
print("F1-score =", f1_score)

Accuracy = 0.6730769230769231
Precision = 0.6923076923076923
Recall = 0.6666666666666666
F1-score = 0.6792452830188679


## New feature vector

For each sentence, we will have 5 features \
x1: Counts of Joy lexicon from NRC emotion lexicon dict. in the document (sentences) \
x2: Counts of Sadness lexicon from NRC emotion lexicon dict. in the document \
x3: Total number of tokens in the document \
x4: Counts of Joy lexicon from spreadsheet lexicon dict. in the document (sentences) \
x5: Counts of Sadness lexicon from spreadsheet lexicon dict. in the document (sentences) \
x6: 1, for biase term \

In [172]:
# Function return the counts of Joy lexicon from NRC emotion lexicon dict. in the document
# Input is a tokenized Sentence
def find_x4(Sentence):
  count = 0
  # Loop through each token and find number of token in the Joy dictionary
  for token in Sentence:
    if token in Joy_Dictionary:
      count += 1
  return count

In [173]:
# Function return the counts of Joy lexicon from NRC emotion lexicon dict. in the document
# Input is a tokenized Sentence
def find_x5(Sentence):
  count = 0
  # Loop through each token and find number of token in the Joy dictionary
  for token in Sentence:
    if token in Sadness_Dictionary:
      count += 1
  return count

In [174]:
# define new feature vector function
# Function to construct feature vector given tokenized sentence
def construct_new_feature_vector(Sentence):
  x1 = find_x1(Sentence)
  x2 = find_x2(Sentence)
  x3 = find_x3(Sentence)
  x4 = find_x4(Sentence)
  x5 = find_x5(Sentence)
  x6 = 1
  return np.array([x1, x2, x3, x4, x5, x6])

In [175]:
# Check new validation losses
for learning_rate in learning_rates:
  # Get a fresh weight vector, everything is 0
  curr_W = np.array([0, 0, 0, 0, 0, 0])
  # Loop through the training dataset
  # We will run 100 iterations
  for epoch in range(100):
    for index, row in Dictionary_Train.iterrows():
      # Current sentence
      curr_sentence = row["Tokenized_Sentence"]
      # Current label
      curr_label = row["Label"]
      # Current feature vector
      curr_feature_vector = construct_new_feature_vector(curr_sentence)
      # Update weight vector
      curr_W = SGD(curr_feature_vector, curr_W, curr_label, learning_rate)
  print("Updated weight vector:", curr_W)

  # Loop through the validation dataset
  # Get the average validation loss
  total_loss = 0
  for index, row in Dictionary_Validation.iterrows():
    # Current sentence
    curr_sentence = row["Tokenized_Sentence"]
    # Current label
    curr_label = row["Label"]
    # Current feature vector
    curr_feature_vector = construct_new_feature_vector(curr_sentence)
    # Current loss
    curr_loss = CrossEntropyLoss(curr_feature_vector, curr_W, curr_label)
    # Add to total loss
    total_loss += curr_loss
  # Get average loss
  average_loss = total_loss / len(Dictionary_Validation)

  # Loop through the validation dataset
  # Get the average accuracy
  total_accuracy = 0
  for index, row in Dictionary_Test.iterrows():
    # Current sentence
    curr_sentence = row["Tokenized_Sentence"]
    # Current label
    curr_label = row["Label"]
    # Current feature vector
    curr_feature_vector = construct_new_feature_vector(curr_sentence)
    # Get predicted label
    curr_predicted_label = logistic_regression_classifier(curr_feature_vector, curr_W)
    if (curr_predicted_label == curr_label):
      total_accuracy += 1
  # Get average accuracy
  average_accuracy = total_accuracy / len(Dictionary_Validation)

  print("For learning rate:", learning_rate, "average validation loss:", average_loss, "average accuarcy:", average_accuracy)

Updated weight vector: [ 0.045439   -0.04435558  0.00027824  0.01146507 -0.02729461 -0.00100543]
For learning rate: 1e-05 average validation loss: 0.6636539544200247 average accuarcy: 0.5666666666666667
Updated weight vector: [ 0.17623158 -0.15938944 -0.00840981  0.0527284  -0.10192012  0.00452845]
For learning rate: 5e-05 average validation loss: 0.6558455820368982 average accuarcy: 0.5666666666666667
Updated weight vector: [ 0.2644856  -0.23849606 -0.01974804  0.09252673 -0.15979306  0.01980718]
For learning rate: 0.0001 average validation loss: 0.7699396804451607 average accuarcy: 0.5666666666666667
Updated weight vector: [ 0.43817336 -0.47565175 -0.06207542  0.2457729  -0.33305337  0.10386745]
For learning rate: 0.0005 average validation loss: 1.6172773302650283 average accuarcy: 0.5166666666666667
Updated weight vector: [ 0.50308692 -0.59650636 -0.07924807  0.3022929  -0.42270736  0.1416206 ]
For learning rate: 0.001 average validation loss: 2.1159486970084993 average accuarcy: 0.

In [176]:
# Train actual Weight vector using learning rate 0.00001 on the training dataset
W_new = np.array([0, 0, 0, 0, 0, 0])
for epoch in range(200):
  # Loop through the training dataset
  for index, row in Dictionary_Train.iterrows():
    # Current sentence
    curr_sentence = row["Tokenized_Sentence"]
    # Current label
    curr_label = row["Label"]
    # Current feature vector
    curr_feature_vector = construct_new_feature_vector(curr_sentence)
    # Update weight vector
    W_new = SGD(curr_feature_vector, W_new, curr_label, 0.00001)

In [177]:
W_new

array([ 0.08345656, -0.08030586,  0.00091397,  0.02264099, -0.04838916,
       -0.00151496])

In [178]:
# Apply Logistic Regression to predict label
Dictionary_Test["New_Predicted_Label"] = Dictionary_Test["Tokenized_Sentence"].apply(lambda x: logistic_regression_classifier(construct_new_feature_vector(x), W_new))

In [179]:
# Construct new confusion matrix
y_true = Dictionary_Test["Label"]
y_pred = Dictionary_Test["New_Predicted_Label"]
confusion_matrix_df = pd.DataFrame(confusion_matrix(y_true, y_pred, labels=[1, 0]), index = ["Joy", "Sadness"], columns = ["Joy", "Sadness"])

In [180]:
confusion_matrix_df

Unnamed: 0,Joy,Sadness
Joy,17,9
Sadness,8,18


In [181]:
# Get new evaluation
accuracy, precision, recall, f1_score = calculate_performance("Joy")
print("Accuracy =", accuracy)
print("Precision =", precision)
print("Recall =", recall)
print("F1-score =", f1_score)

Accuracy = 0.6730769230769231
Precision = 0.6538461538461539
Recall = 0.68
F1-score = 0.6666666666666666
