<a href="https://colab.research.google.com/github/Alonment/CSCI4962-Projects-In-ML-AI/blob/main/CSCI4962_HW4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Homework 4(100 points)**

**Sequence Models**

In your project, you will pick a dataset and an associated problem that can be solved via sequence models. You must describe why you need sequence models to solve this problem. Include a link to the dataset source. Next, you should pick an RNN framework that you would use to solve this problem (This framework can be in Tensorflow, PyTorch or any other Python package.)

**Problem:** Predicting the sentiment of a tweet as either positive or negative.

**Dataset:** https://www.kaggle.com/kazanova/sentiment140

**EDA**

In [2]:
# Importing libraries
import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv("drive/MyDrive/csci4962_hw4.csv", encoding = "ISO-8859-1")

In [3]:
df.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


It appears that when loading the dataset, our columns are missing. Let us assign them manually based on what they are in Kaggle.


In [4]:
df.columns = ["target", "id", "date", "flag", "user", "text"]
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [5]:
df.isnull()

Unnamed: 0,target,id,date,flag,user,text
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
1599994,False,False,False,False,False,False
1599995,False,False,False,False,False,False
1599996,False,False,False,False,False,False
1599997,False,False,False,False,False,False


It's clear that for our given problem statement, we simply do not care about id, date, flag, or the user as their values have no bearing on the sentiment of any given text. Thus, let us remove them.

In [6]:
df = df.drop(columns = ["id", "date", "flag", "user"])
df

Unnamed: 0,target,text
0,0,is upset that he can't update his Facebook by ...
1,0,@Kenichan I dived many times for the ball. Man...
2,0,my whole body feels itchy and like its on fire
3,0,"@nationwideclass no, it's not behaving at all...."
4,0,@Kwesidei not the whole crew
...,...,...
1599994,4,Just woke up. Having no school is the best fee...
1599995,4,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,Happy 38th Birthday to my boo of alll time!!! ...


According to the dataset provider, the target value corresponds to the following sentiments: 0 -> negative, 2 -> neutral, 4 -> positive. Let us see if we can normalize these values in any way.

In [7]:
df[df["target"] == 0]

Unnamed: 0,target,text
0,0,is upset that he can't update his Facebook by ...
1,0,@Kenichan I dived many times for the ball. Man...
2,0,my whole body feels itchy and like its on fire
3,0,"@nationwideclass no, it's not behaving at all...."
4,0,@Kwesidei not the whole crew
...,...,...
799994,0,Sick Spending my day laying in bed listening ...
799995,0,Gmail is down?
799996,0,rest in peace Farrah! So sad
799997,0,@Eric_Urbane Sounds like a rival is flagging y...


In [8]:
df[df["target"] == 2]

Unnamed: 0,target,text


In [9]:
df[df["target"] == 4]

Unnamed: 0,target,text
799999,4,I LOVE @Health4UandPets u guys r the best!!
800000,4,im meeting up with one of my besties tonight! ...
800001,4,"@DaRealSunisaKim Thanks for the Twitter add, S..."
800002,4,Being sick can be really cheap when it hurts t...
800003,4,@LovesBrooklyn2 he has that effect on everyone
...,...,...
1599994,4,Just woke up. Having no school is the best fee...
1599995,4,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,Happy 38th Birthday to my boo of alll time!!! ...


It seems that the dataset is already evenly split based on negative and positive tweets with no neutral tweets even existing. Thus, let us change our target values to be 0 when the tweet is negative and 1 when the tweet is positive. Also, note that the encoder provided by tensorflow already lowers and strips punctuation from any input text, thus leaving us less work to do on the preprocessing side.

Other than normalizing the labels, let us also remove any occurences of "@" (mentions) or "#" (hashtags) in tweets as often enough, they don't provide much insight into the sentiment behind any given tweet.

In [10]:
df["target"] = df["target"].replace(4, 1)

def removeMentionsAndHashtags(text: str):
  return " ".join([word for word in str(text).split() if "@" not in word and "#" not in word])

df["text"] = df["text"].apply(lambda text: removeMentionsAndHashtags(text))
df.head()

Unnamed: 0,target,text
0,0,is upset that he can't update his Facebook by ...
1,0,I dived many times for the ball. Managed to sa...
2,0,my whole body feels itchy and like its on fire
3,0,"no, it's not behaving at all. i'm mad. why am ..."
4,0,not the whole crew


Now, let us split our data into train, validation, and test sets respectively.

In [11]:
from sklearn.model_selection import train_test_split

# Converts a pandas dataframe into a tensorflow dataset
def convert_to_tfds(dataframe):

  dataset = tf.data.Dataset.from_tensor_slices((dataframe['text'], dataframe['target']))
  dataset = dataset.shuffle(buffer_size=len(dataframe), seed=0)
  return dataset.batch(512).prefetch(tf.data.AUTOTUNE)

training_set = df.copy()

# Split training set into dev and test sets respectively
train, dev = train_test_split(training_set, test_size=0.1, random_state = 0)
train, test = train_test_split(train, test_size = 0.1, random_state = 0)

ds_train = convert_to_tfds(train)
ds_dev = convert_to_tfds(dev)
ds_test = convert_to_tfds(test)

**Task 1(60 points)**

> **Part 1(35 points):** Implement your RNN either using an existing framework OR you can implement your own RNN cell structure. In either case, describe the structure of your RNN and the activation functions you are using for each time step and in the output layer. Define a metric you will use to measure the performance of your model(NOTE: Performance should be measured both for the validation set and the test set).


> **Part 2(25 points):** Update your network from part 1 with either an LSTM or a GRU based cell structure. Re-do the training and performance evaluation. What are the major differences you notice? Why do you think those differences exist between the 2 implementations?







**PART 1**

Creating our vocabulary and text vectorization layer based off of our training set.

In [12]:
# Encoder declaration

encoder = tf.keras.layers.TextVectorization()
encoder.adapt(ds_train.map(lambda text, label: text))

In [13]:
# Let's see how large of a vocabulary we're working with

len(encoder.get_vocabulary())

436408

**Model Architecture**

I decided to closely follow the architecture provided in https://www.tensorflow.org/text/tutorials/text_classification_rnn#setup since it was rather simple and easy to follow as well as fufilled the task at hand.

Order of Layers: TextVectorization -> Embedding -> Bidirectional (SimpleRNN) -> Dense -> Dense

All words are converted into sequences of token indices that are read in to the embedding layer and transformed into a trainable vectors representing that specific word. The Bidirectional RNN layer simply reads the encoded words and propagates it forwards and backwards, concatenating the final output and sending to the final two dense layers which finalize the classification.

In [14]:
# Model Implementation

model = tf.keras.Sequential([
        encoder, # Text Vectorization Layer
        tf.keras.layers.Embedding( 
            input_dim = len(encoder.get_vocabulary()),
            output_dim = 64,
            # Masking handles variable length tweets
            mask_zero = True
        ),
        tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(64, activation='relu')),
        tf.keras.layers.Dense(64, activation='relu'), 
        tf.keras.layers.Dense(1) # Classification layer
])

In [15]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [17]:
history = model.fit(ds_train, epochs=2,
                    validation_data=ds_dev,
                    validation_steps = 30)

Epoch 1/2
Epoch 2/2


In [18]:
# Model evaluation

loss, accuracy = model.evaluate(ds_test)

print(f"Loss: {loss}, Accuracy: {accuracy}")


Loss: 0.41170719265937805, Accuracy: 0.8011319637298584


The activation function used at each timestep as well as at the output layers was ReLU. It seemed to perform better both in terms of speed and convergence than that of tanh. The metric used to evaluate is that of accuracy, with the actual accuracy of the model turning out to be somewhat decent. Given that the model was only trained for two epochs, the accuracy could very well increase given more time, however a single epoch did prove to be computationally expensive and took a substantial amount of time. The validation and test accuracy are slightly lower than that of training accuracy and could therefore indicate some overfitting occurring. 

**PART II**

In [19]:
# Model Implementation with LSTM cells

model = tf.keras.Sequential([
        encoder, # Text Vectorization Layer
        tf.keras.layers.Embedding( 
            input_dim = len(encoder.get_vocabulary()),
            output_dim = 64,
            # Masking handles variable length tweets
            mask_zero = True
        ),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, activation='relu')),
        tf.keras.layers.Dense(64, activation='relu'), 
        tf.keras.layers.Dense(1) # Classification layer
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])



In [20]:
history = model.fit(ds_train, epochs=2,
                    validation_data=ds_dev,
                    validation_steps = 30)

Epoch 1/2
Epoch 2/2


In [21]:
# Model evaluation

loss, accuracy = model.evaluate(ds_test)

print(f"Loss: {loss}, Accuracy: {accuracy}")

Loss: 0.4225277900695801, Accuracy: 0.7948333621025085


The model cell structure is exactly the same, other than the replacement of the SimpleRNN layer with a LSTM layer. The change in LSTM proved to provide negligable, if not worse, results in regards to accuracy, despite the time required to train being practically twice that of the simple RNN. This could potentially be explained by the fact that the average length of a tweet isn't too long at all since tweets are usually utilized as a means of expressing a single thought, moment, or idea in a quick and digestible manner. Thus, LSTMs, where their primary advantage over traditional RNNs is propagating dependence throughout an entire sequence regardless of length, most likely didn't produce much of a change in the model since our input sequences, the tweets, aren't very long at all on average.

**Task 2(40 points)**

In this task, use any of the pre-trained word embeddings. The Wor2vec embedding link provided with the lecture notes can be useful to get started. Write your own code/function that uses these embeddings and outputs **cosine similarity** and a **dissimilarity** score for any 2 pair of words. The dissimilarity score should be defined by you. You either can have your own idea of a dissimilarity score or refer to literature. In either case clearly describe how this score helps determine the dissmilarity between 2 words.

The pre-trained word embedding that I will be using is that of Wiki-words-250

https://tfhub.dev/google/Wiki-words-500-with-normalization/2

In [71]:
# Loading the pre-trained word embedding

import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/Wiki-words-500-with-normalization/2") # Wiki-words pretrained embedding that was trained on the English Wikipedia Corpus

In [73]:
def printSimilarity():
  x = str(input("Enter the first word: "))
  y = str(input("Enter the second word: "))
  embed_x = embed([x])[0].numpy()
  embed_y = embed([y])[0].numpy()

  similarity = np.dot(embed_x, embed_y) / (np.linalg.norm(embed_x) * np.linalg.norm(embed_y)) # Cosine similarity is given by (x . y)/(||x|| * ||y||)
  print(f"Cosine similarity between {x} and {y}: {similarity}")
  print(f"Dissimilarity between {x} and {y}: {1 - similarity}")


In [74]:
printSimilarity()

Enter the first word: dog
Enter the second word: cat
Cosine similarity between dog and cat: 0.7862539887428284
Dissimilarity between dog and cat: 0.21374601125717163


In [78]:
printSimilarity()

Enter the first word: car
Enter the second word: vehicle
Cosine similarity between car and vehicle: 0.6681612133979797
Dissimilarity between car and vehicle: 0.33183878660202026


In [80]:
printSimilarity()

Enter the first word: sky
Enter the second word: earth
Cosine similarity between sky and earth: 0.5910468101501465
Dissimilarity between sky and earth: 0.4089531898498535


I defined the dissimilarity between the first word and second word to simply be $1 - $ cosine similarity. Intuitively speaking, this makes perfect sense seeing how the two terms are techinically opposites of each and can thus be treated as inverses. Especially speaking from a probablistic view, this definition of dissimilarity comes naturally. 