# Sentiment Analysis of Tweets

Sentiment analysis is performed on the `SemEval-2017` dataset to predict whether Tweets have a negative, positive or neutral sentiment.

### Objectives
This notebook aims to employ some classical ML methods as well as LSTM based neural networks to perform the sentiment analysis task. The inputs to all classic ML models will be `Tf-Idf` vectors of each Tweet (after pre-processing). The LSTM model will be trained using `GloVe` embeddings as the input features. The models will then be compared to analyse which approach appears to work best.

#### Models Trained
- Linear SVM
- Naive Bayes Classifier
- Logistic Regression
- LSTM

***Note***: Training time of Kernalized SVM scales quadratically with number of training samples. It was found to be taking too long to converge with the 45,000 samples on hand and was thus dropped from the analysis.


### Import necessary packages

In [1]:
# Import necessary packages
import re
from os.path import join
import numpy as np
import os
from nltk.corpus import stopwords
from collections import Counter
from nltk import word_tokenize
import pickle

from sklearn import naive_bayes
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from imblearn.over_sampling import SMOTE

from torch import nn
from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn.functional as F
from sklearn.metrics import mean_squared_error

In [12]:
# Define test sets
testsets = ['twitter-test1.txt', 'twitter-test2.txt', 'twitter-test3.txt']

#Define paths for semeval data and GloVe Embedding data
semeval_path = './semeval-tweets'
glove_path = './data'

In [3]:
# Skeleton: Evaluation code for the test sets
def read_test(testset):
    '''
    readin the testset and return a dictionary
    
    Args:
        testset: str, the file name of the testset to compare
    '''
    id_gts = {}
    with open(testset, 'r', encoding='utf8') as fh:
        for line in fh:
            fields = line.split('\t')
            tweetid = fields[0]
            gt = fields[1]

            id_gts[tweetid] = gt

    return id_gts


def confusion(id_preds, testset, classifier):
    '''
    print the confusion matrix of {'positive', 'netative'} between preds and testset
    
    Args:
        id_preds: a dictionary of predictions formated as {<tweetid>:<sentiment>, ... }
        testset: str, the file name of the testset to compare
        classifier: str, the name of the classifier
    '''
    id_gts = read_test(testset)

    gts = []
    for m, c1 in id_gts.items():
        if c1 not in gts:
            gts.append(c1)

    gts = ['positive', 'negative', 'neutral']

    conf = {}
    for c1 in gts:
        conf[c1] = {}
        for c2 in gts:
            conf[c1][c2] = 0

    for tweetid, gt in id_gts.items():
        if tweetid in id_preds:
            pred = id_preds[tweetid]
        else:
            pred = 'neutral'
        conf[pred][gt] += 1

    print(''.ljust(12) + '  '.join(gts))

    for c1 in gts:
        print(c1.ljust(12), end='')
        for c2 in gts:
            if sum(conf[c1].values()) > 0:
                print('%.3f     ' % (conf[c1][c2] / float(sum(conf[c1].values()))), end='')
            else:
                print('0.000     ', end='')
        print('')

    print('')


def evaluate(id_preds, testset, classifier):
    '''
    print the macro-F1 score of {'positive', 'netative'} between preds and testset
    
    Args:
        id_preds: a dictionary of predictions formated as {<tweetid>:<sentiment>, ... }
        testset: str, the file name of the testset to compare
        classifier: str, the name of the classifier
    '''
    id_gts = read_test(testset)

    acc_by_class = {}
    for gt in ['positive', 'negative', 'neutral']:
        acc_by_class[gt] = {'tp': 0, 'fp': 0, 'tn': 0, 'fn': 0}

    catf1s = {}

    ok = 0
    for tweetid, gt in id_gts.items():
        if tweetid in id_preds:
            pred = id_preds[tweetid]
        else:
            pred = 'neutral'

        if gt == pred:
            ok += 1
            acc_by_class[gt]['tp'] += 1
        else:
            acc_by_class[gt]['fn'] += 1
            acc_by_class[pred]['fp'] += 1

    catcount = 0
    itemcount = 0
    macro = {'p': 0, 'r': 0, 'f1': 0}
    micro = {'p': 0, 'r': 0, 'f1': 0}
    semevalmacro = {'p': 0, 'r': 0, 'f1': 0}

    microtp = 0
    microfp = 0
    microtn = 0
    microfn = 0
    for cat, acc in acc_by_class.items():
        catcount += 1

        microtp += acc['tp']
        microfp += acc['fp']
        microtn += acc['tn']
        microfn += acc['fn']

        p = 0
        if (acc['tp'] + acc['fp']) > 0:
            p = float(acc['tp']) / (acc['tp'] + acc['fp'])

        r = 0
        if (acc['tp'] + acc['fn']) > 0:
            r = float(acc['tp']) / (acc['tp'] + acc['fn'])

        f1 = 0
        if (p + r) > 0:
            f1 = 2 * p * r / (p + r)

        catf1s[cat] = f1

        n = acc['tp'] + acc['fn']

        macro['p'] += p
        macro['r'] += r
        macro['f1'] += f1

        if cat in ['positive', 'negative']:
            semevalmacro['p'] += p
            semevalmacro['r'] += r
            semevalmacro['f1'] += f1

        itemcount += n

    micro['p'] = float(microtp) / float(microtp + microfp)
    micro['r'] = float(microtp) / float(microtp + microfn)
    micro['f1'] = 2 * float(micro['p']) * micro['r'] / float(micro['p'] + micro['r'])

    semevalmacrof1 = semevalmacro['f1'] / 2

    print(testset + ' (' + classifier + '): %.3f' % semevalmacrof1)

### Load training set, dev set and testing set


In [4]:
# Load training set, dev set and testing set
data = {}
tweetids = {}
tweetgts = {}
tweets = {}

for dataset in ['twitter-training-data.txt'] + testsets + ['twitter-dev-data.txt']:
    data[dataset] = []
    tweets[dataset] = []
    tweetids[dataset] = []
    tweetgts[dataset] = []

    # write code to read in the datasets here
    with open(os.path.join(semeval_path, dataset)) as f:
        f_data = f.readlines()
    
    f_data = [x.strip() for x in f_data]
    data[dataset] = f_data
    tweets[dataset] = [x.split('\t')[2] for x in f_data]
    tweetids[dataset] = [x.split('\t')[0] for x in f_data]
    tweetgts[dataset] = [x.split('\t')[1] for x in f_data]

### Pre-processing
Each tweet is pre-processed to remove profile tags, hashtags and URL's.
Stemming/Lemmatization has been shelved for the moment to analyse performance without the same.

In [5]:
#function to pre-process tweets from a given dataset
def pre_process(tweets_list):
    tweets_list = [x.lower() for x in tweets_list]
    
    #removing profile tags
    tweets_list = [re.sub(r'@\w+\s', '', x) for x in tweets_list]
    
    #removing URL's
    tweets_list = [re.sub(r'http[s]?://[\w./\-?]+', '', x) for x in tweets_list]
    
    #removing hashtags
    tweets_list = [re.sub(r'#\w+\s', '', x) for x in tweets_list]
    
    
    
    #print(len(tweets_list))
    for i,text in enumerate(tweets_list):
        processed = ''.join([x for x in text if x.isalnum() or x==' '])
        tweets_list[i] = processed
    #tweets_list = [''.join([x for text in tweets_list for x in text if x.isalnum() or x==' '])]
    
    return tweets_list


#pre-process all the tweets across datasets
for k in tweets:
    tweets[k] = pre_process(tweets[k])

In [6]:
#get number of training samples
len(tweets['twitter-training-data.txt'])

45101

## Analysing Class distribution of Datasets

In [7]:
#train data
Counter(tweetgts['twitter-training-data.txt'])

Counter({'positive': 15986, 'negative': 8326, 'neutral': 20789})

In [None]:
train_vocab = set()
for tw in tweets['twitter-training-data.txt']:
    tokens = word_tokenize(tw)
    train_vocab.update(tokens)

In [9]:
len(train_vocab)

48090

### Oversampling with SMOTE to get a balanced dataset

From the analysis above, we see that the classes are imbalanced in the training set. The ratio is roughly `negative:positive:neutral = 2: 1.5 :1`. Two standard approaches to balance datasets are downsampling and upsampling. Considering the size of train set is limited, undersampling is avoided and SMOTE is used to generate some new training samples using information from existing training samples.

The performance of the models will be compared on both original and SMOTE applied datasets.

In [10]:
#original class distribution
Counter(tweetgts['twitter-training-data.txt'])

Counter({'positive': 15986, 'negative': 8326, 'neutral': 20789})

In [12]:
tvec = TfidfVectorizer(stop_words=None, max_features=5000, min_df=5)
tfidf = tvec.fit_transform(tweets['twitter-training-data.txt'])
X_smote, y_smote = SMOTE(random_state=10).fit_resample(tfidf, tweetgts['twitter-training-data.txt'])

In [13]:
#class distribution after SMOTE
Counter(y_smote)

Counter({'positive': 20789, 'negative': 20789, 'neutral': 20789})

# Classifier Analysis

## 1. Logistic Regression with Tf-Idf features

In [24]:
tvec = TfidfVectorizer(stop_words=None, max_features=5000, min_df=5)
lr = LogisticRegression()

In [25]:
tfidf = tvec.fit_transform(tweets['twitter-training-data.txt'])

In [26]:
#tfidf_train = tvec.fit_transform(tweets['twitter-training-data.txt'])
#tfidf_val = tvec.fit_transform(tweets['twitter-training-data.txt'][-2000:])
lr_model = lr.fit(tfidf, tweetgts['twitter-training-data.txt'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [27]:
tfidf_test = tvec.transform(tweets['twitter-test1.txt'])
prediction = lr_model.predict(tfidf_test)
prediction

array(['neutral', 'negative', 'neutral', ..., 'positive', 'neutral',
       'neutral'], dtype='<U8')

In [28]:
print(confusion_matrix(tweetgts['twitter-test1.txt'],prediction))  
print(classification_report(tweetgts['twitter-test1.txt'],prediction))  
print(accuracy_score(tweetgts['twitter-test1.txt'], prediction))

[[ 166  322   69]
 [  26 1203  275]
 [  29  531  910]]
              precision    recall  f1-score   support

    negative       0.75      0.30      0.43       557
     neutral       0.59      0.80      0.68      1504
    positive       0.73      0.62      0.67      1470

    accuracy                           0.65      3531
   macro avg       0.69      0.57      0.59      3531
weighted avg       0.67      0.65      0.63      3531

0.6454262248654772


#### Using SMOTE dataset below, the model showed an improvement of about 8% in the F1 score for the negative class (which earlier models were struggling with). 

In [29]:
#Using SMOTE training data
model_lr_smote = lr.fit(X_smote, y_smote)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [30]:
prediction = model_lr_smote.predict(tfidf_test)
prediction

array(['neutral', 'negative', 'neutral', ..., 'positive', 'neutral',
       'neutral'], dtype='<U8')

In [31]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(tweetgts['twitter-test1.txt'],prediction))  
print(classification_report(tweetgts['twitter-test1.txt'],prediction))  
print(accuracy_score(tweetgts['twitter-test1.txt'], prediction))

[[ 255  230   72]
 [ 108 1007  389]
 [  87  413  970]]
              precision    recall  f1-score   support

    negative       0.57      0.46      0.51       557
     neutral       0.61      0.67      0.64      1504
    positive       0.68      0.66      0.67      1470

    accuracy                           0.63      3531
   macro avg       0.62      0.60      0.60      3531
weighted avg       0.63      0.63      0.63      3531

0.6321155480033984


# 2. Linear SVM with Tf-Idf features

In [32]:
#Linear SVM
lin_clf = svm.LinearSVC()
lin_svm_model = lin_clf.fit(tfidf, tweetgts['twitter-training-data.txt'])

In [33]:
#tfidf_test = tvec.transform(tweets['twitter-test1.txt'])
lin_svm_preds = lin_svm_model.predict(tfidf_test)
lin_svm_preds

array(['neutral', 'negative', 'neutral', ..., 'negative', 'neutral',
       'neutral'], dtype='<U8')

In [34]:
print(confusion_matrix(tweetgts['twitter-test1.txt'],lin_svm_preds))  
print(classification_report(tweetgts['twitter-test1.txt'],lin_svm_preds))  
print(accuracy_score(tweetgts['twitter-test1.txt'], lin_svm_preds))

[[ 194  289   74]
 [  43 1175  286]
 [  44  518  908]]
              precision    recall  f1-score   support

    negative       0.69      0.35      0.46       557
     neutral       0.59      0.78      0.67      1504
    positive       0.72      0.62      0.66      1470

    accuracy                           0.64      3531
   macro avg       0.67      0.58      0.60      3531
weighted avg       0.66      0.64      0.64      3531

0.6448598130841121


### Linear SVM (with class weights specified)
Linear SVM has an option to provide `class weights` to the model prior to training. This is another way to deal with imbalanced classes, especially if the minority class is of bigger importance to us.

The results show a 7% improvement in the F1 score of the negative class and an improvement of 2% in the overall macro average F1 score (compared to linear SVM model fit without class_weights)

In [38]:
lin_clf_weighted = svm.LinearSVC(class_weight={'negative':2, 'neutral':1, 'positive':1.5})
lin_svm_model_weighted = lin_clf_weighted.fit(tfidf, tweetgts['twitter-training-data.txt'])

In [39]:
lin_svm_preds = lin_svm_model_weighted.predict(tfidf_test)
lin_svm_preds

array(['neutral', 'negative', 'neutral', ..., 'negative', 'positive',
       'neutral'], dtype='<U8')

In [40]:
print(confusion_matrix(tweetgts['twitter-test1.txt'],lin_svm_preds))  
print(classification_report(tweetgts['twitter-test1.txt'],lin_svm_preds))  
print(accuracy_score(tweetgts['twitter-test1.txt'], lin_svm_preds))

[[ 254  222   81]
 [  79 1046  379]
 [  67  438  965]]
              precision    recall  f1-score   support

    negative       0.64      0.46      0.53       557
     neutral       0.61      0.70      0.65      1504
    positive       0.68      0.66      0.67      1470

    accuracy                           0.64      3531
   macro avg       0.64      0.60      0.62      3531
weighted avg       0.64      0.64      0.64      3531

0.6414613423959218


### Linear SVM (using SMOTE enhanced train data)
Trying Linear SVM with SMOTE, the performance **degraded** compared to model fit with original data (class weights specified).

In [35]:
lin_clf_smote = svm.LinearSVC()
lin_svm_model_smote = lin_clf_smote.fit(X_smote, y_smote)

In [36]:
lin_svm_preds = lin_svm_model_smote.predict(tfidf_test)
lin_svm_preds

array(['neutral', 'negative', 'neutral', ..., 'negative', 'neutral',
       'neutral'], dtype='<U8')

In [37]:
print(confusion_matrix(tweetgts['twitter-test1.txt'],lin_svm_preds))  
print(classification_report(tweetgts['twitter-test1.txt'],lin_svm_preds))  
print(accuracy_score(tweetgts['twitter-test1.txt'], lin_svm_preds))

[[281 202  74]
 [159 957 388]
 [108 408 954]]
              precision    recall  f1-score   support

    negative       0.51      0.50      0.51       557
     neutral       0.61      0.64      0.62      1504
    positive       0.67      0.65      0.66      1470

    accuracy                           0.62      3531
   macro avg       0.60      0.60      0.60      3531
weighted avg       0.62      0.62      0.62      3531

0.6207873123760974


# Naive Bayes with Tf-Idf features

In [41]:
nb_model = naive_bayes.MultinomialNB()

In [42]:
nb_model.fit(tfidf, tweetgts['twitter-training-data.txt'])

MultinomialNB()

In [43]:
predictions_nb = nb_model.predict(tfidf_test)
predictions_nb

array(['neutral', 'positive', 'neutral', ..., 'positive', 'neutral',
       'neutral'], dtype='<U8')

In [44]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(tweetgts['twitter-test1.txt'],predictions_nb))  
print(classification_report(tweetgts['twitter-test1.txt'],predictions_nb))  
print(accuracy_score(tweetgts['twitter-test1.txt'], predictions_nb))

[[  52  403  102]
 [   5 1172  327]
 [   6  596  868]]
              precision    recall  f1-score   support

    negative       0.83      0.09      0.17       557
     neutral       0.54      0.78      0.64      1504
    positive       0.67      0.59      0.63      1470

    accuracy                           0.59      3531
   macro avg       0.68      0.49      0.48      3531
weighted avg       0.64      0.59      0.56      3531

0.5924667233078448


#### The results for Naive Bayes trained on the original dataset shows really poor performance on the minority negative class. Training the same model on the SMOTE dataset shows a significant improvement, as shown below. The overall macro average F1 score goes up by 9% and F1 score for the minority class goes up by 33%

In [45]:
nb_model_smote = naive_bayes.MultinomialNB()
nb_model_smote.fit(X_smote, y_smote)

MultinomialNB()

In [46]:
predictions_nb = nb_model_smote.predict(tfidf_test)
predictions_nb

array(['neutral', 'negative', 'neutral', ..., 'positive', 'positive',
       'neutral'], dtype='<U8')

In [47]:
print(confusion_matrix(tweetgts['twitter-test1.txt'],predictions_nb))  
print(classification_report(tweetgts['twitter-test1.txt'],predictions_nb))  
print(accuracy_score(tweetgts['twitter-test1.txt'], predictions_nb))

[[ 272  181  104]
 [ 136  796  572]
 [ 114  337 1019]]
              precision    recall  f1-score   support

    negative       0.52      0.49      0.50       557
     neutral       0.61      0.53      0.56      1504
    positive       0.60      0.69      0.64      1470

    accuracy                           0.59      3531
   macro avg       0.58      0.57      0.57      3531
weighted avg       0.59      0.59      0.59      3531

0.5910506938544322


#### The conclusion from the above experiments seems to be that SMOTE has indeed helped address the class imbalance problem and the trained models are more robust as a result.

## Building LSTM

A simple neural network architecture with a single LSTM layer is designed to predict sentiments for this task. Architecture can be enhanced later to train more complex models.

### Loading Glove Embeddings
Pre-trained 100 dimensional GloVe embeddings are loaded.

In [48]:
with open('./data/glove.6B.100d.txt') as f:
    lines = f.readlines()

In [49]:
glove_vocab_size = len(lines)
glove_vocab_size

400000

In [50]:
words = []
word2idx = {}
vectors = np.zeros((glove_vocab_size, 100))

for ix, l in enumerate(lines):
    line = l.split()
    word = line[0]
    words.append(word)
    word2idx[word] = ix
    vect = np.array(line[1:]).astype(np.float32)
    vectors[ix] = vect
    
glove = {w: vectors[word2idx[w]] for w in words}

#### Next step is to identify the top 5000 words from train set vocab to use in the LSTM's embedding layer. This is done here with the help of Tf-Idf. The top 5000 words with the highest score (essentially highest relevance) is picked from the feature space.

In [52]:
tvec_lstm = TfidfVectorizer(stop_words=None, max_features=10000, min_df=5)
tfidf_lstm = tvec_lstm.fit_transform(tweets['twitter-training-data.txt'])

In [53]:
feature_names = tvec_lstm.get_feature_names_out()
feature_names

array(['00', '007', '01', ..., 'zuckerman', 'zulu', 'zumba'], dtype=object)

In [54]:
feature_array = np.array(feature_names)
tfidf_sorting = np.argsort(tfidf.toarray()).flatten()[::-1]

n = 5000
top_vocab = feature_array[tfidf_sorting][:n]

#display the top 100
top_vocab[:100]

array(['lack', '86', 'late', 'driver', 'fear', 'faint', 'italy', 'inter',
       'celebrity', 'celebs', 'celine', 'cell', 'cells', 'celtic', 'cena',
       'celtics', 'celebrations', 'cenas', 'censor', 'censors', 'cent',
       'center', 'central', 'celebrities', 'lifestyle', 'celebration',
       'cease', 'cb', 'cba', 'cbb', 'cbs', 'cc', 'cd', 'cdnpoli',
       'ceasefire', 'celebrating', 'cebu', 'cedar', 'celeb', 'celebrate',
       'celebrated', 'celebrates', 'centuries', 'centre', 'ceremony',
       'century', 'challenging', 'chamber', 'chamberlain', 'chambers',
       'champ', 'champagne', 'champion', 'champions', 'championship',
       'championships', 'champs', 'chance', 'chancellor', 'chances',
       'chandler', 'chanel', 'chalmers', 'challenges', 'ceo',
       'challenged', 'cavani', 'certain', 'certainly', 'certificate',
       'ces', 'cesc', 'cesena', 'cet', 'cfc', 'ch', 'chad', 'chain',
       'chair', 'chairman', 'challenge', 'cave', 'causes', 'causing',
       'carly', '

In [55]:
with open('./top_vocab.pkl', 'wb') as p_file:
    pickle.dump(top_vocab, p_file)

### Mapping top 5000 vocab to corresponding Glove embeddings

In [6]:
def get_glove_embeddings(top_vocab):
    matrix_len = 5000 #vocab size
    emb_dim = 100
    weights_matrix = np.zeros((matrix_len, emb_dim), dtype='float')
    words_found = 0
    glove_vocab = glove.keys()

    #initialise vocab->ix mapping with padding placeholder and unknown token
    emb_vocab_to_idx = {'':0, 'UNK':1}
    #initialise vocab list mapping with padding placeholder and unknown token
    emb_vocabulary = ['', 'UNK']

    #setting a random vector for UNK token
    weights_matrix[1] = np.random.normal(scale=0.6, size=(emb_dim, ))

    #since we are adding two custom tokens to vocab, the last two words from top vocab is dropped
    # to maintain vocab size of 5000. For the same reason, all indexing is offset by 2 below
    for i, word in enumerate(top_vocab[:-2]):
        #if word exists in Glove, use Glove embedding
        if word in glove_vocab: 
            weights_matrix[i+2] = glove[word]
            #keep count of how many words are actually present in Glove
            words_found += 1

        #if word not in Glove, initialise with a random vector
        else:
            weights_matrix[i+2] = np.random.normal(scale=0.6, size=(emb_dim, ))

        emb_vocab_to_idx[word] = i+2
        emb_vocabulary.append(word)
        
    return weights_matrix, emb_vocab_to_idx, emb_vocabulary, words_found

In [62]:
#try fetching the embeddings of the top vocab words
weights_matrix, emb_vocab_to_idx, emb_vocabulary, words_found = get_glove_embeddings(top_vocab)

In [63]:
words_found #majority of words are already present in Glove, this is good! (4824/5000 are present)

4824

### Encoding the tweets in Glove's embedding space

First, the size of the document vector to be fed into the neural network has to be determined. The tweets in the training data are analysed to figure this out.

In [64]:
all_lengths = []
for tweet in tweets['twitter-training-data.txt']:
    cur_length = len(word_tokenize(tweet))
    all_lengths.append(cur_length)

In [65]:
np.mean(all_lengths)

17.578834172191304

In [66]:
max(all_lengths)

35

#### The average length of the tweets in train set is 17.5 words(tokens) and longest tweet is 35 words(tokens)
With this in mind, a document length of 20 is chosen to accomodate for the vast majority of tweets with padding whenever necessary.

***Note***: Choosing a very large document length would just feed a lot of heavily padded vectors to the network and this will not help with its training/convergence.

In [7]:
#encode tweets into embedding space with a document vector length of 20
def encode_tweet(text, vocab_conv_to_idx, N=20):
    tokenized = word_tokenize(text)
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([vocab_conv_to_idx.get(word, vocab_conv_to_idx["UNK"]) for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]
    return encoded

In [68]:
#encoding the training data
train_X = []
for tweet in tweets['twitter-training-data.txt']:
    encoding = encode_tweet(tweet, emb_vocab_to_idx)
    train_X.append(encoding)

In [69]:
train_X[10] #sample encoded vector - the 1's correspond to 'UNK' tokens

array([2685,    1,    1,    1,    1, 3459,    1, 1474, 2795,    1,  651,
          1,    1, 2295, 3995,    1, 3459,  952,    1,    1])

In [70]:
#encoding the validation data
val_X = []
for tweet in tweets['twitter-dev-data.txt']:
    encoding = encode_tweet(tweet, emb_vocab_to_idx)
    val_X.append(encoding)

In [71]:
val_X[10]

array([3066, 2062, 4630,    1,    1,    1, 2693, 4448, 3673,    1,    1,
          1, 3092,    1, 4588,  936,    1,    1,    1,    1])

## Setting up Torch network

In [8]:
#defining custom dataset class to use for the dataloader
class twitterDataset(Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.y = Y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        #mapping categorical labels to integers
        class_mapping = {'negative':0, 'neutral':1, 'positive':2}
        return torch.from_numpy(self.X[idx].astype(np.int32)), class_mapping.get(self.y[idx])

In [73]:
#create train/val datasets
train_ds = twitterDataset(train_X, tweetgts['twitter-training-data.txt'])
valid_ds = twitterDataset(val_X, tweetgts['twitter-dev-data.txt'])

In [74]:
#specifying batch and vocab size
batch_size = 2048
vocab_size = 5000

In [75]:
#creating dataloaders
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_dl = DataLoader(valid_ds, batch_size=batch_size)

In [76]:
#display data from the first training batch
for x,y in train_dl:
    print(x,y)
    break

tensor([[   1,    1,    1,  ...,    0,    0,    0],
        [   1,    1, 3459,  ...,    0,    0,    0],
        [   1, 3377,    1,  ...,    1,    0,    0],
        ...,
        [   1,    1, 4578,  ...,    0,    0,    0],
        [   1, 3092,    1,  ...,    0,    0,    0],
        [   1, 1573,    1,  ...,    1, 2971,  926]], dtype=torch.int32) tensor([2, 2, 1,  ..., 1, 1, 2])


### Network Architecture

***Note***: A softmax layer seems intuitive for this multi-class classification task, however the three output values from the linear layer when taken as the final output led to better convergence and this was employed in the end.

In [9]:
class LSTM_custom(torch.nn.Module) :
    
    def __init__(self, vocab_size, embedding_dim, hidden_dim, glove_weights) :
        '''
        Custom LSTM model incorporating pre-trained GloVe embeddings.

        Args:
            vocab_size: size of vocabulary in the embedding layer (5000 in this case)
            embedding_dim: GloVe_vector dimension
            hidden_dim: size of output from the hidden state of LSTM
        '''
        
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embeddings.weight.data.copy_(torch.from_numpy(glove_weights))
        
        # Freeze embedding layer so that pre-trained Glove embeddings don't get updated while training
        self.embeddings.weight.requires_grad = False
        
        #specify the LSTM layer that follows the embedding layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, batch_first=True)
        #linear layer to compute three values corresponding to each of the three sentiments
        self.linear = nn.Linear(hidden_dim, 3)
        
    def forward(self, x):
        '''
        Perform a single forward pass through the network
        
        Args:
            x: input to network
        
        Returns:
            torch_tensor: final output of the neural network after forward pass
        '''
        #encoded doc vector (seq_len=20) is fed to embedding layer
        x = self.embeddings(x)
        #drop out to try avoid overfitting
        x = self.dropout(x)
        #embeddings of size (seq_len=20, emb_dim=100) are generated and fed to lstm
        lstm_out, (ht, ct) = self.lstm(x)
        #the ht(hidden_state) output of the lstm is taken and fed to the final linear layer
        x = self.linear(ht[-1])
        
        #the scores for the three classes outputted by the linear layer is the final output of network
        return x

### Training Loop

In [10]:
def train_model(model, train_dl, val_dl, epochs=10, lr=0.001, patience=5):
    '''
    Function to train the PyTorch model
    
    Args:
        model: torch NN model
        train_dl: DataLoader for training data
        val_dl: DataLoader for validation data
        epochs: number of training epochs
        lr: learning rate
        patience: number of epochs to wait without improvement in validation score before halting training
    '''
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    comp = float('inf')
    
    #patience = 5
    #patience_window_train_loss = []
    #patience_window_val_loss = []
    best_train_loss, best_val_loss = float('inf'), float('inf')
    
    for i in range(epochs):
        model.train()
        sum_loss = 0.0
        total = 0
        for x, y in train_dl:
            x = x.long()
            y = y.long()
            y_pred = model(x)
            #print(y_pred)
            #print(y_pred.shape)
            #break
            optimizer.zero_grad()
            loss = F.cross_entropy(y_pred, y)
            loss.backward()
            optimizer.step()
            sum_loss += loss.item()*y.shape[0]
            total += y.shape[0]
        val_loss, val_acc, val_rmse = validation_metrics(model, val_dl)
        if i % 5 == 1:
            print('Epoch-{} : '.format(i), "train loss %.3f, val loss %.3f, val accuracy %.3f, and val rmse %.3f" % (sum_loss/total, val_loss, val_acc, val_rmse))
            
        if i%patience == 0:
            if val_loss < best_val_loss:
                print('Epoch-{} Val Loss Improved! Saving Model to Disk.'.format(i))
                torch.save(model.state_dict(), './best_lstm_model.pt')
                best_val_loss = val_loss
                
            else:
                print('Val Loss did not improve within patience window. Stop Training.')
                return None

def validation_metrics (model, valid_dl):
    model.eval()
    correct = 0
    total = 0
    sum_loss = 0.0
    sum_rmse = 0.0
    for x, y in valid_dl:
        x = x.long()
        y = y.long()
        y_hat = model(x)
        loss = F.cross_entropy(y_hat, y)
        pred = torch.max(y_hat, 1)[1]
        correct += (pred == y).float().sum()
        total += y.shape[0]
        sum_loss += loss.item()*y.shape[0]
        sum_rmse += np.sqrt(mean_squared_error(pred, y.unsqueeze(-1)))*y.shape[0]
    return sum_loss/total, correct/total, sum_rmse/total

## Model Training

In [79]:
lstm_model = LSTM_custom(vocab_size, embedding_dim=100, hidden_dim=50, glove_weights=weights_matrix)

In [80]:
train_model(lstm_model, epochs=200, lr=0.01, patience=10)

Epoch-0 Val Loss Improved! Saving Model to Disk.
Epoch-1 :  train loss 0.981, val loss 0.946, val accuracy 0.513, and val rmse 0.825
Epoch-6 :  train loss 0.867, val loss 0.855, val accuracy 0.593, and val rmse 0.689
Epoch-10 Val Loss Improved! Saving Model to Disk.
Epoch-11 :  train loss 0.835, val loss 0.844, val accuracy 0.604, and val rmse 0.714
Epoch-16 :  train loss 0.812, val loss 0.839, val accuracy 0.599, and val rmse 0.694
Val Loss did not improve within patience window. Stop Training.


In [81]:
#load best model
best_model = LSTM_custom(vocab_size, embedding_dim=100, hidden_dim=50, glove_weights=weights_matrix)
best_model.load_state_dict(torch.load('./best_lstm_model.pt'))

<All keys matched successfully>

In [82]:
validation_metrics(lstm_model, val_dl)

(0.8704016208648682, tensor(0.5750), 0.7671375365604267)

### Performance
The LSTM model was overfitting quite easily when trained without an early-stopping condition. Even with early stopping, the model's performance on the validation set doesn't seem to be a true reflection of generalization, and on the test set performance suffers as shown in cells below.

The `patience_window` was thus quite agressive to stop the model training early and combat the overfitting issue. However, performance on the test set shows that the model has not learned to generalize and there is definitely more work to be done here.

# Training and Evaluating all models (summary)
This cell trains and evaluates the different models discussed above to compare them all in once place. The performance metrics are shown for each model across the three different test sets (does NOT include validation set). The evaluation metric displayed is **macro F1 score** (assigning equal importance to all three sentiment classes) for all models. 

In [11]:
#class mapping to convert int to categorical labels
class_map = {0:'negative', 1:'neutral', 2:'positive'}

#creating Tf-Idf representation of the train data as this is used across all models for training
tvec = TfidfVectorizer(stop_words=None, max_features=5000, min_df=5)
tfidf = tvec.fit_transform(tweets['twitter-training-data.txt'])

for classifier in ['NB', 'LR', 'SVM', 'LSTM']:
    for features in ['tfidf', 'glove']:
        # Skeleton: Creation and training of the classifiers
        if classifier == 'LR':
            if features != 'tfidf':
                continue
            print('Training ' + classifier)
            #instantiate LR model
            lr = LogisticRegression()
            #train model
            model = lr.fit(tfidf, tweetgts['twitter-training-data.txt'])
            
        elif classifier == 'SVM':
            if features != 'tfidf':
                continue
            print('Training ' + classifier)
            #instantiate SVM model
            lin_clf_weighted = svm.LinearSVC(class_weight={'negative':2, 'neutral':1, 'positive':1.5})
            #train model
            model = lin_clf_weighted.fit(tfidf, tweetgts['twitter-training-data.txt'])
            
        elif classifier == 'NB':
            if features != 'tfidf':
                continue
            print('Training ' + classifier)
            #instantiate NB model
            nb = naive_bayes.MultinomialNB()
            #train model
            model = nb.fit(tfidf, tweetgts['twitter-training-data.txt'])
            
        elif classifier == 'LSTM':
            # write the LSTM classifier here
            if features != 'glove':
                continue
            print('Training ' + classifier)
            
            #Read the Glove embeddings
            with open(join(glove_path,'glove.6B.100d.txt')) as f:
                lines = f.readlines()
            print('Log: Glove data read from disk successfully')
                
            #Load Glove Embeddings into dictionary mapping word->embedding
            words = []
            word2idx = {}
            glove_vocab_size = len(lines)
            vectors = np.zeros((glove_vocab_size, 100))
            print('Log: Creating word->embedding dict for all words in Glove vocabulary...')
            for ix, l in enumerate(lines):
                line = l.split()
                word = line[0]
                words.append(word)
                word2idx[word] = ix
                vect = np.array(line[1:]).astype(np.float32)
                vectors[ix] = vect
            #Glove embedding dictionary
            glove = {w: vectors[word2idx[w]] for w in words}
                
            #use TF-Idf to help with choosing the most relevant 5000 words in training voabulary
            print('Log: Fitting Tf-Idf vectoriser to train data')
            tvec_lstm = TfidfVectorizer(stop_words=None, max_features=10000, min_df=5)
            tfidf_lstm = tvec_lstm.fit_transform(tweets['twitter-training-data.txt'])
            
            
            
            '''                     WARNING!!!!
            The below lines that convert the Tf-Idf matrix from CSR to dense form will consume a lot of RAM.
            On my machine with 16GB of RAM, it was just about to manage. If the kernel dies due to a memory
            error, please set the 'memory_issue_flag' to True below so that the pickled list of top vocabulary
            words can be loaded!
            '''
            memory_issue_flag = False
            
            if not memory_issue_flag:
                #get list of all features(words) in Tf-Idf feature space
                print('Log: Converting Tf-Idf to dense representation')
                feature_names = tvec_lstm.get_feature_names_out()
                feature_array = np.array(feature_names)
                #sort words based on Tf-Idf scores
                tfidf_sorting = np.argsort(tfidf.toarray()).flatten()[::-1]

                #fetching top 5000 words
                n = 5000
                top_vocab = feature_array[tfidf_sorting][:n]
                
            else:
                print('Log: Loading top vocab words from pickle file to avoid memory issue')
                with open('./top_vocab.pkl', 'rb') as p_file:
                    top_vocab = pickle.load(p_file)
            
            print('Log: Identified top 5000 words from train set vocabulary')
            
            #helper function to get Glove embeddings for top words(i.e our custom vocabulary)
            print('Log: Creating embedding matrix using Glove vectors for our custom top vocabulary')
            weights_matrix, emb_vocab_to_idx, emb_vocabulary, _ = get_glove_embeddings(top_vocab)
            
            #encoding the training data as vectors of indices to index into custom vocabulary glove embeddings
            print('Log: Encoding Train data')
            train_X = []
            for tweet in tweets['twitter-training-data.txt']:
                encoding = encode_tweet(tweet, emb_vocab_to_idx)
                train_X.append(encoding)
                
            #encoding the validation data
            print('Log: Encoding Validation data')
            val_X = []
            for tweet in tweets['twitter-dev-data.txt']:
                encoding = encode_tweet(tweet, emb_vocab_to_idx)
                val_X.append(encoding)
                
            #specifying batch size for datasets/dataloaders
            batch_size = 2048
            vocab_size = 5000
                
            #creating dataset and dataloader for training
            train_ds = twitterDataset(train_X, tweetgts['twitter-training-data.txt'])
            valid_ds = twitterDataset(val_X, tweetgts['twitter-dev-data.txt'])
            
            #creating dataloaders
            print('Log: Creating data loaders for train and validation data')
            train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
            val_dl = DataLoader(valid_ds, batch_size=batch_size)
            
            #instantiate custom LSTM model using class defined earlier
            print('Log: Instantiating custom LSTM model')
            lstm_model = LSTM_custom(vocab_size, embedding_dim=100, hidden_dim=50, glove_weights=weights_matrix)
            #train network using helper function defined earlier
            print('Log: Starting LSTM training')
            train_model(lstm_model, train_dl, val_dl, epochs=50, lr=0.01)
            print('Log: Training Complete')
            
            #load the best model from disk
            print('Log: Loading best trained model from disk')
            best_model = LSTM_custom(vocab_size, embedding_dim=100, hidden_dim=50, glove_weights=weights_matrix)
            best_model.load_state_dict(torch.load('./best_lstm_model.pt'))
            #set model to evaluate mode so that weights don't get updated when running inference
            best_model.eval()
            
        else:
            print('Unknown classifier name' + classifier)
            continue

        # Prediction performance of the classifiers
        for testset in testsets:
            id_preds = {}
            # write the prediction and evaluation code here
            
            #Inference for LSTM
            if classifier == 'LSTM':
                #getting encodings for tweets in test set
                X_test = []
                for tweet in tweets[testset]:
                    encoding = encode_tweet(tweet, emb_vocab_to_idx)
                    X_test.append(encoding)
                
                #creating dataloader for test set
                test_ds = twitterDataset(X_test, tweetgts[testset])
                test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=True)
                
                for x, y in test_dl:
                    x = x.long()
                    y = y.long()
                    #run inference on batch
                    y_hat = best_model(x)
                    preds = torch.max(y_hat, 1)[1].tolist()
                    predictions = [class_map[x] for x in preds]
            
            #For all other models
            else:
                tfidf_test = tvec.transform(tweets[testset])
                predictions = model.predict(tfidf_test)
            
            
            for idd, pred in zip(tweetids[testset], predictions):
                id_preds[idd] = pred

            testset_name = testset
            testset_path = join('semeval-tweets', testset_name)
            evaluate(id_preds, testset_path, features + '-' + classifier)

Training NB
semeval-tweets/twitter-test1.txt (tfidf-NB): 0.398
semeval-tweets/twitter-test2.txt (tfidf-NB): 0.430
semeval-tweets/twitter-test3.txt (tfidf-NB): 0.404
Training LR


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


semeval-tweets/twitter-test1.txt (tfidf-LR): 0.547
semeval-tweets/twitter-test2.txt (tfidf-LR): 0.554
semeval-tweets/twitter-test3.txt (tfidf-LR): 0.511
Training SVM
semeval-tweets/twitter-test1.txt (tfidf-SVM): 0.599
semeval-tweets/twitter-test2.txt (tfidf-SVM): 0.602
semeval-tweets/twitter-test3.txt (tfidf-SVM): 0.564
Training LSTM
Log: Glove data read from disk successfully
Log: Creating word->embedding dict for all words in Glove vocabulary...
Log: Fitting Tf-Idf vectoriser to train data
Log: Converting Tf-Idf to dense representation
Log: Identified top 5000 words from train set vocabulary
Log: Creating embedding matrix using Glove vectors for our custom top vocabulary
Log: Encoding Train data
Log: Encoding Validation data
Log: Creating data loaders for train and validation data
Log: Instantiating custom LSTM model
Log: Starting LSTM training
Epoch-0 Val Loss Improved! Saving Model to Disk.
Epoch-1 :  train loss 1.001, val loss 0.961, val accuracy 0.501, and val rmse 0.729
Epoch-5 

# TODO
Debug why the LSTM model is overfitting with such ease and train a more robust model. LSTM's are known to perform very well on NLP tasks and performance here certainly has room for improvement.