# Sentiment Analysis of COVID-19 tweets

Sanket Mayuresh Bhave, Dept. of Computer Science, Colorado State University

sanket.bhave@colostate.edu

# Motivation and Background

Sentiment analysis is very important as it helps to understand people's sentiments and thoughts on a particular topic. This is visible through various examples like sentiment analysis of user reviews on an online retailing website, etc. Sentiment analysis can also be applied on various statements made on social media platforms like twitter. People react on social media and this can be helpful to understand people's sentiments about a topic, situation or incidence. This project aims to do sentiment analysis of such user tweets. These tweets are between the time frame February 2020 till March 2020 and related to COVID-19 pandemic. This has various applications like it helps understand public opinion and hence influences decision making. This can also be useful to decide how much moderation needs to be applied on social media to curtail misinformation. People's sentiments can be used to determine the next course of action and what the government needs to focus on for health and public management. 

This project can help build models that can do automatic sentiment analysis of user tweets and make the job easier for the policy-makers. Further, it also analyzes different models and their performance on categorizing the tweets based on their sentiments. 


Another motivation for this problem is to get the hands dirty on different important models not explored fully in the class. Yet another aim is to handle a dataset and build a model in a data scientist way by following all the steps- preprocessing, training, testing and reporting the results, on a huge dataset.

For this task, a large-scale sentiment dataset, COVID-Senti has been used. Its use has been demostrated in [1] and this project aims to apply different categorization techniques on this dataset. The dataset contains 90K tweets with 6280 tweets labeled as positive, 16,335 as negative and 67,835 as neutral. Thus, the number of neutral tweets is larger than that of positive or negative. In the paper [1], the authors have used the TextBlob tool to label the tweets as positive, negative or neutral. 

We will start our analysis now

In [1]:
import pandas as pd
import os
import numpy as np
from collections import defaultdict
from math import ceil
from random import Random
from sklearn.metrics import classification_report
from gensim.utils import simple_preprocess
import string
import nltk

  from .autonotebook import tqdm as notebook_tqdm


The dataset is downloaded in CSV format in the file named COVIDSenti.csv. It can be found at [2]. We read the data into a pandas dataframe.

In [2]:
covid_senti = pd.read_csv("COVIDSenti-main/COVIDSenti.csv")

# Preprocessing step

As a part of preprocesing, several techniques are applied.

1] First step is to remove the hyperlinks from the tweets. Many tweets contain hyperlinks that cite many websites. This information is not useful for us as it does not add any additional information for sentiment analysis. <br>
2] The next step is to remove the @ mentions from the tweets. The reason is the same- it does not add any significant information on a sentence's sentiment. <br>
3] and 4] The next steps are to remove the newlines and #tags from the tweets. Here, we won't remove the word associated with the # but only the special character #. It means, if a tweets contains #COVID, it will be converted to just COVID (no '#'). <br>
5] The next step is to remove all the punctuations as they don't add any useful information for our models to work. <br>
6] Further, all the stop words are removed from the data.<br>
7] The final step is to tokenize the tweets using simple_preprocess() function from gensim. simple_preprocess() lowercases and tokenizes the documents into words. This step is very important as every tweet is a vector of words and hence needs to be tokenized for the same.<br>

In [3]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
table = str.maketrans(dict.fromkeys(string.punctuation))

covid_senti['processed_tweet'] = covid_senti['tweet'].str.replace('http[^\s]*',"")
covid_senti['processed_tweet'] = covid_senti['tweet'].str.replace('@[^\s]*',"")
covid_senti = covid_senti.replace(r'\n', '', regex=True)
covid_senti = covid_senti.replace(r'#', '', regex=True)
covid_senti['processed_tweet'] = covid_senti['processed_tweet'].str.replace('[^\w\s]','')
covid_senti['processed_tweet'] = covid_senti['processed_tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
covid_senti['tokenized_tweet'] = [simple_preprocess(line, deacc=True) for line in covid_senti['processed_tweet']]
# covid_senti['tokenized_tweet'] = [[word.replace('\n', '') for word in line] for line in covid_senti['tokenized_tweet']]
# covid_senti['tokenized_tweet'] = [[word.replace('#', '') for word in line] for line in covid_senti['tokenized_tweet']]
# covid_senti['tokenized_tweet'] = [[word.lower() for word in line] for line in covid_senti['tokenized_tweet']]
# covid_senti['tokenized_tweet'] = [[word.translate(table) for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti.to_csv('df.csv')
# covid_senti['tokenized_tweet'] = [[covid_senti['tokenized_tweet'].apply(lambda x: word for word in line if word not in stop)] for line in  covid_senti['tokenized_tweet']] 

# print(covid_senti['tokenized_tweet'].sample(n=10))

[nltk_data] Downloading package stopwords to
[nltk_data]     /s/chopin/a/grad/sanket96/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  covid_senti['processed_tweet'] = covid_senti['tweet'].str.replace('http[^\s]*',"")
  covid_senti['processed_tweet'] = covid_senti['tweet'].str.replace('@[^\s]*',"")
  covid_senti['processed_tweet'] = covid_senti['processed_tweet'].str.replace('[^\w\s]','')


Below is the final step of preprocessing. Here word lemmatization is done. Word lemmatization is a process to associate different forms of the word to a single lemma or to its dictionary form. This is also a vital preprocessing step. The main aim behind this is to reduce the vocabulary size. For example, the words building and built will become a single word "build". Also, lemmatization is chosen over stemming as lemmatization is more accurate than the later.

In [4]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
covid_senti['lemmatized_tweet'] = [[wordnet_lemmatizer.lemmatize(word) for word in line] for line in covid_senti['tokenized_tweet']]
# print(covid_senti['lemmatized_tweet'].sample(n=10))

[nltk_data] Downloading package wordnet to
[nltk_data]     /s/chopin/a/grad/sanket96/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Split of the dataset

Here, the dataset is divided into train and test set. 80% of the data is used for training while 20% for testing. Below are the train and test sets.

In [5]:
mask = np.random.rand(len(covid_senti)) < 0.8
covid_senti_train = covid_senti[mask]
covid_senti_test = covid_senti[~mask]

The number of samples in each class for the train and test sets

In [6]:
covid_senti_train["label"].value_counts()

neu    53876
neg    13097
pos     4974
Name: label, dtype: int64

In [7]:
covid_senti_test["label"].value_counts()

neu    13509
neg     3238
pos     1306
Name: label, dtype: int64

# Naive Bayes Classifier

Now, trying the Naive Bayes Classifier on the dataset. This is the same as the one used in PA1. Also, below is the table of results.

In [8]:
class NaiveBayes():

    def __init__(self):
        # be sure to use the right class_dict for each data set
        self.class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
        # self.class_dict = {'action': 0, 'comedy': 1}
        self.feature_dict = {}
        self.prior = np.zeros(len(self.class_dict))
        self.likelihood = None
    '''
    Trains a multinomial Naive Bayes classifier on a training set.
    Specifically, fills in self.prior and self.likelihood such that:
    self.prior[class] = log(P(class))
    self.likelihood[class][feature] = log(P(feature|class))
    '''
    def train(self, train_set):
        self.feature_dict = self.select_features(train_set)
        # iterate over training documents
        self.likelihood = np.zeros((len(self.class_dict), len(self.feature_dict)))
        doc_per_class = {}
        word_count = {}
        total_words_per_class = {}
        vocabulary = set()
        for index, row in train_set.iterrows():
            class_name = row['label']
            if (class_name in self.class_dict):
                doc_per_class[class_name] = 1 + doc_per_class.get(class_name, 0)
                    # collect class counts and feature counts
                data = row['lemmatized_tweet']
                for word in data:
                    vocabulary.add(word)
                    word_count[(word, class_name)] = 1 + word_count.get((word, class_name), 0)
        # normalize counts to probabilities, and take logs
        for class_name in self.class_dict:
            counts = [v for k, v in word_count.items() if k[1] == class_name]
            total_words_per_class[class_name] = sum(counts)
        for word in self.feature_dict:
            for class_name in self.class_dict:
                self.likelihood[self.class_dict.get(class_name)][self.feature_dict.get(word)] = np.log(((word_count.get((word,
                                                class_name), 0) + 1)/(total_words_per_class[class_name] + len(vocabulary))))
        for class_name in self.class_dict:
            self.prior[self.class_dict[class_name]] = np.log((doc_per_class[class_name] / sum(doc_per_class.values())))
    '''
    Tests the classifier on a development or test set.
    Returns a dictionary of filenames mapped to their correct and predicted
    classes such that:
    results[filename]['correct'] = correct class
    results[filename]['predicted'] = predicted class
    '''
    def test(self, dev_set):
        pred_labels = []
        true_labels = []
        # iterate over testing documents
        for index, row in dev_set.iterrows():
            class_name = row['label']
            # create feature vectors for each document
            word_count = {}
            true_labels.append(self.class_dict[class_name])
            data = str(row['lemmatized_tweet'])
            for word in data:
                if word in self.feature_dict:
                    word_count[word] = 1 + word_count.get(word, 0)
            feature_vector = np.zeros((len(self.feature_dict), 1))
            for i, word in enumerate(self.feature_dict):
                feature_vector[i] = word_count.get(word, 0)
            self.prior = np.reshape(self.prior, (self.prior.shape[0], 1))
            probability = self.prior + np.matmul(self.likelihood, feature_vector)
            pred_labels.append(np.argmax(probability))
                # get most likely class
        # print(dict(results))
        return pred_labels, true_labels

    '''
    Given results, calculates the following:
    Precision, Recall, F1 for each class
    Accuracy overall
    Also, prints evaluation metrics in readable format.
    '''
    def evaluate(self, results):
        # you may find this helpful
        target_names = ['neg', 'pos', 'neu']
        print(classification_report(results[1], results[0], target_names=target_names))
    '''
    Performs feature selection.
    Returns a dictionary of features.
    '''
    def select_features(self, train_set):
        # almost any method of feature selection is fine here
        doc_per_class = {}
        word_count = {}
        total_words_per_class = {}
        vocabulary = set()
        likelihood_ratio = {}
        for index, row in train_set.iterrows():
            class_name = row['label']
            if (class_name in self.class_dict):
                doc_per_class[class_name] = 1 + doc_per_class.get(class_name, 0)
                    # collect class counts and feature counts
                data = row['lemmatized_tweet']
                for word in data:
                    vocabulary.add(word)
                    word_count[(word, class_name)] = 1 + word_count.get((word, class_name), 0)
        # normalize counts to probabilities, and take logs
        for class_name in self.class_dict:
            counts = [v for k, v in word_count.items() if k[1] == class_name]
            total_words_per_class[class_name] = sum(counts)
        prob_class = np.zeros((3, 1))
        for i, class_name in enumerate(self.class_dict):
            prob_class[i] = (doc_per_class[class_name] / sum(doc_per_class.values()))
        for word in vocabulary:
            class_probs = [1] * len(self.class_dict)
            for i, class_name in enumerate(self.class_dict):
                class_probs[i] = (word_count.get((word,
                                      class_name), 0) + 1) / (total_words_per_class[class_name] + len(vocabulary))
                class_probs[i] = class_probs[i] / prob_class[i]
            likelihood_ratio[word] = (1 / class_probs[0]) * (1 / class_probs[1]) * (1 / class_probs[2])
        #likelihood_ratio_pos = dict(sorted(likelihood_ratio.items(), key=lambda item: item[1], reverse=True))
        likelihood_ratio = dict(sorted(likelihood_ratio.items(), key=lambda item: item[1]))
        words = []
        words.extend(list(likelihood_ratio.keys())[:750])
        #words.extend(list(likelihood_ratio_pos.keys())[:750])
        # for class_name in self.class_dict:
        #     self.prior[self.class_dict[class_name]] = np.log((doc_per_class[class_name] / sum(doc_per_class.values())))
        features = {}
        for i, word in enumerate(words):
            features[word] = i
        return features


if __name__ == '__main__':
    nb = NaiveBayes()
    # make sure these point to the right directories
    nb.train(covid_senti_train)
    # nb.train('movie_reviews_small/train')
    results = nb.test(covid_senti_test)
    # results = nb.test('movie_reviews_small/test')
    nb.evaluate(results)


              precision    recall  f1-score   support

         neg       0.00      0.00      0.00      3238
         pos       0.00      0.00      0.00      1306
         neu       0.75      1.00      0.86     13509

    accuracy                           0.75     18053
   macro avg       0.25      0.33      0.29     18053
weighted avg       0.56      0.75      0.64     18053



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The overall accuracy achieved by Naive Bayes (75%) is satisfactory. But, the per-class precision is 0 which is not acceptable. It means Naive Bayes is biased entirely towards the "neu" class. This is expected as the number of samples labeled as "neu" is more than the other two classes

# Logistic Regression Classifier

Now, let's try Logistic Regression clasifier from our PA2.

In [9]:
# CS542 Fall 2021 Programming Assignment 2
# Logistic Regression Classifier

'''
Computes the logistic function.
'''


def sigma(z):
    return 1 / (1 + np.exp(-z))


class LogisticRegression():

    def __init__(self, n_features=400):
        # be sure to use the right class_dict for each data set
        self.theta = None
        self.n_features = n_features
        self.feature_dict = None
        self.class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
        # self.class_dict = {'action': 0, 'comedy': 1}
        # use of self.feature_dict is optional for this assignment
        self.feature_dict = self.select_features(covid_senti_train)

    '''
    Loads a dataset. Specifically, returns a list of filenames, and dictionaries
    of classes and documents such that:
    classes[filename] = class of the document
    documents[filename] = feature vector for the document (use self.featurize)
    '''

    def select_features(self, data_set):
        feature_count = {}
        for index, row in data_set.iterrows():
            data = str(row['lemmatized_tweet']).split()
            for word in data:
                feature_count[word] = 1 + feature_count.get(word, 0)

        feature_count = list(dict(sorted(feature_count.items(), key=lambda v: v[1], reverse=True)).keys())[:500]
        features = {}

        for i, word in enumerate(feature_count):
            features[word] = i
        return features

    def load_data(self, data_set):
        filenames = []
        classes = dict()
        documents = dict()
        # iterate over documents
        for index, row in data_set.iterrows():
            # your code here
            # BEGIN STUDENT CODE
            # if os.path.isfile(os.path.join(root, name)):
            class_name = row['label']
            classes[index] = self.class_dict[class_name]
            documents[index] = self.featurize(row['lemmatized_tweet'])
            # END STUDENT CODE
        return classes, documents

    '''
    Given a document (as a list of words), returns a feature vector.
    Note that the last element of the vector, corresponding to the bias, is a
    "dummy feature" with value 1.
    '''

    def featurize(self, document):
        vector = np.zeros(self.n_features + 1)
        # BEGIN STUDENT CODE
        for word in document:
            if word in self.feature_dict:
                if word not in w2v_model.wv.key_to_index:
                    vector.extend([0] * 500)
                else:
                    vector.extend(w2v_model.wv[word])
        # END STUDENT CODE
        vector[-1] = 1
        return vector

    '''
    Trains a logistic regression classifier on a training set.
    '''

    def train(self, train_set, batch_size=3, n_epochs=1, eta=0.1):
        # if train_set == "movie_reviews_small/train":
        #     self.feature_dict = {'fast': 0, 'couple': 1, 'shoot': 2, 'fly': 3}
        # else:
        #     self.feature_dict = self.select_features(train_set)
        # self.n_features = len(self.feature_dict)
        self.theta = np.zeros(self.n_features + 1)  # weights (and bias)
        classes, documents = self.load_data(train_set)
        n_minibatches = ceil(len(train_set) / batch_size)
        for epoch in range(n_epochs):
            print("Epoch {:} out of {:}".format(epoch + 1, n_epochs))
            loss = 0
            for i in range(n_minibatches):
                # list of filenames in minibatch
                minibatch = train_set[i * batch_size: (i + 1) * batch_size]
                # BEGIN STUDENT CODE
                # create and fill in matrix x and vector y
                x = np.zeros((len(minibatch), self.n_features + 1))
                y = np.zeros(len(minibatch))
                k = 0
                for j, row in minibatch.iterrows():
                    x[k][:] = documents[j]
                    y[k] = classes[j]
                    k += 1
                # compute y_hat
                y_hat = sigma(np.dot(x, self.theta))
                # update loss
                loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))
                # compute gradient
                gradient = np.dot(x.T, np.subtract(y_hat, y)) / len(minibatch)
                # update weights (and bias)
                self.theta = self.theta - (eta * gradient)
                # END STUDENT CODE
            loss /= len(train_set)
            print("Average Train Loss: {}".format(loss))
            # randomize order
            #Random(epoch).shuffle(train_set)

    '''
    Tests the classifier on a development or test set.
    Returns a dictionary of filenames mapped to their correct and predicted
    classes such that:
    results[filename]['correct'] = correct class
    results[filename]['predicted'] = predicted class
    '''

    def test(self, dev_set):
        pred_labels = []
        true_labels = []
        classes, documents = self.load_data(dev_set)
        for index, row in dev_set.iterrows():
            # BEGIN STUDENT CODE
            # get most likely class (recall that P(y=1|x) = y_hat)
            true_labels.append(classes[index])
            prediction = sigma(np.dot(documents[index], self.theta))
            pred_label = 1 if prediction > 0.5 else 0
            pred_labels.append(pred_label)
            # END STUDENT CODE
        return pred_labels, true_labels

    '''
    Given results, calculates the following:
    Precision, Recall, F1 for each class
    Accuracy overall
    Also, prints evaluation metrics in readable format.
    '''

    def evaluate(self, results):
        # you can copy and paste your code from PA1 here
        target_names = ['neg', 'pos', 'neu']
        print(classification_report(results[1], results[0], target_names=target_names))


if __name__ == '__main__':
    lr = LogisticRegression(n_features=750)
    # make sure these point to the right directories
    batch_size = [1, 2, 3, 8, 16, 32]
    n_epochs = [1, 5, 10, 20, 30, 40]
    eta = [0.025, 0.05, 0.1, 0.2, 0.4]

    # code for grid search
#     for b in batch_size:
#         for n in n_epochs:
#             for ler in eta:
#                 lr.train(covid_senti_train, batch_size=b, n_epochs=n, eta=ler)
#                 results = lr.test(covid_senti_test)
#                 lr.evaluate(results)
#                 print("Accuracy is for batch size: ", b, ", n_epochs: ", n, "eta: ", ler)

    # best features from grid search
    lr.train(covid_senti_train, batch_size=3, n_epochs=40, eta=0.05)
    results = lr.test(covid_senti_test)
    # lr.train('movie_reviews_small/train', batch_size=3, n_epochs=1, eta=0.1)
    # results = lr.test('movie_reviews_small/test')
    lr.evaluate(results)


Epoch 1 out of 40


  loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))
  loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))


Average Train Loss: nan
Epoch 2 out of 40
Average Train Loss: nan
Epoch 3 out of 40
Average Train Loss: nan
Epoch 4 out of 40
Average Train Loss: nan
Epoch 5 out of 40
Average Train Loss: nan
Epoch 6 out of 40
Average Train Loss: nan
Epoch 7 out of 40
Average Train Loss: nan
Epoch 8 out of 40
Average Train Loss: nan
Epoch 9 out of 40
Average Train Loss: nan
Epoch 10 out of 40
Average Train Loss: nan
Epoch 11 out of 40
Average Train Loss: nan
Epoch 12 out of 40
Average Train Loss: nan
Epoch 13 out of 40
Average Train Loss: nan
Epoch 14 out of 40
Average Train Loss: nan
Epoch 15 out of 40
Average Train Loss: nan
Epoch 16 out of 40
Average Train Loss: nan
Epoch 17 out of 40
Average Train Loss: nan
Epoch 18 out of 40
Average Train Loss: nan
Epoch 19 out of 40
Average Train Loss: nan
Epoch 20 out of 40
Average Train Loss: nan
Epoch 21 out of 40
Average Train Loss: nan
Epoch 22 out of 40
Average Train Loss: nan
Epoch 23 out of 40
Average Train Loss: nan
Epoch 24 out of 40
Average Train Loss:

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


From the results, it is visible that the accuracy is very poor with Logisitic Regression.

# CNN Classifier

Now, we will try CNN as our next model. Here, the features extracted from the WordNet classifier will be used to train the CNN model.

In [7]:
import torch
from torch import nn, optim
import gensim
import time
import torch.nn.functional as F

Below is the Word2Vec model from gensim

In [8]:
from gensim.models import Word2Vec
tweets = list(covid_senti['lemmatized_tweet'].values)
tweets.append(['pad'])
w2v_model = Word2Vec(tweets, min_count = 1, vector_size = 500, workers = 3, window = 3, sg = 1)

As our model understands only numbers, we will assign a number to each of our labels here. In the above cases, it was internally done in the code through the class_dict dictionary. Here, we will do it explicitly and add a new column to our dataframe.

In [9]:
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
covid_senti['label_num'] = covid_senti.label.replace(class_dict)
covid_senti

Unnamed: 0,tweet,label,processed_tweet,tokenized_tweet,lemmatized_tweet,label_num
0,Coronavirus | Human Coronavirus Types | CDC ht...,neu,Coronavirus Human Coronavirus Types CDC httpst...,"[coronavirus, human, coronavirus, types, cdc]","[coronavirus, human, coronavirus, type, cdc]",2
1,"@shehryar_taseer That‚Äôs üíØ true , Corona v...",neu,ThatÄôs üíØ true Corona virus swine flue Bird ...,"[thataos, uiø, true, corona, virus, swine, flu...","[thataos, uiø, true, corona, virus, swine, flu...",2
2,"TLDR: Not SARS, possibly new coronavirus. Diff...",neg,TLDR Not SARS possibly new coronavirus Difficu...,"[tldr, not, sars, possibly, new, coronavirus, ...","[tldr, not, sars, possibly, new, coronavirus, ...",0
3,Disease outbreak news from the WHO: Middle Eas...,neu,Disease outbreak news WHO Middle East respirat...,"[disease, outbreak, news, who, middle, east, r...","[disease, outbreak, news, who, middle, east, r...",2
4,China - Media: WSJ says sources tell them myst...,neu,China Media WSJ says sources tell mystery pneu...,"[china, media, wsj, says, sources, tell, myste...","[china, medium, wsj, say, source, tell, myster...",2
...,...,...,...,...,...,...
89995,@C_Racing48 The flu has a 2% death rate.. the ...,neu,The flu 2 death rate coronavirus 3 fine 3 risk...,"[the, flu, death, rate, coronavirus, fine, ris...","[the, flu, death, rate, coronavirus, fine, ris...",2
89996,@realDonaldTrump We already know that but you‚...,neg,We already know youÄôre idiot bungled Coronavi...,"[we, already, know, youaore, idiot, bungled, c...","[we, already, know, youaore, idiot, bungled, c...",0
89997,First coronavirus case reported in St. Joseph ...,neu,First coronavirus case reported St Joseph Coun...,"[first, coronavirus, case, reported, st, josep...","[first, coronavirus, case, reported, st, josep...",2
89998,"If you ate ants when you were a child, you‚Äôr...",neu,If ate ants child youÄôre immune coronavirus,"[if, ate, ants, child, youaore, immune, corona...","[if, ate, ant, child, youaore, immune, coronav...",2


Again dividing our dataset into train and test sets

In [14]:
mask = np.random.rand(len(covid_senti)) < 0.8
covid_senti_train = covid_senti[mask]
covid_senti_test = covid_senti[~mask]
covid_senti_train["label"].value_counts()

neu    53870
neg    13101
pos     5061
Name: label, dtype: int64

In [15]:
covid_senti_test["label"].value_counts()

neu    13515
neg     3234
pos     1219
Name: label, dtype: int64

Below function return a vector representation of a sentence. That is, given a sentence, it looks for the index to every word and appends it to a list. The final list contains index for every word in the sentence. This list is then returned as a tensor. This index is a mapping to its actual embedding which will be looked in the first layer of the CNN

In [16]:
max_len = covid_senti['lemmatized_tweet'].map(len).max()
def make_word_2_vec(sentence):
    padding_idx = w2v_model.wv.key_to_index['pad']
    padded_X = [padding_idx for i in range(max_len)]
    i = 0
    for word in sentence:
        if word not in w2v_model.wv.key_to_index:
            padded_X[i] = 0
            print(word)
        else:
            padded_X[i] = w2v_model.wv.key_to_index[word]
        i += 1
    return torch.tensor(padded_X, dtype=torch.long).view(1, -1)

Below is the CNN model. This is taken from the lecture as it is. The aim is to evaluate the model on the tweet dataset. It has an embedding layer wherein it looks for word embeddings. Then, it has five convolution layers followed by a linear layer.

In [17]:
EMBEDDING_SIZE = 500
NUM_FILTERS = 10

class CnnTextClassifier(nn.Module):
    def __init__(self, vocab_size, num_classes, window_sizes=(1,2,3,5)):
        super(CnnTextClassifier, self).__init__()
        weights = w2v_model.wv
        # With pretrained embeddings
        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors),
                                                      padding_idx=w2v_model.wv.key_to_index['pad'])
        # Without pretrained embeddings
        # self.embedding = nn.Embedding(vocab_size, EMBEDDING_SIZE)

        self.convs = nn.ModuleList([
                                   nn.Conv2d(1, NUM_FILTERS, [window_size, EMBEDDING_SIZE],
                                             padding=(window_size - 1, 0))
                                   for window_size in window_sizes
        ])

        self.fc = nn.Linear(NUM_FILTERS * len(window_sizes), num_classes)

    def forward(self, x):
        x = self.embedding(x)

        # Apply a convolution + max_pool layer for each window size
        x = torch.unsqueeze(x, 1)
        xs = []
        for conv in self.convs:
            x2 = torch.tanh(conv(x))
            x2 = torch.squeeze(x2, -1)
            x2 = F.max_pool1d(x2, x2.size(2))
            xs.append(x2)
        x = torch.cat(xs, 2)

        # FC
        x = x.view(x.size(0), -1)
        logits = self.fc(x)

        probs = F.softmax(logits, dim = 1)

        return probs

Now, we train our CNN model

In [19]:
NUM_CLASSES = 3
VOCAB_SIZE = len(w2v_model.wv.key_to_index)

cnn_model = CnnTextClassifier(vocab_size=VOCAB_SIZE, num_classes=NUM_CLASSES)
# cnn_model.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)
num_epochs = 10

# Open the file for writing loss
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
loss_file_name = 'cnn_class_big_loss_with_padding.csv'
losses = []
cnn_model.train()
for epoch in range(num_epochs):
    start_time = time.time()
    print("Epoch " + str(epoch + 1))
    train_loss = 0
    for index, row in covid_senti_train.iterrows():
        # Clearing the accumulated gradients
        optimizer.zero_grad()

        # Make the bag of words vector for stemmed tokens 
        bow_vec = make_word_2_vec(row['lemmatized_tweet'])
       
        # Forward pass to get output
        probs = cnn_model(bow_vec)

        # Get the target label
        target = torch.tensor([class_dict[row['label']]], dtype=torch.long)

        # Calculate Loss: softmax --> cross entropy loss
        loss = loss_function(probs, target)
        train_loss += loss.item()

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()


    # if index == 0:
    #     continue
    print("Epoch completed in: %.4f seconds" % (time.time()-start_time))
    print(str((epoch+1)) + "," + str(train_loss / len(covid_senti_train)))
    print('\n')
    train_loss = 0

torch.save(cnn_model, 'cnn_big_model_500_with_padding.pth')

Epoch 1
Epoch completed in: 169.0479 seconds
1,0.8036848827393875


Epoch 2
Epoch completed in: 178.3237 seconds
2,0.803582663023501


Epoch 3
Epoch completed in: 182.0451 seconds
3,0.803582663023501




Let's evaluate the model on test set

In [20]:
from sklearn.metrics import classification_report
predictions = []
correct = []
cnn_model.eval()

with torch.no_grad():
    results = defaultdict(dict)
    for index, row in covid_senti_test.iterrows():
        bow_vec = make_word_2_vec(row['lemmatized_tweet'])
        probs = cnn_model(bow_vec)
        correct.append(row['label_num'])
        _, predicted = torch.max(probs.data, 1)
        predictions.append(predicted.numpy()[0])
target_names = ['neg', 'pos', 'neu']
print(classification_report(predictions, correct, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.00      0.00      0.00         0
         pos       0.00      0.00      0.00         0
         neu       1.00      0.75      0.86     17968

    accuracy                           0.75     17968
   macro avg       0.33      0.25      0.29     17968
weighted avg       1.00      0.75      0.86     17968



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The results with CNN are same as that with Naive Bayes.

# BERT

Now, let's try on BERT. BERT is a very famous model released in 2018. It is based on Transformer having 12 Encoder layers and 12 Attention heads. It has around 110M parameters. Here, the BERT model is fine tuned for our dataset. We first split the dataset into train and test with 20% reserved for testing. 

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(covid_senti.index.values, 
                                                    covid_senti.label_num.values, test_size=0.2,
                                                   stratify=covid_senti.label_num.values)

Here, we will mark each row as "train" or "test" based on which partition it is. This will help us in the future.

In [11]:
covid_senti['data_type'] = ['not_set'] * covid_senti.shape[0]

In [12]:
covid_senti.loc[X_train, 'data_type'] = 'train'
covid_senti.loc[X_test, 'data_type'] = 'test'

In [13]:
from transformers import BertTokenizer
import torch

torch.cuda.empty_cache()

Below cells intialize and train a BERT uncased model. First, we will use th BERT Tokenizer to convert our dataset into the format BERT expects. This tokenizer will pad our inputs to length 512. The sequences larger than 512 are truncated. Also, a [SEP] token is placed between two sentences and a [CLS] token at the start of the sentences. The tokenizer also converts the words into the embeddings. Also, the tokenizer creates an attention mask that differentiates word token from padded tokens by marking them 1 and 0 repsectively.

In [14]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [15]:
encoded_data_train = tokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='train'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

encoded_data_test = tokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='test'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')



The below cells assigns labels, attention masks and input ids as per the train and test set.

In [16]:
#train set
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(covid_senti[covid_senti.data_type == 'train'].label_num.values)

#validation set
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(covid_senti[covid_senti.data_type == 'test'].label_num.values)

This is the actual BERT model

In [17]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(class_dict),
                                                     output_attentions = False,
                                                      output_hidden_states = False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Defining the Dataloader for our train and test set and starting the training proces..

In [18]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from torch import nn, optim

In [19]:
train_dataset = TensorDataset(input_ids_train, attention_masks_train,labels_train)

test_dataset = TensorDataset(input_ids_test, attention_masks_test,labels_test)

In [20]:
train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=32)
test_dataloader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=32)

BERT gets fine-tuned well if AdamW optimizer is used. In AdamW, weight decay and and learning rate are optimized separately. lr=1e-5 is the defualt learning rate for BERT

In [22]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

In [23]:
model = model.cuda()
loss_total = 0
model.train()
start = time.time()
for i in range(3):
    for j, data in enumerate(train_dataloader):
        inputs = {'input_ids': data[0].cuda(), 
                      'attention_mask': data[1].cuda(), 
                      'labels': data[2].cuda()}
        output = model(**inputs)
        loss = output[0]
        optimizer.zero_grad()
        loss_total += loss.item()
        loss.backward()
        optimizer.step()
        print("Epoch: {} / {}, Step: {} / {} Loss: {:.4f}".format(i+1, 3, j, len(train_dataloader),
                                                                      loss))
print("Time for training: ", time.time()-start)
torch.save(model.state_dict(), 'model.pth')

Epoch: 1 / 3, Step: 0 / 2250 Loss: 1.1775
Epoch: 1 / 3, Step: 1 / 2250 Loss: 1.1513
Epoch: 1 / 3, Step: 2 / 2250 Loss: 1.1006
Epoch: 1 / 3, Step: 3 / 2250 Loss: 0.9597
Epoch: 1 / 3, Step: 4 / 2250 Loss: 1.0195
Epoch: 1 / 3, Step: 5 / 2250 Loss: 0.9269
Epoch: 1 / 3, Step: 6 / 2250 Loss: 0.8980
Epoch: 1 / 3, Step: 7 / 2250 Loss: 0.8771
Epoch: 1 / 3, Step: 8 / 2250 Loss: 0.8009
Epoch: 1 / 3, Step: 9 / 2250 Loss: 0.8886
Epoch: 1 / 3, Step: 10 / 2250 Loss: 0.7702
Epoch: 1 / 3, Step: 11 / 2250 Loss: 0.6703
Epoch: 1 / 3, Step: 12 / 2250 Loss: 0.7361
Epoch: 1 / 3, Step: 13 / 2250 Loss: 0.7624
Epoch: 1 / 3, Step: 14 / 2250 Loss: 1.0131
Epoch: 1 / 3, Step: 15 / 2250 Loss: 0.5220
Epoch: 1 / 3, Step: 16 / 2250 Loss: 0.8398
Epoch: 1 / 3, Step: 17 / 2250 Loss: 0.6348
Epoch: 1 / 3, Step: 18 / 2250 Loss: 0.7459
Epoch: 1 / 3, Step: 19 / 2250 Loss: 0.7515
Epoch: 1 / 3, Step: 20 / 2250 Loss: 0.6863
Epoch: 1 / 3, Step: 21 / 2250 Loss: 0.8617
Epoch: 1 / 3, Step: 22 / 2250 Loss: 0.5856
Epoch: 1 / 3, Step: 2

In [24]:
model.load_state_dict(torch.load('model.pth'))
model = model.cuda()
model.eval()
results = defaultdict(dict)
predictions, true_vals = [], []
start = time.time()
for j, data in enumerate(test_dataloader):
    inputs = {'input_ids': data[0].cuda(), 
              'attention_mask': data[1].cuda(), 
              'labels': data[2].cuda()}
    with torch.no_grad():
        output = model(**inputs)
    loss = output[0]
    logits = output[1]
    logits = logits.detach().cpu().numpy()
    labels = inputs['labels'].cpu().numpy()
    loss_total += loss.item()
    predictions.append(logits)
    true_vals.append(labels)
print("Time for inference: ", time.time()-start)

Time for inference:  65.6012110710144


In [38]:
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)

In [39]:
from sklearn.metrics import classification_report
preds_flat = np.argmax(predictions, axis = 1).flatten()
labels_flat = true_vals.flatten()
target_names = ['neg', 'pos', 'neu']
print(classification_report(labels_flat, preds_flat, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.93      0.95      0.94      3267
         pos       0.91      0.91      0.91      1256
         neu       0.98      0.98      0.98     13477

    accuracy                           0.97     18000
   macro avg       0.94      0.95      0.94     18000
weighted avg       0.97      0.97      0.97     18000



The performance by BERT is excellent with 97% overall accuracy and satisfactory per-class precision

# DistilBERT

DistilBert is a lighter version of BERT. This model is built by knowledge distillation technique wherein a small model is trained to reproduce the behavior of a larger model. Accordingly, it has 40% less parameters than BERT and runs 60% faster while retaining 95% of the BERT-base-uncased performance[3]. Here, we are planning to compare the results of BERT with DistilBERT and try to get hands on both.

In [25]:
from transformers import DistilBertConfig,DistilBertTokenizer,DistilBertModel
distil_berttokenizer = DistilBertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizer'.


In [26]:
encoded_data_train = distil_berttokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='train'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

encoded_data_test = distil_berttokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='test'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

In [27]:
#train set
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(covid_senti[covid_senti.data_type == 'train'].label_num.values)

#validation set
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(covid_senti[covid_senti.data_type == 'test'].label_num.values)

In [28]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(class_dict),
                                                     output_attentions = False,
                                                      output_hidden_states = False)

You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['bert.encoder.layer.5.output.LayerNorm.weight', 'bert.encoder.layer.11.attention.output.LayerNorm.weight', 'bert.encoder.layer.2.intermediate.dense.weight', 'bert.encoder.layer.6.output.dense.bias', 'bert.encoder.layer.7.output.dense.weight', 'bert.encoder.layer.4.attention.self.key.bias', 'bert.encoder.layer.6.output.LayerNorm.bias', 'bert.encoder.layer.11.intermediate.dense.bias', 'bert.encoder.layer.1.attention.self.value.weight', 'bert.encoder.layer.5.intermediate.dense.weight', 'bert.encoder.layer.9.attention.output.LayerNorm.bias', 'bert.encoder.layer.10.output.LayerNorm.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.7.output.LayerNorm.weight', 'bert.encoder.layer.0.attent

In [29]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from torch import nn, optim

train_dataset = TensorDataset(input_ids_train, attention_masks_train,labels_train)
test_dataset = TensorDataset(input_ids_test, attention_masks_test,labels_test)

train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=32)
test_dataloader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=32)

In [30]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

In [31]:
model = model.cuda()
loss_total = 0
model.train()
start = time.time()
for i in range(3):
    for j, data in enumerate(train_dataloader):
        inputs = {'input_ids': data[0].cuda(), 
                      'attention_mask': data[1].cuda(), 
                      'labels': data[2].cuda()}
        output = model(**inputs)
        loss = output[0]
        optimizer.zero_grad()
        loss_total += loss.item()
        loss.backward()
        optimizer.step()
        print("Epoch: {} / {}, Step: {} / {} Loss: {:.4f}".format(i+1, 3, j, len(train_dataloader),
                                                                      loss))
print("Time for training: ", time.time()-start)
torch.save(model.state_dict(), 'distilmodel.pth')

Epoch: 1 / 3, Step: 0 / 2250 Loss: 1.0783
Epoch: 1 / 3, Step: 1 / 2250 Loss: 0.8679
Epoch: 1 / 3, Step: 2 / 2250 Loss: 0.6708
Epoch: 1 / 3, Step: 3 / 2250 Loss: 0.7652
Epoch: 1 / 3, Step: 4 / 2250 Loss: 0.3557
Epoch: 1 / 3, Step: 5 / 2250 Loss: 0.7270
Epoch: 1 / 3, Step: 6 / 2250 Loss: 0.7649
Epoch: 1 / 3, Step: 7 / 2250 Loss: 0.8280
Epoch: 1 / 3, Step: 8 / 2250 Loss: 1.0821
Epoch: 1 / 3, Step: 9 / 2250 Loss: 0.9098
Epoch: 1 / 3, Step: 10 / 2250 Loss: 0.9553
Epoch: 1 / 3, Step: 11 / 2250 Loss: 0.4044
Epoch: 1 / 3, Step: 12 / 2250 Loss: 0.7077
Epoch: 1 / 3, Step: 13 / 2250 Loss: 0.5611
Epoch: 1 / 3, Step: 14 / 2250 Loss: 0.6160
Epoch: 1 / 3, Step: 15 / 2250 Loss: 0.8396
Epoch: 1 / 3, Step: 16 / 2250 Loss: 0.5527
Epoch: 1 / 3, Step: 17 / 2250 Loss: 0.6816
Epoch: 1 / 3, Step: 18 / 2250 Loss: 0.5158
Epoch: 1 / 3, Step: 19 / 2250 Loss: 0.5346
Epoch: 1 / 3, Step: 20 / 2250 Loss: 0.5446
Epoch: 1 / 3, Step: 21 / 2250 Loss: 0.8661
Epoch: 1 / 3, Step: 22 / 2250 Loss: 0.9522
Epoch: 1 / 3, Step: 2

In [33]:
model.load_state_dict(torch.load('distilmodel.pth'))
model = model.cuda()
model.eval()
loss_total = 0
results = defaultdict(dict)
predictions, true_vals = [], []
start = time.time()
for j, data in enumerate(test_dataloader):
    inputs = {'input_ids': data[0].cuda(), 
              'attention_mask': data[1].cuda(), 
              'labels': data[2].cuda()}
    with torch.no_grad():
        output = model(**inputs)
    loss = output[0]
    logits = output[1]
    logits = logits.detach().cpu().numpy()
    labels = inputs['labels'].cpu().numpy()
    loss_total += loss.item()
    predictions.append(logits)
    true_vals.append(labels)
print("Time for inference: ", time.time()-start)

Time for inference:  62.421045541763306


In [50]:
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)

In [51]:
from sklearn.metrics import classification_report
preds_flat = np.argmax(predictions, axis = 1).flatten()
labels_flat = true_vals.flatten()
target_names = ['neg', 'pos', 'neu']
print(classification_report(labels_flat, preds_flat, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.85      0.84      0.85      3267
         pos       0.68      0.79      0.73      1256
         neu       0.94      0.93      0.94     13477

    accuracy                           0.91     18000
   macro avg       0.83      0.86      0.84     18000
weighted avg       0.91      0.91      0.91     18000



Although DistilBERT is a lighter version of BERT, it is giving less accuracy on our dataset. The accuracy given by BERT is 97% while that by DistilBERT is 91%.

# Evaluation on Part A

The dataset on which we worked above has been divided into three parts [2]. After analysing the performance of different Machine Learning algorithms on the huge dataset, let's analyze the performace of the dataset on individual parts.

Let's start on Part A. Part A of this dataset contains tweets related to government actions against COVID-19. The evaluation of this part (preprocessing and processing) is same as that of the entire dataset as above.

In [4]:
covid_senti = pd.read_csv("COVIDSenti-main/COVIDSenti-A.csv")
covid_senti["label"].value_counts()

neu    22949
neg     5083
pos     1968
Name: label, dtype: int64

It contains 22949 samples marked as neutral, 5083 as positive and 1968 as negative.

## Preprocessing

In [5]:
from gensim.utils import simple_preprocess
import string

table = str.maketrans(dict.fromkeys(string.punctuation))

covid_senti['tokenized_tweet'] = [simple_preprocess(line, deacc=True) for line in covid_senti['tweet']]
covid_senti['tokenized_tweet'] = [[word.replace('\n', '') for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.replace('#', '') for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.lower() for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.translate(table) for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[''.join(filter(lambda x: not word.startswith('https'), word)) for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[''.join(filter(lambda x: not word.startswith('@'), word)) for word in line] for line in covid_senti['tokenized_tweet']]
# print(covid_senti['tokenized_tweet'].sample(n=10))

In [6]:
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
covid_senti['lemmatized_tweet'] = [[wordnet_lemmatizer.lemmatize(word) for word in line] for line in covid_senti['tokenized_tweet']]
# print(covid_senti['lemmatized_tweet'].sample(n=10))

[nltk_data] Downloading package wordnet to
[nltk_data]     /s/chopin/a/grad/sanket96/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
mask = np.random.rand(len(covid_senti)) < 0.8
covid_senti_train = covid_senti[mask]
covid_senti_test = covid_senti[~mask]

## Naive Bayes

In [56]:
class NaiveBayes():

    def __init__(self):
        # be sure to use the right class_dict for each data set
        self.class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
        # self.class_dict = {'action': 0, 'comedy': 1}
        self.feature_dict = {}
        self.prior = np.zeros(len(self.class_dict))
        self.likelihood = None
    '''
    Trains a multinomial Naive Bayes classifier on a training set.
    Specifically, fills in self.prior and self.likelihood such that:
    self.prior[class] = log(P(class))
    self.likelihood[class][feature] = log(P(feature|class))
    '''
    def train(self, train_set):
        self.feature_dict = self.select_features(train_set)
        # iterate over training documents
        self.likelihood = np.zeros((len(self.class_dict), len(self.feature_dict)))
        doc_per_class = {}
        word_count = {}
        total_words_per_class = {}
        vocabulary = set()
        for index, row in train_set.iterrows():
            class_name = row['label']
            if (class_name in self.class_dict):
                doc_per_class[class_name] = 1 + doc_per_class.get(class_name, 0)
                    # collect class counts and feature counts
                data = row['lemmatized_tweet']
                for word in data:
                    vocabulary.add(word)
                    word_count[(word, class_name)] = 1 + word_count.get((word, class_name), 0)
        # normalize counts to probabilities, and take logs
        for class_name in self.class_dict:
            counts = [v for k, v in word_count.items() if k[1] == class_name]
            total_words_per_class[class_name] = sum(counts)
        for word in self.feature_dict:
            for class_name in self.class_dict:
                self.likelihood[self.class_dict.get(class_name)][self.feature_dict.get(word)] = np.log(((word_count.get((word,
                                                class_name), 0) + 1)/(total_words_per_class[class_name] + len(vocabulary))))
        for class_name in self.class_dict:
            self.prior[self.class_dict[class_name]] = np.log((doc_per_class[class_name] / sum(doc_per_class.values())))
    '''
    Tests the classifier on a development or test set.
    Returns a dictionary of filenames mapped to their correct and predicted
    classes such that:
    results[filename]['correct'] = correct class
    results[filename]['predicted'] = predicted class
    '''
    def test(self, dev_set):
        pred_labels = []
        true_labels = []
        # iterate over testing documents
        for index, row in dev_set.iterrows():
            class_name = row['label']
            # create feature vectors for each document
            word_count = {}
            true_labels.append(self.class_dict[class_name])
            data = str(row['lemmatized_tweet'])
            for word in data:
                if word in self.feature_dict:
                    word_count[word] = 1 + word_count.get(word, 0)
            feature_vector = np.zeros((len(self.feature_dict), 1))
            for i, word in enumerate(self.feature_dict):
                feature_vector[i] = word_count.get(word, 0)
            self.prior = np.reshape(self.prior, (self.prior.shape[0], 1))
            probability = self.prior + np.matmul(self.likelihood, feature_vector)
            pred_labels.append(np.argmax(probability))
                # get most likely class
        # print(dict(results))
        return pred_labels, true_labels

    '''
    Given results, calculates the following:
    Precision, Recall, F1 for each class
    Accuracy overall
    Also, prints evaluation metrics in readable format.
    '''
    def evaluate(self, results):
        # you may find this helpful
        target_names = ['neg', 'pos', 'neu']
        print(classification_report(results[1], results[0], target_names=target_names))
    '''
    Performs feature selection.
    Returns a dictionary of features.
    '''
    def select_features(self, train_set):
        # almost any method of feature selection is fine here
        doc_per_class = {}
        word_count = {}
        total_words_per_class = {}
        vocabulary = set()
        likelihood_ratio = {}
        for index, row in train_set.iterrows():
            class_name = row['label']
            if (class_name in self.class_dict):
                doc_per_class[class_name] = 1 + doc_per_class.get(class_name, 0)
                    # collect class counts and feature counts
                data = row['lemmatized_tweet']
                for word in data:
                    vocabulary.add(word)
                    word_count[(word, class_name)] = 1 + word_count.get((word, class_name), 0)
        # normalize counts to probabilities, and take logs
        for class_name in self.class_dict:
            counts = [v for k, v in word_count.items() if k[1] == class_name]
            total_words_per_class[class_name] = sum(counts)
        prob_class = np.zeros((3, 1))
        for i, class_name in enumerate(self.class_dict):
            prob_class[i] = (doc_per_class[class_name] / sum(doc_per_class.values()))
        for word in vocabulary:
            class_probs = [1] * len(self.class_dict)
            for i, class_name in enumerate(self.class_dict):
                class_probs[i] = (word_count.get((word,
                                      class_name), 0) + 1) / (total_words_per_class[class_name] + len(vocabulary))
                class_probs[i] = class_probs[i] / prob_class[i]
            likelihood_ratio[word] = (1 / class_probs[0]) * (1 / class_probs[1]) * (1 / class_probs[2])
        #likelihood_ratio_pos = dict(sorted(likelihood_ratio.items(), key=lambda item: item[1], reverse=True))
        likelihood_ratio = dict(sorted(likelihood_ratio.items(), key=lambda item: item[1]))
        words = []
        words.extend(list(likelihood_ratio.keys())[:750])
        #words.extend(list(likelihood_ratio_pos.keys())[:750])
        # for class_name in self.class_dict:
        #     self.prior[self.class_dict[class_name]] = np.log((doc_per_class[class_name] / sum(doc_per_class.values())))
        features = {}
        for i, word in enumerate(words):
            features[word] = i
        return features


if __name__ == '__main__':
    nb = NaiveBayes()
    # make sure these point to the right directories
    nb.train(covid_senti_train)
    # nb.train('movie_reviews_small/train')
    results = nb.test(covid_senti_test)
    # results = nb.test('movie_reviews_small/test')
    nb.evaluate(results)


              precision    recall  f1-score   support

         neg       0.00      0.00      0.00      1037
         pos       0.00      0.00      0.00       382
         neu       0.76      1.00      0.86      4526

    accuracy                           0.76      5945
   macro avg       0.25      0.33      0.29      5945
weighted avg       0.58      0.76      0.66      5945



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Logistic Regression classifier

In [57]:
# CS542 Fall 2021 Programming Assignment 2
# Logistic Regression Classifier

'''
Computes the logistic function.
'''


def sigma(z):
    return 1 / (1 + np.exp(-z))


class LogisticRegression():

    def __init__(self, n_features=400):
        # be sure to use the right class_dict for each data set
        self.theta = None
        self.n_features = n_features
        self.feature_dict = None
        self.class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
        # self.class_dict = {'action': 0, 'comedy': 1}
        # use of self.feature_dict is optional for this assignment
        self.feature_dict = self.select_features(covid_senti_train)

    '''
    Loads a dataset. Specifically, returns a list of filenames, and dictionaries
    of classes and documents such that:
    classes[filename] = class of the document
    documents[filename] = feature vector for the document (use self.featurize)
    '''

    def select_features(self, data_set):
        feature_count = {}
        for index, row in data_set.iterrows():
            data = str(row['lemmatized_tweet']).split()
            for word in data:
                feature_count[word] = 1 + feature_count.get(word, 0)

        feature_count = list(dict(sorted(feature_count.items(), key=lambda v: v[1], reverse=True)).keys())[:500]
        features = {}

        for i, word in enumerate(feature_count):
            features[word] = i
        return features

    def load_data(self, data_set):
        filenames = []
        classes = dict()
        documents = dict()
        # iterate over documents
        for index, row in data_set.iterrows():
            # your code here
            # BEGIN STUDENT CODE
            # if os.path.isfile(os.path.join(root, name)):
            class_name = row['label']
            classes[index] = self.class_dict[class_name]
            documents[index] = self.featurize(row['lemmatized_tweet'])
            # END STUDENT CODE
        return classes, documents

    '''
    Given a document (as a list of words), returns a feature vector.
    Note that the last element of the vector, corresponding to the bias, is a
    "dummy feature" with value 1.
    '''

    def featurize(self, document):
        vector = np.zeros(self.n_features + 1)
        # BEGIN STUDENT CODE
        for word in document:
            if word in self.feature_dict:
                if word not in w2v_model.wv.key_to_index:
                    vector.extend([0] * 500)
                else:
                    vector.extend(w2v_model.wv[word])
        # END STUDENT CODE
        vector[-1] = 1
        return vector

    '''
    Trains a logistic regression classifier on a training set.
    '''

    def train(self, train_set, batch_size=3, n_epochs=1, eta=0.1):
        # if train_set == "movie_reviews_small/train":
        #     self.feature_dict = {'fast': 0, 'couple': 1, 'shoot': 2, 'fly': 3}
        # else:
        #     self.feature_dict = self.select_features(train_set)
        # self.n_features = len(self.feature_dict)
        self.theta = np.zeros(self.n_features + 1)  # weights (and bias)
        classes, documents = self.load_data(train_set)
        n_minibatches = ceil(len(train_set) / batch_size)
        for epoch in range(n_epochs):
            print("Epoch {:} out of {:}".format(epoch + 1, n_epochs))
            loss = 0
            for i in range(n_minibatches):
                # list of filenames in minibatch
                minibatch = train_set[i * batch_size: (i + 1) * batch_size]
                # BEGIN STUDENT CODE
                # create and fill in matrix x and vector y
                x = np.zeros((len(minibatch), self.n_features + 1))
                y = np.zeros(len(minibatch))
                k = 0
                for j, row in minibatch.iterrows():
                    x[k][:] = documents[j]
                    y[k] = classes[j]
                    k += 1
                # compute y_hat
                y_hat = sigma(np.dot(x, self.theta))
                # update loss
                loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))
                # compute gradient
                gradient = np.dot(x.T, np.subtract(y_hat, y)) / len(minibatch)
                # update weights (and bias)
                self.theta = self.theta - (eta * gradient)
                # END STUDENT CODE
            loss /= len(train_set)
            print("Average Train Loss: {}".format(loss))
            # randomize order
            #Random(epoch).shuffle(train_set)

    '''
    Tests the classifier on a development or test set.
    Returns a dictionary of filenames mapped to their correct and predicted
    classes such that:
    results[filename]['correct'] = correct class
    results[filename]['predicted'] = predicted class
    '''

    def test(self, dev_set):
        pred_labels = []
        true_labels = []
        classes, documents = self.load_data(dev_set)
        for index, row in dev_set.iterrows():
            # BEGIN STUDENT CODE
            # get most likely class (recall that P(y=1|x) = y_hat)
            true_labels.append(classes[index])
            prediction = sigma(np.dot(documents[index], self.theta))
            pred_label = 1 if prediction > 0.5 else 0
            pred_labels.append(pred_label)
            # END STUDENT CODE
        return pred_labels, true_labels

    '''
    Given results, calculates the following:
    Precision, Recall, F1 for each class
    Accuracy overall
    Also, prints evaluation metrics in readable format.
    '''

    def evaluate(self, results):
        # you can copy and paste your code from PA1 here
        target_names = ['neg', 'pos', 'neu']
        print(classification_report(results[1], results[0], target_names=target_names))


if __name__ == '__main__':
    lr = LogisticRegression(n_features=750)
    # make sure these point to the right directories
    batch_size = [1, 2, 3, 8, 16, 32]
    n_epochs = [1, 5, 10, 20, 30, 40]
    eta = [0.025, 0.05, 0.1, 0.2, 0.4]

    # code for grid search
#     for b in batch_size:
#         for n in n_epochs:
#             for ler in eta:
#                 lr.train(covid_senti_train, batch_size=b, n_epochs=n, eta=ler)
#                 results = lr.test(covid_senti_test)
#                 lr.evaluate(results)
#                 print("Accuracy is for batch size: ", b, ", n_epochs: ", n, "eta: ", ler)

    # best features from grid search
    lr.train(covid_senti_train, batch_size=3, n_epochs=40, eta=0.05)
    results = lr.test(covid_senti_test)
    # lr.train('movie_reviews_small/train', batch_size=3, n_epochs=1, eta=0.1)
    # results = lr.test('movie_reviews_small/test')
    lr.evaluate(results)


Epoch 1 out of 40


  loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))
  loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))


Average Train Loss: nan
Epoch 2 out of 40
Average Train Loss: nan
Epoch 3 out of 40
Average Train Loss: nan
Epoch 4 out of 40
Average Train Loss: nan
Epoch 5 out of 40
Average Train Loss: nan
Epoch 6 out of 40
Average Train Loss: nan
Epoch 7 out of 40
Average Train Loss: nan
Epoch 8 out of 40
Average Train Loss: nan
Epoch 9 out of 40
Average Train Loss: nan
Epoch 10 out of 40
Average Train Loss: nan
Epoch 11 out of 40
Average Train Loss: nan
Epoch 12 out of 40
Average Train Loss: nan
Epoch 13 out of 40
Average Train Loss: nan
Epoch 14 out of 40
Average Train Loss: nan
Epoch 15 out of 40
Average Train Loss: nan
Epoch 16 out of 40
Average Train Loss: nan
Epoch 17 out of 40
Average Train Loss: nan
Epoch 18 out of 40
Average Train Loss: nan
Epoch 19 out of 40
Average Train Loss: nan
Epoch 20 out of 40
Average Train Loss: nan
Epoch 21 out of 40
Average Train Loss: nan
Epoch 22 out of 40
Average Train Loss: nan
Epoch 23 out of 40
Average Train Loss: nan
Epoch 24 out of 40
Average Train Loss:

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## CNN classifier

In [8]:
import torch
from torch import nn, optim
import gensim
import time
import torch.nn.functional as F

In [9]:
from gensim.models import Word2Vec
tweets = list(covid_senti['lemmatized_tweet'].values)
tweets.append(['pad'])
w2v_model = Word2Vec(tweets, min_count = 1, vector_size = 500, workers = 3, window = 3, sg = 1)

In [10]:
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
covid_senti['label_num'] = covid_senti.label.replace(class_dict)
covid_senti

Unnamed: 0,tweet,label,tokenized_tweet,lemmatized_tweet,label_num
0,Coronavirus | Human Coronavirus Types | CDC ht...,neu,"[coronavirus, human, coronavirus, types, cdc, ...","[coronavirus, human, coronavirus, type, cdc, ,...",2
1,"@shehryar_taseer That‚Äôs üíØ true , \nCorona...",neu,"[shehryartaseer, that, aos, uiø, true, corona,...","[shehryartaseer, that, aos, uiø, true, corona,...",2
2,"TLDR: Not SARS, possibly new coronavirus. Diff...",neg,"[tldr, not, sars, possibly, new, coronavirus, ...","[tldr, not, sars, possibly, new, coronavirus, ...",0
3,Disease outbreak news from the WHO: Middle Eas...,neu,"[disease, outbreak, news, from, the, who, midd...","[disease, outbreak, news, from, the, who, midd...",2
4,China - Media: WSJ says sources tell them myst...,neu,"[china, media, wsj, says, sources, tell, them,...","[china, medium, wsj, say, source, tell, them, ...",2
...,...,...,...,...,...
29995,CDC: Re-test confirms Westerdam cruise ship pa...,neu,"[cdc, re, test, confirms, westerdam, cruise, s...","[cdc, re, test, confirms, westerdam, cruise, s...",2
29996,Two doctors die of coronavirus within 24 hours...,neu,"[two, doctors, die, of, coronavirus, within, h...","[two, doctor, die, of, coronavirus, within, ho...",2
29997,BEIJING - The lockdown of Guo Jing's neighbour...,neu,"[beijing, the, lockdown, of, guo, jing, neighb...","[beijing, the, lockdown, of, guo, jing, neighb...",2
29998,#CoronavirusOutbreak in #Balochistan !!\n#CPEC...,neu,"[in, balochistan, cpec, route, to, spread, cor...","[in, balochistan, cpec, route, to, spread, cor...",2


In [61]:
max_len = covid_senti['lemmatized_tweet'].map(len).max()
def make_word_2_vec(sentence):
    padding_idx = w2v_model.wv.key_to_index['pad']
    padded_X = [padding_idx for i in range(max_len)]
    i = 0
    for word in sentence:
        if word not in w2v_model.wv.key_to_index:
            padded_X[i] = 0
            print(word)
        else:
            padded_X[i] = w2v_model.wv.key_to_index[word]
        i += 1
    return torch.tensor(padded_X, dtype=torch.long).view(1, -1)

In [62]:
EMBEDDING_SIZE = 500
NUM_FILTERS = 10

class CnnTextClassifier(nn.Module):
    def __init__(self, vocab_size, num_classes, window_sizes=(1,2,3,5)):
        super(CnnTextClassifier, self).__init__()
        weights = w2v_model.wv
        # With pretrained embeddings
        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors),
                                                      padding_idx=w2v_model.wv.key_to_index['pad'])
        # Without pretrained embeddings
        # self.embedding = nn.Embedding(vocab_size, EMBEDDING_SIZE)

        self.convs = nn.ModuleList([
                                   nn.Conv2d(1, NUM_FILTERS, [window_size, EMBEDDING_SIZE],
                                             padding=(window_size - 1, 0))
                                   for window_size in window_sizes
        ])

        self.fc = nn.Linear(NUM_FILTERS * len(window_sizes), num_classes)

    def forward(self, x):
        x = self.embedding(x)

        # Apply a convolution + max_pool layer for each window size
        x = torch.unsqueeze(x, 1)
        xs = []
        for conv in self.convs:
            x2 = torch.tanh(conv(x))
            x2 = torch.squeeze(x2, -1)
            x2 = F.max_pool1d(x2, x2.size(2))
            xs.append(x2)
        x = torch.cat(xs, 2)

        # FC
        x = x.view(x.size(0), -1)
        logits = self.fc(x)

        probs = F.softmax(logits, dim = 1)

        return probs

In [63]:
NUM_CLASSES = 3
VOCAB_SIZE = len(w2v_model.wv.key_to_index)

cnn_model = CnnTextClassifier(vocab_size=VOCAB_SIZE, num_classes=NUM_CLASSES)
# cnn_model.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)
num_epochs = 3

# Open the file for writing loss
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
loss_file_name = 'cnn_class_big_loss_with_padding.csv'
losses = []
cnn_model.train()
for epoch in range(num_epochs):
    start_time = time.time()
    print("Epoch " + str(epoch + 1))
    train_loss = 0
    for index, row in covid_senti_train.iterrows():
        # Clearing the accumulated gradients
        cnn_model.zero_grad()

        # Make the bag of words vector for stemmed tokens 
        bow_vec = make_word_2_vec(row['lemmatized_tweet'])
       
        # Forward pass to get output
        probs = cnn_model(bow_vec)

        # Get the target label
        target = torch.tensor([class_dict[row['label']]], dtype=torch.long)

        # Calculate Loss: softmax --> cross entropy loss
        loss = loss_function(probs, target)
        train_loss += loss.item()

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()


    # if index == 0:
    #     continue
    print("Epoch completed in: %.4f seconds" % (time.time()-start_time))
    print(str((epoch+1)) + "," + str(train_loss / len(covid_senti_train)))
    print('\n')
    train_loss = 0

torch.save(cnn_model, 'cnn_big_model_500_with_paddingA.pth')

Epoch 1
Epoch completed in: 67.5854 seconds
1,0.7859940051859021


Epoch 2
Epoch completed in: 68.0643 seconds
2,0.7855748417495914


Epoch 3
Epoch completed in: 70.3885 seconds
3,0.7855748417347242




In [64]:
from sklearn.metrics import classification_report
predictions = []
correct = []
cnn_model.eval()

with torch.no_grad():
    results = defaultdict(dict)
    for index, row in covid_senti_test.iterrows():
        bow_vec = make_word_2_vec(row['lemmatized_tweet'])
        probs = cnn_model(bow_vec)
        correct.append(class_dict[row['label']])
        _, predicted = torch.max(probs.data, 1)
        predictions.append(predicted.numpy()[0])
target_names = ['neg', 'pos', 'neu']
print(classification_report(predictions, correct, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.00      0.00      0.00         0
         pos       0.00      0.00      0.00         0
         neu       1.00      0.76      0.86      5945

    accuracy                           0.76      5945
   macro avg       0.33      0.25      0.29      5945
weighted avg       1.00      0.76      0.86      5945



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## BERT

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(covid_senti.index.values, 
                                                    covid_senti.label.values, test_size=0.2,
                                                   stratify=covid_senti.label.values)

In [12]:
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}

In [13]:
covid_senti['data_type'] = ['not_set'] * covid_senti.shape[0]

In [14]:
covid_senti.loc[X_train, 'data_type'] = 'train'
covid_senti.loc[X_test, 'data_type'] = 'test'

In [15]:
from transformers import BertTokenizer
import torch

torch.cuda.empty_cache()

In [39]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [40]:
encoded_data_train = tokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='train'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

encoded_data_test = tokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='test'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')



In [41]:
#train set
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(covid_senti[covid_senti.data_type == 'train'].label_num.values)

#validation set
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(covid_senti[covid_senti.data_type == 'test'].label_num.values)

In [42]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(class_dict),
                                                     output_attentions = False,
                                                      output_hidden_states = False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [43]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from torch import nn, optim

In [44]:
train_dataset = TensorDataset(input_ids_train, attention_masks_train,labels_train)

test_dataset = TensorDataset(input_ids_test, attention_masks_test,labels_test)

In [19]:
train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=32)
test_dataloader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=32)

In [20]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

In [21]:
model = model.cuda()
loss_total = 0
model.train()
for i in range(3):
    for j, data in enumerate(train_dataloader):
        inputs = {'input_ids': data[0].cuda(), 
                      'attention_mask': data[1].cuda(), 
                      'labels': data[2].cuda()}
        output = model(**inputs)
        loss = output[0]
        optimizer.zero_grad()
        loss_total += loss.item()
        loss.backward()
        optimizer.step()
        print("Epoch: {} / {}, Step: {} / {} Loss: {:.4f}".format(i+1, 3, j, len(train_dataloader),
                                                                      loss))
torch.save(model.state_dict(), 'modelA.pth')

Epoch: 1 / 3, Step: 0 / 750 Loss: 1.4978
Epoch: 1 / 3, Step: 1 / 750 Loss: 1.4497
Epoch: 1 / 3, Step: 2 / 750 Loss: 1.3034
Epoch: 1 / 3, Step: 3 / 750 Loss: 1.3531
Epoch: 1 / 3, Step: 4 / 750 Loss: 1.3236
Epoch: 1 / 3, Step: 5 / 750 Loss: 1.2706
Epoch: 1 / 3, Step: 6 / 750 Loss: 1.2097
Epoch: 1 / 3, Step: 7 / 750 Loss: 1.1642
Epoch: 1 / 3, Step: 8 / 750 Loss: 1.1235
Epoch: 1 / 3, Step: 9 / 750 Loss: 1.0776
Epoch: 1 / 3, Step: 10 / 750 Loss: 1.0305
Epoch: 1 / 3, Step: 11 / 750 Loss: 1.0590
Epoch: 1 / 3, Step: 12 / 750 Loss: 0.9862
Epoch: 1 / 3, Step: 13 / 750 Loss: 0.9032
Epoch: 1 / 3, Step: 14 / 750 Loss: 0.7150
Epoch: 1 / 3, Step: 15 / 750 Loss: 0.8478
Epoch: 1 / 3, Step: 16 / 750 Loss: 0.6535
Epoch: 1 / 3, Step: 17 / 750 Loss: 0.6589
Epoch: 1 / 3, Step: 18 / 750 Loss: 0.6021
Epoch: 1 / 3, Step: 19 / 750 Loss: 0.7397
Epoch: 1 / 3, Step: 20 / 750 Loss: 0.5786
Epoch: 1 / 3, Step: 21 / 750 Loss: 0.5841
Epoch: 1 / 3, Step: 22 / 750 Loss: 0.7222
Epoch: 1 / 3, Step: 23 / 750 Loss: 0.6692
Ep

In [22]:
model.load_state_dict(torch.load('modelA.pth'))
model = model.cuda()
model.eval()
loss_total = 0
results = defaultdict(dict)
predictions, true_vals = [], []
for j, data in enumerate(test_dataloader):
    inputs = {'input_ids': data[0].cuda(), 
              'attention_mask': data[1].cuda(), 
              'labels': data[2].cuda()}
    with torch.no_grad():
        output = model(**inputs)
    loss = output[0]
    logits = output[1]
    logits = logits.detach().cpu().numpy()
    labels = inputs['labels'].cpu().numpy()
    loss_total += loss.item()
    predictions.append(logits)
    true_vals.append(labels)

In [23]:
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)

In [24]:
from sklearn.metrics import f1_score
preds_flat = np.argmax(predictions, axis = 1).flatten()
labels_flat = true_vals.flatten()

In [25]:
from sklearn.metrics import classification_report
target_names = ['neg', 'pos', 'neu']
print(classification_report(labels_flat, preds_flat, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.88      0.89      0.89      1017
         pos       0.86      0.84      0.85       393
         neu       0.96      0.96      0.96      4590

    accuracy                           0.94      6000
   macro avg       0.90      0.90      0.90      6000
weighted avg       0.94      0.94      0.94      6000



# DistilBERT

In [16]:
from transformers import DistilBertConfig,DistilBertTokenizer,DistilBertModel
distil_berttokenizer = DistilBertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizer'.


In [17]:
encoded_data_train = distil_berttokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='train'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

encoded_data_test = distil_berttokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='test'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')



In [18]:
#train set
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(covid_senti[covid_senti.data_type == 'train'].label_num.values)

#validation set
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(covid_senti[covid_senti.data_type == 'test'].label_num.values)

In [19]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(class_dict),
                                                     output_attentions = False,
                                                      output_hidden_states = False)

You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['bert.encoder.layer.2.attention.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.11.attention.self.query.bias', 'bert.encoder.layer.5.attention.output.LayerNorm.weight', 'bert.encoder.layer.11.output.dense.bias', 'bert.encoder.layer.8.output.dense.weight', 'bert.encoder.layer.3.attention.output.dense.bias', 'bert.encoder.layer.11.attention.self.query.weight', 'bert.encoder.layer.1.output.dense.weight', 'bert.encoder.layer.2.output.LayerNorm.bias', 'bert.encoder.layer.3.intermediate.dense.weight', 'bert.encoder.layer.10.attention.output.LayerNorm.weight', 'bert.encoder.layer.4.attention.output.dense.bias', 'bert.encoder.layer.1.attention.self.key.bias', 'bert.e

In [20]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from torch import nn, optim

train_dataset = TensorDataset(input_ids_train, attention_masks_train,labels_train)
test_dataset = TensorDataset(input_ids_test, attention_masks_test,labels_test)

train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=32)
test_dataloader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=32)

In [21]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

In [22]:
model = model.cuda()
loss_total = 0
model.train()
for i in range(3):
    for j, data in enumerate(train_dataloader):
        inputs = {'input_ids': data[0].cuda(), 
                      'attention_mask': data[1].cuda(), 
                      'labels': data[2].cuda()}
        output = model(**inputs)
        loss = output[0]
        optimizer.zero_grad()
        loss_total += loss.item()
        loss.backward()
        optimizer.step()
        print("Epoch: {} / {}, Step: {} / {} Loss: {:.4f}".format(i+1, 3, j, len(train_dataloader),
                                                                      loss))
torch.save(model.state_dict(), 'distilmodelA.pth')

Epoch: 1 / 3, Step: 0 / 750 Loss: 1.1052
Epoch: 1 / 3, Step: 1 / 750 Loss: 0.7276
Epoch: 1 / 3, Step: 2 / 750 Loss: 0.9843
Epoch: 1 / 3, Step: 3 / 750 Loss: 0.6338
Epoch: 1 / 3, Step: 4 / 750 Loss: 0.5480
Epoch: 1 / 3, Step: 5 / 750 Loss: 0.6396
Epoch: 1 / 3, Step: 6 / 750 Loss: 0.5616
Epoch: 1 / 3, Step: 7 / 750 Loss: 0.9645
Epoch: 1 / 3, Step: 8 / 750 Loss: 0.6199
Epoch: 1 / 3, Step: 9 / 750 Loss: 0.7198
Epoch: 1 / 3, Step: 10 / 750 Loss: 0.6688
Epoch: 1 / 3, Step: 11 / 750 Loss: 0.7417
Epoch: 1 / 3, Step: 12 / 750 Loss: 0.9502
Epoch: 1 / 3, Step: 13 / 750 Loss: 0.7326
Epoch: 1 / 3, Step: 14 / 750 Loss: 0.6738
Epoch: 1 / 3, Step: 15 / 750 Loss: 0.7072
Epoch: 1 / 3, Step: 16 / 750 Loss: 0.5473
Epoch: 1 / 3, Step: 17 / 750 Loss: 0.6048
Epoch: 1 / 3, Step: 18 / 750 Loss: 0.6233
Epoch: 1 / 3, Step: 19 / 750 Loss: 0.7719
Epoch: 1 / 3, Step: 20 / 750 Loss: 0.6490
Epoch: 1 / 3, Step: 21 / 750 Loss: 0.6407
Epoch: 1 / 3, Step: 22 / 750 Loss: 0.6419
Epoch: 1 / 3, Step: 23 / 750 Loss: 0.6644
Ep

In [23]:
model.load_state_dict(torch.load('distilmodelA.pth'))
model = model.cuda()
model.eval()
loss_total = 0
results = defaultdict(dict)
predictions, true_vals = [], []
for j, data in enumerate(test_dataloader):
    inputs = {'input_ids': data[0].cuda(), 
              'attention_mask': data[1].cuda(), 
              'labels': data[2].cuda()}
    with torch.no_grad():
        output = model(**inputs)
    loss = output[0]
    logits = output[1]
    logits = logits.detach().cpu().numpy()
    labels = inputs['labels'].cpu().numpy()
    loss_total += loss.item()
    predictions.append(logits)
    true_vals.append(labels)

In [24]:
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)

In [25]:
from sklearn.metrics import classification_report
preds_flat = np.argmax(predictions, axis = 1).flatten()
labels_flat = true_vals.flatten()
target_names = ['neg', 'pos', 'neu']
print(classification_report(labels_flat, preds_flat, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.80      0.79      0.79      1017
         pos       0.67      0.55      0.60       393
         neu       0.92      0.94      0.93      4590

    accuracy                           0.89      6000
   macro avg       0.80      0.76      0.78      6000
weighted avg       0.88      0.89      0.88      6000



# Evaluation on Part B

The Part B of this dataset consists of tweets related to the COVID-19 crises, social distancing, lockdown, and stay at home. All the models above will be used for our evaluation. The pre-processing and processing of the data is same as that for the whole dataset and that for Part A.

In [26]:
covid_senti = pd.read_csv("COVIDSenti-main/COVIDSenti-B.csv")
covid_senti["label"].value_counts()

neu    22496
neg     5471
pos     2033
Name: label, dtype: int64

As shown above, Part B consists of 22496 neutral samples, 5471 negative while 2033 rows labeled as postive.

In [27]:
from gensim.utils import simple_preprocess
import string

table = str.maketrans(dict.fromkeys(string.punctuation))

covid_senti['tokenized_tweet'] = [simple_preprocess(line, deacc=True) for line in covid_senti['tweet']]
covid_senti['tokenized_tweet'] = [[word.replace('\n', '') for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.replace('#', '') for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.lower() for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.translate(table) for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[''.join(filter(lambda x: not word.startswith('https'), word)) for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[''.join(filter(lambda x: not word.startswith('@'), word)) for word in line] for line in covid_senti['tokenized_tweet']]
# print(covid_senti['tokenized_tweet'].sample(n=10))

In [28]:
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
covid_senti['lemmatized_tweet'] = [[wordnet_lemmatizer.lemmatize(word) for word in line] for line in covid_senti['tokenized_tweet']]
# print(covid_senti['lemmatized_tweet'].sample(n=10))

[nltk_data] Downloading package wordnet to
[nltk_data]     /s/chopin/a/grad/sanket96/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [29]:
mask = np.random.rand(len(covid_senti)) < 0.8
covid_senti_train = covid_senti[mask]
covid_senti_test = covid_senti[~mask]

## Naive Bayes

In [31]:
class NaiveBayes():

    def __init__(self):
        # be sure to use the right class_dict for each data set
        self.class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
        # self.class_dict = {'action': 0, 'comedy': 1}
        self.feature_dict = {}
        self.prior = np.zeros(len(self.class_dict))
        self.likelihood = None
    '''
    Trains a multinomial Naive Bayes classifier on a training set.
    Specifically, fills in self.prior and self.likelihood such that:
    self.prior[class] = log(P(class))
    self.likelihood[class][feature] = log(P(feature|class))
    '''
    def train(self, train_set):
        self.feature_dict = self.select_features(train_set)
        # iterate over training documents
        self.likelihood = np.zeros((len(self.class_dict), len(self.feature_dict)))
        doc_per_class = {}
        word_count = {}
        total_words_per_class = {}
        vocabulary = set()
        for index, row in train_set.iterrows():
            class_name = row['label']
            if (class_name in self.class_dict):
                doc_per_class[class_name] = 1 + doc_per_class.get(class_name, 0)
                    # collect class counts and feature counts
                data = row['lemmatized_tweet']
                for word in data:
                    vocabulary.add(word)
                    word_count[(word, class_name)] = 1 + word_count.get((word, class_name), 0)
        # normalize counts to probabilities, and take logs
        for class_name in self.class_dict:
            counts = [v for k, v in word_count.items() if k[1] == class_name]
            total_words_per_class[class_name] = sum(counts)
        for word in self.feature_dict:
            for class_name in self.class_dict:
                self.likelihood[self.class_dict.get(class_name)][self.feature_dict.get(word)] = np.log(((word_count.get((word,
                                                class_name), 0) + 1)/(total_words_per_class[class_name] + len(vocabulary))))
        for class_name in self.class_dict:
            self.prior[self.class_dict[class_name]] = np.log((doc_per_class[class_name] / sum(doc_per_class.values())))
    '''
    Tests the classifier on a development or test set.
    Returns a dictionary of filenames mapped to their correct and predicted
    classes such that:
    results[filename]['correct'] = correct class
    results[filename]['predicted'] = predicted class
    '''
    def test(self, dev_set):
        pred_labels = []
        true_labels = []
        # iterate over testing documents
        for index, row in dev_set.iterrows():
            class_name = row['label']
            # create feature vectors for each document
            word_count = {}
            true_labels.append(self.class_dict[class_name])
            data = str(row['lemmatized_tweet'])
            for word in data:
                if word in self.feature_dict:
                    word_count[word] = 1 + word_count.get(word, 0)
            feature_vector = np.zeros((len(self.feature_dict), 1))
            for i, word in enumerate(self.feature_dict):
                feature_vector[i] = word_count.get(word, 0)
            self.prior = np.reshape(self.prior, (self.prior.shape[0], 1))
            probability = self.prior + np.matmul(self.likelihood, feature_vector)
            pred_labels.append(np.argmax(probability))
                # get most likely class
        # print(dict(results))
        return pred_labels, true_labels

    '''
    Given results, calculates the following:
    Precision, Recall, F1 for each class
    Accuracy overall
    Also, prints evaluation metrics in readable format.
    '''
    def evaluate(self, results):
        # you may find this helpful
        target_names = ['neg', 'pos', 'neu']
        print(classification_report(results[1], results[0], target_names=target_names))
    '''
    Performs feature selection.
    Returns a dictionary of features.
    '''
    def select_features(self, train_set):
        # almost any method of feature selection is fine here
        doc_per_class = {}
        word_count = {}
        total_words_per_class = {}
        vocabulary = set()
        likelihood_ratio = {}
        for index, row in train_set.iterrows():
            class_name = row['label']
            if (class_name in self.class_dict):
                doc_per_class[class_name] = 1 + doc_per_class.get(class_name, 0)
                    # collect class counts and feature counts
                data = row['lemmatized_tweet']
                for word in data:
                    vocabulary.add(word)
                    word_count[(word, class_name)] = 1 + word_count.get((word, class_name), 0)
        # normalize counts to probabilities, and take logs
        for class_name in self.class_dict:
            counts = [v for k, v in word_count.items() if k[1] == class_name]
            total_words_per_class[class_name] = sum(counts)
        prob_class = np.zeros((3, 1))
        for i, class_name in enumerate(self.class_dict):
            prob_class[i] = (doc_per_class[class_name] / sum(doc_per_class.values()))
        for word in vocabulary:
            class_probs = [1] * len(self.class_dict)
            for i, class_name in enumerate(self.class_dict):
                class_probs[i] = (word_count.get((word,
                                      class_name), 0) + 1) / (total_words_per_class[class_name] + len(vocabulary))
                class_probs[i] = class_probs[i] / prob_class[i]
            likelihood_ratio[word] = (1 / class_probs[0]) * (1 / class_probs[1]) * (1 / class_probs[2])
        #likelihood_ratio_pos = dict(sorted(likelihood_ratio.items(), key=lambda item: item[1], reverse=True))
        likelihood_ratio = dict(sorted(likelihood_ratio.items(), key=lambda item: item[1]))
        words = []
        words.extend(list(likelihood_ratio.keys())[:750])
        #words.extend(list(likelihood_ratio_pos.keys())[:750])
        # for class_name in self.class_dict:
        #     self.prior[self.class_dict[class_name]] = np.log((doc_per_class[class_name] / sum(doc_per_class.values())))
        features = {}
        for i, word in enumerate(words):
            features[word] = i
        return features


if __name__ == '__main__':
    nb = NaiveBayes()
    # make sure these point to the right directories
    nb.train(covid_senti_train)
    # nb.train('movie_reviews_small/train')
    results = nb.test(covid_senti_test)
    # results = nb.test('movie_reviews_small/test')
    nb.evaluate(results)


              precision    recall  f1-score   support

         neg       0.00      0.00      0.00      1166
         pos       0.00      0.00      0.00       379
         neu       0.75      1.00      0.86      4579

    accuracy                           0.75      6124
   macro avg       0.25      0.33      0.29      6124
weighted avg       0.56      0.75      0.64      6124



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Logistic Regression classifier

In [32]:
# CS542 Fall 2021 Programming Assignment 2
# Logistic Regression Classifier

'''
Computes the logistic function.
'''


def sigma(z):
    return 1 / (1 + np.exp(-z))


class LogisticRegression():

    def __init__(self, n_features=400):
        # be sure to use the right class_dict for each data set
        self.theta = None
        self.n_features = n_features
        self.feature_dict = None
        self.class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
        # self.class_dict = {'action': 0, 'comedy': 1}
        # use of self.feature_dict is optional for this assignment
        self.feature_dict = self.select_features(covid_senti_train)

    '''
    Loads a dataset. Specifically, returns a list of filenames, and dictionaries
    of classes and documents such that:
    classes[filename] = class of the document
    documents[filename] = feature vector for the document (use self.featurize)
    '''

    def select_features(self, data_set):
        feature_count = {}
        for index, row in data_set.iterrows():
            data = str(row['lemmatized_tweet']).split()
            for word in data:
                feature_count[word] = 1 + feature_count.get(word, 0)

        feature_count = list(dict(sorted(feature_count.items(), key=lambda v: v[1], reverse=True)).keys())[:500]
        features = {}

        for i, word in enumerate(feature_count):
            features[word] = i
        return features

    def load_data(self, data_set):
        filenames = []
        classes = dict()
        documents = dict()
        # iterate over documents
        for index, row in data_set.iterrows():
            # your code here
            # BEGIN STUDENT CODE
            # if os.path.isfile(os.path.join(root, name)):
            class_name = row['label']
            classes[index] = self.class_dict[class_name]
            documents[index] = self.featurize(row['lemmatized_tweet'])
            # END STUDENT CODE
        return classes, documents

    '''
    Given a document (as a list of words), returns a feature vector.
    Note that the last element of the vector, corresponding to the bias, is a
    "dummy feature" with value 1.
    '''

    def featurize(self, document):
        vector = np.zeros(self.n_features + 1)
        # BEGIN STUDENT CODE
        for word in document:
            if word in self.feature_dict:
                if word not in w2v_model.wv.key_to_index:
                    vector.extend([0] * 500)
                else:
                    vector.extend(w2v_model.wv[word])
        # END STUDENT CODE
        vector[-1] = 1
        return vector

    '''
    Trains a logistic regression classifier on a training set.
    '''

    def train(self, train_set, batch_size=3, n_epochs=1, eta=0.1):
        # if train_set == "movie_reviews_small/train":
        #     self.feature_dict = {'fast': 0, 'couple': 1, 'shoot': 2, 'fly': 3}
        # else:
        #     self.feature_dict = self.select_features(train_set)
        # self.n_features = len(self.feature_dict)
        self.theta = np.zeros(self.n_features + 1)  # weights (and bias)
        classes, documents = self.load_data(train_set)
        n_minibatches = ceil(len(train_set) / batch_size)
        for epoch in range(n_epochs):
            print("Epoch {:} out of {:}".format(epoch + 1, n_epochs))
            loss = 0
            for i in range(n_minibatches):
                # list of filenames in minibatch
                minibatch = train_set[i * batch_size: (i + 1) * batch_size]
                # BEGIN STUDENT CODE
                # create and fill in matrix x and vector y
                x = np.zeros((len(minibatch), self.n_features + 1))
                y = np.zeros(len(minibatch))
                k = 0
                for j, row in minibatch.iterrows():
                    x[k][:] = documents[j]
                    y[k] = classes[j]
                    k += 1
                # compute y_hat
                y_hat = sigma(np.dot(x, self.theta))
                # update loss
                loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))
                # compute gradient
                gradient = np.dot(x.T, np.subtract(y_hat, y)) / len(minibatch)
                # update weights (and bias)
                self.theta = self.theta - (eta * gradient)
                # END STUDENT CODE
            loss /= len(train_set)
            print("Average Train Loss: {}".format(loss))
            # randomize order
            #Random(epoch).shuffle(train_set)

    '''
    Tests the classifier on a development or test set.
    Returns a dictionary of filenames mapped to their correct and predicted
    classes such that:
    results[filename]['correct'] = correct class
    results[filename]['predicted'] = predicted class
    '''

    def test(self, dev_set):
        pred_labels = []
        true_labels = []
        classes, documents = self.load_data(dev_set)
        for index, row in dev_set.iterrows():
            # BEGIN STUDENT CODE
            # get most likely class (recall that P(y=1|x) = y_hat)
            true_labels.append(classes[index])
            prediction = sigma(np.dot(documents[index], self.theta))
            pred_label = 1 if prediction > 0.5 else 0
            pred_labels.append(pred_label)
            # END STUDENT CODE
        return pred_labels, true_labels

    '''
    Given results, calculates the following:
    Precision, Recall, F1 for each class
    Accuracy overall
    Also, prints evaluation metrics in readable format.
    '''

    def evaluate(self, results):
        # you can copy and paste your code from PA1 here
        target_names = ['neg', 'pos', 'neu']
        print(classification_report(results[1], results[0], target_names=target_names))


if __name__ == '__main__':
    lr = LogisticRegression(n_features=750)
    # make sure these point to the right directories
    batch_size = [1, 2, 3, 8, 16, 32]
    n_epochs = [1, 5, 10, 20, 30, 40]
    eta = [0.025, 0.05, 0.1, 0.2, 0.4]

    # code for grid search
#     for b in batch_size:
#         for n in n_epochs:
#             for ler in eta:
#                 lr.train(covid_senti_train, batch_size=b, n_epochs=n, eta=ler)
#                 results = lr.test(covid_senti_test)
#                 lr.evaluate(results)
#                 print("Accuracy is for batch size: ", b, ", n_epochs: ", n, "eta: ", ler)

    # best features from grid search
    lr.train(covid_senti_train, batch_size=3, n_epochs=40, eta=0.05)
    results = lr.test(covid_senti_test)
    # lr.train('movie_reviews_small/train', batch_size=3, n_epochs=1, eta=0.1)
    # results = lr.test('movie_reviews_small/test')
    lr.evaluate(results)


Epoch 1 out of 40


  loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))
  loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))


Average Train Loss: nan
Epoch 2 out of 40
Average Train Loss: nan
Epoch 3 out of 40
Average Train Loss: nan
Epoch 4 out of 40
Average Train Loss: nan
Epoch 5 out of 40
Average Train Loss: nan
Epoch 6 out of 40
Average Train Loss: nan
Epoch 7 out of 40
Average Train Loss: nan
Epoch 8 out of 40
Average Train Loss: nan
Epoch 9 out of 40
Average Train Loss: nan
Epoch 10 out of 40
Average Train Loss: nan
Epoch 11 out of 40
Average Train Loss: nan
Epoch 12 out of 40
Average Train Loss: nan
Epoch 13 out of 40
Average Train Loss: nan
Epoch 14 out of 40
Average Train Loss: nan
Epoch 15 out of 40
Average Train Loss: nan
Epoch 16 out of 40
Average Train Loss: nan
Epoch 17 out of 40
Average Train Loss: nan
Epoch 18 out of 40
Average Train Loss: nan
Epoch 19 out of 40
Average Train Loss: nan
Epoch 20 out of 40
Average Train Loss: nan
Epoch 21 out of 40
Average Train Loss: nan
Epoch 22 out of 40
Average Train Loss: nan
Epoch 23 out of 40
Average Train Loss: nan
Epoch 24 out of 40
Average Train Loss:

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## CNN Classifier

In [33]:
import torch
from torch import nn, optim
import gensim
import time
import torch.nn.functional as F

In [34]:
from gensim.models import Word2Vec
tweets = list(covid_senti['lemmatized_tweet'].values)
tweets.append(['pad'])
w2v_model = Word2Vec(tweets, min_count = 1, vector_size = 500, workers = 3, window = 3, sg = 1)

In [35]:
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
covid_senti['label_num'] = covid_senti.label.replace(class_dict)
covid_senti

Unnamed: 0,tweet,label,tokenized_tweet,lemmatized_tweet,label_num
0,Coronavirus fears expose a cultural divide ove...,neu,"[coronavirus, fears, expose, cultural, divide,...","[coronavirus, fear, expose, cultural, divide, ...",2
1,Coronavirus Live Updates: Global Outbreak Rais...,neu,"[coronavirus, live, updates, global, outbreak,...","[coronavirus, live, update, global, outbreak, ...",2
2,"Ruling party, government mull W10tr to tackle ...",neu,"[ruling, party, government, mull, tr, to, tack...","[ruling, party, government, mull, tr, to, tack...",2
3,Exclusive: Thousands in Coronavirus Epicenter ...,neg,"[exclusive, thousands, in, coronavirus, epicen...","[exclusive, thousand, in, coronavirus, epicent...",0
4,"@Queen_kimo_ Derp derp, what's that have to do...",neu,"[queenkimo, derp, derp, what, that, have, to, ...","[queenkimo, derp, derp, what, that, have, to, ...",2
...,...,...,...,...,...
29995,What Happens if Coronavirus Forces U.S. Movie ...,neu,"[what, happens, if, coronavirus, forces, movie...","[what, happens, if, coronavirus, force, movie,...",2
29996,Real question : how does it seem that people i...,neu,"[real, question, how, does, it, seem, that, pe...","[real, question, how, doe, it, seem, that, peo...",2
29997,"#NOOLUYO35: No Time to Die, the latest #JamesB...",pos,"[nooluyo, no, time, to, die, the, latest, jame...","[nooluyo, no, time, to, die, the, latest, jame...",1
29998,@ShilpiSinghINC @RahulGandhi Ye Italy se laute...,neu,"[shilpisinghinc, rahulgandhi, ye, italy, se, l...","[shilpisinghinc, rahulgandhi, ye, italy, se, l...",2


In [36]:
max_len = covid_senti['lemmatized_tweet'].map(len).max()
def make_word_2_vec(sentence):
    padding_idx = w2v_model.wv.key_to_index['pad']
    padded_X = [padding_idx for i in range(max_len)]
    i = 0
    for word in sentence:
        if word not in w2v_model.wv.key_to_index:
            padded_X[i] = 0
            print(word)
        else:
            padded_X[i] = w2v_model.wv.key_to_index[word]
        i += 1
    return torch.tensor(padded_X, dtype=torch.long).view(1, -1)

In [37]:
EMBEDDING_SIZE = 500
NUM_FILTERS = 10

class CnnTextClassifier(nn.Module):
    def __init__(self, vocab_size, num_classes, window_sizes=(1,2,3,5)):
        super(CnnTextClassifier, self).__init__()
        weights = w2v_model.wv
        # With pretrained embeddings
        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors),
                                                      padding_idx=w2v_model.wv.key_to_index['pad'])
        # Without pretrained embeddings
        # self.embedding = nn.Embedding(vocab_size, EMBEDDING_SIZE)

        self.convs = nn.ModuleList([
                                   nn.Conv2d(1, NUM_FILTERS, [window_size, EMBEDDING_SIZE],
                                             padding=(window_size - 1, 0))
                                   for window_size in window_sizes
        ])

        self.fc = nn.Linear(NUM_FILTERS * len(window_sizes), num_classes)

    def forward(self, x):
        x = self.embedding(x)

        # Apply a convolution + max_pool layer for each window size
        x = torch.unsqueeze(x, 1)
        xs = []
        for conv in self.convs:
            x2 = torch.tanh(conv(x))
            x2 = torch.squeeze(x2, -1)
            x2 = F.max_pool1d(x2, x2.size(2))
            xs.append(x2)
        x = torch.cat(xs, 2)

        # FC
        x = x.view(x.size(0), -1)
        logits = self.fc(x)

        probs = F.softmax(logits, dim = 1)

        return probs

In [38]:
NUM_CLASSES = 3
VOCAB_SIZE = len(w2v_model.wv.key_to_index)

cnn_model = CnnTextClassifier(vocab_size=VOCAB_SIZE, num_classes=NUM_CLASSES)
# cnn_model.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)
num_epochs = 3

# Open the file for writing loss
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
loss_file_name = 'cnn_class_big_loss_with_padding.csv'
losses = []
cnn_model.train()
for epoch in range(num_epochs):
    start_time = time.time()
    print("Epoch " + str(epoch + 1))
    train_loss = 0
    for index, row in covid_senti_train.iterrows():
        # Clearing the accumulated gradients
        cnn_model.zero_grad()

        # Make the bag of words vector for stemmed tokens 
        bow_vec = make_word_2_vec(row['lemmatized_tweet'])
       
        # Forward pass to get output
        probs = cnn_model(bow_vec)

        # Get the target label
        target = torch.tensor([class_dict[row['label']]], dtype=torch.long)

        # Calculate Loss: softmax --> cross entropy loss
        loss = loss_function(probs, target)
        train_loss += loss.item()

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()


    # if index == 0:
    #     continue
    print("Epoch completed in: %.4f seconds" % (time.time()-start_time))
    print(str((epoch+1)) + "," + str(train_loss / len(covid_senti_train)))
    print('\n')
    train_loss = 0

torch.save(cnn_model, 'cnn_big_model_500_with_paddingB.pth')

Epoch 1
Epoch completed in: 73.7431 seconds
1,0.8013223460504888


Epoch 2
Epoch completed in: 74.2900 seconds
2,0.8010258935897429


Epoch 3
Epoch completed in: 77.4480 seconds
3,0.8010258935522965




In [39]:
from sklearn.metrics import classification_report
predictions = []
correct = []
cnn_model.eval()

with torch.no_grad():
    results = defaultdict(dict)
    for index, row in covid_senti_test.iterrows():
        bow_vec = make_word_2_vec(row['lemmatized_tweet'])
        probs = cnn_model(bow_vec)
        correct.append(class_dict[row['label']])
        _, predicted = torch.max(probs.data, 1)
        predictions.append(predicted.numpy()[0])
target_names = ['neg', 'pos', 'neu']
print(classification_report(predictions, correct, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.00      0.00      0.00         0
         pos       0.00      0.00      0.00         0
         neu       1.00      0.75      0.86      6124

    accuracy                           0.75      6124
   macro avg       0.33      0.25      0.29      6124
weighted avg       1.00      0.75      0.86      6124



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## BERT

In [40]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(covid_senti.index.values, 
                                                    covid_senti.label.values, test_size=0.2,
                                                   stratify=covid_senti.label.values)

In [41]:
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}

In [43]:
covid_senti['data_type'] = ['not_set'] * covid_senti.shape[0]

In [44]:
covid_senti.loc[X_train, 'data_type'] = 'train'
covid_senti.loc[X_test, 'data_type'] = 'test'

In [45]:
from transformers import BertTokenizer
import torch

torch.cuda.empty_cache()

In [46]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [47]:
encoded_data_train = tokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='train'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

encoded_data_test = tokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='test'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')



In [48]:
#train set
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(covid_senti[covid_senti.data_type == 'train'].label.values)

#validation set
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(covid_senti[covid_senti.data_type == 'test'].label.values)

In [49]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(class_dict),
                                                     output_attentions = False,
                                                      output_hidden_states = False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [50]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from torch import nn, optim

In [51]:
train_dataset = TensorDataset(input_ids_train, attention_masks_train,labels_train)

test_dataset = TensorDataset(input_ids_test, attention_masks_test,labels_test)

In [54]:
train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=32)
test_dataloader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=32)

In [56]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

In [57]:
model = model.cuda()
loss_total = 0
model.train()
for i in range(3):
    for j, data in enumerate(train_dataloader):
        inputs = {'input_ids': data[0].cuda(), 
                      'attention_mask': data[1].cuda(), 
                      'labels': data[2].cuda()}
        output = model(**inputs)
        loss = output[0]
        optimizer.zero_grad()
        loss_total += loss.item()
        loss.backward()
        optimizer.step()
        print("Epoch: {} / {}, Step: {} / {} Loss: {:.4f}".format(i+1, 3, j, len(train_dataloader),
                                                                      loss))
torch.save(model.state_dict(), 'modelB.pth')

Epoch: 1 / 3, Step: 0 / 750 Loss: 1.1228
Epoch: 1 / 3, Step: 1 / 750 Loss: 1.1532
Epoch: 1 / 3, Step: 2 / 750 Loss: 1.0181
Epoch: 1 / 3, Step: 3 / 750 Loss: 1.0761
Epoch: 1 / 3, Step: 4 / 750 Loss: 0.9837
Epoch: 1 / 3, Step: 5 / 750 Loss: 0.9835
Epoch: 1 / 3, Step: 6 / 750 Loss: 0.9329
Epoch: 1 / 3, Step: 7 / 750 Loss: 0.8398
Epoch: 1 / 3, Step: 8 / 750 Loss: 0.9530
Epoch: 1 / 3, Step: 9 / 750 Loss: 0.9688
Epoch: 1 / 3, Step: 10 / 750 Loss: 0.8938
Epoch: 1 / 3, Step: 11 / 750 Loss: 0.9393
Epoch: 1 / 3, Step: 12 / 750 Loss: 0.8730
Epoch: 1 / 3, Step: 13 / 750 Loss: 0.7420
Epoch: 1 / 3, Step: 14 / 750 Loss: 0.9393
Epoch: 1 / 3, Step: 15 / 750 Loss: 0.7057
Epoch: 1 / 3, Step: 16 / 750 Loss: 0.6884
Epoch: 1 / 3, Step: 17 / 750 Loss: 0.7565
Epoch: 1 / 3, Step: 18 / 750 Loss: 0.7399
Epoch: 1 / 3, Step: 19 / 750 Loss: 0.6450
Epoch: 1 / 3, Step: 20 / 750 Loss: 0.6797
Epoch: 1 / 3, Step: 21 / 750 Loss: 0.5143
Epoch: 1 / 3, Step: 22 / 750 Loss: 0.4315
Epoch: 1 / 3, Step: 23 / 750 Loss: 0.8014
Ep

In [58]:
model.load_state_dict(torch.load('modelB.pth'))
model = model.cuda()
model.eval()
loss_total = 0
results = defaultdict(dict)
predictions, true_vals = [], []
for j, data in enumerate(test_dataloader):
    inputs = {'input_ids': data[0].cuda(), 
              'attention_mask': data[1].cuda(), 
              'labels': data[2].cuda()}
    with torch.no_grad():
        output = model(**inputs)
    loss = output[0]
    logits = output[1]
    logits = logits.detach().cpu().numpy()
    labels = inputs['labels'].cpu().numpy()
    loss_total += loss.item()
    predictions.append(logits)
    true_vals.append(labels)

In [59]:
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)

In [60]:
from sklearn.metrics import f1_score
preds_flat = np.argmax(predictions, axis = 1).flatten()
labels_flat = true_vals.flatten()

In [61]:
from sklearn.metrics import classification_report
target_names = ['neg', 'pos', 'neu']
print(classification_report(labels_flat, preds_flat, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.92      0.86      0.89      1094
         pos       0.81      0.86      0.83       407
         neu       0.95      0.96      0.96      4499

    accuracy                           0.94      6000
   macro avg       0.89      0.90      0.89      6000
weighted avg       0.94      0.94      0.94      6000



# DistilBERT

DistilBert is a lighter version of BERT. This model is built by knowledge distillation technique wherein a small model is trained to reproduce the behavior of a larger model. Accordingly, it has 40% less parameters than BERT and runs 60% faster while retaining 95% of the BERT-base-uncased performance[3]. Here, we are planning to compare the results of BERT with DistilBERT and try to get hands on both.

In [62]:
from transformers import DistilBertConfig,DistilBertTokenizer,DistilBertModel
distil_berttokenizer = DistilBertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizer'.


In [63]:
encoded_data_train = distil_berttokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='train'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

encoded_data_test = distil_berttokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='test'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')



In [64]:
#train set
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(covid_senti[covid_senti.data_type == 'train'].label_num.values)

#validation set
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(covid_senti[covid_senti.data_type == 'test'].label_num.values)

In [65]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(class_dict),
                                                     output_attentions = False,
                                                      output_hidden_states = False)

You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['bert.encoder.layer.2.attention.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.11.attention.self.query.bias', 'bert.encoder.layer.5.attention.output.LayerNorm.weight', 'bert.encoder.layer.11.output.dense.bias', 'bert.encoder.layer.8.output.dense.weight', 'bert.encoder.layer.3.attention.output.dense.bias', 'bert.encoder.layer.11.attention.self.query.weight', 'bert.encoder.layer.1.output.dense.weight', 'bert.encoder.layer.2.output.LayerNorm.bias', 'bert.encoder.layer.3.intermediate.dense.weight', 'bert.encoder.layer.10.attention.output.LayerNorm.weight', 'bert.encoder.layer.4.attention.output.dense.bias', 'bert.encoder.layer.1.attention.self.key.bias', 'bert.e

In [66]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from torch import nn, optim

train_dataset = TensorDataset(input_ids_train, attention_masks_train,labels_train)
test_dataset = TensorDataset(input_ids_test, attention_masks_test,labels_test)

train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=32)
test_dataloader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=32)

In [67]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

In [69]:
model = model.cuda()
loss_total = 0
model.train()
for i in range(3):
    for j, data in enumerate(train_dataloader):
        inputs = {'input_ids': data[0].cuda(), 
                      'attention_mask': data[1].cuda(), 
                      'labels': data[2].cuda()}
        output = model(**inputs)
        loss = output[0]
        optimizer.zero_grad()
        loss_total += loss.item()
        loss.backward()
        optimizer.step()
        print("Epoch: {} / {}, Step: {} / {} Loss: {:.4f}".format(i+1, 3, j, len(train_dataloader),
                                                                      loss))
torch.save(model.state_dict(), 'distilmodelB.pth')

Epoch: 1 / 3, Step: 0 / 750 Loss: 0.2118
Epoch: 1 / 3, Step: 1 / 750 Loss: 0.1976
Epoch: 1 / 3, Step: 2 / 750 Loss: 0.1105
Epoch: 1 / 3, Step: 3 / 750 Loss: 0.4373
Epoch: 1 / 3, Step: 4 / 750 Loss: 0.1667
Epoch: 1 / 3, Step: 5 / 750 Loss: 0.0947
Epoch: 1 / 3, Step: 6 / 750 Loss: 0.2525
Epoch: 1 / 3, Step: 7 / 750 Loss: 0.1375
Epoch: 1 / 3, Step: 8 / 750 Loss: 0.1969
Epoch: 1 / 3, Step: 9 / 750 Loss: 0.5194
Epoch: 1 / 3, Step: 10 / 750 Loss: 0.3098
Epoch: 1 / 3, Step: 11 / 750 Loss: 0.1188
Epoch: 1 / 3, Step: 12 / 750 Loss: 0.2719
Epoch: 1 / 3, Step: 13 / 750 Loss: 0.1816
Epoch: 1 / 3, Step: 14 / 750 Loss: 0.3821
Epoch: 1 / 3, Step: 15 / 750 Loss: 0.4668
Epoch: 1 / 3, Step: 16 / 750 Loss: 0.3268
Epoch: 1 / 3, Step: 17 / 750 Loss: 0.2012
Epoch: 1 / 3, Step: 18 / 750 Loss: 0.2884
Epoch: 1 / 3, Step: 19 / 750 Loss: 0.2631
Epoch: 1 / 3, Step: 20 / 750 Loss: 0.1056
Epoch: 1 / 3, Step: 21 / 750 Loss: 0.1297
Epoch: 1 / 3, Step: 22 / 750 Loss: 0.1447
Epoch: 1 / 3, Step: 23 / 750 Loss: 0.2654
Ep

In [70]:
model.load_state_dict(torch.load('distilmodelB.pth'))
model = model.cuda()
model.eval()
loss_total = 0
results = defaultdict(dict)
predictions, true_vals = [], []
for j, data in enumerate(test_dataloader):
    inputs = {'input_ids': data[0].cuda(), 
              'attention_mask': data[1].cuda(), 
              'labels': data[2].cuda()}
    with torch.no_grad():
        output = model(**inputs)
    loss = output[0]
    logits = output[1]
    logits = logits.detach().cpu().numpy()
    labels = inputs['labels'].cpu().numpy()
    loss_total += loss.item()
    predictions.append(logits)
    true_vals.append(labels)

In [71]:
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)

In [72]:
from sklearn.metrics import classification_report
preds_flat = np.argmax(predictions, axis = 1).flatten()
labels_flat = true_vals.flatten()
target_names = ['neg', 'pos', 'neu']
print(classification_report(labels_flat, preds_flat, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.79      0.78      0.79      1094
         pos       0.76      0.55      0.64       407
         neu       0.91      0.94      0.92      4499

    accuracy                           0.88      6000
   macro avg       0.82      0.76      0.78      6000
weighted avg       0.88      0.88      0.88      6000



# Evaluation on Part C

Now let's start on Part C. Part C of this dataset contains tweets related to COVID-19 cases, outbreak, and stay at home. The evaluation of this part (preprocessing and processing) is same as that of the entire dataset as above.

In [115]:
covid_senti = pd.read_csv("COVIDSenti-main/COVIDSenti-C.csv")
covid_senti["label"].value_counts()

neu    21940
neg     5781
pos     2279
Name: label, dtype: int64

It contains 21940 samples marked as neutral, 2279 as positive and 5781 as negative.

## Preprocessing

In [116]:
from gensim.utils import simple_preprocess
import string

table = str.maketrans(dict.fromkeys(string.punctuation))

covid_senti['tokenized_tweet'] = [simple_preprocess(line, deacc=True) for line in covid_senti['tweet']]
covid_senti['tokenized_tweet'] = [[word.replace('\n', '') for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.replace('#', '') for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.lower() for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[word.translate(table) for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[''.join(filter(lambda x: not word.startswith('https'), word)) for word in line] for line in covid_senti['tokenized_tweet']]
covid_senti['tokenized_tweet'] = [[''.join(filter(lambda x: not word.startswith('@'), word)) for word in line] for line in covid_senti['tokenized_tweet']]
# print(covid_senti['tokenized_tweet'].sample(n=10))

In [117]:
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
covid_senti['lemmatized_tweet'] = [[wordnet_lemmatizer.lemmatize(word) for word in line] for line in covid_senti['tokenized_tweet']]
# print(covid_senti['lemmatized_tweet'].sample(n=10))

[nltk_data] Downloading package wordnet to
[nltk_data]     /s/chopin/a/grad/sanket96/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [118]:
mask = np.random.rand(len(covid_senti)) < 0.8
covid_senti_train = covid_senti[mask]
covid_senti_test = covid_senti[~mask]

## Naive Bayes

In [77]:
class NaiveBayes():

    def __init__(self):
        # be sure to use the right class_dict for each data set
        self.class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
        # self.class_dict = {'action': 0, 'comedy': 1}
        self.feature_dict = {}
        self.prior = np.zeros(len(self.class_dict))
        self.likelihood = None
    '''
    Trains a multinomial Naive Bayes classifier on a training set.
    Specifically, fills in self.prior and self.likelihood such that:
    self.prior[class] = log(P(class))
    self.likelihood[class][feature] = log(P(feature|class))
    '''
    def train(self, train_set):
        self.feature_dict = self.select_features(train_set)
        # iterate over training documents
        self.likelihood = np.zeros((len(self.class_dict), len(self.feature_dict)))
        doc_per_class = {}
        word_count = {}
        total_words_per_class = {}
        vocabulary = set()
        for index, row in train_set.iterrows():
            class_name = row['label']
            if (class_name in self.class_dict):
                doc_per_class[class_name] = 1 + doc_per_class.get(class_name, 0)
                    # collect class counts and feature counts
                data = row['lemmatized_tweet']
                for word in data:
                    vocabulary.add(word)
                    word_count[(word, class_name)] = 1 + word_count.get((word, class_name), 0)
        # normalize counts to probabilities, and take logs
        for class_name in self.class_dict:
            counts = [v for k, v in word_count.items() if k[1] == class_name]
            total_words_per_class[class_name] = sum(counts)
        for word in self.feature_dict:
            for class_name in self.class_dict:
                self.likelihood[self.class_dict.get(class_name)][self.feature_dict.get(word)] = np.log(((word_count.get((word,
                                                class_name), 0) + 1)/(total_words_per_class[class_name] + len(vocabulary))))
        for class_name in self.class_dict:
            self.prior[self.class_dict[class_name]] = np.log((doc_per_class[class_name] / sum(doc_per_class.values())))
    '''
    Tests the classifier on a development or test set.
    Returns a dictionary of filenames mapped to their correct and predicted
    classes such that:
    results[filename]['correct'] = correct class
    results[filename]['predicted'] = predicted class
    '''
    def test(self, dev_set):
        pred_labels = []
        true_labels = []
        # iterate over testing documents
        for index, row in dev_set.iterrows():
            class_name = row['label']
            # create feature vectors for each document
            word_count = {}
            true_labels.append(self.class_dict[class_name])
            data = str(row['lemmatized_tweet'])
            for word in data:
                if word in self.feature_dict:
                    word_count[word] = 1 + word_count.get(word, 0)
            feature_vector = np.zeros((len(self.feature_dict), 1))
            for i, word in enumerate(self.feature_dict):
                feature_vector[i] = word_count.get(word, 0)
            self.prior = np.reshape(self.prior, (self.prior.shape[0], 1))
            probability = self.prior + np.matmul(self.likelihood, feature_vector)
            pred_labels.append(np.argmax(probability))
                # get most likely class
        # print(dict(results))
        return pred_labels, true_labels

    '''
    Given results, calculates the following:
    Precision, Recall, F1 for each class
    Accuracy overall
    Also, prints evaluation metrics in readable format.
    '''
    def evaluate(self, results):
        # you may find this helpful
        target_names = ['neg', 'pos', 'neu']
        print(classification_report(results[1], results[0], target_names=target_names))
    '''
    Performs feature selection.
    Returns a dictionary of features.
    '''
    def select_features(self, train_set):
        # almost any method of feature selection is fine here
        doc_per_class = {}
        word_count = {}
        total_words_per_class = {}
        vocabulary = set()
        likelihood_ratio = {}
        for index, row in train_set.iterrows():
            class_name = row['label']
            if (class_name in self.class_dict):
                doc_per_class[class_name] = 1 + doc_per_class.get(class_name, 0)
                    # collect class counts and feature counts
                data = row['lemmatized_tweet']
                for word in data:
                    vocabulary.add(word)
                    word_count[(word, class_name)] = 1 + word_count.get((word, class_name), 0)
        # normalize counts to probabilities, and take logs
        for class_name in self.class_dict:
            counts = [v for k, v in word_count.items() if k[1] == class_name]
            total_words_per_class[class_name] = sum(counts)
        prob_class = np.zeros((3, 1))
        for i, class_name in enumerate(self.class_dict):
            prob_class[i] = (doc_per_class[class_name] / sum(doc_per_class.values()))
        for word in vocabulary:
            class_probs = [1] * len(self.class_dict)
            for i, class_name in enumerate(self.class_dict):
                class_probs[i] = (word_count.get((word,
                                      class_name), 0) + 1) / (total_words_per_class[class_name] + len(vocabulary))
                class_probs[i] = class_probs[i] / prob_class[i]
            likelihood_ratio[word] = (1 / class_probs[0]) * (1 / class_probs[1]) * (1 / class_probs[2])
        #likelihood_ratio_pos = dict(sorted(likelihood_ratio.items(), key=lambda item: item[1], reverse=True))
        likelihood_ratio = dict(sorted(likelihood_ratio.items(), key=lambda item: item[1]))
        words = []
        words.extend(list(likelihood_ratio.keys())[:750])
        #words.extend(list(likelihood_ratio_pos.keys())[:750])
        # for class_name in self.class_dict:
        #     self.prior[self.class_dict[class_name]] = np.log((doc_per_class[class_name] / sum(doc_per_class.values())))
        features = {}
        for i, word in enumerate(words):
            features[word] = i
        return features


if __name__ == '__main__':
    nb = NaiveBayes()
    # make sure these point to the right directories
    nb.train(covid_senti_train)
    # nb.train('movie_reviews_small/train')
    results = nb.test(covid_senti_test)
    # results = nb.test('movie_reviews_small/test')
    nb.evaluate(results)


              precision    recall  f1-score   support

         neg       0.00      0.00      0.00      1132
         pos       0.00      0.00      0.00       448
         neu       0.74      1.00      0.85      4403

    accuracy                           0.74      5983
   macro avg       0.25      0.33      0.28      5983
weighted avg       0.54      0.74      0.62      5983



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Logistic Regression classifier

In [78]:
# CS542 Fall 2021 Programming Assignment 2
# Logistic Regression Classifier

'''
Computes the logistic function.
'''


def sigma(z):
    return 1 / (1 + np.exp(-z))


class LogisticRegression():

    def __init__(self, n_features=400):
        # be sure to use the right class_dict for each data set
        self.theta = None
        self.n_features = n_features
        self.feature_dict = None
        self.class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
        # self.class_dict = {'action': 0, 'comedy': 1}
        # use of self.feature_dict is optional for this assignment
        self.feature_dict = self.select_features(covid_senti_train)

    '''
    Loads a dataset. Specifically, returns a list of filenames, and dictionaries
    of classes and documents such that:
    classes[filename] = class of the document
    documents[filename] = feature vector for the document (use self.featurize)
    '''

    def select_features(self, data_set):
        feature_count = {}
        for index, row in data_set.iterrows():
            data = str(row['lemmatized_tweet']).split()
            for word in data:
                feature_count[word] = 1 + feature_count.get(word, 0)

        feature_count = list(dict(sorted(feature_count.items(), key=lambda v: v[1], reverse=True)).keys())[:500]
        features = {}

        for i, word in enumerate(feature_count):
            features[word] = i
        return features

    def load_data(self, data_set):
        filenames = []
        classes = dict()
        documents = dict()
        # iterate over documents
        for index, row in data_set.iterrows():
            # your code here
            # BEGIN STUDENT CODE
            # if os.path.isfile(os.path.join(root, name)):
            class_name = row['label']
            classes[index] = self.class_dict[class_name]
            documents[index] = self.featurize(row['lemmatized_tweet'])
            # END STUDENT CODE
        return classes, documents

    '''
    Given a document (as a list of words), returns a feature vector.
    Note that the last element of the vector, corresponding to the bias, is a
    "dummy feature" with value 1.
    '''

    def featurize(self, document):
        vector = np.zeros(self.n_features + 1)
        # BEGIN STUDENT CODE
        for word in document:
            if word in self.feature_dict:
                if word not in w2v_model.wv.key_to_index:
                    vector.extend([0] * 500)
                else:
                    vector.extend(w2v_model.wv[word])
        # END STUDENT CODE
        vector[-1] = 1
        return vector

    '''
    Trains a logistic regression classifier on a training set.
    '''

    def train(self, train_set, batch_size=3, n_epochs=1, eta=0.1):
        # if train_set == "movie_reviews_small/train":
        #     self.feature_dict = {'fast': 0, 'couple': 1, 'shoot': 2, 'fly': 3}
        # else:
        #     self.feature_dict = self.select_features(train_set)
        # self.n_features = len(self.feature_dict)
        self.theta = np.zeros(self.n_features + 1)  # weights (and bias)
        classes, documents = self.load_data(train_set)
        n_minibatches = ceil(len(train_set) / batch_size)
        for epoch in range(n_epochs):
            print("Epoch {:} out of {:}".format(epoch + 1, n_epochs))
            loss = 0
            for i in range(n_minibatches):
                # list of filenames in minibatch
                minibatch = train_set[i * batch_size: (i + 1) * batch_size]
                # BEGIN STUDENT CODE
                # create and fill in matrix x and vector y
                x = np.zeros((len(minibatch), self.n_features + 1))
                y = np.zeros(len(minibatch))
                k = 0
                for j, row in minibatch.iterrows():
                    x[k][:] = documents[j]
                    y[k] = classes[j]
                    k += 1
                # compute y_hat
                y_hat = sigma(np.dot(x, self.theta))
                # update loss
                loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))
                # compute gradient
                gradient = np.dot(x.T, np.subtract(y_hat, y)) / len(minibatch)
                # update weights (and bias)
                self.theta = self.theta - (eta * gradient)
                # END STUDENT CODE
            loss /= len(train_set)
            print("Average Train Loss: {}".format(loss))
            # randomize order
            #Random(epoch).shuffle(train_set)

    '''
    Tests the classifier on a development or test set.
    Returns a dictionary of filenames mapped to their correct and predicted
    classes such that:
    results[filename]['correct'] = correct class
    results[filename]['predicted'] = predicted class
    '''

    def test(self, dev_set):
        pred_labels = []
        true_labels = []
        classes, documents = self.load_data(dev_set)
        for index, row in dev_set.iterrows():
            # BEGIN STUDENT CODE
            # get most likely class (recall that P(y=1|x) = y_hat)
            true_labels.append(classes[index])
            prediction = sigma(np.dot(documents[index], self.theta))
            pred_label = 1 if prediction > 0.5 else 0
            pred_labels.append(pred_label)
            # END STUDENT CODE
        return pred_labels, true_labels

    '''
    Given results, calculates the following:
    Precision, Recall, F1 for each class
    Accuracy overall
    Also, prints evaluation metrics in readable format.
    '''

    def evaluate(self, results):
        # you can copy and paste your code from PA1 here
        target_names = ['neg', 'pos', 'neu']
        print(classification_report(results[1], results[0], target_names=target_names))


if __name__ == '__main__':
    lr = LogisticRegression(n_features=750)
    # make sure these point to the right directories
    batch_size = [1, 2, 3, 8, 16, 32]
    n_epochs = [1, 5, 10, 20, 30, 40]
    eta = [0.025, 0.05, 0.1, 0.2, 0.4]

    # code for grid search
#     for b in batch_size:
#         for n in n_epochs:
#             for ler in eta:
#                 lr.train(covid_senti_train, batch_size=b, n_epochs=n, eta=ler)
#                 results = lr.test(covid_senti_test)
#                 lr.evaluate(results)
#                 print("Accuracy is for batch size: ", b, ", n_epochs: ", n, "eta: ", ler)

    # best features from grid search
    lr.train(covid_senti_train, batch_size=3, n_epochs=40, eta=0.05)
    results = lr.test(covid_senti_test)
    # lr.train('movie_reviews_small/train', batch_size=3, n_epochs=1, eta=0.1)
    # results = lr.test('movie_reviews_small/test')
    lr.evaluate(results)


Epoch 1 out of 40


  loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))
  loss += -((y @ np.log(y_hat)) + ((1 - y) @ np.log(1 - y_hat)))


Average Train Loss: nan
Epoch 2 out of 40
Average Train Loss: nan
Epoch 3 out of 40
Average Train Loss: nan
Epoch 4 out of 40
Average Train Loss: nan
Epoch 5 out of 40
Average Train Loss: nan
Epoch 6 out of 40
Average Train Loss: nan
Epoch 7 out of 40
Average Train Loss: nan
Epoch 8 out of 40
Average Train Loss: nan
Epoch 9 out of 40
Average Train Loss: nan
Epoch 10 out of 40
Average Train Loss: nan
Epoch 11 out of 40
Average Train Loss: nan
Epoch 12 out of 40
Average Train Loss: nan
Epoch 13 out of 40
Average Train Loss: nan
Epoch 14 out of 40
Average Train Loss: nan
Epoch 15 out of 40
Average Train Loss: nan
Epoch 16 out of 40
Average Train Loss: nan
Epoch 17 out of 40
Average Train Loss: nan
Epoch 18 out of 40
Average Train Loss: nan
Epoch 19 out of 40
Average Train Loss: nan
Epoch 20 out of 40
Average Train Loss: nan
Epoch 21 out of 40
Average Train Loss: nan
Epoch 22 out of 40
Average Train Loss: nan
Epoch 23 out of 40
Average Train Loss: nan
Epoch 24 out of 40
Average Train Loss:

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## CNN classifier

In [119]:
import torch
from torch import nn, optim
import gensim
import time
import torch.nn.functional as F

In [120]:
from gensim.models import Word2Vec
tweets = list(covid_senti['lemmatized_tweet'].values)
tweets.append(['pad'])
w2v_model = Word2Vec(tweets, min_count = 1, vector_size = 500, workers = 3, window = 3, sg = 1)

In [121]:
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
covid_senti['label_num'] = covid_senti.label.replace(class_dict)
covid_senti

Unnamed: 0,tweet,label,tokenized_tweet,lemmatized_tweet,label_num
0,BREAKING: Kim Jong-un sends condolence letter ...,pos,"[breaking, kim, jong, un, sends, condolence, l...","[breaking, kim, jong, un, sends, condolence, l...",1
1,Coronavirus: Cases rise in South Korea as Aust...,neu,"[coronavirus, cases, rise, in, south, korea, a...","[coronavirus, case, rise, in, south, korea, a,...",2
2,"How dangerous is coronavirus really, when are ...",neu,"[how, dangerous, is, coronavirus, really, when...","[how, dangerous, is, coronavirus, really, when...",2
3,@Dr_psychiatry Make a mark and also coronaviru...,neu,"[drpsychiatry, make, mark, and, also, coronavi...","[drpsychiatry, make, mark, and, also, coronavi...",2
4,As #Coronavirus positive cases continues to ri...,neu,"[as, coronavirus, positive, cases, continues, ...","[a, coronavirus, positive, case, continues, to...",2
...,...,...,...,...,...
29995,@C_Racing48 The flu has a 2% death rate.. the ...,neu,"[cracing, the, flu, has, death, rate, the, cor...","[cracing, the, flu, ha, death, rate, the, coro...",2
29996,@realDonaldTrump We already know that but you‚...,neg,"[realdonaldtrump, we, already, know, that, but...","[realdonaldtrump, we, already, know, that, but...",0
29997,First coronavirus case reported in St. Joseph ...,neu,"[first, coronavirus, case, reported, in, st, j...","[first, coronavirus, case, reported, in, st, j...",2
29998,"If you ate ants when you were a child, you‚Äôr...",neu,"[if, you, ate, ants, when, you, were, child, y...","[if, you, ate, ant, when, you, were, child, yo...",2


In [82]:
max_len = covid_senti['lemmatized_tweet'].map(len).max()
def make_word_2_vec(sentence):
    padding_idx = w2v_model.wv.key_to_index['pad']
    padded_X = [padding_idx for i in range(max_len)]
    i = 0
    for word in sentence:
        if word not in w2v_model.wv.key_to_index:
            padded_X[i] = 0
            print(word)
        else:
            padded_X[i] = w2v_model.wv.key_to_index[word]
        i += 1
    return torch.tensor(padded_X, dtype=torch.long).view(1, -1)

In [83]:
EMBEDDING_SIZE = 500
NUM_FILTERS = 10

class CnnTextClassifier(nn.Module):
    def __init__(self, vocab_size, num_classes, window_sizes=(1,2,3,5)):
        super(CnnTextClassifier, self).__init__()
        weights = w2v_model.wv
        # With pretrained embeddings
        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors),
                                                      padding_idx=w2v_model.wv.key_to_index['pad'])
        # Without pretrained embeddings
        # self.embedding = nn.Embedding(vocab_size, EMBEDDING_SIZE)

        self.convs = nn.ModuleList([
                                   nn.Conv2d(1, NUM_FILTERS, [window_size, EMBEDDING_SIZE],
                                             padding=(window_size - 1, 0))
                                   for window_size in window_sizes
        ])

        self.fc = nn.Linear(NUM_FILTERS * len(window_sizes), num_classes)

    def forward(self, x):
        x = self.embedding(x)

        # Apply a convolution + max_pool layer for each window size
        x = torch.unsqueeze(x, 1)
        xs = []
        for conv in self.convs:
            x2 = torch.tanh(conv(x))
            x2 = torch.squeeze(x2, -1)
            x2 = F.max_pool1d(x2, x2.size(2))
            xs.append(x2)
        x = torch.cat(xs, 2)

        # FC
        x = x.view(x.size(0), -1)
        logits = self.fc(x)

        probs = F.softmax(logits, dim = 1)

        return probs

In [84]:
NUM_CLASSES = 3
VOCAB_SIZE = len(w2v_model.wv.key_to_index)

cnn_model = CnnTextClassifier(vocab_size=VOCAB_SIZE, num_classes=NUM_CLASSES)
# cnn_model.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)
num_epochs = 3

# Open the file for writing loss
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}
loss_file_name = 'cnn_class_big_loss_with_padding.csv'
losses = []
cnn_model.train()
for epoch in range(num_epochs):
    start_time = time.time()
    print("Epoch " + str(epoch + 1))
    train_loss = 0
    for index, row in covid_senti_train.iterrows():
        # Clearing the accumulated gradients
        cnn_model.zero_grad()

        # Make the bag of words vector for stemmed tokens 
        bow_vec = make_word_2_vec(row['lemmatized_tweet'])
       
        # Forward pass to get output
        probs = cnn_model(bow_vec)

        # Get the target label
        target = torch.tensor([class_dict[row['label']]], dtype=torch.long)

        # Calculate Loss: softmax --> cross entropy loss
        loss = loss_function(probs, target)
        train_loss += loss.item()

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()


    # if index == 0:
    #     continue
    print("Epoch completed in: %.4f seconds" % (time.time()-start_time))
    print(str((epoch+1)) + "," + str(train_loss / len(covid_senti_train)))
    print('\n')
    train_loss = 0

torch.save(cnn_model, 'cnn_big_model_500_with_paddingA.pth')

Epoch 1
Epoch completed in: 69.4599 seconds
1,0.8216448113270761


Epoch 2
Epoch completed in: 69.1323 seconds
2,0.8212536107557186


Epoch 3
Epoch completed in: 69.1177 seconds
3,0.8212536107557186




In [85]:
from sklearn.metrics import classification_report
predictions = []
correct = []
cnn_model.eval()

with torch.no_grad():
    results = defaultdict(dict)
    for index, row in covid_senti_test.iterrows():
        bow_vec = make_word_2_vec(row['lemmatized_tweet'])
        probs = cnn_model(bow_vec)
        correct.append(class_dict[row['label']])
        _, predicted = torch.max(probs.data, 1)
        predictions.append(predicted.numpy()[0])
target_names = ['neg', 'pos', 'neu']
print(classification_report(predictions, correct, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.00      0.00      0.00         0
         pos       0.00      0.00      0.00         0
         neu       1.00      0.74      0.85      5983

    accuracy                           0.74      5983
   macro avg       0.33      0.25      0.28      5983
weighted avg       1.00      0.74      0.85      5983



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## BERT

In [129]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(covid_senti.index.values, 
                                                    covid_senti.label.values, test_size=0.2,
                                                   stratify=covid_senti.label.values)

In [130]:
class_dict = {'neg': 0, 'pos': 1, 'neu': 2}

In [131]:
covid_senti['data_type'] = ['not_set'] * covid_senti.shape[0]

In [132]:
covid_senti.loc[X_train, 'data_type'] = 'train'
covid_senti.loc[X_test, 'data_type'] = 'test'

In [133]:
from transformers import BertTokenizer
import torch

torch.cuda.empty_cache()

In [91]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [92]:
encoded_data_train = tokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='train'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

encoded_data_test = tokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='test'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')



In [94]:
#train set
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(covid_senti[covid_senti.data_type == 'train'].label_num.values)

#validation set
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(covid_senti[covid_senti.data_type == 'test'].label_num.values)

In [95]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(class_dict),
                                                     output_attentions = False,
                                                      output_hidden_states = False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [96]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from torch import nn, optim

In [97]:
train_dataset = TensorDataset(input_ids_train, attention_masks_train,labels_train)

test_dataset = TensorDataset(input_ids_test, attention_masks_test,labels_test)

In [98]:
train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=32)
test_dataloader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=32)

In [100]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

In [103]:
model = model.cuda()
loss_total = 0
model.train()
for i in range(3):
    for j, data in enumerate(train_dataloader):
        inputs = {'input_ids': data[0].cuda(), 
                      'attention_mask': data[1].cuda(), 
                      'labels': data[2].cuda()}
        output = model(**inputs)
        loss = output[0]
        optimizer.zero_grad()
        loss_total += loss.item()
        loss.backward()
        optimizer.step()
        print("Epoch: {} / {}, Step: {} / {} Loss: {:.4f}".format(i+1, 3, j, len(train_dataloader),
                                                                      loss))
torch.save(model.state_dict(), 'modelC.pth')

Epoch: 1 / 3, Step: 0 / 750 Loss: 0.6443
Epoch: 1 / 3, Step: 1 / 750 Loss: 0.6138
Epoch: 1 / 3, Step: 2 / 750 Loss: 0.6902
Epoch: 1 / 3, Step: 3 / 750 Loss: 0.7266
Epoch: 1 / 3, Step: 4 / 750 Loss: 0.7678
Epoch: 1 / 3, Step: 5 / 750 Loss: 0.8211
Epoch: 1 / 3, Step: 6 / 750 Loss: 0.7691
Epoch: 1 / 3, Step: 7 / 750 Loss: 0.7659
Epoch: 1 / 3, Step: 8 / 750 Loss: 0.9845
Epoch: 1 / 3, Step: 9 / 750 Loss: 0.9286
Epoch: 1 / 3, Step: 10 / 750 Loss: 1.0161
Epoch: 1 / 3, Step: 11 / 750 Loss: 0.6764
Epoch: 1 / 3, Step: 12 / 750 Loss: 0.6803
Epoch: 1 / 3, Step: 13 / 750 Loss: 0.5819
Epoch: 1 / 3, Step: 14 / 750 Loss: 0.5408
Epoch: 1 / 3, Step: 15 / 750 Loss: 0.7032
Epoch: 1 / 3, Step: 16 / 750 Loss: 0.7146
Epoch: 1 / 3, Step: 17 / 750 Loss: 0.7520
Epoch: 1 / 3, Step: 18 / 750 Loss: 0.6804
Epoch: 1 / 3, Step: 19 / 750 Loss: 0.7841
Epoch: 1 / 3, Step: 20 / 750 Loss: 1.0411
Epoch: 1 / 3, Step: 21 / 750 Loss: 0.8513
Epoch: 1 / 3, Step: 22 / 750 Loss: 0.9568
Epoch: 1 / 3, Step: 23 / 750 Loss: 0.8434
Ep

In [104]:
model.load_state_dict(torch.load('modelC.pth'))
model = model.cuda()
model.eval()
loss_total = 0
results = defaultdict(dict)
predictions, true_vals = [], []
for j, data in enumerate(test_dataloader):
    inputs = {'input_ids': data[0].cuda(), 
              'attention_mask': data[1].cuda(), 
              'labels': data[2].cuda()}
    with torch.no_grad():
        output = model(**inputs)
    loss = output[0]
    logits = output[1]
    logits = logits.detach().cpu().numpy()
    labels = inputs['labels'].cpu().numpy()
    loss_total += loss.item()
    predictions.append(logits)
    true_vals.append(labels)

In [105]:
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)

In [106]:
from sklearn.metrics import f1_score
preds_flat = np.argmax(predictions, axis = 1).flatten()
labels_flat = true_vals.flatten()

In [107]:
from sklearn.metrics import classification_report
target_names = ['neg', 'pos', 'neu']
print(classification_report(labels_flat, preds_flat, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.86      0.91      0.89      1156
         pos       0.81      0.84      0.82       456
         neu       0.96      0.94      0.95      4388

    accuracy                           0.93      6000
   macro avg       0.88      0.90      0.89      6000
weighted avg       0.93      0.93      0.93      6000



# DistilBERT

In [135]:
from transformers import DistilBertConfig,DistilBertTokenizer,DistilBertModel
distil_berttokenizer = DistilBertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizer'.


In [136]:
encoded_data_train = distil_berttokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='train'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')

encoded_data_test = distil_berttokenizer.batch_encode_plus(covid_senti[covid_senti.data_type=='test'].tweet.values, add_special_tokens=True,
                                                return_attention_mask=True, padding=True,
                                                max_length=512, return_tensors='pt')



In [137]:
#train set
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(covid_senti[covid_senti.data_type == 'train'].label_num.values)

#validation set
input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(covid_senti[covid_senti.data_type == 'test'].label_num.values)

In [138]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(class_dict),
                                                     output_attentions = False,
                                                      output_hidden_states = False)

You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['bert.encoder.layer.2.attention.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.11.attention.self.query.bias', 'bert.encoder.layer.5.attention.output.LayerNorm.weight', 'bert.encoder.layer.11.output.dense.bias', 'bert.encoder.layer.8.output.dense.weight', 'bert.encoder.layer.3.attention.output.dense.bias', 'bert.encoder.layer.11.attention.self.query.weight', 'bert.encoder.layer.1.output.dense.weight', 'bert.encoder.layer.2.output.LayerNorm.bias', 'bert.encoder.layer.3.intermediate.dense.weight', 'bert.encoder.layer.10.attention.output.LayerNorm.weight', 'bert.encoder.layer.4.attention.output.dense.bias', 'bert.encoder.layer.1.attention.self.key.bias', 'bert.e

In [139]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
from torch import nn, optim

train_dataset = TensorDataset(input_ids_train, attention_masks_train,labels_train)
test_dataset = TensorDataset(input_ids_test, attention_masks_test,labels_test)

train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=32)
test_dataloader = DataLoader(test_dataset, sampler=RandomSampler(test_dataset), batch_size=32)

In [140]:
optimizer = optim.AdamW(model.parameters(), lr=1e-5)

In [141]:
model = model.cuda()
loss_total = 0
model.train()
for i in range(3):
    for j, data in enumerate(train_dataloader):
        inputs = {'input_ids': data[0].cuda(), 
                      'attention_mask': data[1].cuda(), 
                      'labels': data[2].cuda()}
        output = model(**inputs)
        loss = output[0]
        optimizer.zero_grad()
        loss_total += loss.item()
        loss.backward()
        optimizer.step()
        print("Epoch: {} / {}, Step: {} / {} Loss: {:.4f}".format(i+1, 3, j, len(train_dataloader),
                                                                      loss))
torch.save(model.state_dict(), 'distilmodelC.pth')

Epoch: 1 / 3, Step: 0 / 750 Loss: 1.3260
Epoch: 1 / 3, Step: 1 / 750 Loss: 0.9201
Epoch: 1 / 3, Step: 2 / 750 Loss: 0.6835
Epoch: 1 / 3, Step: 3 / 750 Loss: 0.8440
Epoch: 1 / 3, Step: 4 / 750 Loss: 0.6767
Epoch: 1 / 3, Step: 5 / 750 Loss: 0.6434
Epoch: 1 / 3, Step: 6 / 750 Loss: 0.8689
Epoch: 1 / 3, Step: 7 / 750 Loss: 0.7314
Epoch: 1 / 3, Step: 8 / 750 Loss: 0.6250
Epoch: 1 / 3, Step: 9 / 750 Loss: 0.7379
Epoch: 1 / 3, Step: 10 / 750 Loss: 0.8152
Epoch: 1 / 3, Step: 11 / 750 Loss: 0.8038
Epoch: 1 / 3, Step: 12 / 750 Loss: 0.5505
Epoch: 1 / 3, Step: 13 / 750 Loss: 0.5375
Epoch: 1 / 3, Step: 14 / 750 Loss: 0.7962
Epoch: 1 / 3, Step: 15 / 750 Loss: 0.9293
Epoch: 1 / 3, Step: 16 / 750 Loss: 0.5782
Epoch: 1 / 3, Step: 17 / 750 Loss: 0.8427
Epoch: 1 / 3, Step: 18 / 750 Loss: 0.6130
Epoch: 1 / 3, Step: 19 / 750 Loss: 0.6882
Epoch: 1 / 3, Step: 20 / 750 Loss: 0.7653
Epoch: 1 / 3, Step: 21 / 750 Loss: 0.8577
Epoch: 1 / 3, Step: 22 / 750 Loss: 0.7319
Epoch: 1 / 3, Step: 23 / 750 Loss: 0.6615
Ep

In [142]:
model.load_state_dict(torch.load('distilmodelC.pth'))
model = model.cuda()
model.eval()
loss_total = 0
results = defaultdict(dict)
predictions, true_vals = [], []
for j, data in enumerate(test_dataloader):
    inputs = {'input_ids': data[0].cuda(), 
              'attention_mask': data[1].cuda(), 
              'labels': data[2].cuda()}
    with torch.no_grad():
        output = model(**inputs)
    loss = output[0]
    logits = output[1]
    logits = logits.detach().cpu().numpy()
    labels = inputs['labels'].cpu().numpy()
    loss_total += loss.item()
    predictions.append(logits)
    true_vals.append(labels)

In [143]:
predictions = np.concatenate(predictions, axis=0)
true_vals = np.concatenate(true_vals, axis=0)

In [144]:
from sklearn.metrics import classification_report
preds_flat = np.argmax(predictions, axis = 1).flatten()
labels_flat = true_vals.flatten()
target_names = ['neg', 'pos', 'neu']
print(classification_report(labels_flat, preds_flat, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.80      0.81      0.80      1156
         pos       0.58      0.81      0.68       456
         neu       0.93      0.89      0.91      4388

    accuracy                           0.87      6000
   macro avg       0.77      0.84      0.80      6000
weighted avg       0.88      0.87      0.87      6000



# Results and Discussion

Let's compare the results on the entire dataset:

| Model | Accuracy |
|-------|----------|
|Naive Bayes| 75% |
|Logistic Regression| 7% |
|CNN | 75% |
|BERT | 97% |
|DistilBERT| 91% |

This shows that BERT performs the best of the 5 models for sentiment analysis of the COVID-19 tweets.

But, to get a good sense of the performance, it is necessary to check the per class precision obtained by each model.



|    Model   | neu | pos | neg |
|-------|-----------|------|------|
|Naive Bayes | 75% | 0% | 0% |
|Logistic Regression | 0% | 7% | 0% |
|CNN | 100% | 0% | 0% |
|BERT | 98% | 91% | 93% |
|DistilBERT | 94% | 68% | 85% |

The results show that BERT performs the best with respect to per-class precision as well. 

The results even demonstrate that even though DistilBERT is a "lighter" version of BERT, it does compromise with accuracy and precision for our dataset. Also, the BERT model will perform the best for sentiment analysis.

Now, to discuss the results on the performance of the models on Parts A, B, C:


Part A:

| Model | Accuracy |
|-------|----------|
|Naive Bayes| 76% |
|Logistic Regression| 6% |
|CNN | 76% |
|BERT | 94% |
|DistilBERT| 89% |

This again shows that the BERT model performs well also on the types of tweets related to government actions on COVID.

Part B:

| Model | Accuracy |
|-------|----------|
|Naive Bayes| 75% |
|Logistic Regression| 6% |
|CNN | 75% |
|BERT | 94% |
|DistilBERT| 88% |

This again shows that the BERT model performs well also on the types of tweets related COVID-19 crises, social distancing, lockdown, and stay at home.

Part C:

| Model | Accuracy |
|-------|----------|
|Naive Bayes| 74% |
|Logistic Regression| 7% |
|CNN | 74% |
|BERT | 93% |
|DistilBERT| 87% |

This again shows that the BERT model performs well also on the types of tweets related to COVID-19 cases, outbreak, and stay at home.

Thus, it can be concluded that BERT is the best model to do sentiment analysis of COVID-19 tweets. Also, BERT performs slightly less accurate with repsect to tweets related to COVID-19 cases, outbreak and stay at home. But, the difference is not significant than that of parts A and B. <br>
Also, DistilBERT, though claimed to be equivalent to BERT gives a significant low performance than BERT for our task. Hence, the use of BERT for sentiment analysis is important as it significantly imrpoves performance over DistilBERT.

The results given in [1] on distilBERT and BERT are as follows:

| Models/Dataset | COVIDSenti-A | COVIDSenti-B | COVIDSenti-C | COVIDSenti |
|----------------|--------------|--------------|--------------|------------|
|distilBERT | 93.7% | 92.9% | 92.6% | 93.9% |
| BERT | 94.1% | 93.7% | 93.2% | 94.8% |

The results obtained by finetuning these both models in this project are:

| Models/Dataset | COVIDSenti-A | COVIDSenti-B | COVIDSenti-C | COVIDSenti |
|----------------|--------------|--------------|--------------|------------|
|distilBERT | 89% | 88% | 93% | 91% |
| BERT | 94% | 94% | 87% | 97% |

Thus, it can be seen that the BERT model fine tuned in this project above outperforms the one with the given in the original paper. On top of that, we have also tested the model on Naive Bayes, Logistic Regression and CNN. These models are not tested in [1]. Naive Bayes and CNN also achieve a satisfactory accuracy on all the datasets.

The main reason for BERT's high accuracy is its ability to capture contextual word representation which the other models can't do. These results align with the previous studies that state that BERT outperforms other methods like TF-IDF or Word2Vec for any NLP task.

## References:
[1] U. Naseem, I. Razzak, M. Khushi, P. W. Eklund and J. Kim, "COVIDSenti: A Large-Scale Benchmark Twitter Data Set for COVID-19 Sentiment Analysis," in IEEE Transactions on Computational Social Systems, vol. 8, no. 4, pp. 1003-1015, Aug. 2021, doi: 10.1109/TCSS.2021.3051189. <br>
[2] https://github.com/usmaann/COVIDSenti <br>
[3] https://huggingface.co/docs/transformers/model_doc/distilbert <br>
[4] https://medium.com/analytics-vidhya/bert-the-theory-you-need-to-know-ddd316794395 <br>