### The Sentiment Classification involves the following tasks:

Step 1: Quick Overview of the Data Structure

Step 2: Text Preprocessing

Step 3: Prepare Dataset (Training, Validation, and Test Data)

Step 4: Model Selection 

Step 5: Model Fine-Tuning

Step 6: Evalute Models and Feature Importance on the Test Data

Import libraries

In [2]:
import os
import pandas as pd
import numpy as np
import re

%matplotlib inline 
import matplotlib.pyplot as plt

Import the dataset 

In [3]:
df = pd.read_csv('data/ps5_tweets_text.csv')
labels=pd.read_csv('data/ps5_tweets_labels_as_numbers.csv')

### Step 1: Quick Overview of the Data Structure

Print the shape of the training dataset. In total, we got around 37K data points for the training data. 

In [4]:
df

Unnamed: 0,Id,Tweet
0,0,https://t.co/UpjxfOgQs8\r\r\n\r\r\nGaisss! Ple...
1,1,@mygovindia Today just after a week of lockdow...
2,2,Tuskys partners with Amref to provide on groun...
3,3,@chrissyteigen are u doing ur own grocery shop...
4,4,UK Critical Care Nurse Cries at Empty SuperMar...
...,...,...
37036,37036,Minnesota classifies grocery store workers as ...
37037,37037,US Senator @ewarren has asked for information ...
37038,37038,Just commented on @thejournal_ie: Poll: Are yo...
37039,37039,My wife got laid off yesterday because the sma...


Print the data labels

In [5]:
labels

Unnamed: 0,Id,Label
0,0,4
1,1,1
2,2,2
3,3,1
4,4,0
...,...,...
37036,37036,1
37037,37037,1
37038,37038,0
37039,37039,2


Check for the frequency and percentage of the five sentiment classes

Extremely Negative (0), Negative (1), Neutral (2), Positive (3), Extremely Positive (4).

Based on the distribution, we see it is a relatively balanced dataset –only Class 1 and 3 is near 25% of the entire dataset and the rest of classes range from 15% to 20%. Due to the relatively balanced distribution, I did not perform any data augmentations method to further balance the dataset.

In [6]:
from collections import Counter
count_label = Counter(labels.Label).items()
percentages_labels = {x: float(z) / len(labels.Label) * 100 for x, z in count_label}
Counter(labels.Label),percentages_labels

(Counter({4: 5953, 1: 8930, 2: 6930, 0: 4946, 3: 10282}),
 {4: 16.071380362301234,
  1: 24.108420399017305,
  2: 18.70899813719932,
  0: 13.35277125347588,
  3: 27.758429848006262})

### Step 2: Text Preprocessing

Since tokenization will assign different numbers to each word in differernt round, I performed text preprocessing before data split to ensure same numbers are encoded for each unique word. 

Define x and y of training samples and labels. 

In [7]:
x=df.Tweet
y=labels.Label

Copy the original training data again for the future data processing (Geron, 2019)

In [8]:
x_copy=x.copy()

#### Text Cleaning

Remove mentions, hashtags, foreigner characters, and urls from original tweets. 

Mentions are removed because mentioning or retweeting other users does not contribute to the sentiment of a tweet. Though they might convey some sentimental meaning, hashtags are removed because some hashtags are combinations of multiple words, which can lead to noise in clean text.

In [9]:
texts=[]

for i in x:
    text=' '.join(re.sub("([@][A-Za-z0-9]+)|([#][A-Za-z0-9]+)|([#][A-Za-z0-9|^0-9A-Za-z \th]+)|http\S+"," ",i).split())
    texts.append(text)

Print the length of cleaned texts to confirm all tweets have been converted.

In [10]:
len(texts)

37041

Illustrate the comparison between orginal and cleaned tweets

In [11]:
for i in range(6):
    print(i)
    print("a) Cleaned: "+texts[i])
    print("b) Original: "+x_copy[i])
    print('')

0
a) Cleaned: Gaisss! Please read this,and please limit yourself to go outside and please,please..always wash your hands,always use the hand sanitizer. And please get ready to stock up the food.
b) Original: https://t.co/UpjxfOgQs8

Gaisss! Please read this,and please limit yourself to go outside and please,please..always wash your hands,always use the hand sanitizer. 

And please get ready to stock up the food.

1
a) Cleaned: Today just after a week of lockdown lot of confectionary stores are running out of stock, how will be the seen if lockdown increased because of COVID-19 community spread, specially in B &amp; C class city. Emergency Supply chain need to be pla
b) Original: @mygovindia Today just after a week of lockdown lot of confectionary stores are running out of stock, how will be the seen if lockdown increased because of COVID-19 community spread, specially in B &amp; C class city. Emergency Supply chain need to be pla

2
a) Cleaned: Tuskys partners with Amref to pro

Check the number of unique words we have. In total, we got 77378 unique words in this sentiment analysis task (DSCI 552 Lecture, 2021)

In [12]:
x=' '.join(texts)
words = x.split()
len(set(words)),words[:20]

(77378,
 ['Gaisss!',
  'Please',
  'read',
  'this,and',
  'please',
  'limit',
  'yourself',
  'to',
  'go',
  'outside',
  'and',
  'please,please..always',
  'wash',
  'your',
  'hands,always',
  'use',
  'the',
  'hand',
  'sanitizer.',
  'And'])

Illustrate the frequency of the first 10 words (DSCI 552 Lecture, 2021)

In [13]:
counter=Counter(words)
list(counter.items())[:10]

[('Gaisss!', 1),
 ('Please', 707),
 ('read', 169),
 ('this,and', 1),
 ('please', 587),
 ('limit', 186),
 ('yourself', 192),
 ('to', 33693),
 ('go', 1802),
 ('outside', 277)]

I defined and generated a popular word list by collecting words that appear at least 20 times in the entire dataset. (DSCI 552 Lecture, 2021)

In [14]:
popular_words = set()
for w in set(words):
    if counter[w]>20:
        popular_words.add(w)
len(popular_words)

4277

Remove all non-popular word from documents

In [15]:
from tensorflow.keras.preprocessing.text import Tokenizer

remove_not_popular_words = lambda texts: [t for t in texts.split() if t in popular_words]
text = [remove_not_popular_words(t) for t in texts]


I removed all non-popular word from the entire tweet data, and joined tokenized tweets after the unpopular word removal back to complete sentences for sequence transformation.

In [16]:
join_text=[]
for i in text:
    join_text.append(' '.join(i))

Compare the tweet before and after unpopular word removal

In [17]:
join_text[0],texts[0]

('Please read please limit yourself to go outside and wash your use the hand sanitizer. And please get ready to stock up the food.',
 'Gaisss! Please read this,and please limit yourself to go outside and please,please..always wash your hands,always use the hand sanitizer. And please get ready to stock up the food.')

Set the total number of words tokenized to the total number of popular words we got. Tokenize the tweets after unpopular word removal and convert tweets to sequences (encoded by numbers) (DSCI552 Lecture, 2021). 

In [18]:
tokenizer = Tokenizer(num_words=4277)
tokenizer.fit_on_texts(join_text)
sequences = tokenizer.texts_to_sequences(join_text)

Illustrate a number sequence of the tweet

In [19]:
sequences[0]

[94,
 198,
 94,
 570,
 494,
 2,
 75,
 406,
 3,
 399,
 33,
 153,
 1,
 81,
 82,
 3,
 94,
 57,
 801,
 2,
 84,
 45,
 1,
 19]

### Step 3: Prepare Dataset (Training and Test Data)

Assign sequences to x

In [20]:
x=sequences

Use train_test_split method to split data into training and test dataset with a ratio of 8:2. 80% of data was used for training purpose because we want to save more data points for model training and fitting. 


In [21]:
from sklearn import model_selection

train_x, test_x, train_y, test_y = model_selection.train_test_split(x, y, test_size=0.2, random_state=42)
len(train_x),len(test_x)

(29632, 7409)

Verify the percentage distribution of training and test data.

According to the percentage distribution of training and test data datasets, we are confident that our  sampling has similar distribution with our original tweet dataset.

In [22]:
count_train_y = Counter(train_y).items()
percentages_train = {x: float(z) / len(train_y) * 100 for x, z in count_train_y}
count_test_y = Counter(test_y).items()
percentages_test = {x: float(z) / len(test_y) * 100 for x, z in count_test_y}
Counter(train_y),percentages_train,Counter(test_y),percentages_test

(Counter({2: 5498, 4: 4750, 0: 3971, 3: 8253, 1: 7160}),
 {2: 18.5542656587473,
  4: 16.029967602591793,
  0: 13.401052915766739,
  3: 27.85164686825054,
  1: 24.16306695464363},
 Counter({4: 1203, 2: 1432, 3: 2029, 0: 975, 1: 1770}),
 {4: 16.237009043055743,
  2: 19.32784451342961,
  3: 27.38561209339992,
  0: 13.159670670805776,
  1: 23.88986367930895})

### Step 4: Model Selection

#### 1) Naive Bayes as Baseline

Before evaluating the performance of more sophisticated word embedding methods, I used Naive Bayes as the baseline model for setting the benchmark. 

The common practice is that I need to split data into training, validation and test datasets for model fitting, selection and evaluation. However, due to the constrain of text preprocessing (where I need to preprocess all tweets and convert to number sequences through encoding before the data split), I directly performed another data split specifically for Naive Bayes models, and use one hot encoding method to convert their training data and test data to vector format.

Use train_test_split method to split data into training and test dataset for Naive Bayes with a ratio of 8:2. 80% of data was used for training purpose because we want to save more data points for model training and fitting. 


In [23]:
train_x_n, test_x_n, train_y_n, test_y_n = model_selection.train_test_split(join_text, y, test_size=0.2, random_state=42)
len(train_x_n),len(test_x_n)

(29632, 7409)

Perform one hot encoding on the training data 

In [24]:
train_hot = tokenizer.texts_to_matrix(train_x_n, mode = 'binary')
train_hot

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Perform one hot encoding on the test data 

In [25]:
test_hot = tokenizer.texts_to_matrix(test_x_n, mode = 'binary')
test_hot

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

Five Naive Bayes models, including Multinomial, Gaussian, Bernoulli, Complement, and Categorical, were selected to compare the model performance on the test data measured by accuracy. Among the five Naive Bayes models, Multinomial performs slightly better than the other four, though none of the model generates an accuracy over 50%. 

Therefore, the Multinomial  model, the Naïve Bayes model with the best accuracy of 47.41% was selected for setting the benchmark of the more sophisticated word embedding models in the next section. 

Multinomial Model Performance

In [26]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

MNB = MultinomialNB()
MNB.fit(train_hot,train_y_n)

predicted = MNB.predict(test_hot)
accuracy = metrics.accuracy_score(predicted,test_y_n)
accuracy,Counter(predicted)

(0.47415305709272504, Counter({4: 1220, 2: 1414, 3: 2167, 1: 1819, 0: 789}))

Gaussian Model Performance

In [27]:
from sklearn.naive_bayes import GaussianNB

GNB = GaussianNB()
GNB.fit(train_hot,train_y_n)

predicted = GNB.predict(test_hot)
accuracy = metrics.accuracy_score(predicted,test_y_n)
accuracy,Counter(predicted)

(0.3322985558105007, Counter({4: 1535, 2: 3627, 0: 1773, 1: 262, 3: 212}))

Bernoulli Model Performance

In [28]:
from sklearn.naive_bayes import BernoulliNB

BNB = BernoulliNB()
BNB.fit(train_hot,train_y_n)

predicted = BNB.predict(test_hot)
accuracy = metrics.accuracy_score(predicted,test_y_n)
accuracy,Counter(predicted)

(0.44999325145093805, Counter({4: 1248, 2: 2148, 1: 1412, 0: 893, 3: 1708}))

Complement Model Performance

In [29]:
from sklearn.naive_bayes import ComplementNB

CNB = ComplementNB()
CNB.fit(train_hot,train_y_n)

predicted = CNB.predict(test_hot)
accuracy = metrics.accuracy_score(predicted,test_y_n)
accuracy,Counter(predicted)

(0.43838574706438116, Counter({4: 2067, 2: 1585, 3: 957, 1: 1272, 0: 1528}))

Categorical Model Performance

In [30]:
from sklearn.naive_bayes import CategoricalNB

CANB = CategoricalNB()
CANB.fit(train_hot,train_y_n)

predicted = CANB.predict(test_hot)
accuracy = metrics.accuracy_score(predicted,test_y_n)
accuracy,Counter(predicted)

(0.44877851261978674, Counter({4: 1307, 2: 2180, 1: 1354, 0: 960, 3: 1608}))

#### 2) Word Embedding

Considering none of the Naive Bayes models produce accuracy rate over 50%, I want to implement more sophesticated models, such as word embedding and neural network for sentiment classifier. 

For word embedding method, I needed to pad the shorter tweet sequence or truncate the longer tweet sequence to ensure all input have the same shape for neural networks. I set the maximum length of sequence to 50, so it could capture most of the content of the cleaned tweets (DSCI552 Lecture, 2021). 

In [31]:
from tensorflow.keras import preprocessing

maxlen = 50

train_xpad = preprocessing.sequence.pad_sequences(train_x, maxlen = maxlen, padding = 'pre', truncating = 'pre')
test_xpad = preprocessing.sequence.pad_sequences(test_x, maxlen = maxlen, padding = 'pre', truncating = 'pre')

Illustrate a sequence after padding/ truncating

In [32]:
print(train_xpad[0].tolist())

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 150, 19, 18, 509, 16, 50, 2359, 440, 2, 25, 11, 7, 14, 460, 1815, 163, 4, 1, 2359, 2, 322, 3, 2369, 30, 460, 346, 374, 648, 1063, 103, 1, 22, 3, 510, 224]


Import all tensorflow packages

In [39]:
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Embedding, Conv1D, MaxPooling1D, GlobalAveragePooling1D

##### i) One-layer Embedding Model 

I will start with a simple one-layer embedding model. In this model, there is only one embedding, one flatten and one dense layer to generate five classes using the softmax activation. 

In [34]:
onelayer = Sequential([
    Embedding(input_dim=4277, output_dim=6, input_length=maxlen),
    Flatten(),
    Dense(5, activation='softmax')
])

onelayer.compile(optimizer='adam',loss='sparse_categorical_crossentropy', metrics=['accuracy'])
onelayer.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 6)             25662     
_________________________________________________________________
flatten (Flatten)            (None, 300)               0         
_________________________________________________________________
dense (Dense)                (None, 5)                 1505      
Total params: 27,167
Trainable params: 27,167
Non-trainable params: 0
_________________________________________________________________


Perform model fitting with ten times of epochs and validate on 20% of valid data. 

In [35]:
onelayer.fit(train_xpad, train_y, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a45599290>

##### ii) CNN Model 

Inspired by Brownlee (2017), I built a CNN model with one embedding layer, one convolutional layer followed by one max pooling layer, one flatten layer, and two dense layers.

Deep Convolutional Neural Network for Sentiment Analysis (Text Classification): https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/

In [36]:
CNN = Sequential()
CNN.add(Embedding(input_dim=4277, output_dim=6, input_length=maxlen))
CNN.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
CNN.add(MaxPooling1D(pool_size=2))
CNN.add(Flatten())
CNN.add(Dense(10, activation='relu'))
CNN.add(Dense(10, activation='softmax'))

CNN.compile(optimizer='adam',loss='sparse_categorical_crossentropy', metrics=['accuracy'])
CNN.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 6)             25662     
_________________________________________________________________
conv1d (Conv1D)              (None, 43, 32)            1568      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 21, 32)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 672)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                6730      
_________________________________________________________________
dense_2 (Dense)              (None, 10)                110       
Total params: 34,070
Trainable params: 34,070
Non-trainable params: 0
__________________________________________________

Perform model fitting with ten times of epochs and validate on 20% of valid data. 

In [37]:
CNN.fit(train_xpad, train_y, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a9273a790>

##### iii) Global Average Pooling Layer Model  

Inspired by TensorFlow Documentation (2021),  I built a neural network model with one embedding layer, one global average pooling layer, and two dense layers. 


In [40]:
GlobalAvg = Sequential([
  Embedding(input_dim=4277, output_dim=6, input_length=maxlen),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(5, activation='softmax')
])

GlobalAvg.compile(optimizer='adam',
              loss="sparse_categorical_crossentropy",
              metrics=['accuracy'])
GlobalAvg.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 6)             25662     
_________________________________________________________________
global_average_pooling1d (Gl (None, 6)                 0         
_________________________________________________________________
dense_3 (Dense)              (None, 16)                112       
_________________________________________________________________
dense_4 (Dense)              (None, 5)                 85        
Total params: 25,859
Trainable params: 25,859
Non-trainable params: 0
_________________________________________________________________


Perform model fitting with ten times of epochs and validate on 20% of valid data. 

In [41]:
GlobalAvg.fit(train_xpad, train_y, epochs=10,validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a92e92dd0>

#### 3) Word Embedding - Word2vec approach

Previously, I vectorized tweets using a dense sequence with length of 50 for word embedding. In this section, I first performed a word2vec word embedding method to produce a dense word vectorization with each word by converting it to a 200-dimension vector, and then applied the vectorization to the entire tweet and conducted a neural network for classifying sentiments based on tweet vectors. 

In [42]:
import gensim,logging
from gensim.models import word2vec



Since stopword usually constains negative words like "not" or "no", removing those would alter the sentiment of a tweet, I skipped the stopword removal step for the tweet cleaning. 

Previously, I have generated a list of tokenized tweet and will reuse it for word2vec. 

In [43]:
text

[['Please',
  'read',
  'please',
  'limit',
  'yourself',
  'to',
  'go',
  'outside',
  'and',
  'wash',
  'your',
  'use',
  'the',
  'hand',
  'sanitizer.',
  'And',
  'please',
  'get',
  'ready',
  'to',
  'stock',
  'up',
  'the',
  'food.'],
 ['Today',
  'just',
  'after',
  'a',
  'week',
  'of',
  'lockdown',
  'lot',
  'of',
  'stores',
  'are',
  'running',
  'out',
  'of',
  'stock,',
  'how',
  'will',
  'be',
  'the',
  'seen',
  'if',
  'lockdown',
  'increased',
  'because',
  'of',
  'COVID-19',
  'community',
  'spread,',
  'in',
  'B',
  '&amp;',
  'C',
  'class',
  'Emergency',
  'Supply',
  'chain',
  'need',
  'to',
  'be'],
 ['partners',
  'with',
  'to',
  'provide',
  'on',
  'ground',
  'health',
  'education',
  'and',
  'awareness',
  'on',
  '19',
  'at',
  'all',
  'Its',
  'supermarket',
  'Kenya'],
 ['are',
  'u',
  'doing',
  'ur',
  'own',
  'grocery',
  'shopping',
  'now',
  'like',
  'a',
  'regular',
  'person',
  'or',
  'are',
  'u',
  'still',


Usually we need to have training, validation and test data for model fitting, selection and evaluation. However, due to the constrain of text preprocessing (where we need to preprocess all data and convert to sequences before data aplit), we will directly perform another data split for the word2vec model by using the tokenized tweets.

Use train_test_split method to split data into training and test dataset for word2vec with a ratio of 8:2. 80% of data was used for training purpose because we want to save more data points for model training and fitting. 


In [44]:
train_x_w, test_x_w, train_y_w, test_y_w = model_selection.train_test_split(text, y, test_size=0.2, random_state=42)
len(train_x_w),len(test_x_w)

(29632, 7409)

Train the word2vec model using the training data. Save the model to local

In [48]:
w2v=word2vec.Word2Vec(sentences=train_x_w,vector_size=200,sg=1,window=10,min_count=3,hs=0)
#w2v.save('w2v.model')

Load the saved model

In [49]:
#w2v = word2vec.Word2Vec.load('w2v.model')

Print the vector for "good"

In [50]:
w2v.wv['good']

array([ 0.09379546, -0.03826694,  0.03124714, -0.14304438, -0.0255729 ,
       -0.15769012,  0.1852837 ,  0.11600946, -0.17804474, -0.09902942,
       -0.20548357, -0.1399364 , -0.03896389,  0.42858398, -0.15012059,
       -0.22436889,  0.17582186,  0.25466058,  0.03259578, -0.4062336 ,
        0.07866448,  0.04566304, -0.12107116, -0.03761733, -0.10486799,
        0.05563427,  0.00830531, -0.20233499, -0.06435529, -0.10289777,
        0.3068847 , -0.04249847,  0.01157409,  0.15313831,  0.14227022,
        0.13534659,  0.16453083, -0.02778324, -0.29743934, -0.01724864,
        0.03093329, -0.1213576 , -0.15015621,  0.01570711,  0.11024246,
        0.17788552,  0.02847718, -0.04729576, -0.03898012,  0.16029099,
        0.07566921, -0.30035648,  0.06925744, -0.08225472,  0.06817754,
       -0.11406031,  0.03125651, -0.0932342 , -0.31181374, -0.02734807,
       -0.25214806, -0.03544705, -0.11818658, -0.12409714, -0.2188806 ,
        0.20601973, -0.10561441,  0.21430221, -0.07681768,  0.14

Show the top 10 similar words for "good", "happy", and "sad"

In [51]:
w2v.wv.most_similar(positive="happy"),w2v.wv.most_similar(positive="good"),w2v.wv.most_similar(positive="sad")

([('beautiful', 0.8226258754730225),
  ('suppose', 0.8105071783065796),
  ('realized', 0.7914396524429321),
  ('dinner', 0.7861936092376709),
  ('cool', 0.7829416394233704),
  ('lucky', 0.782436192035675),
  ('ride', 0.7623853087425232),
  ('thing.', 0.7621414661407471),
  ('much.', 0.7618863582611084),
  ('couldn?t', 0.7607919573783875)],
 [('sad', 0.6383953094482422),
  ('always', 0.6325969099998474),
  ('correct', 0.6322875022888184),
  ('enough.', 0.6310970187187195),
  ('cool', 0.6292913556098938),
  ('super', 0.6291665434837341),
  ('Hopefully', 0.6219711303710938),
  ('nice', 0.6208566427230835),
  ('bad', 0.6181733012199402),
  ('terrible', 0.6171517372131348)],
 [('cool', 0.8600603342056274),
  ('terrible', 0.8371973037719727),
  ('thing.', 0.8154576420783997),
  ('nobody', 0.8067632913589478),
  ('mad', 0.8063436150550842),
  ('is,', 0.8055611252784729),
  ('mentioned', 0.8044517636299133),
  ('idiot', 0.8029739856719971),
  ('surprised', 0.799253523349762),
  ('strange', 0.7

Inspired by Besbes (2017), I vectorize tweets by taking averages of all words in a tweet. 

Sentiment analysis 👍 👎 on Twitter using Word2vec and Keras: https://www.ahmedbesbes.com/blog/sentiment-analysis-with-keras-and-word-2-vec

In [52]:
def word_vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0
    for word in tokens:
        try:
            vec += w2v.wv[word].reshape((1, size))
            count += 1.
        except KeyError:  # handling the case where the token is not in vocabulary
            continue
    if count != 0:
        vec /= count
    return vec

Transform all tweet into a 200-dimension vector (Besbes, 2017)

In [53]:
wordvec_arrays = np.zeros((len(train_x_w), 200)) 
for i in range(len(train_x_w)):
    wordvec_arrays[i,:] = word_vector(train_x_w[i], 200)
wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape

(29632, 200)

Built a neural network with two dense layers for sentiment classification (Besbes, 2017)

In [54]:
WNN = keras.models.Sequential()
WNN.add(keras.layers.Dense(32, activation='relu', input_dim=200))
WNN.add(keras.layers.Dense(5, activation='softmax'))
WNN.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

WNN.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 32)                6432      
_________________________________________________________________
dense_6 (Dense)              (None, 5)                 165       
Total params: 6,597
Trainable params: 6,597
Non-trainable params: 0
_________________________________________________________________


Perform model fitting with ten times of epochs and validate on 20% of valid data. 

In [55]:
WNN.fit(wordvec_df,train_y_w, epochs=10, batch_size=32, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a932bcbd0>

### Step 4: Model Fine-Tuning 

Four variants of Global Average Pooling models were selected for the parameter fine-tuning step. From the preliminary model training, we see that Global Average Pooling model has slightly better performance than other models and approached. Therefore, these models were mainly test by adding different regularization method to reduce their risk of overfitting. These models include: 

(1) Original Global Average Pooling model

(2) Global Average Pooling model with dropout layer

(3) Global Average Pooling model with batch normalization

(4) Original Global Average Pooling model with output dimension adjusted


Similarly, AUC scores were used for comparing the model performance between each Global Average Pooling model model and the baseline Naive Bayes model. 

Further split the training dataset after padding to a subset of training and validation data for model fine-tuning. 


In [56]:
train_xpad, valid_xpad, train_y, valid_y = model_selection.train_test_split(train_xpad, train_y, test_size=0.2, random_state=42)
len(train_xpad),len(valid_xpad)

(23705, 5927)

##### 1. Original Global Average Pooling model

Perform the original model for classification

In [57]:
GlobalAvg = Sequential([
  Embedding(input_dim=4277, output_dim=6, input_length=maxlen),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(5, activation='softmax')
])

GlobalAvg.compile(optimizer='adam',
              loss="sparse_categorical_crossentropy",
              metrics=['accuracy'])
GlobalAvg.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 50, 6)             25662     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 6)                 0         
_________________________________________________________________
dense_7 (Dense)              (None, 16)                112       
_________________________________________________________________
dense_8 (Dense)              (None, 5)                 85        
Total params: 25,859
Trainable params: 25,859
Non-trainable params: 0
_________________________________________________________________


Using the validation dataset to fit the one layer model with 10 times of epochs. 

In [58]:
GlobalAvg.fit(train_xpad, train_y, epochs=10,validation_data=(valid_xpad,valid_y))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a93718d50>

Sklean Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

Based on Sklean Documentation (2020), macro AUC score is selected for comparing the model performance. The reason is that 'macro' doesn't take imbalanced dataset into account, which means it's a simply average between three different classes.

Calculate the AUC score for the model

In [59]:
from sklearn.metrics import roc_auc_score

pred_y = GlobalAvg.predict(valid_xpad)
roc_auc_score(valid_y, pred_y,average='macro',multi_class='ovo')

0.8569054348233325

##### 2. Global Average Pooling model with one dropout layer

Perform the global avergae polling model with one dropout layer for classification

In [60]:
from tensorflow.keras.layers import Dropout

GlobalAvg_dropout = Sequential([
  Embedding(input_dim=4277, output_dim=6, input_length=maxlen),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dropout(0.5),
  Dense(5, activation='softmax')
])

GlobalAvg_dropout.compile(optimizer='adam',
              loss="sparse_categorical_crossentropy",
              metrics=['accuracy'])
GlobalAvg_dropout.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 50, 6)             25662     
_________________________________________________________________
global_average_pooling1d_2 ( (None, 6)                 0         
_________________________________________________________________
dense_9 (Dense)              (None, 16)                112       
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 5)                 85        
Total params: 25,859
Trainable params: 25,859
Non-trainable params: 0
_________________________________________________________________


Using the validation dataset to fit the one layer model with 15 times of epochs. 

In [61]:
GlobalAvg_dropout.fit(train_xpad, train_y, epochs=15,validation_data=(valid_xpad,valid_y))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x114f26c10>

Calculate the AUC score for the model

In [62]:
pred_y = GlobalAvg_dropout.predict(valid_xpad)
roc_auc_score(valid_y, pred_y,average='macro',multi_class='ovo')

0.8569063762402831

##### 3. Global Average Pooling model with batch normalization

Perform the global avergae polling model with batch normalization layer for classification

In [63]:
from tensorflow.keras.layers import BatchNormalization

GlobalAvg_batch = Sequential([
  Embedding(input_dim=4277, output_dim=6, input_length=maxlen),
  BatchNormalization(),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dropout(0.5),
  Dense(5, activation='softmax')
])

GlobalAvg_batch.compile(optimizer='adam',
              loss="sparse_categorical_crossentropy",
              metrics=['accuracy'])
GlobalAvg_batch.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 50, 6)             25662     
_________________________________________________________________
batch_normalization (BatchNo (None, 50, 6)             24        
_________________________________________________________________
global_average_pooling1d_3 ( (None, 6)                 0         
_________________________________________________________________
dense_11 (Dense)             (None, 16)                112       
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_12 (Dense)             (None, 5)                 85        
Total params: 25,883
Trainable params: 25,871
Non-trainable params: 12
_________________________________________________

Using the validation dataset to fit the one layer model with 10 times of epochs. 

In [64]:
GlobalAvg_batch.fit(train_xpad, train_y, epochs=10,validation_data=(valid_xpad,valid_y))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a450d4790>

Calculate the AUC score for the model

In [65]:
pred_y = GlobalAvg_batch.predict(valid_xpad)
roc_auc_score(valid_y, pred_y,average='macro',multi_class='ovo')

0.8545064549663971

##### 4. Global Average Pooling model with output dimension adjusted 

Perform the global avergae polling model with one output dimensions adjusted for classification

In [66]:
from tensorflow.keras.layers import BatchNormalization

GlobalAvg_output = Sequential([
  Embedding(input_dim=4277, output_dim=7, input_length=maxlen),
  BatchNormalization(),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dropout(0.5),
  Dense(5, activation='softmax')
])

GlobalAvg_output.compile(optimizer='adam',
              loss="sparse_categorical_crossentropy",
              metrics=['accuracy'])
GlobalAvg_output.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 50, 7)             29939     
_________________________________________________________________
batch_normalization_1 (Batch (None, 50, 7)             28        
_________________________________________________________________
global_average_pooling1d_4 ( (None, 7)                 0         
_________________________________________________________________
dense_13 (Dense)             (None, 16)                128       
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_14 (Dense)             (None, 5)                 85        
Total params: 30,180
Trainable params: 30,166
Non-trainable params: 14
_________________________________________________

Using the validation dataset to fit the one layer model with 8 times of epochs. 

In [67]:
GlobalAvg_output.fit(train_xpad, train_y, epochs=8,validation_data=(valid_xpad,valid_y))

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<tensorflow.python.keras.callbacks.History at 0x1a4531eb50>

Calculate the AUC score for the model

In [68]:
pred_y = GlobalAvg_output.predict(valid_xpad)
roc_auc_score(valid_y, pred_y,average='macro',multi_class='ovo')

0.8553338945778851

### Step 6: Evalute Models on the Test Data

Based on the prelinminary model selection and model fine-tuning steps, we identify that original version of Global Average Pooling model with no batch normalization and no dropout layer would generate the best model performance for the training data.

In [69]:
GlobalAvg = Sequential([
  Embedding(input_dim=4277, output_dim=6, input_length=maxlen),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(5, activation='softmax')
])

GlobalAvg.compile(optimizer='adam',
              loss="sparse_categorical_crossentropy",
              metrics=['accuracy'])
GlobalAvg.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 50, 6)             25662     
_________________________________________________________________
global_average_pooling1d_5 ( (None, 6)                 0         
_________________________________________________________________
dense_15 (Dense)             (None, 16)                112       
_________________________________________________________________
dense_16 (Dense)             (None, 5)                 85        
Total params: 25,859
Trainable params: 25,859
Non-trainable params: 0
_________________________________________________________________


In the model fitting step, I ran this step multiple times by starting with larger epoch values and continued reducing the value until the model did not face any overfitting issue due to its high epoch value. In the end, I fit the model with 10 times of epochs and validated by the validating dataset. 

In [70]:
GlobalAvg.fit(train_xpad, train_y, epochs=10,validation_data=(valid_xpad,valid_y))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a94547090>

Compute the macro AUC score for the final model based on the validation data. 

In [71]:
pred_y = GlobalAvg.predict(valid_xpad)
roc_auc_score(valid_y, pred_y,average='macro',multi_class='ovo')

0.8608947755157887

Compute the macro AUC score for the final model based on the test data. 

In [72]:
pred_y = GlobalAvg.predict(test_xpad)
roc_auc_score(test_y, pred_y,average='macro',multi_class='ovo')

0.8615135250049242

Reduce the numpy array of predicted value to a list for calculating the confusion matrix.

In [73]:
reduce=[]
for a in pred_y: 
    reduce.append(np.argmax(a))

Calculate the confusion matrix for multiclass classification.

In [74]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_y,reduce)

array([[ 512,  320,   63,   70,   10],
       [ 159, 1010,  272,  310,   19],
       [   7,  190,  994,  233,    8],
       [  27,  294,  244, 1282,  182],
       [   5,   44,   40,  407,  707]])

Compare the confusion matrix with the distribution of labels 

In [75]:
Counter(test_y)

Counter({4: 1203, 2: 1432, 3: 2029, 0: 975, 1: 1770})