<div style="display:block">
    <div style="width: 10%; display: inline-block; text-align: left;">
    </div>
    <div style="width: 69%; display: inline-block">
        <h5  style="color:maroon; text-align: center; font-size:25px;">Sentiment Analysis using Transfer Learning</h5>
    </div>
</div>

Sentiment Classification is a perfect problem in Natural language Processing (NLP) for getting started in it. As the name suggests, it is classification of peoples opinion or expressions into different sentiments, such as __Positive__, __Neutral__, and __Negative__.

NLP is a powerful tool, but in real-world we often come across tasks which suffer from data deficit and poor model generalisation. __Transfer learning__ solved this problem. It is the process of training a model on a large-scale dataset and then using that pretrained model to conduct learning for another downstream task (i.e., target task).

# Importing libraries

In [None]:
# to hide warnings
import warnings
warnings.filterwarnings('ignore')

# basic data processing
import os
import datetime
import pandas as pd
import numpy as np

# for EDA
from pandas_profiling import ProfileReport

# for text preprocessing
import re
import nltk 
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from spellchecker import SpellChecker

# progress bar
from tqdm.auto import tqdm
from tqdm import tqdm_notebook

# instantiate
tqdm.pandas(tqdm_notebook)

# for wordcloud
from PIL import Image
from wordcloud import WordCloud

# for aesthetics and plots
from IPython.display import display, Markdown, clear_output
from termcolor import colored

import matplotlib.pyplot as plt

import plotly.graph_objects as go
from plotly.offline import plot, iplot
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook"

# for model
import tensorflow as tf
import tensorflow_hub as hub
import keras.layers as layers
from keras.models import Model
from keras import backend as K
import keras
from keras.models import load_model

display(Markdown('_All libraries are imported successfully!_'))

# Data Loading & Preprocessing

In this notebook, I am using __[Sentiment140](http://help.sentiment140.com/for-students)__. It contains two labeled data:
* data of __1.6 Million Tweets__ to be used as __train,validation,test split data__
* data of __498 Tweets__ to be used as another fresh __test data__

Data dictionary are as follows:

* __target__: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
* __ids__: The id of the tweet (2087)
* __date__: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
* __flag__: The query (lyx). If there is no query, then this value is NO_QUERY.
* __user__: the user that tweeted (robotickilldozr)
* __text__: the text of the tweet (Lyx is cool)

__NOTE__: The training data isn't perfectly categorised as it has been created by tagging the text according to the emoji present. So, any model built using this dataset may have lower than expected accuracy, since the dataset isn't perfectly categorised.

In [None]:
col_names =  ['target', 'id', 'date', 'flag','user','text']

df = pd.read_csv('./data/sentiment140data/training_data.csv', encoding = "ISO-8859-1", names=col_names)

print(colored('DATA','blue',attrs=['bold']))
display(df.head())

Let us explore the data for better understanding.

In [None]:
profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True)
profile

We will be requiring only __target__ and __text__ columns. As observed from the above report, we have only __positive (4)__ and __negative (0)__ sentiment. We will replace 4 as 1 for convenience. 

Also, it's a __perfectly balanced__ dataset without any skewness - equal distribution of positive and negative sentiment.

In [None]:
# dropping irrelevant columns
df.drop(['id', 'date', 'flag', 'user'], axis=1, inplace=True)

# replacing positive sentiment 4 with 1
df.target = df.target.replace(4,1)

target_count = df.target.value_counts()

category_counts = len(target_count)
display(Markdown('__Number of categories__: {}'.format(category_counts)))

# Text Preprocessing

At first glance, it's evident that the data is not clean. Tweet texts often consists of other user mentions, hyperlink texts, emoticons and special characters which no value as feature to the model we are training. So we need to get rid of them. To do this, we need to perform 4 crucial process step-by-step:

1. __Hyperlinks and Mentions__: In Twitter, people can tag/mention other people's ID and share URLs/hyperlinks. We need to eliminate this as well.

2. __Stopwords__ : These are commonly used words (such as “the”, “a”, “an”, “in”) which have no contextual meaning in a sentence and hence we ignore them when indexing entries for searching and when retrieving them as the result of a search query.

3. __Spelling Correction__: We can definitely expect incorrect spellings in the tweets/data, and we need to fix as many as possible, because without doing this, the following step will not work properly.

4. __Stemming/Lemmatization__: The goal of both stemming and lemmatization is to reduce inflectional and derivationally related forms of a word to a common base form. However, there is a difference which you can understand from the image below.

![](./source/stem_lemm.png)

Lemmatization is similar to stemming with one difference - the final form is also a __meaningful word__. Thus, stemming operation does not need a dictionary like lemmatization. Hence, here we will be going ahead with lemmetization.

1, 2 and 4 can be done using the library __`NLTK`__, and spell-checking using __`pyspellchecker`__.

In [None]:
# set of stop words declared
stop_words = stopwords.words('english')

display(Markdown('__List of stop words__:'))
display(Markdown(str(stop_words)))

Some words like `not`, `haven't`, `don't` are included in stopwords and ignoring them will make sentences like `this was not good` and `this was good` or `He is a nice guy... not!` and `He is a nice guy... !` have same predictions. So we need to eliminate the words that expresses negation, denial, refusal or prohibition.

In [None]:
updated_stop_words = stop_words.copy()
for word in stop_words:
    if "n't" in word or "no" in word or word.endswith('dn') or word.endswith('sn') or word.endswith('tn'):
        updated_stop_words.remove(word)

# custom select words you don't want to eliminate
words_to_remove = ['for','by','with','against','shan','don','aren','haven','weren','until','ain','but','off','out']
for word in words_to_remove:
    updated_stop_words.remove(word)

display(Markdown('__Updated list of stop words__:'))
display(Markdown(str(updated_stop_words)))

Now, let us define the function to perform the necessary preprocessing.

In [None]:
# Defining dictionary containing all emojis with their meanings.
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', 
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', 
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

# Defining regex patterns.
urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
userPattern       = '@[^\s]+'
alphaPattern      = "[^a-zA-Z0-9]"
sequencePattern   = r"(.)\1\1+"
seqReplacePattern = r"\1\1"

# creating instance of spellchecker
spell = SpellChecker()

# creating instance of lemmatizer
lemm = WordNetLemmatizer()


def preproc ess(tweet):
    # lowercase the tweets
    tweet = tweet.lower().strip()
    
    # REMOVE all URls
    tweet = re.sub(urlPattern,'',tweet)
    
    # Replace all emojis.
    for emoji in emojis.keys():
        tweet = tweet.replace(emoji, "emoji" + emojis[emoji])        
    
    # Remove @USERNAME
    tweet = re.sub(userPattern,'', tweet)        
    
    # Replace all non alphabets.
    tweet = re.sub(alphaPattern, " ", tweet)
    
    # Replace 3 or more consecutive letters by 2 letter.
    tweet = re.sub(sequencePattern, seqReplacePattern, tweet)

    tokenized_tweet = tweet.split()
    
#     # spell checks
#     misspelled = spell.unknown(tokenized_tweet)
#     if misspelled == set():
#         pass
#     else:
#         for i,word in enumerate(misspelled):
#             tokenized_tweet[i] = spell.correction(word)

    tweetwords = ''
    for word in tokenized_tweet:
        # Checking if the word is a stopword.
        if word not in updated_stop_words:
            if len(word)>1:
                # Lemmatizing the word.
                lem_word = lemm.lemmatize(word)
                tweetwords += (lem_word+' ')
    
    return tweetwords

Now, we will apply the function _`preprocess`_ on each value of the column `text` where tweets are located.

In [None]:
df['text'] = df['text'].progress_apply(lambda x: preprocess(x))
print(colored('DATA','blue',attrs=['bold']))
display(df.head())

Let's take a quick look at the words that are frequently used for positive and negative tweets.

In [None]:
def plot_wordcloud(text, mask, title = None):
    wordcloud = WordCloud(background_color='black', max_words = 200,
                          max_font_size = 200, random_state = 42, mask = mask)
    wordcloud.generate(text)
    
    plt.figure(figsize=(25,25))
    
    plt.imshow(wordcloud)
    plt.title(title, fontdict={'size': 40, 'verticalalignment': 'bottom'})
    plt.axis('off')
    plt.tight_layout()

In [None]:
pos_text = " ".join(df[df['target'] == 1]['text'])
pos_mask = np.array(Image.open('./source/upvote.png'))

plot_wordcloud(pos_text, pos_mask, title = 'Most common 200 words in positive tweets')

In [None]:
neg_text = " ".join(df[df['target'] == 0]['text'])
neg_mask = np.array(Image.open('./source/downvote.png'))

plot_wordcloud(neg_text, neg_mask,
               title = 'Most common 200 words in negative tweets')

# Data Split

We will shuffle the dataset and split it to gives __train__, __validation__ and __test__ dataset. It's important to shuffle our dataset before training. The split is in the ratio of __6:2:2__ respectively.

In [None]:
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=369):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test

train_df, val_df, test_df = train_validate_test_split(df)

print('Train: {}, Validation: {}, Test: {}'.format(train_df.shape, val_df.shape, test_df.shape))

print(colored('TRAIN DATA','magenta',attrs=['bold']))
display(train_df.head())

train_text = train_df['text'].tolist()
train_text = np.array(train_text, dtype=object)[:, np.newaxis]
train_label = np.asarray(pd.get_dummies(train_df['target']), dtype = np.int8)

val_text = val_df['text'].tolist()
val_text = np.array(val_text, dtype=object)[:, np.newaxis]
val_label = np.asarray(pd.get_dummies(val_df['target']), dtype = np.int8)

test_text = test_df['text'].tolist()
test_text = np.array(test_text, dtype=object)[:, np.newaxis]
test_label = np.asarray(pd.get_dummies(test_df['target']), dtype = np.int8)

# Pre-trained Embedding model

## Understanding the difference

There are two types of embedding in NLP domain:

__WORD EMBEDDING__

* Baseline
    1. __Word2Vec__
    2. __GloVe__
    3. __FastText__
* State-of-the-art
    1. __ELMo__ (`E`mbeddeding from `L`anguage `Mo`del)
    2. __BERT__ (`B`idirectional `E`ncoder `R`epresentations from `T`ransformers)
    3. __OpenAI GPT__ (`G`enerative `P`re-Training `T`ransformer)
    4. __ULMFiT__ (`U`niversal `L`anguage `M`odel `Fi`ne-`T`uning) - This is more of a process that includes word embedding along with NN architecture.

__SENTENCE EMBEDDING__

* Basline
    1. __Bag of Words__
    2. __Doc2Vec__
* State-of-the-art
    1. __Sentence BERT__
    2. __Skip-Thoughts and Quick-Thoughts__
    2. __InferSent__
    3. __Universal Sentence Encoder__

So, the fundamental difference is that __Word Embedding__ turns a word to N-dimensional vector, but the __Sentence Embedding__ is much more powerful because it is able to embed not only words but phrases and sentences as well.

__ULMFiT__ is considered to be the best choice for Transfer Learning in NLP but it is built using __fast.ai__ library in which the code implementation is different from that of __Keras__ or __Tensorflow__. For this notebook, we will be using __Universal Sentence Encoder__.

## Universal Sentence Encoder

It can be used for text classification, semantic similarity, clustering and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.

It takes __variable length English text as input__ and __outputs a 512-dimensional vector__. Handling variable length text input sounds great, but the problem is that as sentence keeps getting longer counted by words, the more diluted embedding results could be.

Hence, there are 2 Universal Sentence Encoders to choose from with different encoder architectures to achieve distinct design goals:
* __Transformer__ architecture that targets high accuracy at the cost of greater model complexity and resource consumption
* __Deep Averaging Network(DAN)__ that targets efficient inference with slightly reduced accuracy using simple architecture

Both models were trained with the __Stanford Natural Language Inference (SNLI)__ corpus. The [SNLI](https://nlp.stanford.edu/pubs/snli_paper.pdf) corpus is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as __recognizing textual entailment (RTE)__. Essentially, the models were trained to learn the semantic similarity between the sentence pairs.

__This model is trained using DAN.__ DAN works in three simple steps:
1. take the vector average of the embeddings associated with an input sequence of tokens
2. pass that average through one or more feedforward layers
3. perform (linear) classification on the final layer’s representation

![](./source/dan.png)

The primary advantage of the DAN encoder is that compute time is linear in the length of the input sequence.

This module is about 1GB. Depending on your network speed, it might take a while to load the first time you run inference with it. After that, loading the model should be faster as modules are cached by default.

In [None]:
# we can change this model. check the url 'https://tfhub.dev/google/' for more
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")

embed_size = embed.get_output_info_dict()['default'].get_shape()[1].value
display(Markdown("__Embedding size__: {}".format(embed_size)))

We have loaded the Universal Sentence Encoder and computing the embeddings for some text can be as easy as shown below.

In [None]:
# Compute a representation for each message, showing various lengths supported.
word = "Elephant"
sentence = "I am a sentence for which I would like to get its embedding."
paragraph = ("Universal Sentence Encoder embeddings also support short paragraphs. "
             "There is no hard limit on how long the paragraph is. Roughly, the longer the more 'diluted' the embedding will be.")
messages = [word, sentence, paragraph]

# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    message_embeddings = session.run(embed(messages))

    for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
        print("Message: {}".format(messages[i]))
        print("Embedding size: {}".format(len(message_embedding)))
        message_embedding_snippet = ", ".join((str(x) for x in message_embedding[:3]))
        print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

# Model creation

We have loaded the Universal Sentence Encoder as variable `embed`. To have it work with Keras, it is necessary to wrap it in a Keras Lambda layer and explicitly cast its input as a string. Then we build the Keras model in its standard Functional API. We can view the model summary and realize that __only the Keras layers are trainable, that is how the transfer learning task works by assuring the Universal Sentence Encoder weights untouched__.

Now, the let's eliminate the confusion between the terms that is used in deep learning aspect - __loss function__ and __optimizer__.

The __loss function__ is a mathematical way of measuring how wrong the predictions are.

During the training process, we tweak and change the parameters (weights) of the model to __try and minimize that loss function__, and make the predictions as correct and optimized as possible. But how exactly is it done, by how much, and when?

_This is where optimizers come in_. They tie together the loss function and model parameters by updating the model in response to the output of the loss function.

In [None]:
def UniversalEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), 
                 signature="default", as_dict=True)["default"]

input_text = layers.Input(shape=(1,), dtype="string")
embedding = layers.Lambda(UniversalEmbedding, output_shape=(embed_size,))(input_text)

# experiment on the custom FC layer here
#------------------------------------------------------#
x = layers.Dense(256, activation='relu')(embedding)
x = layers.Dropout(0.25)(x)
x = layers.Dense(64, activation='relu')(x)
x = layers.Dropout(0.125)(x)
x = layers.Dense(category_counts, activation='sigmoid')(x)
#------------------------------------------------------#

model_sa = Model(inputs=[input_text], outputs=x)

# we are selecting Adam optimizer - one of the best optimizer in this field
opt = keras.optimizers.Adam(learning_rate=0.001)

# setting `binary_crossentropy` as loss function for the classifier
model_sa.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model_sa.summary()

# Training

Now, we train the model with the training dataset and validate its performance at the end of each training epoch with validation dataset.

In [None]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    history = model_sa.fit(train_text, train_label,
                            validation_data=(val_text, val_label),
                            epochs=5,
                            batch_size=64,
                            shuffle=True)
    model_sa.save_weights('best_model.h5')

# Evaluation

Now that we have trained the model, we can evaluate its performance. We will some evaluation metrics and techniques to test the model.

In [None]:
# load the saved model
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model_sa.load_weights('best_model.h5')
    _, train_acc = model_sa.evaluate(train_text, train_label)
    _, test_acc = model_sa.evaluate(test_text, test_label)

clear_output()
display(Markdown('__Train Accuracy__: {}, __Test Accuracy__: {}'.format(round(train_acc,4), round(test_acc,4))))

The Learning Curve of loss and accuracy of the model on each epoch are shown as below:

In [None]:
fig = make_subplots(rows=1, cols=2)

fig.add_trace(go.Scatter(x=list(range(50)), y=history.history['accuracy'], name='train'),
              row=1, col=1)
fig.add_trace(go.Scatter(x=list(range(50)), y=history.history['val_accuracy'], name='validation'),
              row=1, col=1)

fig.add_trace(go.Scatter(x=list(range(50)), y=history.history['loss'], name='train'),
              row=1, col=2)
fig.add_trace(go.Scatter(x=list(range(50)), y=history.history['val_loss'], name='validation'),
              row=1, col=2)

fig.update_layout(height=600, width=900, showlegend=False,hovermode="x",
                  title_text="Train and Validation Accuracy and Loss")
fig.show()

# Prediction

Finally, lets perform some predictions to see where and why are we getting false positives.

In [None]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model_sa.load_weights('best_model.h5')
    predicts = model_sa.predict(test_text, batch_size=32)

categories = train_df['target'].unique().tolist()

predict_logits = predicts.argmax(axis=1)
test_df['predicted'] = [categories[i] for i in predict_logits]

def highlight_rows(x):
    if x['target'] != x['predicted']:
        return ['background-color: #d65f5f']*3
    else:
        return ['background-color: lightgreen']*3

clear_output()
display(test_df.head(20).style.apply(highlight_rows, axis=1))

Now, we will perform prediction on the clean test data set provided along with the train data.

In [None]:
col_names =  ['target', 'id', 'date', 'flag','user','text']

test_df = pd.read_csv('./data/sentiment140data/test_data.csv', encoding = "ISO-8859-1", names=col_names)

print(colored('TEST DATA','magenta',attrs=['bold']))
display(test_df.head(10))

In [None]:
# dropping irrelevant columns
test_df.drop(['id', 'date', 'flag', 'user'], axis=1, inplace=True)

# replacing positive sentiment 4 with 1
test_df['target'] = test_df['target'].replace(4,1)

target_count = test_df['target'].value_counts()

category_counts = len(target_count)
display(Markdown('__Number of categories__: {}'.format(category_counts)))

We are witnessing 3 categories instead of 2, the extra sentiment that we have is __neutral__, but we haven't trained the model on it. Even if we try to keep it, during one-hot encoding, we will obtain 3 columns which will go against the model output architecture (which has 2). So we have to discard this.

In [None]:
test_df = test_df[test_df['target'].isin([0, 1])]
target_count = test_df['target'].value_counts()

category_counts = len(target_count)
display(Markdown('__Number of categories__: {}'.format(category_counts)))

Now, we will evaluate and predict the data __with and without all the text preprocessing__, and analyze the difference.

In [None]:
test_df['processed_text'] = test_df['text'].apply(lambda x: preprocess(x))

new_test_text = test_df['text'].tolist()
new_test_text_p = test_df['processed_text'].tolist()

new_test_text = np.array(new_test_text, dtype=object)[:, np.newaxis]
new_test_text_p = np.array(new_test_text_p, dtype=object)[:, np.newaxis]

new_test_label = np.asarray(pd.get_dummies(test_df['target']), dtype = np.int8)

display(Markdown('__New Test Data size__: {}'.format(new_test_label.shape[0])))

In [None]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    
    model_sa.load_weights('best_model.h5')
    
    new_predicts = model_sa.predict(new_test_text, batch_size=16)
    new_predicts_p = model_sa.predict(new_test_text_p, batch_size=16)
    
    _,new_acc = model_sa.evaluate(new_test_text, new_test_label)
    _,new_acc_p = model_sa.evaluate(new_test_text_p, new_test_label)
    
clear_output()
display(Markdown('__New test data evaluation__'))
display(Markdown('__Without preprocessing__: {} || __With preprocessing__: {}'.format(round(new_acc,4), 
                                                                                      round(new_acc_p,4))))

Not much of a significant difference. Let's try to see some of the outputs to understand where is the difference coming.

In [None]:
new_categories = test_df['target'].unique().tolist()

predict_logits = new_predicts.argmax(axis=1)
test_df['predicted'] = [categories[i] for i in predict_logits]

predict_logits = new_predicts_p.argmax(axis=1)
test_df['processed_predicted'] = [categories[i] for i in predict_logits]

def highlight_rows(x):
    if x['target'] != x['predicted'] or x['target'] != x['processed_predicted']:
        return ['background-color: #d65f5f']*5
    else:
        return ['background-color: lightgreen']*5

display(test_df.head(20).style.apply(highlight_rows, axis=1))

# Executive Summary

The objective of this notebook is to analyze and classify the sentiment of Tweets obtained from Twitter as positive or negative.

After identifying the relevant columns required, we performed an intensive text preprocessing that can be provided as an input to the model, without the need to even tokenization - a step that is required in traditional deep learning approach.

Using Universal Sentence Encoder, which is a state-of-the-art pre-trained sentence embedding module, we contextualized the tweets and created a model that holds the information as to which tweets are referring to a positive sentiment, and which ones are negative.

An interesting observation is that despite the architecture on which this Encoder works yields less accuracy, the dataset not properly tagged, and such a small number of NN layers - we are obtaining a pretty decent accuracy.