<a href="https://colab.research.google.com/github/FelixSchmid/Sentiment_Analysis/blob/master/1_Sentiment_Analysis_IMBD_TFIDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment classification - close to the state of the art

The task of classifying sentiments of texts (for example movie or product reviews) has high practical significance in online marketing as well as financial prediction. This is a non-trivial task, since the concept of sentiment is not easily captured.

For this assignment you have to use the larger [IMDB sentiment](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) benchmark dataset from Stanford, an achieve close to state of the art results.

The task is to try out multiple models in ascending complexity, namely:

1. TFIDF + classical statistical model (eg. RandomForest)
2. LSTM classification model
3. LSTM model, where the embeddings are initialized with pre-trained GloVe vectors
4. fastText model
5. BERT based model (you are advised to use a pre-trained one and finetune, since the resource consumption is considerable!)

You should get over 90% validation accuracy (though nearly 94 is achievable).

You are allowed to use any library or tool, though the Keras environment, and some wrappers on top (ie. Ktrain) make your life easier.





__Groups__
This assignment is to be completed individually, two weeks after the class has finished. For the precise deadline please see canvas.

__Format of submission__
You need to submit a pdf of your Google Collab notebooks.

__Due date__
Two weeks after the class has finished. For the precise deadline please see canvas.

Grade distribution:
1. TFIDF + classical statistical model (eg. RandomForest) (25% of the final grade)
2. LSTM classification model (15% of the final grade)
3. LSTM model, where the embeddings are initialized with pre-trained GloVe vectors (15% of the final grade)
4. fastText model (15% of the final grade)
5. BERT based model (you are advised to use a pre-trained one and finetune it, since the resource consumption is considerable!) (30% of the final grade). For BERT you should get over 90% validation accuracy (though nearly 94% is achievable).


__For each of the models, the marks will be awarded according to the following three criteria__:

(1) The (appropriately measured) accuracy of your prediction for the task. The more accurate the prediction is, the better. Note that you need to validate the predictive accuracy of your model on a hold-out of unseen data that the model has not been trained with.

(2) How well you motivate the use of the model - what in this model's structure makes it suited for representing sentiment? After using the model for the task how well you evaluate the accuracy you got for each model and discuss the main advantages and disadvantages the model has in the particular modelling task. At best you take part of the modelling to support your arguments.

(3) The consistency of your take-aways, i.e. what you have learned from your analyses. Also, analyze when the model is good and when and where it does not predict well.

Please make sure that you comment with # on the separates steps of the code you have produced. For the verbal description and analyses plesae insert markdown cells.


__Plagiarism__: The Frankfurt School does not accept any plagiarism. Data science is a collaborative exercise and you can discuss the research question with your classmates from other groups, if you like. You must not copy any code or text though. Plagiarism will be prosecuted and will result in a mark of 0 and you failing this class.

After carefully reading this document and having had a look at the data you may still have questions. Please submit those question to the public Q&A board in canvas and we will answer each question, so 

# Importing data and libraries (notebook 1)

In [0]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz

In [0]:
# I later on safe preprcessed data and trained models on my google drive.
# With this two lines of code I can mount the drive with colab.
from google.colab import drive 
drive.mount('/content/drive')

In [0]:
import os,re,string
from glob import glob
import numpy as np
import pandas as pd
from sklearn.datasets import load_files
import gensim, spacy, logging, warnings
from gensim import corpora, models
from gensim.models.phrases import Phrases, Phraser
import numpy as np
from gensim.matutils import sparse2full
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import pprint

In [0]:
%%capture
!pip install spacy gensim pprint
!python -m spacy download en_core_web_sm
!pip install wordcloud bokeh

# 1. TFIDF + classical statistical model



## 1.0 Introduction

To use machine learning for text data, we somehow need to convert the text to numbers. There are several approaches such as Bag of Words (BoW), word2vec or ELMo. In this part of the notebook we want to use term frequency–inverse document frequency (TF-IDF).

TF-IDF builts up on BoW. In the BoW approach a document is represented as a set ('bag') of its words disregarding grammer and word order. The BoW for a certain document would be a vector with the lenght of the vocabulary (unique words) of the corpus. And the vector would just contain the frequency of each word in the document. The values of the vector could be used as features for our statistical model.

Every word in the document is equally important in the BoW approach. However, we might want to give words like "the" or "movie" lesser values because these are very common words in the IMBd corpus and do not really distinguish documents from each other as they are most likely distributed equally among reviews with positive and negative sentiment.

The idea of TF-IDF is to give more weight to words that occur more in a certain document but less overall. Let us imagine a document that contains the word "film" as much as "boring". We want to weight "boring" more because in the general corpus the word "boring" is less frequent (for sake of the explanation, let us assume it is less frequent; I did not check.). Therefore, "boring" contains probably more valuable information to classify that one document. This is realized with the following formula:

<img src="https://cdn-images-1.medium.com/max/1600/1*jNnpbGPxkjehlvTCXq9B8g.png" width=45%>

In the following code we will create a TF-IDF matrix. The rows of that matrix represent one document (review) and the columns the TF-IDF score of each word. 

In terms of machine learning, one row contains the features of the document with which the model will be trained. As there a many words in the corpus and only a few in each document the feature vector will be sparse. That is why we do some further preprocessing such as filtering for POS and throwing out STOP-words in advance of creating the TF-IDF matrix.


## 1.1 Preprocessing

In [0]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

In [0]:
# We load the data with this scikit learn function
train_set = load_files(os.path.join('aclImdb',  'train'), shuffle=False, categories=['neg', 'pos'])
test_set = load_files(os.path.join('aclImdb',  'test'), shuffle=False, categories=['neg', 'pos'])

**Filtering tokens that only consist of alphabetic characters:** 

1.   List item
2.   List item


Sometimes reviewers end their text with quantitative ratings such as "8/10". While this might be an useful information for the task of sentiment analysis for movies. I decided to throw it out because it might not be generalizable and we want a model that 'decides' based on qualitative text only.

In [0]:
%%time
def text_to_words(text):
    # tokanization of text
    doc = nlp(text)
    token_list = []
    for token in doc:
      lower = token.lower_
      if token.is_alpha == True and len(lower) >= 2 and len(lower) <=15:
        token_list.append(lower)
    
    return token_list

# text values from train set
train_texts = train_set.data
train_texts_words = [text_to_words(text.decode('utf8')) for text in train_texts]

test_texts = test_set.data
test_texts_words = [text_to_words(text.decode('utf8')) for text in test_texts]


CPU times: user 7min 23s, sys: 2.9 s, total: 7min 26s
Wall time: 7min 26s


In [0]:
print(train_texts_words[0])

['story', 'of', 'man', 'who', 'has', 'unnatural', 'feelings', 'for', 'pig', 'starts', 'out', 'with', 'opening', 'scene', 'that', 'is', 'terrific', 'example', 'of', 'absurd', 'comedy', 'formal', 'orchestra', 'audience', 'is', 'turned', 'into', 'an', 'insane', 'violent', 'mob', 'by', 'the', 'crazy', 'chantings', 'of', 'it', 'singers', 'unfortunately', 'it', 'stays', 'absurd', 'the', 'whole', 'time', 'with', 'no', 'general', 'narrative', 'eventually', 'making', 'it', 'just', 'too', 'off', 'putting', 'even', 'those', 'from', 'the', 'era', 'should', 'be', 'turned', 'off', 'the', 'cryptic', 'dialogue', 'would', 'make', 'shakespeare', 'seem', 'easy', 'to', 'third', 'grader', 'on', 'technical', 'level', 'it', 'better', 'than', 'you', 'might', 'think', 'with', 'some', 'good', 'cinematography', 'by', 'future', 'great', 'vilmos', 'zsigmond', 'future', 'stars', 'sally', 'kirkland', 'and', 'frederic', 'forrest', 'can', 'be', 'seen', 'briefly']


**Creating bi- and trigrams**

In [0]:
%%time
def create_trigram_phraser(texts_words):

    bigram_phrases = Phrases(texts_words, threshold=100)
    trigram_phrases = Phrases(bigram_phrases[texts_words], threshold=100) 

    # Technical pruning
    bigram_phraser = gensim.models.phrases.Phraser(bigram_phrases)
    trigram_phraser = gensim.models.phrases.Phraser(trigram_phrases)
    return trigram_phraser, bigram_phraser

trigram_phraser, bigram_phraser = create_trigram_phraser(train_texts_words)
train_texts_words = [trigram_phraser[bigram_phraser[words]] for words in train_texts_words]
test_texts_words = [trigram_phraser[bigram_phraser[words]] for words in test_texts_words]



CPU times: user 1min 43s, sys: 129 ms, total: 1min 44s
Wall time: 1min 44s


In [0]:
print(train_texts_words[0])

['story', 'of', 'man', 'who', 'has', 'unnatural', 'feelings', 'for', 'pig', 'starts', 'out', 'with', 'opening', 'scene', 'that', 'is', 'terrific', 'example', 'of', 'absurd', 'comedy', 'formal', 'orchestra', 'audience', 'is', 'turned', 'into', 'an', 'insane', 'violent', 'mob', 'by', 'the', 'crazy', 'chantings', 'of', 'it', 'singers', 'unfortunately', 'it', 'stays', 'absurd', 'the', 'whole', 'time', 'with', 'no', 'general', 'narrative', 'eventually', 'making', 'it', 'just', 'too', 'off', 'putting', 'even', 'those', 'from', 'the', 'era', 'should', 'be', 'turned', 'off', 'the', 'cryptic', 'dialogue', 'would', 'make', 'shakespeare', 'seem', 'easy', 'to', 'third', 'grader', 'on', 'technical', 'level', 'it', 'better', 'than', 'you', 'might', 'think', 'with', 'some', 'good', 'cinematography', 'by', 'future', 'great', 'vilmos', 'zsigmond', 'future', 'stars', 'sally_kirkland', 'and', 'frederic', 'forrest', 'can', 'be', 'seen', 'briefly']


**Filtering non-stopwords and ['NOUN', 'ADJ', 'VERB', 'ADV', 'INTJ']:**
Nouns, adjectives, verbs, adverbs and interjections should contain the most useful information. Therefore, we keep all tokens with these POS. In the first run I did not keep interjections ('ahh', 'hmmm') and got slightly worse results.

In [0]:
%%time
def filter_words_to_lemmas(texts_words):
    filtered_texts_lemmas = []

    for words in texts_words:
       doc = spacy.tokens.Doc(nlp.vocab, words=words)
       tagged = nlp.get_pipe("tagger")(doc)
       lemmas = [word.lemma_ for word in tagged \
                  if word.pos_ in ['NOUN', 'ADJ', 'VERB', 'ADV', 'INTJ']
                  and word.is_stop == False] 
       filtered_texts_lemmas.append(lemmas)

    return filtered_texts_lemmas

train_texts_words = filter_words_to_lemmas(train_texts_words)
test_texts_words = filter_words_to_lemmas(test_texts_words)

CPU times: user 5min 49s, sys: 1.51 s, total: 5min 50s
Wall time: 5min 51s


In [0]:
print(train_texts_words[0])

['story', 'man', 'unnatural', 'feeling', 'pig', 'start', 'open', 'scene', 'terrific', 'example', 'absurd', 'comedy', 'formal', 'orchestra', 'audience', 'turn', 'insane', 'violent', 'mob', 'crazy', 'chanting', 'singer', 'unfortunately', 'stay', 'absurd', 'time', 'general', 'narrative', 'eventually', 'make', 'put', 'era', 'turn', 'cryptic', 'dialogue', 'shakespeare', 'easy', 'grader', 'technical', 'level', 'better', 'think', 'good', 'cinematography', 'future', 'great', 'vilmos', 'zsigmond', 'future', 'star', 'sally_kirkland', 'frederic', 'forrest', 'see', 'briefly']


**Creating a TF-IDF matrix** 

In [0]:
# Creating id2word and TfidfModel based on train only to not leak information.
# Then applying to test set...

# Create Dictionary on train data
id2word = gensim.corpora.Dictionary(train_texts_words)
id2word.filter_extremes(keep_n=5000)

# Term Document Frequency
train_corpus = [id2word.doc2bow(words) for words in train_texts_words]

# Applying Term Document Frequency on test set
test_corpus = [id2word.doc2bow(words) for words in test_texts_words]

# Fit Tfidf on train
tfidf = gensim.models.TfidfModel(train_corpus) 

In [0]:
for doc in train_corpus[:2]:
    print([[id2word[id], freq] for id, freq in doc])

[['absurd', 2], ['audience', 1], ['better', 1], ['briefly', 1], ['cinematography', 1], ['comedy', 1], ['crazy', 1], ['dialogue', 1], ['easy', 1], ['era', 1], ['eventually', 1], ['example', 1], ['feeling', 1], ['future', 2], ['general', 1], ['good', 1], ['great', 1], ['insane', 1], ['level', 1], ['make', 1], ['man', 1], ['mob', 1], ['narrative', 1], ['open', 1], ['pig', 1], ['put', 1], ['scene', 1], ['see', 1], ['shakespeare', 1], ['singer', 1], ['star', 1], ['start', 1], ['stay', 1], ['story', 1], ['technical', 1], ['terrific', 1], ['think', 1], ['time', 1], ['turn', 2], ['unfortunately', 1], ['violent', 1]]
[['dialogue', 1], ['good', 2], ['great', 3], ['make', 1], ['open', 2], ['scene', 4], ['see', 1], ['star', 1], ['start', 1], ['stay', 1], ['think', 2], ['time', 4], ['unfortunately', 1], ['acting', 1], ['action', 1], ['actually', 1], ['add', 1], ['adventure', 1], ['air', 4], ['airport', 11], ['alright', 1], ['await', 1], ['bad', 2], ['badly', 1], ['bang', 1], ['barely', 1], ['belong

In [0]:
for doc in tfidf[train_corpus][:2]:
    print([[id2word[id], np.around(freq, decimals=2)] for id, freq in doc])

[['absurd', 0.38], ['audience', 0.1], ['better', 0.11], ['briefly', 0.22], ['cinematography', 0.14], ['comedy', 0.1], ['crazy', 0.15], ['dialogue', 0.12], ['easy', 0.14], ['era', 0.16], ['eventually', 0.15], ['example', 0.12], ['feeling', 0.13], ['future', 0.29], ['general', 0.15], ['good', 0.03], ['great', 0.05], ['insane', 0.2], ['level', 0.13], ['make', 0.06], ['man', 0.07], ['mob', 0.22], ['narrative', 0.18], ['open', 0.13], ['pig', 0.24], ['put', 0.15], ['scene', 0.06], ['see', 0.04], ['shakespeare', 0.21], ['singer', 0.19], ['star', 0.09], ['start', 0.08], ['stay', 0.13], ['story', 0.05], ['technical', 0.19], ['terrific', 0.17], ['think', 0.05], ['time', 0.04], ['turn', 0.17], ['unfortunately', 0.13], ['violent', 0.17]]
[['dialogue', 0.03], ['good', 0.01], ['great', 0.04], ['make', 0.01], ['open', 0.06], ['scene', 0.05], ['see', 0.01], ['star', 0.02], ['start', 0.02], ['stay', 0.03], ['think', 0.02], ['time', 0.03], ['unfortunately', 0.03], ['acting', 0.02], ['action', 0.02], ['a

**Setting up data for scikit learn**

In [0]:
# for scikit learn we need to transform the sparce tuples to a numpy array
X_train = np.vstack([sparse2full(doc, len(id2word)) for doc in tfidf[train_corpus]])
X_test = np.vstack([sparse2full(doc, len(id2word)) for doc in tfidf[test_corpus]])

# getting labels from pd
y_train = train_set.target
y_test = test_set.target

In [0]:
# safe preprocessed data
np.save("/content/drive/My Drive/Coding/NLP_project/preprocessed_data/X_train2.npy", X_train)
np.save("/content/drive/My Drive/Coding/NLP_project/preprocessed_data/X_test2.npy", X_test)
np.save("/content/drive/My Drive/Coding/NLP_project/preprocessed_data/y_train2.npy", y_train)
np.save("/content/drive/My Drive/Coding/NLP_project/preprocessed_data/y_test2.npy", y_test)

In [0]:
X_train = np.load("/content/drive/My Drive/Coding/NLP_project/preprocessed_data/X_train2.npy")
X_test = np.load("/content/drive/My Drive/Coding/NLP_project/preprocessed_data/X_test2.npy")
y_train = np.load("/content/drive/My Drive/Coding/NLP_project/preprocessed_data/y_train2.npy")
y_test = np.load("/content/drive/My Drive/Coding/NLP_project/preprocessed_data/y_test2.npy")

## 1.2 Modelling

**Random forrest**

In [0]:
%%time
# Setting up grid search for a random forest
rf_params = {
    'bootstrap': [True],
    'max_depth': [5, 10, 30, 50],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 5],
    'min_samples_split': [8, 12],
    'n_estimators': [100, 300, 1000]
}
# Create a model
rf = RandomForestClassifier(n_jobs=2)
# Instantiate the grid search model
rf_gs = GridSearchCV(estimator=rf, param_grid=rf_params, cv=3)
rf_gs = rf_gs.fit(X_train, y_train)

CPU times: user 13min 5s, sys: 45.1 s, total: 13min 50s
Wall time: 52min 55s


In [0]:
# get best score from grid search
print('RF best CV accuracy: ' + str(round(rf_gs.best_score_, 3)))

RF best CV accuracy: 0.844


In [0]:
# train model with best params from grid search
rf = rf_gs.best_estimator_
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
rf_acc = str(round(accuracy_score(y_test, y_pred), 3))
print('RF accuracy on test set: ' + rf_acc)

# safe model
joblib.dump(rf, 
            "/content/drive/My Drive/Coding/NLP_project/preprocessed_data/rf2.sav")

RF accuracy on test set: 0.848


['/content/drive/My Drive/Coding/NLP_project/preprocessed_data/rf2.sav']

**Elastic net**

In [0]:
%%time
# Setting up grid search for an elastic net
net_params = {"alpha": np.arange(0.1, 0.5, 0.05),
              "l1_ratio": np.arange(0.0, 0.5, 0.1)}
# Create a model
net = linear_model.SGDClassifier(n_jobs=-1, loss='log', penalty='elasticnet')
# Instantiate the grid search model
net_classifier_gs = GridSearchCV(net, net_params, cv=3)
net_classifier_gs = net_classifier_gs.fit(X_train, y_train)

CPU times: user 6min 48s, sys: 14.2 s, total: 7min 2s
Wall time: 6min 43s


In [0]:
# get best score from grid search
print('Elastic net best CV accuracy: ' + str(net_classifier_gs.best_score_))

Elastic net best CV accuracy: 0.83004


In [0]:
# train model with best params from grid search
best_net_classifier = net_classifier_gs.best_estimator_
best_net_classifier.fit(X_train, y_train)
y_pred = best_net_classifier.predict(X_test)
el_acc = str(round(accuracy_score(y_test, y_pred),3))
print('Elastic net best accuracy on test set: ' + el_acc)

# safe model
joblib.dump(best_net_classifier,
            "/content/drive/My Drive/Coding/NLP_project/preprocessed_data/best_net_classifier2.sav")

Elastic net best accuracy on test set: 0.837


['/content/drive/My Drive/Coding/NLP_project/preprocessed_data/best_net_classifier2.sav']

In [0]:
coefficients = best_net_classifier.coef_
index = coefficients.argsort()
smallest_50 = index[0,0:50].tolist()
highest_50 = index[0,-50::1].tolist()
# Look up feature names
smallest_50_words = [id2word[ind] for ind in smallest_50]
highest_50_words = [id2word[ind] for ind in highest_50]

## 1.3 Evaluation

In [0]:
# Summary
print('Elastic net best CV accuracy: ' + str(round(net_classifier_gs.best_score_,3)))
print('Elastic net best accuracy on test set: ' + el_acc)
print('Random forest best CV accuracy: ' + str(round(rf_gs.best_score_, 3)))
print('Random forest accuracy on test set: ' + rf_acc)

Elastic net best CV accuracy: 0.83
Elastic net best accuracy on test set: 0.837
Random forest best CV accuracy: 0.844
Random forest accuracy on test set: 0.848


By the rather simple approach of creating TFIDFs and using these as features for a 'classical' classifier, we received an accuracy of 85% on the test set with a random forest. This is a decent baseline for benchmarking more complicated models.

In the following, the 10 features with the smallest and the 10 features with the highest coefficients (of the elastic net) are printed out. These are the features which impact the models decision the most. 

Unsurprisingly, the words that drive towards classifiying the review as positive are mostly single words that one would itself connect to a positive sentiment and the other way around. 

(Interestingly, the word "war" drives the model towards a positive prediction. This could be an overfitting towards the IMDb corpus. Maybe war movies are judged more positive by the viewers in general.)

In [0]:
print('Words indicating a negative sentiment:')
print(smallest_50_words[:10])
print('Words indicating a positive sentiment:')
print(highest_50_words[:10])

Words indicating a negative sentiment:
['bad', 'waste', 'awful', 'terrible', 'stupid', 'horrible', 'plot', 'poor', 'money', 'minute']
Words indicating a positive sentiment:
['beautifully', 'powerful', 'incredible', 'perfectly', 'unique', 'outstanding', 'simple', 'war', 'live', 'season']


The biggest problem of the model is that it only evaluates based on the impact of single words and does not regard the context. So, related words like "not good" or "I do not like" can be misinterpreted.

Evaluating the review without context is often not sufficient -- for example, when one uses irony or talks positively about a movie with a very dark topic.

To illustrate this, let us look at the following review. 

In [0]:
#load the model
path = '/content/drive/My Drive/Coding/NLP_project/preprocessed_data/'
model = joblib.load(path + 'best_net_classifier2.sav')
doc = 19
pp = pprint.PrettyPrinter()
pp.pprint(test_set.data[doc].decode('utf-8'))
print('\n predicted label: ' + str(model.predict(X_test[doc,:].reshape(1, -1)))[1])
print('\n true label: ' + str(y_test[doc]))

('New York family is the last in their neighborhood to get a television set, '
 "which nearly ruins David Niven's marriage to Mitzi Gaynor. Bedroom comedy "
 'that rarely ventures into the bedroom(and nothing sexy happens there '
 'anyway). Gaynor as an actress has about as much range as an oven--she turns '
 "on, she turns off. Film's sole compensation is a supporting performance by "
 'perky Patty Duke, pre-"Miracle Worker", as Niven\'s daughter. She\'s '
 'delightful; "Happy Anniversary" is not. * from ****')

 predicted label: 1

 true label: 0


This document is correctly labeled as negative (0). For humans, the sentence "Gaynor as an actress has about as much range as an oven--she turns on, she turns off." is clearly meant negative. However, each word by itself has no negative connotation. The context is what makes us understand the real sentiment. The words "delightful" and "Happy" have a positive connotation and are more frequent in positive sentiments, leading the model to falsely predict the review as positive (1).

To conclude, we need context to further increase the accuracy. Therefore, we need to use another model such as an LSTM that can handle sentences. And, of course, we need to preprocess the data differently. We need sequences instead of BoW.

(Surely, we could try to improve the accuracy with the TFIDF approach by playing around with the preprocessing. Another trick is to create bigrams and trigrams that contain at least some local context. In fact we did this during preprocessing, but we also set a threshhold and did only keep very few of the bi- and trigrams. We could play around more with the threshhold. We could think more about which words to filter out and which not to filter out according to their POS tag. We could use different statistical models on top and try to further tune them.)
