<a href="https://colab.research.google.com/github/FelixSchmid/Sentiment_Analysis/blob/master/2_Sentiment_Analysis_IMBD_LSTM_fastText_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing data and libraries (notebook 2)

In [0]:
%%capture
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz

In [0]:
# Mounting colab with my drive to import GloVe vectors
# and safe models
from google.colab import drive 
drive.mount('/content/drive')

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

In [0]:
%%capture
! pip install ktrain
! pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1
import ktrain
from ktrain import text
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.layers import Dropout, Bidirectional, GlobalAveragePooling1D
from tensorflow.keras import backend as K 
from tensorflow.keras.initializers import Constant
import numpy as np
from sklearn.model_selection import train_test_split
#pip install keras-self-attention
os.environ['TF_KERAS'] = '1'
from keras_self_attention import SeqSelfAttention, SeqWeightedAttention
from sklearn.metrics import classification_report

# 2. LSTM Model

## 2.0 Introduction

So far, we received a decent baseline performance with the TFIDF model. However, because TFIDF is based on BoW, this approach can not properly process contextual information (except locally with the help of ngrams).

In this section, we will use an LSTM which is well-suited for time series. A sentence or a text can be interpreted as a time series. The simplest approach would be to use a single LSTM layer that reads in the sequence from left to right. Each word in the sequence can be interpreted as one time step. The key part of the LSTM is the cell state. This is so to say the memory of the LSTM. 

In our context, that is where information from previous words can be stored. There are three gates (input gate, output gate, forget gate) which regulate what information from which word will be stored in the cell state and what information from the cell state will be given to the next time step. The behaviour of the gates will be learned during training. In other words, the model will learn which information to store, output or forget during training. The LSTM were invented by [Hochreiter & Schmidhuber (1997)](http://www.bioinf.jku.at/publications/older/2604.pdf).

<img src=https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png width="500">


[Illustration from colah's blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)



In theory, a word could be 'remembered' in the cell state for an infinite number of time steps. In our context that means that the first word of a sequence could be stored until the end of the sequence. So, contextual information are not only processed locally but ideally over the whole sequence.

However, there is a limitation. An LSTM can only carry on information in one direction. That is not good, because not only prior words are relevant for understanding the context of a word.

Therefore, we will use a bidirectional LSTMs. In this architecture we have two parallel LSTMs. In one, the sequence is inputted inverse and in the other, the normal way. In the next step, the output of both LSTMs is concatenated. This way the resulting vector can contain information from the beginning and the end of the sequence.


## 2.1 Preprocessing

In my first notebook for sentiment analysis with TFIDF, I wrote the importing and preprocessing manually using spaCy and gensim. However, ktrain already has a build in function for efficiently importing and preprocessing text data. The function even creates ngrams (if wanted) on the fly. It also prints out a handy overview of the data:

In [0]:
# Importing data
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('aclImdb', 
                                                                         max_features=20000,
                                                                         maxlen=400,
                                                                         preprocess_mode='standard',
                                                                         classes=['pos', 'neg'])

detected encoding: utf-8
language: en
Word Counts: 88582
Nrows: 25000
25000 train sequences
Average train sequence length: 231
x_train shape: (25000,400)
y_train shape: (25000,2)
25000 test sequences
Average test sequence length: 224
x_test shape: (25000,400)
y_test shape: (25000,2)


In [0]:
# Here, we split the test data to receive a holdout set. We do not want to 
# leak information by tuning hyperparameters
x_test, x_holdout, y_test, y_holdout = train_test_split(x_test, 
                                                        y_test, 
                                                        test_size=0.5, 
                                                        random_state=42)

**Data after preprocessing**

The maximum lenght after preprocessing is set to 400. When a senquence (movie review) is shorter, it is padded with 0. Each word is represented by an integer (0 to 19999). The maximum amount of features is set to 20000.

In [0]:
print('feature vector of first document in train set:')
print(x_train[0])
print('target vector of first document in train set:')
print(y_train[0])
print('highest word index in train set: ' + str(np.amax(x_train)))

feature vector of first document in train set:
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
    

The embedding layer will assign each word to a vector with a dimension of 100. During the model training, the embedding vector of each word will also be trained.

## 2.2 Modelling

In [0]:
def load_model(input_shape):
    model = Sequential()
    # add 1 for padding token
    model.add(Embedding(19999+1, 100, mask_zero=True, 
                        input_length=input_shape[0]))
    model.add(SpatialDropout1D(0.4))
    model.add(Bidirectional(LSTM(100, dropout=0.4, 
                                 return_sequences=True)))
    model.add(Bidirectional(LSTM(100, dropout=0.4, 
                                 return_sequences=True)))
    model.add(SeqWeightedAttention())
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])
    model.summary()
    return model


K.clear_session()
input_shape = ((400),)
lstm_model = load_model(input_shape)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 100)          2000000   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 400, 100)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, 400, 200)          160800    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 400, 200)          240800    
_________________________________________________________________
seq_weighted_attention (SeqW (None, 200)               201       
_________________________________________________________________
dense (Dense)                (None, 2)                 402       
Total params: 2,402,203
Trainable params: 2,402,203
Non-trainable params: 0
______________________________________________

In [0]:
lstm_learner = ktrain.get_learner(lstm_model, 
                                  train_data=(x_train, y_train), 
                                  val_data=(x_test, y_test), 
                                  batch_size=64)

In [0]:
#find a good learning rate
#lstm_learner.lr_find()
#lstm_learner.lr_plot()

In [0]:
lstm_learner.autofit(0.01, 
                     5, 
                     early_stopping=1, 
                     checkpoint_folder='/content/drive/My Drive/models/lstm')



begin training using triangular learning rate policy with max lr of 0.01...
Train on 25000 samples, validate on 12500 samples
Epoch 1/2
Epoch 2/2
Epoch 00002: early stopping
Weights from best epoch have been loaded into model.


<tensorflow.python.keras.callbacks.History at 0x7f5037364048>

In [0]:
lstm_learner.save_model(
    '/content/drive/My Drive/models/lstm/lstm_model')

## 2.3 Evaluation

In [0]:
# I used a custom layer for attention. Unfortunately, I can not load my model
# because of that. Luckily, I still can recreate my model and than load the
# weights of my best epoch.
lstm_learner.model.load_weights('/content/drive/My Drive/models/lstm/weights-01.hdf5')

**Analysing the top losses**

ktrain has some very nice functions to evaluate your model. In the following we print out the 5 movie reviews with the highest loss. This helps us to understand for which cases our model does perform poorly. 
If the model fails for reviews that are difficult to evaluate even for humans or that are labeled wrongly, that would be not a bad sign. 

It would not be bad in the sense that the model does not predict based on some arbitrary signs, but that it learned an understanding that is similiar to the human one.

In [0]:
lstm_learner.view_top_losses(n=5, preproc=preproc)

----------
id:3508 | loss:8.5 | true:pos | pred:neg)

it's not citizen kane but it does deliver cleavage and lots of it br br badly acted and directed poorly scripted who cares i didn't watch it for the dialog
----------
id:10050 | loss:8.33 | true:pos | pred:neg)

black tar can't be there's a documentary dark end of the street about s f street punks and b t abuse not bad quite heavy in wasted there's this stuff that looks like coke but should be something else no big deal black tar can't be there's a documentary dark end of the street about s f street punks and b t abuse not bad quite heavy in wasted there's this stuff that looks like coke but should be something else no big deal black tar can't be there's a documentary dark end of the street about s f street punks and b t abuse not bad quite heavy in wasted there's this stuff that looks like coke but should be something else no big deal
----------
id:1924 | loss:7.81 | true:pos | pred:neg)

david morse and andre are very talented act

**Interpretation of the errors**


*   id 3505: The author evaluates the movie as having a bad script and bad acting, but for some reason he does not care. So, the "true sentiment" is in the grey area slightly towards positive as the reviewer does not really care -- he was entertained in some way. It is labeled positive but the model predicted the sentiment as negative. The model failed here, but to be fair, it is a really difficult decision.

*   id 10050: This is a really interesting review. The movie is about drug addiction and, frankly, the author of the review seems to have some experience on that. He repeats one sentence 3 times and the whole review is written very messy. I googled and found out that he actually gave the movie 7 out of 10 stars. I could not have derived that based on the text. I am not alone with that: "0 out of 2 found this helpful." (https://www.imdb.com/review/rw1177735/?ref_=tt_urv). I think it is no shame that the model predicted the sentiment wrongly. 

*   id 1924: This review is labeled incorrect. It seems pretty clear that this is a negative review. So, the model predicted the sentement correctly.
*   id 3998: This review is mislabeled, too. The model predicted it correctly.
*   id 7726: I can not decide myself whether the author liked the movie or found it bad in an entertaining way and ment the review ironic.

In [0]:
lstm_predictor = ktrain.get_predictor(lstm_learner.model, preproc)

In [0]:
test_b = load_files(os.path.join('aclImdb',  'test'), 
                    shuffle=False, 
                    categories=['neg', 'pos'])

**Did the model learn to process contextual information?**

Remember that our statistical model based on TFIDF had an accuracy of 85% on validation data. However, it struggled for document 19. I believe that is because in document 19 the meaning is heavely embedded in the context. The author does not use words that by themselves signal a strong negative sentiment, but compares an actress to an oven. The strongest words on their own are "delightful" and "happy" which falsely indicate a positive sentiment.

In theory, an LSTM model is able to capture relationships by memorizing information from earlier words. We have used a bidirectional LSTM that reads the sentence backwards once and normally once and then concatenates the results. 

In the following we can see that the LSTM model actually classified the sentence correctly, unlike the classical, statistical model, which is based on a bag of words. That seems to indicate that the theory works. To be honest, it could also be mere luck. To really evaluate this, I would need to manually check way more examples.

In [0]:
print("Predicted label: "+ lstm_predictor.predict(test_b.data[19].decode('utf-8')))
print("True label: "+ str(test_b.target[19]))

Predicted label: neg
True label: 0


In [0]:
lstm_predictor.explain(test_b.data[19].decode('utf-8'))

Contribution?,Feature
0.707,<BIAS>
-0.221,Highlighted in text (sum)


**Final validation on the holdout dataset**

Lastly, let us validate the data based on the holdout data set. Note that I splitted the test set in a new test set and a holdout set (50/50) before training. That way, we make sure that we do not overfit towards the test data by tuning the hyperparamenter of the model.

The results are quite good. We gained 5% accuracy compared to the model based on TFIDF. Furthermore, recall and precision are pretty balanced between both classes.

In [0]:
print(classification_report(y_holdout[:,1], 
                            np.around(lstm_predictor.model.predict(x_holdout)[:,1]),
                            target_names=['neg', 'pos']))

              precision    recall  f1-score   support

         neg       0.90      0.90      0.90      6308
         pos       0.89      0.90      0.89      6192

    accuracy                           0.90     12500
   macro avg       0.90      0.90      0.90     12500
weighted avg       0.90      0.90      0.90     12500



One disadvantage of the model is that it requires a long time for training (about half an hour for one epoch) and that there are many hyperparameters you need to optimize.

To further improve the model, we could try to add a convolution layer that "extracts the higher-level phrase representations from the word embedding vectors" as for example proposed by Liu and Guo (https://doi.org/10.1016/j.neucom.2019.01.078).

Also, we could add a second attention layer in between both LSTM layers. I just tried it in another notebook in parallel and it slightly improved the accuracy. Unfortunatley, I have no time to rerun all my LSTM models, so I leave it like this.

# 3. LSTM model with pre-trained GloVe vectors

## 3.0 Introduction

The benefit of using pretrained word vectors is to not start the training with zero information. Pretrained vectors already contain a structure. Similiar words are closer together and depending on the method there are other logical relationships embedded such as word analogies. 





Common methods are, for example, GloVe or skip-gram. The underlying idea for training the vectors is, that words can be described by their context. 

"You shall know a word by the company it keeps" (Firth, 1957)

So, the basis of GloVe vectors is a co-occurence matrix as created in the first notebook. GloVe vectors were introduced by [Pennington, Socher and Manning (2014).](https://nlp.stanford.edu/pubs/glove.pdf) They also provide several pre-trained vector sets. I used the set trained on Wikipedia with a dimension of 100.

<img src=https://nlp.stanford.edu/projects/glove/images/man_woman.jpg width="500">

[Visualization of analogies of GloVe vectors](https://nlp.stanford.edu/projects/glove/)

## 3.1 Loading GloVe vectors

For creation of the embedding matrix I used and slightly modified the code from this keras tutorial for the usage of pretrained vectors: https://keras.io/examples/pretrained_word_embeddings/

In [0]:
DIR = '/content/drive/My Drive/word_vectors'
MAX_SEQUENCE_LENGTH = 400
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100


print('Indexing word vectors.')
embeddings_index = {}
with open(os.path.join(DIR, 'glove.6B.100d.txt')) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs
print('Found %s word vectors.' % len(embeddings_index))


# This is the word2idex dictionary which ktrain created 
# during importing the imdb data. We now need to make the
# connection to the imported GloVeVectors
word_index = preproc.tok.index_word

# Preparation of the embedding matrix
print('Preparing embedding matrix.')
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for i, word in word_index.items():
    if i >= MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Indexing word vectors.
Found 400000 word vectors.
Preparing embedding matrix.


## 3.2 Modelling

### 3.2.1 Training with freezed GloVe vectors

In [0]:
def load_model(input_shape, embedding_matrix, trainable):
    model = Sequential()
    model.add(Embedding(19999+1, # add 1 for padding token
                        100, 
                        mask_zero=True, 
                        input_length=input_shape[0],
                        embeddings_initializer=Constant(embedding_matrix),
                        trainable=trainable)) 
   #model.add(SpatialDropout1D(0.4))
    model.add(Bidirectional(LSTM(200, dropout=0.2, recurrent_dropout=0.2, 
                                 return_sequences=True)))
    model.add(Bidirectional(LSTM(200, dropout=0.2, recurrent_dropout=0.2, 
                                 return_sequences=True)))
    model.add(SeqWeightedAttention())
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])
    model.summary()
    return model


K.clear_session()
input_shape = ((400),)
glove_lstm_model = load_model(input_shape, embedding_matrix, trainable=False)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 100)          2000000   
_________________________________________________________________
bidirectional (Bidirectional (None, 400, 400)          481600    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 400, 400)          961600    
_________________________________________________________________
seq_weighted_attention (SeqW (None, 400)               401       
_________________________________________________________________
dense (Dense)                (None, 2)                 802       
Total params: 3,444,403
Trainable params: 1,444,403
Non-trainable params: 2,000,000
_________________________________________________________________


In [0]:
glove_lstm_learner = ktrain.get_learner(glove_lstm_model, 
                                        train_data=(x_train, y_train), 
                                        val_data=(x_test, y_test), 
                                        batch_size=150)

In [0]:
# find a good learning rate
#learner.lr_find()
#learner.lr_plot()

In [0]:
glove_lstm_learner.autofit(0.01,
                           150, 
                           early_stopping=3, 
                           checkpoint_folder='/content/drive/My Drive/models/glove')



begin training using triangular learning rate policy with max lr of 0.01...
Train on 25000 samples, validate on 12500 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 00010: early stopping
Weights from best epoch have been loaded into model.


<tensorflow.python.keras.callbacks.History at 0x7f54ce78eeb8>

In [0]:
glove_lstm_learner.save_model(
    '/content/drive/My Drive/models/glove/glove_model')

### 3.2.2 Training with unfreezed GloVe vectors

**Now** let's try the same, but allow the glove vectors to be further trained. Hopefully, this way we can finetune the vectors and archieve better results.

In [0]:
K.clear_session()
input_shape = ((400),)
glove_lstm_trainable_model = load_model(input_shape, embedding_matrix, trainable=True)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 100)          2000000   
_________________________________________________________________
bidirectional (Bidirectional (None, 400, 400)          481600    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 400, 400)          961600    
_________________________________________________________________
seq_weighted_attention (SeqW (None, 400)               401       
_________________________________________________________________
dense (Dense)                (None, 2)                 802       
Total params: 3,444,403
Trainable params: 3,444,403
Non-trainable params: 0
_________________________________________________________________


In [0]:
glove_lstm_trainable_learner = ktrain.get_learner(glove_lstm_trainable_model, 
                                        train_data=(x_train, y_train), 
                                        val_data=(x_test, y_test), 
                                        batch_size=64)

In [0]:
glove_lstm_trainable_learner.autofit(0.01,
                           50, 
                           early_stopping=1, 
                           checkpoint_folder='/content/drive/My Drive/models/glove')



begin training using triangular learning rate policy with max lr of 0.01...
Train on 25000 samples, validate on 12500 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 00003: early stopping
Weights from best epoch have been loaded into model.


<tensorflow.python.keras.callbacks.History at 0x7f502c1b4860>

In [0]:
glove_lstm_trainable_learner.save_model(
    '/content/drive/My Drive/models/glove/glove_trainable_model')

## 3.3 Evaluation

### 3.3.1 Trained with unfreezed GloVe vectors

In [0]:
# I used a custom layer for attention. Unfortunately, I can not load my model
# because of that. Luckily, I still can recreate my model and than load the
# weights of my best epoch.
glove_lstm_trainable_learner.model.load_weights('/content/drive/My Drive/models/glove/weights-01.hdf5')

In [0]:
glove_lstm_trainable_predictor = ktrain.get_predictor(glove_lstm_trainable_learner.model, preproc)

**Final validation on the holdout dataset**

While we have an accuracy of 90% on the validation set after the best epoch (the 1st), the model performs way worse on the holdout set. I am surprised by the difference. When we look at the recall, there seems to be a bias towards the positive class.

In [0]:
print(classification_report(y_holdout[:,1], 
                            np.around(glove_lstm_trainable_predictor.model.predict(x_holdout)[:,1]),
                            target_names=['neg', 'pos']))

              precision    recall  f1-score   support

         neg       0.88      0.83      0.86      6308
         pos       0.84      0.89      0.86      6192

    accuracy                           0.86     12500
   macro avg       0.86      0.86      0.86     12500
weighted avg       0.86      0.86      0.86     12500



### 3.3.2 Trained with freezed GloVe vectors

In [0]:
glove_lstm_learner.model.load_weights('/content/drive/My Drive/models/glove/weights-07.hdf5')

In [0]:
glove_lstm_predictor = ktrain.get_predictor(glove_lstm_learner.model, preproc)

**Final validation on the holdout dataset**

The LSTM model with freezed GloVe vectors shows a stable performance on the holdout set with 90% accuracy.






In [0]:
print(classification_report(y_holdout[:,1], 
                            np.around(glove_lstm_predictor.model.predict(x_holdout)[:,1]),
                            target_names=['neg', 'pos']))

              precision    recall  f1-score   support

         neg       0.90      0.90      0.90      6308
         pos       0.90      0.90      0.90      6192

    accuracy                           0.90     12500
   macro avg       0.90      0.90      0.90     12500
weighted avg       0.90      0.90      0.90     12500



'

### 3.3.3 Summary

![alt text](https://)I hoped to improve the performance of the LSTM model by using pretrained GloVe vectors. However, the comparison on the holdout data shows no real improvement. The accuracy for both approaches is 90%. You could argue that the f1-score for the positive class with pretrained and freezed GloVe vectors is about 1% better, but I do not think that is significant.

Using pretrained GloVe vectors and continuing to train them during the overall model training actually hurts the performance. A bias towards the positive class is introduced. Maybe the corpus is to small to effectively modify the GloVe vector in a beneficial way. Maybe we would perform better when we try to steadily defreeze.

Furthermore, we could try out other GloVe vectors that are trained from other corpus or that have a higher dimension and, therefore, contain more information. Or we could try out vectors that are pretrained with other models. Intuitively, pretrained vectors based on a twitter corpus might be more beneficial as the communication on twitter is more similiar to the IMDb data set.

Additionally, we can improve the architecture of the LSTM as I already suggested in the section 2.

# 4. fastText

## 4.0 Introduction

I implemented the fastText model based on the paper [Bag of Tricks for Efficient Text Classification](https://arxiv.org/pdf/1607.01759.pdf) (Joulin et al., 2016).

This model is simple but extremely efficient. It is specially designed to give an efficient baseline for NLP classification, but the paper shows that it can often compete with deep learning models. In the first layer, the features/tokens will be embedded. In the next layer, the resulting embedded vectors will simply be averaged to from hidden variables. Lastly, there is an output layer with the number of classes and a softmax function.

To get context information in the model, n-grams are used (3-grams in my implementation). Thereby, the model receives some information about the local word order.

## 4.1 Preprocessing

In [0]:
# Importing data
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('aclImdb', 
                                                                         max_features=20000,
                                                                         maxlen=500,
                                                                         preprocess_mode='standard',
                                                                         ngram_range=3,
                                                                         classes=['pos', 'neg'])

detected encoding: utf-8
language: en
Word Counts: 88582
Nrows: 25000
25000 train sequences
Average train sequence length: 231
Adding 3-gram features
max_features changed to 4678016 with addition of ngrams
Average train sequence length with ngrams: 690
x_train shape: (25000,500)
y_train shape: (25000,2)
25000 test sequences
Average test sequence length: 224
Average test sequence length with ngrams: 522
x_test shape: (25000,500)
y_test shape: (25000,2)


In [0]:
# Here, we split the test data to receive a holdout set. We do not want to 
# leak information by tuning hyperparameters
x_test, x_holdout, y_test, y_holdout = train_test_split(x_test,
                                                        y_test, 
                                                        test_size=0.5, 
                                                        random_state=42)

This time the first document looks different after preprocessing, because we created bigrams, which are represented by the indices higher than 19999.

In [0]:
#print('feature vector of first document in train set:')
#print(x_train[0])
#print('target vector of first document in train set:')
#print(y_train[0])
#print('highest word index in train set: ' + str(np.amax(x_train)))

## 4.2 Modelling

In [0]:
def load_model(input_shape):
    model = Sequential()
    model.add(Embedding(4678015+1, 50, 
                        input_length=input_shape[0], 
                        mask_zero=True))
    model.add(GlobalAveragePooling1D())
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])
    model.summary()
    return model

K.clear_session()
input_shape = ((500),)
fasttext_model= load_model(input_shape)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 50)           233900800 
_________________________________________________________________
global_average_pooling1d (Gl (None, 50)                0         
_________________________________________________________________
dense (Dense)                (None, 2)                 102       
Total params: 233,900,902
Trainable params: 233,900,902
Non-trainable params: 0
_________________________________________________________________


In [0]:
fasttext_learner = ktrain.get_learner(fasttext_model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=64)

In [0]:
fasttext_learner.autofit(0.01, 
                         10, 
                         early_stopping=4, 
                         checkpoint_folder='/content/drive/My Drive/models/fasttext')



begin training using triangular learning rate policy with max lr of 0.01...
Train on 25000 samples, validate on 12500 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 00005: early stopping
Weights from best epoch have been loaded into model.


<tensorflow.python.keras.callbacks.History at 0x7f19b2cf9438>

In [0]:
fasttext_learner.save_model(
    '/content/drive/My Drive/models/fasttext/fasttext_model')

## 4.3 Evaluation

In [0]:
fasttext_learner.load_model(
    '/content/drive/My Drive/models/fasttext/fasttext_model')

**Analysing the top losses**

For the same reasons as mentioned in the LSTM section, let us have a view of the 5 documents with the highest loss.

In [0]:
fasttext_learner.view_top_losses(n=5, preproc=preproc)

----------
id:1134 | loss:16.12 | true:neg | pred:pos)

masterpiece carrot top blows the screen away never has one movie captured the essence of the human spirit quite like chairman of the board 10 10 don't miss this instant classic feyder's
----------
id:1876 | loss:11.41 | true:neg | pred:pos)

this film has the language the style and the attitude down plus greats rides from a world champ and the great jerry lopez john as turtle has the surf down and the surfing scenes are still the best ever a true classic that can be seen many times is a babe and laird hamilton shows the early stuff that has made him the world's number one extreme surfer ising
----------
id:3998 | loss:11.27 | true:neg | pred:pos)

this movie was pure genius john waters is brilliant it is hilarious and i am not sick of it even after seeing it about 20 times since i bought it a few months ago the acting is great although lake could have been better and johnny depp is magnificent he is such a beautiful man and a very

**Interpretation of the errors**

* id 1134: The review is mislabeled; fastText is correct.
* id 1876: The review is mislabeled; fastText is correct.
* id 3998: The review is mislabeled; fastText is correct.
* id 789: The review is mislabeled; fastText is correct.
* id 2365: The review is mislabeled; fastText is correct.

All 5 reviews with the highest lost are actually mislabelled. It would be interesting to know the real accuracy if all labels were correct.

In [0]:
fasttext_predictor = ktrain.get_predictor(fasttext_learner.model, preproc)

**Did the model learn to process contextual information?**

Remember that our statistical model based on TFIDF had an accuracy of 85% on validation data. However, it struggled for document 19. I believe that is because in document 19 the meaning is heavely embedded in the context. The author does not use words that by themselves signal a strong negative sentiment, but compares an actress to an oven. Strong words on their own are "delightful" and "happy" which falsly indicate a positive sentiment.

The fastText model does predict the sentence correctly.

In [0]:
print("Predicted label: "+ fasttext_predictor.predict(test_b.data[19].decode('utf-8')))
print("True label: "+ str(test_b.target[19]))

Predicted label: neg
True label: 0


In [0]:
fasttext_predictor.explain(test_b.data[19].decode('utf-8'))

Contribution?,Feature
1.565,Highlighted in text (sum)
0.42,<BIAS>


**Final validation on the holdout dataset and comparison of the speed**

fastText performs about as good as the LSTM models. Yet, the time for training is a fraction of that for training an LSTM. One epoch for fastText with the batch size of 64 took 33s. In a previous set up I only created bigrams and it took me even less time (8s). An epoch with the LSTM and the same set up took 1731s (almost half an hour).

Also, I spent very little time to fine tune architecture and hyperparameters of the fastText model, while I had to experiment several days with the LSTM models to get over an accuracy of 90% percent during the model training.

In [0]:
print(classification_report(y_holdout[:,1], 
                            np.around(fasttext_predictor.model.predict(x_holdout)[:,1]),
                            target_names=['neg', 'pos']))

              precision    recall  f1-score   support

         neg       0.90      0.90      0.90      6308
         pos       0.90      0.90      0.90      6192

    accuracy                           0.90     12500
   macro avg       0.90      0.90      0.90     12500
weighted avg       0.90      0.90      0.90     12500



# 5. BERT based model

## 5.0 Introduction

BERT (Bidirectional Encoder Representations from Transformers) is a powerful technique for NLP recently [developed (2018) by Google](https://arxiv.org/abs/1810.04805v2). When it came out, it broke several records on various NLP tasks. Moreover, Google keeps the code open source and even provides a pre-trained version for download available. The model is so powerful, because it is conceptionally good and at the same time trained with a massive data set.

The following picture describes a top level overview from a practitioner's point of view. As a practitioner we can profit from the freely pretrained model. For our tasks (sentiment analysis) we only need to train the classifier on top of the model.

<img src=http://jalammar.github.io/images/bert-transfer-learning.png width="700">

[Alammar, Jay (2018). The Illustrated Transformer [Blog post]](http://jalammar.github.io/illustrated-bert/)


But what makes BERT actually so powerful under the hood? We meantioned in the beginning of the notebook that LSTM have a limitation. They can only read in the data from one side. Therefore, we implemented a bidirectional LSTM that concatenates output vectors from two LSTMs that read the data in different directions. The language model ELMo uses such a kind of achitecture to create bidirectional word vectors. But that is only a suboptimal solution as it is a shallow bidirectional representation compared to BERT's approach (the vectors are only a concatenation of two unidirectional trained vectors).

BERT's key feature is that it is designed to create a bidirectional representation by "jointly conditioning on both left and right context **in all layers**" (Devlin et al., 2018). To archieve this Google uses a new technique for pre-training: they mask out some words in the middle of the sentence and then condition each word bidirectionally to predict the masked words. 

<img src=https://2.bp.blogspot.com/-pNxcHHXNZg0/W9iv3evVyOI/AAAAAAAADfA/KTSvKXNzzL0W8ry28PPl7nYI1CG_5WuvwCLcBGAs/s1600/f1.png width="700">

[Google AI Blog. Open Sourcing BERT (2018)](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)


## 5.1 Preprocessing

In [0]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('aclImdb', 
                                                                         maxlen=400, 
                                                                         max_features = 20000,
                                                                         preprocess_mode='bert',
                                                                         classes=['pos', 'neg'])

detected encoding: utf-8
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


preprocessing test...
language: en


In [0]:
from sklearn.utils import shuffle
import numpy as np

# Shuffling x_test 
x_test[0], x_test[1], y_test = shuffle(x_test[0], x_test[1], y_test)

# Creating a holdout set for a final validation
x_holdout = [x_test[0][:12500], x_test[1][:12500]]
y_holdout = y_test[:12500]
x_test = [x_test[0][12500:], x_test[1][12500:]]
y_test = y_test[12500:]


## 5.2 Modelling

In [0]:
bert_model = text.text_classifier('bert', (x_train, y_train), preproc=preproc)
bert_learner = ktrain.get_learner(bert_model, 
                                  train_data=(x_train, y_train), 
                                  val_data=(x_test, y_test),
                                  batch_size=6)

Is Multi-Label? False
maxlen is 400
done.


In [0]:
bert_learner.autofit(2e-5, 
                     10, 
                     early_stopping=2, 
                     checkpoint_folder='/content/drive/My Drive/models/bert')

# epoch1: 9348



begin training using triangular learning rate policy with max lr of 2e-05...
Train on 25000 samples, validate on 12500 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 00004: early stopping
Weights from best epoch have been loaded into model.


<tensorflow.python.keras.callbacks.History at 0x7f29cfdff7b8>

In [0]:
bert_learner.save_model(
    '/content/drive/My Drive/models/bert/bert_model')

## 5.3 Evaluation

In [0]:
bert_learner.load_model(
    '/content/drive/My Drive/models/bert/bert_model')

**Analysing the top losses**

Fore the same reasons as mentioned in the LSTM section, let us have a view of the 5 documents with the highest loss.

In [0]:
bert_learner.view_top_losses(n=5, preproc=preproc)

----------
id:11635 | loss:6.51 | true:neg | pred:pos)

[CLS] i really liked this qui ##rky movie . the characters are not the bland beautiful people that show up in so many movies and on tv . it has a realistic edge , with a capt ##ivating story line . the main title sequence alone makes this movie fun to watch . [SEP]
----------
id:11824 | loss:6.45 | true:neg | pred:pos)

[CLS] this has to be one of , if not the greatest mob / crime films of all time . every thing about this movie is great , the acting in this film is of true quality ; master p ' s acting skills make you actually believe he is italian ! the cinematography is excellent too , probably the best ever . this movie was great ; and i have the brain capacity of an earth worm . [SEP]
----------
id:7873 | loss:6.39 | true:pos | pred:neg)

[CLS] this was one of the worst col ##umb ##o episodes that i have seen , however , i am only in the second season . < br / > < br / > the typical col ##umb ##o activities are both amusing a

**Interpretation of the errors**

* id 11635: The review is mislabeled; BERT is correct.
* id 11824: The review is mislabeled; BERT is correct.
* id 7873: The review is mislabeled; BERT is correct.
* id 3114: The review is mislabeled; BERT is correct.
* id 3114: The review is mislabeled; BERT is correct.

All the highest losses are caused by misclassified data. Actually, there is not even any ambiguity in how to interpret these reviews, which makes sense as the BERT model is working quite good and the highest losses result by a confident predicten in the "wrong" direction. We might could cut off a certain percent of the list of highest losses to identify wrongly labeled data. Then we could measure the true accuracy more accurate. But, of course, that would bare a huge risk in throwing out true wrongly predicted documents which would inflate the accuracy rate.

In [0]:
bert_predictor = ktrain.get_predictor(bert_learner.model, preproc)

**Did the model learn to process contextual information?**

Remember that our statistical model based on TFIDF had an accuracy of 85% on validation data. However, it struggled for document 19. I believe that is because in document 19 the meaning is heavely embedded in the context. The author does not use words that by themselves signal a strong negative sentiment, but compares an actress to an oven. Strong words on their own are "delightful" and "happy" which falsly indicate a positive sentiment.

However, the BERT model does predict the sentence correctly and it does that with the highest confidents (0.926) of all models.

In [0]:
print("Predicted label: "+ bert_predictor.predict(test_b.data[19].decode('utf-8')))
print("True label: "+ str(test_b.target[19]))

Predicted label: neg
True label: 0


In [0]:
bert_predictor.explain(test_b.data[19].decode('utf-8'))

Contribution?,Feature
2.192,Highlighted in text (sum)
0.339,<BIAS>


**Final validation on the holdout dataset and comparison of the speed**

Clearly, the BERT model performs best. The accuracy is 4% higher than accuracy of the LSTM and the fastText model and 9% higher than the accuracy of the random forest based on TFIDF. Considering the recall, there is a small bias towards the positive class. 

For comparison, the to date state of the art (21.12.19) for sentiment analysis on the IMDb dataset is 97.4% and the first time that 94% accuracy was archieved was in 2016 with an oh-LSTM: https://paperswithcode.com/sota/sentiment-analysis-on-imdb

In [0]:
print(classification_report(y_holdout[:,1], 
                            np.around(bert_predictor.model.predict(x_holdout)[:,1]),
                            target_names=['neg', 'pos']))

              precision    recall  f1-score   support

         neg       0.95      0.92      0.94      6263
         pos       0.93      0.95      0.94      6237

    accuracy                           0.94     12500
   macro avg       0.94      0.94      0.94     12500
weighted avg       0.94      0.94      0.94     12500



In [0]:
data = [ 'Sesame street sucks!', 'I like Elmo.']
bert_predictor.predict(data)

['neg', 'pos']