In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import tensorflow as tf
from tensorflow.keras import layers
import re

print(tf.version.VERSION)
print(tf.keras.__version__)

1.14.0
2.2.4-tf


## Large Movie Review Dataset

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. 

http://ai.stanford.edu/~amaas/data/sentiment/

In [2]:
df = pd.read_csv('rnn/trainIMDB.tsv', delimiter ="\t")

In [3]:
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


As preprocessing, it is necessary to remove certain characters that may cause noise during the training of the recurrent neural network. Punctuations marks are removed and all strings are set to lowercase.

### Function that normalizes the reviews

In [4]:
def word_norm(text):
    text = text.str.lower()
    text = text.replace(r'[^\w\s]','')
    text = text.replace('.', '')
    text = text.replace(',', '')
    text = text.replace('!', '')
    text = text.replace(to_replace =r'!', value='',regex=True)
    text = text.replace('á', 'a')
    text = text.replace('é', 'e')
    text = text.replace('í', 'i')
    text = text.replace('ó', 'o')
    text = text.replace('ú', 'u')
    text = text.replace('?', '')
    text = text.replace('!', '')
    text = text.replace(to_replace =r'-', value='',regex=True)
    text = text.replace(')', '')
    text = text.replace('(', '')
    text = text.replace(':', '')
    text = text.replace('/', '')
    text = text.replace('\\', '')
    return(text)

In [5]:
df['review_clean'] = word_norm(df['review'])

In [6]:
df.head()

Unnamed: 0,id,sentiment,review,review_clean
0,5814_8,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...","\the classic war of the worlds\"" by timothy hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager (nicholas bell)...
3,3630_4,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...


In [7]:
#This class allows to vectorize a text corpus, by turning each text into either a sequence of integers 
#(each integer being the index of a token in a dictionary)
from tensorflow.keras.preprocessing.text import Tokenizer
#Pads sequences to the same length.
from tensorflow.keras.preprocessing.sequence import pad_sequences


## Defining the Dictionary

We will define a vocabulary of the words present in the dataset. Each word will be associated to a number. This number is the feature representation of the word that we will use in our model. For this project, we will use the top 6000 most common words present in the dataset.

At the moment, we have not removed stopwords. 

#### Our RNNs will be defined as Many-to-one

This is due to the fact that we are performing sentiment analysis, we will have several input features (words) and we will return the probability of the review being positive or negative as an output. 

In [8]:
max_features = 6000 ##the maximum number of words to keep.
tokenizer = Tokenizer(num_words=max_features)
##Updates internal vocabulary based on a list of texts.
tokenizer.fit_on_texts(df['review_clean'])
tokens = tokenizer.texts_to_sequences(df['review_clean'])

In [9]:
df.head()

Unnamed: 0,id,sentiment,review,review_clean
0,5814_8,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...","\the classic war of the worlds\"" by timothy hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager (nicholas bell)...
3,3630_4,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...


## Padding 

### Theorical max padding. 

input vectors for training all have to be of the same size. One way of defining the size or the padding required could be using as a benchmark the longest review of the dataset. This review has a total of 2470 words: 

In [10]:
max(df['review_clean'].str.split().str.len())

2470

### Practical pad size

As first approximation, a pad of size 140 will be used. 

In [11]:
pad = 140
X_data = pad_sequences(tokens, maxlen=pad)

## Embedding 

In [12]:
X_data

array([[   3,   51,    9, ...,   21,    1, 1563],
       [   3,   52,  437, ...,   27,   90, 5537],
       [   1, 1445, 1360, ...,  864, 1351,    4],
       ...,
       [   0,    0,    0, ...,    7,  358,  159],
       [  10,  131,   11, ...,   16,   82,   80],
       [   4,   55,   83, ...,   14,    3,  520]])

In [14]:
y_data = df['sentiment']

## Splitting the Data set

In [15]:
### Splitting the datsets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data,
                                                    stratify=y_data, 
                                                    test_size=0.2)

## Experiment 1: LSTM 


* Input vectors of size 128 for each word.
* Long Term Short Memory Neural Net using with size 100 as an output.
* Dense layer using sigmoid activation.
* adam optimizer

### Creating a ModelCheckpoint 

* ModelCheckpoint will save the best model based on validation loss while training. 
* EarlyStopping will stop training when validation loss is no longer decreasing.

In [1]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

callbacks_1 = [EarlyStopping(monitor='val_loss', patience=1),
             ModelCheckpoint(('rnn/experiment_1/model.h5'), save_best_only=True, 
                             save_weights_only=False)]

In [17]:
embed_size = 128
model_1 = tf.keras.Sequential()
model_1.add(layers.Embedding(max_features,embed_size))
model_1.add(layers.LSTM(100))
model_1.add(layers.Dense(1,activation="sigmoid"))
print(model_1.summary())
model_1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         768000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               91600     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 859,701
Trainable params: 859,701
Non-trainable params: 0
_________________________________________________________________
None
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [18]:
batch_size = 100
epochs = 6
model_1.fit(X_train,y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2, callbacks =callbacks_1)

Train on 16000 samples, validate on 4000 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6


<tensorflow.python.keras.callbacks.History at 0x26a426fa940>

# Note:

Experiment 1 was first tested with a padding of 2470. This number represents the amount of words in the largest review of the dataset. However, each epoch took about 2 hours to train, so this experiment was discarded. 

<img src="rnn/maxpad.jfif">

In [19]:
score_1 = model_1.evaluate(X_test, y_test, batch_size=100)



## Experiment 1.1: LSTM 


* Input vectors of size 256 for each word.
* Long Term Short Memory Neural Net using with size 100 as an output.
* Dense layer using sigmoid activation.
* adam optimizer

In [20]:
callbacks_1_1 = [EarlyStopping(monitor='val_loss', patience=1),
             ModelCheckpoint(('rnn/experiment_1/model_1_1.h5'), save_best_only=True, 
                             save_weights_only=False)]
model_1_1 = tf.keras.Sequential()
model_1_1.add(layers.Embedding(max_features,256))
model_1_1.add(layers.LSTM(100))
model_1_1.add(layers.Dense(1,activation="sigmoid"))
print(model_1_1.summary())
model_1_1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


batch_size = 100
epochs = 6

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 256)         1536000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               142800    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 1,678,901
Trainable params: 1,678,901
Non-trainable params: 0
_________________________________________________________________
None


In [21]:
model_1_1.fit(X_train,y_train, batch_size=batch_size, epochs=10, validation_split=0.2, callbacks =callbacks_1_1)

Train on 16000 samples, validate on 4000 samples
Epoch 1/10
Epoch 2/10


<tensorflow.python.keras.callbacks.History at 0x26a9f26ef28>

In [22]:
score_1_1 = model_1_1.evaluate(X_test, y_test, batch_size=100)



## Experiment 0 

In [23]:
callbacks = [EarlyStopping(monitor='val_loss', patience=1),
             ModelCheckpoint(('rnn/experiment_0/model.h5'), save_best_only=True, 
                             save_weights_only=False)]

In [24]:
embed_size = 128
model = tf.keras.Sequential()
model.add(layers.Embedding(max_features, embed_size))
model.add(layers.Bidirectional(layers.LSTM(32, return_sequences = True)))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(20, activation="relu"))
model.add(layers.Dropout(0.05))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 100
epochs = 6
model.fit(X_train,y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2, callbacks =callbacks)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Train on 16000 samples, validate on 4000 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6


<tensorflow.python.keras.callbacks.History at 0x26aa70852b0>

The model stopped at epoch 4 because the validation accuracy of the training set did not improve on that epoch. 

In [25]:
score = model.evaluate(X_test, y_test, batch_size=100)



## Comments and conclusions

Working this model, it was really important to perform data cleaning. Characters where removed so the context of words could be the same through the whole datasets.

Stopwords where not removed as they could add context to the problem.

A possible improvement for the model could be to perform stemming and lemmatization to the dataset. 
Furthermore, working with the padding was crucial for this model. A possible improvement could be to define the padding using the mean and standard deviation in the whole dataset. Working with the maximum size was not viable as each epoch took about 3 hours to train.

