# Data preparation for sentiment analysis model

First we import libraries to manage data

In [1]:
import pandas as pd 
import numpy as np
import tensorflow as tf 
import tensorflow_datasets as tfds

Counter function from collections library will be used to count the words in the different review and then this object will be used to define an encoder to converte string to number array.

In [2]:
from collections import Counter

The used dataset contains film's reviews from IMDB. Each review is labelled as "positive" or "negative" and we are going to use this data to build a supervised Recurrent neural network (RNN) to accomplish a sentiment analysis 

In [3]:
imdb = pd.read_csv("IMDB.csv")

We convert from string type to numeric the sentiment in the IMDB dataset

In [4]:
imdb.loc[imdb.sentiment == "negative", "score"] = 0
imdb.loc[imdb.sentiment == "positive", "score"] = 1

In [5]:
imdb.head()

Unnamed: 0,review,sentiment,score
0,One of the other reviewers has mentioned that ...,positive,1.0
1,A wonderful little production. <br /><br />The...,positive,1.0
2,I thought this was a wonderful way to spend ti...,positive,1.0
3,Basically there's a family where a little boy ...,negative,0.0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1.0


The first thing to do is create a tensorflow dataset object and defined the train/test/valid set to validate the model through a holdout

In [6]:
ds_raw = tf.data.Dataset.from_tensor_slices((imdb.review, imdb.score))

In [7]:
ds_raw = ds_raw.shuffle(50000)

In [8]:
ds_raw_test = ds_raw.take(25000)
ds_raw_train_valid = ds_raw.skip(25000)
ds_raw_train = ds_raw_train_valid.take(20000)
ds_raw_valid = ds_raw_train_valid.skip(20000)

To encode the string reviews we use the previously imported Counter to store every word used in the reviews. 
the reviews are made readable for the counter using the Tokenizer function provided in the Tensorflow library

In [9]:
count = Counter()
tokenizer = tfds.deprecated.text.Tokenizer()

In [10]:
for rew in ds_raw_train:
    token = tokenizer.tokenize(rew[0].numpy())
    count.update(token)

In [11]:
encoder = tfds.deprecated.text.TokenTextEncoder(count)

Once the reading of the reviews has been completed it is possible to decode them into numerical arrays as shown below

In [12]:
encoder.encode("This is an example")

[361, 61, 75, 431]

At this point we could define a function to encode the sets previously built. It's essential to convert it to a tensorflow function

In [13]:
def encode(tensor, label):
    text = tensor.numpy()
    encoded_text = encoder.encode(text)
    return encoded_text, label

In [14]:
def tf_encode(tensor, label):
    return tf.py_function(encode, inp = [tensor, label], Tout = (tf.int64, tf.float64))

In [15]:
ds_train = ds_raw_train.map(tf_encode)
ds_test = ds_raw_test.map(tf_encode)
ds_valid = ds_raw_valid.map(tf_encode)

As can be expected the vectors of the regressors are of different sizes. In general RNN can also handle several dimensions but you can simplify this by collecting reviews in batches and using padding to have the same size in each collection

In [16]:
train = ds_train.padded_batch(32, padded_shapes = ([-1], []) )
test = ds_test.padded_batch(32, padded_shapes = ([-1], []) )
valid = ds_valid.padded_batch(32, padded_shapes = ([-1], []) )

Finally we have the dataset processed to be used efficently in a RNN model. 
The model will be buld in the follow fashion:  
$\bullet$ embedding layer: maps discrete numerical array to continous array and normalized  
$\bullet$ bidirectional LSTM layer: LSTM neural network in which the recursion is forward and backward. The chosen activation function is the hyperbolic tangent. Finally a dropout with 0.2 rate is set in this layer.  
$\bullet$ dense hidden layer with 64 neurons  
$\bullet$ Output dense layer with sigmoid activation to classify review in feed-forward  

In [17]:
model = tf.keras.Sequential()

model.add(tf.keras.layers.Embedding(input_dim = len(count)+2, 
                                    output_dim = 20))

model.add(tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, activation = "tanh", dropout = 0.2))
         )

model.add(tf.keras.layers.Dense(64, activation = "tanh"))

model.add(tf.keras.layers.Dense(1, activation = "sigmoid"))

In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 20)          1745120   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               43520     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 1,796,961
Trainable params: 1,796,961
Non-trainable params: 0
_________________________________________________________________


The optimization method chosen to minime the binary crossentropy loss function is the "adam" stochastic gradient descent 

In [19]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

To simplify the times of execution on a local machine only ten epochs of learning are carried out

In [20]:
model.fit(train, validation_data = valid, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x20c65469430>

In the end it's possible to make predictions on records not yet seen by the model and evaluate the generalized error with the test set.

In [21]:
model.evaluate(test)



[0.1576797217130661, 0.9500799775123596]

The values of generalized error, of training and on the test of validation are in all the cases very low and therefore the risk of overfitting can be considered averted and the model can be considered satisfactory.