# RNN using tensorflow

Here we will be trying to classify movie reviews as positive or negative based on 50,000 reviews from imdb. We will have 25,000 in the train set.

In [0]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

print(tf.__version__)



## Dataset

The dataset comes preloaded in imdb.

Like before, we load the train data and test data.

We only include the top 10,000 words. This is because we can do most of the predicting using only the most frequently occuring words, and the rare words do not contribute much.

Every word is mapped to an integer, which serves as the index. So using this, we can get back the word we initially had. 
More details on this at the end of the notebook.

In [0]:
imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

### Preprocessing

In tensorflow, we need all sequences to be of a fixed length(i.e., same number of timesteps). 

However, sentences in general are bound to have varying numbers of words. So as a work around to this, we pad every sentence/review to be of the same length. In this case, every input is padded to a length of 256. The model simply ignore these values when we train it.

Here, we are padding with zeros.

In [0]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=0,
                                                        padding='post',
                                                        maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=0,
                                                       padding='post',
                                                       maxlen=256)

## Building the model

Vocab size is the number of distinct words.

First off, we start with an embedding layer. And then to ignore the padded parts of the sequence, we use the masking layer. Then we have a layer with 100 LSTM neurons. Then we have a fully connected layer with 16 neurons, followed by a single neuron that outputs a value between 0 and 1.

Read the first section of this [post](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html) on the PyTorch documentations for an explanation on embeddings.

In [5]:
vocab_size = 10000
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 100),
    keras.layers.Masking(mask_value=0., input_shape=(256, 100)),
    keras.layers.LSTM(100),
    keras.layers.Dense(16, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid) 
])
model.summary()



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 100)         1000000   
_________________________________________________________________
masking_1 (Masking)          (None, None, 100)         0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_2 (Dense)              (None, 16)                1616      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 1,082,033
Trainable params: 1,082,033
Non-trainable params: 0
_________________________________________________________________


We are using ```binary cross entropy``` as the loss function we are dealing with binary classification (positive or negative review). 

In [0]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

For the sake of time, I am only going to be using the first 10000 records. 

In [0]:
partial_x_train = train_data[10000:]
partial_y_train = train_labels[10000:]

In [8]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=5,
                    batch_size=1024,
                    verbose=1)

Instructions for updating:
Use tf.cast instead.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


#### Testing

We see that using this approach gets us an accuracy of 83.57%. We can improve this by training it for longer and using a better architecture. Feel free to experiment around with the architecture, but make sure to get your dimensions right. And for that, always refer to the docs.

In [9]:
results = model.evaluate(test_data, test_labels)

print(results)

[0.44628088002204896, 0.83572]


This implementation is based on [this](https://www.tensorflow.org/tutorials/keras/basic_text_classification) tutorial by TF. Read it for more explanation.

[This](https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17) has more methods of doing it.

### PyTorch implementation

Refer to [this](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html) post for the pytorch implementation of a similar problem.