## Stage 1: Install dependencies and setting up GPU environment

In [1]:
!pip install numpy==1.16.1

Collecting numpy==1.16.1
[?25l  Downloading https://files.pythonhosted.org/packages/f5/bf/4981bcbee43934f0adb8f764a1e70ab0ee5a448f6505bd04a87a2fda2a8b/numpy-1.16.1-cp36-cp36m-manylinux1_x86_64.whl (17.3MB)
[K     |████████████████████████████████| 17.3MB 210kB/s 
[31mERROR: umap-learn 0.4.6 has requirement numpy>=1.17, but you'll have numpy 1.16.1 which is incompatible.[0m
[31mERROR: tensorflow 2.4.0 has requirement numpy~=1.19.2, but you'll have numpy 1.16.1 which is incompatible.[0m
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m
[?25hInstalling collected packages: numpy
  Found existing installation: numpy 1.19.4
    Uninstalling numpy-1.19.4:
      Successfully uninstalled numpy-1.19.4
Successfully installed numpy-1.16.1


## Stage 2: Importing project dependencies

In [2]:
import numpy as np
import tensorflow as tf

from tensorflow.keras.datasets import imdb

In [3]:
tf.__version__

'2.4.0'

## Stage 3: Dataset preprocessing

### Setting up dataset parameters

In [9]:
number_of_words = 20000
 # this means that we are not taking all reviews from IMBD Dataset, but taking all the reviews that have 20000 most frequently repeated words

max_len = 100

 # for example though we have a review of 5 words a padded version of that review of 5 words is made as sequence of 100 elements of 
#which 5 elements are of actual review and rest of the elements upto 100 are occupied by padding tokens and all our reviews will be sequences of 100 elements

### Loading the IMDB dataset

In [10]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=number_of_words)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


### Padding all sequences to be the same length . To use reviews text data in RNN all the reviews must be of same length. Some reviews can be of 4 or 5 words but some will be 20 words but it is necessary that all these reviews must be of same length , so we add some extra cells in the  tensor of inputs containing all the text so that all the input reviews have the same length

In [11]:
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_len)

In [28]:
X_train

array([[1415,   33,    6, ...,   19,  178,   32],
       [ 163,   11, 3215, ...,   16,  145,   95],
       [1301,    4, 1873, ...,    7,  129,  113],
       ...,
       [  11,    6, 4065, ...,    4, 3586,    2],
       [ 100, 2198,    8, ...,   12,    9,   23],
       [  78, 1099,   17, ...,  204,  131,    9]], dtype=int32)

In [12]:
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_len)

### Setting up Embedding Layer parameters

In [13]:
vocab_size = number_of_words #also known as:  (input_dim = number_of_words)
vocab_size

20000

In [14]:
embed_size = 128 # also known as : no. of columns to encode the words/ represent the words gives embedding matrix of 128 columns or output_dim = 128)

## Step 4: Building a Recurrent Neural Network

### Defining the model

In [15]:
model = tf.keras.Sequential()

### Adding the Embeding Layer

Embedding layer is the layer used to create a word vector representation of the words( here the words in the reviews ), so that instead of using pre-trained word vectors as if we had vectors of words with padding including, we are going to use this embedded layer to train the vectors in a large matrix. In this large matrix each row corresponds to a word (20,000 words here means 20,000 rows) and coloumns are encoding the words, known as representation of words in Dataset vocabulary. So, by using Embedded layer we are going to learn those word representations jointly with the weights in the network itself. This matrix contains some encoded relationship between the words so that RNN can learn from these relationships and predict in the end if the association of a word leads to a positive or negative review

In [16]:
model.add(tf.keras.layers.Embedding(vocab_size, embed_size, input_shape=(X_train.shape[1],)))

 # since shape is the 2nd element of our tensor meaning of index 1 we take shape[1]

### Adding the LSTM Layer ( This layer helps RNN to understand the relationships between the different input words)

- units: 128 (No. of cells/neurons in our LSTM Layer)
- activation: tanh ( Hyperbolic tangent activation function)

In [17]:
model.add(tf.keras.layers.LSTM(units=128, activation='tanh'))

### Adding the Dense output layer

- units: 1 ;

( as we have final output as only 0 or 1 where 0 is negative and 1 is positive, we only have one output neuron or cell)

- activation: sigmoid 

(we get the probabilities of reviews if they are positive or negative)

In [18]:
model.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

 # since this output layer is fully connected to the previous layer we use dense class

### Compiling the model

In [19]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

#rmsprop is the most recommended optimizer for RNN as it leads to better results

In [20]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 128)          2560000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


### Training the model

In [21]:
model.fit(X_train, y_train, epochs=3, batch_size=128)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f89924515f8>

### Evaluating the model

In [22]:
test_loss, test_acurracy = model.evaluate(X_test, y_test)



In [23]:
print("Test accuracy: {}".format(test_acurracy))

Test accuracy: 0.8521999716758728


In [25]:
probability_model = tf.keras.Sequential([model,tf.keras.layers.Softmax()])

In [26]:
predictions = probability_model.predict(X_test)

In [31]:
predictions[200]

array([1.], dtype=float32)