<a href="https://colab.research.google.com/github/HeadHunter28/DeepLearning/blob/main/Basic%20RNN%20-%20Sentiment%20Analysis/Simple_RNN_%2B_NLP_Sentiment_Analysis_on_imDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment analysis on imdb reviews using recurrent neural networks

- Building an RNN on imdb dataset from Keras dataset.

#### 1. Importing libraries :

In [None]:
from tensorflow import keras
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import SimpleRNN
from keras.datasets import imdb
from keras import initializers

#### 2. Setting parameters for text processing :

In [None]:
## The size our vocabulary :
max_features = 20000  # This is used in loading the data, picks the most common (max_features) words

maxlen = 30  # maximum length of a sequence - truncate after this

batch_size = 32

#### 3. Loading data from TensorFlow :

- The function automatically tokenizes the text.

In [None]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [None]:
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

25000 train sequences
25000 test sequences




---


#### 4. 1 We have the text already tokenized to the limit of 'max_features' words.

- But the tokenized text (which is now a vector) is still in sequences of different length :

In [None]:
print(len(x_train[1]))
print(len(x_train[2])) #different lengths

189
141


#### 4.2 In order for it to be fed as an input to an RNN, we need to have all sequences (vectors) of same length/dimension.

- All sequences must be converted to decided sequence length (maxlen) ^, the sequences smaller than maxlen will be PADDED, while those bigger will be truncated.

- The truncation may lead to losing some information, but it is a necessary step.

- The task of making sequences of same length done by the **pad_sequences** function of *Keras*, which pads or truncates the sequences based on passed length value.

In [None]:
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000, 30)
x_test shape: (25000, 30)


- Now all sequences are of same length.



---



#### 5. Making an Embedding Layer



- The text sequences are of tokenised and of same length, but are in **integer representation** form :

In [None]:
print(x_train[2])
print(x_train[3])

[  47    6 2307   51    9  170   23  595  116  595 1352   13  191   79
  638   89    2   14    9    8  106  607  624   35  534    6  227    7
  129  113]
[   12  1685   195    25   238    60   796 13713     4   671     7  2804
     5     4   559   154   888     7   726    50    26    49  7008    15
   566    30   579    21    64  2574]


- To be fed into an RNN layer, we have to convert the integer representation of tokens into **dense vector representations (embeddings)** by mapping each integer index to a corresponding dense vector.

- The embedding layer is initialized with an embedding matrix, where each row of the matrix corresponds to the embedding vector for a unique token in the vocabulary.

- The number of rows in the matrix is equal to the size of the vocabulary, and the number of columns is equal to the desired dimensionality of the embedding space. Each row represents the embedding vector for a specific token.

- Can think of this as learning a word vector embedding "on the fly" rather than using an existing mapping (like GloVe)

----

####- Using Keras library :

 #### **keras.layers.embeddings.Embedding** (**input_dim** , **output_dim**, **embeddings_initializer**='uniform' ,
 #### **embeddings_regularizer** =None , **activity_regularizer**=None, **embeddings_constraint**=None, **mask_zero**=False, **input_length**=None)

- This layer maps each integer into a distinct (dense) word vector of length output_dim.

- Can think of this as learning a word vector embedding "on the fly" rather than using an existing mapping (like GloVe)

- The **input_dim** should be the size of the vocabulary.

- The **input_length** specifies the length of the sequences that the network expects.

---

We will feed the following values to the parameters:

- For input_dim : size of our vocabulary (max_features) i.e. 100

- For output_dim : word_embedding_dim (50)

 Each integer in the sequence will be taken and embedded in a 50-dimensional vector.



### 6. Making a Recurrent Layer

#### **keras.layers.recurrent.SimpleRNN**(units, activation='tanh', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)

- The parameter units gives the dimensionality of the output (and therefore the hidden state).

- Note that typically there will be another layer after the RNN mapping the (RNN) output to the network output.
So we should think of this value as the desired dimensionality of the hidden state and not necessarily the desired output of the network.

- We can set the parameters as =

    *units* : rnn_hidden_dim (5)

    *activation* : 'relu'

    *input_shape* : Size of one vector in X_train.shape

    *kernel_initializer* : initializers.RandomNormal(stddev=0.001)

    *recurrent_initializer* : initializers.Identity(gain=1.0)



#### 7. Making a Dense Layer

- One node (output is binary)

- Will use 'Sigmoid' activation function - preferred for Binary classification

---

#### 8. Designing the Neural Network

- Adding all layers.

- Using gradient descent optimiser : RMS Prop

- Using loss function : Binary-Cross Entropy

In [None]:
rnn_hidden_dim = 5
word_embedding_dim = 50

In [None]:
RNN_model = Sequential()

# Sequential Layer
RNN_model.add(Embedding(max_features, word_embedding_dim))

# Recurrent Layer

RNN_model.add(SimpleRNN(rnn_hidden_dim,
                    kernel_initializer=initializers.RandomNormal(stddev=0.001),
                    recurrent_initializer=initializers.Identity(gain=1.0),
                    activation='relu',
                    input_shape=x_train.shape[1:]))

# Dense Layer

RNN_model.add(Dense(1, activation='sigmoid'))

- Let us check all the parameters :

In [None]:
RNN_model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, None, 50)          1000000   
                                                                 
 simple_rnn_2 (SimpleRNN)    (None, 5)                 280       
                                                                 
 dense_2 (Dense)             (None, 1)                 6         
                                                                 
Total params: 1000286 (3.82 MB)
Trainable params: 1000286 (3.82 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


- Setting the learning rate, loss function, metrics and compiling the network :

In [None]:
rmsprop = keras.optimizers.RMSprop(learning_rate = .0001)

RNN_model.compile(loss='binary_crossentropy',
              optimizer=rmsprop,
              metrics=['accuracy'])

#### 9. Training the neural network (10 epochs)

In [None]:
RNN_model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(x_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7afb46aec4f0>

#### 9.1 Evaluating the neural network's performance

In [None]:
score, acc = RNN_model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.455465167760849
Test accuracy: 0.781719982624054


The accuracy obtained is 78.17%, which can be improved.

-----

#### 10. Testing performance with different parameters of:     

- **Vocabulary size (max_features)**

- **Length of sequences (max_len)**

- **Dimensionality of the hidden state (rnn_hidden_state)**

- **Epochs** - Training the network for larger number of epochs.


#### 10.1 Iteration 1

- Vocabulary size : 20000 ( initially 1000)

- Sequence length : 80 (initially 50)

- Keeping hidden state dimension (5) and embedding layer dimension (50) same.

In [None]:
max_features = 20000  # This is used in loading the data, picks the most common (max_features) words
maxlen = 80  # maximum length of a sequence - truncate after this

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)

- Making and compiling the neural network :

In [None]:
rnn_hidden_dim = 5
word_embedding_dim = 50
RNN_model1= Sequential()
RNN_model1.add(Embedding(max_features, word_embedding_dim))  #This layer takes each integer in the sequence
RNN_model1.add(SimpleRNN(rnn_hidden_dim,
                    kernel_initializer=initializers.RandomNormal(stddev=0.001),
                    recurrent_initializer=initializers.Identity(gain=1.0),
                    activation='relu',
                    input_shape=x_train.shape[1:]))

RNN_model1.add(Dense(1, activation='sigmoid'))

In [None]:
rmsprop = keras.optimizers.RMSprop(learning_rate = .0001)

RNN_model1.compile(loss='binary_crossentropy',
              optimizer=rmsprop,
              metrics=['accuracy'])

- Training the neural network :

In [None]:
RNN_model1.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(x_test, y_test))

- Evaluating performance :

In [None]:
score, acc = RNN_model1.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)