Course Accreditations: Inspired from Advanced Data Science course from harvard.

# Course: Apprentissage statistique
## CORO-SIP, LEARN-TP
---
## Enseignants:
 * Dawood ALCHANTI
 * Mathieu LAGRANGE
---

# Recurrent Neural Network Part 2: 
### Complexity measure, behaviour analysis and usability.

## 1. Materials
#### 1. In this lab we will consider using google colab: https://colab.research.google.com/notebooks/intro.ipynb#recent=true

#### 2. Press the link above, go to **file** and press **upload notebook**, then chose the file ''Recurrent_Neural_Network_Lab_Part_1''

#### 3. At the beggining of your code, go to Edit, Notebook Setting, and change the configuration to GPU, so  you can use google colab gpus.

## In this lab we will look mainly at:
1. Recurrent Neural Networks (RNNs),
2. LSTMs and their building blocks.

## Goals: By the end of this lab, you should:
1. be able to **use RNNs and its variants (GRU, LSTM)** using keras library based on tensorflow.
2. **understand how any sequential data would fit** into and benefit from a recurrent architecture.
3. become familiar with **text preprocessing and dynamic embeddings**.
4. **understand the underlying complexity when using RNN**.
4. **understand the gradient issues** on RNNs processing when it is deployed to process longer sentence lengths.
5. become familiar with **different kinds of LSTM architectures**, classifiers, and sequence to sequence models.

**Keywords: RNN, LSTM, RNN+CNN**

# Problem Context we will be working on today is:

## **Sentiment classification on a movie review dataset using IMDb Reviews**

Inspired from Kaggel competitions, more information is availabel in the following link: https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews


**We are going to build a**:

1. Dense Feedforward Neural Network without and with Embedding Layer,
2. CNN Network, 
3. RNN Network,  
4. LSTM Network

and combine one or more of them to understand performance.

**Understanding the Context**:

1. A sentence can be thought of as a sequence of words that collectively represent meaning.
2. Individual words impact the meaning.
3. Thus, the context matters; words that occur earlier in the sentence influence the sentence's structure and meaning in the latter part of the sentence (e.g., Jose asked Anqi if she were going to the library today).
4. Likewise, words that occur later in a sentence can affect the meaning of earlier words (e.g., Apple is an interesting company). 
5. If we wish to make use of a full sentence's context in both directions, then we should use a bi-directional RNN (e.g., Bi-LSTM). 

* For the purpose of this tutorial, we are going to restrict ourselves to only uni-directional RNNs.



In [1]:
# Dependencies, we will rely for simplicity on Keras with backend Tensorflow
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, SimpleRNN
from keras.layers.embeddings import Embedding
from keras.layers import Flatten
from keras.preprocessing import sequence
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
import numpy as np
# fix random seed for reproducibility
numpy.random.seed(1)

# **Natural Language Processing:**

1. Permit Computers to have an understandable numerical representation for words.
2. The first crucial step is to clean (pre-process) your data so that you can soundly make use of it. 
3.  Within NLP, this first step is called Tokenization and it concerns how to represent each token (a.k.a. word) of your corpus (i.e., dataset).


# Data Preprocessing in NLP: 

### 1. TOKENIZATION --> Word-Based Encodings and Transofrming Text to Sequence:

1. A **token** refers to a single, atomic unit of meaning (i.e., a **word**).

2. ***How should our computers represent each word?*** We could read in our corpus word by word and store each word as a String (data structure). However, *Strings tend to use more computer memory* *than Integers* and can become cumbersome. 

3. We are better off **converting each distinct word to a distinct number (Integer).**

4. As a simple example of tokenization, assume we have **five** **sentences** below as our **entire corpus**:
* i have books 
* interesting books are useful 
* i have computers 
* computers are interesting and useful 
* books and computers are both valuable. 
* Bye Bye


**Now, let us create tokens for vocabulary based on frequency of occurrence. Hence, we assign the following tokens:**

'books': 1, 'are': 2, 'computers': 3, 'i': 4, 'have': 5, 'interesting': 6, 'useful': 7, 'and': 8, 'bye': 9, 'both': 10, 'valuable': 11

***Hence, the representation of our sentences will be as follow:***

* i have books --> [4, 5, 1]
* interesting books are useful --> [6, 1, 2, 7]
* i have computers --> [4, 5, 3]
* computers are interesting and useful -->  [3, 2, 6, 8, 7]
* books and computers are both valuable -->   [1, 8, 3, 2, 10, 11]
* Bye Bye --> [9, 9]


* For more information: https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html

## How to do that automatically?

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [3]:
sentences = [
             'i have books',
             'interesting books are useful',
             'i have computers',
             'computers are interesting and useful',
             'books and computers are both valuable',
             'Bye Bye'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'books': 1, 'are': 2, 'computers': 3, 'i': 4, 'have': 5, 'interesting': 6, 'useful': 7, 'and': 8, 'bye': 9, 'both': 10, 'valuable': 11}


In [4]:
sequences = tokenizer.texts_to_sequences(sentences)

In [5]:
print(word_index)

{'books': 1, 'are': 2, 'computers': 3, 'i': 4, 'have': 5, 'interesting': 6, 'useful': 7, 'and': 8, 'bye': 9, 'both': 10, 'valuable': 11}


In [6]:
[print(sentences[i]+' : ' + str(sequences[i])) for i in range(len(sentences))][0]

i have books : [4, 5, 1]
interesting books are useful : [6, 1, 2, 7]
i have computers : [4, 5, 3]
computers are interesting and useful : [3, 2, 6, 8, 7]
books and computers are both valuable : [1, 8, 3, 2, 10, 11]
Bye Bye : [9, 9]


# 2. Padding:

1. If we were training our RNN one sentence at a time, it would be okay to have sentences of varying lengths. Therefore, we can first feed the 1st sentence 'i have books' with sequence representation [4, 5, 1], which have size 3. Then the second sentence 'interesting books are useful' [6, 1, 2, 7], which have size 4. And so on...

2. However, as with any neural network, it can be sometimes advantageous to train inputs in batches. Therefore, our input tensors need to be of the same length/dimensions. That is, we can consider the maximal length of our sequence representation, **in the upper example**, the sentence 'books and computers are both valuable' with representation of [1, 8, 3, 2, 10, 11] is the longest sentence with a size of 6. Hence, we can consider the longest sentence length in our training set as our maximal length. Afterwards, each of the above sentences will be padded by zeros until the maximal length is reached. That is:

---
* i have books :[ 0,  0,  0,  4,  5,  1]
* interesting books are useful : [ 0,  0,  6,  1,  2,  7]
* i have computers :[ 0,  0,  0,  4,  5,  3]
* computers are interesting and useful : [ 0,  3,  2,  6,  8,  7]
* books and computers are both valuable : [1, 8, 3, 2, 10, 11]
* Bye Bye :  [ 0,  0,  0,  0,  9,  9]



For more information: https://www.tensorflow.org/guide/keras/masking_and_padding



## How to do that automatically?

In [7]:
from tensorflow.keras.preprocessing import sequence

In [8]:
PaddedSequence = sequence.pad_sequences(sequences, maxlen=6)

In [9]:
[print(sentences[i]+' : ' + str(PaddedSequence[i])) for i in range(len(sentences))][0]

i have books : [0 0 0 4 5 1]
interesting books are useful : [0 0 6 1 2 7]
i have computers : [0 0 0 4 5 3]
computers are interesting and useful : [0 3 2 6 8 7]
books and computers are both valuable : [ 1  8  3  2 10 11]
Bye Bye : [0 0 0 0 9 9]


## Thankfully, our dataset is already represented in such a tokenized form and no further preprocessing is required. However, it is not padded and we should perform this step.

1. Let us **Load the Dataset from imdb library** using load_data fuction.

* We will **strict our vocabulary size** to 10000 to have a finite vocabulary to make sure that our word matrices are not arbitrary small, therefore we will set vocabulary_size = 10000



In [10]:
# We want to have a finite vocabulary to make sure that our word matrices are not arbitrarily small
vocabulary_size = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)
print('Number of training reviews', len(X_train))
print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]))
print('First review vector', X_train[0])
print('First review label', y_train[0])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])


Number of training reviews 25000
Length of first and fifth review before padding 218 147
First review vector [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16

  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


Thus, let us now pad our sentences.

* We will also want to **have a finite length of reviews** and not to have to process really long sentences, therefore we will set *max_review_length = 500*


---

* **Hint**: use the sequence library and then use pad_sequences function. Here we 
need to give as an argument our data and the maximum length to pad with.
---



### Q.1 Pad the train and the test dataset with a finite length of reviews of length 500

In [11]:
max_review_length = 500 # replace xxx by your code

X_train = sequence.pad_sequences(X_train, maxlen=500) # replace xxx by your code
X_test = sequence.pad_sequences(X_test, maxlen=500)  # replace xxx by your code
print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))

Length of first and fifth review after padding 500 500


# Q.2 MODEL 1A : FEED-FORWARD NETWORKS WITHOUT EMBEDDINGS

1. Build a single-layer feed-forward net with a hidden layer of 250 nodes to do classification. 
2. Each input must be a 500-dim vector of tokens since we padded all our sequences to size 500.
3. Calculate the number of parameters involved in this network.

---
Remark: Check on 
* https://keras.io/guides/functional_api/ 
* https://keras.io/guides/sequential_model/

to build your 1st model

---




**Here you must**: 
1. Define your model.
2. add Dense layers.
3. compile your loss
4. print your model summary and check on the total number of model parameters.
5. evaluate your model over the test set.
6. print the accuracy.

In [12]:
#1. Build a sequential model
model = Sequential()
num_classes = 1

# define single layer feed-forward net with 250 nodes with relu activation, use Dense function.
model.add(Dense(250, activation='relu', input_dim=max_review_length)) # replace xxx by your code.

# define a single Dense layer with sigmoid activation that perform sentiment classification, here we will have
# our output either 0 or 1, therefor the number of output nodes is?
model.add(Dense(num_classes, activation='sigmoid')) # replace xxx by your code.

# Use binary_crossentropy as loss function, use adam as optimizer and measure the accuracy
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy']) # replace xxx by your code.

# Print the model summary
print(model.summary())

# Train the model over the training set and consdider the test set as your validation.
# Here, we will use 10 epochs, with batch of 128
model.fit(X_train, y_train, epochs=10, batch_size=128, verbose=2) # replace xxx by your code.

# Final evaluation of the model: Get the model scores over the test set
scores = model.evaluate(X_test, y_test, verbose=0)  # replace xxx by your code.
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 250)               125250    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 251       
Total params: 125,501
Trainable params: 125,501
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
196/196 - 2s - loss: 189.3796 - accuracy: 0.5041
Epoch 2/10
196/196 - 0s - loss: 51.9289 - accuracy: 0.5912
Epoch 3/10
196/196 - 0s - loss: 19.9678 - accuracy: 0.6674
Epoch 4/10
196/196 - 0s - loss: 8.9483 - accuracy: 0.7338
Epoch 5/10
196/196 - 0s - loss: 4.6588 - accuracy: 0.7831
Epoch 6/10
196/196 - 0s - loss: 2.7043 - accuracy: 0.8147
Epoch 7/10
196/196 - 0s - loss: 1.7247 - accuracy: 0.8392
Epoch 8/10
196/196 - 0s - loss: 1.2793 - accuracy: 0.8550
Epoch 9/10
196/196 - 0s - loss: 0.9349 

# *Discussion*: 
#### 1. Comment on the model performance (analyse the loss and the accuracy over the test set)? 
#### 2. What was wrong with tokenization? Do you think it is a representative way to consider it? 

# Q.3  MODEL 1B : FEED-FORWARD NETWORKS WITH EMBEDDINGS

---
**What is an embedding layer ?**
---

An embedding is a "**distributed representation**" (e.g., vector) of a particular atomic item (e.g., word token, object, etc). 

When representing items by embeddings:

* each distinct item should be represented by its own unique embedding
* the semantic similarity between items should correspond to the similarity between their respective embeddings (i.e., *words that are more similar to one another should have embeddings that are more similar to each other*).

*In general, though, one can view the embedding process as a linear projection from one vector space to another (e.g., a vector space of unique words being mapped to a world of fixed-length, dense vectors filled with continuous-valued numbers.*

---
**For NLP:** 
1. we usually use embeddings to project the one-hot encodings of words on to a lower-dimensional continuous space (e.g., vectors of size 100) so that the input surface is dense and possibly smooth. 
2. Thus, one can view this embedding layer process as just a transformation from $\mathbb{R}^{input}$ to $\mathbb{R}^{emb}$.
---


**One hot Encoding:** a vector that is the length of the entire vocabulary, and it is filled with all zeros except for a single value of 1 that corresponds to the particular word.



Check on the following source to know how to use the Embedding layer:
https://keras.io/api/layers/core_layers/embedding/



In [13]:
embedding_dim = 100

In [14]:
model = Sequential()

# inputs will be converted from batch_size * sentence_length to batch_size*sentence_length*embedding _dim
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))
model.add(Flatten())

# Now repeat your previous defined model
# -------------------------------------

# define single layer feed-forward net with 250 nodes with relu activation, use Dense function.
model.add(Dense(250, activation='relu', input_dim=embedding_dim)) # replace xxx by your code.

# define a single Dense layer with sigmoid activation that perform sentiment classification, here we will have
# our output either 0 or 1, therefor the number of output nodes is?
model.add(Dense(num_classes, activation='sigmoid')) # replace xxx by your code.

# Use binary_crossentropy as loss function, use adam as optimizer and measure the accuracy
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy']) # replace xxx by your code.

# Print the model summary
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 100)          1000000   
_________________________________________________________________
flatten (Flatten)            (None, 50000)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 250)               12500250  
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 251       
Total params: 13,500,501
Trainable params: 13,500,501
Non-trainable params: 0
_________________________________________________________________
None


# Train the defined Model

In [15]:
# Train the model over the training set and consdider the test set as your validation.
# Here, we will use 10 epochs, with batch of 128
model.fit(X_train, y_train, epochs=10, batch_size=128, verbose=2) # replace xxx by your code.

# Final evaluation of the model: Get the model scores over the test set
scores = model.evaluate(X_test, y_test, verbose=0)  # replace xxx by your code.
print("Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/10
196/196 - 5s - loss: 0.5868 - accuracy: 0.6615
Epoch 2/10
196/196 - 5s - loss: 0.1983 - accuracy: 0.9221
Epoch 3/10
196/196 - 5s - loss: 0.0468 - accuracy: 0.9868
Epoch 4/10
196/196 - 5s - loss: 0.0045 - accuracy: 0.9997
Epoch 5/10
196/196 - 5s - loss: 6.8039e-04 - accuracy: 1.0000
Epoch 6/10
196/196 - 5s - loss: 3.3671e-04 - accuracy: 1.0000
Epoch 7/10
196/196 - 5s - loss: 2.1489e-04 - accuracy: 1.0000
Epoch 8/10
196/196 - 5s - loss: 1.4842e-04 - accuracy: 1.0000
Epoch 9/10
196/196 - 5s - loss: 1.0829e-04 - accuracy: 1.0000
Epoch 10/10
196/196 - 5s - loss: 8.1968e-05 - accuracy: 1.0000
Accuracy: 87.59%


# *Discussion*:  
#### 1. Compare the performance with and without Embedding.
#### 2. What do you conclude?
#### 3. What is the main advantages of using Embedding layer?

# Q.4 MODEL 2 : Build a CNN based Model

1. Text can be thought of as **1-dimensional sequence** (a single, long vector) 
2. Therefore, we can apply **1D Convolutions** over a set of word embeddings. 


* Use the model developed in Model 1B, modify it to include 1D Conv layer with 32 filters and kernel size of 3 followed by max pooling operation with stride by 2 for downsampling the vector size.
* feed-forward layer of 250 nodes, and ReLU and Sigmoid activations as appropriate.
* Fit the model over the training set
* Evaluate the model performace over the test set.






**More resources on:** **Understanding Convolutions in Text can be found in:**

http://debajyotidatta.github.io/nlp/deep/learning/word-embeddings/2016/11/27/Understanding-Convolutions-In-Text/




In [24]:

# create the CNN
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))

# Define the parameters of the connv1D with 32 filters and kernal size of 3x3.
model.add(Conv1D(32, kernel_size=3, padding='same', activation='relu')) # replace xxx by your code.

# perform downsampling by 2 using maxpooling
model.add(MaxPooling1D(pool_size=2)) # replace xxx by your code.

model.add(Flatten())

num_classes = 1
# Now repeat your previous defined model
# -------------------------------------

# define single layer feed-forward net with 250 nodes with relu activation, use Dense function.
model.add(Dense(250, activation='relu', input_dim=embedding_dim)) # replace xxx by your code.

# define a single Dense layer with sigmoid activation that perform sentiment classification, here we will have
# our output either 0 or 1, therefor the number of output nodes is?
model.add(Dense(num_classes, activation='sigmoid')) # replace xxx by your code.

# Use binary_crossentropy as loss function, use adam as optimizer and measure the accuracy
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy']) # replace xxx by your code.

# Print the model summary
print(model.summary())

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 500, 32)           9632      
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 8000)              0         
_________________________________________________________________
dense_12 (Dense)             (None, 250)               2000250   
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 251       
Total params: 3,010,133
Trainable params: 3,010,133
Non-trainable params: 0
____________________________________________

# Discusion: 
### comment on the model performance in term of:
1. accuracy, 
2. running tim and 
3. complexity: consider using total number of parameters


## Q.5 MODEL 3 : Simple Recurrent Neural Network RNN**

More resources on understanding RNN and LSTM are:

1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
2. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

* An RNN is similar to a feed-forward neural network in that there is an input layer, a hidden layer, and an output layer. 
* The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer.
*  However, the hidden layer for a given time  $t$ is not only based on the input layer at time  $t$ but also the hidden layer from time $t-1$.
* Mathematically, a simpleRNN can be defined by the following recurrence relation:


---
$$h_{t} = \sigma(W[h_{t-1},x_{t}]+b)$$
---



In [17]:

model = Sequential()

# Add an Embedding layer
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length)) # Replace your code by xxx

# Add a recurrent layer using SimpleRNN with 100 hidden units
model.add(SimpleRNN(100)) # Replace your code by xxx

# Add a Dense layer for computing the output scores with Sigmoid activations
model.add(Dense(1, activation="sigmoid")) # Replace your code by xxx

# Compile the model by defining the loss, the optimizer and the metrics, similar to what you have done before
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Replace your code by xxx
print(model.summary())

# Train the defined model over the training set
model.fit(X_train,y_train,validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2) 


# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 100)               20100     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 101       
Total params: 1,020,201
Trainable params: 1,020,201
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
196/196 - 73s - loss: 0.5994 - accuracy: 0.6498 - val_loss: 0.4376 - val_accuracy: 0.8148
Epoch 2/10
196/196 - 72s - loss: 0.3492 - accuracy: 0.8511 - val_loss: 0.4061 - val_accuracy: 0.8320
Epoch 3/10
196/196 - 72s - loss: 0.3451 - accuracy: 0.8471 - val_loss: 0.4400 - val_accuracy: 0.8121
Epoch 4/10
196/196 - 72s - loss: 0.2603 - accuracy: 0.8956 - val_l

# **Discussion**: 
1. Comment on the results, any drop or gain in the performance? 
2. What went wrong?
3. Compare the model complexity with CNN model and with Dense Model with Embedding.
4. Can you think of any trick to decrease the model complexity while still using RNN?

# Remark: as we have seen in the lecture

## The main problem when using RNN is the vanishing/exploding of the gradients

Let us use sigmoid activations as example. Derivative of a sigmoid can be written as:

$$\sigma'(x) = \sigma(x).\sigma(1-x)$$

* Remember that an RNN is a very deep feed-forward network when unrolled in time!
* Hence, backpropagation happens from $h_{t}$ all the way to $h_{1}$.
*  Also, sigmoid gradients are multiplicatively dependent on the value of sigmoid.
*  Hence, if the non-activated output of any layer $h_{l}$ is $<0$, then $\sigma$ tends to 0, leading to **the gradient vanishing problem.**

# LSTM: Long Short Term Memory

1. LSTM and GRU are two sophisticated implementations of RNNs that have gates (one could say that their success hinges on using gates). 
2. A gate emits probability between 0 and 1. For instance, LSTM is built on these state updates:


Let us assume that $L$ is a linear transformation $L(x) = W*x + b$

* Forget Gate: $f_t = \sigma(L[h_{t-1},x_t])$
* Input Gate: $i_t = \sigma(L[h_{t-1},x_t])$
* Output Gate: $o_t = \sigma(L[h_{t-1},x_t])$
* Cell State: $\hat c_t = \tanh(L[h_{t-1},x_t])$

Now, using the forget gate, the neural network can learn to **control how much information it has to retain or forget**:

* $c_t = f_t*c_{t-1} + i_t*\hat c_t$

Thus the** hidden state update** is:

* $o_t = o_t*\tanh(c_{t})$






# Q.6 MODEL 4 : Building a LSTM based Model

* Now, let's use an LSTM model to do classification! 

1. To make it a fair comparison to the SimpleRNN, let's start with the same architecture hyper-parameters (e.g., number of hidden nodes, epochs, and batch size). 

2. Then, experiment with increasing the number of nodes, stacking multiple layers

3. Check the number of parameters that this model entails.



More information are available at: https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

In [18]:
model = Sequential()

# Add an Embedding layer
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))

# Add a recurrent layer using LSTM with 100 hidden units and perform dropout with probability of 0.3
model.add(LSTM(units=100, dropout=0.3))

# Add a Dense layer for computing the output scores with Sigmoid activations
model.add(Dense(1, activation="sigmoid"))

# Compile the model by defining the loss, the optimizer and the metrics, similar to what you have done before
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

# Train the defined model over the training set
model.fit(X_train,y_train,validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2) # Replace your code by xxx

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 101       
Total params: 1,080,501
Trainable params: 1,080,501
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
196/196 - 19s - loss: 0.4586 - accuracy: 0.7679 - val_loss: 0.3575 - val_accuracy: 0.8446
Epoch 2/10
196/196 - 13s - loss: 0.2427 - accuracy: 0.9050 - val_loss: 0.2932 - val_accuracy: 0.8786
Epoch 3/10
196/196 - 13s - loss: 0.1911 - accuracy: 0.9307 - val_loss: 0.3075 - val_accuracy: 0.8755
Epoch 4/10
196/196 - 13s - loss: 0.1510 - accuracy: 0.9461 - val_l

# Q.7 MODEL 5 : Combining both the CNN with LSTM




1. CNNs are good at learning spatial features, and sentences can be thought of as 1-D spatial vectors (dimensionality is determined by the number of words in the sentence). 

2. Here we want to apply an LSTM over the features learned by the CNN (after a maxpooling layer).


3. By doing that, we can leverages the power of CNNs and LSTMs combined! 

4. We expect the CNN to be able to pick out invariant features across the 1-D spatial structure (i.e., sentence) that characterize good and bad sentiment.

5. This learned spatial features may then be learned as sequences by an LSTM layer, and the final classification can be made via a feed-forward connection to a single node.

In [19]:
model = Sequential()

# Add an Embedding layer
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length)) # Replace your code by xxx

# Define a Conv1D layer with 32 filters and 3x3 kernel size and relue activation
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')) # Replace your code by xxx

# Perform maxpooling for downsampling the previous layer by 2
model.add(MaxPooling1D(pool_size=2)) # Replace your code by xxx

# Add a recurrent layer using LSTM with 100 hidden units and perform dropout with probability of 0.3
model.add(LSTM(units=100, dropout=0.3)) # Replace your code by xxx

# Add a Dense layer for computing the output scores with Sigmoid activations
model.add(Dense(1, activation="sigmoid")) # Replace your code by xxx

# Compile the model by defining the loss, the optimizer and the metrics, similar to what you have done before
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Replace your code by xxx

print(model.summary())

# Train the defined model over the training set
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2) # Replace your code by xxx

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 32)           9632      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 101       
Total params: 1,062,933
Trainable params: 1,062,933
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
196/196 - 11s - loss: 0.4377 - accuracy: 0.7798 - val_loss: 0.3957 - val_accuracy: 0.8308
Epoc

# Discusion: 
1. Compar ethe model performance to RNN, and to CNN based model.
2. Elaborate on the model complexity in comparison to RNN and to CNN model
3. complexity: consider using total number of parameters


# General Discusion and Conclusion

* **Please draw out what do you conclude regarding**:

1. What is the benefits of embedding layer? and how it contribute to improve the model performance?
2. What do you expect after adding a CNN layer? what type of possible features are extracted? 
3. Does leveraging Temporal information using RNN and its variant help?
4. Why RNN may fail on long sentences?
5. What is in your opinion the limitations/advantages of CNN, RNN or LSTM?


* **You can base your discussion on the basis of:**
1. performance, 
2. memory usage and model complexity
3. leveraging the temporally connected information contained in the inputs
4. performance vs memory benefits of CNNs vs RNNs 



# **Bonus**: Import GRU and check the performance against LSTM in terms of accuracy and model complexity

In [20]:
from keras.layers import GRU

In [21]:
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(GRU(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 500, 32)           9632      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
gru (GRU)                    (None, 100)               40200     
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 101       
Total params: 1,049,933
Trainable params: 1,049,933
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3
Accuracy: 86.69%


In [22]:
model = Sequential()

# Add an Embedding layer
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length)) # Replace your code by xxx

# Define a Conv1D layer with 32 filters and 3x3 kernel size and relue activation
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')) # Replace your code by xxx

# Perform maxpooling for downsampling the previous layer by 2
model.add(MaxPooling1D(pool_size=2)) # Replace your code by xxx

# Add a recurrent layer using GRU with 100 hidden
model.add(GRU(units=100)) # Replace your code by xxx

# Add a Dense layer for computing the output scores with Sigmoid activations
model.add(Dense(1, activation="sigmoid")) # Replace your code by xxx

# Compile the model by defining the loss, the optimizer and the metrics, similar to what you have done before
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Replace your code by xxx

print(model.summary())

# Train the defined model over the training set
model.fit(X_train,y_train,validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2) # Replace your code by xxx

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 500, 32)           9632      
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
gru_1 (GRU)                  (None, 100)               40200     
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 101       
Total params: 1,049,933
Trainable params: 1,049,933
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
196/196 - 11s - loss: 0.4385 - accuracy: 0.7819 - val_loss: 0.2929 - val_accuracy: 0.8778
Epoc

# Consider GRU with Dropout as a way of Regularization

In [23]:
model = Sequential()

# Add an Embedding layer
model.add(Embedding(vocabulary_size, embedding_dim, input_length=max_review_length)) # Replace your code by xxx

# Define a Conv1D layer with 32 filters and 3x3 kernel size and relue activation
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')) # Replace your code by xxx

# Perform maxpooling for downsampling the previous layer by 2
model.add(MaxPooling1D(pool_size=2)) # Replace your code by xxx

# Add a recurrent layer using GRU with 100 hidden and apply dropout with probability of 0.3
model.add(GRU(units=100, dropout=0.3)) # Replace your code by xxx

# Add a Dense layer for computing the output scores with Sigmoid activations
model.add(Dense(1, activation="sigmoid")) # Replace your code by xxx

# Compile the model by defining the loss, the optimizer and the metrics, similar to what you have done before
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Replace your code by xxx

print(model.summary())

# Train the defined model over the training set
model.fit(X_train,y_train,validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=2) # Replace your code by xxx

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 500, 100)          1000000   
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 500, 32)           9632      
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
gru_2 (GRU)                  (None, 100)               40200     
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 101       
Total params: 1,049,933
Trainable params: 1,049,933
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
196/196 - 11s - loss: 0.4280 - accuracy: 0.7842 - val_loss: 0.2770 - val_accuracy: 0.8856
Epoc

# Discussion:
1. Compare the model performance when using GRU with and without dropout
2. Comment on the results