### Recurrent Neural Networks `RNN`
Recurrent Neural Networks (RNN) are designed to work with sequential data. Sequential data(can be time-series) can be in form of text, audio, video etc.
RNN uses the previous information in the sequence to produce the current output. Let's consider the following illustration suppose we have the sentence:
```
"I love deep learning."
```
* At `t0` the input will be `I`
* At `t1` the input will be `I love`
* At `t2` the input will be `I love deep`
* At `t3` the input will be `I love deep learning.`
> At the end the `NN` will have all the infomation about the last step.

<p align="center"><img src="https://miro.medium.com/max/480/1*gEA0-LTj05xtESA5XoBxPw.gif"/> </p>

**Note**: In ``RNN`` weights and bias for all the nodes in the layer are same. The default activation function of `RNN` is `tanh`.

> Let's have a look at a `RNN` unit.
<p align="center"><img src="https://miro.medium.com/max/332/0*eRJCRsikdGGu8ffA.png"/></p>

> ``RNN’s`` face **short-term memory** problem. It is caused due to ``vanishing gradient`` problem. As ``RNN`` processes more steps it suffers from vanishing gradient more than other neural network architectures.

### The vaishing gradient problem.
In ``RNN`` to train the network you backpropagate through time, at each step the gradient is calculated. The gradient is used to update weights in the network. If the effect of the previous layer on the current layer is small then the gradient value will be small and vice-versa. If the gradient of the previous layer is smaller, then the gradient of the current layer will be even smaller. This makes the gradients exponentially shrink down as we backpropagate. Smaller gradient means it will not affect the weight updation. Due to this, the network does not learn the effect of earlier inputs. Thus, causing the ``short-term`` memory problem.

To solve the problem of Vanishing gradient we use two specialised versions of RNN.

#### Gated Recurrent Units `GRU`
In this `RNN` there are two gates namely:
1. update gate
2. reset gate

<p align="center"><img src="https://miro.medium.com/max/700/1*RiOzdOVaaeKrUotY7-1a2A.png"/></p>

Gates are nothing but neural networks, each gate has its own weights and biases(but don’t forget that weights and bias for all nodes in one layer are same).

* **Update gate** -
Update gate decides if the cell state should be updated with the candidate state(current activation value)or not.
* **Reset gate** -
The reset gate is used to decide whether the previous cell state is important or not. Sometimes the reset gate is not used in simple GRU.
* **Candidate cell** -
It is just simply the same as the hidden state(activation) of RNN.
**Final cell state** -
The final cell state is dependent on the update gate. It may or may not be updated with candidate state. Remove some content from last cell state, and write some new cell content.

In GRU,


* If reset close to 0, ignore previous hidden state (allows the model to drop information that is irrelevant in the future).
* If gamma(update gate) close to 1, then we can copy information in that unit through many steps!
* Gamma Controls how much of past state should matter now.

#### Long Short-Term Memory `LSTM`
LSTMs are pretty much similar to GRU’s, they are also intended to solve the vanishing gradient problem. It has the following gates

1. update gate
2. reset gate
3. forget gate
4. output gate

<p align="center"><img src="https://miro.medium.com/max/700/1*lSDKRennQMpJFL4xxJHloQ.png"/></p>

> All 3 gates(input gate, output gate, forget gate) use sigmoid as activation function so all gate values are between 0 and 1.

* **Forget gate** -
It controls what is kept vs forgotten, from previous cell state. In laymen terms, it will decide how much information from the previous state should be kept and forget remaining.
* **Output gate** -
It controls which parts of the cell are output to the hidden state. It will determine what the next hidden state will be.

### Ref

[Article](https://medium.com/analytics-vidhya/rnn-vs-gru-vs-lstm-863b0b7b1573)


### Practical Example
We are going to use `RNN` with the `airline-passanger` dataset which we will load from a `csv` file. The dataset provides a record of the number of people travelling in US airlines in a particular month. we will only use the `Passengers` column.

### Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

### Data Prep

In [2]:
data = pd.read_csv("airline-passengers.csv")
data.head()

Unnamed: 0,Month,Passengers
0,1949-01,112
1,1949-02,118
2,1949-03,132
3,1949-04,129
4,1949-05,121


In [8]:
data.isnull().values.any()

False

In [3]:
scaler = MinMaxScaler()
dataset = scaler.fit_transform(data["Passengers"].values.reshape(-1, 1))
train_size = int(len(dataset) * 0.75)
test_size = len(dataset) - train_size
train=dataset[:train_size,:]
test=dataset[train_size:142,:]
def getdata(data,lookback):
    X,Y=[],[]
    for i in range(len(data)-lookback-1):
        X.append(data[i:i+lookback,0])
        Y.append(data[i+lookback,0])
    return np.array(X),np.array(Y).reshape(-1,1)
lookback=1

X_train, y_train=getdata(train, lookback)
X_test,y_test=getdata(test,lookback)
X_train=X_train.reshape(X_train.shape[0],X_train.shape[1],1)
X_test=X_test.reshape(X_test.shape[0],X_test.shape[1],1)

### ``SimpleRNN``

In [31]:
X_train.shape

(106, 1, 1)

In [37]:
model_1 = keras.Sequential([
    keras.layers.SimpleRNN(64, input_shape=(1, 1)),
    keras.layers.Dense(1, activation='softmax')
], name="model_1")

model_1.compile(
    loss='mean_squared_error',optimizer='adam',
    metrics=['mse']
)
model_1.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test), verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x287adafc670>

> `LSTM`

In [39]:
model_2 = keras.Sequential([
    keras.layers.LSTM(64, input_shape=(1, 1)),
    keras.layers.Dense(1, activation='softmax')
], name="model_2")

model_2.compile(
    loss='mean_squared_error',optimizer='adam',
    metrics=['mse']
)
model_2.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test), verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x287b3551b20>

In [41]:
model_3 = keras.Sequential([
    keras.layers.GRU(64, input_shape=(1, 1)),
    keras.layers.Dense(1, activation='softmax')
], name="model_3")

model_3.compile(
    loss='mean_squared_error',optimizer='adam',
    metrics=['mse']
)
model_3.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test), verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x287bab97c40>

> This Notebook is more focusing on the theory and how to use `RNN` not about training or testing `metrics`

### Nesting `RNN's`
To nest `RNN` we should just pass `return_sequences =True` to `RNN` layers for example:

> GRU
```python
keras.layers.Sequential([
    keras.layers.GRU(64, input_shape=(1, 1), return_sequences=True),
    keras.layers.GRU(64, return_sequences=True),
    ...
])
```

> RNN / SimpleRNN
```python
keras.layers.Sequential([
    keras.layers.SimpleRNN(64, input_shape=(1, 1), return_sequences=True),
    keras.layers.SimpleRNN(64, return_sequences=True),
    ...
])
```

> LSTM

```python
keras.layers.Sequential([
    keras.layers.LSTM(64, input_shape=(1, 1), return_sequences=True),
    keras.layers.LSTM(64, return_sequences=True),
    ...
])
```
> We can also have different `RNN` layers nested for example.

```python
keras.layers.Sequential([
    keras.layers.GRU(64, input_shape=(1, 1), return_sequences=True),
    keras.layers.LTSM(64, return_sequences=True),
    ...
])
```

### Bidirectional `RNN`
Bidirectional layer is a layer that is used to wrap such as ``keras.layers.LSTM`` or ``keras.layers.GRU.`` It can also be a `keras.layes.Layer` tha meet [these](https://keras.io/api/layers/recurrent_layers/bidirectional/) condeitions.

```python
tf.keras.layers.Bidirectional(
    layer, merge_mode="concat", weights=None, backward_layer=None, **kwargs
)
```

* [Docs](https://keras.io/api/layers/recurrent_layers/bidirectional/)

> Example.

In [5]:
model = keras.Sequential([
    keras.layers.Bidirectional(
        keras.layers.LSTM(10, return_sequences=True), input_shape=(5, 10)
    ),
    keras.layers.Bidirectional(keras.layers.LSTM(10)),
    keras.layers.Dense(5)
])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional (Bidirectional (None, 5, 20)             1680      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 20)                2480      
_________________________________________________________________
dense (Dense)                (None, 5)                 105       
Total params: 4,265
Trainable params: 4,265
Non-trainable params: 0
_________________________________________________________________


> Example With custom ``backward`` layer

In [10]:

forward_layer = keras.layers.LSTM(10, return_sequences=True)
backward_layer = keras.layers.LSTM(10, activation='relu', return_sequences=True,
                       go_backwards=True)

model = keras.Sequential([
     keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer,
                         input_shape=(5, 10)),
      keras.layers.Dense(5)
])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_3 (Bidirection (None, 5, 20)             1680      
_________________________________________________________________
dense_2 (Dense)              (None, 5, 5)              105       
Total params: 1,785
Trainable params: 1,785
Non-trainable params: 0
_________________________________________________________________


 > **Practical Example**: ``imdb`` dataset.

In [16]:
max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
(X_train, y_train), (X_val, y_val) = keras.datasets.imdb.load_data(
    num_words=max_features
)
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_val = keras.preprocessing.sequence.pad_sequences(X_val, maxlen=maxlen)

> Building the model without `Bidirectional` layer.

In [18]:
input_layer = keras.layers.Input(shape=(None, ), dtype="int32")
x = keras.layers.Embedding(max_features, 128)(input_layer)
x = keras.layers.LSTM(256, return_sequences=True)(x)
x = keras.layers.LSTM(64)(x)
output_layer = keras.layers.Dense(1, activation='sigmoid')(x)
model_4 = keras.Model(inputs=input_layer, outputs=output_layer, name="model_4")
model_4.compile(
"adam", "binary_crossentropy", metrics=["accuracy"]
)
model_4.fit(X_train, y_train, epochs=1, validation_data=(X_val, y_val),verbose=1, batch_size=128 )

  5/196 [..............................] - ETA: 8:25 - loss: 0.6929 - accuracy: 0.5179

KeyboardInterrupt: 

> Building the model with `Bidirectional` layer.

In [None]:
input_layer = keras.layers.Input(shape=(None, ), dtype="int32")
x = keras.layers.Embedding(max_features, 128)(input_layer)
x = keras.layers.Bidirectional(keras.layers.LSTM(256, return_sequences=True))(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
output_layer = keras.layers.Dense(1, activation='sigmoid')(x)
model_4 = keras.Model(inputs=input_layer, outputs=output_layer, name="model_4")
model_4.compile(
"adam", "binary_crossentropy", metrics=["accuracy"]
)
model_4.fit(X_train, y_train, epochs=2, validation_data=(X_val, y_val),verbose=1, batch_size=64 )

In [None]:
> Conclusion: 