# Deep Learning with Python 
# 6.3 - Recurrent Neural Networks

All neural networks we have seen so far do not maintain **state** between processing inputs. This means that each input is processed independently, and previous inputs do not affect the gradient updates to the weights of a model for the current input. 

To get around this problem, we fed each sequence of inputs to the NN in one go. E.g. each IMDb review can be considered as a sequence of words, but the NN will receive the entire review in one go instead of parsing it word by word (or word-vector by word-vector).

RNN **maintains a state between inputs** but the **state is reset between samples**. The output of the current input will act as a feedback input along with the next input for a given sample. In this case, each sample is a sequence of inputs. 

This is called the **forward pass**

## Forward Pass of an RNN
- RNN takes as input a sequence of vectors (encoded as 2D `numpy` tensors). 
- Loops over timesteps and after each timestep it considers its urrent state at `t` and the input at `t` (of shape `input_features`) and combines them to obtain the output at `t`.
- Then sets the state for the next step. 

In [None]:
# The state at t - initially this will be 0
state_t = 0 

# Iterate over sequence elements
for input_t in input_seqence:
    # Uses the current input and state to compute output
    output_t = f(input_t, state_t)

Assume that the transformation of the current input and the current state to produce the current output uses kernel matrices `W` and `U` along with a bias vector `b.

In [None]:
state_t = 0
for input_t in input_sequence:
    output_t = activation(dot(W, input_t) + dot(U, state_t) + b)
    state_t = output_t

Writing a complete naive `numpy` implementation of an RNN's update based on an input and state.

In [1]:
import numpy as np

In [2]:
timesteps = 100

# Dimensionality of the input feature space
input_features = 32

# Dimensionality of the output feature space
output_features = 64

# Input data is random noise for the sake of this example
# At each of the timesteps, we have the same number of features
inputs = np.random.random((timesteps, input_features))

In [3]:
# Initial state (prior to any training) will be all zeros
state_t = np.zeros((output_features,))

In [12]:
# Defining random weight matrices
W = np.random.random((output_features, input_features))
U = np.random.random((output_features, output_features))
b = np.random.random((output_features, ))

In [13]:
# Create an empty list to store the outputs produced at each timestep
successive_outputs = []

In [16]:
# `input_t` is a vector of shape (input_features, )
for input_t in inputs:
    # Using the hyperbolic tangent activation function
    output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
    
    # Append this outout to the list of computed outputs
    successive_outputs.append(output_t)
    
    # The state for the next timestep will be the current output
    state_t = output_t

In [18]:
# The final output is a 2D tensor of shape `(timesteps, output_features)`
final_output_sequence = np.concatenate(successive_outputs, axis=0)

This is the simplest variant of an RNN. Many other RNN architectures exist, and all of them have different ways of using the current output or state as an input in computing the next state. 

The output of the RNN is a tensor of shape `(timesteps, output_features)`. The output features for timestep `t` will contain information about outputs of **all timesteps up to and including `t`**. This means we don't often need the entire output tensor for an RNN, and are mostly interested in the final O/P.

## RNN and Return Sequences
When using the `keras` RNN layer, we have the option of specifying whether we want a 3D output tensor which contains all output features for all timesteps for all samples in the batch, or whether we want only the last output for each input sequence. 

These are controlled by the `return_sequences` constructor argument.

### Returns only the Last Output for Each Sequence

In [19]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN

In [21]:
model = Sequential()

# Embedding layer will accept samples containing upto 10000
# different words and will encode each word as a 32-dimensional vector
model.add(Embedding(10000, 32))

# Recurrent layer
model.add(SimpleRNN(32))

In [22]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 32)                2080      
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0
_________________________________________________________________


### Returns the Full State Sequence

In [24]:
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, None, 32)          2080      
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0
_________________________________________________________________


## Tracking Multiple Recurrent Layers
Stacking multiple recurrent layers in sequence increases the representational power of a network. To do this, all intermediate recurrent layers must return the full sequences. Can't just return the last output.

In [26]:
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))

# for the last recurrent layer, we need not return outputs for all timesteps
model.add(SimpleRNN(32))

In [27]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
simple_rnn_6 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_7 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_8 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_9 (SimpleRNN)     (None, 32)                2080      
Total params: 328,320
Trainable params: 328,320
Non-trainable params: 0
_________________________________________________________________
