# <center> Recurrent Neural Networks - Long Short Term Memory</center>

As it was seen beforehand in previous courses, Neural Networks offer a very comprehensive and efficient solution to classification problems. However could it also be applied to other problems such as prediction and time series forecasting? 

# Recurrent Neural Networks (RNN)

Google Translate, Apple's Siri, ... What do these applications have in common? They rely on specific Neural Networks called Recurrent Neural Networks. The core difference with what we have been seen before is the ability of RNNs to "remember" informations from previous chunk of the network. These networks were introduced by David Rumelhart's work in 1986.  

Before the theoritical explanation, let's illustrate this idea with a graphical representation.

<img src="Pics/RNN-rolled.png" alt="RNN-rolled.PNG" style="width: 100px;"/>

This "rolled" representation of the RNN shows us an input $x_t$ in the network that then outputs an $h_t$ called the hidden state. 

We can also see that there is a loop on the network that can be unrolled (see below) to represent the loop mechanism of a RNN. 

<img src="Pics/RNN-unrolled.png" alt="RNN-unrolled.png" style="width: 600px;"/>

We see here that the input is in reality made out of several inputs for each step $1$ to $t$. The network feeds itself with previous values it computed to improve the results its predictions. The most simple form of RNN feeds itself its outputs as inputs for instance after a tanh activation (see: https://keras.io/layers/recurrent/ - SimpleRNN).   


## Theory

As stated before the network relies on the hidden state $h_i$ which is the "memory" of the network, an information about what the network has seen beforehand.

To compute the hidden state, several methods can be used, for instance with the SimpleRNN layer in Keras, the output can become the new hidden state. 

A more general method is the concatenation of the previous hidden state and the input into a vector. This vector then goes through a $tanh$ activation to avoid a uncontrollable growth of the values in the network and the output of this activation is the new hidden state $h_t$.

To summarize it, here is a visual representation of the process: 
<img src="Pics/RNN_resume.gif" alt="RNN_resume.gif" style="width: 600px;"/>

## Application

To create a simple RNN  we will use the dataset and the example of the following article. <br>  
<center><b>Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras </b></center> <br>
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/


Let's begin by plotting the data in order to grasp the problem at hand.  

In [None]:
import math
import pandas
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np


dataset = pandas.read_csv('data/international-airline-passengers.csv', usecols=[1], engine='python', skipfooter=3)
plt.plot(dataset)
plt.show()

A first insight on our data shows us that the sales seem to follow a cyclic pattern and as such a NN with the ability to "remember" what came before could be a good approah to this problem.  

### Let's prepare our data

In [None]:
# fix random seed for reproducibility
np.random.seed(7)

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = int(len(dataset) * 0.67)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

Let's introduce the create_dataset function that we will use for our example

In [None]:
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
        dataY.append(dataset[i + look_back, 0])
    return np.array(dataX), np.array(dataY)

In [None]:
# Using the create_dataset function given beforehand, reshape test and train into X=t and Y=t+1
look_back = 1
trainX, trainY = #...
testX, testY = #...
# reshape input to be [samples, time steps, features]
trainX = #...
testX = #...

In [None]:
#%load solutions/code1.py

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN

# create and fit the LSTM network
model = Sequential()
model.add(SimpleRNN(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])

### Plotting our results

In [None]:
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))

# shift train predictions for plotting
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict

# shift test predictions for plotting
testPredictPlot = np.empty_like(dataset)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict

# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()

As it can be seen here, the results are close to reality, thus illustrating the efficiency of RNNs.

## Limits of RNNs

### The vanishing gradient issue

However RNNs have a very impairing issue which is the vanishing of gradient during the backpropagation phase of the network. Some gradients converge very quickly to 0 preventing any modification in the network's weigths and thus limiting the network efficiency and accuracy.  

### Memory capacity

The memory of a RNN is very limited and will forget earlier data while it will propagate trough the network. This memory is only short-term oriented. 

<center><b>Before starting the second part of the notebook, feel free to take a short break
<img src="Pics/Slacking.png" style="width: 300px;"/>

# Long Short Term Memory

In order to compose with RNNs short-comings, a new cell was designed to improve its capacity, while exploiting its "memory" capacity. These are called LSTM. They were introduced in 1997 by Hochreiter and Schmihuber in the article, "Long Short-term Memory".

## A quick guide on LSTM

LTSM cells rely on the same idea as RNNs as it propagates information in the network. The operations in the cell are more elaborate than those of an RNN. We shall go through them step by step.

<img src="Pics/LSTM_resume.png" title="A LSTM cell" style="width: 600px;"/>

The cell is built around three main gates, which are activation functions. RNNs relied on $tanh$ mostly, but LSTM cells add sigmoid activation to allow the network to forget information and not keep all of it like with a RNN.  

### Forget gate

The first gate decides what information to keep. After a concatenation of the current input and the previous hidden state into a vector, this information wil be passed through a sigmoid activation to decide what is to be kept or forgotten.
<img src="Pics/forget_gate.gif" alt="forget_gate.gif" style="width: 800px;"/>

## Input gate

The input gate is meant to prepare the update of the cell state and is made of two activation functions in parallel that are passed through by the previously concatenated vector.

The sigmoid activation is here to choose what information is to be kept and the $tanh$ activation help regulating the network by keeping the values it gets between 1 and -1 (thus avoid the possible escalation of some values)

The outputs of these two activations are then multiplied and the result will be used to update our cell state

<img src="Pics/input_gate.gif" alt="RNN_resume.gif" style="width: 800px;"/>

## Cell state

We use this step to update the previous cell state by first multiply it by the forget gate output to drop irrelevant values and we then add the output of the input gate to create the new cell state
<img src="Pics/cell_state.gif" style="width: 800px;"/>

## Output gate

The last gate is the output gate where the new hidden state is computed based on its previous hidden state and the new cell state. 
The cell state is squished through a $tanh$ activation and the previously concatenated vector is passed through a sigmoid activation to forget irrelevant information. 

The outputs of both gates are multiplied to create the new hidden state.

<img src="Pics/output_gate.gif" style="width: 800px;"/>

<b><center>Now please try to complete the following code which summarizes the process inside a LSTM cell


<center>This is a pseudo-code exercise, no need to run the cell afterwards

In [None]:
def LSTMcell(prev_ht, prev_ct,input):
    combine = #...
    forget_t = forget_layer(#...)
    candidate = candidate_layer(#...)
    input_t = input_layer(#...)
    c_t = #...
    output_t = output_layer(#...)
    h_t = #...
    return h_t, c_t

In [None]:
# %load solutions/code2.py

## Application

Let's try to implement LSTM on our previous example to compare their efficiency. 

In [None]:
# reshape test and train into X=t and Y=t+1
look_back = 1
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

In [None]:
from keras.layers import LSTM

# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)

# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])

In [None]:
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))
# shift train predictions for plotting
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = np.empty_like(dataset)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()

We can see that our results were improved by simply replacing the RNN cells by LSTM cells.

# To go further

## More about RNN and LSTM uses

As stated in the introduction, these cells can be used for prediction and classification, here are two articles about these uses that could come in handy later or that you could explore if you want to see more applications.

https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/ : a text generation model trained on Lewis Caroll's <i>"Alice in Wonderland"</i> by Jason Brownlee 

https://towardsdatascience.com/forecasting-air-pollution-with-recurrent-neural-networks-ffb095763a5c : Air quality forecast by Bert Carremans. It is highly interesting as it compares the efficiency of RNNs, simple LSTMs and more complex forms of LSTMs.

## Similar cells

<b>Gated Recurrent Unit</b> (GRU): These structures are a variation of LSTMs, where the two first gates are combined in a single update gate. It was introduced by Cho, et al. (2014). 
<b> Benchmarks </b> The benchmark of the efficiency of these networks made by Greff, et al. (2015) is a relevant article 

RNNs and LSTMs have be a hot topic for some time now and new results and studies about them are published often.

# Sources

Example source:<br>
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/


Guides on RNN and LSTM:<br>
https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://skymind.ai/wiki/lstm