## Different LSTM Structures

We want to explore the different LSTM structures and evaluate the performance based on a simple, artificial dataset. The structures we are going to implement are:<br>
<br>
    - Vanilla LSTM: which is the standard LSTM we know already for comparison<br>
    - Bidirectional LSTM<br>
    - Stacked LSTM<br>
    - LSTM + CNN<br>

**0) Loading Libraries and Subroutines**

Standard libraries for plotting and numerical operations

In [None]:
import numpy as np
import matplotlib.pyplot as plt

Loading LSTM related keras libraries:

In [None]:
from keras import optimizers
from keras.layers import LSTM
from keras.layers import Dense
from keras.models import Sequential
from sklearn.preprocessing import MinMaxScaler

Calling a subroutine that puts the data set in the correct shape for LSTM (see later)

In [None]:
from prepare_data import prepare_data

<br>
As before, we generate a simple dataset, but this time with a higher noise level in order to challange the LSTMs we are going to build:

In [None]:
t_start = -50
t_end   = 20
incr    = 0.25

t       = np.arange(t_start, t_end, incr)
t       = t.reshape(len(t), 1)
Y_t     = np.sin(t) + 0.5*np.random.randn(len(t), 1) + np.exp((t + 20)*0.05)

In [None]:
plt.plot(t, Y_t)
plt.title('complete series')
plt.show()

a) Scaling:

In [None]:
scaler  = MinMaxScaler(feature_range = (0, 1))
Y_tnorm = scaler.fit_transform(Y_t)

2b) Reshaping the Data

In [None]:
dt_past    = 20
dt_futu    = 8
n_features = 1

[X, Y] = prepare_data(Y_tnorm, dt_past, dt_futu)

2c) Splitting data into Training and Test dataset

In [None]:
cut            = int(np.round(0.7*Y_tnorm.shape[0]))

TrainX, TrainY = X[:cut], Y[:cut]
TestX,   TestY = X[cut:], Y[cut:]

<br>

**1) Vanilla LSTM**

As in the previous lecture, we start withe standard, aka *vanilla* LSTM<br>

1a) Generating the Model

In [None]:
n_neurons  = 400
batch_size = 128

model = Sequential()
model.add(LSTM(n_neurons, input_shape = (dt_past, n_features), activation = 'tanh'))
model.add(Dense(dt_futu))

opt = optimizers.Adam()
model.compile(loss = 'mean_squared_error', optimizer = opt)

model.summary()

<br>

1b) Fitting the Model 

In [None]:
n_epochs = 100
out = model.fit(TrainX, TrainY, epochs = n_epochs, batch_size = batch_size, validation_split = 0.2, verbose = 2, shuffle = False)

In [None]:
#plotting #############################################################
plt.plot(out.history['loss'])
plt.plot(out.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc = 'upper left')
plt.savefig('training loss.pdf')
plt.show()
#######################################################################

<br>

1c) Evaluating the Fit

In [None]:
PredY = model.predict(TestX)
back  = PredY.shape[0]

plt.plot(t, Y_tnorm, linewidth = 3)
plt.plot(t[-back:], PredY[:, dt_futu-1])
plt.legend(['actual data', 'prediction'])
plt.fill_between([t[-back,0], t[-1,0]], 0, 1, color = 'k', alpha = 0.1)
plt.plot([t[-back,0], t[-back,0]], [0, 1], 'k-', linewidth = 3)
plt.show()

Let us run the same analysis, but with different LSTMs:

<br>

**2) Bidirectional LSTM**

For many sequences (like i.e. DNA, RNA, AA) it makes sense to read them from both directions and therefore makes it easier to detect pattern. For example a pattern in DNA sense (ATTCA) and antisense (ACTTA) direction might look mirrowed, hence diffferent, but they are actually the same feature with the same function.<br>
The only thing we need to do is call the corresponding library in *Keras*:<br>

In [None]:
from keras.layers import Bidirectional

...and add the class to our model: 

In [None]:
model = Sequential()
model.add(Bidirectional(LSTM(n_neurons, activation = 'tanh'), input_shape = (dt_past, n_features)))
model.add(Dense(dt_futu))

opt = optimizers.Adam()
model.compile(loss = 'mean_squared_error', optimizer = opt)

model.summary()

In [None]:
out = model.fit(TrainX, TrainY, epochs = n_epochs, batch_size = batch_size, validation_split = 0.2, verbose = 2, shuffle = False)

In [None]:
#plotting #############################################################
plt.plot(out.history['loss'])
plt.plot(out.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc = 'upper left')
plt.savefig('training loss.pdf')
plt.show()
#######################################################################

In [None]:
PredY = model.predict(TestX)
back  = PredY.shape[0]

plt.plot(t, Y_tnorm, linewidth = 3)
plt.plot(t[-back:], PredY[:, dt_futu-1])
plt.legend(['actual data', 'prediction'])
plt.fill_between([t[-back,0], t[-1,0]], 0, 1, color = 'k', alpha = 0.1)
plt.plot([t[-back,0], t[-back,0]], [0, 1], 'k-', linewidth = 3)
plt.show()

<br>

**3) Stacked LSTM**

In the same way we can run different convolution layer subsequently, we can add different LSTMs as stacks.<br>
For the **first LSTM**, we still need to provide the input shape. As an additional setting, we need to add *return_sequences = True* to **all LSTMs except the last one**, so that the output has the shape *(batch size i.e. sequence length x timesteps i.e. dt_past x hidden state)* in order to pass it on to the next LSTM layer (see matrix multiplication "MLP" lecture and "LSTM1" lecture).

In [None]:
model = Sequential()

model.add(LSTM(n_neurons,   activation = 'tanh', return_sequences = True, input_shape = (dt_past, n_features)))
model.add(LSTM(2*n_neurons, activation = 'relu', return_sequences = True))
model.add(LSTM(n_neurons,   activation = 'relu'))
model.add(Dense(dt_futu))

opt = optimizers.Adam()
model.compile(loss = 'mean_squared_error', optimizer = opt)

model.summary()

In [None]:
out = model.fit(TrainX, TrainY, epochs = n_epochs, batch_size = batch_size, validation_split = 0.2, verbose = 2, shuffle = False)

In [None]:
#plotting #############################################################
plt.plot(out.history['loss'])
plt.plot(out.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc = 'upper left')
plt.savefig('training loss.pdf')
plt.show()
#######################################################################

In [None]:
PredY = model.predict(TestX)
back  = PredY.shape[0]

plt.plot(t, Y_tnorm, linewidth = 3)
plt.plot(t[-back:], PredY[:, dt_futu-1])
plt.legend(['actual data', 'prediction'])
plt.fill_between([t[-back,0], t[-1,0]], 0, 1, color = 'k', alpha = 0.1)
plt.plot([t[-back,0], t[-back,0]], [0, 1], 'k-', linewidth = 3)
plt.show()

<br>

**4) LSTM + CNN**

Both, CNN and LSTM are quite sucessful on detecting pattern. A logical step is to combine both structures with their strenghts.

First, we need to call the corresponding libraries:

In [None]:
from keras.layers import Flatten, Conv1D, MaxPooling1D

But now there is a tricky part: The convolution layer expects the shape *(N_images, N_pixel_x, N_pixel_y, N_color_chan)*, but a sequence usually has the shape *(N_samples, N_timesteps, N_features)*. Thus, we first need to reshape the input matrix. But we also want to maintain the order of time. If we want to learn a pattern in time form a certain number of samples having a certain number of features, each time point needs to have the information from all features and all samples. Thus, the first coordinate is time (see the lecture slides for more details).<br> 
Thererfore, the shape for $X$ has to be *(N_timesteps, N_samples, dt_past, N_features)*.

In [None]:
N_samples  = 1

In [None]:
X          = X.reshape((X.shape[0], N_samples, dt_past, n_features))

In [None]:
TrainX, TrainY = X[:cut], Y[:cut]
TestX,   TestY = X[cut:], Y[cut:]

The next step is to make sure that the shapes from the convolution filters are passed on to the LSTM in the correct way. This is done by using the wrapper *TimeDistributed*. 

In [None]:
from keras.layers import TimeDistributed

Now, we are ready for building the model:

In [None]:
model = Sequential()
model.add(TimeDistributed(Conv1D(filters = 64, kernel_size = 3, activation = 'relu'), input_shape = (None, dt_past, n_features)))
model.add(TimeDistributed(MaxPooling1D(pool_size = 2)))
model.add(TimeDistributed(Flatten()))
model.add(LSTM(n_neurons, input_shape = (dt_past, n_features), activation = 'tanh'))
model.add(Dense(dt_futu))

opt = optimizers.Adam()
model.compile(loss = 'mean_squared_error', optimizer = opt)

model.summary()

In [None]:
out = model.fit(TrainX, TrainY, epochs = n_epochs, batch_size = batch_size, validation_split = 0.2, verbose = 2, shuffle = False)

In [None]:
#plotting #############################################################
plt.plot(out.history['loss'])
plt.plot(out.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc = 'upper left')
plt.savefig('training loss.pdf')
plt.show()
#######################################################################

In [None]:
PredY = model.predict(TestX)
back  = PredY.shape[0]

plt.plot(t, Y_tnorm, linewidth = 3)
plt.plot(t[-back:], PredY[:, dt_futu-1])
plt.legend(['actual data', 'prediction'])
plt.fill_between([t[-back,0], t[-1,0]], 0, 1, color = 'k', alpha = 0.1)
plt.plot([t[-back,0], t[-back,0]], [0, 1], 'k-', linewidth = 3)
plt.show()

Compared to the other architectures, the result has improved alot!