# LSTM Tutorial 2
Thomas Delissen - November 2024

Goal of this Tutorial is to get an understanding how LSTM and GRU work, and do some more complicated sequence 2 sequence architectures, such as the bidirectional approach. We will not include the attention mechanism in this tutorial. 

## Part 1: Manual calculation of a LSTM Forward pass
To fully understand the inner workings of an LSTM Cell, the goal of this first exercise to manually calculate a forward pass through an unrolled graph of an LSTM Model, where we have a single input and a single cell in my LSTM Layer. The Sequence lenght is 2, and we again do a binary classification at the end. Goal is to calculate Y. 

Note that I will use array notation everywhere, also for single scalars. This way you could extend the code withto also use it for examples where we have larger inputs and more hidden layers (in case this tickles your fancy).

I have put the two LSTM timesteps kind of above eachother, so that the diagram remains readable; you should still interpret them as if they are in sequence of course. I also introduced intermediate variables so that you do not need to calculate y in one go. Good luck! 

![Rolled Out LSTM](lstmTutorial.png)

In [1]:
import numpy as np
# In this example, we only have 1 hidden unit and 1 input feature, but we still use arrays to ensure compatibility with more complicated architectures. 

# Input sequence x(0), x(1)
x_0 = np.array([[0.5]])
x_1 = np.array([[0.8]])

# Initial hidden state h(0) and cell state c(0)
h_0 = np.array([[0.0]])  
c_0 = np.array([[0.0]])  

# LSTM cell weights and biases
W = np.array([[0.2, 0.4, -0.3, 0.1],  # Weights for input x(t)
               [0.1, -0.2, 0.5, 0.3]])  # Weights for previous hidden state h(t-1)
B = np.array([[0.1, 0.2, -0.1, 0.0]])  # Biases

# Output layer weight and bias
Wy = np.array([[1.0]])
By = np.array([[0.5]])

# Activation functions
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))
def tanh(x):
    return np.tanh(x)

Ok, let us do the calculations. As a first step, we will concatenate x_0 and h_0, so they can be mulitplied with W. Then we take the dot product, and add the Bias. Print the result to check how the matrix looks.

In [12]:
# First step: Concatenate x(0) and h(0), then apply weights and biases
combined_input_0 = np.concatenate((x_0, h_0), axis=1)  # Concatenate x(0) and h(0)
combined_result_0 = np.dot(combined_input_0, W) + B  # Apply weights and biases
print("Combined input matrix before split operation at t=0:")
print(combined_result_0)

Combined input matrix before split operation at t=0:
[[ 0.2   0.4  -0.25  0.05]]


Now we can split this matrix (vector) in four parts, each of the numbers represents an input for one of the gates. We can index them from this list when we calculate the intermediate variables, for example by using combined_result[0,1], to get the value for the input gate. Fill in the blanks, using the sigmoid and tanh functions defined before: 

In [14]:
f_0 = sigmoid(combined_result_0[0, 0])  # Forget gate
i_0 = sigmoid(combined_result_0[0, 1])  # Input gate
c_hat_0 = tanh(combined_result_0[0, 2]) # Candidate gate
o_0 = sigmoid(combined_result_0[0, 3])  # Output gate

So far so good! With these variables, we can now calculate c_1, and with c_1, we can also calculate h_1. With that, we have the output of the first timestep ready, and can move to the second timestep in the sequence. Elementwise multiplication and Elementwise addition can simply be applied by using the symbols "*" and "+"

In [15]:
# Calculate c(1) based on the gate outputs
c_1 = (c_0 * f_0) + (i_0 * c_hat_0)  # Update cell state

# Calculate h(1) based on the updated cell state c(1) and output gate
h_1 = tanh(c_1) * o_0  # Update hidden state


print("Updated cell state c(1):")
print(c_1)
print("Updated hidden state h(1):")
print(h_1)

Updated cell state c(1):
[[-0.14662978]]
Updated hidden state h(1):
[[-0.07461341]]


Excellent work. Now we basically do the same for the second timestep of the LSTM. You might be able to copy a lot of code from above. 

In [23]:
# For the second timestep: Concatenate x(1) and h(1), then apply weights and biases

combined_input_1 = np.concatenate((x_1, h_1), axis=1)  # Concatenate x(1) and h(1)
combined_result_1 = np.dot(combined_input_1, W) + B  # Apply weights and biases

print("\nCombined input matrix before split operation at t=1:")
print(combined_result_1)


Combined input matrix before split operation at t=1:
[[ 0.25253866  0.53492268 -0.3773067   0.05761598]]


In [24]:
# Calculate the outputs of the four gates at t=1
f_1 = sigmoid(combined_result_1[0, 0])  # Forget gate
i_1 = sigmoid(combined_result_1[0, 1])  # Input gate
c_hat_1 = tanh(combined_result_1[0, 2])  # Candidate gate
o_1 = sigmoid(combined_result_1[0, 3])  # Output gate


In [25]:
# Calculate c(2) based on the gate outputs
c_2 = (c_1 * f_1) + (i_1 * c_hat_1)  # Update cell state

# Calculate h(2) based on the updated cell state and output gate
h_2 = tanh(c_2) * o_1  # Update hidden state

Excellent, now we have calculated h_2, which is basically the value we will pass to our regular output layer. Now we can directly calculate y. Again, fill in the blanks. This should be similar to the first exercise. 

In [26]:
y = sigmoid(h_2 * Wy + By)  # Compute output

print("And our final output is:")
print(y[0,0])

And our final output is:
0.5855398664699348


If you are the fastest student in the class, and have finished the entire notebook before everyone else, you may come back to this point and do the following bonus exercise: 

Create a python oneliner that calculates y in one step, without creating intermediate variables. (it is ok to spread out the one-liner over multiple lines for readability, just one "=" sign)

In [None]:
# YOU ONLY NEED TO DO THIS IF YOU HAVE TIME LEFT
#y = ...
print("Output should be the same as above:")
print(y[0,0])

## Part 2: Train an LSTM and a GRU based Model on the example from exercise one
In this part of the exercise, you will create your own LSTM RNN as well as a GRU based RNN in Keras, and compare them to the RNNs we had before. 

We will use the same dataset and preprocessing pipeline from the first exercise, which I will give you here for free; the focus of this tutorial is on RNNs, and not on timeseries data wrangling. 

Make sure the csv from exercise 1 is in the same folder as this notebook. 


In [27]:
# Importing and transforming the dataframe from exercise 1
import pandas as pd
from sklearn.preprocessing import StandardScaler

file_path = "CTA_-_Ridership_-_Daily_Boarding_Totals_20241104.csv"
df = pd.read_csv(file_path)
df['service_date'] = pd.to_datetime(df['service_date'], format='%m/%d/%Y')
df = df.drop(columns=['day_type', 'bus', 'rail_boardings'])
cutoff_date = '2019-12-31'
df = df[df['service_date'] <= cutoff_date]
df = df.sort_values(by='service_date').reset_index(drop=True)
# Split the data into training and test sets based on the specified dates: smaller or equal to 2015-12-31 and bigger than this date. 
train_data = df[df['service_date'] <= '2015-12-31']
test_data = df[(df['service_date'] > '2015-12-31')]

# Normalize the training and test data
scaler = StandardScaler()
train_data['total_rides'] = scaler.fit_transform(train_data[['total_rides']])
test_data['total_rides'] = scaler.transform(test_data[['total_rides']])

# TRAINING DATA - single
X_train_s = []
y_train_s = []

for i in range(7, len(train_data)):
    X_train_s.append(train_data['total_rides'].iloc[i-7:i].values)
    y_train_s.append(train_data['total_rides'].iloc[i])

# TEST DATA - single
X_test_s = []
y_test_s = []

for i in range(7, len(test_data)):
    X_test_s.append(test_data['total_rides'].iloc[i-7:i].values)
    y_test_s.append(test_data['total_rides'].iloc[i])

# TRAINING DATA - multi
X_train_m = []
y_train_m = []

for i in range(14, len(train_data)-7):
    X_train_m.append(train_data['total_rides'].iloc[i-14:i].values)
    y_train_m.append(train_data['total_rides'].iloc[i:i+7].values)

# TEST DATA - multi
X_test_m = []
y_test_m = []

for i in range(14, len(test_data)-7):
    X_test_m.append(test_data['total_rides'].iloc[i-14:i].values)
    y_test_m.append(test_data['total_rides'].iloc[i:i+7].values)

# BELOW ARE THE VARIABLES WE WILL USE LATER ON
# For the single predictions
X_train_SinglePred = np.array(X_train_s)
y_train_SinglePred = np.array(y_train_s)
X_test_SinglePred = np.array(X_test_s)
y_test_SinglePred = np.array(y_test_s)

# For the series (multi) predictions
X_train_MultiPred = np.array(X_train_m)
y_train_MultiPred = np.array(y_train_m)
X_test_MultiPred = np.array(X_test_m)
y_test_MultiPred = np.array(y_test_m)

print(len(X_train_SinglePred))
print(len(X_test_SinglePred))
print(len(X_train_MultiPred))
print(len(X_test_MultiPred))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['total_rides'] = scaler.fit_transform(train_data[['total_rides']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['total_rides'] = scaler.transform(test_data[['total_rides']])


5533
1454
5519
1440


Now that we have prepared the Data, we can start creating our "less simple" neural nets, using LSTMs and GRU. However, since we are using Keras, it is a simple to use these as just importing another layer. Here are the imports we will need: 

In [28]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input, SimpleRNN, LSTM, GRU
from tensorflow.keras.models import Model
from sklearn.metrics import root_mean_squared_error

### GRU Single Prediction

Now let us build our first GRU model to predict the single forecast, based on a sequence. 

For this we will use the regular sequential API. We will first reshape the inputs again so that it is explicit that we only have one input feature, then feed it into a GRU layer. In the layer of the GRU, we will use 20 GRU cells, activation tanh (because that is the one that is used in the original GRU) and return_sequences is set to False. This is actually the default value for that parameter, so you could also omit it. We won't add any additional layers, just a single celled output without an activation function. 

You can also already compile it using adam and the mean squared error.

In [33]:
# Reshape the X matrices to be compatible with RNN input (samples, time steps, features)
X_train_single_gru_lstm = X_train_SinglePred.reshape((X_train_SinglePred.shape[0], X_train_SinglePred.shape[1], 1))
X_test_single_gru_lstm = X_test_SinglePred.reshape((X_test_SinglePred.shape[0], X_test_SinglePred.shape[1], 1))

# Create the GRU model
model_single_gru = Sequential()
model_single_gru.add(GRU(20, activation='tanh', return_sequences=False, input_shape=(X_train_single_gru_lstm.shape[1], X_train_single_gru_lstm.shape[2])))
model_single_gru.add(Dense(1))


# Compile the model
model_single_gru.compile(optimizer='adam', loss='mse')

  super().__init__(**kwargs)


Train the model for 20 epochs, Batch size 32 and validation split of 0.2. This is the same configuration we used for the SimpleRNN.

In [34]:
model_single_gru.fit(X_train_single_gru_lstm, y_train_SinglePred, epochs=20, batch_size=32, validation_split=0.2)

Epoch 1/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.8618 - val_loss: 0.5440
Epoch 2/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.4489 - val_loss: 0.2635
Epoch 3/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2322 - val_loss: 0.2550
Epoch 4/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2013 - val_loss: 0.2487
Epoch 5/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2201 - val_loss: 0.2426
Epoch 6/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2106 - val_loss: 0.2472
Epoch 7/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1956 - val_loss: 0.2321
Epoch 8/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.1831 - val_loss: 0.2343
Epoch 9/20
[1m139/139[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x145ed49f230>

In [35]:
# Predict and evaluate on test data
y_single_gru_pred = model_single_gru.predict(X_test_single_gru_lstm)

# Calculate RMSE for single prediction RNN
rmse_single_gru = root_mean_squared_error(y_test_SinglePred, y_single_gru_pred)
print("Single Prediction RNN Model RMSE:", rmse_single_gru)

[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
Single Prediction RNN Model RMSE: 0.3844938255145831


The performance might not be too impressive compared to the regular RNN in this example. My assumption is that this architecture is a bit to simple for this problem. Further training, or using more complicated architecture, might change this. 

### LSTM single prediction
Let us try to create a single prediction with an LSTM 

Same approach, we will reuse the reshaped X matrices from the GRU exercise. 

Again, use a LSTM layer with 20 LSTM cells, tanh activation function and return sequences = false.


In [36]:
# Create the LSTM model
model_single_lstm = Sequential()
model_single_lstm.add(LSTM(20, activation='tanh', return_sequences=False, input_shape=(X_train_single_gru_lstm.shape[1], X_train_single_gru_lstm.shape[2])))
model_single_lstm.add(Dense(1))

# Compile the model
model_single_lstm.compile(optimizer='adam', loss='mse')

  super().__init__(**kwargs)


Again, train the model for 20 epochs, Batch size 32 and validation split of 0.2. 

In [37]:
model_single_lstm.fit(X_train_single_gru_lstm, y_train_SinglePred, epochs=20, batch_size=32, validation_split=0.2)

Epoch 1/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.9753 - val_loss: 0.4068
Epoch 2/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2980 - val_loss: 0.2934
Epoch 3/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2541 - val_loss: 0.2807
Epoch 4/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2483 - val_loss: 0.2743
Epoch 5/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2139 - val_loss: 0.2654
Epoch 6/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2226 - val_loss: 0.2742
Epoch 7/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2407 - val_loss: 0.2528
Epoch 8/20
[1m139/139[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.2350 - val_loss: 0.2496
Epoch 9/20
[1m139/139[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x145f13e6c00>

In [38]:
# Predict and evaluate on test data
y_single_lstm_pred = model_single_lstm.predict(X_test_single_gru_lstm)

# Calculate RMSE for single prediction RNN
rmse_single_lstm = root_mean_squared_error(y_test_SinglePred, y_single_lstm_pred)
print("Single Prediction RNN Model RMSE:", rmse_single_lstm)

[1m46/46[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
Single Prediction RNN Model RMSE: 0.38247860472619744


In my case, the LSTM also did not seem to work very well for this example. However, hopefully you did notice how easy it was to exchange the layers in these examples; Keras and Tensorflow take care of the complications behind the scenes, we just need to include a single model. 

Let us try with the sequence to sequence problem we defined before, perhaps the LSTM will perform better there. 

### LSTM Series prediction
We will now evaluate the LSTM on the series model, to see how it works. 

Like before, we want to predict the next 7 values, based on the last 14. This is a sequence to sequence task, and will be modeled by using the encoder - decoder architecture. This will follow the same pattern as the first exercise, except this time we will use LSTMs instead of SimpleRNNs. 

Start by defining the encoder. 

One difference with the regular RNN is that we don't connect the decoder and the encoder over the output of the encoder, but we connect both the hidden state as well as the context from the encoder.

Fill in the blanks:

In [46]:
# Define the input shape for the encoder (14 days of data, 1 feature)
encoder_inputs = Input(shape=(14, 1))

# Encoder LSTM RNN: Process the input sequence and output the final hidden state as the output
encoder_lstm = LSTM(20, activation='tanh', return_sequences=False, return_state=True)
encoder_output, state_h, state_c = encoder_lstm(encoder_inputs)

Next, define the decoder. We will mostly reuse the code from before, except that we will use LSTM layers.

In [41]:
decoder_inputs = Input(shape=(7, 1))  # Here we will input the 7 zeroes

# I will also already provide dummy values for the training data
decoder_input_train = tf.zeros((X_train_MultiPred.shape[0], 7, 1))

# LSTM RNN Layer of decoder: 
decoder_lstm = LSTM(20, activation='tanh', return_sequences=True, return_state=False)
decoder_lstm_outputs = decoder_lstm(decoder_inputs, initial_state=[state_h, state_c]) # in this case, we forward the hidden state and state c from the encoder to the decoder
# Final dense layer of decoder. 
decoder_dense = Dense(1) # this defines the output layer of the network
decoder_final_outputs = decoder_dense(decoder_lstm_outputs)

Now, same as in the first exercise, we define the Model as the combination of the encoder and decoder. To define it, we need to provide the inputs and outputs of the elements we just defined. 

Inputs: 
- Encoder inputs
- Decoder inputs (7 zeroes)

Outputs: 
- Decoder output (7 predictions)

Define the model, then compile it as before, with optimizer adam and loss function mse.

In [42]:
# Define the model with encoder and decoder
model_series_lstm = Model([encoder_inputs, decoder_inputs], decoder_final_outputs)

# Compile the model
model_series_lstm.compile(optimizer='adam', loss='mse')

Now let us train the model, using 20 Epochs, Batch size 32, validation split 0.2.

In [43]:
model_series_lstm.fit([X_train_MultiPred, decoder_input_train], y_train_MultiPred, epochs=20, batch_size=32, validation_split=0.2)


Epoch 1/20




[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - loss: 0.9717 - val_loss: 0.7915
Epoch 2/20
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.8003 - val_loss: 0.7241
Epoch 3/20
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.6579 - val_loss: 0.5471
Epoch 4/20
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.4568 - val_loss: 0.4378
Epoch 5/20
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.3804 - val_loss: 0.3981
Epoch 6/20
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.3363 - val_loss: 0.3726
Epoch 7/20
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.2989 - val_loss: 0.3572
Epoch 8/20
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.2895 - val_loss: 0.3494
Epoch 9/20
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x145f27fd2b0>

Evaluate the model.

In [44]:
# Generate dummy values for the test data: 
decoder_input_test = tf.zeros((X_test_MultiPred.shape[0], 7, 1))

# Predict and evaluate on test data
y_series_lstm_pred = model_series_lstm.predict([X_test_MultiPred, decoder_input_test])

# Remove the last dimension to get the desired shape
y_series_lstm_pred = np.squeeze(y_series_lstm_pred, axis=-1)

# Calculate RMSE for single prediction RNN
rmse_series_lstm = root_mean_squared_error(y_series_lstm_pred, y_test_MultiPred)
print("Series Prediction LSTM Model RMSE:", rmse_series_lstm)

[1m37/45[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 1ms/step  



[1m45/45[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Series Prediction LSTM Model RMSE: 0.44198150402846453


Congratulations, you also trained a LSTM model on the series data now. In my case, the performance was worse than for the regular RNN. Perhaps this is caused by the problem definition, perhaps we could improve performance by changing the architecture of the LSTM model. 

For the purpose of this tutorial, that doesn't matter, the goal was to create a working encoder - decoder architecture, and that seems to work. Combined with proper time series modeling (which will be taught later), LSTMs can be powerful tools. They also excel with NLP Tasks, as you will see in the next semester. 

Feel free to experiment and extend this model for your own projects. 

You are now ready to try and tackle the RNN / LSTM Assignment. Good luck!