# Notebook

In the Model learning step, the prepared dataset from [2_EDA](https://github.com/Rudinius/Bike_usage_Bremen/blob/57e21c8dd687aadc1498f82241cf662840c8b871/2_EDA.ipynb) is loaded. Then different machine learning algorithms are trained and compared to each other.

<a name="content"></a>
# Content 

* [1. Import libraries and mount drive](#1)
* [2. Import datasets](#2)
* [3. Transform columns](#3)
* [4. Establish baseline benchmark](#4)
* [5. Training machine learning algorithms](#5)
    * [5.2. XGBoost](#5.2.)
    * [5.3. Multilayer perceptron](#5.3.)
    * [5.4. Recurrent Neural Network](#5.4.)

<a name="1"></a>
# 1.&nbsp;Import libraries
[Content](#content)

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import random
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split, TimeSeriesSplit
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import RootMeanSquaredError

In [2]:
# Install package pyjanitor since it is not part of the standard packages
# of Google Colab

import importlib

# Check if package is installed
package_name = "pyjanitor"
spec = importlib.util.find_spec(package_name)
if spec is None:
    # Package is not installed, install it via pip
    !pip install pyjanitor
else:
    print(f"{package_name} is already installed")

import janitor

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyjanitor
  Downloading pyjanitor-0.24.0-py3-none-any.whl (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.4/158.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting pandas-flavor
  Downloading pandas_flavor-0.5.0-py3-none-any.whl (7.1 kB)
Installing collected packages: pandas-flavor, pyjanitor
Successfully installed pandas-flavor-0.5.0 pyjanitor-0.24.0


<a name="2"></a>
#2.&nbsp;Import dataset
[Content](#content)

Next, we will import the processed dataset from [2_EDA](../Bike_usage_Bremen/2_EDA.ipynb).

In [3]:
# Set base url
url = "https://raw.githubusercontent.com/Rudinius/Bike_usage_Bremen/main/data/"

  and should_run_async(code)


In [4]:
# Import dataset

# We will also parse the date column as datetime64 and set it to the index column
df = pd.read_csv(url + "03_training_data/" + "2023-04-21_df_full.csv", 
                         parse_dates=[0], index_col=[0])

# Check the correct loading of dataset
df.head()

  and should_run_async(code)


Unnamed: 0_level_0,weekday,graf_moltke_straße_ostseite,graf_moltke_straße_westseite,hastedter_bruckenstraße,langemarckstraße_ostseite,langemarckstraße_westseite,osterdeich,radweg_kleine_weser,schwachhauser_ring,wachmannstraße_auswarts_sud,...,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun,holiday,vacation
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,1,261.0,290.0,381.0,312.0,308.0,870.0,410.0,391.0,514.0,...,9.1,6.9,0.0,233.0,19.4,50.4,1001.8,0,Neujahr,Weihnachtsferien
2013-01-02,2,750.0,876.0,1109.0,1258.0,1120.0,2169.0,1762.0,829.0,1786.0,...,7.1,1.8,0.0,246.0,20.2,40.0,1017.5,30,,Weihnachtsferien
2013-01-03,3,931.0,1015.0,1603.0,1556.0,1480.0,2295.0,2287.0,1196.0,2412.0,...,10.6,0.9,0.0,257.0,23.8,45.7,1024.5,0,,Weihnachtsferien
2013-01-04,4,500.0,587.0,1284.0,703.0,626.0,1640.0,1548.0,1418.0,964.0,...,9.7,0.0,0.0,276.0,25.2,48.2,1029.5,0,,Weihnachtsferien
2013-01-05,5,1013.0,1011.0,1284.0,1856.0,1621.0,4128.0,4256.0,3075.0,2065.0,...,8.6,0.1,0.0,293.0,20.2,41.0,1029.9,0,,Weihnachtsferien


<a name="3"></a>
# 3. Transform columns
[Content](#content)

We need to transform the columns `Holiday` and `Vacation` using `One-Hot-Encoding` to change the categorical columns to numerical columns. Then we need to drop the original columns.

In [5]:
# Use One Hot Encoder
OH_encoder = OneHotEncoder()
transformed_array = OH_encoder.fit_transform(df.loc[:,"holiday": "vacation"]).toarray()
df_holiday_vacation_transformed = pd.DataFrame(transformed_array, 
                              columns=OH_encoder.get_feature_names_out(), 
                              index = df.index)              
# Drop the columns with Holiday_nan and Vacation_nan as those hold no additional value          
df_holiday_vacation_transformed = df_holiday_vacation_transformed.drop(["holiday_nan", "vacation_nan"], axis=1)

# Drop the old categorical columns Holiday and Vacation
df_transformed = df.drop(["holiday", "vacation"], axis=1)

# Add the new columns from OHE
df_transformed = pd.concat([df_transformed, df_holiday_vacation_transformed], axis=1)

  and should_run_async(code)


We will add the year, month and day as seperate columns to give the algorithm the chance to pick up more granular and seasonal patterns.

In [6]:
df_date = pd.DataFrame(data = {
    "year": df.index.year,
    "month": df.index.month,
    "day": df.index.day
}, index=pd.to_datetime(df.index.values))
    
df_transformed_date = (pd.concat([df_date, df_transformed], axis=1)
                        .clean_names(strip_underscores="both"))

# Check dataframe
df_transformed_date.head()

  and should_run_async(code)


Unnamed: 0,year,month,day,weekday,graf_moltke_straße_ostseite,graf_moltke_straße_westseite,hastedter_bruckenstraße,langemarckstraße_ostseite,langemarckstraße_westseite,osterdeich,...,holiday_pfingstmontag,holiday_reformationstag,holiday_tag_der_arbeit,holiday_tag_der_deutschen_einheit,vacation_herbstferien,vacation_osterferien,vacation_pfingstferien,vacation_sommerferien,vacation_weihnachtsferien,vacation_winterferien
2013-01-01,2013,1,1,1,261.0,290.0,381.0,312.0,308.0,870.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2013-01-02,2013,1,2,2,750.0,876.0,1109.0,1258.0,1120.0,2169.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2013-01-03,2013,1,3,3,931.0,1015.0,1603.0,1556.0,1480.0,2295.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2013-01-04,2013,1,4,4,500.0,587.0,1284.0,703.0,626.0,1640.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2013-01-05,2013,1,5,5,1013.0,1011.0,1284.0,1856.0,1621.0,4128.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Now, after all those transformations, we have out final dataset, to train our machine learning algorithms on.

<a name="4"></a>
# 4. Establish baseline benchmark
[Content](#content)


For our current task of creating model to predict the amount of cyclers for a given day, we do not have any baseline accuracy to predict the value that we could use, to measure our model against. 
For this reason, we will create a naive baseline model. For this, we will simply predict the amount of a day based on the value of previous day.

In [7]:
# Evaluate the model's performance using RMSE

# Select the `Total` column as our y_test and preds arrays
y_test = preds = df.loc[:,"total"]

rmse = 0
length = y_test.shape[0]

# Loop from 0 to second last entry, as we can only use seconds last entry to
# predict the last entry of series
for i in range(length-1):
    # The mean_sqared_error function expects an array as input, therfore we 
    # concatenate the range from current value to current value + 1 (excluding)
    rmse += np.sqrt(mean_squared_error(y_test[i+1:i+2], preds[i:i+1]))

# Divide rmse value by number of pairs
rmse = rmse / (length-1)
print("RMSE: %f" % (rmse))

  and should_run_async(code)


RMSE: 9083.225418


If we were naivly predicting the current value with the last value, we get an error over the entire dataset of approximately $9,111$. 

This is our naive benchmark to compare our model against. 

<a name="5"></a>
# 5. Training machine learning algorithms
[Content](#content)

In [8]:
import numpy as np
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error

p = 2
d = 1
q = 2
P = 2
D = 1
Q = 2
m = 365

# Define the SARIMA model
model = SARIMAX(endog=y_train, order=(p, d, q)) #, seasonal_order=(P, D, Q, m)

# Fit the model to the training data
fit_model = model.fit()

# Use the fitted model to make predictions on the test data
#y_pred = fit_model.predict(start=len(X_train["total"]), end=len(X_train)+len(X_test)-1)
y_pred = fit_model.predict(start=len(X_train), end=len(X_train)+len(X_test)-1)

# Calculate the RMSE between the predicted and actual values
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Print the RMSE
print(f"RMSE: {rmse}")

  and should_run_async(code)


NameError: ignored

<a name="5.2."></a>
## 5.2. XGBoost
[Content](#content)

Specifically for XGBoost algorithm, we will add the data points of the previous time steps as features to feature vector. For RNNs, this will probably not be necessary, as RNNs have built-in memory units that allow them to store information from previous steps.

In this case, we will only add the last 3 values, as the observed improvement of accuracy (RMSE score) is drastically decressing with each further time step added after 3 steps.

Improvements:
* 1 day 6291 2,8%
* 2 day 6157 2,1%
* 3 day 6120 0,6%

In [12]:
# Create empty new dataframe
df_lagged_days = pd.DataFrame({})

# Select the number of lagged days
go_back_x_days = 3

for i in range(go_back_x_days):
    # Shift the values in "Total" by i and assign to new column prev_Total_i+1
    df_lagged_days[f'prev_total_{i+1}'] = df_transformed_date['total'].shift(i+1)


# Concat new dataframe with old dataframe
# Using bfill strategy on dataset since the first few days will have NaN values
# Using pyjanitor to clean up names
df_transformed_date_lagged = (pd.concat([df_transformed_date, df_lagged_days], axis=1)
                              .fillna(method="bfill")
                              .clean_names(strip_underscores="both"))

# Check output
df_transformed_date_lagged

  and should_run_async(code)


Unnamed: 0,year,month,day,weekday,graf_moltke_straße_ostseite,graf_moltke_straße_westseite,hastedter_bruckenstraße,langemarckstraße_ostseite,langemarckstraße_westseite,osterdeich,...,holiday_tag_der_deutschen_einheit,vacation_herbstferien,vacation_osterferien,vacation_pfingstferien,vacation_sommerferien,vacation_weihnachtsferien,vacation_winterferien,prev_total_1,prev_total_2,prev_total_3
2013-01-01,2013,1,1,1,261.0,290.0,381.0,312.0,308.0,870.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,5795.0,5795.0,5795.0
2013-01-02,2013,1,2,2,750.0,876.0,1109.0,1258.0,1120.0,2169.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,5795.0,5795.0,5795.0
2013-01-03,2013,1,3,3,931.0,1015.0,1603.0,1556.0,1480.0,2295.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,19494.0,5795.0,5795.0
2013-01-04,2013,1,4,4,500.0,587.0,1284.0,703.0,626.0,1640.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,24851.0,19494.0,5795.0
2013-01-05,2013,1,5,5,1013.0,1011.0,1284.0,1856.0,1621.0,4128.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,13475.0,24851.0,19494.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-27,2022,12,27,1,693.0,612.0,1495.0,1062.0,915.0,2123.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9520.0,8067.0,11206.0
2022-12-28,2022,12,28,2,643.0,585.0,1076.0,884.0,820.0,1819.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,19231.0,9520.0,8067.0
2022-12-29,2022,12,29,3,654.0,648.0,1076.0,1014.0,907.0,2013.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,16241.0,19231.0,9520.0
2022-12-30,2022,12,30,4,757.0,665.0,1076.0,1106.0,976.0,2088.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,18006.0,16241.0,19231.0


In [42]:
# Load the data into a pandas dataframe
data = df_transformed_date_lagged
# Define the features and target
# Higly correlated features have been removed (tavg, tmin, wpgt)
# `wdir` has been removed due to no correlation

features = ['year', 'month', 'day', 'weekday', 'tmax', 'prcp', 
            'snow', 'wspd', 'pres', 'tsun', 
            'holiday_1_weihnachtsfeiertag', 'holiday_2_weihnachtsfeiertag', 
            'holiday_christi_himmelfahrt', 'holiday_karfreitag', 'holiday_neujahr', 
            'holiday_ostermontag', 'holiday_pfingstmontag', 'holiday_reformationstag', 
            'holiday_tag_der_arbeit', 'holiday_tag_der_deutschen_einheit', 
            'vacation_herbstferien', 'vacation_osterferien', 'vacation_pfingstferien', 
            'vacation_sommerferien', 'vacation_weihnachtsferien', 'vacation_winterferien',
            'prev_total_1', 'prev_total_2', 'prev_total_3']

target = 'total'

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], 
                                                    test_size=0.2, shuffle=True, random_state=0)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Finally, we will standardize our dataset. Standarization will generally improve learning speed of the models and can help to improve the accuarcy of the model.

In [44]:
# Standardize and fit to the training set only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same standardization to the test set
X_test_scaled = scaler.transform(X_test)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Using `GridSearchCV` to select optimal parameters.

In [None]:
params = {
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'n_estimators': [700, 1000, 1200],
    'subsample': [0.5, 0.8, 1.0],
    'colsample_bytree': [0.5, 0.8, 1.0],
    'reg_alpha': [0.0, 0.5, 1.0],
    'reg_lambda': [0.8, 1.0, 2.0],
    'min_child_weight': [1, 5, 10]
}

xg_reg = xgb.XGBRegressor(objective='reg:squarederror')

grid_search = GridSearchCV(xg_reg, param_grid=params, cv=5, n_jobs=-1, verbose=2)

grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

In [30]:
%%time
### SCALED ###
# Build the XGBoost regressor model with selected hyper parameters
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 1.0, learning_rate = 0.01, 
                          max_depth = 12, n_estimators = 1000, reg_alpha = 0.5, reg_lambda = 5.0, 
                          subsample = 1.0, min_child_weight=5)
#print(X_train_scaled.shape)
#print(y_train.shape)
xg_reg.fit(X_train_scaled, y_train)



# Evaluate the model's performance using RMSE
# Predict on the train set
y_preds_train = xg_reg.predict(X_train_scaled)
print(y_preds_train.shape)
print(y_preds_train)

# Training set
rmse_train = np.sqrt(mean_squared_error(y_train, y_preds_train))
print("Train RMSE: %f" % (rmse_train))

# Predict on the test set
y_preds_test = xg_reg.predict(X_test_scaled)
# Test set
rmse_test = np.sqrt(mean_squared_error(y_test, y_preds_test))
print("Test RMSE: %f" % (rmse_test))

  and should_run_async(code)


(2921,)
[15197.027 33466.727 15071.921 ... 28035.055 32195.223 57779.17 ]
Train RMSE: 1216.824632
Test RMSE: 6054.573112
CPU times: user 37.3 s, sys: 451 ms, total: 37.7 s
Wall time: 27.2 s


In [31]:
%%time
# Build the XGBoost regressor model with selected hyper parameters
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 1.0, learning_rate = 0.01, 
                          max_depth = 12, n_estimators = 1000, reg_alpha = 0.5, reg_lambda = 5.0, 
                          subsample = 1.0, min_child_weight=5)

xg_reg.fit(X_train, y_train)



# Evaluate the model's performance using RMSE
# Predict on the train set
preds_train = xg_reg.predict(X_train)
print(preds_train)
print(preds_train.shape)

# Training set
rmse_train = np.sqrt(mean_squared_error(y_train, preds_train))
print("Train RMSE: %f" % (rmse_train))

# Predict on the test set
preds_test = xg_reg.predict(X_test)

# Test set
rmse_test = np.sqrt(mean_squared_error(y_test, preds_test))
print("Test RMSE: %f" % (rmse_test))

  and should_run_async(code)


[15197.027 33466.727 15071.921 ... 28035.055 32195.223 57779.17 ]
(2921,)
Train RMSE: 1216.824632
Test RMSE: 6054.719014
CPU times: user 37.8 s, sys: 490 ms, total: 38.3 s
Wall time: 28.2 s


* 02.04.23: 6468
* 10.04.23: 6120

As an alternative, instead of `train_test_split`, we try `TimeSeriesSplit` and compare the performance.

In [None]:
# Split the data into training and test sets
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(data):
    X_train, X_test = data.iloc[train_index][features], data.iloc[test_index][features]
    y_train, y_test = data.iloc[train_index][target], data.iloc[test_index][target]

    xg_reg = xgb.XGBRegressor(objective='reg:squarederror', colsample_bytree=1.0, learning_rate=0.01,
                              max_depth=12, n_estimators=1000, reg_alpha=0.1, reg_lambda=1.0,
                              subsample=0.8, min_child_weight=5)

    xg_reg.fit(X_train, y_train)

    # Predict on the test set
    preds = xg_reg.predict(X_test)

    # Evaluate the model's performance using RMSE
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    print("RMSE: %f" % (rmse))


RMSE: 7127.157045
RMSE: 6843.915770
RMSE: 6808.748428
RMSE: 7121.689867
RMSE: 6898.040513


`train_test_split` has a better performance on the accuracy of the model than `TimeSeriesSplit`.

Plot random different data points to analyze big differences between prediction and actual value in more detail.

In [None]:
# Select 100 random datapoints from y_test
random_indices = random.sample(range(len(y_test)), 100)
y_test_sample = y_test[random_indices]
preds_sample = preds[random_indices]

# Create plot using plotly express
fig = px.scatter()
fig.add_scatter(x=y_test_sample.index, y=y_test_sample, mode='markers', name='y_test', marker=dict(color='blue'))
fig.add_scatter(x=y_test_sample.index, y=preds_sample, mode='markers', name='preds', marker=dict(color='red'))
fig.update_layout(title='y_test vs. preds, random values', xaxis_title='Date', yaxis_title='Values')
fig.show()

  and should_run_async(code)


<a name="5.3."></a>
## 5.3. Multilayer Perceptron
[Content](#content)

TO CHECK: Performance when standardizing/normalizing dataset
TO CHECK: When adding the prev_total values, the accuracy becomes less. Check later with optimized MLP again

In [50]:
%%time
# Define the architecture of the MLP
model = Sequential()
model.add(Input(shape=(29,)))
model.add(Dense(units=512, activation='relu'))  # Input layer with 20 features
model.add(Dense(units=512, activation='relu'))  # Hidden layer with 32 neurons
model.add(Dense(units=512, activation='relu'))  # Hidden layer with 32 neurons
model.add(Dense(units=512, activation='relu'))  # Hidden layer with 32 neurons
model.add(Dense(units=512, activation='relu'))  # Hidden layer with 32 neurons
model.add(Dense(units=512, activation='relu'))  # Input layer with 20 features
model.add(Dense(units=512, activation='relu'))  # Hidden layer with 32 neurons
model.add(Dense(units=512, activation='relu'))  # Hidden layer with 32 neurons
model.add(Dense(units=512, activation='relu'))  # Hidden layer with 32 neurons
model.add(Dense(units=512, activation='relu'))  # Hidden layer with 32 neurons
model.add(Dense(units=1, activation='linear'))  # Output layer with 1 neuron for regression

# Compile the model with mean squared error (MSE) loss, and root mean square error (RMSE) as metric
# Use Adam optimizer with learning rate
optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='mse', metrics=[RootMeanSquaredError()])  

# Train the model and save the learning history, use X_test and y_test for validation
history = model.fit(X_train_scaled, y_train, epochs=100, batch_size=64, validation_data=(X_test_scaled, y_test))


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 100/100
46/46 [==============================] - 1s 23ms/step - loss: 6617481.5000 - root_mean_squared_error: 2572.4465 - val_loss: 67260184.0000 - val_root_mean_squared_error: 8201.2305
CPU times: user 2min 43s, sys: 6.67 s, total: 2min 50s
Wall time: 2min 23

In [46]:
hist_train_rmse = np.array(history.history["root_mean_squared_error"])
hist_test_rmse = np.array(history.history["val_root_mean_squared_error"])

# Create the array for the x axis, starting from 1
x = np.arange(1, len(history.history["root_mean_squared_error"])+1)

# Create a line chart with two lines using Plotly Express
fig = px.line(title='train RMSE vs test RMSE')
fig.add_scatter(x=x, y=hist_train_rmse, mode='lines+markers', name='Train', line=dict(color='red'))
fig.add_scatter(x=x, y=hist_test_rmse, mode='lines+markers', name='Test', line=dict(color='blue'))

# Set chart title and axis labels
fig.update_layout(
    xaxis_title='Epochs',
    yaxis_title='RMSE'
)

# Show the chart
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [51]:
hist_train_rmse = np.array(history.history["root_mean_squared_error"])
hist_test_rmse = np.array(history.history["val_root_mean_squared_error"])

# Create the array for the x axis, starting from 1
x = np.arange(1, len(history.history["root_mean_squared_error"])+1)

# Create a line chart with two lines using Plotly Express
fig = px.line(title='train RMSE vs test RMSE')
fig.add_scatter(x=x, y=hist_train_rmse, mode='lines+markers', name='Train', line=dict(color='red'))
fig.add_scatter(x=x, y=hist_test_rmse, mode='lines+markers', name='Test', line=dict(color='blue'))

# Set chart title and axis labels
fig.update_layout(
    xaxis_title='Epochs',
    yaxis_title='RMSE'
)

# Show the chart
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [49]:
preds_train = model.predict(X_train_scaled)

# Evaluate the model's performance using RMSE
rmse_train = np.sqrt(mean_squared_error(y_train, preds_train))
print("Trai RMSE: %f" % (rmse_train))

preds_test = model.predict(X_test_scaled)

# Evaluate the model's performance using RMSE
rmse_test = np.sqrt(mean_squared_error(y_test, preds_test))
print("Test RMSE: %f" % (rmse_test))


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Trai RMSE: 2623.097985
Test RMSE: 8387.871121


<a name="5.4."></a>
## 5.4. Recurrent Neural Network
[Content](#content)

In [None]:
# Load the data into a pandas dataframe
data = df_transformed_date

# Define the features and target
# Higly correlated features have been removed (tavg, tmax, wpgt)
features = ['year', 'month', 'day', 'weekday', 'tmax', 'prcp', 
            'snow', 'wspd', 'pres', 'tsun', 
            'holiday_1_weihnachtsfeiertag', 'holiday_2_weihnachtsfeiertag', 
            'holiday_christi_himmelfahrt', 'holiday_karfreitag', 'holiday_neujahr', 
            'holiday_ostermontag', 'holiday_pfingstmontag', 'holiday_reformationstag', 
            'holiday_tag_der_arbeit', 'holiday_tag_der_deutschen_einheit', 
            'vacation_herbstferien', 'vacation_osterferien', 'vacation_pfingstferien', 
            'vacation_sommerferien', 'vacation_weihnachtsferien', 'vacation_winterferien'] 

target = 'Total'

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], 
                                                    test_size=0.2, shuffle=True, random_state=0)

In [None]:

import tensorflow as tf
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from keras.optimizers import Adam

# Assuming X_train, X_test, y_train, y_test are already defined and contain the appropriate data




num_samples = X_train.shape[0]
# Build the LSTM model
"""
model = Sequential()
model.add(LSTM(units=256, input_shape=(1, X_train.shape[1]), activation='relu'))
# Add a dense output layer with a single unit (for regression) and no activation function
model.add(Dense(units=1))
"""
"""
model = tf.keras.models.Sequential([
    # Shape [batch, time, features] => [batch, time, lstm_units]
    tf.keras.layers.LSTM(64),
    # Shape => [batch, time, features]
    tf.keras.layers.Dense(units=1)
])
"""


# Build the LSTM model
model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(256, activation='relu'),
    # Shape [batch, time, features] => [batch, time, lstm_units]
    tf.keras.layers.LSTM(128, activation='relu'),
    # Add another LSTM layer with 128 units and ReLU activation
    tf.keras.layers.LSTM(64, activation='relu'),
    # Shape => [batch, time, features]
    tf.keras.layers.Dense(units=1)
])

X_train_3d = np.reshape(X_train.to_numpy(), (num_samples, 1, X_train.shape[1]))  # Fix variable name
model.compile(loss=tf.keras.losses.MeanSquaredError(),
              optimizer=Adam(learning_rate=0.001),
              metrics=[tf.keras.metrics.MeanAbsoluteError()])

# Train the model
model.fit(X_train_3d, y_train, epochs=100, batch_size=64, verbose=1)

Epoch 1/100



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



ValueError: ignored

In [None]:
# Predict on the test set
#X_test_3d = np.reshape(X_test.to_numpy(), (X_test.shape[0], 1, X_test.shape[1]))  # Reshape X_test

num_samples = X_test.shape[0]
X_test_3d = np.reshape(X_test.to_numpy(), (num_samples, 1, X_test.shape[1]))  # Fix variable name
preds = model.predict(X_test_3d)

# Evaluate the model's performance using RMSE
rmse = np.sqrt(mean_squared_error(y_test, np.reshape(preds, (y_test.shape[0]))))
print("RMSE: %f" % (rmse))


2018-03-11    26957.0
2014-07-27    37410.0
2016-09-30    39593.0
2022-06-05    50288.0
2017-09-29    41285.0
               ...   
2013-08-06    37690.0
2014-11-02    29683.0
2020-06-27    33080.0
2016-08-31    53149.0
2022-01-16    12165.0
Name: Total, Length: 731, dtype: float64
[30535.756 37923.53  32049.592 36079.492 31525.645 28276.818 31182.781
 28032.14  26946.822 35099.41  40275.06  31049.348 38569.285 31857.996
 30947.023 28566.012 28895.145 29982.74  36054.23  28999.047 29830.938
 30775.758 27452.115 31243.29  27236.404 27673.469 29084.553 30560.8
 30556.438 34519.117 30605.236 31264.998 33066.836 20967.12  37602.754
 32578.305 39709.023 29811.994 28410.668 33195.67  27754.33  31554.383
 27051.059 28399.223 35825.227 27933.12  39415.55  35235.44  35060.434
 32877.6   31883.713 31583.65  28782.07  34868.4   32103.193 28370.572
 26299.674 29966.117 36611.188 40976.992 27086.205 35797.164 28439.223
 33622.242 28705.676 38175.42  34417.445 28957.5   36308.617 28908.55
 37213.16 