# Notebook

In the Model learning step, the prepared dataset from [2_EDA](https://github.com/Rudinius/Bike_usage_Bremen/blob/57e21c8dd687aadc1498f82241cf662840c8b871/2_EDA.ipynb) is loaded. Then different machine learning algorithms are trained and compared to each other.

<a name="content"></a>
# Content

* [1. Import libraries and mount drive](#1)
* [2. Import datasets](#2)
* [3. Transform columns](#3)
* [4. Establish baseline benchmark](#4)
* [5. Training machine learning algorithms](#5)
    * [5.2. XGBoost](#5.2.)
    * [5.3. Multilayer perceptron](#5.3.)
    * [5.4. Recurrent Neural Network](#5.4.)

<a name="1"></a>
# 1.&nbsp;Import libraries
[Content](#content)

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import random
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split, TimeSeriesSplit
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Input, Dropout, BatchNormalization, LSTM
from keras.optimizers import Adam
from keras.metrics import RootMeanSquaredError

In [2]:
# Install package pyjanitor since it is not part of the standard packages
# of Google Colab

import importlib

# Check if package is installed
package_name = "pyjanitor"
spec = importlib.util.find_spec(package_name)
if spec is None:
    # Package is not installed, install it via pip
    !pip install pyjanitor
else:
    print(f"{package_name} is already installed")

import janitor

Collecting pyjanitor
  Downloading pyjanitor-0.25.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/171.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.3/171.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting pandas-flavor (from pyjanitor)
  Downloading pandas_flavor-0.6.0-py3-none-any.whl (7.2 kB)
Installing collected packages: pandas-flavor, pyjanitor
Successfully installed pandas-flavor-0.6.0 pyjanitor-0.25.0


<a name="2"></a>
#2.&nbsp;Import dataset
[Content](#content)

Next, we will import the processed dataset from [2_EDA](../Bike_usage_Bremen/2_EDA.ipynb).

In [3]:
# Set base url
url = "https://raw.githubusercontent.com/Rudinius/Bike_usage_Bremen/main/data/"

  and should_run_async(code)


In [4]:
# Import dataset

# We will also parse the date column as datetime64 and set it to the index column
df = pd.read_csv(url + "03_training_data/" + "2023-04-21_df_full.csv",
                         parse_dates=[0], index_col=[0])

# Check the correct loading of dataset
df.head()

  and should_run_async(code)


Unnamed: 0_level_0,weekday,graf_moltke_straße_ostseite,graf_moltke_straße_westseite,hastedter_bruckenstraße,langemarckstraße_ostseite,langemarckstraße_westseite,osterdeich,radweg_kleine_weser,schwachhauser_ring,wachmannstraße_auswarts_sud,...,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun,holiday,vacation
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,1,261.0,290.0,381.0,312.0,308.0,870.0,410.0,391.0,514.0,...,9.1,6.9,0.0,233.0,19.4,50.4,1001.8,0,Neujahr,Weihnachtsferien
2013-01-02,2,750.0,876.0,1109.0,1258.0,1120.0,2169.0,1762.0,829.0,1786.0,...,7.1,1.8,0.0,246.0,20.2,40.0,1017.5,30,,Weihnachtsferien
2013-01-03,3,931.0,1015.0,1603.0,1556.0,1480.0,2295.0,2287.0,1196.0,2412.0,...,10.6,0.9,0.0,257.0,23.8,45.7,1024.5,0,,Weihnachtsferien
2013-01-04,4,500.0,587.0,1284.0,703.0,626.0,1640.0,1548.0,1418.0,964.0,...,9.7,0.0,0.0,276.0,25.2,48.2,1029.5,0,,Weihnachtsferien
2013-01-05,5,1013.0,1011.0,1284.0,1856.0,1621.0,4128.0,4256.0,3075.0,2065.0,...,8.6,0.1,0.0,293.0,20.2,41.0,1029.9,0,,Weihnachtsferien


<a name="3"></a>
# 3.&nbsp;Transform columns
[Content](#content)

We need to transform the columns `holiday` and `vacation` using `One-Hot-Encoding` and masking to change the categorical columns to numerical columns. Then we need to drop the original columns.

We will use `One-Hot-Encoding` for the `holiday` feature. We will use masking (replacing) for the `vacation` feature and thus only having one column for `vacation` with 1 for vacation and 0 for no vcation.

The reason for keeping only this reduced information on `vacation` had been explained in [2_EDA](https://github.com/Rudinius/Bike_usage_Bremen/blob/57e21c8dd687aadc1498f82241cf662840c8b871/2_EDA.ipynb) and it turns out, that this actually decreased the train and dev error.

In [6]:
# Use One Hot Encoder only for encoding holiday feature
OH_encoder = OneHotEncoder()

transformed_array = OH_encoder.fit_transform(df[["holiday"]]).toarray()
df_holiday_transformed = pd.DataFrame(transformed_array,
                              columns=OH_encoder.get_feature_names_out(),
                              index = df.index)
# Drop the columns with Holiday_nan as this hold no additional value
df_holiday_transformed = df_holiday_transformed.drop(["holiday_nan"], axis=1)

# Drop the old categorical column holiday
df_transformed = df.drop(["holiday"], axis=1)

# Add the new columns from OHE
df_transformed = pd.concat([df_transformed, df_holiday_transformed], axis=1)

# Create a mask for vacation or no vacation
mask = df_transformed["vacation"].isna()

# Set the values to `0` or `1` according to mask
df_transformed.loc[mask, "vacation"] = 0
df_transformed.loc[np.invert(mask), "vacation"] = 1

# Set datatype of column to int
df_transformed["vacation"] = df_transformed["vacation"].astype(int)

  and should_run_async(code)


We will add the year, month and day as seperate columns to give the algorithm the chance to pick up more granular and seasonal patterns.

In [7]:
df_date = pd.DataFrame(data = {
    "year": df.index.year,
    "month": df.index.month,
    "day": df.index.day
}, index=pd.to_datetime(df.index.values))

df_transformed_date = (pd.concat([df_date, df_transformed], axis=1)
                        .clean_names(strip_underscores="both"))

# Check dataframe
df_transformed_date.head()

  and should_run_async(code)


Unnamed: 0,year,month,day,weekday,graf_moltke_straße_ostseite,graf_moltke_straße_westseite,hastedter_bruckenstraße,langemarckstraße_ostseite,langemarckstraße_westseite,osterdeich,...,holiday_1_weihnachtsfeiertag,holiday_2_weihnachtsfeiertag,holiday_christi_himmelfahrt,holiday_karfreitag,holiday_neujahr,holiday_ostermontag,holiday_pfingstmontag,holiday_reformationstag,holiday_tag_der_arbeit,holiday_tag_der_deutschen_einheit
2013-01-01,2013,1,1,1,261.0,290.0,381.0,312.0,308.0,870.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2013-01-02,2013,1,2,2,750.0,876.0,1109.0,1258.0,1120.0,2169.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2013-01-03,2013,1,3,3,931.0,1015.0,1603.0,1556.0,1480.0,2295.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2013-01-04,2013,1,4,4,500.0,587.0,1284.0,703.0,626.0,1640.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2013-01-05,2013,1,5,5,1013.0,1011.0,1284.0,1856.0,1621.0,4128.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, after all those transformations, we have out final dataset, to train our machine learning algorithms on.

<a name="4"></a>
# 4.&nbsp;Establish baseline benchmark
[Content](#content)


For our current task of creating model a to predict the amount of cyclers for a given day, we do not have any baseline metric score to measure our model against.
For this reason, we will create a naive baseline model. For this, we will simply predict the amount of a day based on the value of previous day.

In [8]:
# Evaluate the model's performance using RMSE

# Select the `Total` column as our y_dev and preds arrays
y = y_hat = df.loc[:,"total"]

rmse = 0
length = y.shape[0]

# Loop from 0 to second last entry, as we can only use seconds last entry to
# predict the last entry of series
for i in range(length-1):
    # The mean_sqared_error function expects an array as input, therfore we
    # concatenate the range from current value to current value + 1 (excluding)
    rmse += np.sqrt(mean_squared_error(y[i+1:i+2], y_hat[i:i+1]))

# Divide rmse value by number of pairs
rmse = rmse / (length-1)
print("RMSE: %f" % (rmse))

  and should_run_async(code)


RMSE: 9083.225418


If we were naivly predicting the current value with the last value, we get an error over the entire dataset of approximately $9,100$.

This is our naive benchmark to compare our model against.

Another method would be to predict the value of a given day by the average of all the other equal days in the dataset (e.g., to predict 18.08.2017, we take the average of all other 18.08. days in the dataset).

In [9]:
# Initialize squared error
se = 0

# Get the total number of examples
m = df.shape[0]

for i in df.index:
    day = i.day
    month = i.month
    year = i.year

    # create a mask for given day but exclude the day we want to predict
    mask = (df.index.day == day) & (df.index.month == month) & (df.index.year != year)

    # Get value for current day and mean values of all the other same days in the dataset
    y = df.loc[i, "total"]
    y_hat = df.loc[mask,"total"].mean()

    # Calculate the squared error
    se += (y - y_hat)**2

# Calculate mean squared error
mse = se / m

# Calcualte root mean squared error
rmse = np.sqrt(mse)

print(f"RMSE: {rmse}")

  and should_run_async(code)


RMSE: 10282.852522084184


With this second approach, of average all our previous values for the given day and using this as our forecast, we get an error over the entire dataset of approximately $10,300$.

The error of this second naive approach is close to the first approach.
Both approaches could be seen as human-level as this would be a typical approach of a human, to predict the value of any given day. A domain expert, who also looks at more data and e.g., compares also the temperatures, could come up with better estimates. However humans are typically not very good in accurately predicting complex time-series data. The expected Bayes error (least possible error) should therefore be much lower.

<a name="5"></a>
# 5.&nbsp;Training machine learning algorithms
[Content](#content)

We are going to train 1 shallow machine learning algorithm and 2 deep machine learning algorithms to be able to compare performances. Those are:

* XGBoost
* Multilayer Perceptron (MLP -- standard NN)
* Recurrent Neural Network (RNN)

<a name="5.1."></a>
## 5.1. Adding sequential data to our model

[Content](#content)

In contrast to RNNs where the algorithm takes automically the datapoints of previous timesteps into account, XGBoost and MLPs do not have direct access to the sequential data of previous time steps.
Those algorithms have only indirect knowledge via the learned model parameters. RNNs however directly include the previous timestep for learning the parameters of the current timestep.

We will add the data points of the previous time steps as features to the feature vector.

In this case, we will only add the last 3 values, as the observed improvement of accuracy (RMSE score) is drastically decressing with each further time step added after 3 steps.

Improvements:
* 1 day 6291 2,8%
* 2 day 6157 2,1%
* 3 day 6120 0,6%

The following code creates a dataframe with a variable amount of time:

In [10]:
# Create empty new dataframe
df_lagged_days = pd.DataFrame({})

# Select the number of lagged days
go_back_x_days = 3

for i in range(go_back_x_days):
    # Shift the values in "Total" by i and assign to new column prev_Total_i+1
    df_lagged_days[f'prev_total_{i+1}'] = df_transformed_date['total'].shift(i+1)

  and should_run_async(code)


<a name="5.2."></a>
## 5.2. XGBoost
[Content](#content)

For XGBoost, we will add the dataframe `df_lagged_days` to our dataset. Because we do not have all the information about the previous days for the first `go_back_x_days`, we drop the rows with `na` values. The parameter on how far to go back in time, has therefore an impact on the length of our dataset.

In [11]:
# Concat new dataframe with old dataframe
# Using bfill strategy on dataset since the first few days will have NaN values
# Using pyjanitor to clean up names
df_transformed_date_lagged = (pd.concat([df_transformed_date, df_lagged_days], axis=1)
                              .dropna(axis=0)
                              .clean_names(strip_underscores="both"))

# Check output
df_transformed_date_lagged

  and should_run_async(code)


Unnamed: 0,year,month,day,weekday,graf_moltke_straße_ostseite,graf_moltke_straße_westseite,hastedter_bruckenstraße,langemarckstraße_ostseite,langemarckstraße_westseite,osterdeich,...,holiday_karfreitag,holiday_neujahr,holiday_ostermontag,holiday_pfingstmontag,holiday_reformationstag,holiday_tag_der_arbeit,holiday_tag_der_deutschen_einheit,prev_total_1,prev_total_2,prev_total_3
2013-01-04,2013,1,4,4,500.0,587.0,1284.0,703.0,626.0,1640.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24851.0,19494.0,5795.0
2013-01-05,2013,1,5,5,1013.0,1011.0,1284.0,1856.0,1621.0,4128.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13475.0,24851.0,19494.0
2013-01-06,2013,1,6,6,819.0,905.0,1284.0,1602.0,1215.0,4128.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30643.0,13475.0,24851.0
2013-01-07,2013,1,7,0,1123.0,1318.0,3070.0,2637.0,2268.0,3240.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23582.0,30643.0,13475.0
2013-01-08,2013,1,8,1,1321.0,1584.0,4673.0,3082.0,2694.0,4957.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,36246.0,23582.0,30643.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-27,2022,12,27,1,693.0,612.0,1495.0,1062.0,915.0,2123.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9520.0,8067.0,11206.0
2022-12-28,2022,12,28,2,643.0,585.0,1076.0,884.0,820.0,1819.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19231.0,9520.0,8067.0
2022-12-29,2022,12,29,3,654.0,648.0,1076.0,1014.0,907.0,2013.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16241.0,19231.0,9520.0
2022-12-30,2022,12,30,4,757.0,665.0,1076.0,1106.0,976.0,2088.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18006.0,16241.0,19231.0


<a name="5.2.1"></a>
### 5.2.1 Split data into train and dev set and standardize training data
[Content](#content)

Now, we will split the data into a training set and into a dev set. Also here we select the final futures, which we want to use to train our model.

Highly correlated features `tavg`, `tmin`, `wpgt` will be removed and features with no correlation to our target value will be also removed (`wdir`).

When splitting into train and dev set, we will not shuffle the data. This ensures that the validation results are more realistic since they are being evaluated on the data collected after the model was trained. Otherwise we would introduce a "leakage error" into our data.

In [16]:
# Load the data into a pandas dataframe
data = df_transformed_date_lagged
# Define the features and target
# Higly correlated features have been removed (tavg, tmin, wpgt)
# Features with no correlation have been removed (wdir)
# Only all single couting stations are being removed
features = [feature for feature in data.columns if feature not in ["tavg", "tmin", "wpgt", "wdir",
                                                                    'graf_moltke_straße_ostseite', 'graf_moltke_straße_westseite',
                                                                    'hastedter_bruckenstraße', 'langemarckstraße_ostseite',
                                                                    'langemarckstraße_westseite', 'osterdeich', 'radweg_kleine_weser',
                                                                    'schwachhauser_ring', 'wachmannstraße_auswarts_sud',
                                                                    'wachmannstraße_einwarts_nord', 'wilhelm_kaisen_brucke_ost',
                                                                    'wilhelm_kaisen_brucke_west', 'total', ]]

target = ['total']

# Split the data into training and dev sets
# We set shuffle to False
X_train, X_dev, y_train, y_dev = train_test_split(data[features], data[target],
                                                    test_size=0.2, shuffle=False, random_state=0)

print("X_train: ", X_train.shape, "y_train: ", y_train.shape)
print("X_dev: ", X_dev.shape, "y_dev: ", y_dev.shape)

X_train:  (2919, 24) y_train:  (2919, 1)
X_dev:  (730, 24) y_dev:  (730, 1)


  and should_run_async(code)


Finally, we will standardize our dataset. Standarization will generally improve learning speed of the models and can help to improve the accuarcy of the model.

In [None]:
# Standardize and fit to the training set only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same standardization to the dev set
X_dev_scaled = scaler.transform(X_dev)

  and should_run_async(code)


<a name="5.2.2"></a>
### 5.2.2 Using GridSeachCV to select the optimal parameters
[Content](#content)

We will use `GridSearchCV` to select optimal parameters among the preselected ranges for the training data. Furthermore we will create our own scoring metric, to evaluate the performance of the parameters found with `GridSearchCV`.

`GridSearchCV` is using `KFold` for regression problems as default. However `KFold` would split the training data in such a way, that later data will be evaluated against earlier data, introducing `leackage error`.
Therefore we do not use the default, but create splits with `TimeSeriesSplit` and pass this to `GridSearchCV`.

In [None]:
# Create custom function for evaluating GridSearchCV
def custom_rmse(y, y_hat):
    return np.sqrt(mean_squared_error(y, y_hat))

# Create the scoring object using the custom scoring function
custom_scorer_rmse = make_scorer(custom_rmse)

  and should_run_async(code)


In [None]:
params = {
    'n_estimators': [1000],
    'learning_rate': [0.05],
    'max_depth': [5, 10, 20],           # max. depth of tree
    'min_child_weight': [3, 6, 12],     # min. weight for splitting into new node
    'colsample_bytree': [0.7, 0.8],     # subsample ratio of columns
    'reg_alpha': [2.0, 4.0, 8.0],       # L1 regularization
    'reg_lambda': [4.0, 8.0, 16.0],    # L2 regularization
}

# Getting time series splits using TimeSeriesSplit
n = 4
tscv = TimeSeriesSplit(n_splits=n)
splits = []

for i, (train_index, test_index) in enumerate(tscv.split(X_train)):
    splits.append((train_index, test_index))

# Define the estimator
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', random_state=0)

# Define GridSearch and fit GridSearch on the training data with the custom scorer and custom splits
grid_search = GridSearchCV(xg_reg, param_grid=params, cv=splits, scoring=custom_scorer_rmse, n_jobs=-1, verbose=2)
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters and score as well as the best model to that score
print(grid_search.best_params_)
print(grid_search.best_score_)

best_model = grid_search.best_estimator_
print(best_model)

Fitting 4 folds for each of 162 candidates, totalling 648 fits


  and should_run_async(code)


{'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 5, 'min_child_weight': 12, 'n_estimators': 1000, 'reg_alpha': 8.0, 'reg_lambda': 8.0}
7136.89129076001
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.8, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.05, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=5, max_leaves=None,
             min_child_weight=12, missing=nan, monotone_constraints=None,
             n_estimators=1000, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=0, ...)


{'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 10, 'min_child_weight': 7, 'n_estimators': 1000, 'reg_alpha': 1.0, 'reg_lambda': 2.0, 'subsample': 1.0}


{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 13, 'min_child_weight': 5, 'n_estimators': 1000, 'reg_alpha': 7.0, 'reg_lambda': 4.0, 'subsample': 1.0}


{'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 5, 'min_child_weight': 12, 'n_estimators': 1000, 'reg_alpha': 8.0, 'reg_lambda': 8.0}


In [None]:
# Build the XGBoost regressor model with selected hyper parameters
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.8, learning_rate = 0.05,
                          max_depth = 12, n_estimators = 1000, reg_alpha = 5.0, reg_lambda = 5.0,
                          subsample = 1.0, min_child_weight=5)

xg_reg.fit(X_train_scaled, y_train)

# Predict on the train set
y_train_hat = xg_reg.predict(X_train_scaled)

# Predict on the dev set
y_dev_hat = xg_reg.predict(X_dev_scaled)

# Evaluate the model's performance using RMSE
# Training set
rmse_train = custom_rmse(y_train, y_train_hat)
print("Train RMSE: %f" % (rmse_train))

# Dev set
rmse_dev = custom_rmse(y_dev, y_dev_hat)
print("Dev RMSE: %f" % (rmse_dev))

  and should_run_async(code)


Train RMSE: 8.210400
Dev RMSE: 7201.797497


<a name="5.3."></a>
## 5.3. Multilayer Perceptron
[Content](#content)

TO CHECK: Performance when standardizing/normalizing dataset
TO CHECK: When adding the prev_total values, the accuracy becomes less. Check later with optimized MLP again

In [None]:
dropout = 0.1 # 0.2 best
training = True

# Define the architecture of the MLP with L2 regularization
model = Sequential()

model.add(Input(shape=(24,)))
model.add(Dense(units=10, activation='relu'))
model.add(Dropout(dropout))
model.add(Dense(units=1, activation='linear'))  # Output layer with 1 neuron for regression
"""
model.add(Input(shape=(24,)))
model.add(Dense(units=128, activation='relu'))
model.add(Dropout(dropout))
model.add(BatchNormalization())
model.add(Dense(units=64, activation='relu'))
model.add(Dropout(dropout))
model.add(BatchNormalization())
model.add(Dense(units=32, activation='relu'))
model.add(Dropout(dropout))
model.add(BatchNormalization())
model.add(Dense(units=1, activation='linear'))  # Output layer with 1 neuron for regression
"""
#print(model.summary())

# Compile the model with mean squared error (MSE) loss, and root mean square error (RMSE) as metric
# Use Adam optimizer with learning rate
optimizer = Adam(learning_rate=0.1)
model.compile(optimizer=optimizer, loss='mse', metrics=[RootMeanSquaredError()])

# Train the model and save the learning history, use x_dev and y_dev for validation
history = model.fit(X_train_scaled, y_train, epochs=500, batch_size=32, validation_data=(X_dev_scaled, y_dev))
#history = model.fit(batched_X_train, batched_y_train, epochs=50, batch_size=32)

Epoch 1/500


  and should_run_async(code)


Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78/500
Epoch 7

Mit BatchNormalization:

92/92 [==============================] - 0s 5ms/step - loss: 32794122.0000 - root_mean_squared_error: 5726.6152 - val_loss: 70737936.0000 - val_root_mean_squared_error: 8410.5850

Ohne BatchNormalization:

92/92 [==============================] - 0s 4ms/step - loss: 70121296.0000 - root_mean_squared_error: 8373.8457 - val_loss: 87356080.0000 - val_root_mean_squared_error: 9346.4473


In [None]:
history_reg0 = history

  and should_run_async(code)


In [None]:
history_reg1000 = history

  and should_run_async(code)


In [None]:
history_reg0_not_scaled = history

  and should_run_async(code)


In [None]:
model.layers[0].get_config()

Epoch 100/100
46/46 [==============================] - 1s 23ms/step - loss: 6617481.5000 - root_mean_squared_error: 2572.4465 - val_loss: 67260184.0000 - val_root_mean_squared_error: 8201.2305
CPU times: user 2min 43s, sys: 6.67 s, total: 2min 50s
Wall time: 2min 23

In [None]:
hist_train_rmse_reg0 = np.array(history_reg0.history["root_mean_squared_error"])
hist_dev_rmse_reg0 = np.array(history_reg0.history["val_root_mean_squared_error"])
hist_train_rmse_reg0_not_scaled = np.array(history_reg0_not_scaled.history["root_mean_squared_error"])
hist_dev_rmse_reg0_not_scaled = np.array(history_reg0_not_scaled.history["val_root_mean_squared_error"])

# Create the array for the x axis, starting from 1
x = np.arange(1, len(history_reg0.history["root_mean_squared_error"])+1)

# Create a line chart with two lines using Plotly Express
fig = px.line(title='train RMSE vs dev RMSE')
fig.add_scatter(x=x, y=hist_train_rmse_reg0, mode='lines+markers', name='Train L2=0', line=dict(color='red'))
fig.add_scatter(x=x, y=hist_dev_rmse_reg0, mode='lines+markers', name='Dev L2=0', line=dict(color='blue'))
fig.add_scatter(x=x, y=hist_train_rmse_reg0_not_scaled, mode='lines+markers', name='Train L2=0 not scaled', line=dict(color='gray'))
fig.add_scatter(x=x, y=hist_dev_rmse_reg0_not_scaled, mode='lines+markers', name='Dev L2=0 not scaled', line=dict(color='black'))

# Set chart title and axis labels
fig.update_layout(
    xaxis_title='Epochs',
    yaxis_title='RMSE'
)

# Show the chart
fig.show()

  and should_run_async(code)


In [None]:
hist_train_rmse = np.array(history.history["root_mean_squared_error"])
hist_dev_rmse = np.array(history.history["val_root_mean_squared_error"])

# Create the array for the x axis, starting from 1
x = np.arange(1, len(history.history["root_mean_squared_error"])+1)

# Create a line chart with two lines using Plotly Express
fig = px.line(title='train RMSE vs dev RMSE')
fig.add_scatter(x=x, y=hist_train_rmse, mode='lines+markers', name='Train', line=dict(color='red'))
fig.add_scatter(x=x, y=hist_dev_rmse, mode='lines+markers', name='Dev', line=dict(color='blue'))

# Set chart title and axis labels
fig.update_layout(
    xaxis_title='Epochs',
    yaxis_title='RMSE'
)

# Show the chart
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
preds_train = model.predict(X_train_scaled)

# Evaluate the model's performance using RMSE
rmse_train = np.sqrt(mean_squared_error(y_train, preds_train))
print("Trai RMSE: %f" % (rmse_train))

preds_dev = model.predict(x_dev_scaled)

# Evaluate the model's performance using RMSE
rmse_dev = np.sqrt(mean_squared_error(y_dev, preds_dev))
print("Dev RMSE: %f" % (rmse_dev))


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Trai RMSE: 2623.097985
Test RMSE: 8387.871121


<a name="5.4."></a>
## 5.4. Recurrent Neural Network
[Content](#content)

Lastly, we train an RNN to compare to the previous two models. A RNN can take in multiple timesteps, to make a prediction (Many-to-One). Specifically, we will feed in 4 timesteps at a time (current day and the 3 previous days) and output the prediction of the current day.

Out input shape is therefore (none, 4, 21), where none represents a variable amount of training days (in our case the length of X_train).

In [None]:
# Load the data into a pandas dataframe
data = df_transformed_date

# Define the features and target
# Higly correlated features have been removed (tavg, tmax, wpgt)
features = ['year', 'month', 'day', 'weekday', 'tmax', 'prcp',
            'snow', 'wspd', 'pres', 'tsun',
            'holiday_1_weihnachtsfeiertag', 'holiday_2_weihnachtsfeiertag',
            'holiday_christi_himmelfahrt', 'holiday_karfreitag', 'holiday_neujahr',
            'holiday_ostermontag', 'holiday_pfingstmontag', 'holiday_reformationstag',
            'holiday_tag_der_arbeit', 'holiday_tag_der_deutschen_einheit',
            'vacation']

target = 'total'

# Split the data into training and dev sets
X_train, X_dev, y_train, y_dev = train_test_split(data[features], data[target],
                                                    test_size=0.2, shuffle=False, random_state=0)

  and should_run_async(code)


In [None]:
from keras.utils import timeseries_dataset_from_array

print(X_train.shape)
print(y_train.shape)

input_data = X_train
#targets = y_train[3:]
targets = y_train
x_trainnnn = tf.keras.utils.timeseries_dataset_from_array(
    data=input_data, targets=targets, sequence_length=4, batch_size=None, shuffle=False)
print(type(x_trainnnn))
arr = x_trainnnn.as_numpy()

#print(dataset.take(10))
"""
for x,y in x_trainnnn.take(10):
    print(x, y)"""

#for x, y in dataset.take(10):
#    print(x, y)
"""
y_trainnnn = tf.keras.utils.timeseries_dataset_from_array(
    data=targets, targets=None, sequence_length=4, batch_size=None, shuffle=False)


for y in y_trainnnn.take(10):
    print(y)

x_trainnnn.to_numpy()"""
arr

  and should_run_async(code)


(2921, 21)
(2921,)
<class 'tensorflow.python.data.ops.prefetch_op._PrefetchDataset'>


AttributeError: ignored

In [None]:
# Standardize and fit to the training set only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same standardization to the dev set
X_dev_scaled = scaler.transform(X_dev)

X_train_scaled = X_train
X_dev_scaled = X_dev

  and should_run_async(code)


In [None]:
# Define the number of timesteps
timesteps = 4

# Step 1: Create sequences of length 4
X_train_scaled_3d, X_dev_scaled_3d = [], []
y_train_3d, y_dev_3d = [], []

for i in range(len(X_train_scaled) - timesteps + 1):
    X_train_scaled_3d.append(X_train_scaled[i : i + timesteps])
    y_train_3d.append(y_train[i : i + timesteps])

X_train_scaled_3d = np.array(X_train_scaled_3d)
#y_train = y_train [timesteps-1:]
y_train_3d = np.array(y_train_3d)
y_train_3d = np.expand_dims(y_train_3d, axis=2)

for i in range(len(X_dev_scaled) - timesteps + 1):
    X_dev_scaled_3d.append(X_dev_scaled[i : i + timesteps])
    y_dev_3d.append(y_dev[i : i + timesteps])

X_dev_scaled_3d = np.array(X_dev_scaled_3d)
#y_dev = y_dev [timesteps-1:]
y_dev_3d = np.array(y_dev_3d)
y_dev_3d = np.expand_dims(y_dev_3d, axis=2)

print(X_train_scaled_3d.shape)
#print(y_train.shape)
print(y_train_3d.shape)

print(X_dev_scaled_3d.shape)
#print(y_dev.shape)
print(y_dev_3d.shape)

  and should_run_async(code)


(2918, 4, 21)
(2918, 4, 1)
(728, 4, 21)
(728, 4, 1)


In [None]:
testing_y = np.expand_dims(y_train_3d,axis=2)
print(testing_y.shape)
testing_y[0:2]

(2918, 4, 1)


  and should_run_async(code)


array([[[ 5795.],
        [19494.],
        [24851.],
        [13475.]],

       [[19494.],
        [24851.],
        [13475.],
        [30643.]]])

In [None]:
print(X_train_scaled_3d[0:2])
print(y_train_3d[0:2])

[[[2.0130e+03 1.0000e+00 1.0000e+00 1.0000e+00 9.1000e+00 6.9000e+00
   0.0000e+00 1.9400e+01 1.0018e+03 0.0000e+00 0.0000e+00 0.0000e+00
   0.0000e+00 0.0000e+00 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
   0.0000e+00 0.0000e+00 1.0000e+00]
  [2.0130e+03 1.0000e+00 2.0000e+00 2.0000e+00 7.1000e+00 1.8000e+00
   0.0000e+00 2.0200e+01 1.0175e+03 3.0000e+01 0.0000e+00 0.0000e+00
   0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
   0.0000e+00 0.0000e+00 1.0000e+00]
  [2.0130e+03 1.0000e+00 3.0000e+00 3.0000e+00 1.0600e+01 9.0000e-01
   0.0000e+00 2.3800e+01 1.0245e+03 0.0000e+00 0.0000e+00 0.0000e+00
   0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
   0.0000e+00 0.0000e+00 1.0000e+00]
  [2.0130e+03 1.0000e+00 4.0000e+00 4.0000e+00 9.7000e+00 0.0000e+00
   0.0000e+00 2.5200e+01 1.0295e+03 0.0000e+00 0.0000e+00 0.0000e+00
   0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
   0.0000e+00 0.0000e+00 1.0000e+00]]

 [[2.0130e+03 1.0000e+

  and should_run_async(code)


In [None]:
# Assuming X_train, x_dev, y_train, y_dev are already defined and contain the appropriate data




#num_samples = len(X_train_scaled_3d)
num_sequence = 4
# Build the LSTM model
"""
model = Sequential()
model.add(LSTM(units=256, input_shape=(1, X_train.shape[1]), activation='relu'))
# Add a dense output layer with a single unit (for regression) and no activation function
model.add(Dense(units=1))
"""
"""
model = tf.keras.models.Sequential([
    # Shape [batch, time, features] => [batch, time, lstm_units]
    tf.keras.layers.LSTM(64),
    # Shape => [batch, time, features]
    tf.keras.layers.Dense(units=1)
])
"""
# Build the LSTM model
"""
model = Sequential()
model.add(Input(shape=(num_sequence, 21)))
model.add(LSTM(units=32, activation='relu', dropout=0.1))
model.add(Dense(units=1, activation='linear'))

model = Sequential([
    #Input(shape=(num_sequence, 21)),
    # Shape [batch, time, features] => [batch, time, lstm_units]
    LSTM(32, activation='relu', dropout=0.1, return_sequences=True),
    # Shape => [batch, time, features]
    Dense(units=1, activation='linear')
])"""

inputs = Input(shape=(num_sequence, 21))
x = LSTM(64, activation='relu', dropout=0.1, return_sequences=True)(inputs)
outputs = Dense(1, activation='linear', name="custom_output")(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

optimizer = Adam(learning_rate=0.1)
model.compile(optimizer=optimizer, loss='mse', metrics=[RootMeanSquaredError()])

print(model.summary())




#X_train_scaled_3d = np.reshape(X_train_scaled, (num_samples, num_sequence, X_train_scaled.shape[1]))  # Fix variable name

#num_samples = X_dev_scaled.shape[0]
#X_dev_scaled_3d = np.reshape(X_dev_scaled, (num_samples, num_sequence, X_dev_scaled.shape[1]))

# Train the model
#X_train_scaled_3d = np.expand_dims(X_train_scaled_3d[0], axis=0)
#y_train_3d = np.expand_dims(y_train_3d[0], axis=0)
#print(X_train_scaled_3d)
#print(y_train_3d)
#print(testing_y.shape)
history = model.fit(X_train_scaled_3d, y_train_3d, epochs=500, batch_size=32, verbose=2, validation_data=(X_dev_scaled_3d, y_dev_3d))
#history = model.fit(X_train_scaled_3d, testing_y, epochs=10, batch_size=32, verbose=2, validation_data=(X_dev_scaled_3d, y_dev_3d))
#history = model.fit(dataset, epochs=10, batch_size=32, verbose=2, validation_data=(X_dev_scaled_3d, y_dev_3d))
#history = model.fit(x_trainnnn, y_trainnnn, epochs=10, batch_size=32, verbose=2)

#validation_data=(X_dev_scaled, y_dev)

output = model.predict(X_train_scaled_3d)

#intermediate_model = tf.keras.Model(inputs=model.input, outputs=model.get_layer("custom_output").output)
#intermediate_model.predict(X_train_scaled_3d)




Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 4, 21)]           0         
                                                                 
 lstm_3 (LSTM)               (None, 4, 64)             22016     
                                                                 
 custom_output (Dense)       (None, 4, 1)              65        
                                                                 
Total params: 22,081
Trainable params: 22,081
Non-trainable params: 0
_________________________________________________________________


  and should_run_async(code)


None
Epoch 1/500
92/92 - 2s - loss: 544387136.0000 - root_mean_squared_error: 23332.1055 - val_loss: 134798464.0000 - val_root_mean_squared_error: 11610.2744 - 2s/epoch - 26ms/step
Epoch 2/500
92/92 - 1s - loss: 143541840.0000 - root_mean_squared_error: 11980.8945 - val_loss: 129962208.0000 - val_root_mean_squared_error: 11400.0967 - 559ms/epoch - 6ms/step
Epoch 3/500
92/92 - 1s - loss: 135511808.0000 - root_mean_squared_error: 11640.9541 - val_loss: 152592368.0000 - val_root_mean_squared_error: 12352.8281 - 584ms/epoch - 6ms/step
Epoch 4/500
92/92 - 1s - loss: 134960576.0000 - root_mean_squared_error: 11617.2529 - val_loss: 142894256.0000 - val_root_mean_squared_error: 11953.8389 - 552ms/epoch - 6ms/step
Epoch 5/500
92/92 - 0s - loss: 131916600.0000 - root_mean_squared_error: 11485.4951 - val_loss: 127970664.0000 - val_root_mean_squared_error: 11312.4121 - 392ms/epoch - 4ms/step
Epoch 6/500
92/92 - 0s - loss: 126315304.0000 - root_mean_squared_error: 11239.0078 - val_loss: 149816528.0

In [None]:
output

  and should_run_async(code)


array([[[18585.56 ],
        [18585.56 ],
        [18585.56 ],
        [18585.56 ]],

       [[21691.38 ],
        [21691.38 ],
        [21691.38 ],
        [21691.38 ]],

       [[21374.6  ],
        [21374.6  ],
        [21374.6  ],
        [21374.6  ]],

       ...,

       [[10651.098],
        [10651.098],
        [10651.098],
        [15214.214]],

       [[11091.42 ],
        [11091.42 ],
        [15622.95 ],
        [15622.95 ]],

       [[15030.084],
        [19279.078],
        [19279.078],
        [19279.078]]], dtype=float32)

In [None]:
output.shape

  and should_run_async(code)


(2918, 1)