# Predicting target variables using different models

### Prepare the training set (2022) and test set (2023) using 
https://github.com/sagerpascal/uzh-data-science-project/blob/main/WZ/Q5_datacleaning.ipynb

### Data downloaded from:
https://data.stadt-zuerich.ch/dataset/vbz_fahrgastzahlen_ogd
and
https://data.stadt-zuerich.ch/dataset/vbz_fahrzeiten_ogd

In [13]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math
import csv
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor

In [26]:
trainset = pd.read_csv('data/fahrgastzahlen_2017/cleaned.csv', sep=',')
testset = pd.read_csv('data/fahrgastzahlen_2019/cleaned.csv', sep=',')

In [27]:
trainset.info()
trainset.isna().any().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423484 entries, 0 to 423483
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Time             423484 non-null  float64
 1   Nachtnetz        423484 non-null  int64  
 2   Capacity         423484 non-null  float64
 3   Occupancy        423484 non-null  float64
 4   GPS_Latitude     423484 non-null  float64
 5   GPS_Longitude    423484 non-null  float64
 6   Weekday          423484 non-null  int64  
 7   Vehicle_type_B   423484 non-null  bool   
 8   Vehicle_type_FB  423484 non-null  bool   
 9   Vehicle_type_N   423484 non-null  bool   
 10  Vehicle_type_SB  423484 non-null  bool   
 11  Vehicle_type_T   423484 non-null  bool   
 12  Vehicle_type_TR  423484 non-null  bool   
dtypes: bool(6), float64(5), int64(2)
memory usage: 25.0 MB


0

In [28]:
# Fit the scaler on the training set only
scaler = StandardScaler()
columns_to_normalize = ['Time', 'GPS_Latitude', 'GPS_Longitude',
                        'Occupancy' 
                    # 'Capacity' only to use to restore the number of empty seats
                       ]
scaler.fit(trainset[columns_to_normalize])

# Transform both training and test sets
trainset[columns_to_normalize] = scaler.transform(trainset[columns_to_normalize])
testset[columns_to_normalize] = scaler.transform(testset[columns_to_normalize])

In [29]:
testset.head(10)

Unnamed: 0,Time,Nachtnetz,Capacity,Occupancy,GPS_Latitude,GPS_Longitude,Weekday,Vehicle_type_B,Vehicle_type_FB,Vehicle_type_N,Vehicle_type_SB,Vehicle_type_T,Vehicle_type_TR
0,-1.073274,0,96.0,-0.8,-0.616886,-0.181882,0,False,False,False,False,False,True
1,-1.073274,0,32.0,-0.385937,-0.616886,-0.181882,0,False,False,False,False,False,True
2,-0.98444,0,48.0,-0.479474,-0.616886,-0.181882,0,False,False,False,False,False,True
3,-0.98444,0,32.0,-0.100487,-0.616886,-0.181882,0,False,False,False,False,False,True
4,-1.517444,0,48.0,-0.71372,-0.616886,-0.181882,1,False,False,False,False,False,True
5,-1.517444,0,90.0,-0.716112,-0.616886,-0.181882,1,False,False,False,False,False,True
6,-1.339776,0,60.0,-0.77976,-0.616886,-0.181882,1,False,False,False,False,False,True
7,-1.517444,0,90.0,-0.774384,-0.616886,-0.181882,1,False,False,False,False,False,True
8,-1.42861,0,48.0,-0.645986,-0.616886,-0.181882,1,False,False,False,False,False,True
9,-1.42861,0,90.0,-0.824056,-0.616886,-0.181882,1,False,False,False,False,False,True


### Define Target and Predictor Variables

Define what should be predicted, e.g. free seats, number of passengers, etc. It should be a single variable that is used as $y$. Do this for the training and test set.

Use all other variables as $X_train$ and $X_test$. Remove all columns in $X_train$ and $X_test$ that basically includes the target value, e.g., we cannot predict a number of free seats but provide the model the number of passengers and the number of seats -> the number of passengers must be removed in this case.

### Define some metrics

Define some metrics, have a look here:

https://scikit-learn.org/stable/modules/model_evaluation.html

### Define Some Models

- Linear Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- SVR: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
  (try RBF kernel) <- never finishes, forget about it
- Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

For each of these model, define different parameters and try them out using GridSearch: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Fit the training data.

### Calculate Metrics

Calculate the metrics based on the test data (fed into the model using the .predict function). Maybe also plot the results.

### Look at feature importance

- Linear Regression: Features with higher absolute values of coefficients have a more significant impact on the predicted outcome
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
- https://scikit-learn.org/stable/modules/permutation_importance.html


In [30]:
X_train = trainset.drop(columns=['Capacity', 'Occupancy'])
y_train = trainset['Occupancy']
X_test = testset.drop(columns=['Capacity', 'Occupancy'])
y_test = testset['Occupancy']
X_train.head()

Unnamed: 0,Time,Nachtnetz,GPS_Latitude,GPS_Longitude,Weekday,Vehicle_type_B,Vehicle_type_FB,Vehicle_type_N,Vehicle_type_SB,Vehicle_type_T,Vehicle_type_TR
0,-1.606278,0,0.328782,-0.903167,1,False,False,False,False,False,True
1,-1.606278,0,0.328782,-0.903167,1,False,False,False,False,False,True
2,-1.517444,0,0.328782,-0.903167,1,False,False,False,False,False,True
3,-1.517444,0,0.328782,-0.903167,1,False,False,False,False,False,True
4,-1.42861,0,0.328782,-0.903167,1,False,False,False,False,False,True


In [31]:
# Define parameter grids for GridSearchCV
param_grid_lr = {'fit_intercept': [True, False]}

# param_grid_svr = {'kernel': ['linear', 'rbf'], 'C': [0.1, 1, 10]}
# param_grid_svr = {'kernel': ['linear'], 'C': [0.1, 1]}
## Taking too long

# param_grid_rf = {'n_estimators': [100, 200, 300], 'max_depth': [None, 10, 20]}
param_grid_rf = {'n_estimators': [100, 200], 'max_depth': [10]}


# Initialize models
# verbose=1 to show progress
lr = LinearRegression()
# svr = SVR(verbose=1)
rf = RandomForestRegressor(verbose=1)

# GridSearchCV for Linear Regression
grid_lr = GridSearchCV(lr, param_grid_lr, cv=5, verbose=1)
grid_lr.fit(X_train, y_train)


# GridSearchCV for SVR, use double precision for Qfloat
# svr_double = SVR(kernel='linear', verbose=1, cache_size=1000) 
# grid_svr = GridSearchCV(svr_double, param_grid_svr, cv=5, verbose=1)
# grid_svr.fit(X_train, y_train)

# GridSearchCV for Random Forest
grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, verbose=1)
grid_rf.fit(X_train, y_train)

# Print best parameters for each model
print("Best parameters for Linear Regression:", grid_lr.best_params_)
# print("Best parameters for SVR:", grid_svr.best_params_)
print("Best parameters for Random Forest:", grid_rf.best_params_)

# Fit the training data with best parameters
best_lr = grid_lr.best_estimator_
best_lr.fit(X_train, y_train)

# best_svr = grid_svr.best_estimator_
# best_svr.fit(X_train, y_train)

best_rf = grid_rf.best_estimator_
best_rf.fit(X_train, y_train)



Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   20.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   22.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   21.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

Best parameters for Linear Regression: {'fit_intercept': True}
Best parameters for Random Forest: {'max_depth': 10, 'n_estimators': 100}


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   27.9s finished


In [32]:
from datetime import datetime
from joblib import dump

# Get the current date and time
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Define the directory path where you want to save the models
directory_path = 'models/'

# Define the file names with timestamps and full path
lr_model_filename = f'{directory_path}linear_regression_model_{timestamp}.joblib'
# svr_model_filename = f'{directory_path}svr_model_{timestamp}.joblib'
rf_model_filename = f'{directory_path}random_forest_model_{timestamp}.joblib'


# Save Linear Regression model
dump(best_lr, lr_model_filename)

# Save SVR model
# dump(best_svr, svr_model_filename)

# Save Random Forest model
dump(best_rf, rf_model_filename)

# Save scaler
dump(scaler, 'scaler.joblib')

['scaler.joblib']

In [33]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Predict on the test set
y_pred_lr = best_lr.predict(X_test)
# y_pred_svr = best_svr.predict(X_test_scaled)
y_pred_rf = best_rf.predict(X_test)

# Calculate MAPE with a small denominator to prevent inf
mape_lr = np.mean(np.abs((y_test - y_pred_lr) / (y_test + 1e-10))) * 100
# mape_svr = np.mean(np.abs((y_test - y_pred_svr) / y_test)) * 100
mape_rf = np.mean(np.abs((y_test - y_pred_rf) / (y_test + 1e-10))) * 100

# Calculate MAE 
mae_lr = mean_absolute_error(y_test, y_pred_lr)
# mae_svr = mean_absolute_error(y_test, y_pred_svr)
mae_rf = mean_absolute_error(y_test, y_pred_rf)

# Calculate MSE
mse_lr = mean_squared_error(y_test, y_pred_lr)
#mse_svr = mean_squared_error(y_test, y_pred_svr)
mse_rf = mean_squared_error(y_test, y_pred_rf)


print("Linear Regression MAPE:", mape_lr)
print("Linear Regression MAE:", mae_lr)
print("Linear Regression MSE:", mse_lr)

# print("SVR MAPE:", mape_svr)
# print("SVR MAE:", mae_svr)
# print("SVR MSE:", mse_svr)

print("Random Forest MAPE:", mape_rf)
print("Random Forest MAE:", mae_rf)
print("Random Forest MSE:", mse_rf)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Linear Regression MAPE: 275.02650832873985
Linear Regression MAE: 0.6751491303608285
Linear Regression MSE: 0.954257048969894
Random Forest MAPE: 358.5462065837942
Random Forest MAE: 0.5513029185191594
Random Forest MSE: 0.7072565635236046


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    1.5s finished


In [34]:
## Check on the training set to see if there has been shifts

# Predict on the test set
y_pred_lr = best_lr.predict(X_train)
# y_pred_svr = best_svr.predict(X_train)
y_pred_rf = best_rf.predict(X_train)

# Calculate MAPE with a small denominator to prevent inf
mape_lr = np.mean(np.abs((y_train - y_pred_lr) / (y_test + 1e-10))) * 100
# mape_svr = np.mean(np.abs((y_test - y_pred_svr) / y_test)) * 100
mape_rf = np.mean(np.abs((y_train - y_pred_rf) / (y_test + 1e-10))) * 100

# Calculate MAE 
mae_lr = mean_absolute_error(y_train, y_pred_lr)
# mae_svr = mean_absolute_error(y_test, y_pred_svr)
mae_rf = mean_absolute_error(y_train, y_pred_rf)

# Calculate MSE
mse_lr = mean_squared_error(y_train, y_pred_lr)
#mse_svr = mean_squared_error(y_test, y_pred_svr)
mse_rf = mean_squared_error(y_train, y_pred_rf)


print("Linear Regression MAPE:", mape_lr)
print("Linear Regression MAE:", mae_lr)
print("Linear Regression MSE:", mse_lr)

# print("SVR MAPE:", mape_svr)
# print("SVR MAE:", mae_svr)
# print("SVR MSE:", mse_svr)

print("Random Forest MAPE:", mape_rf)
print("Random Forest MAE:", mae_rf)
print("Random Forest MSE:", mse_rf)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Linear Regression MAPE: 536.6403115356728
Linear Regression MAE: 0.6437198215804588
Linear Regression MSE: 0.8751106869710044
Random Forest MAPE: 501.4482085353002
Random Forest MAE: 0.5189943379926268
Random Forest MSE: 0.6389002985890974


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    1.1s finished


In [None]:
# Change to train with 2017 data to predict 2019 data for better similarity in volume with vehicle types?
# see https://github.com/Chocobosaurus/uzh-data-science-project/blob/main/WZ/Q1Q2.md

In [None]:
# Models:
# 2024-05-14_15-59-58: Done with unormalized Occupancy, with vehicle type
# 2024-05-14_16-49-07: Done with normalized Occupancy, with vehicle type
# 2024-05-14_17-16-45: Done with normalized Occupanct, 2017 as training set and 2019 as test set