# Predicting target variables using different models

### Prepare the training set (2022) and test set (2023) using 
https://github.com/sagerpascal/uzh-data-science-project/blob/main/WZ/Q5_datacleaning.ipynb

### Data downloaded from:
https://data.stadt-zuerich.ch/dataset/vbz_fahrgastzahlen_ogd
and
https://data.stadt-zuerich.ch/dataset/vbz_fahrzeiten_ogd

In [2]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math
import csv
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor

In [3]:
trainset = pd.read_csv('data/fahrgastzahlen_2022/cleaned.csv', sep=',')
testset = pd.read_csv('data/fahrgastzahlen_2023/cleaned.csv', sep=',')

In [4]:
trainset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960951 entries, 0 to 960950
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Time           960951 non-null  float64
 1   Nachtnetz      960951 non-null  int64  
 2   Capacity       960951 non-null  float64
 3   Occupancy      960951 non-null  float64
 4   GPS_Latitude   960951 non-null  float64
 5   GPS_Longitude  960951 non-null  float64
 6   Weekday        960951 non-null  int64  
dtypes: float64(5), int64(2)
memory usage: 51.3 MB


In [5]:
testset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1127063 entries, 0 to 1127062
Data columns (total 7 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   Time           1127063 non-null  float64
 1   Nachtnetz      1127063 non-null  int64  
 2   Capacity       1127063 non-null  int64  
 3   Occupancy      1127063 non-null  float64
 4   GPS_Latitude   1127063 non-null  float64
 5   GPS_Longitude  1127063 non-null  float64
 6   Weekday        1127063 non-null  int64  
dtypes: float64(4), int64(3)
memory usage: 60.2 MB


In [6]:
# Fit the scaler on the training set only
scaler = StandardScaler()
columns_to_normalize = ['Time', 'GPS_Latitude', 'GPS_Longitude',
                    # 'Occupancy' not neccesary to norm?
                    # 'Capacity' only to use to restore the number of empty seats
                       ]
scaler.fit(trainset[columns_to_normalize])

# Transform both training and test sets
trainset[columns_to_normalize] = scaler.transform(trainset[columns_to_normalize])
testset[columns_to_normalize] = scaler.transform(testset[columns_to_normalize])

### Define Target and Predictor Variables

Define what should be predicted, e.g. free seats, number of passengers, etc. It should be a single variable that is used as $y$. Do this for the training and test set.

Use all other variables as $X_train$ and $X_test$. Remove all columns in $X_train$ and $X_test$ that basically includes the target value, e.g., we cannot predict a number of free seats but provide the model the number of passengers and the number of seats -> the number of passengers must be removed in this case.

### Define some metrics

Define some metrics, have a look here:

https://scikit-learn.org/stable/modules/model_evaluation.html

### Define Some Models

- Linear Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- SVR: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
  (try RBF kernel) <- never finishes, forget about it
- Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

For each of these model, define different parameters and try them out using GridSearch: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Fit the training data.

### Calculate Metrics

Calculate the metrics based on the test data (fed into the model using the .predict function). Maybe also plot the results.

### Look at feature importance

- Linear Regression: Features with higher absolute values of coefficients have a more significant impact on the predicted outcome
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
- https://scikit-learn.org/stable/modules/permutation_importance.html


In [7]:
X_train = trainset.drop(columns=['Capacity', 'Occupancy'])
y_train = trainset['Occupancy']
X_train.head()

Unnamed: 0,Time,Nachtnetz,GPS_Latitude,GPS_Longitude,Weekday
0,1.329027,0,0.491249,-1.005551,1
1,1.329027,0,0.491249,-1.005551,1
2,1.417528,0,0.491249,-1.005551,1
3,1.417528,0,0.491249,-1.005551,1
4,1.50603,0,0.491249,-1.005551,1


In [None]:
# Define parameter grids for GridSearchCV
param_grid_lr = {'fit_intercept': [True, False]}
# param_grid_svr = {'kernel': ['linear', 'rbf'], 'C': [0.1, 1, 10]}
param_grid_svr = {'kernel': ['linear'], 'C': [0.1, 1]}
# param_grid_rf = {'n_estimators': [100, 200, 300], 'max_depth': [None, 10, 20]}
param_grid_rf = {'n_estimators': [100], 'max_depth': [10]}


# Initialize models
# verbose=1 to show progress
lr = LinearRegression()
svr = SVR(verbose=1)
rf = RandomForestRegressor(verbose=1)

# GridSearchCV for Linear Regression
grid_lr = GridSearchCV(lr, param_grid_lr, cv=5, verbose=1)
grid_lr.fit(X_train, y_train)


# GridSearchCV for SVR, use double precision for Qfloat
svr_double = SVR(kernel='linear', verbose=1, cache_size=1000) 
grid_svr = GridSearchCV(svr_double, param_grid_svr, cv=5, verbose=1)
grid_svr.fit(X_train, y_train)

# GridSearchCV for Random Forest
grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, verbose=1)
grid_rf.fit(X_train, y_train)

# Print best parameters for each model
print("Best parameters for Linear Regression:", grid_lr.best_params_)
print("Best parameters for SVR:", grid_svr.best_params_)
print("Best parameters for Random Forest:", grid_rf.best_params_)

# Fit the training data with best parameters
best_lr = grid_lr.best_estimator_
best_lr.fit(X_train, y_train)

best_svr = grid_svr.best_estimator_
best_svr.fit(X_train, y_train)

best_rf = grid_rf.best_estimator_
best_rf.fit(X_train, y_train)



Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[LibSVM].

In [None]:
from datetime import datetime
from joblib import dump

# Get the current date and time
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Define the directory path where you want to save the models
directory_path = 'models/'

# Define the file names with timestamps and full path
lr_model_filename = f'{directory_path}linear_regression_model_{timestamp}.joblib'
svr_model_filename = f'{directory_path}svr_model_{timestamp}.joblib'
rf_model_filename = f'{directory_path}random_forest_model_{timestamp}.joblib'


# Save Linear Regression model
dump(best_lr, lr_model_filename)

# Save SVR model
dump(best_svr, svr_model_filename)

# Save Random Forest model
dump(best_rf, rf_model_filename)

# Save scaler
dump(scaler, 'scaler.joblib')

In [None]:
# Import necessary libraries
from sklearn.metrics import r2_score

# Predict on the test set
y_pred_lr = best_lr.predict(X_test_scaled)
y_pred_svr = best_svr.predict(X_test_scaled)
y_pred_rf = best_rf.predict(X_test_scaled)

# Calculate R^2 score for each model
r2_lr = r2_score(y_test, y_pred_lr)
r2_svr = r2_score(y_test, y_pred_svr)
r2_rf = r2_score(y_test, y_pred_rf)

print("Linear Regression R^2 Score:", r2_lr)
print("SVR R^2 Score:", r2_svr)
print("Random Forest R^2 Score:", r2_rf)


In [None]:
# Inverse transform the scaled target variable
y_test_scaled_inverse = scaler.inverse_transform(y_test_scaled.reshape(-1, 1)).flatten()


In [None]:
# Note: change output to occupancy from next time