# Predicting target variables using different models

### Prepare the training set (2022) and test set (2023) using 
https://github.com/sagerpascal/uzh-data-science-project/blob/main/WZ/Q5_datacleaning.ipynb

### Data downloaded from:
https://data.stadt-zuerich.ch/dataset/vbz_fahrgastzahlen_ogd
and
https://data.stadt-zuerich.ch/dataset/vbz_fahrzeiten_ogd

In [1]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math
import csv
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
import statsmodels.api as sma 
from sklearn import (linear_model, datasets, metrics,
                     discriminant_analysis)

In [2]:
trainset = pd.read_csv('data/fahrgastzahlen_2022/cleaned.csv', sep=',')
testset = pd.read_csv('data/fahrgastzahlen_2023/cleaned.csv', sep=',')

In [3]:
trainset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960951 entries, 0 to 960950
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Time           960951 non-null  float64
 1   Besetzung      960951 non-null  float64
 2   Tage_SA        960951 non-null  int64  
 3   Tage_SO        960951 non-null  int64  
 4   Nachtnetz      960951 non-null  int64  
 5   Tage_SA_N      960951 non-null  int64  
 6   Tage_SO_N      960951 non-null  int64  
 7   Occupancy      960951 non-null  float64
 8   Freeseats      960951 non-null  float64
 9   GPS_Latitude   960951 non-null  float64
 10  GPS_Longitude  960951 non-null  float64
 11  Weekday        960951 non-null  int64  
 12  Richtung_1     960951 non-null  bool   
 13  Richtung_2     960951 non-null  bool   
dtypes: bool(2), float64(6), int64(6)
memory usage: 89.8 MB


In [4]:
testset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1127063 entries, 0 to 1127062
Data columns (total 14 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   Time           1127063 non-null  float64
 1   Besetzung      1127063 non-null  float64
 2   Tage_SA        1127063 non-null  int64  
 3   Tage_SO        1127063 non-null  int64  
 4   Nachtnetz      1127063 non-null  int64  
 5   Tage_SA_N      1127063 non-null  int64  
 6   Tage_SO_N      1127063 non-null  int64  
 7   Occupancy      1127063 non-null  float64
 8   Freeseats      1127063 non-null  float64
 9   GPS_Latitude   1127063 non-null  float64
 10  GPS_Longitude  1127063 non-null  float64
 11  Weekday        1127063 non-null  int64  
 12  Richtung_1     1127063 non-null  bool   
 13  Richtung_2     1127063 non-null  bool   
dtypes: bool(2), float64(6), int64(6)
memory usage: 105.3 MB


In [5]:
# Normalize the values before using as input
from sklearn.preprocessing import StandardScaler

# Fit the scaler on the training set only
scaler = StandardScaler()
columns_to_normalize = ['Time', 'Besetzung', 'Occupancy', 'Freeseats', 'GPS_Latitude', 'GPS_Longitude']
scaler.fit(trainset[columns_to_normalize])

# Transform both training and test sets
trainset[columns_to_normalize] = scaler.transform(trainset[columns_to_normalize])
testset[columns_to_normalize] = scaler.transform(testset[columns_to_normalize])

### Define Target and Predictor Variables

Define what should be predicted, e.g. free seats, number of passengers, etc. It should be a single variable that is used as $y$. Do this for the training and test set.

Use all other variables as $X_train$ and $X_test$. Remove all columns in $X_train$ and $X_test$ that basically includes the target value, e.g., we cannot predict a number of free seats but provide the model the number of passengers and the number of seats -> the number of passengers must be removed in this case.

### Define some metrics

Define some metrics, have a look here:

https://scikit-learn.org/stable/modules/model_evaluation.html

### Define Some Models

I would suggest to use:

- Linear Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- SVR: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
- Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

For each of these model, define different parameters and try them out using GridSearch: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Fit the training data.

### Calculate Metrics

Calculate the metrics based on the test data (fed into the model using the .predict function). Maybe also plot the results.

### Look at feature importance

- Linear Regression: Features with higher absolute values of coefficients have a more significant impact on the predicted outcome
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
- https://scikit-learn.org/stable/modules/permutation_importance.html


In [6]:
# Define feature variables and target variables
# Target variable set as 'Freeseats' here
X = trainset.drop(columns = ['Besetzung', 'Occupancy', 'Freeseats'])
y = trainset['Freeseats']
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
print(X.columns)

Shape of X: (960951, 11)
Shape of y: (960951,)
Index(['Time', 'Tage_SA', 'Tage_SO', 'Nachtnetz', 'Tage_SA_N', 'Tage_SO_N',
       'GPS_Latitude', 'GPS_Longitude', 'Weekday', 'Richtung_1', 'Richtung_2'],
      dtype='object')


In [7]:
X_array = np.asarray(X)
y_array = np.asarray(y)
X_array = X.astype(float).values

In [8]:
model = sma.OLS(y_array, X_array).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.047
Model:                            OLS   Adj. R-squared:                  0.047
Method:                 Least Squares   F-statistic:                     5875.
Date:                Mon, 13 May 2024   Prob (F-statistic):               0.00
Time:                        23:08:33   Log-Likelihood:            -1.3406e+06
No. Observations:              960951   AIC:                         2.681e+06
Df Residuals:                  960942   BIC:                         2.681e+06
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0867      0.001    -84.402      0.0

In [None]:
# Looking suspicious