# Taxi Demand Prediction - Support Vector Machine
---
In this notebook, we build a model for predicting taxi demand in Chicago. The model is based on the information provided by the city of Chicago in 2015.

Furthermore, the model is based on all of the available data. Since the prediction target is taxi demand which is calculated by data aggregation, we won't encounter any memory problems.

To build our demand prediction model, we proceed as following:

In [1]:
cd ..

/Users/simonwolf/git/aaa21


In [2]:
import utils, feature_engineering, geo_engineering, preprocessing, prediction_utils, prediction_svm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Daily Models
---
Explanations...

## Data Preparation
---

In [3]:
# Takes few minutes to run (16 GB RAM)
chicago_df = utils.read_parquet('Taxi_Trips_Cleaned.parquet',
                                columns=['Trip ID','Trip Start Timestamp','Pickup Community Area',
                                         'Dropoff Community Area'])
weather_df = utils.read_parquet('Weather.parquet',columns = ['Trip Start Timestamp','Humidity(%)',
                                    'Pressure(hPa)','Temperature(C)',
                                    'Wind Direction(Meteoro. Degree)','Wind Speed(M/S)'])

daily_demand = preprocessing.create_aggregated_data(df=chicago_df,weather_df=weather_df,temporal_resolution='D')
#daily_demand_hex_7 = preprocessing.create_aggregated_data(df=chicago_df,weather_df=weather_df,temporal_resolution='D',
#                                            use_hexes=True,hex_resolution=7)
#daily_demand_hex_6 = preprocessing.create_aggregated_data(df=chicago_df,weather_df=weather_df,temporal_resolution='D',
#                                            use_hexes=True,hex_resolution=6)

del chicago_df
del weather_df

### Daily Model - Community Areas
---

In [4]:
daily_demand

Unnamed: 0,Trip Start Timestamp,Pickup Community Area,Demand (D),Humidity(%),Pressure(hPa),Temperature(C),Wind Direction(Meteoro. Degree),Wind Speed(M/S)
0,2015-01-01,1.0,406,100.000000,1034.250777,-5.435774,249.095220,9.069681
1,2015-01-01,10.0,32,100.000000,1034.250777,-5.435774,249.095220,9.069681
2,2015-01-01,11.0,73,100.000000,1034.250777,-5.435774,249.095220,9.069681
3,2015-01-01,12.0,15,100.000000,1034.250777,-5.435774,249.095220,9.069681
4,2015-01-01,13.0,39,100.000000,1034.250777,-5.435774,249.095220,9.069681
...,...,...,...,...,...,...,...,...
25240,2015-12-31,73.0,1,75.061278,1024.788420,-2.917027,265.509481,5.001468
25241,2015-12-31,75.0,1,75.061278,1024.788420,-2.917027,265.509481,5.001468
25242,2015-12-31,76.0,1450,75.061278,1024.788420,-2.917027,265.509481,5.001468
25243,2015-12-31,77.0,737,75.061278,1024.788420,-2.917027,265.509481,5.001468


In [5]:
import datetime

df=daily_demand
tscv = prediction_svm.TimeBasedCV(train_period=333,
                   test_period=31,
                   freq='days')

y = df.pop("Demand (D)")
X = df

In [6]:
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import math

scores = []
for train_index, test_index in tscv.split(X):

    data_train   = X.loc[train_index].drop('Trip Start Timestamp', axis=1)
    target_train = y.loc[train_index]

    data_test    = X.loc[test_index].drop('Trip Start Timestamp', axis=1)
    target_test  = y.loc[test_index]
    
    regr = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.1)) # Optimize Parameters
    regr.fit(data_train, target_train)
    
    prediction = regr.predict(data_test)
    
    r2score = regr.score(data_test, target_test)
    
    scores.append(r2score)
    
    y_pred = regr.predict(data_test)
    print("-------MODEL SCORES-------")
    print(f"MAE: {metrics.mean_absolute_error(target_test, y_pred): .3f}")
    print(f"MSE: {metrics.mean_squared_error(target_test, y_pred): .3f}")
    print(f"RMSE: {math.sqrt(metrics.mean_squared_error(target_test, y_pred)): .3f}")
    print(f"R2: {100 * metrics.r2_score(target_test, y_pred): .3f} %")
    
# this is the average accuracy over all folds
average_r2score = np.mean(scores)
#print(average_r2score)