# Regression - Taxi Prediction Model

This project utilizes number of passengers and flights arriving at the airport, taxi, rideshare services such as Uber and Lyft data, and daily temperature data that I cleaned and analyzed to create a prediction model for YVR airport using machine learning techniques such as Decision Trees and Regression. The purpose of this model is to accurately predict how many taxis are required at the airport at a specific period of time during the day.

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import warnings
from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings("ignore", category=DataConversionWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)


cleaned_data = pd.read_csv('merged_fixed.csv')


In [2]:
# Dropping rows with missing values in "ride.count" column since there is no 2019 
cleaned_data= cleaned_data.dropna(subset=["ride.count"])


In [3]:
# Fill in missing values for taxi wait with the column mean
cleaned_data['taximinwait'] = pd.to_numeric(cleaned_data['taximinwait'], errors='coerce')
cleaned_data['taximinwait'] = cleaned_data['taximinwait'].fillna(cleaned_data['taximinwait'].mean())



In [4]:
cleaned_data

Unnamed: 0.1,Unnamed: 0,date,Entry.hour,Domestic,Transborder,Latin America,Asia Pacific,Europe,Middle East,Africa,total_passengers,TaxiCount,taximinwait,ride.count,precip_amt,temp
14377,14378,2021-01-01,10,981,49,0,34,0,0,0,1064,18.0,43.268333,26.0,,
14378,14379,2021-01-01,11,344,0,0,0,0,0,0,344,12.0,55.350000,26.0,,
14379,14380,2021-01-01,12,374,48,0,42,0,0,0,464,18.0,42.297222,33.0,,
14380,14381,2021-01-01,13,147,20,0,0,0,0,0,167,32.0,32.333499,13.0,,
14381,14382,2021-01-01,14,16,0,0,0,0,0,0,16,4.0,89.225000,7.0,,
14382,14383,2021-01-01,15,175,0,0,299,0,0,0,474,7.0,34.112857,10.0,,
14383,14384,2021-01-01,16,337,0,0,0,0,0,0,337,58.0,32.333499,7.0,,
14384,14385,2021-01-01,17,93,0,0,0,0,0,0,93,6.0,32.333499,5.0,,
14385,14386,2021-01-01,18,82,0,0,0,0,0,0,82,16.0,47.728125,5.0,,
14386,14387,2021-01-01,19,366,0,0,0,0,0,0,366,26.0,21.005769,6.0,,


In [5]:
#splitting the data into training and testing sets, within chosen features
X = cleaned_data[['Entry.hour', 'ride.count', 'precip_amt', 'total_passengers','temp']]
y = cleaned_data['TaxiCount']


In [6]:

X = X.fillna(0) #fixing the other NA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=37)

In [7]:
print(np.isnan(X_train).sum())
print(np.isinf(X_train).sum())
print(np.isnan(y_train).sum())
print(np.isinf(y_train).sum())

0
0
73
0


In [8]:
y_train = np.nan_to_num(y_train)
y_test = np.nan_to_num(y_test)

In [9]:
# Grid search with cross-validation
param_grid = {'fit_intercept': [True, False], 'normalize': [True, False]}
model = LinearRegression()
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'fit_intercept': [True, False], 'normalize': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [10]:
best_model = grid_search.best_estimator_
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)


Best parameters: {'fit_intercept': True, 'normalize': True}
Best score: 0.3533518150570999


In [11]:
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
#print("Mean Squared Error:", mse)

Mean Squared Error: 996.0801681689037


In [12]:
# Making predictions
# The number of taxis needed for each entry hour

model.fit(X_train, y_train)
scale_factor = 1.3 # Scaling to consider any unforeseen factors or variability in the data

unique_entry_hours = sorted(cleaned_data['Entry.hour'].unique())
for entry_hour in unique_entry_hours:
    entry_hour_data = cleaned_data[cleaned_data['Entry.hour'] == entry_hour]
    entry_hour_data_scaled = scaler.transform(entry_hour_data[['Entry.hour', 'ride.count', 'precip_amt', 'total_passengers','temp']])
    entry_hour_data_scaled = np.nan_to_num(entry_hour_data_scaled)  
    
    # Scaling based on factors
    ride_count_avg = entry_hour_data_scaled[:, 3].mean()
    precip_amt = entry_hour_data_scaled[:, 1]
    total_passengers = entry_hour_data_scaled[:, 1]
    temp= entry_hour_data_scaled[:, 2]
    scaling_factor = np.where((entry_hour >= 3) & (entry_hour <= 11), ride_count_avg * temp, -ride_count_avg)  # At entry hour 3 to 11 there is a bigger ride.count average than TaxiCount average 
    
    predicted_count = model.predict(entry_hour_data_scaled) * scale_factor * (1 + precip_amt) / (1 + total_passengers) * (1 + temp)
    predicted_count = np.clip(predicted_count, 14, 116)  # Based on low and high-end average of cabs at each entry hour
    predicted_count = predicted_count.astype(int)  
    
    print(f"At entry hour {entry_hour}, {predicted_count[0]} cabs are needed.")


At entry hour 0, 17 cabs are needed.
At entry hour 1, 14 cabs are needed.
At entry hour 2, 14 cabs are needed.
At entry hour 3, 15 cabs are needed.
At entry hour 4, 80 cabs are needed.
At entry hour 5, 26 cabs are needed.
At entry hour 6, 36 cabs are needed.
At entry hour 7, 39 cabs are needed.
At entry hour 8, 41 cabs are needed.
At entry hour 9, 41 cabs are needed.
At entry hour 10, 75 cabs are needed.
At entry hour 11, 49 cabs are needed.
At entry hour 12, 55 cabs are needed.
At entry hour 13, 46 cabs are needed.
At entry hour 14, 43 cabs are needed.
At entry hour 15, 63 cabs are needed.
At entry hour 16, 59 cabs are needed.
At entry hour 17, 52 cabs are needed.
At entry hour 18, 53 cabs are needed.
At entry hour 19, 66 cabs are needed.
At entry hour 20, 56 cabs are needed.
At entry hour 21, 60 cabs are needed.
At entry hour 22, 58 cabs are needed.
At entry hour 23, 59 cabs are needed.
