# NYC Taxi daily average pickups modelling

<font size=3>

What's in this Notebook?
    
<p>

</p>

- **Subset the resulting dataset and drop null values**

<p>

</p>

- **Split the dataset in train and test**

<p>

</p>
    
- **Train and evaluate model's performance**

# Subset the dataset and drop null values

<font size=3>
    
We will use the income_per_capita and nonfamily_households variables to create a model for predicting daily average of Taxi pickups

In [2]:
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns

In [3]:
# Read the dataset using GeoPandas 

taxi_gdf_filtered = gpd.read_file('taxi_gdf_filtered.gpkg')

In [4]:
taxi_gdf_filtered.head()

Unnamed: 0,number_pickups,cartodb_id,geoid,do_date,total_pop,households,male_pop,female_pop,median_age,male_under_5,...,not_in_labor_force,armed_forces,civilian_labor_force,do_label,boro_name,avg_pickups,zscore,white_pop_rel,employed_pop_rel,geometry
0,51,142301,340179801001,2011-01-01,0.0,0.0,0.0,0.0,,0.0,...,0.0,0.0,0.0,Block Group 1,Manhattan,1,0.193254,,,"MULTIPOLYGON (((-74.04366 40.69804, -74.04170 ..."
1,5,127295,360050016002,2011-01-01,884.0,322.0,379.0,505.0,33.8,44.0,...,334.0,0.0,284.0,Block Group 2,Bronx,1,0.193254,0.010181,0.435275,"MULTIPOLYGON (((-73.85827 40.81763, -73.86198 ..."
2,23,201291,360050016004,2011-01-01,1540.0,389.0,749.0,791.0,41.1,75.0,...,767.0,0.0,459.0,Block Group 4,Bronx,1,0.193254,0.02987,0.359706,"MULTIPOLYGON (((-73.86198 40.81714, -73.85827 ..."
3,1,102134,360050020001,2011-01-01,1499.0,413.0,859.0,640.0,49.8,0.0,...,490.0,0.0,827.0,Block Group 1,Bronx,1,0.193254,0.00934,0.58694,"MULTIPOLYGON (((-73.86571 40.81662, -73.86475 ..."
4,6,103494,360050020003,2011-01-01,2888.0,851.0,1070.0,1818.0,29.7,76.0,...,1375.0,0.0,771.0,Block Group 3,Bronx,1,0.193254,0.020083,0.258621,"MULTIPOLYGON (((-73.86997 40.81827, -73.86955 ..."


In [5]:
# Fetch subset of variables to be used within the model plus the geometry

selected_features = ['income_per_capita', 'nonfamily_households', 'avg_pickups']

taxi_selection = taxi_gdf_filtered[selected_features].copy()

In [6]:
taxi_selection.head()

Unnamed: 0,income_per_capita,nonfamily_households,avg_pickups
0,,0.0,1
1,26937.0,124.0,1
2,13832.0,57.0,1
3,16977.0,93.0,1
4,5709.0,171.0,1


In [7]:
# Make a copy of the dataframe and drop null values

taxi_selection_notna = taxi_selection.copy().dropna()

# Split the dataset into train and test sets

<font size=3> 
    
We will split the dataset into the variables used for predicting (X) and the target variable we want to predict (y)

In [8]:
# Create X (features to be used for prediction) and y (target value) variables

X = taxi_selection_notna.drop('avg_pickups', axis=1).values

y = taxi_selection_notna['avg_pickups'].values

In [9]:
# Create training and test datasets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=101)

# Scale the variables, train the model and evaluate it's performance

<font size=3>
    
We'll scale the variables, just in case, to prevent one of these from having a higher weight because of its numerical value

In [10]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train_std = sc.fit_transform(X_train)

X_test_std = sc.transform(X_test)

<font size=3>

We'll be using a **Random Forest** model, main reason being relative familiarity with the model (above other models at least!) but also because the behavior shown in the previous notebook, filtering through widgets could be similar to human decission as it'd happen with decission trees
    
<p></p><p></p><p></p>
    
<font size=3>
    
RF models combine the prediction of mutiple **Decision Trees** (equal to the number of estimators) to compute the result, each DT using a random sample of the data.

In [13]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=5000)

rf.fit(X_train_std, y_train)

y_pred = rf.predict(X_test_std)

<font size=3>

We'll use the **R2 score** to evaluate model's performance. R2 score ranges from 0 to 1, the higher the value the better performance the model achieved.
    
    
<p></p><p></p><p></p>
    
<font size=3>
    
In this case, the model's R2 score is 0.5, which indicates that model's accuracy is low and predictions, although are not completely misleading, can't be taken into account with confidence.
    
    
<p></p><p></p><p></p>
    
<font size=3>
    
Model's accuracy could be potentially improved by,
    
<p></p><p></p><p></p>
    
<font size=3>
    
 
1. **Investigating the data further** to understand the relation between sociodemographic factors and NYC Taxi pickups
    
<p></p><p></p><p></p>
    
<font size=3>
    
    
2. Further **feature engineering** (create secondary variables based on the available data) to detect more correlated variables
    
<p></p><p></p><p></p>
    
<font size=3>
    
    
3. Use **additional data** that could exlplain main activity in Manhattan area such as POIs including offices, restaurants, hotels, places of tourist interest, etc.

In [14]:
from sklearn.metrics import r2_score

print(r2_score(y_test, y_pred))

0.5021049084822959


<font size=4><b>


Read the conclusions and possible improvements, [`5 - Discussion`](discussion.md)

</font>