# NYC fare prediction

[This notebook](TBD) shows my approach for predicting the fare amount for a taxi ride in NYC when given the pickup and dropoff locations of the passangers regarding the [New York City Taxi Fare Prediction Challange]( https://www.kaggle.com/c/new-york-city-taxi-fare-prediction).



---

This notebook is seperated into different sections, relating to the common data science workflow (except the hypothesis and data collection where already done).


0.   Previous Commits
1.   Setup and Check Infrastructure
2.   Having a first look at the Data (EDA)
3.   Data Cleaning (Feature Engineering)
4.   Linear Regression
5.   Ridge Regression
6.   Model ...
7.   Evaulation and Discussion

---



# 0. Previous Commits - Comparison of different Models:



Predictions are done when using the whole data set

### Linear Regression

**Commit 1 (Baseline) Score: 5.67093**
- 5.67093
 

### Ridge Regression

**Commit 2 Score: 12.54277**

params_ridge
  - alpha = loguniform(1e-5, 1e0)
  - solver = ['eig', 'cd']
  - n_iter = 100
  - cv = 5
  - verbose = 0
  - n_jobs = 1

ridge_params
  - alpha = 0.240960447726532
  - fit_intercept = True
  - normalize = False
  - solver = 'eig'

RMSE for Ridge_rmse Regression is  -7.212150573730469


### ...
Commit 3 Score:



# 1. Setup and Check Infrastructure

Check for GPU

In [3]:
!nvidia-smi

Sat Sep 19 12:53:41 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
# Install RAPIDS and Dask_ml
!pip install dask_ml
!pip install dask_cuda

!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable


import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

Collecting dask_ml
  Using cached dask_ml-1.6.0-py3-none-any.whl (140 kB)
Collecting dask-glm>=0.2.0
  Using cached dask_glm-0.2.0-py2.py3-none-any.whl (12 kB)
Collecting multipledispatch>=0.4.9
  Using cached multipledispatch-0.6.0-py3-none-any.whl (11 kB)
Installing collected packages: multipledispatch, dask-glm, dask-ml
Successfully installed dask-glm-0.2.0 dask-ml-1.6.0 multipledispatch-0.6.0
Collecting dask_cuda
  Using cached dask_cuda-0.15.0-py3-none-any.whl (44 kB)
Installing collected packages: dask-cuda
Successfully installed dask-cuda-0.15.0
fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
PLEASE READ
********************************************************************************************************
Changes:
1. IMPORTANT CHANGES: RAPIDS on Colab will be pegged to 0.14 Stable until further notice.
2. Default stable version is now 0.14.  Nightly will redirect to 0.14.
3. You can now declare your RAPIDSAI version as a CLI option an

In [43]:
import nvstrings
import numpy as np
import cudf, cuml
import dask_cudf
import io, requests
import math
import gc
import cupy as cp
import pandas as pd

#Plotting
import matplotlib.pyplot as plt
import seaborn as sns 

#Learning
from cuml.preprocessing.model_selection import train_test_split
from scipy.stats import uniform

# Linear Models https://github.com/rapidsai/cuml/tree/branch-0.13/notebooks
from cuml.linear_model import LinearRegression # Linear
from cuml.linear_model import LogisticRegression # Logisitc
from cuml.linear_model import ElasticNet # Elastic
from cuml.linear_model import Ridge # Ridge
from cuml.linear_model import Lasso # Lasso
from cuml.linear_model import MBSGDRegressor as cumlMBSGDRegressor # Mini Batch SGD Regressor

from cuml.solvers import SGD as cumlSGD # Stochastic Gradient Descent
from cuml.ensemble import RandomForestRegressor as cuRF # Random Forest
from cuml.neighbors import KNeighborsRegressor as cumlKNR # Nearest Neighbours
from cuml.svm import SVC # Support Vector Machines

from cuml import ForestInference
import xgboost as xgb

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

from cuml.metrics.regression import r2_score
from cuml.metrics.accuracy import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score as sk_acc
from sklearn.utils.fixes import loguniform

In [8]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 2. First look at the Data

In [3]:
cudf.set_allocator("managed")
dtype = {'fare_amount': 'float32',
              'pickup_datetime':'str',
              'pickup_longitude': 'float32',
              'pickup_latitude': 'float32',
              'dropoff_longitude': 'float32',
              'dropoff_latitude': 'float32',
              'passenger_count': 'int8'}

usecols = list(dtype.keys())

NameError: ignored

In [4]:
%%time
# use a subset with 50.000 rows, max is nrows = 55423480
## using 1%
nrows = 542348

test = cudf.read_csv('/content/drive/My Drive/Colab Notebooks/test_taxi.csv', nrows=nrows, usecols=usecols, dtype=dtype)
train = cudf.read_csv('/content/drive/My Drive/Colab Notebooks/train.csv', nrows=nrows, usecols=usecols, dtype=dtype)

NameError: ignored

In [None]:
train.head(5)

In [None]:
test.head(5)

# 3. Data Cleaning

In [1]:
#Drop Nan Values
train.nans_to_nulls()
train = train.dropna()

NameError: ignored

In [2]:
#Checking shape of the data
print("Train: " + str(train.shape))
print("Test: " + str(test.shape))

NameError: ignored

In [None]:
#Changing the data format of pickup_datetime and adding additional information about pickup time
train['pickup_datetime'] = train['pickup_datetime'].astype('datetime64[ns]')

train["hour"] = train.pickup_datetime.dt.hour
train["weekday"] = train.pickup_datetime.dt.weekday
train["month"] = train.pickup_datetime.dt.month
train["year"] = train.pickup_datetime.dt.year


test['pickup_datetime'] = test['pickup_datetime'].astype('datetime64[ns]')

test["hour"] = test.pickup_datetime.dt.hour
test["weekday"] = test.pickup_datetime.dt.weekday
test["month"] = test.pickup_datetime.dt.month
test["year"] = test.pickup_datetime.dt.year

In [None]:
#calculate trip distance in miles
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))

In [None]:
train['distance'] = distance(train['pickup_latitude'], train['pickup_longitude'], train['dropoff_latitude'], train['dropoff_longitude'] )
test['distance'] = distance(test['pickup_latitude'], test['pickup_longitude'], test['dropoff_latitude'], test['dropoff_longitude'] )
train['distance'].describe()

In [None]:
#check if everything worked
train.head(10)

In [None]:
test.head(2)

In [None]:
print("Ararage fare amount: " + str(train['fare_amount'].mean()))
print("Standard deviation fare amount: " + str(train['fare_amount'].std()))
print("Ararage distance: " + str(train['distance'].mean()) + " miles")
print("Standard deviation distance: " + str(train['distance'].std()) + " miles")

In [None]:
train.describe()

Visualization of the data <br>

The following things were noticed (while using 500k datapoints):
*   The minimal fare_amount is negative. As this does not seem to be realistic I will drop them from the dataset.
*   Some of the minimum and maximum longitude/lattitude coordinates are way off. These  will also be remove from the dataset. (bounding box will be defined)
*   The average fare_amount is about 9.79 USD with a standard deviation of 7.48 USD. When building a predictive model we want to be better than 7.48 USD.



In [None]:
train = train[train.fare_amount>=0]
train = train[(train['distance'] < 30) & (train['distance'] >=0 )]

In [None]:
fare_amount = train['fare_amount'].to_array()
passenger_count = train['passenger_count'].to_array()
distance = train['distance'].to_array()

In [None]:
plt.figure(figsize=(8,5))
sns.kdeplot(fare_amount).set_title("Verteilung des Fahrpreises")

In [None]:
plt.figure(figsize=(8,5))
sns.kdeplot(distance).set_title("Distanz")

In [None]:
#check max latitude und max longitude of test data
print("Max lat pickup: " + str(test['pickup_latitude'].max()))
print("Max lat dropoff: " + str(test['dropoff_latitude'].max()))
print("Max lon pickup: " + str(test['pickup_longitude'].max()))
print("Max lon dropoff: " + str(test['dropoff_longitude'].max()))
print("")
print("Min lat pickup: " + str(test['pickup_latitude'].min()))
print("Min lat dropoff: " + str(test['dropoff_latitude'].min()))
print("Min lon pickup: " + str(test['pickup_longitude'].min()))
print("Min lon dropoff: " + str(test['dropoff_longitude'].min()))

Bounding Box New York
<table>
  <tr>
    <th></th>
    <th>Dropoff</th>
    <th>Pickup</th>
  </tr>
  <tr>
    <td>Max Long</td>
    <td>-72.99096</td>
    <td>-72.986534</td>
   </tr>
   <tr>
    <td>Max Lat</td>
    <td>41.696682</td>
    <td>41.709553</td>
   </tr>
   <tr>
    <td>Min Long</td>
    <td>-74.26323</td>
    <td>-74.25219</td>
    </tr>
   <tr>
    <td>Min Lat</td>
    <td>40.568974</td>
    <td>40.57314</td>
   </tr>
</table>



In [None]:
train.head(2)

In [None]:
#Parts of train data are too far away, so they can be dropped
train = train[(train['pickup_longitude'] > -74.25) & (train['pickup_longitude'] < -72.98)]
train = train[(train['pickup_latitude'] > 40.57) & (train['pickup_latitude'] < 41.70)]
train = train[(train['dropoff_longitude'] < -72.99) & (train['dropoff_longitude'] > -74.26)]
train = train[(train['dropoff_latitude'] > 40.56) & (train['dropoff_latitude'] < 41.69)]

In [None]:
dropoff_longitude = train['dropoff_longitude'].to_array()
dropoff_latitude = train['dropoff_latitude'].to_array()

city_long_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)

plt.figure(figsize=(10,6))
plt.scatter(dropoff_longitude, dropoff_latitude,
                color='green', 
                s=.02, alpha=.6)
plt.title("Dropoffs")

plt.ylim(city_lat_border)
plt.xlim(city_long_border)

In [None]:
unnecessary_columns=['pickup_datetime','dropoff_latitude','pickup_latitude','dropoff_longitude','pickup_longitude']
train=train.drop(unnecessary_columns,axis=1)
test=test.drop(unnecessary_columns,axis=1)

In [None]:
train.head(2)

In [None]:
test.head(2)

# 4. Linear Regression

In [29]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  39130
Number of records in validation data  9782
(39130, 6)
(9782, 6)
(39130,)
(9782,)


In [None]:
lm = LinearRegression(fit_intercept = True, normalize = False,
                      algorithm = "eig")
lm.fit(X_train,y_train)
y_pred=lm.predict(X_test)
lm_rmse = r2_score(y_pred, y_test)
print("RMSE for Linear Regression is ",lm_rmse)

RMSE for Linear Regression is  -11.430109024047852


In [None]:
y_pred=lm.predict(test)
y_pred

0        9.602448
1        9.602448
2        4.343445
3        9.452057
4       15.897491
          ...    
9909     9.640594
9910    12.716614
9911    46.170258
9912    22.659058
9913     4.728363
Length: 9914, dtype: float32

In [None]:
gdf_submission = cudf.read_csv('/content/drive/My Drive/Colab Notebooks/sample_submission.csv')
gdf_submission['fare_amount']= y_pred

gdf_submission.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,9.602448
1,2015-01-27 13:08:24.0000003,9.602448
2,2011-10-08 11:53:44.0000002,4.343445
3,2012-12-01 21:12:12.0000002,9.452057
4,2012-12-01 21:12:12.0000003,15.897491


In [None]:
gdf_submission.to_csv('/content/drive/My Drive/Colab Notebooks/submission1.csv',index=False)

# 5. Ridge Regression

In [None]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  43381440
Number of records in validation data  10845359
(43381440, 6)
(10845359, 6)
(43381440,)
(10845359,)


In [None]:
params_ridge = {
    "alpha": loguniform(1e-5, 1e0), # default 1.0
    "solver": ['eig', 'cd'], 
}
ridge = Ridge()
clf = RandomizedSearchCV(ridge, params_ridge, random_state=1, n_iter=100, cv=5, verbose=0, n_jobs=1)
best_model = clf.fit(X_train,y_train)

In [None]:
best_model.best_estimator_.get_params()

{'alpha': 0.001216494146415184,
 'fit_intercept': True,
 'normalize': False,
 'solver': 'eig'}

In [None]:
ridge_params = {
 'alpha': 0.240960447726532,
 'fit_intercept': True,
 'normalize': False,
 'solver': 'eig'
}

ridge = Ridge(**ridge_params)
result_ridge = ridge.fit(X_train,y_train)

y_pred = result_ridge.predict(X_test)
ridge_rmse = r2_score(y_pred, y_test)
print("RMSE for Ridge_rmse Regression is ", ridge_rmse)

RMSE for Ridge_rmse Regression is  -7.212150573730469


In [None]:
y_pred

0            7.710571
1           13.208282
2            9.265839
3            4.302399
4            9.057373
              ...    
10845354     3.778290
10845355     3.865448
10845356    19.796417
10845357    16.238464
10845358    14.010254
Length: 10845359, dtype: float32

In [None]:
gdf_submission = cudf.read_csv('/content/drive/My Drive/Colab Notebooks/sample_submission.csv')
gdf_submission['fare_amount']= y_pred

gdf_submission.head()

Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,7.710571
1,2015-01-27 13:08:24.0000003,13.208282
2,2011-10-08 11:53:44.0000002,9.265839
3,2012-12-01 21:12:12.0000002,4.302399
4,2012-12-01 21:12:12.0000003,9.057373


In [None]:
gdf_submission.to_csv('/content/drive/My Drive/Colab Notebooks/submission2.csv',index=False)

# 6. K-Nearest Neighbors Regression

https://github.com/rapidsai/cuml/blob/branch-0.13/notebooks/kneighbors_regressor_demo.ipynb


In [32]:
## params
n_neighbors=4

In [33]:
X=train.drop(['fare_amount'],axis=1)
y=train['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
print("Number of records in training data ",X_train.shape[0])
print("Number of records in validation data ",X_test.shape[0])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Number of records in training data  4245058
Number of records in validation data  1061264
(4245058, 6)
(1061264, 6)
(4245058,)
(1061264,)


In [34]:
## inspiration: https://www.kaggle.com/cdeotte/rapids-knn-30-seconds-0-938/notebook

%%time
knn_cuml = cumlKNR(n_neighbors=n_neighbors)
knn_cuml.fit(X_train, y_train)

cuml_result = knn_cuml.predict(X_test)

CPU times: user 1min 36s, sys: 1min 40s, total: 3min 17s
Wall time: 3min 16s


In [40]:
#y_pred=knn_cuml.predict(X_test)
#knn_cuml_rmse = cuml.metrics.regression.r2_score(y_test, cuml_result)
#print("RMSE for K-Nearest Neighbor is ",knn_cuml_rmse)

RuntimeError: ignored

In [34]:
gdf_submission = cudf.read_csv('/content/drive/My Drive/Colab Notebooks/sample_submission.csv')
gdf_submission['fare_amount']= cuml_result

gdf_submission.head()

gdf_submission.to_csv('/content/drive/My Drive/Colab Notebooks/submission_KNearest.csv',index=False)

NameError: ignored

# 7. Evaluation and Discussion