<a href="https://colab.research.google.com/github/AnhDao1411/CSC14115/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Heading
# New York City Taxi Trip Duration

## Problem Statement

### Overview
This is a kaggle competition that challenges us to predict the taxi trip duration in New York city. Dataset is provided by the NYC taxi and Limousine commision for building a model.

Money prize: 30,000$.

Business motivation: providing good driving time estimation to
- Detect bottlenecks that appear in the taxi traffic network.
- Predict the taxi trip price. 

Input: a taxi trip with it's attributes. \
Output: trip duration in seconds. 
 
### Input and Output Data description

There are three files:
- train.csv: (1458644,11)
- test.csv: (625134,9)
- sample_submission.csv: (625134,2) 

| Column name | Description | 
| -------- | -------- | 
| id   | the id for each trip     |
| vendor_id   | the id provided Associated with the trip record     |
| pickup_datetime   | date and time when the meter was engaged     |
| dropoff_datetime   | date and time when the meter was disengaged    |
| passenger_count   | The number of passengers in the taxi    |
| pickup_longitude   | the longitude when the meter was engaged     |
| pickup_latitude   | the latitude when the meter was engaged   |
| dropoff_longitude   | the longitude when the meter was disengaged     |
| dropoff_latitude   | the latitude when the meter was disengaged     |
| store_and_fwd_flag   | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server  |
| trip_duration  | duration of the trip in seconds     |


- A record in train.csv

![](https://i.imgur.com/LtLFbTO.png)

- A record in test.csv
![](https://i.imgur.com/gEJR1jT.png)

- A record in sample_submission.csv
![](https://i.imgur.com/zwq0ezw.png)




### Evaluation Metric
- **Root Mean Squared Logarithmic Error** (RMSLE) is the metric was used to assess the result of this contest.

$$\epsilon = \sqrt{\frac{1}{n}\sum^{n}_{i=1}(log(p_i + 1) -log(a_i + 1)) ^2}$$

* With respect to:
    * $\epsilon$: RMSLE score
    * n: the number of records (trip duration's observations) in the dataset
    * $p_i$: the prediction of trip duration
    * $a_i$: the actual value of trip duration
    * log(x): natural logarithm (base is e)

* **The smaller the RMSLE value is, the better the model**.

## Explore data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%matplotlib inline
import numpy as np 
import pandas as pd
from datetime import timedelta
import datetime as dt
import matplotlib.pyplot as plt
import random

plt.rcParams['figure.figsize'] = [20, 15]

import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
import warnings
from math import radians, cos, sin, asin, sqrt
warnings.filterwarnings('ignore')

In [None]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [None]:
# !unzip ../input/nyc-taxi-trip-duration/train.zip -d nyc-taxi-trip-duration
# !unzip ../input/nyc-taxi-trip-duration/test.zip -d nyc-taxi-trip-duration
# !unzip ../input/nyc-taxi-trip-duration/sample_submission.zip -d nyc-taxi-trip-duration
%cd /content/drive/MyDrive/CSC14115 - KHDLUD

In [None]:
np.random.seed(1987)
N = 100000 # number of sample rows in plots
t0 = dt.datetime.now()
train = pd.read_csv('./nyc-taxi-trip-duration/train.csv')
test = pd.read_csv('./nyc-taxi-trip-duration/test.csv')
sample_submission = pd.read_csv('./nyc-taxi-trip-duration/sample_submission.csv')

In [None]:
train.head(2)

In [None]:
test.head(2)

In [None]:
def check_basic(df, df_type="train"):
    print("{} DF has {} rows and {} columns".format(df_type,df.shape[0], df.shape[1]))
    if df.id.nunique() == df.shape[0]:
        print("1. Id is unique")
    if not df.isnull().any().any(): 
        print("2. No missing value")
    if train.duplicated(keep='first').sum() == 0:
        print("3. No duplicate record")
    if df_type == 'train':
        trip_duration_diff = (pd.to_datetime(train.dropoff_datetime) - pd.to_datetime(train.pickup_datetime)).map(lambda x :x.total_seconds())
        if len(df[np.abs(trip_duration_diff.values - df['trip_duration'].values) > 1]) == 0:
            print("4. Trip_duration is consistent with pickup and dropoff times.")

In [None]:
check_basic(train, df_type="train")

In [None]:
check_basic(test, df_type="test")

In [None]:
print("Train:\n", train.dtypes, "\n")
print("Test:\n", test.dtypes)

* Convert string type to datetime

In [None]:
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'])

train['dropoff_datetime'] = pd.to_datetime(train['dropoff_datetime'])

In [None]:
print("Train:\n", train.dtypes, "\n")
print("Test:\n", test.dtypes)

In [None]:
train.describe(datetime_is_numeric = True)

In [None]:
num_cols = list(train.select_dtypes(exclude='object').columns)
cate_cols = list(train.select_dtypes(include='object').columns)
print("Num cols: ",num_cols)
print("Cate cols: ",cate_cols)

### Numerical columns

In [None]:
def missing_percentage(col):
    per = col.isna().mean()*100
    return per.round(1)

def value_percentages(col):
    val = col.groupby(by=col).count().sort_values(ascending=False)
    total = val.sum()
    return (((val / total)*100).round(1)).to_dict()

def num_values(col):
    return col.nunique()

In [None]:
train[num_cols].agg([missing_percentage, value_percentages, num_values])

In [None]:
train[num_cols].hist(figsize=(20, 15), bins = 100)
plt.show()

As we can see here:

- Because the location in the dataset is around New York so the latitude and longitude are concentrated in a number of points.
- The number of passenger equals to 1 is the majority.
- Trip duration is calcuated by second so the range is quite big (1e6) and there are some values that occur most of the time.

In [None]:
train_log_trip_duration = np.log(train['trip_duration'].values + 1)
plt.hist(train_log_trip_duration, bins=100)
plt.xlabel('log(trip_duration)')
plt.ylabel('number of train records')
plt.show()

In [None]:
def getColor(len_com):
    lstColor = []
    for i in range(len_com):
        color = '#'+''.join([random.choice('ABCDEF0123456789') for j in range(6)])
        while color in lstColor:
            color = '#'+''.join([random.choice('ABCDEF0123456789') for j in range(6)])
            lstColor.append(color)
    return lstColor

In [None]:
def draw_chart(df, fig_size = (20,15)):
    temp = list(df.columns)
    figure, axis = plt.subplots(len(temp)//2 if len(temp) % 2 == 0 else len(temp)//2 + 1, 2, figsize = fig_size)
    color = ['#582f0e','#936639','#f3722c','#c2c5aa','#414833','#457b9d','#ffb703','#e63946', '#c77dff','#55a630','#f72585','#ffa69e','#4d908e','#7f5539','#b5e48c','#b6ad90','#38b000']
    for i, col in enumerate(temp):
        t = df[col].value_counts()
        axis[i//2, i%2].scatter(list(t.index), t.values, color = color[i])
        axis[i//2, i%2].set_title(col)
    if len(temp) %2 !=0:
        figure.delaxes(axis[len(temp)//2,1])
    plt.show()

In [None]:
draw_chart(train[num_cols], (20,30));

### Categorical columns

In [None]:
train[cate_cols].agg([missing_percentage, value_percentages, num_values])

* Because the **store_and_fwd_flag** column only contains two unique values, **"Y"** and **"N"**, we will convert it to **1** and **0**, respectively. (In train and test set)

In [None]:
train['store_and_fwd_flag'] = 1 * (train.store_and_fwd_flag.values == 'Y')
test['store_and_fwd_flag'] = 1 * (test.store_and_fwd_flag.values == 'Y')

In [None]:
print("Train:\n", train.dtypes, "\n")
print("Test:\n", test.dtypes)

## Extract features

### OSRM features

In [None]:
osrm_train = pd.read_csv('../input/nyc-taxi-trip-noisy/train_augmented.csv')
osrm_test = pd.read_csv('../input/nyc-taxi-trip-noisy/test_augmented.csv')
print("OSRM Train shape: ", osrm_train.shape)
print("OSRM Test shape: ", osrm_test.shape)

* Definition of each column

| Column name | Description | 
| -------- | -------- | 
| id   | Record id     |
| distance | Route distance (m) |
| duration | OSRM trip duration (s) |
| motorway | The proportion spent on different kind of roads (% of total distance) |
| trunk |  The proportion spent on different kind of roads (% of total distance) |
| primary | The proportion spent on different kind of roads (% of total distance) |
| secondary | The proportion spent on different kind of roads (% of total distance) |
| tertiary | The proportion spent on different kind of roads (% of total distance) |
| unclassified | The proportion spent on different kind of roads (% of total distance) |
| residential | The proportion spent on different kind of roads (% of total distance) |
| nTrafficSignals | The number of traffic signals |
| nCrossing | The number of pedestrian crossing |
| nStop | The number of stop signs |
| nIntersection | The number of intersections, if you are OSRM user, intersection have different meaning than the one used in OSRM |
| srcCounty | Pickup county |
| dstCounty | Dropoff county |

* **srcCounty** and **dstCounty** values: There are 6 different values
    * NA: Not in NYC
    1. Brooklyn
    2. Queens
    3. Staten Island
    4. Manhattan
    5. Bronx

$\Rightarrow$ We will fill in NA values with the number 6.

In [None]:
osrm_train.srcCounty.fillna(6, inplace = True)
osrm_train.dstCounty.fillna(6, inplace = True)
osrm_test.srcCounty.fillna(6, inplace = True)
osrm_test.dstCounty.fillna(6, inplace = True)

In [None]:
osrm_train.dtypes

In [None]:
osrm_test.dtypes

In [None]:
osrm_train.head(2)

In [None]:
print(len(np.intersect1d(osrm_train['id'], train['id'])))
print(train.shape[0])

* The number of ids in osrm_train is less than the origin train set 1 id, every id in osrm_train are in train set.

In [None]:
osrm_train.describe()

In [None]:
osrm_num_cols = list(osrm_train.select_dtypes(exclude = 'object').columns)
osrm_cate_cols = list(osrm_train.select_dtypes(include = 'object').columns)

In [None]:
osrm_train[osrm_num_cols].agg([missing_percentage, value_percentages, num_values])

In [None]:
draw_chart(osrm_train.select_dtypes(exclude='object'),(20,30))

### Calculate Distance

The fomular is:

Distance= 3963.0 * 1.609344 arccos[(sin(lat1) sin(lat2)) + cos(lat1) cos(lat2) * cos(long2 – long1)]

In [None]:
train["distance"] = 0

In [None]:
def distance(row):
    lon1 = radians(row.pickup_latitude)
    lon2 = radians(row.dropoff_latitude)
    lat1 = radians(row.pickup_longitude)
    lat2 = radians(row.dropoff_longitude)

    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2

    c = 2 * asin(sqrt(a))
    
    r = 6371
    row.distance = c*r
    return row

In [None]:
distance_df = train[['pickup_latitude','dropoff_latitude', 'pickup_longitude','dropoff_longitude','distance']]
distance_df.head(1)

In [None]:
%%time
distance_df = distance_df.apply(distance, axis = 1)

In [None]:
train['distance'] = distance_df['distance']
train.head(1)

### Datetime feature

In [None]:
train.loc[:, 'pick_month'] = train['pickup_datetime'].dt.month
train.loc[:, 'hour'] = train['pickup_datetime'].dt.hour
train.loc[:, 'week_of_year'] = train['pickup_datetime'].dt.weekofyear
train.loc[:, 'day_of_year'] = train['pickup_datetime'].dt.dayofyear
train.loc[:, 'day_of_week'] = train['pickup_datetime'].dt.dayofweek
#The week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6

In [None]:
train.head()

## Explore correlation

In [None]:
# New york city border
west, east = -74.03, -73.75
south, north = (40.63, 40.85)

train = train[(train.pickup_latitude> south) & (train.pickup_latitude < north)]
train = train[(train.dropoff_latitude> south) & (train.dropoff_latitude < north)]
train = train[(train.pickup_longitude> west) & (train.pickup_longitude < east)]
train = train[(train.dropoff_longitude> west) & (train.dropoff_longitude < east)]

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(15,10))

train.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',
                color='yellow', 
                s=.02, alpha=.6, subplots=True, ax=ax1)
ax1.set_title("Pickups")
ax1.set_facecolor('black')

train.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',
                color='yellow', 
                s=.02, alpha=.6, subplots=True, ax=ax2)
ax2.set_title("Dropoffs")
ax2.set_facecolor('black') 

In [None]:
plt.figure(figsize=(8,6))
sns.pointplot(x='hour',y='trip_duration',data=train,kind='point',hue='pick_month')
plt.xlabel('pickup_hour',fontsize=16)
plt.ylabel('mean(trip_duration)',fontsize=16)

In [None]:
temp3 = train.copy()

corr = temp3.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(15, 13))

cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})