<a href="https://colab.research.google.com/github/AMMLRepos/new-york-taxi-trip-duration/blob/main/new_york_city_taxi_trip_duration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Problem 
Predicting a taxi trip duration in New York city for will help a taxi company to -
- to plan number of taxis required to address the need 
- to undestand the most prominent locations 
- to undetstand the locations where drivers will get longer rides
- enhance customer experience 
- improve taxi utilization and planning 

# Objective 
To predict the trip duration of a taxi in New York city

# Source of data
Data is openely available on [Kaggle](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview/evaluation) 

# Steps
We will perform following probable activities to train a model -
- Import required libraries 
- Download the dataset and import it in notebook 
- Analyze existing data 
- Perform feature engineering if required 
- Prepare and clean data for model training 
- Evaluate the developed model and make changes to improve accuracy 
- Publish the model  

## Import required libraries and download the dataset
We will use [opendatasets](https://github.com/JovianML/opendatasets) library from [jovian](https://jovian.ai/) to download kaggle data 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
!pip install opendatasets



In [3]:
import opendatasets as od
import os 
dataset_url = "https://www.kaggle.com/c/nyc-taxi-trip-duration/overview/evaluation"
od.download(dataset_url)

Skipping, found downloaded files in "./nyc-taxi-trip-duration" (use force=True to force download)


In [4]:
files = os.listdir('nyc-taxi-trip-duration')

In [5]:
import zipfile
with zipfile.ZipFile("./nyc-taxi-trip-duration/train.zip", 'r') as zip_ref:
    zip_ref.extractall("./")

In [6]:
os.listdir()

['.config', 'nyc-taxi-trip-duration', 'train.csv', 'sample_data']

In [7]:
raw_taxi_df = pd.read_csv("./train.csv")
raw_taxi_df

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.964630,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.010040,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.782520,N,435
...,...,...,...,...,...,...,...,...,...,...,...
1458639,id2376096,2,2016-04-08 13:31:04,2016-04-08 13:44:02,4,-73.982201,40.745522,-73.994911,40.740170,N,778
1458640,id1049543,1,2016-01-10 07:35:15,2016-01-10 07:46:10,1,-74.000946,40.747379,-73.970184,40.796547,N,655
1458641,id2304944,2,2016-04-22 06:57:41,2016-04-22 07:10:25,1,-73.959129,40.768799,-74.004433,40.707371,N,764
1458642,id2714485,1,2016-01-05 15:56:26,2016-01-05 16:02:39,1,-73.982079,40.749062,-73.974632,40.757107,N,373


# Knowing data fields
Data fields in the dataset stands for the following - 

* id - a unique identifier for each trip
* vendor_id - a code indicating the provider associated with the trip record
* pickup_datetime - date and time when the meter was engaged
* dropoff_datetime - date and time when the meter was disengaged
* passenger_count - the number of passengers in the vehicle (driver entered value)
* pickup_longitude - the longitude where the meter was engaged
* pickup_latitude - the latitude where the meter was engaged
* dropoff_longitude - the longitude where the meter was disengaged
* dropoff_latitude - the latitude where the meter was disengaged
* store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending * to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* trip_duration - duration of the trip in seconds

# Possibilities of feature engineering 
Having first look at the data, we might end up doing feature engineering to get following fields - 
* Seperate date and time 
* Get days(Monday, Tuesday and so on) for specific date 
* Divide time period into slots of say Morning, afternoon, evening and night or may be more granular periods like early monring, late morning, noon, early evening, etc. 
* Calculate trip distance from pick-up latitude to drop-off latitude 

We can conclude on the same after seeing some more patterns in the data

# Doing first level analysis of data

In [8]:
raw_taxi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   id                  1458644 non-null  object 
 1   vendor_id           1458644 non-null  int64  
 2   pickup_datetime     1458644 non-null  object 
 3   dropoff_datetime    1458644 non-null  object 
 4   passenger_count     1458644 non-null  int64  
 5   pickup_longitude    1458644 non-null  float64
 6   pickup_latitude     1458644 non-null  float64
 7   dropoff_longitude   1458644 non-null  float64
 8   dropoff_latitude    1458644 non-null  float64
 9   store_and_fwd_flag  1458644 non-null  object 
 10  trip_duration       1458644 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 122.4+ MB


As we can see from above output, we have - 
* 11 columns
* 14,58,644 - 14 Lakh rows - its a good size dataset
* A few string/object values and a few of them are numerical 
* No column has empty or missing values 

In [9]:
raw_taxi_df.describe()

Unnamed: 0,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration
count,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0,1458644.0
mean,1.53495,1.66453,-73.97349,40.75092,-73.97342,40.7518,959.4923
std,0.4987772,1.314242,0.07090186,0.03288119,0.07064327,0.03589056,5237.432
min,1.0,0.0,-121.9333,34.3597,-121.9333,32.18114,1.0
25%,1.0,1.0,-73.99187,40.73735,-73.99133,40.73588,397.0
50%,2.0,1.0,-73.98174,40.7541,-73.97975,40.75452,662.0
75%,2.0,2.0,-73.96733,40.76836,-73.96301,40.76981,1075.0
max,2.0,9.0,-61.33553,51.88108,-61.33553,43.92103,3526282.0


- Vendor ID is a categorical column with value 1, 2 
- Passenger count is max 9 and min 0 (How can this happen. We need to see this record)
- There is a trip duration of even 1 second (Again something which is not realistic) and max second 3526282 seconds which is approx 970 hours which is again not realistic

- We need to see such values which does not make sense to the context and could be there for some reasons


In [10]:
raw_taxi_df["trip_duration"] = raw_taxi_df["trip_duration"].astype("float64")

In [14]:
raw_taxi_df.sort_values(by = "trip_duration")

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
207497,id1520236,1,2016-05-17 09:03:38,2016-05-17 09:03:39,1,-73.819893,40.740822,-73.819885,40.740822,N,1.0
1382872,id0839864,1,2016-02-06 13:40:27,2016-02-06 13:40:28,1,-73.987991,40.724083,-73.987991,40.724079,N,1.0
1360664,id0480433,1,2016-01-14 12:33:28,2016-01-14 12:33:29,1,-73.991486,40.741940,-73.991478,40.741955,N,1.0
346102,id2375785,1,2016-01-15 23:57:18,2016-01-15 23:57:19,1,-73.985825,40.755760,-73.985901,40.755829,N,1.0
1034341,id0218424,1,2016-01-17 13:50:16,2016-01-17 13:50:17,1,-73.953728,40.670036,-73.953346,40.670021,N,1.0
...,...,...,...,...,...,...,...,...,...,...,...
1234291,id1942836,2,2016-02-15 23:18:06,2016-02-16 23:17:58,2,-73.794525,40.644825,-73.991051,40.755573,N,86392.0
355003,id1864733,1,2016-01-05 00:19:42,2016-01-27 11:08:38,1,-73.789650,40.643559,-73.956810,40.773087,N,1939736.0
680594,id0369307,1,2016-02-13 22:38:00,2016-03-08 15:57:38,2,-73.921677,40.735252,-73.984749,40.759979,N,2049578.0
924150,id1325766,1,2016-01-05 06:14:15,2016-01-31 01:01:07,1,-73.983788,40.742325,-73.985489,40.727676,N,2227612.0


Observing closely, there seems to be more non-realistic entries in our dataset. Some of them are 
- trip start date of 2016-02-13 and trip end date of 2016-03-25 which is approx 42 days 
- tip start time of 12:33:28 and trip stop time of 12:33:29 which is just 1 second

What can we do to get more information on the data - 
- Connect with the owner of the data / customer and check if these records are valid. If yes in what situations such records get entered in the system. For example, a duration of 1 second could mean that trip was started but was cancelled by passenger due to a sudden change of plan. A car was rented by a passenger for say 42 days. 
- Either correct the data or drop those records if business thinks they are not relevant records. 
- Understand the possible boundry conditions.




Since we cannot talk to anyone for now, we will for now just go ahead with the data and add a new feature which is distance travelled. Distance travelled can be calculated by the latitude and longitude paramters of start location and stop location. 

We will use a library named [geopy](https://github.com/geopy/geopy) to calculate the distance between two coordinates. 

In [16]:
!pip install geopy



In [26]:
#Sample geopy code
import geopy.distance

coords_1 = (52.2296756, 21.0122287)
coords_2 = (52.406374, 16.9251681)

print(geopy.distance.vincenty(coords_1, coords_2).km)

279.35290160386563


In [31]:
import geopy.distance 

#pickup_longitude
#pickup_latitude
#dropoff_longitude	
#dropoff_latitude

def get_distance(df):
  pickup_coor = (df.pickup_latitude, df.pickup_longitude)
  dropoff_coor = (df.dropoff_latitude,df.dropoff_longitude)

  try:
    distance = geopy.distance.vincenty(pickup_coor, dropoff_coor).km
    print(distance)
  except:
    distance = "NAN"
    print(distance)

In [None]:
raw_taxi_df["trip_distance"] = raw_taxi_df.apply(get_distance, axis = 1)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
10.80797417715912
1.2109184717373755
1.966656275940384
3.4036305412399015
1.1389024948396216
0.872845730817573
1.4044143642181446
5.0331770579888
1.465718824682444
3.322632246041337
0.5364571908964919
1.472485211107888
2.2549802596355044
1.8555531188191599
1.8208298821406081
1.761144005490532
0.8278264243977633
1.9003595741260175
1.395927538376373
2.3179757886695684
1.1523495420085932
1.6050770891921093
0.9011009709581638
0.06951043516598068
1.8374461490766132
2.099070480840921
1.2811641453929914
3.0020750257588906
0.1950078488397122
1.349909911207854
1.9255712114325596
13.42773358675001
1.7943584084255262
2.845140295441938
0.8882328089175779
0.1675850805144958
3.439499131868296
3.133026656714929
0.9051176809117923
4.787583233071812
20.236944670831676
1.0321523917667592
7.103099377417078
22.658532243116703
2.1704849534652113
7.607288152167682
5.764851742387691
1.2128038176448652
1.556395284458631
0.37315146784496184
0.377

In [30]:
raw_taxi_df

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,trip_distance
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.964630,40.765602,N,455.0,
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663.0,
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124.0,
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.010040,40.719971,-74.012268,40.706718,N,429.0,
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.782520,N,435.0,
...,...,...,...,...,...,...,...,...,...,...,...,...
1458639,id2376096,2,2016-04-08 13:31:04,2016-04-08 13:44:02,4,-73.982201,40.745522,-73.994911,40.740170,N,778.0,
1458640,id1049543,1,2016-01-10 07:35:15,2016-01-10 07:46:10,1,-74.000946,40.747379,-73.970184,40.796547,N,655.0,
1458641,id2304944,2,2016-04-22 06:57:41,2016-04-22 07:10:25,1,-73.959129,40.768799,-74.004433,40.707371,N,764.0,
1458642,id2714485,1,2016-01-05 15:56:26,2016-01-05 16:02:39,1,-73.982079,40.749062,-73.974632,40.757107,N,373.0,
