<a href="https://colab.research.google.com/github/AMMLRepos/new-york-taxi-trip-duration/blob/main/new_york_city_taxi_trip_duration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Problem 
Predicting a taxi trip duration in New York city for will help a taxi company to -
- to plan number of taxis required to address the need 
- to undestand the most prominent locations 
- to undetstand the locations where drivers will get longer rides
- enhance customer experience 
- improve taxi utilization and planning 

# Objective 
To predict the trip duration of a taxi in New York city

# Source of data
Data is openely available on [Kaggle](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview/evaluation) 

# Steps
We will perform following probable activities to train a model -
- Import required libraries 
- Download the dataset and import it in notebook 
- Analyze existing data 
- Perform feature engineering if required 
- Prepare and clean data for model training 
- Evaluate the developed model and make changes to improve accuracy 
- Publish the model  

## Import required libraries and download the dataset
We will use [opendatasets](https://github.com/JovianML/opendatasets) library from [jovian](https://jovian.ai/) to download kaggle data 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.20-py3-none-any.whl (14 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.20


In [6]:
import opendatasets as od
import os 
dataset_url = "https://www.kaggle.com/c/nyc-taxi-trip-duration/overview/evaluation"
od.download(dataset_url)

Skipping, found downloaded files in "./nyc-taxi-trip-duration" (use force=True to force download)


In [8]:
files = os.listdir('nyc-taxi-trip-duration')

['test.zip', 'train.zip', 'sample_submission.zip']

In [10]:
import zipfile
with zipfile.ZipFile("./nyc-taxi-trip-duration/train.zip", 'r') as zip_ref:
    zip_ref.extractall("./")

In [14]:
os.listdir()

['.config', 'nyc-taxi-trip-duration', 'train.csv', 'sample_data']

In [15]:
raw_taxi_df = pd.read_csv("./train.csv")
raw_taxi_df

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.964630,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.010040,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.782520,N,435
...,...,...,...,...,...,...,...,...,...,...,...
1458639,id2376096,2,2016-04-08 13:31:04,2016-04-08 13:44:02,4,-73.982201,40.745522,-73.994911,40.740170,N,778
1458640,id1049543,1,2016-01-10 07:35:15,2016-01-10 07:46:10,1,-74.000946,40.747379,-73.970184,40.796547,N,655
1458641,id2304944,2,2016-04-22 06:57:41,2016-04-22 07:10:25,1,-73.959129,40.768799,-74.004433,40.707371,N,764
1458642,id2714485,1,2016-01-05 15:56:26,2016-01-05 16:02:39,1,-73.982079,40.749062,-73.974632,40.757107,N,373


# Knowing data fields
Data fields in the dataset stands for the following - 

* id - a unique identifier for each trip
* vendor_id - a code indicating the provider associated with the trip record
* pickup_datetime - date and time when the meter was engaged
* dropoff_datetime - date and time when the meter was disengaged
* passenger_count - the number of passengers in the vehicle (driver entered value)
* pickup_longitude - the longitude where the meter was engaged
* pickup_latitude - the latitude where the meter was engaged
* dropoff_longitude - the longitude where the meter was disengaged
* dropoff_latitude - the latitude where the meter was disengaged
* store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending * to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* trip_duration - duration of the trip in seconds

# Possibilities of feature engineering 
Having first look at the data, we might end up doing feature engineering to get following fields - 
* Seperate date and time 
* Get days(Monday, Tuesday and so on) for specific date 
* Divide time period into slots of say Morning, afternoon, evening and night or may be more granular periods like early monring, late morning, noon, early evening, etc. 
* Calculate trip distance from pick-up latitude to drop-off latitude 

We can conclude on the same after seeing some more patterns in the data

# Doing first level analysis of data

In [None]:
raw_taxi_df.info()

As we can see from above output, we have - 
* 11 columns
* 14,58,644 - 14 Lakh rows - its a good size dataset
* A few string/object values and a few of them are numerical 
* No column has empty or missing values 

In [None]:
raw_taxi_df.describe()