<a href="https://colab.research.google.com/github/Rushabhbhagat08/NYC-Taxi-Time-Prediction-/blob/main/NYC_Taxi_Time_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - NYC Taxi Trip Time Prediction 



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual
##### **Team Member**: Rushabh Anilrao Bhagat

# **Project Summary -**

New York City taxi rides form the core of the traffic in the city of New York. The many rides taken every day by New Yorkers in the busy city can give us a great idea of traffic times, road blockages, and so on. Predicting the duration of a taxi trip is very important since a user would always like to know precisely how much time it would require of him to travel from one place to another. Given the rising popularity of app-based taxi usage through common vendors like Ola and Uber, competitive pricing has to be offered to ensure users choose them. Prediction of duration and price of trips can help users to plan their trips properly, thus keeping potential margins for traffic congestions. It can also help drivers to determine the correct route which in-turn will take lesser time as accordingly. Moreover, the transparency about pricing and trip duration will help to attract users at times when popular taxi app-based vendor services apply surge fares. Thus in this research study, we used real-time data which customers would provide at the start of a ride, or while booking a ride to predict the duration and fare. This data includes pickup and drop-off point coordinates, the distance of the trip, start time, number of passengers, and a rate code belonging to the different classes of cabs available such that the rate applied is based on a regular or airport basis. Hereafter, we applied multiple algorithm Perceptron models to find out which one of them provides better accuracy and relationships between real-time variables. At last, a comparison of the two mentioned algorithms facilitates us to decide that Random Forest is more fitter and efficient than Decision Tree Perceptron for taxi trip duration-based predictions.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

# ***Let's Begin !***

###**Install Requird Libraries**

In [None]:
!pip install haversine

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
# plt.style.use("dark_background")
# from sklearn.model_selection import GridSearchCV
# from pandas_profiling import ProfileReport
from haversine import haversine
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# from xgboost import XGBRegressor
# from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from haversine import haversine
# import statsmodels.formula.api as sm
# from sklearn.model_selection import learning_curve
# from sklearn.model_selection import ShuffleSplit
import warnings; warnings.simplefilter('ignore')

### Dataset Loading

In [None]:
#Load Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
Taxi_Time_df=pd.read_csv('/content/drive/MyDrive/Regression_project/NYC taxi Time Prediction /Copy of NYC Taxi Data.csv')
# Taxi_Time_df=pd.read_csv('/NYC Taxi Data.csv')

In [None]:
# Taxi_Time_df.hist()

### Dataset First View

In [None]:
# Dataset First Look
Taxi_Time_df.head()

In [None]:
Taxi_Time_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows= Taxi_Time_df.shape[0]
columns = Taxi_Time_df.shape[1]
print(f"The number of rows is {rows} and number of columns is {columns}.")

### Dataset Information

In [None]:
# Dataset Info
Taxi_Time_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
Taxi_Time_df['vendor_id'].value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Taxi_Time_df.isnull().sum()

There is no NaN/NULL record in the dataset, So we dont have to impute any record.

### What did you know about your dataset?

* The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.
NYC Taxi Data.csv - the training set (contains 1458644 trip records)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Taxi_Time_df.columns

In [None]:
# Dataset Describe
Taxi_Time_df.describe()

### Variables Description 

1. id - a unique identifier for each trip.

2. vendor_id - a code indicating the provider associated with the trip record.

3. pickup_datetime - date and time when the meter was engaged.

4. dropoff_datetime - date and time when the meter was disengaged.

5. passenger_count - the number of passengers in the vehicle (driver entered value).

6. pickup_longitude - the longitude where the meter was engaged.

7. pickup_latitude - the latitude where the meter was engaged.

8. dropoff_longitude - the longitude where the meter was disengaged.

9. dropoff_latitude - the latitude where the meter was disengaged.

10. store_and_fwd_flag - This flag indicates whether the trip record was held     in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip.
11. trip_duration - duration of the trip in seconds.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("There are %d unique id's in Training dataset, which is equal to the number of records"%(Taxi_Time_df.id.nunique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#Calculate and assign new columns to the dataframe such as weekday,
#month and pickup_hour which will help us to gain more insights from the data.

Taxi_Time_df['pickup_datetime'] = pd.to_datetime(Taxi_Time_df['pickup_datetime'])
Taxi_Time_df['dropoff_datetime'] = pd.to_datetime(Taxi_Time_df['dropoff_datetime'])
Taxi_Time_df['pickup_day'] = Taxi_Time_df['pickup_datetime'].dt.day
Taxi_Time_df['pickup_month'] = Taxi_Time_df['pickup_datetime'].dt.month
Taxi_Time_df['pickup_date'] = Taxi_Time_df['pickup_datetime'].dt.date
Taxi_Time_df['pickup_hour'] = Taxi_Time_df['pickup_datetime'].dt.hour
Taxi_Time_df['pickup_min'] = Taxi_Time_df['pickup_datetime'].dt.minute
Taxi_Time_df['dropoff_min'] = Taxi_Time_df['dropoff_datetime'].dt.minute
Taxi_Time_df['pickup_weekday'] = Taxi_Time_df['pickup_datetime'].dt.weekday 


In [None]:
Taxi_Time_df['store_and_fwd_flag']=Taxi_Time_df['store_and_fwd_flag'].apply(lambda x : 0 if x=='N' else 1)

In [None]:
Taxi_Time_df.head()

In [None]:
#calc_distance is a function to calculate distance between pickup and dropoff coordinates using Haversine formula.
def calc_distance(df):
    pickup = (df['pickup_latitude'], df['pickup_longitude'])
    drop = (df['dropoff_latitude'], df['dropoff_longitude'])
    return haversine(pickup, drop)

In [None]:
#Calculate distance and assign new column to the dataframe.
Taxi_Time_df['distance'] = Taxi_Time_df.apply(lambda x: calc_distance(x), axis = 1)
# Taxi_Time_df['distance']

In [None]:
#Calculate Speed in km/h for further insights
Taxi_Time_df['speed'] = (Taxi_Time_df.distance/(Taxi_Time_df.trip_duration/3600))

In [None]:
#Check the type of each variable
Taxi_Time_df.dtypes.reset_index()

In [None]:
#Dummify all the categorical features like "store_and_fwd_flag, vendor_id, month, weekday_num, pickup_hour, passenger_count" except the label i.e. "trip_duration"

dummy = pd.get_dummies(Taxi_Time_df.store_and_fwd_flag, prefix='flag')
dummy.drop(dummy.columns[0], axis=1, inplace=True) #avoid dummy trap
data = pd.concat([Taxi_Time_df,dummy], axis = 1)

dummy = pd.get_dummies(Taxi_Time_df.vendor_id, prefix='vendor_id')
dummy.drop(dummy.columns[0], axis=1, inplace=True) #avoid dummy trap
data = pd.concat([Taxi_Time_df,dummy], axis = 1)

dummy = pd.get_dummies(Taxi_Time_df.pickup_month, prefix='pickup_month')
dummy.drop(dummy.columns[0], axis=1, inplace=True) #avoid dummy trap
data = pd.concat([Taxi_Time_df,dummy], axis = 1)

dummy = pd.get_dummies(Taxi_Time_df.pickup_weekday, prefix='pickup_weekday')
dummy.drop(dummy.columns[0], axis=1, inplace=True) #avoid dummy trap
data = pd.concat([Taxi_Time_df,dummy], axis = 1)

dummy = pd.get_dummies(Taxi_Time_df.pickup_hour, prefix='pickup_hour')
dummy.drop(dummy.columns[0], axis=1, inplace=True) #avoid dummy trap
data = pd.concat([Taxi_Time_df,dummy], axis = 1)

dummy = pd.get_dummies(Taxi_Time_df.passenger_count, prefix='passenger_count')
dummy.drop(dummy.columns[0], axis=1, inplace=True) #avoid dummy trap
data = pd.concat([Taxi_Time_df,dummy], axis = 1)

In [None]:
# Taxi_Time_df['pickup_day'].value_counts()

In [None]:
#update a dataset
Taxi_Time_df.head()

* Now our dataset is complete for the further analysis before we train our model with optimal variables

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# 1. <u>**`Univariate Analysis`**</u>

---



---



##**1. <u>Passengers</u>**

### What all manipulations have you done and insights you found?

New York City Taxi Passenger Limit says:

* A maximum of 4 passengers can ride in traditional cabs, there are also 5 passenger cabs that look more like minivans.

* A child under 7 is allowed to sit on a passenger's lap in the rear seat in addition to the passenger limit.
So, in total we can assume that maximum 6 passenger can board the new york taxi i.e. 5 adult + 1 minor

In [None]:
pd.options.display.float_format = '{:.2f}'.format #To suppres scientific notation.
Taxi_Time_df.passenger_count.value_counts()

In [None]:
plt.figure(figsize = (20,5))
sns.boxplot(data.passenger_count)
plt.show()

##### 1. Why did you pick the specific chart?

Beacues, There are some trips with 0 passenger count.
Few trips consisted of even 7, 8 or 9 passengers. so,
Clear outliers and pointers to data inconsistency
Most of trip consist of passenger either 1 or 2. thats reason we use boxplot here.


2. What is/are the insight(s) found from the chart?

Passenger count is a driver entered value. Since the trip is not possible without passengers. It is evident that the driver forgot to enter the value for the trips with 0 passenger count. Lets analyze the passenger count distribution further to make it consistent for further analysis

In [None]:
data.passenger_count.describe()

As per above details. Mean median and mode are all approx equal to 1. So we would replace the 0 passenger count with 1.

In [None]:
data['passenger_count']=data.passenger_count.map(lambda x:1 if x==0 else x)

Also, we will remove the records with passenger count > 7, 8 or 9 as they are extreme values and looks very odd to be ocupied in a taxi.

In [None]:
data=data[data.passenger_count<=6]

Now the data is consistent with respect to the passenger count. Let's take a look at the ditribution with a graph below

In [None]:
sns.countplot(data.passenger_count)
plt.show()

now, you observe the graph It is evident that most of the trips was taken by single passenger.

##**2. <u>Vendor</u>**

Here we analyze taxi data only for the 2 vendors which are listed as 1 and 2 in the datset.

In [None]:
sns.countplot(data.vendor_id)
plt.show()

Though both the vendors seems to have almost equal market share. But Vendor 2 is evidently more famous among the population as per the above graph.

##**3. <u>Distance</u>**


Let's now have a look on the distribution of the distance across the different types of rides.

In [None]:
print(data.distance.describe())

In [None]:
plt.figure(figsize=(20,5))
sns.boxplot(data.distance)
plt.show()

##### 1. What is/are the insight(s) found from the chart?

There some trips with over 100 km distance. And 
some of the trips distance value is 0 km.

So, the mean distance travelled is approx 3.5 kms.
standard deviation of 4.3 which shows that most of the trips are limited to the range of 1-10 kms.

In [None]:
print("There are {} trip records with 0 km distance".format(data.distance[data.distance == 0 ].count()))

In [None]:
data[data.distance == 0 ].head()

Around 6K trip record with distance equal to 0. Below are some possible explanation for such records.
1. Customer changed mind and cancelled the journey just after accepting it.
2. Software didn't recorded dropoff location properly due to which dropoff location is the same as the pickup location.
3. Issue with GPS tracker while the journey is being finished.
4. Driver cancelled the trip just after accepting it due to some reason. So the trip couldn't start
Or some other issue with the software itself which a technical guy can explain
There is some serious inconsistencies in the data where drop off location is same as the pickup location. 

We can't think off imputing the distance values considering a correlation with the duration because the dropoff_location coordinates would not be inline with the distance otherwise.

In [None]:
data.distance.groupby(pd.cut(data.distance, np.arange(0,100,10))).count().plot(kind='barh')
plt.show()

From the above observation it is evident that most of the rides are completed between 1-10 Kms with some of the rides with distances between 10-30 kms. Other slabs bar are not visible because the number of trips are very less as compared to these slabs

##**4. <u>Trip duration</u>**

In [None]:
data.trip_duration.describe()

In [None]:
plt.figure(figsize = (20,5))
sns.boxplot(data.trip_duration)
plt.show()

##### 1. Why did you pick the specific chart?

Some trip durations are over 100000 seconds which are clear outliers and should be removed. so we use boxplot here.

###1. What is/are the insight(s) found from the chart?
There are some durations with as low as 1 second. which points towards trips with 0 km distance.
Major trip durations took between 10-20 mins to complete.
Mean and mode are not same which shows that trip duration distribution is skewed towards right.

In [None]:
data.trip_duration.groupby(pd.cut(data.trip_duration, np.arange(1,max(data.trip_duration),3600))).count()

There are some trips with more than 24 hours of travel duration i.e. 86400 seconds. Which might have occured on weekends for the outstation travels.
Major chunk of trips are completed within an interval of 1 hour with some good numbers of trips duration going above 1 hour.
Let's look at those trips with huge duration, these are outliers and should be removed for the data consistency.

In [None]:
data[data.trip_duration > 86400]

These trips run for more than 20 days, which seems unlikely by the distance travelled.
All the trips are taken by vendor 1 which points us to the fact that this vendor might allows much longer trip for outstations.
All these trips are either taken on Tuesday's in 1st month or Saturday's in 2nd month. There might be some relation with the weekday, pickup location, month and the passenger.
But they fail our purpose of correct prediction and bring inconsistencies in the algorithm calculation.

In [None]:
data[data.trip_duration <= 86400]

Let's visualize the number of trips taken in slabs of 0-10, 20-30 ... minutes respectively

In [None]:
data.trip_duration.groupby(pd.cut(data.trip_duration, np.arange(1,7200,600))).count().plot(kind='barh')
plt.xlabel('Trip Counts')
plt.ylabel('Trip Duration (seconds)')
plt.show()

We can observe that most of the trips took 0 - 30 mins to complete i.e. approx 1800 secs. Let's move ahead to next feature.

##**5. <u>Speed of a car</u>**

Speed is a function of distance and time. Let's visualize speed in different trips.

Maximum speed limit in NYC is as follows:

25 mph in urban area i.e. 40 km/h
65 mph on controlled state highways i.e. approx 104 km/h

In [None]:
data.speed.describe()

In [None]:
plt.figure(figsize = (20,5))
sns.boxplot(data.speed)
plt.show()

###1. What is/are the insight(s) found from the chart?

Many trips were done at a speed of over 200 km/h. Going SuperSonic..!!
Let's remove them and focus on the trips which were done at less than 104 km/h as per the speed limits

In [None]:
data = data[data.speed <= 104]
plt.figure(figsize = (20,5))
sns.boxplot(data.speed)
plt.show()

Trips over 30 km/h are being considered as outliers but we cannot ignore them because they are well under the highest speed limit of 104 km/h on state controlled highways.
Mostly trips are done at a speed range of 10-20 km/h with an average speed of around 14 km/h.
Let's take a look at the speed range ditribution with the help of graph.

In [None]:
data.speed.groupby(pd.cut(data.speed, np.arange(0,104,10))).count().plot(kind = 'barh')
plt.xlabel('Trip count')
plt.ylabel('Speed (Km/H)')
plt.show()

It is evident from this graph what we thought off earlier i.e. most of the trips were done at a speed range of 10-20 km/H.

##**6. <u>Store_and_fwd_flag</u>**

This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip.

In [None]:

x=Taxi_Time_df['store_and_fwd_flag'].value_counts()
print(x)

In [None]:
plt.style.use("classic")
plt.figure(figsize=(8,8))
plt.pie(x, colors=['lightgreen', 'lightcoral'], shadow=True, explode=[0.5,0], autopct='%1.2f%%', startangle=200)
plt.legend(labels=['Y','N'])
plt.title("Store and Forward Flag")

* store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y = store and forward; N = not a store and forward trip.

* Visualization tells us that there were very few trips of which the records were stored in memory due to no connection to the server.

#2. **<U>`Bivariate Analysis`</u>**








## 1. **Total Trips**

* **Total trips Per Hour**

Let's take a look at the distribution of the pickups across the 24 hour time scale.

In [None]:
sns.countplot(data.pickup_hour)
plt.show()

It's inline with the general trend of taxi pickups which starts increasing from 6AM in the morning and then declines from late evening i.e. around 8 PM. There is no unusual behavior here.

* **Total trips per weekday**

In [None]:
# changes in graph monday to friday
plt.figure(figsize = (8,6))
sns.countplot(data.pickup_weekday,palette='Accent')
plt.xlabel(' Month ')
# plt.ylabel('Pickup counts')
plt.xticks([0,1,2,3,4,5,6], labels=['Mon','Tue','Wed','Thrus','Fri','Sat','Sun'], rotation=90)
plt.show()


Here we can see an increasing trend of taxi pickups starting from Monday till Friday. The trend starts declining from saturday till monday which is normal where some office going people likes to stay at home for rest on the weekends.

* **Total Trips Per Hours**

In [None]:
#changes in graph here
sns.countplot(data.pickup_hour)
plt.show()

It's inline with the general trend of taxi pickups which starts increasing from 6AM in the morning and then declines from late evening i.e. around 8 PM. There is no unusual behavior here.

###1. Interesting find:


Taxi pickups increased in the late night hours over the weekend possibly due to more outstation rides or for the late night leisures nearby activities.
Early morning pickups i.e before 5 AM have increased over the weekend in comparison to the office hours pickups i.e. after 7 AM which have decreased due to obvious reasons.
Taxi pickups seems to be consistent across the week at 15 Hours i.e. at 3 PM.

* **Total trips per month**


Let's take a look at the trip distribution across the months to understand if there is any diffrence in the taxi pickups in different months

In [None]:
#changes in month here
# plt.figure(figsize=(10,10))
sns.countplot(data.pickup_month,palette='Accent')
plt.ylabel('Trip Counts')
# plt.xlabel('Months')
# plt.show()
plt.xticks([0,1,2,3,4,5,6], labels=['Jan','Feb','March','April','May','June'], rotation=90)
plt.title('Overall Monthly trips')


# plt.style.use("dark_background")
# sns.countplot(Taxi_Time_df['month'], )
#

##2. **Trip Duration**

* **Trip Duration Per Hours**

We need to aggregate the total trip duration to plot it agaist the month. The aggregation measure can be anything like sum, mean, median or mode for the duration. Since we already did the outlier analysis, so we can take the mean to visualize the pattern which should not result in the bias of the general trend.

In [None]:
group1 = data.groupby('pickup_hour').trip_duration.mean()
sns.pointplot(group1.index, group1.values)
plt.ylabel('Trip Duration (seconds)')
plt.xlabel('Pickup Hour')
plt.show()

# group3 = data.groupby('month').trip_duration.mean()
# sns.pointplot(group3.index, group3.values)
# plt.ylabel('Trip Duration (seconds)')
# plt.xlabel('Month')
# plt.show()


1. What is/are the insight(s) found from the chart?

Average trip duration is lowest at 6 AM when there is minimal traffic on the roads.

Average trip duration is generally highest around 3 PM during the busy streets.

Trip duration on an average is similar during early morning hours i.e. before 6 AM & late evening hours i.e. after 6 PM.

* **Trip duration per weekday**

In [None]:
group2 = data.groupby('pickup_weekday').trip_duration.mean()
sns.pointplot(group2.index, group2.values)
plt.ylabel('Trip Duration (seconds)')
plt.xlabel('Weekday')
plt.show()

We can see that trip duration is almost equally distributed across the week on a scale of 0-1000 minutes with minimal difference in the duration times. Also, it is observed that trip duration on thursday is longest among all days.

* **Trip duration per month**

In [None]:
group3 = data.groupby('pickup_month').trip_duration.mean()
sns.pointplot(group3.index, group3.values)
plt.ylabel('Trip Duration (seconds)')
plt.xlabel('Month')
plt.show()

We can see an increasing trend in the average trip duration along with each subsequent month.

The duration difference between each month is not much. It has increased gradually over a period of 6 months.

It is lowest during february when winters starts declining.

There might be some seasonal parameters like wind/rain which can be a factor of this gradual increase in trip duration over a period. Like May is generally the considered as the wettest month in NYC and which is inline with our visualization. As it generally takes longer on the roads due to traffic jams during rainy season. So natually the trip duration would increase towards April May and June.

* **Trip duration per vendor**

In [None]:
group4 = data.groupby('vendor_id').trip_duration.mean()
sns.barplot(group4.index, group4.values)
plt.ylabel('Trip Duration (seconds)')
plt.xlabel('Vendor')
plt.show()

Vendor 2 takes the crown. Average trip duration for vendor 2 is higher than vendor 1 by approx 200 seconds i.e. atleast 3 minutes per trip.

In [None]:
# plt.figure(figsize = (6,6))
# plot_dist = data.loc[(data.distance < 100)]
# sns.boxplot(x = "flag_Y", y = "distance", data = plot_dist)
# plt.ylabel('Distance (km)')
# plt.show()

##3. **Distance**

* **Distance per hour**

Now, let us check how the distance is distributed against different variables. We know that trip distance must be more or less proportional to the trip duration if we ignore general traffic and other stuff on the road. Let's visualize this for each hour now.

In [None]:
# 
group5 = data.groupby('pickup_min').distance.mean()
sns.pointplot(group5.index, group5.values)
plt.ylabel('Distance (km)')
plt.show()
#plt.scatter(data.trip_duration, data.distance , s=1, alpha=0.5)
# plt.ylabel('Distance')
# plt.xlabel('Trip Duration')
# plt.show()

Trip distance is highest during early morning hours which can account for some things like:

Outstation trips taken during the weekends.

Longer trips towards the city airport which is located in the outskirts of the city.

Trip distance is fairly equal from morning till the evening varying around 3 - 3.5 kms.

It starts increasing gradually towards the late night hours starting from evening till 5 AM and decrease steeply towards morning.

* **Distance per weekday**

In [None]:
group6 = data.groupby('pickup_weekday').distance.mean()
sns.pointplot(group6.index, group6.values)
plt.ylabel('Distance (km)')
plt.show()

So it's a fairly equal distribution with average distance metric verying around 3.5 km/h with Sunday being at the top may be due to outstation trips or night trips towards the airport.

* **Distance per month**

In [None]:
# Chart - 1 visualization code
group7 = data.groupby('pickup_month').distance.mean()
sns.pointplot(group7.index, group7.values)
plt.ylabel('Distance (km)')
plt.show()

Here also the distibution is almost equivalent, varying mostly around 3.5 km/h with 5th month being the highest in the average distance and 2nd month being the lowest.

* **Distance Per Vendor**

In [None]:
group8 = data.groupby('vendor_id').distance.mean()
sns.barplot(group8.index, group8.values)
plt.ylabel("Distance km")
plt.show()

This is more or less same picture with both the vendors. Nothing more to analyze in this.

##4. **Average**

* **Average speed per hour**

In [None]:
group9 = data.groupby('pickup_hour').speed.mean()
sns.pointplot(group9.index, group9.values)
plt.show()

The average trend is totally inline with the normal circumstances.

Average speed tend to increase after late evening and continues to increase gradually till the late early morning hours.

Average taxi speed is highest at 5 AM in the morning, then it declines steeply as the office hours approaches.

Average taxi speed is more or less same during the office hours i.e. from 8 AM till 6PM in the evening.

* **Average speed Per Weekday**

In [None]:
group10 = data.groupby('pickup_weekday').speed.mean()
sns.pointplot(group10.index, group10.values)
plt.show()

Average taxi speed is higher on weekend as compared to the weekdays which is obvious when there is mostly rush of office goers and business owners.

Even on monday the average taxi speed is shown higher which is quite surprising when it is one of the most busiest day after the weekend. There can be several possibility for such behaviour
Lot of customers who come back from outstation in early hours of Monday before 6 AM to attend office on time.

Early morning hours customers who come from the airports after vacation to attend office/business on time for the coming week.

There could be some more reasons as well which only a local must be aware of.
We also can't deny the anomalies in the dataset. which is quite cumbersome to spot in such a large dataset.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#Plotting Pearson Correlation heatmap

plt.figure(figsize=(20,10))
sns.heatmap(Taxi_Time_df.corr()*100, annot=True, cmap='inferno')
plt.title('Correlation Plot')

## ***6. Feature Engineering & Data Pre-processing***

After looking at the dataset from different perspectives. Let's prepare our dataset before training our model. Since our dataset do not contain very large number of dimensions. We will first try to use feature selection instead of the feature extraction technique.

In [None]:
#Encode your categorical columns
Taxi_Time_df.columns

In [None]:
#show the all columns 
Taxi_Time_df.head()

In [None]:
# #dropping unwanted columns
Taxi_Time_df = Taxi_Time_df.drop(['id','pickup_datetime','pickup_date','dropoff_datetime'], axis=1)


##**Normalization**

In [None]:
#Predictors and Target Variable

X = Taxi_Time_df.drop(['trip_duration'], axis=1)
y = np.log(Taxi_Time_df['trip_duration'])

In [None]:
X.head()

In [None]:
# Normalising Predictors and creating new dataframe
from sklearn.preprocessing import StandardScaler

cols = X.columns

ss = StandardScaler()

new_df = ss.fit_transform(X)
new_df = pd.DataFrame(new_df, columns=cols)
new_df.head()

Normalizing the Dataset using Standard Scaling Technique.

Now, Why Standard Scaling ? Why not MinMax or Normalizer ?

* It is because MinMax adjusts the value between 0’s and 1’s , which tend to work better for optimization techniques like Gradient descent and machine learning algorithms like KNN.

* While, Normalizer uses distance measurement like Euclidean or Manhattan, so Normalizer tend to work better with KNN.

## **The First Approach - Decomposition using Principal Component Analysis (PCA)**

* Now that we’re done, we have to pass our Scaled Dataframe in PCA model and observe the elbow plot to get better idea of explained variance.

* Why PCA ? It's a Dimensionality Reduction Technique. It is also a Feature extraction Technique. By PCA we create new features from old (Original) Features but the new features will always be independent of each other. So, its not just Dimensionality Reduction Process, we are even eliminating Correlation between the Variables.

* We'll also go through a approach without using PCA in Second Part and Later compare results with PCA approach.

In [None]:
X = new_df

In [None]:
#Applying PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=len(Taxi_Time_df.columns)-1)
pca.fit_transform(X)
var_rat = pca.explained_variance_ratio_
var_rat

In [None]:
#Variance Ratio vs PC plot

plt.figure(figsize=(15,6))
plt.bar(np.arange(pca.n_components_), pca.explained_variance_, color="grey")

* At 14th component our PCA model seems to go Flat without explaining much of a Variance.

In [None]:
#Cumulative Variance Ratio

plt.figure(figsize=(10,6))
plt.plot(np.cumsum(var_rat)*100, color="g", marker='o')
plt.xlabel("Principal Components")
plt.ylabel("Cumulative Variance Ratio")
plt.title('Elbow Plot')

In [None]:
#Applying PCA as per required components

pca = PCA(n_components=14)
transform = pca.fit_transform(X)
pca.explained_variance_

* Above , we had considered 12 as a required number of components and extracted new features by transforming the Data.

In [None]:
#importance of features in Particular Principal Component

plt.figure(figsize=(25,6))
sns.heatmap(pca.components_, annot=True, cmap="winter")
plt.ylabel("Components")
plt.xlabel("Features")
plt.xticks(np.arange(len(X.columns)), X.columns, rotation=90)
plt.title('Contribution of a Particular feature to our Principal Components')

* Above plot gives us detailed idealogy of which feature has contributed more or less to our each Principal Component.

* Pricipal Components are our new features which consists of Information from every other original Feature we have.

* We reduce the Dimensions using PCA by retaining as much as Information possible.

###**Splitting Data and Choosing Algorithms**

Let’s pass the PCA Transformed data in our Machine Learning Regression Algorithms. To begin with , Linear Regression is a good approach, by splitting our Data into Training and Testing (30%).

# ***7. ML Model Implementation***

### **Why Linear Regression , Decision Tree and Random Forest ?**

**Linear Regression**

* Simple to explain.
* Model training and prediction are fast.
* No tuning is required except regularization.

### **Decision Tree:**

* Decision trees are very intuitive and easy to explain.
* They follow the same pattern of thinking that humans use when making decisions.
* Decision trees are a common-sense technique to find the best solutions to problems with uncertainty.

### **Random Forest:**

* It is one of the most accurate learning algorithms available.
* Random Forest consisits of multiple Decision Tress - Results from multiple trees are then merged to give best possible final outcome.
* Random forests overcome several problems with decision trees like Reduction in overfitting.

**So, I want to approach from base model built using basic Linear Regression and then bring in more Sophisticated Algorithms - Decision Tree & Random Forest. It will give us good idea how Linear Regression performs against Decision Tree Regressor and Random Forest Regressor. Later, we will also approach with same algorithms on "without PCA" data. Finally, we'll evaluate both approaches we took and lay down recommended approach and algorithms.**

In [None]:
#Passing in Transformed values as Predcitors

X = transform
y = np.log(Taxi_Time_df['trip_duration']).values

In [None]:
#importing train test split & some important metrics

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import r2_score, mean_squared_log_error , mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:

# X = new_df
# y = np.log(Taxi_Time_df['trip_duration']).values

### ML Model - 1

## Linear Regression

In [None]:
#implementing Linear regression
from sklearn.linear_model import LinearRegression

est_lr = LinearRegression()
est_lr.fit(X_train, y_train)
lr_pred = est_lr.predict(X_test)
lr_pred

In [None]:
#coeficients & intercept

est_lr.intercept_, est_lr.coef_

Note: The Units / Values can change everytime you run the model freshly.

Holding all other Principal Components fixed, a 1 unit increase in 1st PC is associated with a decrease of -0.12842633 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 2nd PC is associated with a increase of 0.18161527 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 3rd PC is associated with a decrease of -0.01200735 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 4th PC is associated with a increase of 0.01200735 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 5th PC is associated with a increase of 0.01654769 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 6th PC is associated with a increase of 0.08977729 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 7th PC is associated with a increase of 0.03316785 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 8th PC is associated with a increase of 0.01846044 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 9th PC is associated with a decrease of -0.00394832 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 10th PC is associated with a decrease of -0.00148052 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 11th PC is associated with a increase of 0.30265117 in Trip Duration.

Holding all other Principal Components fixed, a 1 unit increase in 12th PC is associated with a decrease of -0.52439944 in Trip Duration.



In [None]:
#examining scores
print ("Training Score : " , est_lr.score(X_train, y_train))

print ("Validation Score : ", est_lr.score(X_test, y_test))

print ("Cross Validation Score : " , cross_val_score(est_lr, X_train, y_train, cv=5).mean())
print ("R2_Score : ", r2_score(lr_pred, y_test))


In [None]:
#prediction vs real data
plt.style.use("seaborn-dark")
plt.figure(figsize=(15,8))
plt.subplot(1,1,1)
sns.distplot(y_test, kde=False, color="red", label="Test")

plt.subplot(1,1,1)
sns.distplot(lr_pred, kde=False, color="green", label="Prediction")
plt.legend()
plt.title("Test VS Prediction")

From the above Viz. we can clearly identify that the Linear Regression isn't performing good. The Actual Data (in Red) and Predicted values (in Green) are so much differing. We can conclude that Linear Regression doesn't seem like a right choice for Trip duration prediction.

## **Null RMSLE**

In [None]:
#null rmsle implementation
from sklearn.metrics import mean_squared_log_error
y_null = np.zeros_like(y_test, dtype=float)
y_null.fill(y_test.mean())
print ("Null RMSLE : ", np.sqrt(mean_squared_log_error(y_test, y_null)))

### ML Model - 2

## **Decision Tree**

In [None]:
#implementation of decision tree

from sklearn.tree import DecisionTreeRegressor

est_dt = DecisionTreeRegressor(criterion="mse", max_depth=10)
est_dt.fit(X_train, y_train)
dt_pred = est_dt.predict(X_test)
dt_pred

In [None]:
#examining metrics

print ("Training Score : " , est_dt.score(X_train, y_train))

print ("Validation Score : ", est_dt.score(X_test, y_test))

print ("Cross Validation Score : " , cross_val_score(est_dt, X_train, y_train, cv=5).mean())

print ("R2_Score : ", r2_score(dt_pred, y_test))

print ("RMSLE : ", np.sqrt(mean_squared_log_error(dt_pred, y_test)))

* Our Goal is to reduce the value of loss function (RMSLE) as much as possible considering NULL RMSLE into account, i.e, 0.1146

In [None]:
#prediction vs real data

plt.figure(figsize=(15,8))
plt.subplot(1,1,1)
sns.distplot(y_test, kde=False, color="red", label="Test")

plt.subplot(1,1,1)
sns.distplot(dt_pred, kde=False, color="blue", label="Prediction")
plt.legend()
plt.title("Test VS Prediction")

* From the above Viz. we can clearly identify that the Decision Tree Algorithm is performing good. The Actual Data (in Red) and Predicted values (in Blue) are as close as possible. We can conclude that Decision Tree could be a good choice for Trip duration prediction.

### ML Model - 3

##**Random Forest**

In [None]:
#random forest implementation
from sklearn.ensemble import RandomForestRegressor
est_rf = RandomForestRegressor(criterion="mse", n_estimators=5, max_depth=10)
est_rf.fit(X_train, y_train)
rf_pred = est_rf.predict(X_test)
rf_pred

In [None]:

#examining metrics 

print ("Training Score : " , est_rf.score(X_train, y_train))

print ("Validation Score : ", est_rf.score(X_test, y_test))

print ("Cross Validation Score : " , cross_val_score(est_rf, X_train, y_train, cv=5).mean())

print ("R2_Score : ", r2_score(rf_pred, y_test))

print ("RMSLE : ", np.sqrt(mean_squared_log_error(rf_pred, y_test)))

In [None]:
#prediction vs real data

plt.figure(figsize=(15,8))
plt.subplot(1,1,1)
sns.distplot(y_test, kde=False, color="black", label="Test")

plt.subplot(1,1,1)
sns.distplot(rf_pred, kde=False, color="green", label="Prediction")
plt.legend()
plt.title("Test VS Prediction")

* From the above Viz. we can clearly identify that the Random Forest Algorithm is also performing good. The Actual Data (in Black) and Predicted values (in Green) are as close as possible. We can conclude that Random Forest could be a good choice for Trip duration prediction.

* Similarly, we can Hyper tune Random Forest to get the most out of it.


##R2 Scores Evaluation

* R2 Score or R-Squared is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

* Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

In [None]:
#r2 score plot for all 3 models

plt.figure(figsize=(10,7))
r2 = pd.DataFrame({'Scores':np.array([r2_score(lr_pred, y_test), r2_score(dt_pred, y_test), r2_score(rf_pred, y_test)]), 'Model':np.array(['Linear Regression', 'Decison Tree', 'Random Forest'])})
r2.set_index('Model').plot(kind="bar", color="brown")
plt.axhline(y=0, color='g')
plt.title("R2 Scores")

* Although , our Evaluation Metric isn't R2 Score but I'm just plotting them to check the Good Fit.

* We're getting good fit score for Decision Tree and Random Forest , i.e, close to 1.0

##RMSLE Evaluation

* RMSLE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.
* With RMSLE we explicitly know how much our predictions deviate.
* Lower values of RMSLE indicate better fit with lesser LOSS.

In [None]:
# #RMSLE plot
plt.figure(figsize=(10,10))
r2 = pd.DataFrame({'RMSLE':np.array([np.sqrt(mean_squared_log_error(dt_pred, y_test)), np.sqrt(mean_squared_log_error(rf_pred, y_test))]), 'Model':np.array(['Decison Tree', 'Random Forest'])})
r2.set_index('Model').plot(kind="bar", color="lightblue", legend=False)
plt.title("RMSLE - Lesser is Better")

* Remember our NULL RMSLE : 0.1146 as a benchmark to beat.

* We can observe from above Viz. that our Decision Tree model and Random Forest model are good performers. As, Random Forest is providing us reduced RMSLE, we can say that it's a model to Opt for.

#**The Second Approach - Without PCA**

* Another approach we could go with is without PCA, just Standard Scaling Dataset and applying our Algorithms.

* The approach can give us better idea of what works better for us.

* This approach might take great amount of computational resources and time, it will be good if we can run this on Google’s Collaboratory, that will eliminate huge computational stress on our system as the program will be running on Cloud.

In [None]:
X = new_df
y = np.log(Taxi_Time_df['trip_duration']).values

### ML Model - 1

## **Linear regression**

In [None]:
#train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)

In [None]:
#implenting linear regression

est_lr = LinearRegression()
est_lr.fit(X_train, y_train)
lr_pred = est_lr.predict(X_test)
lr_pred

In [None]:
#Intercept & Coef

est_lr.intercept_, est_lr.coef_

In [None]:

#Examining metrics

print ("Training Score : " , est_lr.score(X_train, y_train))

print ("Validation Score : ", est_lr.score(X_test, y_test))

print ("Cross Validation Score : " , cross_val_score(est_lr, X_train, y_train, cv=5).mean())

print ("R2_Score : ", r2_score(lr_pred, y_test))

#print ("RMSLE : ", np.sqrt(mean_squared_log_error(lr_pred, y_test)))

In [None]:
#prediction vs validation data

plt.figure(figsize=(15,8))
plt.subplot(1,1,1)
sns.distplot(y_test, kde=False, color="black", label="Test")

plt.subplot(1,1,1)
sns.distplot(lr_pred, kde=False, color="g", label="Prediction")
plt.legend()
plt.title("Test VS Prediction")

* Observations shows us that Linear Regression isn't performing well even with the second (without PCA) Approach.

### ML Model - 2

## **Decision Tree**

In [None]:
#decision tree implementation

est_dt = DecisionTreeRegressor(criterion="mse", max_depth=10)
est_dt.fit(X_train, y_train)
dt_pred = est_dt.predict(X_test)
dt_pred

In [None]:
#examining metrics

print ("Training Score : " , est_dt.score(X_train, y_train))

print ("Validation Score : ", est_dt.score(X_test, y_test))

print ("Cross Validation Score : " , cross_val_score(est_dt, X_train, y_train, cv=5).mean())

print ("R2_Score : ", r2_score(dt_pred, y_test))

print ("RMSLE : ", np.sqrt(mean_squared_log_error(dt_pred, y_test)))

In [None]:
#prediction vs reality check

plt.figure(figsize=(15,8))
plt.subplot(1,1,1)
sns.distplot(y_test, kde=False, color="black", label="Test")

plt.subplot(1,1,1)
sns.distplot(dt_pred, kde=False, color="cyan", label="Prediction")
plt.legend()
plt.title("Test VS Prediction")

* Considering our Null RMSLE 0.1146, this model gave us loss of 0.0241, we can say it is good but not the acceptable, knowing the fact that we got RMSLE of 0.0241 in previous approach where we applied PCA.

### ML Model - 3

##**Random Forest**

In [None]:
#implementation of forest algorithm

from sklearn.ensemble import RandomForestRegressor

est_rf = RandomForestRegressor(criterion="mse", n_estimators=5, max_depth=10)
est_rf.fit(X_train, y_train)
rf_pred = est_rf.predict(X_test)
rf_pred

In [None]:
#examining metrics

print ("Training Score : " , est_rf.score(X_train, y_train))

print ("Validation Score : ", est_rf.score(X_test, y_test))

print ("Cross Validation Score : " , cross_val_score(est_rf, X_train, y_train, cv=5).mean())

print ("R2_Score : ", r2_score(rf_pred, y_test))

print ("RMSLE : ", np.sqrt(mean_squared_log_error(rf_pred, y_test)))

In [None]:
#prediction vs reality check

plt.figure(figsize=(15,8))
plt.subplot(1,1,1)
sns.distplot(y_test, kde=False, color="black", label="Test")

plt.subplot(1,1,1)
sns.distplot(rf_pred, kde=False, color="indigo", label="Prediction")
plt.legend()
plt.title("Test VS Prediction")

* Again the loss value we got here is 0.0229 is good when tried to match with Decision Tree's RMSLE ,i.e, 0.0229. But still could be reduced by PCA Approach or maybe Hyper parameter tuning.

In [None]:
#r2 score plot for all 3 models

plt.figure(figsize=(8,7))
r2 = pd.DataFrame({'Scores':np.array([r2_score(lr_pred, y_test), r2_score(dt_pred, y_test), r2_score(rf_pred, y_test)]), 'Model':np.array(['Linear Regression', 'Decison Tree', 'Random Forest'])})
r2.set_index('Model').plot(kind="bar", color="maroon")
plt.axhline(y=0, color='g')
plt.title("R2 Scores")

In [None]:
#RMSLE plot

plt.figure(figsize=(8,7))
r2 = pd.DataFrame({'RMSLE':np.array([np.sqrt(mean_squared_log_error(dt_pred, y_test)), np.sqrt(mean_squared_log_error(rf_pred, y_test))]), 'Model':np.array(['Decison Tree', 'Random Forest'])})
r2.set_index('Model').plot(kind="bar", color="skyblue", legend=False)
plt.title("RMSLE - Lesser is Better")

##**What's better - Decision Tree or Random Forest ?**


* One problem that might occur with Decision Tree is that it can overfit.

* Difference is - A random forest is a collection of decision trees.

* A decision tree model considers all the features which makes it memorize everything, it gets overfitted on training data which couldn't predict well on unseen data.

* A random forest chooses few number of rows at random and interprets results from all the Tress and combines it to get more accurate and stable final result.

##**Insights:**


* Observed which taxi service provider is most Frequently used by New Yorkers.

* Found out few trips which were of duration 528 Hours to 972 Hours, possibly Outliers.

* With the help of Tableau, we’re able to make good use of Geographical Data provided in the Dataset to figure prominent Locations of Taxi’s pickup / dropoff points.

* Also, found out some Trips of which pickup / dropoff point ended up somewhere in North Atlantic Sea.

* Passenger count Analysis showed us that there were few trips with Zero Passengers.

* Monthly trip analysis gives us a insight of Month – March and April marking the highest number of Trips while January marking lowest, possibly due to Snowfall.

##**Recommended Approach:**

* Apply Standard Scaling on the Dataset to Normalize the values.

* Further, Apply PCA to reduce dimensions, as you’ll extract features from our primary DateTime Feature. Those additional features might lead our model to 
suffer from “Curse of dimensionality” and could drastically affect performance.

* Pass the PCA Transformed data in our ML Regression Algorithms and Evaluate results.

## **Future Work (Optional)**

* Further, one can improve the model's performance using Hyper-Parameter Tuning.

* Other ML Algorithms can be Tried.

* One can take ANN approach.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**


In this project we covered various aspects of the Machine learning development cycle. We observed that the data exploration and variable analysis is a very important aspect of the whole cycle and should be done for thorough understanding of the data. We also cleaned the data while exploring as there were some outliers which should be treated before feature engineering. Further we did feature engineering to filter and gather only the optimal features which are more significant and covered most of the variance in the dataset. Then finally we trained the models on the optimum featureset to get the results.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***