<a href="https://colab.research.google.com/github/Rishabhyadav888/NYC_taxi_trip_duration_prediction/blob/main/NYC_taxi_trip_duration_prediction_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 - Rishabh Kumar Yaddav**


# **Project Summary -**


The New York City Taxi Trip Duration prediction project aims to build a machine learning model to predict the duration of a taxi ride given certain features such as pickup and dropoff coordinates, pickup datetime, and the number of passengers. The dataset used for this project contains information about 1.4 million taxi rides in New York City.

The dataset is preprocessed by converting the pickup and dropoff timestamps to datetime objects and extracting additional features such as the day of the week, hour of the day, and whether the trip was taken during rush hour or not. The pickup and dropoff coordinates are also used to calculate the distance between the pickup and dropoff locations using the Geopy in Python library.
Next, the project starts with data exploration and visualization to gain insights into the data. The exploratory data analysis includes scatterplots, histograms, pointplot and heatmaps to understand the distribution and correlation of the features in the dataset. The analysis shows that the duration of the taxi rides is positively skewed. Based on data exploitation and visualization the insights gained. The data was cleaned to remove outliers and inconsistencies. Trips lasting more than 24 hours or as little as 1 to 30 seconds were removed as they were likely outstation travel or trips with zero distance. Most trips had a single passenger, so zero passenger counts were replaced with one and trips with more than six passengers were removed to align with NYC regulations. Trips exceeding 100 km and speeds over 104 km/hour were considered outliers and removed. The trend of taxi pickups increased from 6AM to late evening around 8 PM, with the busiest hours being 6:00 pm to 7:00 pm, likely due to people returning home from work. There was a trend of increasing taxi pickups from Monday to Friday, with a decline from Saturday to Monday. Trips with zero distance but more than one minute of travel time were removed, likely due to cancelled bookings. Trips taking more than an hour to cover one or two kilometers were also removed, as this was highly unlikely in a developed city like NYC. These cleaning operations ensured that the data was more consistent and reliable for further analysis. VIF values are less than or equal to 5. so there is no high correlation among features.


The next step involved feature engineering, where the store_and_fwd_flag feature was encoded using label encoding with Y:1 and N:0. For month and weekday features containing more than two categories, one-hot encoding was used. An 80/20 split ratio was used for model learning and evaluation. Recursive feature selection was employed for feature selection, and backward elimination was done based on the coefficient value of each feature by running different combinations. The selected features included pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, distance, speed, month_June, month_May, weekday_Monday, and weekday_Sunday. Standard scalar was used for data scaling to normalize the features as they had different value ranges. Power transform was used to normally distribute the non-normally distributed input features, and linear regression model was trained and evaluated with an R2 score of 0.75 on the transformed data. This feature engineering process enabled the data to be prepared and modeled for further analysis.

Linear regression was used to train and test the data for this regression problem. The model was evaluated using RMSE, R2 score, and adjusted R2 score, resulting in an R2 score of 0.80 and an RMSE of 296 seconds on the testing dataset.

RandomizedSearchCV was used for hyperparameter tuning due to the large dataset of approximately 1.4 million records, and there was no improvement in accuracy after the tuning process. The ridge and lasso models yielded the same result as the R2 score remained at 0.80.

Due to the large dataset of approximately 1.4 million records, the decision tree regressor was utilized as it is both fast and efficient. The hyperparameters were set to default, except for a maximum depth of 12 to avoid overfitting. The model was evaluated using RMSE and R2 score, yielding a high R2 score of 0.998 and a low RMSE of 27 seconds on the testing dataset and cross validation score of 0.998.

XGBoost is a distributed gradient boosting library that converts weak learners into strong learners using the gradient boosting framework. Boosting is a sequential process where trees are grown using information from previously grown trees. The model was evaluated using RMSE and R2 score, achieving an R2 score of 0.998 and an RMSE of 20 seconds on the testing dataset.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Build a machine learning model to predict the trip duration of NYC taxi trip.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import warnings; warnings.simplefilter('ignore')



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path="/content/drive/MyDrive/Capstone Project/NYC taxi trip duration prediction/"

In [None]:
df=pd.read_csv(path + "NYC Taxi Data.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

**Feature details:**


*  id - a unique identifier for each trip List.
*  vendor_id - a code indicating the provider associated with the trip record.
*  pickup_datetime - date and time when the meter was engaged.
*  dropoff_datetime - date and time when the meter was disengaged.
*  passenger_count - the number of passengers in the vehicle (driver entered value).
*  pickup_longitude - the longitude where the meter was engaged.
*  pickup_latitude - the latitude where the meter was engaged.
*  dropoff_longitude - the longitude where the meter was disengaged.
*  dropoff_latitude - the latitude where the meter was disengaged.
*  store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip.

**Target detail:**

*  trip_duration - duration of the trip in seconds



### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

*There are approx 1.4 million records in our dataset with 10 features and 1 target columns.*

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

*There are no duplicate entries in it.*

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

There is no NAN/NULL values in our dataset,So we dont have to impute any record.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns=df.columns
columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description 

Passenger count varies from 0 to 9. Trip duration has max value of 3526282 seconds almost 979.5 hours and minimum 1 second. This definetly has outliers present so we'll remove them.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Let us now look at the datatypes of all features.
df.dtypes


We have pickup_datetime, dropoff_datetime of the type 'object'. Convert it into type 'datetime'.

In [None]:
# Convert pickup_datetime, dropoff_datetime into "datetime"
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['dropoff_datetime'] = pd.to_datetime(df['dropoff_datetime'])

we will calculate and assine some new features to the dataframe such as distance, weekday,month and pickup_hour which will help us to gain more insights from the data.

In [None]:
from geopy.distance import great_circle

In [None]:
def cal_distance(pickup_lat,pickup_long,dropoff_lat,dropoff_long):
 
 pickup_point=(pickup_lat,pickup_long)
 drop_point=(dropoff_lat,dropoff_long)
 
 return great_circle(pickup_point,drop_point).km

In [None]:
#Calculated distance covered from pickup point to drop point and new features is created

df["distance"]=df.apply(lambda x: cal_distance(x["pickup_latitude"],x['pickup_longitude'],x['dropoff_latitude'],x['dropoff_longitude'] ), axis=1)

In [None]:
#Calculated speed of the taxi for getting more insights

df['speed'] = (df.distance/(df.trip_duration/3600))

In [None]:
#Calculated month from pickup_datetime

df['month'] = df.pickup_datetime.dt.month_name()

In [None]:
#Calculated pickup_hour from pickup_datetime

df['pickup_hour'] = df.pickup_datetime.dt.hour

In [None]:
#Calculated weekday_num from pickup_datetime

df['weekday'] = df.pickup_datetime.dt.day_name()

In [None]:
df.head()

### What all manipulations have you done and insights you found?

Converted the pickup_datetime to datetimestamp.

we have created the following features:


*   Distance - Total distance covered in that trip.
*   Speed - Average speed of the taxi during that trip.
*   Month - month January to December
*   Pickup_hour - pickup_time in the format of 24 hour.
*   Weekdays - which contain the weekdays Monday to Sunday




## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# ***Univariate Analysis***

#### Trip_duration

In [None]:
# Dependent variable 'trip_duration'
plt.figure(figsize=(7,7))
sns.distplot(df['trip_duration'],color="b")

##### 1. Why did you pick the specific chart?

Distplot shows that the trip_duration follows the right skewness, hence let's apply the log10 to transform it to the normal distribution.

In [None]:
plt.figure(figsize=(7,7))
sns.distplot(np.log10(df['trip_duration']),color="y")

By applying log10 transformation the trip_duration follows the normal distribution.

Let's check for the outliers through boxplot.

In [None]:
plt.figure(figsize = (10,7))
sns.boxplot(df.trip_duration)

We can clearly see an outlier and should be removed for the data consistency.

Let's analyze more

In [None]:
plt.figure(figsize=[12,8])
labels=['less then 1min','within 10 mins','within 30 mins','within hour','within day','within two days','more then two day']
df.groupby(pd.cut(df['trip_duration'],bins=[0,60,600,1800,3600,86400,86400*2,10000000],labels=labels))['trip_duration'].count().plot(kind='bar',fontsize=10)
plt.title("Trip duration with count")
plt.ylabel("trip count")
plt.xlabel("trip duration")


In [None]:
df['trip_duration'].nlargest(10)

In [None]:
df['trip_duration'].nsmallest(10)


##### 2. What is/are the insight(s) found from the chart?

**From above analysis:**

*   Major trip durations took within 10min - 30min to complete.
* Approx 90% of the trip has been completed within 1 hour.
*   Major chunk of trips are completed within an interval of 1 hour with some good numbers of trips duration going above 1 hour.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Insights having negative impact:**
* There are some durations with as low as 1 to 30 seconds. which points towards trips with 0 km distance.
* There are trips ran for more than 20 days, which seems unlikely by the distance travelled.
*There are some trips with more than 24 hours of travel duration i.e. 86400 seconds. Which might have occured for the outstation travels.

We should get rid the outliers for the sake of data consistency. (Trip duration greater than 86400 seconds and also trip duration less than 30 seconds)

In [None]:
#Removing outliers trip greater than 86400 sec and trip duration less than 30 sec
df = df[df.trip_duration <= 86400]
df= df[df.trip_duration >= 30]

In [None]:
df.shape

#### Vendor_id

In [None]:
# Feature variables "Vendor_id"
plt.figure(figsize = (7,7))
sns.countplot(df.vendor_id)
plt.xlabel('Vendor ID')
plt.ylabel('Count')

##### 1. Why did you pick the specific chart?

Count plot bar graph clear shows the number of trips taken by each vendor.



##### 2. What is/are the insight(s) found from the chart?

There are not much difference between both the vendors but as per the graph vendor 2 has more trips than vendor 1.

#### Passenger

In [None]:
#Passenger count with different number of passengers
plt.figure(figsize = (10,5))
sns.countplot(x='passenger_count',data=df)
plt.ylabel('Count')
plt.xlabel('No.of Passngers')
plt.show()

In [None]:
df.passenger_count.value_counts()

##### 1. Why did you pick the specific chart?

Through count plot we can clear see the passenger count is following right skewed distribution.

##### 2. What is/are the insight(s) found from the chart?

**Feature insights**
* There are some trips with 0 passenger count.
* Few trips consisted of even 7, 8 or 9 passengers.
* Most of trip consist of passenger either 1 or 2.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.


* As there are some trips with 0 passenger count in which trip is not possible without any passenger.There may be a case where driver forget to enter the passenger count.  Lets analyze the passenger count distribution further to make it consistent for further analysis.
* Few trips also consiste of 7,8 and 9 passenger count which is clearly an outlier as in a single car max 6 passenger can seat as per rule.so we have to remove these outliers.

In [None]:
df["passenger_count"].describe()

Since mean is 1 that means most of the trip has been done with a single passenger.so we will replace the 0 passenger count with 1 passenger count.

In [None]:
# passenger count with 0 was relpaced by 1
df['passenger_count'] = df['passenger_count'].replace([0], 1)

In [None]:
# passenger count more than 6 where removed
df = df[df.passenger_count <= 6]

#### store_and_fwd_flag

In [None]:
# Chart - 4 visualization code
plt.figure(figsize = (7,7))
sns.countplot(x="store_and_fwd_flag",data=df)
plt.ylabel("count")
plt.xlabel("store_and_fwd_flag")
plt.show()

In [None]:
df.store_and_fwd_flag.value_counts()

In [None]:
from matplotlib.colors import Normalize
df["store_and_fwd_flag"].value_counts(Normalize)

##### 1. Why did you pick the specific chart?

As it is a categorical columns bar plot can be used to compare how many times  the trip record was held in vehicle memory before sending to the vendor.

##### 2. What is/are the insight(s) found from the chart?

**Feature insight:**
* Above result shows that less than 1% of the trip details were stored in the vehicle first before sending it to the server.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

As very less number of times the trip details where stored in a vehicle before sending it to the server the reasone behind this can be:
* Due to low signal while the trip was about to complete.
* May be mobile battery was down.

#### Distance

In [None]:
# Let's now have a look on the distribution of the distance across the different types of rides.
plt.figure(figsize = (10,5))
sns.boxplot(df.distance)
plt.xlabel("Distance_Travelled")
plt.show()

In [None]:
df.distance.describe()

In [None]:
plt.figure(figsize = (10,7))
df.distance.groupby(pd.cut(df.distance, np.arange(0,100,10))).count().plot(kind='bar')
plt.xlabel('distance in km')
plt.ylabel('trip_count')
plt.show()

##### 1. Why did you pick the specific chart?

With the help of boxplot we can clearly see that there are so many outliers present in the distance covered by the trip and the data is right skewed.

##### 2. What is/are the insight(s) found from the chart?

**Feature insights:**
* There some trips with over 100 km distance.
* Some of the trips distance value is 0 km.
* mean distance travelled is approx 3.5 kms.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

In [None]:
# Number of rows having trip distance value as 0
df.distance[df.distance == 0 ].count()

In [None]:
# Number of rows having trip distance value more than 100
df.distance[df.distance >= 100 ].count()

**Negative inshits:**
* There are approx 4500 rows showing the trip distance value as 0 some possible reasons can be:
 * The dropoff location couldn’t be tracked.
 * The driver deliberately took this ride to complete a target ride number.
 * The passengers or driver cancelled the trip due to some issue.
 * Due to some technical issue in software, etc

* There are 19 trips with over 100 km distance some possible reason can be:
 * Those trip where for the outstation trip.
 * Due to some technical issue in software, etc

* There is some serious inconsistencies in the data where drop off location is same as the pickup location. We can't think off imputing the distance values considering a correlation with the duration because the dropoff_location coordinates would not be inline with the distance otherwise. We will look more to it in bivariate analysis with the Trip duration.
* The trips which are having distance more than 100 km are clearly outliers so we will remove it.

In [None]:
#Removing the rows having distance more than 100 km.
df = df[df.distance <= 100]

#### speed

In [None]:
# Chart - 6 visualization code
plt.figure(figsize = (10,5))
sns.boxplot(df.speed)
plt.show()

In [None]:
df.speed.describe()

In [None]:
df.speed.groupby(pd.cut(df.speed, np.arange(0,100,10))).count().plot(kind = 'barh')
plt.xlabel('Trip count')
plt.ylabel('Speed (Km/H)')
plt.show()

##### 1. Why did you pick the specific chart?

By using boxplot we can easily find out the outlier like speed over 100km/hr are unrelevent to the dataset and through bar plot we can find out that most of the trip where done with the average speed of 10-20km/hr. 


##### 2. What is/are the insight(s) found from the chart?

**Feature inshight:**
* Many trips were done at a speed of over 100 km/h.
* There are some trips with 0km/hr 
* Most of the trips were done at a average speed range of 10-20 km/H.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Negative inshigts:**
* There are some trips with 0km/hr. We will look more to it in bivariate analysis with the Trip duration.
* Trips over the speed of 104km/hr will be treated as outliers and will be removed because there are some rules regarding the speed limit in NYC
 * Maximum speed limit in NYC is as follows:

  * 25 mph in urban area i.e. 40 km/h
  * 65 mph on controlled state highways i.e. approx 104 km/h

In [None]:
#Removing the data having speed more than 104km/hr
df = df[df.speed <= 104]

In [None]:
df.shape

#### Total trips per month

In [None]:
df.month.value_counts()

In [None]:
# Total trip per month (visualization code)
plt.figure(figsize = (10,7))
sns.countplot(df.month)
plt.ylabel('Trip Counts')
plt.xlabel('Months')
plt.show()

##### 1. Why did you pick the specific chart?

Through histogram chart we can find out the difference in total trip count taken in each month.

##### 2. What is/are the insight(s) found from the chart?

**feature inshigts:**
* There is not much difference in the number of trips across months.

#### Total trips Per Hour

In [None]:
# Total trips Per Hour(visualization code)
plt.figure(figsize = (10,7))
sns.countplot(df.pickup_hour)
plt.ylabel('Trip Counts')
plt.xlabel('pickup_hour')
plt.show()

##### 1. Why did you pick the specific chart?

Through histogram chart we can find out the difference in total trip count taken per hour.

##### 2. What is/are the insight(s) found from the chart?

**Feature insights:**
* The general trend of taxi pickups which starts increasing from 6AM in the morning and then declines from late evening around 8 PM.
* The busiest hours are 6:00 pm to 7:00 pm which makes sense as this is the time for people to return home from work.

#### Total trips per weekday

In [None]:
# Total trips per weekday(visualization code)
plt.figure(figsize = (10,7))
sns.countplot(df.weekday)
plt.ylabel('Trip Counts')
plt.xlabel('weekday')
plt.show()

##### 1. Why did you pick the specific chart?

Through histogram chart we can find out the difference in total trip count taken in weekdays.

##### 2. What is/are the insight(s) found from the chart?

**Feature insight:**
* We see Fridays are the busiest days followed by Saturdays. That is probably
because it’s weekend.
* There is a increase in a trend of taxi pickups starting from Monday till Friday. The trend starts declining from saturday till monday which is normal where some office going people likes to stay at home for rest on the weekends.

# ***Bivariate Analysis***

#### Trip duration per vendor

In [None]:
# Trip duration per vendor (visualization code)
plt.figure(figsize = (10,8))
trip_duration_per_vendor = df.groupby('vendor_id').trip_duration.mean()
sns.barplot(trip_duration_per_vendor.index, trip_duration_per_vendor.values)
plt.ylabel('Trip Duration (seconds)')
plt.xlabel('Vendor')
plt.show()

In [None]:
trip_duration_per_vendor

##### 1. Why did you pick the specific chart?

Through histogram chart we can find out the difference in trip duration taken by each vendor.

##### 2. What is/are the insight(s) found from the chart?

**Feature insight:**
* Vendor id 2 takes longer trips as compared to vendor 1.

#### Trip duration v/s Distance

In [None]:
# Trip duration v/s Distance (visualization code)
plt.figure(figsize = (10,5))
plt.scatter(x='trip_duration', y='distance',data=df)
plt.ylabel('Distance')
plt.xlabel('Trip Duration')
plt.show()

##### 1. Why did you pick the specific chart?

Through scatter plot we can easily find out the correlation between trip duration and distance coverd in a trip.


##### 2. What is/are the insight(s) found from the chart?

**Feature insight:**
* There are lots of trips which covered negligible distance but clocked more than 20,000 seconds in terms of the Duration.
* There were few trips which covered huge distance of approx 100 kms within very less time frame but considering the speed limits of NYC highway which is 104km/hr .so it possible to cover such distance within a hour.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* We should remove those trips which covered 0 km distance but clocked more than 1 minute to make our data more consistent for predictive model. Because if the trip was cancelled after booking, than that should not have taken more than a minute time. This is our assumption.

In [None]:
# Removing those data having distance equal to 0 & trip duration is greater than 60sec
df = df[~((df.distance == 0) & (df.trip_duration >= 60))]

* There are lots of trips which covered less than but clocked more than 3600 seconds(1hour) in terms of the Duration.It is rarely occurs that customer keep sitting in the taxi for more than an hour and didn't cover 1km.

In [None]:
#Removing those data having distance less than 1 & trip duration is greater than 3600sec
df = df[~((df['distance'] <= 1) & (df['trip_duration'] >= 3600))]

* There are some trips which took more than 3600 sec(1hr) to just cover 1 or 2 km. which very rarely to happen due to any traffic is such a developed city.

In [None]:
#Removing those data having the speed of less than or equal to 2km/hr.
df = df[~(df['speed'] <= 2)]

Let's visualize it again

In [None]:
plt.figure(figsize = (10,5))
plt.scatter(x='trip_duration', y='distance',data=df)
plt.ylabel('Distance')
plt.xlabel('Trip Duration')
plt.show()

* There is just one point lying above the 40000 trip duration and basically that's an outlier. 

In [None]:
# Removed the trip duration above 40000
df = df[~(df.trip_duration >= 40000)]

In [None]:
# visualize it again
plt.figure(figsize = (10,5))
plt.scatter(x='trip_duration', y='distance',data=df)
plt.ylabel('Distance')
plt.xlabel('Trip Duration')
plt.show()

#### Trip duration vs speed

In [None]:
#Trip duration vs speed (visualization code)

plt.figure(figsize = (10,5))
plt.scatter(x='trip_duration', y='speed',data=df)
plt.ylabel('speed')
plt.xlabel('Trip Duration')
plt.show()


##### 1. Why did you pick the specific chart?

Through scatter plot we can easily find out the correlation between trip duration and average speed in a trip.

##### 2. What is/are the insight(s) found from the chart?

* There are some trips with the speed of above 80km/hr in very less trip duration.It is because of less distance covered in high speed.
* Decrease in the speed trip duration also increases.
* There are some trips with the very low average speed and time duration is very high it's like an outlier for the data.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* There are some data inconsistence due to outlier with  very low average speed and time duration is very high to make it more data consistance we will remove the outlier.

In [None]:
#Removing the data having trip duration higher than 15000
df=df[~(df['trip_duration']>=15000)]

Let's vizualize it again

In [None]:
plt.figure(figsize = (10,5))
plt.scatter(x='trip_duration', y='speed',data=df)
plt.ylabel('speed')
plt.xlabel('Trip Duration')
plt.show()


#### Trip duration per month

In [None]:
# Trip duration per month (visualization code)
months=["January","February","March","April","May","June","July"]
plt.figure(figsize = (10,7))
sns.pointplot(x='month',y='trip_duration', data=df,order=months)
plt.ylabel('Duration (seconds)')
plt.xlabel('Month of Trip ')

plt.show()

##### 1. Why did you pick the specific chart?

Through point plot we can find out the increase in trip duration over the different months.

##### 2. What is/are the insight(s) found from the chart?

**Feature insights:**
* It is lowest during february when winters starts declining.
* From February, we can see trip duration rising every month.
* There might be some seasonal parameters like wind/rain which can be a factor of this gradual increase in trip duration over a period.
* In June trip duration are high may be due to the  starting of summer season.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive insights:**
* we get to know that in which months people like to travel more duration of time and for may longer distance.

#### Trip duration per hour

In [None]:
# Trip duration per month (visualization code)
plt.figure(figsize = (12,7))
sns.pointplot(x='pickup_hour',y='trip_duration', data=df)
plt.ylabel('trip Duration (seconds)')
plt.xlabel('pickup_hour ')
plt.show()

##### 1. Why did you pick the specific chart?

Through point plot we can find out the increase in trip duration over the different pickup hours.

##### 2. What is/are the insight(s) found from the chart?

**Feature insight:**
* We see the trip duration is the maximum around 3 pm which may be because of traffic on the roads.
* Trip duration is higher during 8am to 6pm as traffic is high due to office hour. 
* Trip duration is the lowest around 6 am as streets may not be busy.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* we will get to know if people book there taxi and how much time it will take to reach their dropoff location than usual timing.

#### Trip duration per weekdays

In [None]:
# Trip duration per weekdays (visualization code)
weekdays=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
plt.figure(figsize = (12,7))
sns.pointplot(x='weekday',y='trip_duration', data=df,order=weekdays)
plt.ylabel('trip Duration (seconds)')
plt.xlabel('weekday')
plt.show()

##### 1. Why did you pick the specific chart?

Through point plot we can find out the increase in trip duration over the different weekdays.

##### 2. What is/are the insight(s) found from the chart?

**Feature insights:**
* Trip duration is compatible high on thursday as compered other.
* Trip duration is less on saturday & sunday due to weekends.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

It is easy to get to know which days we are going to get more traffics compered to other weekdays because higher the trip duration traffic is also high.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,8))
correlation = df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?

Through correlation heatmap we can see the correlation among different features as well as with dependent variable.

##### 2. What is/are the insight(s) found from the chart?

**Insight**
* Distance is highly correlated with the dependent variable(trip duration).
* There are some features having multicollinearity.let's check through VIF method is it highly correlated or not.

Let's check through VIF method.

In [None]:
# calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns]])

* VIF values are less than or equal to 5. so there is no high correlation among features.

# ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There is no null values in the dataset.

### 2. Categorical Encoding

In [None]:
df.shape

In [None]:
# Encode your categorical columns
#label encoding
encoders_nums = {"store_and_fwd_flag":{"Y":1,"N":0}}

df = df.replace(encoders_nums)
# one hot encoding
df=pd.get_dummies(df,columns=["month","weekday"],drop_first=True)

In [None]:
df.shape

#### What all categorical encoding techniques have you used & why did you use those techniques?

Encoding categorical columns:
* In store_and_fwd_flag feature there are only 2 class Y & N.So we have used the label encoding as Y:1 & N:0.
* In month & weekday feature containe more than 2 categories.So we have used the one hot encoding.
* Earlier df was (1439355, 16) and now after encoding categorical columns the shape of df is (1439355, 25).In which columns are increased by 9.

 ### 3. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split 
x = df.drop([('trip_duration'),('id'),('pickup_datetime'),('dropoff_datetime')], axis=1)
y = df.iloc[:,10]
x_train, x_test, y_train, y_test = train_test_split( x,y , test_size = 0.2, random_state = 0) 
print(x_train.shape)
print(x_test.shape)

##### What data splitting ratio have you used and why? 

* Splitting ratio of 80/20 is used for the evaluation and learing of model.
* 80% for training & 20% for testing.

### 4. Feature Manipulation & Selection

#### 2. Feature Selection

In [None]:
 # Running Recursive feature elimination for selecting important features
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

lm =  LinearRegression()
rfe = RFE(lm)
rfe= rfe.fit(x_train, y_train)


In [None]:
# feature ranking & selected as true
print(rfe.support_)
print(rfe.ranking_)


In [None]:
# selecting columns 
col= x_train.columns[rfe.support_]

In [None]:
# updating selected columns in xtrain & xtest
x_train= x_train[col]
x_test = x_test[col]

##### What all feature selection methods have you used  and why?

* For feature selection we have used the recursive feature selection in which backward elimination of features is done based on their coefficient value of each feature by running different combinations.

##### Which all features you found important and why?

In [None]:
# Important features
x_train.columns

* ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude','dropoff_latitude', 'distance', 'speed', 'month_June', 'month_May','weekday_Monday', 'weekday_Sunday']
* These features were having the higher coefficient values as compared to other features. As these features were relatively highly explaining the relationship with the dependent variable.

### 5. Data Scaling

In [None]:
# Data scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)


##### Which method have you used to scale you data and why?

* We have used the standard scalar for data scaling as mean equals to zero and the variation of 1. As we were having some features relatively high value compared to other features.

### 6. Data Transformation

In [None]:
# used powertransform to distribute the data normally
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()

x_train_transform=pt.fit_transform(x_train)

In [None]:
# Fit the LinearRegression on the transformed data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

lr= LinearRegression()

# Fit the Algorithm
lr.fit(x_train_transform,y_train)

# Predict on the model
y_test_pred = lr.predict(x_test)

print("Testing R2: ",r2_score(y_test,y_test_pred))

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

* As our input features were not normally distributed. so, we have used the power transform to normally distribute the data and train the data by using linear regression model and check the R2 score on the test data which was 0.75.To check the performance on transformed data.

# ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# LinearRegression Model - 1 Implementation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

lr= LinearRegression()

# Fit the Algorithm
lr.fit(x_train,y_train)

# Predict on the model
y_test_pred = lr.predict(x_test)
y_train_pred = lr.predict(x_train)

print("Training RMSE: ",(np.sqrt(mean_squared_error(y_train,y_train_pred))))
print("Testing RMSE: ",(np.sqrt(mean_squared_error(y_test,y_test_pred))))

print("Training R2: ",r2_score(y_train,y_train_pred))
print("Testing R2: ",r2_score(y_test,y_test_pred))

print("Testing Adjusted R2 : ", (1-(1-r2_score((y_test), (y_test_pred)))*((x_train.shape[0]-1)/(x_train.shape[0]-x_train.shape[1]-1))))

print(lr.coef_)
print(lr.intercept_)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* As this is regression problem, we have used the linear regression model for training & testing of our data set.
* For the evaluation of the model, we have calculated the RMSE, R2 score and adjusted r2 score.
* By evaluation we get the R2 score of 0.80 and RMSE 296 seconds on testing dataset.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(12,6))
plt.plot(np.array(y_test_pred))
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross- Validation score with model(linearReggresion)
from sklearn.model_selection import cross_val_score
score = (np.mean(cross_val_score(lr,x_train,y_train,cv=10)))
print(score)

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Lasso

params = {"alpha": np.arange(0.01,10,0.01)}
model=Lasso()
model1 = RandomizedSearchCV(model,params,scoring="r2", cv=5)

# Fit the Algorithm

model1.fit(x_train,y_train)

# Predict on the model

y_pred_model1= model1.predict(x_test)

In [None]:
print('The best fit alpha value is :', model1.best_params_)
print("Testing R2 score: ",r2_score(y_test,y_pred_model1))

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Ridge

params = {"alpha": np.arange(0.01,10,0.01)}
ridge=Ridge()
model2 = RandomizedSearchCV(ridge,params,scoring="r2", cv=5)

# Fit the Algorithm

model2.fit(x_train,y_train)

# Predict on the model

y_pred_model2= model2.predict(x_test)

In [None]:
print('The best fit alpha value is :', model2.best_params_)
print("Testing R2 score: ",r2_score(y_test,y_pred_model2))

##### Which hyperparameter optimization technique have you used and why?

* For hyperparameter tuning we have used RandomSearch CV as we are having very large data set approx. 1.4 million records are there and RandomSearch CV is fast for large data type

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

* There is no improvement in the accuracy after the hyperparameter tuning in ridge and lasso model we get the same result as R2 score is 0.80

### ML Model - 2

In [None]:
from sklearn.tree import DecisionTreeRegressor
dt= DecisionTreeRegressor(max_depth=12)
dt.fit(x_train,y_train)

y_pred_dt=dt.predict(x_test)

print("Testing RMSE: ",(np.sqrt(mean_squared_error(y_test,y_pred_dt))))
print("Testing R2 score: ",r2_score(y_test,y_pred_dt))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* As we are having a big data set approx. 1.4 million so we used decision tree regressor which is comparatively very fast and perorm well on big data size.
* Set the hyperparameter Max depth equal to 12 to avoid overfitting and rest parameter as a default value.
* For the evaluation of the model, we have calculated the RMSE and R2 score
* By evaluation we get the R2 score of 0.998 and RMSE 27 seconds on testing dataset.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(12,6))
plt.plot(np.array(y_pred_dt))
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
score2 = (np.mean(cross_val_score(dt,x_train,y_train,cv=8)))
print(score2)

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Evaluation metric's
* RMSE - The Root Mean Squared Error (RMSE) is one of the two main performance indicators for a regression model. It measures the average difference between values predicted by a model and the actual values. It provides an estimation of how well the model is able to predict the target value (accuracy).

* R Square - R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

* Adjusted R2 - The adjusted R-squared adjusts for the number of terms in the model. Importantly, its value increases only when the new term improves the model fit more than expected by chance alone. The adjusted R-squared value actually decreases when the term doesn't improve the model fit by a sufficient amount.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
import xgboost as xg

xgb_r = xg.XGBRegressor()

# Fit the Algorithm
xgb_r.fit(x_train,y_train)
# Predict on the model
y_pred_xg=xgb_r.predict(x_test)


print("Testing RMSE: ",(np.sqrt(mean_squared_error(y_test,y_pred_xg))))

print("Testing R2 score: ",r2_score(y_test,y_pred_xg))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library. It uses gradient boosting (GBM) framework at core. It belongs to a family of boosting algorithms that convert weak learners into strong learners. A weak learner is one which is slightly better than random guessing.

 'Boosting' here is a sequential process; i.e., trees are grown using the information from a previously grown tree one after the other. This process slowly learns from data and tries to improve its prediction in the subsequent iterations.
* For the evaluation of the model, we have calculated the RMSE and R2 score
* By evaluation we get the R2 score of 0.999 and RMSE 20 seconds on testing dataset. 

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(12,6))
plt.plot(np.array(y_pred_xg))
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

What evaluation of ML models we have calculated the RMSE and R2 score on testing data.
* RMSE :The Root Mean Squared Error (RMSE) is one of the two main performance indicators for a regression model. It measures the average difference between values predicted by a model and the actual values. It provides an estimation of how well the model is able to predict the target value (accuracy).The best rmse was 20sec which we received from the XGBoost Regressor.
* R2 score: Is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable.The best r2 score was 0.999 which we received from the XGBoost Regressor.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

* In comparison to above models Decision Tree Regressor and XGboost gives the best result which is around R2 score of 0.999 
* In XGboost we have the least RMSE of 20 sec compared to Decision tree regressor which is having RMSE of 27 sec.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
pip install shap

In [None]:
import shap
# Create object that can calculate shap values
explainer = shap.TreeExplainer(dt)
# Calculate Shap values
shap_values = explainer.shap_values(x_test)

In [None]:
features=['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'distance', 'speed', 'month_June', 'month_May',
       'weekday_Monday', 'weekday_Sunday']
shap.summary_plot(shap_values, x_train, feature_names=features, plot_type="bar")

* Distance & Speed is the most important features for the prediction of trip duration.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
from joblib import dump, load
dump(dt, 'NYC_taxi.joblib') 

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
clf = load('NYC_taxi.joblib') 

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Based on the analysis and modeling of the NYC taxi trip duration prediction project, the following conclusions can be drawn:

* The project was successful in building a model that can accurately predict the duration of a taxi trip based on various features such as pick-up and drop-off location, distance, time of day, and day of the week.

* The model achieved a good level of accuracy, with a Root mean square error(RMSE) of around 20 secounds, indicating that it can make reasonably accurate predictions.

* Exploratory data analysis helped in understanding the distribution of the target variable (trip duration) and identifying some important features that influence it, such as distance, day of the week, and time of day.

* Feature engineering techniques, such as creating new features from existing ones and encoding categorical variables, helped in improving the performance of the models.

* Several machine learning algorithms were tested for this regression problem, including linear regression, decision trees and XG boosting.XG boosting yielded the best performance, with an RMSE of around 20 seconds.

* The model was also evaluated using cross-validation and tested on a holdout dataset to check for overfitting. The results were consistent, indicating that the model is reliable and robust.

* The model can be used to predict the duration of a taxi trip in NYC based on the given features, which can be useful for taxi companies and passengers to plan their trips more efficiently.

In conclusion, the machine learning project on predicting the duration of NYC taxi trips was successful in building an accurate model that can have practical applications in the real world.











### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***