<a href="https://colab.research.google.com/github/Kushan1001/NYC-Taxi-Trip-Time-Prediction/blob/main/NYC_Taxi_Trip_Time_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Taxi trip time Prediction : Predicting total ride duration of taxi trips in New York City</u></b>

## <b> Problem Description </b>

### Your task is to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.

## <b> Data Description </b>

### The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

### <b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

##**Set Up**

In [58]:
# Importing the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns

import datetime as dt

import warnings; warnings.filterwarnings('ignore')



# setting some display options

plt.rcParams["font.weight"] = "bold"
plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams["axes.titlesize"] = 16
plt.rcParams["axes.titleweight"] = 'bold'
plt.rcParams['xtick.labelsize']=15
plt.rcParams['ytick.labelsize']=15
plt.rcParams["axes.labelsize"] = 16
plt.rcParams["legend.fontsize"] = 15
plt.rcParams["legend.title_fontsize"] = 15


##**Loading the dataset**

In [59]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# creating a pandas data-frame (taxi_df) for the dataset

taxi_df = pd.read_csv('/content/drive/MyDrive/Capstone project -2/NYC Taxi Data.csv', index_col ='id' )

##**Exploring the dataset**

In [None]:
# checking the first five columns of the dataset

taxi_df.head()

In [None]:
# a glance of the whole dataset

taxi_df.info()

In [None]:
# checking the number of observations and features

print(taxi_df.shape)
print('Number of observations :', taxi_df.shape[0])
print('Number of features :', taxi_df.shape[1])

In [None]:
# features in the dataset

taxi_df.columns

In [None]:
# checking the null values

taxi_df.isnull().sum()

We have zero null values in the dataset

In [None]:
taxi_df.nunique().sort_values()



*   There are two types of vendores:- 1 and 2
*   Store and fwd flag takes two types of values:- Y and N
*   Passenger count takes 10 different values






In [None]:
# getting a descriptive summary of our dataset

taxi_df.describe().applymap('{:,.5f}'.format)


*   Passenegrs count takes values b/w 0-9, average passengers being either 1 or 2.
*   Max trip duration is 3526282 seconds or 980 hours aprrox which is not feasible. Clearly, there are some outliers present.




##**Feature Creation**



* Now we will convert pick and drop datetime from object type to python datetime.
* It helps us to perform a lot more operations and makes our analysis better.  



In [None]:
taxi_df['dropoff_datetime'] = pd.to_datetime(taxi_df['dropoff_datetime'])
taxi_df['pickup_datetime'] = pd.to_datetime(taxi_df['pickup_datetime'])

taxi_df[['dropoff_datetime', 'pickup_datetime']].dtypes

## Now let's extract

In [None]:
#weekday
taxi_df['pickup_weekday'] = taxi_df['pickup_datetime'].dt.weekday
taxi_df['dropoff_weekday'] = taxi_df['dropoff_datetime'].dt.weekday

#day
taxi_df['pickup_day'] = taxi_df['pickup_datetime'].dt.day
taxi_df['dropoff_day'] = taxi_df['dropoff_datetime'].dt.day


#month
taxi_df['pickup_month'] = taxi_df['pickup_datetime'].dt.month
taxi_df['dropoff_month'] = taxi_df['dropoff_datetime'].dt.month


#year
taxi_df['pickup_year'] = taxi_df['pickup_datetime'].dt.year
taxi_df['dropoff_year'] = taxi_df['dropoff_datetime'].dt.year


#hour
taxi_df['pickup_hour'] = taxi_df['pickup_datetime'].dt.hour
taxi_df['dropoff_hour'] = taxi_df['dropoff_datetime'].dt.hour



## Now let's calculate distance travelled

To calculate distance we will be using haversine formula

In [None]:
# intsalling the haversine library

!pip install haversine

In [None]:
# importing the library

from haversine import haversine

In [None]:
''' creating a new column trip distance which the stores total distance travelled calculated using the
 haversine library'''

'''the distance calculated will be in kms'''

taxi_df['trip_distance'] = taxi_df.apply(lambda x: haversine((x['pickup_latitude'], x['pickup_longitude']),
                                                        (x['dropoff_latitude'], x['dropoff_longitude']), unit = 'km'), axis = 1)

In [None]:
taxi_df.head()

##**Preprocessing**

Since we have already extracted from pickup and dropoff datetime, let's drop these columns

In [None]:
taxi_df.drop(columns = ['pickup_datetime', 'dropoff_datetime'], inplace = True)

Now we will covert the column store and fwd flag in binary values 0 and 1

In [None]:
''' store and fwd flag takes only two values N and Y. So we can encode them as 0 for N
 and 1 for Y'''

print(taxi_df['store_and_fwd_flag'].unique())

We will do this with the of python class LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder # importing the class LabelEncoder

label_encoder = LabelEncoder() # creating an object of the class labelEncoder 

taxi_df['store_and_fwd_flag'] = label_encoder.fit_transform(taxi_df['store_and_fwd_flag']) # apply the method fit_transform of LabelEncoder
                                                                                           # to convert them into binary values

In [None]:
print(taxi_df['store_and_fwd_flag'].unique())

##**Exploratory Data Analysis**

##Checking for Outliers

In [None]:
rcParams['figure.figsize'] = 20,10

sns.boxplot(taxi_df['trip_duration'])
plt.xlabel('Trip duration')
plt.title('Boxplot showing outliers in trip duration');

From the boxplot we can see that there are some trips which are around 2000000 seconds or 555 hours and beyond. Clearly, it's not feasible for a taxi to run that long. Hence we will get rid of such trips.

In [None]:
taxi_df['trip_duration'].sort_values(ascending = False)

In [None]:
# There are major differences b/w trip durations after 'id1942836'. Thus, we will remove the trips after that id

taxi_df.drop(taxi_df[taxi_df['trip_duration'] == 3526282].index, inplace = True)
taxi_df.drop(taxi_df[taxi_df['trip_duration'] == 2227612].index, inplace = True)
taxi_df.drop(taxi_df[taxi_df['trip_duration'] == 2049578].index, inplace = True)
taxi_df.drop(taxi_df[taxi_df['trip_duration'] == 1939736].index, inplace = True)


##Univariate Analysis

###Average trip duration

In [None]:
rcParams['figure.figsize'] = 10,7

sns.histplot(data = taxi_df, x= 'trip_duration', color = 'brown')
plt.xlim(0,6000)
plt.xlabel('Trip Duration')
plt.axvline(taxi_df['trip_duration'].mean(), linestyle = 'dashed', color = 'magenta', linewidth = 2)
plt.axvline(taxi_df['trip_duration'].median(), linestyle = 'dashed', color = 'black', linewidth = 2)
plt.title('Distribution of trip duration')
plt.show();

##**Insights**


*   Average trip duration is around 500 seconds or 8.5 minutes 



###Distribution of Vendor ID

In [None]:
rcParams['figure.figsize'] = 6,6

sns.countplot(taxi_df['vendor_id'])
plt.title('Distibution of Vendor ID')
plt.xlabel('vendor ID')
plt.show();

###**Insights**

*  From the plot we can infer that vendor-2 is more preferred by the people of New York




###Distribution of passenger count 

In [None]:
rcParams['figure.figsize'] = 10,6

sns.countplot(taxi_df['passenger_count'])
plt.title('Distribution of passenegers count')
plt.xlabel('Passenger count')
plt.show();

###**Insights**

* Above barplot shows us that the most trips are done by either 1 or 2 passengers at a time




###Distribution of pickup latitude

In [None]:
plt.figure(figsize=(15 , 8))

pickups_by_weekdays = taxi_df['pickup_weekday'].value_counts()
pickups_by_weekdays.sort_index().plot(kind = 'bar')
plt.title('Distribution of trips per day')
plt.xticks(rotation = 0, ticks= [0,1,2,3,4,5,6], labels = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.xlabel('Days')
plt.ylabel('Count')
plt.show();

###**Insights**


*   Thurday, Friday and Saturday are the preferred for riding a cab

*   Monday and Sunday are least preferred by New Yorkers



###Distribution of Pickup hours


In [None]:
rcParams['figure.figsize'] = 20,7 

sns.countplot(data = taxi_df, x= 'pickup_hour', color = 'orange')
plt.title('Distribution of Pickup hours')
plt.xlabel('Pickup hour')
plt.show();

###**Insights**


*   People prefers taxis from 8 in the morning (probably office hours) and this growth increases with time.

*   6-7 in evening is the most preferred time by the new yorkers

* This graph also shows that new yorkers also prefer riding a cab at night as well 10pm - 12am




###Distribution of pickup month

In [None]:
rcParams['figure.figsize'] = 15,7 

sns.countplot(data = taxi_df, x= 'pickup_month', color = 'lightgreen')
plt.title('Distribution of Pickup month')
plt.xlabel('Pickup month')
plt.show();

###**Insights**


*   Most new yorkers prefer riding a cab in March and April, closely followed by February and May




###Distribution of pickup day

In [None]:
rcParams['figure.figsize'] = 22,7 

sns.countplot(data = taxi_df, x= 'pickup_day', color = 'coral')
plt.title('Distribution of Pickup day')
plt.xlabel('Pickup day')
plt.show();

###**Insights**


*   All the days of a month are more or less evenly distributed except for 30th and 31st that witnesses a sharp decline in taxi rides.




###Average trip distance

In [None]:
rcParams['figure.figsize'] = 10,7

sns.histplot(data = taxi_df, x= 'trip_distance', color = 'green')
plt.xlim(0,30)
plt.xlabel('Trip Distance')
plt.axvline(taxi_df['trip_distance'].mean(), linestyle = 'dashed', color = 'magenta', linewidth = 2)
plt.axvline(taxi_df['trip_distance'].median(), linestyle = 'dashed', color = 'black', linewidth = 2)
plt.title('Distribution of trip distance')
plt.show();

###**Insights**

Average trip distance is around 3-4km which is reasonable since our trips last only for 7-8 minutes

##Distribiution of store and fwd flag

In [None]:
rcParams['figure.figsize'] = 5,5

explode = [0.4,0]
colors = ['lightpink','lightblue']
plt.pie(taxi_df['store_and_fwd_flag'].value_counts(), wedgeprops={'edgecolor':'black'},autopct='%1.1f%%',
        explode = explode, radius = 1.6,colors = colors, shadow = 'True')

plt.title('Proportion of store and fwd flag', x = 0.3, y= 1.2)
plt.legend(labels=['N','Y'])


plt.tight_layout();

##**Insights**


* From the piechart we can see that less than 1% of trips are stored and forward trips



##Bivariate Analysis

###Relationship b/w Vendor ID and trip duration

In [None]:
rcParams['figure.figsize'] = 15,10

sns.catplot(data = taxi_df, x = 'vendor_id', y = 'trip_duration', 
            palette = 'winter', kind = 'bar')
plt.title('Relationship b/w vendor Id and trip duration')
plt.ylabel('Trip Duration')
plt.xlabel('Vendor ID')
plt.show();


###**Insights**

* It can clearly seen that the vendor 2 gets trips with more duration relative to the vendor 1

###Relationship b/w store and fwd trip and trip duration

In [None]:
rcParams['figure.figsize'] = 15,10

sns.catplot(data = taxi_df, x = 'store_and_fwd_flag', y = 'trip_duration', 
            palette = 'winter', kind = 'strip', jitter = True)
plt.title('Relationship b/w store and fwd trip and trip duration')
plt.ylabel('Trip Duration')
plt.xlabel(' Store and fwd trip')
plt.xticks(ticks = [0,1], labels = ['No', 'Yes'])
plt.show();

###**Insights**


*   Trips which are not stored and forwarded have longer duration relative to trips which are stored and forwared.




###Relationship b/w Passeneger count and Trip duration

In [None]:
rcParams['figure.figsize'] = 15,10

sns.catplot(data = taxi_df, x = 'passenger_count', y = 'trip_duration', 
           kind = 'strip', jitter = True)
plt.title('Relationship b/w Passenger_count and trip duration')
plt.ylabel('Trip Duration')
plt.xlabel('Passenger count')
plt.show();

###**Insights**


*   There is no clear correlation b/w passengers and trip duration




###Relationship b/w Pickup weekday and Trip duaration


In [None]:
rcParams['figure.figsize'] = 15,8

sns.pointplot(data =taxi_df, x = 'pickup_weekday', y = 'trip_duration')

plt.title('Relationship b/w Pickup weekday and Trip duaration')
plt.xlabel('Pickup weekday ')
plt.ylabel('Trip duration')
plt.xticks(ticks = [0,1,2,3,4,5,6], labels = ['Mon','Tues', 'Wed', 'Thurs', 'Fri', 'Sat','Sun'])
plt.show();


##**Insights**


* On Thursday new yorkers longest duration trips, followed by Friday and Wednesday

* However, on weekends new yorkers prefer shortest duration trips



###Relationship b/w Pickup hour and Trip duration

In [None]:
rcParams['figure.figsize'] = 15,8

sns.pointplot(data =taxi_df, x = 'pickup_hour', y = 'trip_duration')

plt.title('Relationship b/w Pickup hour and Trip duration')
plt.xlabel('Pickup hour')
plt.ylabel('Trip duration')
plt.show();

##**Insights**


*   During afternoons (14-16 or 2pm-4pm) we see the longest duration trips.
*   On the other hand, during early mornings (6am) we see the shortest duration trips



###Relationship b/w Pickup month and Trip duration

In [None]:
rcParams['figure.figsize'] = 15,8

sns.pointplot(data =taxi_df, x = 'pickup_month', y = 'trip_duration')

plt.title('Relationship b/w Pickup month and Trip duration')
plt.xlabel('Pickup month')
plt.ylabel('Trip duration')
plt.show();

##**Insights**


*   Duaration of trips increase with every next month. However, after reaching the 6th month or June the rate of increase has slightly slowed down.




###Relationship b/w Pickup day and Trip duration

In [None]:
rcParams['figure.figsize'] = 20,8

sns.pointplot(data =taxi_df, x = 'pickup_day', y = 'trip_duration')

plt.title('Relationship b/w Pickup day and Trip duration')
plt.xlabel('Pickup day')
plt.ylabel('Trip duration')
plt.show();

###**Insights**


*   On 7th and 30th trip duration is at its lowest.
*   However, on 3rd, 17, 24th, 25th and 26th trip duration is the highest.



###Relationship b/w Pickup latitude and longitude & Trip duration

In [None]:
rcParams['figure.figsize'] = 15,8
plt.scatter(data = taxi_df, x = 'pickup_latitude', y = 'pickup_longitude', c = 'trip_duration', s = 200)
cbar = plt.colorbar(orientation = 'vertical', extend = 'both', pad = 0.05, aspect = 20 )
cbar.set_label(label = 'Trip duration', size = 15, x = 1.5)
plt.xlabel('Pickup latitude')
plt.ylabel('Pickup longitude')
plt.title('Relationship b/w Pickup latitude and longitude & Trip duration')
plt.clim(0,2000)
plt.show();

###**Insights**


*   Generally pickups are concentrated between latittude(40-42.5 degrees) and longitude(-75 degrees) are their trip duration is generally between 0-1000 seconds



###Relationship b/w Dropoff latitude and longitude & Trip duration

In [None]:
rcParams['figure.figsize'] = 15,8
plt.scatter(data = taxi_df, x = 'dropoff_latitude', y = 'dropoff_longitude', c = 'trip_duration', s = 200)
cbar = plt.colorbar(orientation = 'vertical', extend = 'both', pad = 0.05, aspect = 20 )
cbar.set_label(label = 'Trip duration', size = 15, x = 1.5)
plt.xlabel('dropoff latitude')
plt.ylabel('dropoff longitude')
plt.title('Relationship b/w Dropoff latitude and longitude & Trip duration')
plt.clim(0,5000)
plt.show();

###**Insights**


*   Generally dropoffs are concentrated between latittude(40-42 degrees) and longitude(-72 degrees) are their trip duration is generally between 0-2000 seconds.




###Relationship b/w Trip distance and Trip duration

In [None]:
rcParams['figure.figsize'] = 15,8

plt.scatter(data = taxi_df, x = 'trip_distance', y = 'trip_duration')
plt.xlabel('Trip distance')
plt.ylabel('Trip duration')
plt.title('Relationship b/w Trip distance and Trip duration')
plt.show();

###**Insights**


*   We can see that there are trips with distance as short as 0 km with duration more than 8000 seconds which is clearly impossible.




##**Feature Scaling**

###Dividing the dataset into dependent and independent columns

In [None]:
X = taxi_df.drop(columns = 'trip_duration') # independent variables/ features 
y = taxi_df['trip_duration'] # target variable

###Using log transformation to remove skewness from our target variable

In [None]:
y = np.log10(y)

###Scaling independent variables using StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # creating an object of class StandardScaler
print(scaler)

In [None]:
X = scaler.fit_transform(X)

##**Implementing Linear Regression**

In [None]:
from sklearn.model_selection import train_test_split 

# splitting the dataset into 80-20 ratio
# 0.8 for training and 0.2 for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

print(X_train.shape)
print(X_test.shape)

In [None]:
from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression() 
linear_reg_fit = linear_reg.fit(X_train, y_train)

In [None]:
# coefficients of the model

linear_reg_fit.intercept_ , linear_reg_fit.coef_

In [None]:
#predicting results

linear_reg_pred = linear_reg.predict(X_test)
linear_reg_pred

In [None]:
# examining scores

print('Training score :', linear_reg.score(X_train, y_train))
print('Validation score :', linear_reg.score(X_test, y_test))

###Evaluating Regression Metrics

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
MSE = mean_squared_error(y_test, linear_reg_pred)
print('MSE :', MSE)

RMSE = np.sqrt(MSE)
print('RMSE :', RMSE)

r2 = r2_score(y_test, linear_reg_pred)
print('R2 score:', r2)

adj_r2 = 1-(1-r2_score(y_test, linear_reg_pred)) * ((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print('Adjusted R2 :', adj_r2)

In [None]:
rcParams['figure.figsize'] = 15,8
plt.subplot(1,1,1)
sns.distplot(y_test, kde = False,  label="Test")

plt.subplot(1,1,1)
sns.distplot(linear_reg_pred, kde = False, label= 'Prediction')
plt.legend()
plt.title('Test VS Prediction')
plt.show();

##**Lasso Regresssion**

In [None]:
from sklearn.linear_model import Lasso

lasso= Lasso()
lasso_fit =lasso.fit(X_train , y_train)

In [57]:
from sklearn.model_selection import GridSearchCV

parameters = {'alpha': [1e-15, 1e-10, 1e-5, 1e-2, 1e-1, 1, 5, 10, 15, 20, 25]}
lasso_grid = GridSearchCV(lasso_fit , parameters, scoring='neg_mean_squared_error', cv=5)
lasso_grid = lasso_grid.fit(X_train,y_train)

KeyboardInterrupt: ignored

In [None]:
lasso_grid.best_params_

In [None]:
y_pred_lasso = lasso_grid.predict(X_test)
y_pred_lasso

In [None]:
MSE  = mean_squared_error(y_test,y_pred_lasso)
print("MSE :" , MSE)
    
RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(y_test,y_pred_lasso)
print("R2 :" ,r2)

adj_r2=1-(1-r2_score(y_test,y_pred_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",adj_r2)

In [None]:
rcParams['figure.figsize'] = 15,8
plt.subplot(1,1,1)
sns.distplot(y_test,kde = False, label= 'Test')

plt.subplot(1,1,1)
sns.distplot(y_pred_lasso, kde = False, label= 'Pred,iction')
plt.legend()
plt.title('Test VS Prediction')
plt.show();

##**Ridge Regression**

In [None]:
from sklearn.linear_model import Ridge

Ridge= Ridge()
ridge_fit =Ridge.fit(X_train , y_train)

In [None]:
parameters = {'alpha': [1e-15, 1e-10, 1e-5, 1e-2, 1e-1, 1, 5, 10, 15, 20, 25]}
ridge_grid = GridSearchCV(ridge_fit , parameters, scoring='neg_mean_squared_error', cv=5)
ridge_grid = lasso_grid.fit(X_train,y_train)

In [None]:
ridge_grid.best_params_

In [None]:
y_pred_ridge = ridge_grid.predict(X_test)
y_pred_ridge

In [None]:
MSE  = mean_squared_error(y_test,y_pred_ridge)
print("MSE :" , MSE)
    
RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(y_test,y_pred_ridge)
print("R2 :" ,r2)

adj_r2=1-(1-r2_score(y_test,y_pred_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",adj_r2)

In [None]:
rcParams['figure.figsize'] = 15,8
plt.subplot(1,1,1)
sns.distplot(y_test,kde = False, label= 'Test')

plt.subplot(1,1,1)
sns.distplot(y_pred_ridge, kde = False, label= 'Pred,iction')
plt.legend()
plt.title('Test VS Prediction')
plt.show();

##**Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeRegressor

dt_regressor = DecisionTreeRegressor()
dt_fit = dt_regressor.fit(X_train, y_train)

In [None]:
y_pred_dt = dt_fit.predict(X_test)
y_pred_dt

In [None]:
params = {'max_depth': [20,30,50,100], 'min_samples_split':[5,10,15,20]}
dt_grid = GridSearchCV(dt_fit, param_grid= params, scoring='neg_mean_squared_error', cv=5)
dt_grid = dt_grid.fit(X_train, y_train)

In [None]:
dt_grid.best_estimator_

In [None]:
y_pred_dt = dt_grid.predict(X_test)
y_pred_dt

In [None]:
MSE  = mean_squared_error(y_test,y_pred_dt)
print("MSE :" , MSE)
    
RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(y_test,y_pred_dt)
print("R2 :" ,r2)

adj_r2=1-(1-r2_score(y_test,y_pred_dt))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",adj_r2)

In [None]:
plt.figure(figsize=(10,8))
plt.subplot(1,1,1)
sns.distplot(y_test, kde=False, label='Test')

plt.subplot(1,1,1)
sns.distplot(y_pred_dt, kde=False, label='Prediction')
plt.legend()
plt.title('Test VS Prediction')
plt.show();