## `Task-3` - `Uber Fare Analysis`

###  * `Problem Statement` : 

* Assume that you are working as a `Data Analyst Intern` with `Uber`. Your first `assignment` as an `intern` here is to `perform analysis and ML modelling on rides data` recorded between `2009-01-01 and 2015-06-30`.

In [1]:
### Load the required libraries
%matplotlib inline
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
df = pd.read_csv(r"Uber Fare data/data.csv")

In [3]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.99471,40.750325,1
2,44984355,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.74077,-73.962565,40.772647,1
3,25894730,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5


#### * What is the shape of given dataset?

In [4]:
df.shape

(200000, 8)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   ride_id            200000 non-null  int64  
 1   fare_amount        200000 non-null  float64
 2   pickup_datetime    200000 non-null  object 
 3   pickup_longitude   200000 non-null  float64
 4   pickup_latitude    200000 non-null  float64
 5   dropoff_longitude  199999 non-null  float64
 6   dropoff_latitude   199999 non-null  float64
 7   passenger_count    200000 non-null  int64  
dtypes: float64(5), int64(2), object(1)
memory usage: 12.2+ MB


#### * How many integer columns(by default) are given in the dataset?

In [6]:
len(df.select_dtypes(include='int').columns)

2

In [7]:
df.describe()

Unnamed: 0,ride_id,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,200000.0,200000.0,200000.0,200000.0,199999.0,199999.0,200000.0
mean,27712500.0,11.359955,-72.527638,39.935885,-72.525292,39.92389,1.684535
std,16013820.0,9.901776,11.437787,7.720539,13.117408,6.794829,1.385997
min,1.0,-52.0,-1340.64841,-74.015515,-3356.6663,-881.985513,0.0
25%,13825350.0,6.0,-73.992065,40.734796,-73.991407,40.733823,1.0
50%,27745500.0,8.5,-73.981823,40.752592,-73.980093,40.753042,1.0
75%,41555300.0,12.5,-73.967153,40.767158,-73.963659,40.768001,2.0
max,55423570.0,499.0,57.418457,1644.421482,1153.572603,872.697628,208.0


In [8]:
df.isnull().sum()

ride_id              0
fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    1
dropoff_latitude     1
passenger_count      0
dtype: int64

#### * How many missing values exists in 'dropoff_longitude' column?

In [9]:
df['dropoff_longitude'].isna().sum()

1

#### * What is the data type of ' pickup_datetime' feature in your data?


In [10]:
df.dtypes['pickup_datetime']

dtype('O')

#### * Which function can be used to remove null values from the dataframe?


In [11]:
df = df.dropna()

In [12]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,24238194,7.5,2015-05-07 19:52:06 UTC,-73.999817,40.738354,-73.999512,40.723217,1
1,27835199,7.7,2009-07-17 20:04:56 UTC,-73.994355,40.728225,-73.99471,40.750325,1
2,44984355,12.9,2009-08-24 21:45:00 UTC,-74.005043,40.74077,-73.962565,40.772647,1
3,25894730,5.3,2009-06-26 08:22:21 UTC,-73.976124,40.790844,-73.965316,40.803349,3
4,17610152,16.0,2014-08-28 17:47:00 UTC,-73.925023,40.744085,-73.973082,40.761247,5


In [13]:
df.shape

(199999, 8)

#### * What is the average fare amount?


In [14]:
df['fare_amount'].mean()

11.359891549458371

In [15]:
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])

In [16]:
df.dtypes

ride_id                            int64
fare_amount                      float64
pickup_datetime      datetime64[ns, UTC]
pickup_longitude                 float64
pickup_latitude                  float64
dropoff_longitude                float64
dropoff_latitude                 float64
passenger_count                    int64
dtype: object

In [17]:
# Extracting time feature from the 'pickup_datetime' 
df= df.assign(date = df['pickup_datetime'].dt.date,
              hour = df['pickup_datetime'].dt.hour,
             day= df['pickup_datetime'].dt.day,
             month = df['pickup_datetime'].dt.month,
             year = df['pickup_datetime'].dt.year,
             dayofweek = df['pickup_datetime'].dt.dayofweek,
             nameofDOW = df['pickup_datetime'].dt.day_name())

In [18]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,date,hour,day,month,year,dayofweek,nameofDOW
0,24238194,7.5,2015-05-07 19:52:06+00:00,-73.999817,40.738354,-73.999512,40.723217,1,2015-05-07,19,7,5,2015,3,Thursday
1,27835199,7.7,2009-07-17 20:04:56+00:00,-73.994355,40.728225,-73.99471,40.750325,1,2009-07-17,20,17,7,2009,4,Friday
2,44984355,12.9,2009-08-24 21:45:00+00:00,-74.005043,40.74077,-73.962565,40.772647,1,2009-08-24,21,24,8,2009,0,Monday
3,25894730,5.3,2009-06-26 08:22:21+00:00,-73.976124,40.790844,-73.965316,40.803349,3,2009-06-26,8,26,6,2009,4,Friday
4,17610152,16.0,2014-08-28 17:47:00+00:00,-73.925023,40.744085,-73.973082,40.761247,5,2014-08-28,17,28,8,2014,3,Thursday


### * * * Haversine formula * * *

         hav(θ) = sin**2(θ/2) = (1 - cos(θ))/2

* The Haversine formula is a mathematical method used to calculate the shortest distance between two points on the surface of a sphere, using their latitudes and longitudes. It's super handy for figuring out distances between places on Earth, assuming it's a perfect sphere.

In [19]:
import math

def haversine(lat1, lon1, lat2, lon2):
    """
    Calculate the Haversine distance between two points on the Earth.

    Parameters:
        lat1, lon1: Latitude and longitude of the first point (in degrees)
        lat2, lon2: Latitude and longitude of the second point (in degrees)

    Returns:
        Distance in kilometers
    """

    R = 6371.0  # Radius of the Earth in kilometers

    lat1 = math.radians(lat1)
    lon1 = math.radians(lon1)
    lat2 = math.radians(lat2)
    lon2 = math.radians(lon2)

    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    return R * c

In [20]:
df.columns

Index(['ride_id', 'fare_amount', 'pickup_datetime', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count', 'date', 'hour', 'day', 'month', 'year', 'dayofweek',
       'nameofDOW'],
      dtype='object')

In [21]:
# Calculate the distance for each row
df['distance_in_km'] = df.apply(lambda row: haversine(row['pickup_latitude'], row['pickup_longitude'], row['dropoff_latitude'], row['dropoff_longitude']), axis=1)

In [22]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,date,hour,day,month,year,dayofweek,nameofDOW,distance_in_km
0,24238194,7.5,2015-05-07 19:52:06+00:00,-73.999817,40.738354,-73.999512,40.723217,1,2015-05-07,19,7,5,2015,3,Thursday,1.683323
1,27835199,7.7,2009-07-17 20:04:56+00:00,-73.994355,40.728225,-73.99471,40.750325,1,2009-07-17,20,17,7,2009,4,Friday,2.45759
2,44984355,12.9,2009-08-24 21:45:00+00:00,-74.005043,40.74077,-73.962565,40.772647,1,2009-08-24,21,24,8,2009,0,Monday,5.036377
3,25894730,5.3,2009-06-26 08:22:21+00:00,-73.976124,40.790844,-73.965316,40.803349,3,2009-06-26,8,26,6,2009,4,Friday,1.661683
4,17610152,16.0,2014-08-28 17:47:00+00:00,-73.925023,40.744085,-73.973082,40.761247,5,2014-08-28,17,28,8,2014,3,Thursday,4.47545


In [23]:
df['distance_in_km'] = df['distance_in_km']

In [24]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,date,hour,day,month,year,dayofweek,nameofDOW,distance_in_km
0,24238194,7.5,2015-05-07 19:52:06+00:00,-73.999817,40.738354,-73.999512,40.723217,1,2015-05-07,19,7,5,2015,3,Thursday,1.683323
1,27835199,7.7,2009-07-17 20:04:56+00:00,-73.994355,40.728225,-73.99471,40.750325,1,2009-07-17,20,17,7,2009,4,Friday,2.45759
2,44984355,12.9,2009-08-24 21:45:00+00:00,-74.005043,40.74077,-73.962565,40.772647,1,2009-08-24,21,24,8,2009,0,Monday,5.036377
3,25894730,5.3,2009-06-26 08:22:21+00:00,-73.976124,40.790844,-73.965316,40.803349,3,2009-06-26,8,26,6,2009,4,Friday,1.661683
4,17610152,16.0,2014-08-28 17:47:00+00:00,-73.925023,40.744085,-73.973082,40.761247,5,2014-08-28,17,28,8,2014,3,Thursday,4.47545


#### * Calculate distance between each pickup and dropoff points using Haversine formula. 
#### * What is the median haversine distance between pickup and dropoff location according to the given dataset?
#### * * Read about Haversine Distance here: https://en.wikipedia.org/wiki/Haversine_formula

In [25]:
df['distance_in_km'].median()

2.1209923961833708

#### * What is the maximum haversine distance between pickup and dropoff location according to the given dataset?

In [26]:
df['distance_in_km'].max()

16409.239135313168

In [27]:
df = df.drop(['pickup_datetime'], axis=1)

In [28]:
df.head()

Unnamed: 0,ride_id,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,date,hour,day,month,year,dayofweek,nameofDOW,distance_in_km
0,24238194,7.5,-73.999817,40.738354,-73.999512,40.723217,1,2015-05-07,19,7,5,2015,3,Thursday,1.683323
1,27835199,7.7,-73.994355,40.728225,-73.99471,40.750325,1,2009-07-17,20,17,7,2009,4,Friday,2.45759
2,44984355,12.9,-74.005043,40.74077,-73.962565,40.772647,1,2009-08-24,21,24,8,2009,0,Monday,5.036377
3,25894730,5.3,-73.976124,40.790844,-73.965316,40.803349,3,2009-06-26,8,26,6,2009,4,Friday,1.661683
4,17610152,16.0,-73.925023,40.744085,-73.973082,40.761247,5,2014-08-28,17,28,8,2014,3,Thursday,4.47545


In [29]:
dfd = df.copy()

In [30]:
dfd.head()

Unnamed: 0,ride_id,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,date,hour,day,month,year,dayofweek,nameofDOW,distance_in_km
0,24238194,7.5,-73.999817,40.738354,-73.999512,40.723217,1,2015-05-07,19,7,5,2015,3,Thursday,1.683323
1,27835199,7.7,-73.994355,40.728225,-73.99471,40.750325,1,2009-07-17,20,17,7,2009,4,Friday,2.45759
2,44984355,12.9,-74.005043,40.74077,-73.962565,40.772647,1,2009-08-24,21,24,8,2009,0,Monday,5.036377
3,25894730,5.3,-73.976124,40.790844,-73.965316,40.803349,3,2009-06-26,8,26,6,2009,4,Friday,1.661683
4,17610152,16.0,-73.925023,40.744085,-73.973082,40.761247,5,2014-08-28,17,28,8,2014,3,Thursday,4.47545


In [31]:
dfd = dfd.drop(['ride_id', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'date', 'nameofDOW'],axis=1)

In [32]:
dfd.head()

Unnamed: 0,fare_amount,passenger_count,hour,day,month,year,dayofweek,distance_in_km
0,7.5,1,19,7,5,2015,3,1.683323
1,7.7,1,20,17,7,2009,4,2.45759
2,12.9,1,21,24,8,2009,0,5.036377
3,5.3,3,8,26,6,2009,4,1.661683
4,16.0,5,17,28,8,2014,3,4.47545


In [33]:
for cols in dfd.columns:
    print('Mean of', cols, '-->', dfd[cols].mean())
    print('')

Mean of fare_amount --> 11.359891549458371

Mean of passenger_count --> 1.6845434227171137

Mean of hour --> 13.491387456937284

Mean of day --> 15.704738523692619

Mean of month --> 6.281791408957044

Mean of year --> 2011.7424337121686

Mean of dayofweek --> 3.048435242176211

Mean of distance_in_km --> 20.85534982511106



In [34]:
for cols in dfd.columns:
    print('Median of', cols, '-->', dfd[cols].median())
    print('')

Median of fare_amount --> 8.5

Median of passenger_count --> 1.0

Median of hour --> 14.0

Median of day --> 16.0

Median of month --> 6.0

Median of year --> 2012.0

Median of dayofweek --> 3.0

Median of distance_in_km --> 2.1209923961833708



In [35]:
for cols in dfd.columns:
    print('Maximum value of', cols, '-->', dfd[cols].max())
    print('')

Maximum value of fare_amount --> 499.0

Maximum value of passenger_count --> 208

Maximum value of hour --> 23

Maximum value of day --> 31

Maximum value of month --> 12

Maximum value of year --> 2015

Maximum value of dayofweek --> 6

Maximum value of distance_in_km --> 16409.239135313168



In [36]:
for cols in dfd.columns:
    print('Minmum value of', cols, '-->', dfd[cols].min())
    print('')

Minmum value of fare_amount --> -52.0

Minmum value of passenger_count --> 0

Minmum value of hour --> 0

Minmum value of day --> 1

Minmum value of month --> 1

Minmum value of year --> 2009

Minmum value of dayofweek --> 0

Minmum value of distance_in_km --> 0.0



#### * How many rides have 0.0 haversine distance between pickup and dropoff location according to the given dataset?

In [37]:
dff = df.loc[df['distance_in_km']==0,'ride_id'].reset_index(drop=True)

In [38]:
dff

0       44470845
1       44195482
2        6379048
3       22405517
4       21993993
          ...   
5627    35013970
5628    44115598
5629    45368488
5630    46517645
5631    50075618
Name: ride_id, Length: 5632, dtype: int64

In [39]:
len(dff)

5632

#### * What is the mean 'fare_amount' for rides with 0 haversine distance?


In [40]:
df[df["distance_in_km"]==0]["fare_amount"].mean()

11.585317826704578

#### * What is the maximum 'fare_amount' for a ride?


In [41]:
df['fare_amount'].max()

499.0

#### * What is the haversine distance between pickup and dropoff location for the costliest ride?

In [42]:
df[df["fare_amount"]==df["fare_amount"].max()]["distance_in_km"]

170081    0.00079
Name: distance_in_km, dtype: float64

#### * How many rides were recorded in the year 2014?


In [43]:
df1 = df.loc[df['year']==2014,'ride_id'].reset_index(drop=True)

In [44]:
df1

0        17610152
1        48725865
2        55085966
3        38755863
4        19277743
           ...   
29963     9699676
29964    21553740
29965    13096190
29966     3189201
29967    16382965
Name: ride_id, Length: 29968, dtype: int64

In [45]:
len(df1)

29968

#### * How many rides were recorded in the first quarter of 2014?

In [46]:
# Define the date range
start_date = pd.to_datetime('2014-01-01')
end_date = pd.to_datetime('2014-03-31')

In [47]:
# Check data based on the date range
df2 = df[df['date'].between(start_date, end_date)]

  result = libops.scalar_compare(x.ravel(), y, op)


In [48]:
df2.head()

Unnamed: 0,ride_id,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,date,hour,day,month,year,dayofweek,nameofDOW,distance_in_km
20,55085966,10.5,-73.980022,40.74599,-74.003432,40.759667,1,2014-02-18,14,18,2,2014,1,Tuesday,2.490244
26,38755863,5.0,-73.957802,40.776372,-73.957422,40.78287,1,2014-01-21,6,21,1,2014,1,Tuesday,0.723253
39,38703737,29.0,-73.9926,40.753172,-73.908508,40.816192,1,2014-02-13,17,13,2,2014,3,Thursday,9.961496
46,37192633,17.0,-73.9939,40.751714,-73.958575,40.76039,1,2014-01-16,14,16,1,2014,3,Thursday,3.127905
100,29350780,9.0,-73.95828,40.7689,-73.97351,40.782907,5,2014-02-19,18,19,2,2014,2,Wednesday,2.017541


In [49]:
df2.shape

(7687, 15)

#### * On which day of the week in September 2010, maximum rides were recorded ?

In [50]:
# Define the date range
start_date = pd.to_datetime('2010-09-01')
end_date = pd.to_datetime('2010-09-30')

In [51]:
# Check data based on the date range
df3 = df[df['date'].between(start_date, end_date)]

  result = libops.scalar_compare(x.ravel(), y, op)


In [52]:
df3.head()

Unnamed: 0,ride_id,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,date,hour,day,month,year,dayofweek,nameofDOW,distance_in_km
23,25121708,7.7,-73.9943,40.739512,-73.98807,40.724482,2,2010-09-04,16,4,9,2010,5,Saturday,1.751763
85,11536406,7.7,-73.987383,40.721255,-73.954733,40.730683,1,2010-09-08,10,8,9,2010,2,Wednesday,2.944304
87,11267091,4.5,-73.98192,40.761172,-74.016152,40.704765,1,2010-09-22,12,22,9,2010,2,Wednesday,6.903596
126,8791172,5.3,-73.956158,40.781593,-73.966138,40.773645,1,2010-09-21,12,21,9,2010,1,Tuesday,1.219522
129,30203807,10.5,-73.995627,40.749295,-73.965087,40.77161,1,2010-09-10,16,10,9,2010,4,Friday,3.573956


In [53]:
df3.shape

(2482, 15)

In [54]:
df3['nameofDOW'].mode()

0    Thursday
Name: nameofDOW, dtype: object

#### * Apply a Machine Learning Algorithm to predict the fare amount given following input features: 
#### >>> passenger_count, distance and ride_week_day.

#### * Perform a 70-30 split of data.

### * * Train-Test Split * *

In [55]:
X = dfd.drop(['fare_amount'], axis=1)
y = dfd['fare_amount']

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.75, random_state = 35)

In [57]:
X_train

Unnamed: 0,passenger_count,hour,day,month,year,dayofweek,distance_in_km
121752,2,6,8,5,2014,3,3.891882
120807,1,1,10,2,2013,6,1.892897
64310,2,19,31,12,2011,5,5.220771
103090,2,18,15,10,2011,5,1.191162
54754,1,13,29,5,2010,5,4.120162
...,...,...,...,...,...,...,...
84927,1,22,8,4,2014,1,5.248460
56300,1,15,20,9,2011,1,0.000000
179234,1,10,30,6,2013,6,1.311106
41911,1,17,5,8,2014,1,4.758027


In [58]:
X_test

Unnamed: 0,passenger_count,hour,day,month,year,dayofweek,distance_in_km
141257,1,13,31,10,2010,6,1.417486
161099,1,19,11,3,2013,0,0.745781
15505,1,14,8,11,2009,6,2.279551
76178,1,19,27,8,2014,2,2.053946
28950,1,7,18,4,2011,0,4.104468
...,...,...,...,...,...,...,...
168658,1,18,16,5,2011,0,1.894657
146968,2,5,4,3,2009,2,9.727890
105930,1,19,3,1,2010,6,4.067061
169265,2,12,29,6,2011,2,0.759620


In [59]:
y_train

121752    11.0
120807    11.0
64310     21.3
103090     7.3
54754      8.1
          ... 
84927     12.0
56300     26.5
179234     5.5
41911     25.5
115984     4.9
Name: fare_amount, Length: 149999, dtype: float64

In [60]:
y_test

141257     5.30
161099     3.50
15505      9.30
76178      8.00
28950     12.90
          ...  
168658     6.10
146968    30.35
105930    10.90
169265     4.50
28452      4.50
Name: fare_amount, Length: 50000, dtype: float64

### * * * Model Building  * * * 

#### 1. Linear regression

In [61]:
%%time
lr=LinearRegression(n_jobs=-1)
lr.fit(X_train,y_train)

Wall time: 69.6 ms


LinearRegression(n_jobs=-1)

In [62]:
%%time
y_pred = lr.predict(X_test)
y_pred

Wall time: 5 ms


array([10.69922679, 11.50603066, 10.04849651, ...,  9.43200911,
       10.93492206, 11.21591663])

In [63]:
r2_lr = r2_score(y_test , y_pred)
r2_lr

0.01577216058930453

In [64]:
mae_lr = mean_absolute_error(y_test, y_pred)
mae_lr

5.943299106434257

In [65]:
rmse_lr = mean_squared_error(y_test, y_pred, squared=False)
rmse_lr

9.725466350361572

#### 2. DecisionTree Regressor

In [66]:
%%time
dtr = DecisionTreeRegressor()
dtr.fit(X_train,y_train)

Wall time: 1.91 s


DecisionTreeRegressor()

In [67]:
%%time
y_pred = dtr.predict(X_test)
y_pred

Wall time: 51.3 ms


array([ 4.9,  5.5,  7.3, ..., 13.7,  5.3,  4.5])

In [68]:
r2_dtr = r2_score(y_test , y_pred)
r2_dtr

0.4695216689148426

In [69]:
mae_dtr = mean_absolute_error(y_test, y_pred)
mae_dtr

3.2126457

In [70]:
rmse_dtr = mean_squared_error(y_test, y_pred, squared=False)
rmse_dtr

7.139970826656087

#### 3.RandomForest Regressor

In [71]:
%%time
rfr=RandomForestRegressor(n_estimators=100, random_state=42)
rfr.fit(X_train,y_train)

Wall time: 1min 15s


RandomForestRegressor(random_state=42)

In [72]:
%%time
y_pred = rfr.predict(X_test)
y_pred

Wall time: 2.47 s


array([ 6.357,  5.15 ,  8.352, ..., 11.531,  4.956,  4.817])

In [73]:
r2_rfr = r2_score(y_test , y_pred)
r2_rfr

0.7258874789501785

In [74]:
mae_rfr = mean_absolute_error(y_test, y_pred)
mae_rfr

2.347092444392063

In [75]:
rmse_rfr = mean_squared_error(y_test, y_pred, squared=False)
rmse_rfr

5.132477450170007

#### 4. KNN Regressor

In [76]:
%%time
knnr=KNeighborsRegressor(n_neighbors=5)
knnr.fit(X_train,y_train)

Wall time: 388 ms


KNeighborsRegressor()

In [77]:
%%time
y_pred = knnr.predict(X_test)
y_pred

Wall time: 5.22 s


array([5.78, 4.8 , 8.02, ..., 9.14, 5.38, 5.3 ])

In [78]:
r2_knnr = r2_score(y_test , y_pred)
r2_knnr

0.6900351812606629

In [79]:
mae_knnr = mean_absolute_error(y_test, y_pred)
mae_knnr

2.6165024800000003

In [80]:
rmse_knnr = mean_squared_error(y_test, y_pred, squared=False)
rmse_knnr

5.457815054440742

#### * Which algorithm gives the least adjusted R square value?

#### All Algorithm along with Score :

In [81]:
all_model = {"Model":["Linear_Regression", "Decision_Tree", "Random_Forest", "KNN"],
            "R2_Score":[r2_lr, r2_dtr, r2_rfr, r2_knnr]}

In [82]:
model = pd.DataFrame(all_model)

In [83]:
model

Unnamed: 0,Model,R2_Score
0,Linear_Regression,0.015772
1,Decision_Tree,0.469522
2,Random_Forest,0.725887
3,KNN,0.690035
