In [1]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

The aim here is to build a Linear Regression model which is able to predict the duration of the trip accurately.

# Reading the Dataset

In [None]:
# loading the dataset
df = pd.read_csv('Uber_Drives_Clean.csv')

In [None]:
# displaying the dataset
df.head()

In [None]:
# info of the dataset
df.info()

# Adding Features

Since the date and time features cannot be used directly in the model, 3 columns 'Month', 'Day of the Month' and 'Hour' are created which store the month, day and the hour the trip starts in respectively.

In [None]:
# creating the month and day columns
df['Month'] = pd.to_datetime(df['Start Date']).dt.month
df['Day of the Month'] = pd.to_datetime(df['Start Date']).dt.day

In [None]:
# creating the hour column
df['Hour'] = pd.to_datetime(df['Start Time']).dt.hour

In [None]:
df.head()

# Correlation Matrix

The correlation matrix is plotted to understand the relationship that the target variable 'Duration' shares with the other independent variables.

It is to be noted that 'Weekday' and 'Month', although stored numerically, are categorical in nature. Hence their correlation should not be studied here.

In [None]:
# heatmap of the correlation matrix
sns.heatmap(df.corr(), annot=True)

The only significant correlation value here is 0.93, shared by 'MILES*' and 'Duration'. 

# Train - Test Split

The dataframe is divided into target and predictor dataframes, named as X and y respectively.

Here, the columns 'MILES*', 'Weekday', 'Month', 'Day of the Month' and 'Hour' considered to be the predictors. The features'CATEGORY*' and 'PURPOSE*' for a trip do not seem to have any effect on its time duration and hence, are excluded.

In [None]:
# creating target and predictor dataframes
y = df['Duration']
X = df[['MILES*', 'Weekday', 'Month', 'Day of the Month', 'Hour']]

In [None]:
X.head()

In [None]:
y.head()

The one-hot encoding of the categorical variables is performed.

In [None]:
# one hot encoding
X = pd.get_dummies(X, columns=['Month', 'Weekday'])

In [None]:
# info on all the predictors that will be used in the model
X.info()

The dataframes are splitted into the training and the test datasets.

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Function to get metrics

A function is created which will be used to find out the accuracy metrics of the given model.

In [None]:
def get_metrics(y_train, y_test, y_pred):
    
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))        # Root mean square error 
    SSE = np.sum((y_pred-y_test)**2)                               # Sum of squares due to the errors
    SST = np.sum((y_test-np.mean(y_train))**2)                     # Total Sum of Squares
    r2_test = 1 - SSE/SST                                          # r2 measure of accuracy
    
    print("The metrics for the model are:")
    print("Test RMSE : ", rmse_test)
    print("Test SSE : ", SSE)
    print("Test SST : ", SST)
    print("Test R2 : ", r2_test)

# Baseline Model

The baseline model is simply the average of values of the target variable present in the training dataset. 

In [None]:
# finding the baseline predictions
y_pred_bl = np.repeat(np.mean(y_train), len(y_test))

In [None]:
y_pred_bl

In [None]:
# finding the metrics for the Baseline model
get_metrics(y_train, y_test, y_pred_bl)

The baseline model has an r<sup>2</sup> value of 0, which means that this model is of no good. 

# Linear Regresssion

Next, a simple Linear Regression model is fitted on the dataset.

In [None]:
# initialising the model 
model = LinearRegression()

In [None]:
# fitting the model
model.fit(X_train, y_train)

In [None]:
# finding predictions for the test set
y_pred_lr = model.predict(X_test)

In [None]:
# finding metrics
get_metrics(y_train, y_test, y_pred_lr)

This linear regression model has an r<sup>2</sup> value of ~0.88, which means that this model has an accuracy of ~88% in its predictions. It has been able to explain ~88% of the entire variation that is present in the target variable through the given predictors.

This model is much better than the baseline model.

#### It was observed in the correlation matrix that the predictor 'MILES*' shared a very high correlation with the target variable 'Duration'. 

Hence, a linear regression model is built with only 'MILES*' as the predictor and one can observe how well this model works compared to the above model.

In [None]:
# defining target and predictor
y = df['Duration']
X = df[['MILES*']]

In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
# fitting the model
model.fit(X_train, y_train)

In [None]:
# finding predictions for the test set
y_pred_lr1 = model.predict(X_test)

In [None]:
# finding the metrics
get_metrics(y_train, y_test, y_pred_lr1)

This linear regression model has an r<sup>2</sup> value of ~0.87, which means that this model has an accuracy of ~87% in its predictions. It has been able to explain ~87% of the entire variation that is present in the target variable through the given predictors.

It is observed that there is just a 1% decrease in the accuracy of the model when 'MILES*' is considered as the only predictor. Hence, this model can also be used for the purpose of predicting the duration of a trip.