# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [40]:
import pandas as pd
import numpy as np

In [49]:
petrol_df = pd.read_csv('petrol.csv')
petrol_df.describe()

Unnamed: 0,tax,income,highway,dl,consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [50]:
Q1 = petrol_df.quantile(0.25)
Q3 = petrol_df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

tax               1.1250
income          839.7500
highway        4045.7500
dl                0.0655
consumption     123.2500
dtype: float64


In [91]:
petrol_df_out = petrol_df[~((petrol_df<(Q1-1.5*IQR)) | (petrol_df> (Q3 + 1.5*IQR))).any(axis=1)]

In [90]:
petrol_df_out.shape

(43, 5)

# Question 3 - Independent variables and collinearity 
Which attributes seems to have stronger association with the dependent variable consumption?

In [94]:
petrol_df_out.corr()

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,-0.109537,-0.390602,-0.314702,-0.446116
income,-0.109537,1.0,0.051169,0.150689,-0.347326
highway,-0.390602,0.051169,1.0,-0.016193,0.034309
dl,-0.314702,0.150689,-0.016193,1.0,0.611788
consumption,-0.446116,-0.347326,0.034309,0.611788,1.0


In [95]:
petrol_df_out.dl.corr( petrol_df_out.consumption )

0.6117880063947396

In [180]:
petrol_df_out.tax.corr( petrol_df_out.consumption )

-0.4461157362582568

In [181]:
#tax has stronger positive association and dl has negative stronger relation with consumption

### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [121]:
x = petrol_df_out[['dl','tax']]


In [110]:
y = petrol_df_out[['consumption']]

# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [122]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

In [157]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=7)
linreg = LinearRegression()



In [158]:
print('x_train shape', x_train.shape)
print('y_train shape', y_train.shape)
print('x_test shape', x_test.shape)
print('y_test shape', y_test.shape)

x_train shape (34, 2)
y_train shape (34, 1)
x_test shape (9, 2)
y_test shape (9, 1)


# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [159]:
model =linreg.fit(x_train, y_train)
b0 = linreg.intercept_
print('Intercept b0: ',b0)

Intercept b0:  [160.26896904]


In [160]:
b1= linreg.coef_
print('Coefficient b1: ',b1)

Coefficient b1:  [[1065.42542264  -25.49572645]]


In [162]:
for idx,col_name in enumerate(x_train.columns):
    print("The coefficient for {} is {}".format(col_name,model.coef_[0][idx]))

The coefficient for dl is 1065.4254226362966
The coefficient for tax is -25.49572645473844


# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [163]:
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
y_predicted = linreg.predict(x_test)
mse = mean_squared_error(y_test, y_predicted)

rmse = sqrt(mse)
print('Root Mean Square Error:', rmse)

Root Mean Square Error: 73.26639220624098


In [164]:
linreg.score(x_train,y_train)

0.5118393644547021

In [174]:
print('R2 score:',r2_score(y_test, y_predicted))

R2 score: 0.011465091621612578


# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [166]:
x1 = petrol_df_out[['dl','tax','income','highway']]

In [167]:
x1_train, x1_test, y1_train, y1_test = train_test_split(x1,y,test_size=0.2, random_state=7)
linreg1 = LinearRegression()
print('x1_train shape', x1_train.shape)
print('y1_train shape', y1_train.shape)
print('x1_test shape', x1_test.shape)
print('y1_test shape', y1_test.shape)



x1_train shape (34, 4)
y1_train shape (34, 1)
x1_test shape (9, 4)
y1_test shape (9, 1)


# Question 9: Print the coefficients of the multilinear regression model

In [168]:
model2 =linreg1.fit(x1_train, y1_train)
_b0 = linreg1.intercept_
print('Intercept b0: ',_b0)
_b1= linreg1.coef_
print('Coefficient b1: ',_b1)

Intercept b0:  [511.6303834]
Coefficient b1:  [[ 1.07238542e+03 -3.42596792e+01 -6.57770878e-02 -3.09447216e-03]]


In [170]:
for idx,col_name in enumerate(x1_train.columns):
    print("The coefficient for {} is {}".format(col_name,model2.coef_[0][idx]))

The coefficient for dl is 1072.3854176532423
The coefficient for tax is -34.259679176344335
The coefficient for income is -0.06577708777329248
The coefficient for highway is -0.0030944721617528117


In [171]:

y1_predicted = linreg1.predict(x1_test)
mse1 = mean_squared_error(y1_test, y1_predicted)

rmse1 = sqrt(mse1)
print('Root Mean Square Error:', rmse1)

Root Mean Square Error: 57.71119323483317


In [173]:
linreg1.score(x1_train,y1_train)

0.7046436836953849

In [175]:
print('R2 score:',r2_score(y1_test, y1_predicted))

R2 score: 0.3866582824256367


# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

### *R squared value increase if we increase the number of independent variables to our analysis

In [179]:
#R square value increase if we increased the number of independent variables to our analysis. In this case, it incrased 
#from 0.011465091621612578 to 0.3866582824256367 when we increased  the dependent variables from 2(tax,dl) to 4(tax,dl,income, highway)