# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [18]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [19]:
df= pd.read_csv("petrol.csv")
df.columns = df.columns.to_series().apply(lambda x: x.strip())
df.describe()

Unnamed: 0,tax,income,highway,dl,consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [20]:
minCap = np.abs(df.quantile(0.25)-1.5*(df.quantile(0.75) - df.quantile(0.25)))
maxCap = np.abs(df.quantile(0.75)+(1.5* df.quantile(0.75) - df.quantile(0.25)))

In [21]:
minCap

tax               5.3125
income         2479.3750
highway        2958.3750
dl                0.4315
consumption     324.6250
dtype: float64

In [22]:
maxCap

tax               13.312500
income          7707.875000
highway        14779.750000
dl                 0.958375
consumption     1072.375000
dtype: float64

In [23]:

df=df[((df["tax"] >= minCap.tax) & (df["tax"] <= maxCap.tax)) 
   & ((df["income"] >= minCap.income) & (df["income"] <= maxCap.income))
   & ((df["highway"] >= minCap.highway) & (df["highway"] <= maxCap.highway))
   & ((df["dl"] >= minCap.dl) & (df["dl"] <= maxCap.dl))
   & ((df["consumption"] >= minCap.consumption) & (df["consumption"] <= maxCap.consumption))
  ]
df

Unnamed: 0,tax,income,highway,dl,consumption
6,8.0,5319,11868,0.451,344
8,8.0,4447,8577,0.529,464
9,7.0,4512,8507,0.552,498
10,8.0,4391,5939,0.53,580
11,7.5,5126,14186,0.525,471
12,7.0,4817,6930,0.574,525
13,7.0,4207,6580,0.545,508
14,7.0,4332,8159,0.608,566
15,7.0,4318,10340,0.586,635
16,7.0,4206,8508,0.572,603


# Question 3 - Independent variables and collinearity 
Which attributes seems to have stronger association with the dependent variable consumption?

In [24]:
df.corr()

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,-0.147446,-0.255876,-0.302866,-0.333229
income,-0.147446,1.0,0.540871,0.145486,-0.194945
highway,-0.255876,0.540871,1.0,-0.202322,-0.407945
dl,-0.302866,0.145486,-0.202322,1.0,0.713031
consumption,-0.333229,-0.194945,-0.407945,0.713031,1.0


dl has strongest association with dependent variable consumption

### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [26]:
x = df[["dl", "tax"]]
y=df[['consumption']]

# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [28]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split as tts

In [29]:
x_train,x_test,y_train,y_test= tts(x,y,test_size=0.2,random_state=1)

In [30]:
x_train.shape

(28, 2)

In [31]:
x_test.shape

(7, 2)

# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [32]:
linreg = LinearRegression()
linreg.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [33]:
linreg.coef_

array([[1178.64299358,  -33.28892404]])

In [34]:
linreg.intercept_

array([159.65329511])

In [35]:
res = linreg.predict(x_test)
res

array([[469.4114064 ],
       [441.12397455],
       [518.02268936],
       [664.46134079],
       [660.92541181],
       [718.67891849],
       [574.59755305]])

In [36]:
df2 = pd.DataFrame(linreg.coef_,columns=["dl", "tax"])
df2

Unnamed: 0,dl,tax
0,1178.642994,-33.288924


# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [38]:
trainScore = linreg.score(x_train, y_train)
testScore = linreg.score(x_test, y_test)

In [39]:
print(trainScore)
print(testScore)

0.5682664446078698
0.3528095481196928


In [40]:
from sklearn import metrics
prediction = linreg.predict(x_test)
mse = metrics.mean_squared_error(prediction, y_test)
testRmse = np.sqrt(mse)
print(testRmse)

112.91435649217532


# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [41]:
x = df[["dl", "tax","income","highway"]]
y=df[['consumption']]

In [42]:
x_train,x_test,y_train,y_test= tts(x,y,test_size=0.2,random_state=1)


In [43]:
linreg2 = LinearRegression()
linreg2.fit(x_train, y_train)
linreg2.coef_

array([[ 1.18709610e+03, -4.14520195e+01, -6.38602267e-02,
        -4.96488404e-03]])

In [44]:
res = linreg2.predict(x_test)
res

array([[461.17358254],
       [474.81818084],
       [503.3434415 ],
       [661.50714287],
       [670.47694477],
       [726.39925201],
       [642.22597823]])

# Question 9: Print the coefficients of the multilinear regression model

In [45]:
linreg2.intercept_

array([515.69536505])

In [46]:
trainScore2 = linreg2.score(x_train, y_train)
testScore2 = linreg2.score(x_test, y_test)
print(trainScore2)
print(testScore2)

0.7642545393848171
0.35765761808403596


In [49]:
prediction2 = linreg2.predict(x_test)

In [48]:
mse = metrics.mean_squared_error(prediction2, y_test)
testRmse = np.sqrt(mse)
print(testRmse)

112.49064371646422


# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

### *R squared value increase if we increase the number of independent variables to our analysis

R-squared simply explains how good is your model when compared to the baseline model. R-squared can be artificially made high. That is we can increase the value of R-squared by simply adding more and more independent variables to our model. R-squared never decreases upon adding more independent variables

R-squared will be maximum when SSE/SST will be minimum. In order for SSE/SST to be minimum SSE should be minimum Now SSE will decrease as we add more explanatory variables to our model. This is because as we add more explanatory variables to our regression model ,our regression model will fit the data points better and hence sum of squared error will reduce. Hence R-squared will increase even when the variable is not significant to our model.