# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp, ttest_ind, mannwhitneyu, levene, shapiro, wilcoxon, f_oneway, chisquare,chi2_contingency, binom
from statsmodels.stats.power import ttest_power
from statsmodels.stats.multicomp import pairwise_tukeyhsd, MultiComparison
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import seaborn as sb
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [9]:
data = pd.read_csv('petrol.csv')
data.describe()
data

Unnamed: 0,tax,income,highway,dl,consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410
5,10.0,5342,1333,0.571,457
6,8.0,5319,11868,0.451,344
7,8.0,5126,2138,0.553,467
8,8.0,4447,8577,0.529,464
9,7.0,4512,8507,0.552,498


# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [23]:
data
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

data_out = data[~((data < (Q1 - 1.5 * IQR))|(data > (Q3 + 1.5 * IQR))).any(axis=1)]

# Question 3 - Independent variables and collinearity 
Which attributes seems to have stronger association with the dependent variable consumption?

In [28]:
data_out.corr()

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,-0.109537,-0.390602,-0.314702,-0.446116
income,-0.109537,1.0,0.051169,0.150689,-0.347326
highway,-0.390602,0.051169,1.0,-0.016193,0.034309
dl,-0.314702,0.150689,-0.016193,1.0,0.611788
consumption,-0.446116,-0.347326,0.034309,0.611788,1.0


In [None]:
#As the correlation values are high for variables tax and dl with consumption, they have stronger association with consumption variable.

### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [51]:
X = data_out[["tax"," dl"]]
#X = X.values.reshape(-1,1)

In [52]:
Y = data_out[' consumption']
#Y = Y.values.reshape(-1,1)
print(X.shape)
print(Y.shape)

(43, 2)
(43,)


# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [58]:
linreg = LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state = 1)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(34, 2)
(34,)
(9, 2)
(9,)


# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [59]:
linreg.fit(x_train,y_train)
pred = linreg.predict(x_test)

In [60]:
linreg.coef_

array([-30.70924255, 892.88620875])

In [61]:
linreg.intercept_

292.55096524614896

# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [62]:
'''linreg = LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state = 1)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)'''

def AdjRsquare(modelToBeTested, indData, target):
    Rsquare = modelToBeTested.score(indData, target)
    NoData = len(target)
    p = indData.shape[1]
    tempRsquare = 1 - (1-Rsquare)*(NoData-1)/(NoData - p - 1)
    return tempRsquare

def linRegcheckModelPerformance(x, y):
    model = LinearRegression()
    # Covert data into train and test
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 1)
    # Build model with train data set
    model.fit(x_train, y_train)
    # Train accuracies
    trainR2 = model.score(x_train, y_train)
    predictedSales = model.predict(x_train)
    mse = metrics.mean_squared_error(predictedSales, y_train)
    trainRmse = np.sqrt(mse)
    trainRmsePct = trainRmse/np.mean(np.mean(np.array(y_train)))*100
    trainAdjR2 = AdjRsquare(model, x_train, y_train)
    trainAccuracies = [len(y_train), trainRmse, trainRmsePct, trainR2, trainAdjR2]
    # Test accuracies
    testR2 = model.score(x_test, y_test)
    predictedSales = model.predict(x_test)
    mse = metrics.mean_squared_error(predictedSales, y_test)
    testRmse = np.sqrt(mse)
    testRmsePct = testRmse/np.mean(np.mean(np.array(y_test)))*100
    testAdjR2 = AdjRsquare(model, x_test, y_test)
    testAccuracies = [len(y_test), testRmse, testRmsePct, testR2, testAdjR2]
    # Create dataframe for results
    resultsDf = pd.DataFrame(index = ["dataSize", "rmse", "rmsePct", "r2", "adjR2"])
    resultsDf['trainData'] = trainAccuracies
    resultsDf['testData'] = testAccuracies
    return ( round(resultsDf, 4))
linRegcheckModelPerformance(X, Y)

Unnamed: 0,trainData,testData
dataSize,34.0,9.0
rmse,62.6155,69.0235
rmsePct,11.2155,11.7166
r2,0.4658,0.2876
adjR2,0.4313,0.0501


# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [63]:
X = data_out[[" income", " highway","tax"," dl"]]
#X = X.values.reshape(-1,1)

In [64]:
Y = data_out[' consumption']
#Y = Y.values.reshape(-1,1)
print(X.shape)
print(Y.shape)

(43, 4)
(43,)


# Question 9: Print the coefficients of the multilinear regression model

In [68]:
linreg = LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state = 1)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
linreg.fit(x_train,y_train)
pred = linreg.predict(x_test)

(34, 4)
(34,)
(9, 4)
(9,)


In [69]:
linreg.coef_

array([-6.26281401e-02, -3.02198704e-03, -3.94115836e+01,  9.50882744e+02])

In [70]:
linreg.intercept_

607.718908908563

# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

### *R squared value increase if we increase the number of independent variables to our analysis

In [71]:
linRegcheckModelPerformance(X, Y)

Unnamed: 0,trainData,testData
dataSize,34.0,9.0
rmse,51.3471,45.3097
rmsePct,9.1971,7.6912
r2,0.6408,0.693
adjR2,0.5912,0.386


In [None]:
#R2 value has increased after we have increased the number of independent variables. 
#This shows us that the dependent variable consumption varies significantly(positively) with variation in the independent variables income, highway, tax and driver.