# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

In [17]:
df=pd.read_csv('petrol.csv')
df.columns=df.columns.to_series().str.strip()
df

Unnamed: 0,tax,income,highway,dl,consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410
5,10.0,5342,1333,0.571,457
6,8.0,5319,11868,0.451,344
7,8.0,5126,2138,0.553,467
8,8.0,4447,8577,0.529,464
9,7.0,4512,8507,0.552,498


In [18]:
df.head()

Unnamed: 0,tax,income,highway,dl,consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [19]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tax,48.0,7.668333,0.95077,5.0,7.0,7.5,8.125,10.0
income,48.0,4241.833333,573.623768,3063.0,3739.0,4298.0,4578.75,5342.0
highway,48.0,5565.416667,3491.507166,431.0,3110.25,4735.5,7156.0,17782.0
dl,48.0,0.570333,0.05547,0.451,0.52975,0.5645,0.59525,0.724
consumption,48.0,576.770833,111.885816,344.0,509.5,568.5,632.75,968.0


In [20]:
df.dtypes

tax            float64
income           int64
highway          int64
dl             float64
consumption      int64
dtype: object

In [21]:
df.median()

tax               7.5000
income         4298.0000
highway        4735.5000
dl                0.5645
consumption     568.5000
dtype: float64

# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [22]:
from scipy import stats

In [23]:
#IQR = Q3 âˆ’ Q1.(IQR score)

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
IQR,Q1,Q3
#IQR scores for all the columns.

(tax               1.1250
 income          839.7500
 highway        4045.7500
 dl                0.0655
 consumption     123.2500
 dtype: float64, tax               7.00000
 income         3739.00000
 highway        3110.25000
 dl                0.52975
 consumption     509.50000
 Name: 0.25, dtype: float64, tax               8.12500
 income         4578.75000
 highway        7156.00000
 dl                0.59525
 consumption     632.75000
 Name: 0.75, dtype: float64)

In [24]:
((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).sum()

tax            2
income         0
highway        2
dl             1
consumption    2
dtype: int64

In [25]:
print((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))

      tax  income  highway     dl  consumption
0   False   False    False  False        False
1   False   False    False  False        False
2   False   False    False  False        False
3   False   False    False  False        False
4   False   False    False  False        False
5    True   False    False  False        False
6   False   False    False  False        False
7   False   False    False  False        False
8   False   False    False  False        False
9   False   False    False  False        False
10  False   False    False  False        False
11  False   False     True  False        False
12  False   False    False  False        False
13  False   False    False  False        False
14  False   False    False  False        False
15  False   False    False  False        False
16  False   False    False  False        False
17  False   False    False  False        False
18  False   False    False   True         True
19  False   False    False  False        False
20  False   F

In [26]:
df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis = 1)]

df_out

Unnamed: 0,tax,income,highway,dl,consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410
6,8.0,5319,11868,0.451,344
7,8.0,5126,2138,0.553,467
8,8.0,4447,8577,0.529,464
9,7.0,4512,8507,0.552,498
10,8.0,4391,5939,0.53,580


In [27]:
df.shape

(48, 5)

In [28]:
Min_cap=Q1-1.5*IQR
Max_cap=Q3+1.5*IQR
Max_cap,Min_cap

Max_cap,Min_cap

(tax                9.8125
 income          5838.3750
 highway        13224.6250
 dl                 0.6935
 consumption      817.6250
 dtype: float64, tax               5.3125
 income         2479.3750
 highway       -2958.3750
 dl                0.4315
 consumption     324.6250
 dtype: float64)

In [29]:
df.corr(method='pearson')

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,0.012665,-0.52213,-0.288037,-0.45128
income,0.012665,1.0,0.050163,0.15707,-0.244862
highway,-0.52213,0.050163,1.0,-0.064129,0.019042
dl,-0.288037,0.15707,-0.064129,1.0,0.698965
consumption,-0.45128,-0.244862,0.019042,0.698965,1.0


### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [30]:
df.shape

(48, 5)

In [32]:
index = ['tax','income','highway','dl']

df.reindex(index)

Feature = df[['tax','income','highway','dl']]

Target = df['consumption']

# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [33]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(Feature,Target, test_size=0.20, random_state=10)

In [34]:
X_train.shape

(38, 4)

In [35]:
y_train.shape

(38,)

In [36]:
X_test.shape

(10, 4)

In [37]:
y_test.shape

(10,)

# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [38]:
from sklearn.linear_model import LinearRegression
Model_1=LinearRegression()

Model_1.fit(X_train[['tax','dl']],y_train)
Model_1.score(X_train[['tax','dl']],y_train)


0.554843267830076

In [39]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

pd.DataFrame([regression_model.coef_],columns=Feature.columns)


Unnamed: 0,tax,income,highway,dl
0,-46.529392,-0.064484,-0.004777,1434.506838


# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [40]:
regression_model.score(X_train, y_train)


0.6752215753272599

# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [41]:
from sklearn.linear_model import LinearRegression
Model_2=LinearRegression()

Model_2.fit(X_train[['income','highway']],y_train)
Model_2.score(X_train[['income','highway']],y_train)


0.013009143722660599

In [46]:
Model_2.fit(X_test[['income','highway']],y_test)
Model_2.score(X_test[['income','highway']],y_test)


0.559736846591258

In [47]:
from sklearn.metrics import r2_score
r2_score(y_test,regression_model.predict(X_test))

0.5270817614402205

# Question 9: Print the coefficients of the multilinear regression model

In [48]:
for i in zip(Feature.columns,regression_model.coef_):
    print("coefficients of ",i)

coefficients of  ('tax', -46.529392172676104)
coefficients of  ('income', -0.06448381178322016)
coefficients of  ('highway', -0.004776732806829852)
coefficients of  ('dl', 1434.5068383107628)


# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

### *R squared value increase if we increase the number of independent variables to our analysis

In [None]:
#  ITS A COEFFICIENT OF DETERMINATION.
# ITS A MEASURE THAT HOW CLOSE THE DATA ARE TO BE FITTED REGRESSION LINE.
# ITS A PERCENTAGE RESPONSE VARIABLE VARIATION>