# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [2]:
import pandas as pd
import numpy as np

In [3]:
petroldata = pd.read_csv ('petrol.csv')
##petroldata
print ('---------------sample data-------------')
print (petroldata.head())
print ('---------------data types-------------')
print (petroldata.dtypes)
print ('---------------stats data-------------')
print (petroldata.describe())

---------------sample data-------------
   tax   income   highway     dl   consumption
0  9.0     3571      1976  0.525           541
1  9.0     4092      1250  0.572           524
2  9.0     3865      1586  0.580           561
3  7.5     4870      2351  0.529           414
4  8.0     4399       431  0.544           410
---------------data types-------------
tax             float64
 income           int64
 highway          int64
 dl             float64
 consumption      int64
dtype: object
---------------stats data-------------
             tax       income       highway         dl   consumption
count  48.000000    48.000000     48.000000  48.000000     48.000000
mean    7.668333  4241.833333   5565.416667   0.570333    576.770833
std     0.950770   573.623768   3491.507166   0.055470    111.885816
min     5.000000  3063.000000    431.000000   0.451000    344.000000
25%     7.000000  3739.000000   3110.250000   0.529750    509.500000
50%     7.500000  4298.000000   4735.500000   0.5645

# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [4]:
Q1 = petroldata.quantile(0.25)
Q3 = petroldata.quantile(0.75)
IQR = Q3 - Q1
#print (IQR)
#print ((Q1 - 1.5 * IQR), (Q3 + 1.5 * IQR))
#print (petroldata)

print ('************Outlier flag******************')
petroldata_outlierflag = (petroldata < (Q1 - 1.5 * IQR)) |(petroldata > (Q3 + 1.5 * IQR))
print (petroldata_outlierflag.shape)

print (petroldata_outlierflag)

print ('*************************************************')
print ('************Data without Outlier*****************')

petroldata_out = petroldata[~((petroldata < (Q1 - 1.5 * IQR)) |(petroldata > (Q3 + 1.5 * IQR))).any(axis=1)]
print (petroldata_out.shape)
print (petroldata_out)

************Outlier flag******************
(48, 5)
      tax   income   highway     dl   consumption
0   False    False     False  False         False
1   False    False     False  False         False
2   False    False     False  False         False
3   False    False     False  False         False
4   False    False     False  False         False
5    True    False     False  False         False
6   False    False     False  False         False
7   False    False     False  False         False
8   False    False     False  False         False
9   False    False     False  False         False
10  False    False     False  False         False
11  False    False      True  False         False
12  False    False     False  False         False
13  False    False     False  False         False
14  False    False     False  False         False
15  False    False     False  False         False
16  False    False     False  False         False
17  False    False     False  False         False

# Question 3 - Independent variables and collinearity 
Which attributes seems to have stronger association with the dependent variable consumption?

In [5]:
petroldata_out.corr(method='pearson')

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,-0.109537,-0.390602,-0.314702,-0.446116
income,-0.109537,1.0,0.051169,0.150689,-0.347326
highway,-0.390602,0.051169,1.0,-0.016193,0.034309
dl,-0.314702,0.150689,-0.016193,1.0,0.611788
consumption,-0.446116,-0.347326,0.034309,0.611788,1.0


In the order of correlation (from the matrix)

Strongest - 1. Number of drivers has the strongest positive correlation (0.6) on consumption. 
2. Tax has negative correlation


3. Income also has negative correlation

Weakest Correlation -> 4. highway has a weak correlation

### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [6]:
X = petroldata_out.iloc[:,[0,3]]
Y = petroldata_out.iloc[:,4:5]
print (X)
print (Y)


     tax     dl
0   9.00  0.525
1   9.00  0.572
2   9.00  0.580
3   7.50  0.529
4   8.00  0.544
6   8.00  0.451
7   8.00  0.553
8   8.00  0.529
9   7.00  0.552
10  8.00  0.530
12  7.00  0.574
13  7.00  0.545
14  7.00  0.608
15  7.00  0.586
16  7.00  0.572
17  7.00  0.540
19  8.50  0.677
20  7.00  0.663
21  8.00  0.602
22  9.00  0.511
23  9.00  0.517
24  8.50  0.551
25  9.00  0.544
26  8.00  0.548
27  7.50  0.579
28  8.00  0.563
29  9.00  0.493
30  7.00  0.518
31  7.00  0.513
32  8.00  0.578
33  7.50  0.547
34  8.00  0.487
35  6.58  0.629
37  7.00  0.586
38  8.50  0.663
40  7.00  0.626
41  7.00  0.563
42  7.00  0.603
43  7.00  0.508
44  6.00  0.672
45  9.00  0.571
46  7.00  0.623
47  7.00  0.593
     consumption
0            541
1            524
2            561
3            414
4            410
6            344
7            467
8            464
9            498
10           580
12           525
13           508
14           566
15           635
16           603
17           714
19     

# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [7]:
from  sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
#print (X_train)
#print (Y_train)
#print (X_test, Y_test)
print ('X_train shape - ', X_train.shape)
print ('Y_train shape - ', Y_train.shape)
print ('X_test shape - ', X_test.shape)
print ('Y_test shape - ', Y_test.shape)



X_train shape -  (34, 2)
Y_train shape -  (34, 1)
X_test shape -  (9, 2)
Y_test shape -  (9, 1)


# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [8]:
from sklearn.linear_model import LinearRegression

lrmodel = LinearRegression().fit(X_train, Y_train)
lrmodel
m = lrmodel.coef_
print (m)
print (m.dtype)

#Construct and display a dataframe with coefficients and X.columns as columns
coeff_df = pd.DataFrame(m,columns=['tax','dl'])
print (coeff_df)

[[ -26.14352769 1016.69740397]]
float64
         tax           dl
0 -26.143528  1016.697404


# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [9]:

from sklearn.metrics import accuracy_score
#from sklearn.metrics import r2_score
#Y_test_pred = lrmodel.predict(X_test)
#r2score = r2_score(Y_test, Y_test_pred)
#print ('r2_score -', r2score)

r2score = lrmodel.score (X_test, Y_test)
print ('r2_score -', r2score)

r2_score - 0.16505636516951616


# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [10]:
#Create X_rev including two more fetures
X_rev = petroldata_out.iloc[:,0:4]
print (X_rev.shape)
print (X_rev.head())

#Split the data

X_rev_train, X_rev_test, Y_train, Y_test = train_test_split(X_rev, Y, test_size=0.20)
print (X_rev_train)
print (Y_train)
#print (X_test, Y_test)
print ('X_train shape - ', X_rev_train.shape)
print ('Y_train shape - ', Y_train.shape)
print ('X_test shape - ', X_rev_test.shape)
print ('Y_test shape - ', Y_test.shape)


(43, 4)
   tax   income   highway     dl
0  9.0     3571      1976  0.525
1  9.0     4092      1250  0.572
2  9.0     3865      1586  0.580
3  7.5     4870      2351  0.529
4  8.0     4399       431  0.544
    tax   income   highway     dl
29  9.0     3601      4650  0.493
13  7.0     4207      6580  0.545
37  7.0     3897      6385  0.586
17  7.0     3718      4725  0.540
34  8.0     3528      3495  0.487
47  7.0     5002      9794  0.593
45  9.0     4476      3942  0.571
44  6.0     5215      2302  0.672
4   8.0     4399       431  0.544
30  7.0     3640      6905  0.518
21  8.0     4983       602  0.602
9   7.0     4512      8507  0.552
38  8.5     3635      3274  0.663
23  9.0     4258      4686  0.517
6   8.0     5319     11868  0.451
33  7.5     3357      4121  0.547
19  8.5     4341      6010  0.677
31  7.0     3333      6594  0.513
40  7.0     4449      4639  0.626
46  7.0     4296      4083  0.623
22  9.0     4897      2449  0.511
0   9.0     3571      1976  0.525
3   7.5     

In [11]:
lrmodel_rev = LinearRegression().fit(X_rev_train, Y_train)


# Question 9: Print the coefficients of the multilinear regression model

In [12]:
m_rev = lrmodel_rev.coef_
m_rev
#Construct and display a dataframe with coefficients and X.columns as columns
coeff_rev_df = pd.DataFrame(m_rev,columns=['tax','income','income','dl'])
print ('****************coefficients****************')
print (coeff_rev_df)

****************coefficients****************
         tax    income    income           dl
0 -31.626901 -0.069485 -0.001103  1080.259075


In [13]:
r2score_rev = lrmodel_rev.score (X_rev_test, Y_test)

print ('Previous r2_score -', r2score)
print ('Revised r2_score -', r2score_rev)


Previous r2_score - 0.16505636516951616
Revised r2_score - 0.5496492954881815


# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

### *R squared value increase if we increase the number of independent variables to our analysis

In [None]:
Additional Independent variable (even if the correlation is low) will increase the value of R-Square. 
The residual square sum reduces and in turns leads to increase in R square. The vairability explaination has improved with additional independent values. 