# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [5]:
import numpy as np
import pandas as pd
df_petrol = pd.read_csv('petrol.csv')
# df_petrol.head()
df_petrol.describe()


Unnamed: 0,tax,income,highway,dl,consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [10]:
df_petrol.head()
for col in df_petrol.columns:
#     print(df_petrol[col])
    print("!$$!",len(df_petrol[col]))

    sorted_col = sorted(df_petrol[col])
    q1, q3= np.percentile(sorted_col,[25,75])
    print(q1,q3)
    iqr = q3 - q1
    lower_bound = q1 -(1.5 * iqr) 
    upper_bound = q3 +(1.5 * iqr) 
    df_petrol = df_petrol.loc[(df_petrol[col] > lower_bound) & (df_petrol[col] < upper_bound)]
    print("!!",len(df_petrol[col]))
    


!$$! 48
7.0 8.125
!! 46
!$$! 46
3727.0 4558.5
!! 46
!$$! 46
3329.25 6923.75
!! 45
!$$! 45
0.53 0.602
!! 44
!$$! 44
520.5 631.25
!! 42


# Question 3 - Independent variables and collinearity 
Which attributes seems to have stronger association with the dependent variable consumption?

In [11]:
df_petrol.corr()

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,-0.133841,-0.443926,-0.316342,-0.463247
income,-0.133841,1.0,-0.076862,0.296002,-0.254464
highway,-0.443926,-0.076862,1.0,0.133983,0.215182
dl,-0.316342,0.296002,0.133983,1.0,0.549161
consumption,-0.463247,-0.254464,0.215182,0.549161,1.0


In [None]:
# tax and propotion of drivers seem to have stronger corelation with consumption

### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [21]:
x = df_petrol.loc[:,['tax',' dl']]
print(x)


     tax     dl
0   9.00  0.525
1   9.00  0.572
2   9.00  0.580
3   7.50  0.529
4   8.00  0.544
7   8.00  0.553
8   8.00  0.529
9   7.00  0.552
10  8.00  0.530
12  7.00  0.574
13  7.00  0.545
14  7.00  0.608
15  7.00  0.586
16  7.00  0.572
17  7.00  0.540
19  8.50  0.677
20  7.00  0.663
21  8.00  0.602
22  9.00  0.511
23  9.00  0.517
24  8.50  0.551
25  9.00  0.544
26  8.00  0.548
27  7.50  0.579
28  8.00  0.563
29  9.00  0.493
30  7.00  0.518
31  7.00  0.513
32  8.00  0.578
33  7.50  0.547
34  8.00  0.487
35  6.58  0.629
37  7.00  0.586
38  8.50  0.663
40  7.00  0.626
41  7.00  0.563
42  7.00  0.603
43  7.00  0.508
44  6.00  0.672
45  9.00  0.571
46  7.00  0.623
47  7.00  0.593


In [23]:
y = df_petrol.loc[:,' consumption']
print(y)

0     541
1     524
2     561
3     414
4     410
7     467
8     464
9     498
10    580
12    525
13    508
14    566
15    635
16    603
17    714
19    640
20    649
21    540
22    464
23    547
24    460
25    566
26    577
27    631
28    574
29    534
30    571
31    554
32    577
33    628
34    487
35    644
37    704
38    648
40    587
41    699
42    632
43    591
44    782
45    510
46    610
47    524
Name:  consumption, dtype: int64


# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [31]:
from sklearn.cross_validation  import train_test_split  
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) 
print(x_test.shape)
print(x_train.shape)
print("------")
print(y_test.shape)
print(y_train.shape)


(9, 2)
(33, 2)
------
(9,)
(33,)


# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [35]:
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(x_train, y_train)
coeff_petrol = pd.DataFrame(regressor.coef_, x.columns, columns=['Coefficient'])  
print(coeff_petrol)

     Coefficient
tax   -40.304195
 dl   930.158680


# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [37]:
r_square = regressor.score(x_train,y_train)
print(r_square)

0.440643829


# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [42]:
x = df_petrol.loc[:,['tax',' income',' highway' ,' dl']]
# print(x)

y = df_petrol.loc[: , ' consumption']
# print(y)

from sklearn.cross_validation  import train_test_split  
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) 
# print(x_test.shape)
# print(x_train.shape)
# print("------")
# print(y_test.shape)
# print(y_train.shape)

from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(x_train, y_train)

r_square = regressor.score(x_train,y_train)
print(r_square)




0.595412253802


# Question 9: Print the coefficients of the multilinear regression model

In [43]:
coeff_petrol = pd.DataFrame(regressor.coef_, x.columns, columns=['Coefficient'])  
print(coeff_petrol)

          Coefficient
tax        -42.635165
 income     -0.066864
 highway    -0.002476
 dl        993.125288


# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

### *R squared value increase if we increase the number of independent variables to our analysis

In [44]:
print("The r2 method takes into account that every independent variable effects the dependent variable (i.e) every independent varable has a variance on dependent variable so when a variable is added r2 cannont go down. To overcome this we use adjusted r2 which takes into account number of independent variables")

The r2 method takes into account that every independent variable effects the dependent variable (i.e) every independent varable has a variance on dependent variable so when a variable is added r2 cannont go down. To overcome this we use adjusted r2 which takes into account number of independent variables
