In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression

In [8]:
delivery = pd.read_csv('Desktop/delivery.csv')
delivery.head()

Unnamed: 0,n.prod,distance,delTime
0,7,560,16.68
1,3,220,11.5
2,3,340,12.03
3,4,80,14.88
4,6,150,13.75


In [9]:
delivery.count()

n.prod      25
distance    25
delTime     25
dtype: int64

In [10]:
# selecting the predictors and targets
x = delivery[["n.prod", "distance"]]
y = delivery["delTime"]

x

Unnamed: 0,n.prod,distance
0,7,560
1,3,220
2,3,340
3,4,80
4,6,150
5,7,330
6,2,110
7,7,210
8,30,1460
9,5,605


In [11]:
y

0     16.68
1     11.50
2     12.03
3     14.88
4     13.75
5     18.11
6      8.00
7     17.83
8     79.24
9     21.50
10    40.33
11    21.00
12    13.50
13    19.75
14    24.00
15    29.00
16    15.35
17    19.00
18     9.50
19    35.10
20    17.90
21    52.32
22    18.75
23    19.83
24    10.75
Name: delTime, dtype: float64

In [12]:
model = LinearRegression()

In [14]:
# building the model using fit() method
model.fit(x,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [15]:
print("Intercept:",model.intercept_ ,"\nCoefficients:",model.coef_)

Intercept: 2.3412311451922 
Coefficients: [1.61590721 0.01438483]


## Linear independence of predictors for delivery time dataset


##### For the delivery time dataset, the correlation(R) among the variables n.prod and distance is as shown below.



In [16]:
# finding the correlation
np.corrcoef(delivery["n.prod"],delivery["distance"])

array([[1.      , 0.824215],
       [0.824215, 1.      ]])

In [20]:
#Let us now compute the VIF for the independent variables n.prod and distance.

from statsmodels.stats.outliers_influence import variance_inflation_factor
#calculating the VIF for each attributes
vif = pd.Series([variance_inflation_factor(x.values,idx) 
           for idx in range(x.shape[1])],
          index=x.columns)
print(vif)


n.prod      7.848245
distance    7.848245
dtype: float64


## Adjusted R-Squared

Using the least squares method, we try to establish a best fit linear regression model with minimum error. For a linear regression model, every additional predictor variable tends to minimize the error of the model. As a result, the R2 value will never decrease for any number of additional predictor variables being included in the model.

The below code illustrates the increase in the R2 for an additional predictor being included in the model.

In [21]:
# Model with a single predictor - n.prod
model1 = LinearRegression()
features = ["n.prod"]
target = ["delTime"]
model1.fit(delivery[features],delivery[target])
print(model1.score(delivery[features],delivery[target]))
#sample model1 score
#0.9304813135986855

# Model with multiple predictors - n.prod,distance
model2 = LinearRegression()
features = ["n.prod","distance"]
target = ["delTime"]
model2.fit(delivery[features],delivery[target])
print(model2.score(delivery[features],delivery[target]))
#sample model2 score
#0.9595937494832257


0.9304813135986855
0.9595937494832257


In other words, the R2 value can be inflated by including more and more predictor variables.

Thus the use of an additional statistic known as adjusted R2 is suggested. The adjusted R2 takes into account the number of predictor variables and the number of samples or observations included in the regression model.

The adjusted R2 for the obtained best fit model for the delivery time dataset is shown below. Higher the value of Adjusted R2, better the model.  

In [22]:
#computation of adjusted R-squared
X = delivery[features]
y = delivery[target]
adjusted_rscore = 1 - (1-model2.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)
print(adjusted_rscore)
#sample adjusted R-Squared
#0.9559204539817008


0.9559204539817008
