# Multiple Linear Regression in Statsmodels

## Introduction

In this lecture, you'll learn how to run your first multiple linear regression model.

## Objectives
You will be able to:
* Introduce Statsmodels for multiple regression
* Present alternatives for running regression in SciPy, Scikit Learn

## Auto-mpg data

The code below reiterates the steps we've taken before: we've created dummies for our categorical variables and have log-transformed some of our continuous predictors. 

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv("auto-mpg.csv") 
data['horsepower'].astype(str).astype(int)

acc = data["acceleration"]
logdisp = np.log(data["displacement"])
loghorse = np.log(data["horsepower"])
logweight= np.log(data["weight"])

scaled_acc = (acc-min(acc))/(max(acc)-min(acc))	
scaled_disp = (logdisp-np.mean(logdisp))/np.var(logdisp)	
scaled_horse = (loghorse-np.mean(loghorse))/(max(loghorse)-min(loghorse))	
scaled_weight= (logweight)/(np.linalg.norm(logweight))

data_fin = pd.DataFrame([])
data_fin["acc"]= scaled_acc
data_fin["disp"]= scaled_disp
data_fin["horse"] = scaled_horse
data_fin["weight"] = scaled_weight
cyl_dummies = pd.get_dummies(data["cylinders"], prefix="cyl")
yr_dummies = pd.get_dummies(data["model year"], prefix="yr")
orig_dummies = pd.get_dummies(data["origin"], prefix="orig")
mpg = data["mpg"]
data_fin = pd.concat([mpg, data_fin, cyl_dummies, yr_dummies, orig_dummies], axis=1)

In [2]:
data_fin.head()

Unnamed: 0,mpg,acc,disp,horse,weight,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8,...,yr_76,yr_77,yr_78,yr_79,yr_80,yr_81,yr_82,orig_1,orig_2,orig_3
0,18.0,0.238095,2.116162,0.173727,0.05176,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
1,15.0,0.208333,2.579297,0.32186,0.052093,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,18.0,0.178571,2.24054,0.262641,0.051636,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,16.0,0.238095,2.081467,0.262641,0.051631,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,17.0,0.14881,2.058147,0.219773,0.05166,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


This looks pretty good now. But wait a second. We had actually identified that there was multicollinearity between our continuous features. We transformed our variables now, does the multicollinearity still hold? Let's have a quick look at the correlations of our feature-scaled continuous variables. We'll explore correlation looking at the scatter matrix again.

In [3]:
pd.plotting.scatter_matrix(data_fin.iloc[:,1:5],figsize  = [6, 6]);

There is clearly still a high correlation bewteen "disp", "horse" and "weight". In our model, let's only use "horse". We'll drop "disp" and "weight" from our `data_fin`.

In [4]:
data_fin.drop(['disp','weight'], axis = 1, inplace=True)

In [5]:
data_fin.columns

Index(['mpg', 'acc', 'horse', 'cyl_3', 'cyl_4', 'cyl_5', 'cyl_6', 'cyl_8',
       'yr_70', 'yr_71', 'yr_72', 'yr_73', 'yr_74', 'yr_75', 'yr_76', 'yr_77',
       'yr_78', 'yr_79', 'yr_80', 'yr_81', 'yr_82', 'orig_1', 'orig_2',
       'orig_3'],
      dtype='object')

Now, let's use the statsmodels.api to run our ols on all our data. Just like for linear regression with a single predictor, you can use the formula $y ~ X$, where, with $n$ predictors, X is represented as $x_1+\ldots+x_n$.



In [6]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# import matplotlib.pyplot as plt

  from pandas.core import datetools


In [7]:
formula = "mpg ~ acc+horse+cyl_3+cyl_4+cyl_5+cyl_6+cyl_8+yr_70+yr_71+yr_72+yr_73+yr_74+ yr_75+yr_76+yr_77+yr_78+ yr_79+ yr_80+yr_81+yr_82+orig_1+orig_2+orig_3"
model = ols(formula= formula, data=data_fin).fit()

Having to type out all the predictors isn't practical when you have many. Another better way than to type them all out is to seperate out the outcome variable "mpg" out of your data frame, and use the a `"+".join()` command on the predictors, as done below:

In [8]:
outcome = 'mpg'
predictors = data_fin.drop('mpg', axis=1)
pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum

In [9]:
model = ols(formula= formula, data=data_fin).fit()
model.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.877
Model:,OLS,Adj. R-squared:,0.871
Method:,Least Squares,F-statistic:,132.8
Date:,"Sun, 14 Oct 2018",Prob (F-statistic):,2.3300000000000002e-155
Time:,15:14:33,Log-Likelihood:,-949.82
No. Observations:,392,AIC:,1942.0
Df Residuals:,371,BIC:,2025.0
Df Model:,20,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,16.6029,0.453,36.615,0.000,15.711,17.495
acc,-9.3480,1.292,-7.238,0.000,-11.888,-6.808
horse,-23.8371,1.612,-14.786,0.000,-27.007,-20.667
cyl_3,-1.1195,1.244,-0.900,0.369,-3.566,1.327
cyl_4,6.2499,0.473,13.219,0.000,5.320,7.180
cyl_5,4.3716,1.435,3.047,0.002,1.550,7.193
cyl_6,3.1118,0.518,6.002,0.000,2.092,4.131
cyl_8,3.9891,0.625,6.380,0.000,2.760,5.219
yr_70,-0.2154,0.531,-0.406,0.685,-1.259,0.829

0,1,2,3
Omnibus:,33.865,Durbin-Watson:,1.801
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.23
Skew:,0.552,Prob(JB):,3.74e-13
Kurtosis:,4.512,Cond. No.,2.04e+16


In [24]:
no_yr = [c for c in predictors.columns if c[:2] != 'yr']
predictors = predictors[no_yr]
predictors["year"] = (data["model year"]-np.mean(data["model year"]))/np.sqrt(np.var(data["model year"]))

In [25]:
pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum

model = ols(formula= formula, data=predictors).fit()
model.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.846
Model:,OLS,Adj. R-squared:,0.842
Method:,Least Squares,F-statistic:,232.8
Date:,"Sun, 14 Oct 2018",Prob (F-statistic):,4.01e-149
Time:,15:18:51,Log-Likelihood:,-994.76
No. Observations:,392,AIC:,2010.0
Df Residuals:,382,BIC:,2049.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,17.7031,0.521,33.962,0.000,16.678,18.728
acc,-10.1230,1.407,-7.197,0.000,-12.889,-7.357
horse,-23.6417,1.747,-13.531,0.000,-27.077,-20.206
cyl_3,-1.0097,1.360,-0.742,0.458,-3.684,1.665
cyl_4,6.5616,0.511,12.834,0.000,5.556,7.567
cyl_5,5.5573,1.574,3.530,0.000,2.462,8.653
cyl_6,2.7837,0.556,5.003,0.000,1.690,3.878
cyl_8,3.8103,0.682,5.589,0.000,2.470,5.151
orig_1,4.5479,0.333,13.646,0.000,3.893,5.203

0,1,2,3
Omnibus:,55.951,Durbin-Watson:,1.48
Prob(Omnibus):,0.0,Jarque-Bera (JB):,95.139
Skew:,0.849,Prob(JB):,2.19e-21
Kurtosis:,4.715,Cond. No.,1.76e+16


In [29]:
no_cyl = [c for c in predictors.columns if c[:3] != 'cyl']
predictors = predictors[no_cyl]

pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum

model = ols(formula= formula, data=predictors).fit()

model.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.817
Model:,OLS,Adj. R-squared:,0.815
Method:,Least Squares,F-statistic:,344.6
Date:,"Sun, 14 Oct 2018",Prob (F-statistic):,7.029999999999999e-140
Time:,15:32:27,Log-Likelihood:,-1028.4
No. Observations:,392,AIC:,2069.0
Df Residuals:,386,BIC:,2093.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,22.2364,0.507,43.865,0.000,21.240,23.233
acc,-11.9491,1.476,-8.097,0.000,-14.851,-9.048
horse,-28.2206,1.337,-21.110,0.000,-30.849,-25.592
orig_1,5.4564,0.324,16.841,0.000,4.819,6.093
orig_2,7.7914,0.380,20.519,0.000,7.045,8.538
orig_3,8.9887,0.338,26.628,0.000,8.325,9.652
year,2.4568,0.188,13.056,0.000,2.087,2.827

0,1,2,3
Omnibus:,21.608,Durbin-Watson:,1.458
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27.627
Skew:,0.468,Prob(JB):,1e-06
Kurtosis:,3.903,Cond. No.,1900000000000000.0


## Summary
Summary goes here