# Multiple Linear Regression with statsmodels

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
%matplotlib inline
sns.set()

In [2]:
data = pd.read_csv('1.02. Multiple linear regression.csv')

In [3]:
# the rand column is assigns 1, 2 or 3 randomly to each student
# this column can't predict college GPA
# the new model is GPA = b0 + b1 * SAT + b2 * Rand 1,2,3

data

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.40,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2
...,...,...,...
79,1936,3.71,3
80,1810,3.71,1
81,1987,3.73,3
82,1962,3.76,1


In [4]:
data.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


In [5]:
# dependent variable

y = data['GPA']

In [6]:
# independent variables

x1 = data[['SAT','Rand 1,2,3']]

In [7]:
# the b0 value

x = sm.add_constant(x1)

  return ptp(axis=axis, out=out, **kwargs)


In [8]:
# because the add_constant
# the dataframe have the const column

x

Unnamed: 0,const,SAT,"Rand 1,2,3"
0,1.0,1714,1
1,1.0,1664,3
2,1.0,1760,3
3,1.0,1685,3
4,1.0,1693,2
...,...,...,...
79,1.0,1936,3
80,1.0,1810,1
81,1.0,1987,3
82,1.0,1962,1


In [9]:
results = sm.OLS(y,x).fit()

In [10]:
results.summary()

0,1,2,3
Dep. Variable:,GPA,R-squared:,0.407
Model:,OLS,Adj. R-squared:,0.392
Method:,Least Squares,F-statistic:,27.76
Date:,"Wed, 25 Dec 2019",Prob (F-statistic):,6.58e-10
Time:,15:40:46,Log-Likelihood:,12.72
No. Observations:,84,AIC:,-19.44
Df Residuals:,81,BIC:,-12.15
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2960,0.417,0.710,0.480,-0.533,1.125
SAT,0.0017,0.000,7.432,0.000,0.001,0.002
"Rand 1,2,3",-0.0083,0.027,-0.304,0.762,-0.062,0.046

0,1,2,3
Omnibus:,12.992,Durbin-Watson:,0.948
Prob(Omnibus):,0.002,Jarque-Bera (JB):,16.364
Skew:,-0.731,Prob(JB):,0.00028
Kurtosis:,4.594,Cond. No.,33300.0


___

#### Adjusted R-squared

A statistician would always take a look at it when performing regression analysis.

The R-squared measures how much of the total variability is explained by model.

Multiple regression are always better than simple ones, as with each additional variable you add, the explanatory power may only increase or stay the same.

Considering the number of variables, the Adjusted R-squared is always smaller than the R-squared, the Adjusted R-squared penalizes excessive use of variables.

In this example, we were penalized for adding an additional variable that had no strong explamatory power, we have added information but have lost value.

Point is you should pick your data as to exclude useless information, however one would assume regression analysis is smarter than that.

The Adjusted R-squared is basic for comparing regression models, it only makes sense to compare two models considering the same dependent variable and using the same dataset.
___

#### Coefficient table

With the coefficient table, we can see the Rand 1,2,3 this row, but it p-value is 0.762, the null hypothesis is <code>H0:β = 0</code>, that means <code>H0:b2 = 0,</code> we can't reject the null hypothesis that the 76% significance level, so with the p-value, the Rand 1,2,3 not only worsens the explanatory power of the model reflect by a lower adjusted R-squared but it also insignificant, therefore it should be dropped altogether, dropping useless variables is important.
___

#### The regression equation

The simple regression equation is <code>ŷ = 0.275 + 0.0017x1</code>, the multiple regression equation is <code>ŷ = 0.296 + 0.0017x1 - 0.0083x2,</code> it can know the choice of third variable affected the intercept, whenever you have one variable that is ruining the model, you should not use this model altogether, because the bias of this variable is reflected into the coefficients of the other variables.

The correct approach is to remove it from the regression and run a new one.