# Explanation of statsmodels' model summary

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from IPython.display import Image
from sklearn.datasets import make_regression
%matplotlib inline

After fitting a regression model in statsmodels we often want to get some information about that model. The most common way of doing this is to call model.summary(). Here I'll explain in detail the meaning for the numbers returned. 

Let's get started by making some data.

In [27]:
X, y = make_regression(n_samples=240, n_features=3, n_informative=2, noise=22)

I like to use the statsmodels formula API with a pandas dataframe. So I will make that here

In [28]:
df = pd.concat( [pd.DataFrame(X,columns=['x0','x1','x2']) , \
                 pd.DataFrame(y,columns=['y'])] ,axis=1)

In [29]:
df.head()

Unnamed: 0,x0,x1,x2,y
0,-2.249892,0.596463,0.085399,21.100636
1,-0.037913,0.803076,-1.574064,23.104803
2,0.285393,0.339467,-1.446577,-35.216285
3,-0.963888,1.527511,0.903515,109.53417
4,-2.279441,-1.629664,-1.645524,-138.575409


Finally specifying and fitting the model

In [30]:
model = smf.ols(formula='y ~ x0+x1+x2', data=df).fit()

and printing the summary.

In [31]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.894
Model:,OLS,Adj. R-squared:,0.893
Method:,Least Squares,F-statistic:,662.8
Date:,"Sun, 22 Oct 2017",Prob (F-statistic):,1.25e-114
Time:,11:02:43,Log-Likelihood:,-1084.5
No. Observations:,240,AIC:,2177.0
Df Residuals:,236,BIC:,2191.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.6547,1.463,0.447,0.655,-2.228,3.538
x0,-0.5257,1.335,-0.394,0.694,-3.155,2.104
x1,53.8768,1.339,40.249,0.000,51.240,56.514
x2,26.0969,1.503,17.359,0.000,23.135,29.059

0,1,2,3
Omnibus:,0.168,Durbin-Watson:,1.873
Prob(Omnibus):,0.919,Jarque-Bera (JB):,0.181
Skew:,0.062,Prob(JB):,0.913
Kurtosis:,2.95,Cond. No.,1.24


Let’s go through the above one table at a time. 

## The first table

In [41]:
Image(url="table1.png", width=500, height=500)


The first column here is all information you already know (or at least should know). 

- Df Residuals: is the degrees of freedom of residuals which calculated with N - Df model - 1 

- Df model: is the degrees of freedom of the model which is just the number of predictors.

- Covariance Type: Reminds us that we're doing non-robust regression. I’ll have a separate notebook introducing robust regression so I'm not going into here. only to say that ordinary least squares regression is non robust b/c it can give misleading results if its underlying assumptions are not true (I'll talk about what i mean by underlying assumptions later in this notebook).


The second column is more interesting 

- R-squared (aka  coefficient of determination or Proportion of variance explained) is a statistical measure of how close the data are to the fitted regression line. you want this number to be larger than 0. if it's less than zero it means your model is useless and you would have been better off just guessing the mean value of the dependent variable. one is the largest value possible. 

- Adj. R-squared (adjusted R-squared) There is a problem with R-squared that is when every time you add a predictor to your model the R-squared get's closer to one (or stays the same). Because a simpler model should be prefered to a more complex one we need a metric that takes this into account. Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would occur by adding a feature with no correlation to the dependent variable and decreases otherwise.

**note:** R-squared values close to one does not necessary means that you're fitting the data well. As we'll see in this notebook there is a lot that goes into producing a good model for any dataset and R-squared is just one of these numbers.
  
- F-statistic (aka F value) and its corresponding p-value (Prob(F)) test the overall significance of the regression model. Specifically, they help us to determine if how likely it is that all of the regression coefficients are actually equal to zero as opposed to the values shown in table two. So if Prob(F) has a value of 0.01000 then there is 1 chance in 100 that all of the regression parameters are zero. Which would indicate to us that the regression equation does have some validity in fitting the data.

- Log-Likelihood: this is the value of the likelihood function with the optimal coefficients plugged in. The exact form of the Log-likelihood function it not that important but can be found here: https://en.wikipedia.org/wiki/Likelihood_function.

**note:** it is never appropriate to compare Log-likelihood across models unless you have the same number of observations in each model.

- BIC Bayesian information criterion (aka Schwarz criterion) and AIC Akaike information criterion are closely related metrics. Both are criterion for model selection among possible models that can be build with the features you have. The model with the lowest AIC and/or BIC is prefered. 



## The seccond table

In [42]:
Image(url="table2.png", width=500, height=500)

The second table is really all about the coefficients of the model.

- std err (of that coefficient): is a measure of the accuracy of coef. 

- t (t-statistic) and P>|t| (corresponding p-value): tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis (that the real value of the coefficent is zero).

- [0.025 0.975] is the 95% confidence interval of the coeffects.


# The third table 

In [43]:
Image(url="table3.png", width=500, height=500)

Here is where we get into finding out if our data meata the underlying assumptions of OLS regression.


- Omnibus and Prob(Omnibus) tests whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. 

- Skew is a measure of the skew in the dependent value;. "Skewness" is a measure of how symmetrical the data are; a skewed variable is one whose mean is not in the middle of the distributi

- Kurtosis" has to do with how peaked the distribution is, either too peaked or too flat. 

**note** "Extreme values" for skewness and kurtosis are values greater than +3 or less than -3.

- Durbin Watson statistic is a number that tests for autocorrelation in the residuals from a statistical regression analysis. The Durbin-Watson statistic is always between 0 and 4. A value of 2 means that there is no autocorrelation in the sample.

- Jarque–Bera test and Prob(JB) is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution. 

- Cond. No. measure correlation between features. larger mean more correlation. generally any number larger than 20 is bad (in fact statsmodels will print an error if this is the case.)

