# Regression Analysis - Importance of Model Coefficients - p-Values - Hypothesis Testing


In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

## OLS Estimation - Example 1

#### Generate Artificial Data

In [10]:
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)

Our model needs an intercept so we add a column of 1s:

In [16]:
X = sm.add_constant(X)
X[:3,:]

array([[1.        , 0.        , 0.        ],
       [1.        , 0.1010101 , 0.01020304],
       [1.        , 0.2020202 , 0.04081216]])

In [18]:
y = np.dot(X, beta) + e
y[:3]

array([2.22684216, 0.60739185, 0.59169934])

#### Fit and Summarize

In [12]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.787e+06
Date:                Thu, 10 Sep 2020   Prob (F-statistic):          5.98e-243
Time:                        13:26:09   Log-Likelihood:                -137.80
No. Observations:                 100   AIC:                             281.6
Df Residuals:                      97   BIC:                             289.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6345      0.287      2.214      0.0

Quantities of interest can be extracted directly from the fitted model. Type dir(results) for a full list. Here are some examples:

In [21]:
#dir(results)

In [19]:
print('Parameters: ', results.params)
print('R2: ', results.rsquared)

Parameters:  [0.63450072 0.37979982 9.97236965]
R2:  0.9999898677859683


## OLS Estimation - Example 2 - Dataset with Multicollinear Features


The Longley dataset contains various US macroeconomic variables. It is well known to have high multicollinearity. 

That is, the exogenous predictors are highly correlated. 

This is problematic because it can affect the stability of our coefficient estimates as we make minor changes to model specification.

In [7]:
from statsmodels.datasets.longley import load_pandas
y = load_pandas().endog
X = load_pandas().exog
X = sm.add_constant(X)

In [8]:
ols_model = sm.OLS(y, X)
ols_results = ols_model.fit()
print(ols_results.summary())

                            OLS Regression Results                            
Dep. Variable:                 TOTEMP   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.992
Method:                 Least Squares   F-statistic:                     330.3
Date:                Thu, 10 Sep 2020   Prob (F-statistic):           4.98e-10
Time:                        13:16:54   Log-Likelihood:                -109.62
No. Observations:                  16   AIC:                             233.2
Df Residuals:                       9   BIC:                             238.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -3.482e+06    8.9e+05     -3.911      0.0

  "anyway, n=%i" % int(n))
