# Subject: Classical Data Analysis

## Session 1 - Regression

### Demo 5 -  OLS estimation and Multicollinearity


Implementation of Python StatsModels package with Quandl integration to calculate the OLS estimation and Multicollinearity.  

## Datatables

WIKI Prices - This database offers stock prices, dividends and splits for 3000 US publicly-traded companies.

In [1]:
import quandl
quandl.ApiConfig.api_key = 'wagAy5tFsmUZ84CH3Ng8' # A valid API key is required to retrieve data. Please check your API key and try again. You can find your API key under your account settings.
data1 = quandl.get_table('WIKI/PRICES', ticker='AAPL', paginate=True)

In [2]:
data1.head()

Unnamed: 0_level_0,ticker,date,open,high,low,close,volume,ex-dividend,split_ratio,adj_open,adj_high,adj_low,adj_close,adj_volume
None,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,AAPL,1980-12-12,28.75,28.87,28.75,28.75,2093900.0,0.0,1.0,0.422706,0.42447,0.422706,0.422706,117258400.0
1,AAPL,1980-12-15,27.38,27.38,27.25,27.25,785200.0,0.0,1.0,0.402563,0.402563,0.400652,0.400652,43971200.0
2,AAPL,1980-12-16,25.37,25.37,25.25,25.25,472000.0,0.0,1.0,0.37301,0.37301,0.371246,0.371246,26432000.0
3,AAPL,1980-12-17,25.87,26.0,25.87,25.87,385900.0,0.0,1.0,0.380362,0.382273,0.380362,0.380362,21610400.0
4,AAPL,1980-12-18,26.63,26.75,26.63,26.63,327900.0,0.0,1.0,0.391536,0.3933,0.391536,0.391536,18362400.0


# 1 - Regression model with Statsmodels and with a constant

In [3]:
import statsmodels.api as sm

X = data1["close"]
y = data1["high"]

In [4]:
X = sm.add_constant(X) 

In [5]:
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)

In [6]:
predictions = model.predict(X)

In [7]:
model.summary()

0,1,2,3
Dep. Variable:,high,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,58380000.0
Date:,"Tue, 03 Oct 2017",Prob (F-statistic):,0.0
Time:,19:56:54,Log-Likelihood:,-18234.0
No. Observations:,9280,AIC:,36470.0
Df Residuals:,9278,BIC:,36490.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,0.3806,0.022,17.081,0.000,0.337 0.424
close,1.0089,0.000,7640.852,0.000,1.009 1.009

0,1,2,3
Omnibus:,7429.58,Durbin-Watson:,1.641
Prob(Omnibus):,0.0,Jarque-Bera (JB):,365058.987
Skew:,3.464,Prob(JB):,0.0
Kurtosis:,32.935,Cond. No.,210.0


### Interpreting the Table 
The R-squared is 1.

# Multicollinearity - example 1

The WIKI Prices dataset is well known to have low multicollinearity. That is, the exogenous predictors are few correlated.

In [8]:
X = data1[["close", "low", "open"]]
y = data1["high"]
X = sm.add_constant(X) 

In [9]:
ols_model = sm.OLS(y, X)
ols_results = ols_model.fit()
print(ols_results.summary())

                            OLS Regression Results                            
Dep. Variable:                   high   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 6.333e+07
Date:                Tue, 03 Oct 2017   Prob (F-statistic):               0.00
Time:                        19:56:57   Log-Likelihood:                -12758.
No. Observations:                9280   AIC:                         2.552e+04
Df Residuals:                    9276   BIC:                         2.555e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.2429      0.013     19.235      0.0

The condition number is small, 378. This might indicate that there are
weak multicollinearity.

# Multicollinearity - example 2

The Longley dataset (http://www.statsmodels.org/dev/datasets/generated/longley.html) is well known to have high multicollinearity. That is, the exogenous predictors are highly correlated. This is problematic because it can affect the stability of our coefficient estimates as we make minor changes to model specification.

In [10]:
import statsmodels.api as sm
from statsmodels.datasets.longley import load_pandas
y = load_pandas().endog
X = load_pandas().exog
X = sm.add_constant(X)

Fit and summary:

In [11]:
ols_model = sm.OLS(y, X)
ols_results = ols_model.fit()
print(ols_results.summary())

                            OLS Regression Results                            
Dep. Variable:                 TOTEMP   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.992
Method:                 Least Squares   F-statistic:                     330.3
Date:                Tue, 03 Oct 2017   Prob (F-statistic):           4.98e-10
Time:                        19:57:02   Log-Likelihood:                -109.62
No. Observations:                  16   AIC:                             233.2
Df Residuals:                       9   BIC:                             238.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const      -3.482e+06    8.9e+05     -3.911      0.0

  "anyway, n=%i" % int(n))


## Condition number

One way to assess multicollinearity is to compute the condition number. Values over 20 are worrisome. The first step is to normalize the independent variables to have unit length:

In [12]:
import numpy as np
norm_x = X.values
for i, name in enumerate(X):
    if name == "const":
        continue
    norm_x[:,i] = X[name]/np.linalg.norm(X[name])
norm_xtx = np.dot(norm_x.T,norm_x)

Then, we take the square root of the ratio of the biggest to the smallest eigen values.

In [13]:
eigs = np.linalg.eigvals(norm_xtx)
condition_number = np.sqrt(eigs.max() / eigs.min())
print(condition_number)

56240.8709118
