## OLS Python

#### Table of Contents
* [Setup](#Setup)
* [statsmodels.api](#statsmodels.api)
    - [sm Training](#sm-Training)
    - [sm Predicting](#sm-Predicting)
    - [sm Training](#sm-Training)
* [statsmodels.formula.api](#statsmodels.formula.api)
    - [smf Training](#smf-Training)
    - [smf Predicting](#smf-Predicting)
    - [smf Training](#smf-Training)
    - [smf Formulas](#smf-Formulas)
* [sklearn.linear_model](#sklearn.linear_model)
    - [sklearn Training](#sklearn-Training)
    - [sklearn Predicting](#sklearn-Predicting)
    - [sklearn Training](#sklearn-Training)

We are going to estimate a linear regression model using three different functions. 
The different packages demonstrate the bigger picture of what each field cares about.
`statsmodels` is focused towards statistics and econometrics, so it has much more formal output.
We will demonstrate the base API and the R-like formula API.
`sklearn` is focused towards machine learning, which is focused only on $\hat{y}$.

We are going to use the following regression:

$$
\begin{align*}
    \%\Delta rGDP_{i,t} = & \alpha_t + UrateBin_{i,t}^\prime\beta + LFPR_{i,t}\gamma + LFPR_{i,t}UrateBin_{i,t}\delta +\\
    & EmpPerEstab_{i,t}\zeta + EmpPerEstab_{i,t}^2\eta + \epsilon_{i,t}
\end{align*}
$$

*********
# Setup
[TOP](#OLS-Python)

In [1]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_pickle('C:/Users/hubst/Econ490_group/class_data.pkl')
df.columns

Index(['GeoName', 'pct_d_rgdp', 'urate_bin', 'pos_net_jobs', 'emp_estabs',
       'estabs_entry_rate', 'estabs_exit_rate', 'pop', 'pop_pct_black',
       'pop_pct_hisp', 'lfpr', 'density', 'year'],
      dtype='object')

All but `statsmodels.formula` require the features and labels to be separate arguments. So, let's create them!

**IMPORTANT** The features matrix is the **design matrix**.

In [3]:
y = df['pct_d_rgdp']
x = df.drop(columns = 'pct_d_rgdp')

# Creating dummies
x = x.join([pd.get_dummies(x['year'], prefix = 'year', drop_first = True),
          pd.get_dummies(x['urate_bin'], prefix = 'urate', drop_first = True)]).drop(columns = ['year', 'urate_bin'])
x = sm.add_constant(x)

# Creating interactions
x['lfpr:urate_lower'] = x.lfpr * x.urate_lower
x['lfpr:urate_similar'] = x.lfpr * x.urate_similar
x['emp_estabs_sq'] = x.emp_estabs**2

# Dropping features we do not want to use
x.drop(columns = ['GeoName', 'pos_net_jobs', 'estabs_entry_rate', 'estabs_exit_rate',
                  'pop', 'pop_pct_black', 'pop_pct_hisp', 'density'], inplace = True)

# Sorting the columns for output
x.sort_index(axis = 'columns', inplace = True)

# Dropping un
x.columns

Index(['const', 'emp_estabs', 'emp_estabs_sq', 'lfpr', 'lfpr:urate_lower',
       'lfpr:urate_similar', 'urate_lower', 'urate_similar', 'year_2003',
       'year_2004', 'year_2005', 'year_2006', 'year_2007', 'year_2008',
       'year_2009', 'year_2010', 'year_2011', 'year_2012', 'year_2013',
       'year_2014', 'year_2015', 'year_2016', 'year_2017', 'year_2018'],
      dtype='object')

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)
print(x_train.shape)
print(y_train.shape, '\n')

print(x_test.shape)
print(y_test.shape)

(33418, 24)
(33418,) 

(16709, 24)
(16709,)


*********
# statsmodels.api
[TOP](#OLS-Python)

`statsmodels.api`'s linear regresion is a capitalized OLS.

This package is also one of the few that use the order (y, x) instead of (x, y). Be careful out there! Read the documentation when available!

## sm Training 
[TOP](#OLS-Python)

Fitting `statsmodels` functions proceeds as follows

1. calling the desired function with `y` and `x` arguments.
2. chain the `.fit()` method

This is different than `sklearn`.

In [5]:
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.030
Model:                            OLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     44.42
Date:                Tue, 23 Feb 2021   Prob (F-statistic):          5.23e-198
Time:                        14:15:31   Log-Likelihood:            -1.2100e+05
No. Observations:               33418   AIC:                         2.420e+05
Df Residuals:                   33394   BIC:                         2.422e+05
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 -0.5453      0

In [6]:
print(fit_sm.summary2())

                  Results: Ordinary least squares
Model:              OLS              Adj. R-squared:     0.029      
Dependent Variable: pct_d_rgdp       AIC:                242038.7314
Date:               2021-02-23 14:15 BIC:                242240.7358
No. Observations:   33418            Log-Likelihood:     -1.2100e+05
Df Model:           23               F-statistic:        44.42      
Df Residuals:       33394            Prob (F-statistic): 5.23e-198  
R-squared:          0.030            Scale:              81.791     
--------------------------------------------------------------------
                     Coef.  Std.Err.    t     P>|t|   [0.025  0.975]
--------------------------------------------------------------------
const               -0.5453   0.6370  -0.8562 0.3919 -1.7938  0.7031
emp_estabs          -0.0495   0.0251  -1.9763 0.0481 -0.0987 -0.0004
emp_estabs_sq        0.0007   0.0007   1.0757 0.2821 -0.0006  0.0020
lfpr                 0.0431   0.0079   5.4432 0.0000 

In [7]:
fit_sm.params
fit_sm.pvalues
fit_sm.resid
fit_sm.conf_int(alpha = 0.01)
fit_sm.rsquared

0.029684707680828093

## sm Predicting 

In [8]:
y_hat_sm = fit_sm.predict(x_test)
y_hat_sm.head()

fips   year
17195  2003    2.228947
30025  2006    5.982549
48279  2008    1.897888
37011  2002    1.666423
19111  2014    1.271692
dtype: float64

## sm Testing

In [9]:
rmse_sm = np.sqrt(np.mean((y_test - y_hat_sm)**2))
rmse_sm

9.174408715923855

How good is this fit, you ask. 
Well, it is a bit a little difficult to say without comparison.
A good starting place is to compare this fit against the null model.
Then we can determine the percent improvement we obtain from it.

In [10]:
# null model
rmse_null = np.sqrt(  np.mean((y_test - np.mean(y_train))**2)  )
rmse_null

9.295172646932564

In [11]:
print(round((rmse_null - rmse_sm)/rmse_null*100, 2), '%', sep = '')

1.3%


Only 1.25%!? 
That is not much at all!

We should note that if we have made a 100% imporvement, then we have interpolated the data (overfit it).
I would say if we could improve upon the null model by 10%, then that is something to be excited about.
Let's see if we can get there this semester.

***
# statsmodels.formula.api
[TOP](#OLS-Python)

This works just like `R`!

In [12]:
df.columns

Index(['GeoName', 'pct_d_rgdp', 'urate_bin', 'pos_net_jobs', 'emp_estabs',
       'estabs_entry_rate', 'estabs_exit_rate', 'pop', 'pop_pct_black',
       'pop_pct_hisp', 'lfpr', 'density', 'year'],
      dtype='object')

So here is something cool about `train_test_split()` with a specified `random_state`:

In [13]:
df_train, df_test = train_test_split(df, train_size = 2/3, random_state = 490)
all(x_train.index == df_train.index)

True

## smf Training
[TOP](#OLS-Python)

In [14]:
fit_smf = smf.ols(formula = 'pct_d_rgdp ~ emp_estabs + I(emp_estabs**2) + C(urate_bin)*lfpr + C(year)', data = df_train).fit()
print(fit_smf.summary2())

                       Results: Ordinary least squares
Model:                 OLS                 Adj. R-squared:        0.029      
Dependent Variable:    pct_d_rgdp          AIC:                   242038.7314
Date:                  2021-02-23 14:15    BIC:                   242240.7358
No. Observations:      33418               Log-Likelihood:        -1.2100e+05
Df Model:              23                  F-statistic:           44.42      
Df Residuals:          33394               Prob (F-statistic):    5.23e-198  
R-squared:             0.030               Scale:                 81.791     
-----------------------------------------------------------------------------
                              Coef.  Std.Err.    t     P>|t|   [0.025  0.975]
-----------------------------------------------------------------------------
Intercept                    -0.5453   0.6370  -0.8562 0.3919 -1.7938  0.7031
C(urate_bin)[T.lower]         0.8578   0.8794   0.9755 0.3293 -0.8658  2.5815
C(urate_b

## smf Predicting
[TOP](#OLS-Python)

In [15]:
yhat_smf = fit_smf.predict(df_test)

## smf Testing
[TOP](#OLS-Python)

In [16]:
rmse_smf = np.sqrt(  np.mean((yhat_smf - y_test)**2)  )
rmse_smf

9.174408715923862

In [17]:
rmse_sm

9.174408715923855

## smf Formulas
[TOP](#OLS-Python)

Here are a few more examples on how to use formulas in `statsmodels.formula.api`:

In [18]:
df_train.columns

Index(['GeoName', 'pct_d_rgdp', 'urate_bin', 'pos_net_jobs', 'emp_estabs',
       'estabs_entry_rate', 'estabs_exit_rate', 'pop', 'pop_pct_black',
       'pop_pct_hisp', 'lfpr', 'density', 'year'],
      dtype='object')

In [19]:
# no intercept
smf.ols(formula = 'pct_d_rgdp ~ density + pop - 1', data = df_train).fit().params

density    0.000024
pop        0.000002
dtype: float64

In [20]:
# only specific levels
smf.ols(formula = "pct_d_rgdp ~ I(year == 2003) + I(year.isin([range(2007,2010)]))", data = df_train).fit().params

Intercept                                    1.872175
I(year == 2003)[T.True]                      1.159957
I(year.isin([range(2007, 2010)]))[T.True]    0.000000
dtype: float64

*********
# sklearn.linear_model
[TOP](#OLS-Python)

`sklearn`, the best machine learning package for everything *other than* neural networks. 
The lack of statistical details from their OLS function goes to show what is the difference between data scientists and statisticians/econometricians.

## sklearn Training 
[TOP](#OLS-Python)

Fitting `sklearn` functions proceeds as follows:

1. call the desired function without arguments
2. chain the `.fit()` method with `x` and `y` arguments

This is different than `statsmodels`

In [21]:
fit_sk = LinearRegression().fit(x_train, y_train)
print(fit_sk.score(x_train, y_train)) # r_sq
fit_sk.coef_ # unamed coefficients

0.029684707680828093


array([ 0.00000000e+00, -4.95321015e-02,  7.02136460e-04,  4.31047751e-02,
        7.24794039e-03, -2.57808743e-02,  8.57828839e-01,  2.60672027e+00,
       -2.79203830e-02,  9.64688694e-02,  4.15018524e-01,  2.27628941e+00,
       -1.71496729e+00, -2.30414244e+00, -3.99878144e+00,  3.82062093e-02,
       -1.07687250e+00, -1.71948367e+00, -5.10131372e-01, -1.58904658e+00,
       -1.32647063e+00, -2.71064185e+00, -1.61600204e+00, -7.26279146e-01])

## sklearn Predicting
[TOP](#OLS-Python)

In [22]:
yhat_sk = fit_sk.predict(x_test)

## sklearn Testing
[TOP](#OLS-Python)

In [23]:
rmse_sk = mean_squared_error(yhat_sk, y_test, squared = False)
rmse_sk

9.174408715923867

In [24]:
rmse_sm

9.174408715923855

In [25]:
rmse_smf

9.174408715923862