## Workshop - OLS Python

In this workshop, we are going to:

1. perform backward selection on the class data set
   1. fit the full model with $\%\Delta rGDP$ as the label
   2. remove the feature with the highest p-value
   3. refit the model
   4. repeat steps B. and C. until all features have p-values below 0.05
2. evaluatate the model performance

*Do not use interactions or polynomial terms in this workshop.*

# Preliminaries

- Load any necessary packages and/or functions
    * For backward select, I recommend using `statsmodels.api` instead of `statsmodels.formula.api`. Your choice.
- Load in the class data
- Define `x` and `y`
- Create a train-test split with
    * training size of two-thirds
    * random state of 490

In [1]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [23]:
df = pd.read_pickle('C:/Users/hubst/Econ490_group/class_data.pkl')

In [24]:
y = df['pct_d_rgdp']
x = df.drop(columns = 'pct_d_rgdp')

In [None]:
x = x.join([pd.get_dummies(x['year'], prefix = 'year', drop_first = True),
          pd.get_dummies(x['urate_bin'], prefix = 'urate', drop_first = True)]).drop(columns = ['year', 'urate_bin'])
x = sm.add_constant(x)

x['lfpr:urate_lower'] = x.lfpr * x.urate_lower
x['lfpr:urate_similar'] = x.lfpr * x.urate_similar
x['emp_estabs_sq'] = x.emp_estabs**2

x.drop(columns = ['GeoName', 'pos_net_jobs', 'estabs_entry_rate', 'estabs_exit_rate',
                  'pop', 'pop_pct_black', 'pop_pct_hisp', 'density'], inplace = True)

x.sort_index(axis = 'columns', inplace = True)

x.columns

In [25]:
x.drop(columns = ['GeoName', 'urate_bin'], inplace = True)
x.drop(columns = ['pop_pct_black', 'density', 'emp_estabs'], inplace = True)

In [26]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)

*****
# Backward Selection 

In [27]:
x_train.drop(columns = ['pop_pct_black', 'density', 'emp_estabs', 'year_2003'], inplace = True)
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                                 OLS Regression Results                                
Dep. Variable:             pct_d_rgdp   R-squared (uncentered):                   0.068
Model:                            OLS   Adj. R-squared (uncentered):              0.068
Method:                 Least Squares   F-statistic:                              350.7
Date:                Tue, 23 Feb 2021   Prob (F-statistic):                        0.00
Time:                        14:54:53   Log-Likelihood:                     -1.2105e+05
No. Observations:               33418   AIC:                                  2.421e+05
Df Residuals:                   33411   BIC:                                  2.422e+05
Df Model:                           7                                                  
Covariance Type:            nonrobust                                                  
                        coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------

**********
# Testing

Evaluate two RMSEs:

1. null model
2. backward-selected model

Then, determine the percent improvement of the backward-selected model from the null model.

In [32]:
y_hat_sm = fit_sm.predict(x_test)
y_hat_sm.head()
rmse_sm = np.sqrt(np.mean((y_test - y_hat_sm)**2))
rmse_sm

9.15424055100316

In [22]:
x_test.shape

(16709, 10)