## Support Vector Machines - Part 1

#### Table of Contents

- [Preliminaries](#Preliminaries)
- [Null Model](#Null-Model)
- [10% Correlation](#10%-Correlation)
- [5% Correlation](#5%-Correlation)
- [1% Correlation](#1%-Correlation)
- [Comparison](#Comparison)


Take aways from this script:

1. Copy and paste your code
2. Regularization helps with weak predictors

***
# Preliminaries
[TOP](#Support-Vector-Machines---Part-1)

Here we have our usual set up.

However, this time we are going to compare choosing features based upon their correlation with the label `pos_net_job`.
We will do so at

* 10%
- 5%
- 1%

This will result with a postponed train-test split.

In [29]:
# utilities
import numpy as np
import pandas as pd

# processing
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler

#algorithms
from sklearn.svm import LinearSVC

In [30]:
df = pd.read_pickle('C:/Users/hubst/Econ490_group/class_data.pkl')
df_prepped = df.drop(columns = ['urate_bin', 'year', 'GeoName']).join([
    pd.get_dummies(df['urate_bin'], drop_first = True),
    pd.get_dummies(df['year'], drop_first = True)
])

**********
# Null Model 
[TOP](#Support-Vector-Machines---Part-1)

In [31]:
y = df_prepped['pos_net_jobs'].astype(float)
y_train, y_test = train_test_split(y,
                                  train_size = 2/3,
                                  random_state = 490)

In [32]:
yhat_null = y_train.value_counts().index[0]
acc_null = np.mean(yhat_null == y_test)
acc_null

0.562391525525166

*****
# 10% Correlation
[TOP](#Support-Vector-Machines---Part-1)

First, let's produce a correlation matrix with the data frame method `.corr()`

In [33]:
df_prepped.corr()

Unnamed: 0,pct_d_rgdp,pos_net_jobs,emp_estabs,estabs_entry_rate,estabs_exit_rate,pop,pop_pct_black,pop_pct_hisp,lfpr,density,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
pct_d_rgdp,1.0,0.095578,-0.020888,0.107552,-0.016574,0.000396,-0.045466,0.042703,0.086249,-0.001632,...,-0.075401,0.022368,0.001775,-0.018607,0.015269,-0.016142,-0.006248,-0.043309,-0.016039,0.009017
pos_net_jobs,0.095578,1.0,0.084148,0.169942,-0.142796,0.060543,-0.031478,0.062863,0.044032,0.029483,...,-0.204247,-0.119566,0.005404,0.055287,0.018699,0.05753,0.070377,0.03533,0.005861,0.045474
emp_estabs,-0.020888,0.084148,1.0,-0.096189,-0.132596,0.265142,0.209641,0.046165,-0.097833,0.145808,...,-0.020331,-0.022647,-0.014737,-0.000619,0.001938,0.010087,0.019693,0.016318,0.019925,0.029483
estabs_entry_rate,0.107552,0.169942,-0.096189,1.0,0.378506,0.119729,-0.03432,0.090729,0.008339,0.059172,...,-0.097782,-0.079984,-0.067017,-0.047806,-0.069377,-0.063668,-0.06074,-0.062557,-0.098154,-0.126031
estabs_exit_rate,-0.016574,-0.142796,-0.132596,0.378506,1.0,0.083556,-0.025939,0.059833,-0.041004,0.047255,...,0.13182,0.014639,0.008815,-0.053665,-0.056253,-0.107329,-0.109617,-0.12636,-0.076538,-0.120902
pop,0.000396,0.060543,0.265142,0.119729,0.083556,1.0,0.090054,0.198232,-0.005642,0.338012,...,-0.000306,0.000889,0.000582,0.002071,0.002534,0.003024,0.003618,0.0042,0.005113,0.005365
pop_pct_black,-0.045466,-0.031478,0.209641,-0.03432,-0.025939,0.090054,1.0,-0.088277,-0.420904,0.106843,...,-0.000788,-0.000718,0.001806,0.001105,0.003142,0.004833,0.001927,0.004551,0.006433,0.007159
pop_pct_hisp,0.042703,0.062863,0.046165,0.090729,0.059833,0.198232,-0.088277,1.0,-0.044089,0.085918,...,-0.000996,-0.000273,0.003606,0.007185,0.00998,0.012258,0.018469,0.020371,0.023543,0.026345
lfpr,0.086249,0.044032,-0.097833,0.008339,-0.041004,-0.005642,-0.420904,-0.044089,1.0,-0.012269,...,0.019472,-0.017164,-0.020647,-0.019918,-0.024598,-0.024735,-0.019397,-0.012369,-0.007152,0.006658
density,-0.001632,0.029483,0.145808,0.059172,0.047255,0.338012,0.106843,0.085918,-0.012269,1.0,...,-0.000227,0.000284,0.000175,0.000864,0.001047,0.001222,0.001427,0.001607,0.00192,0.001914


This is far too much information. 
We reall only want the values for `pos_net_jobs`.

Remember that Python is zero-indexed...

In [34]:
df_prepped.corr().iloc[:, 1]

pct_d_rgdp           0.095578
pos_net_jobs         1.000000
emp_estabs           0.084148
estabs_entry_rate    0.169942
estabs_exit_rate    -0.142796
pop                  0.060543
pop_pct_black       -0.031478
pop_pct_hisp         0.062863
lfpr                 0.044032
density              0.029483
lower                0.054305
similar              0.002937
2003                 0.021893
2004                 0.052089
2005                 0.032065
2006                 0.097031
2007                -0.030403
2008                -0.037354
2009                -0.204247
2010                -0.119566
2011                 0.005404
2012                 0.055287
2013                 0.018699
2014                 0.057530
2015                 0.070377
2016                 0.035330
2017                 0.005861
2018                 0.045474
Name: pos_net_jobs, dtype: float64

Now we are going to select those that have at least a 10% correlation with our label. 
Specifically, we want the absolute value of the correlation to be weakly greater than 10%.

In [35]:
pos_net_jobs_cor = np.abs(df_prepped.corr().iloc[:, 1])
vrbls = pos_net_jobs_cor[pos_net_jobs_cor >= 0.1].index
vrbls

Index(['pos_net_jobs', 'estabs_entry_rate', 'estabs_exit_rate', 2009, 2010], dtype='object')

Neat.

Now we can select the variables that we want.

In [36]:
df_prepped2 = df_prepped.loc[:, vrbls]

In [38]:
x = df_prepped.drop(columns = 'pos_net_jobs')
x_train, x_test = train_test_split(x, 
                                  train_size = 2/3, 
                                  random_state = 490)
ss = StandardScaler()
x_train_std = pd.DataFrame(ss.fit(x_train).transform(x_train),
                          columns = x_train.columns,
                          index = x_train.index)
x_test_std = pd.DataFrame(ss.fit(x_test).transform(x_test),
                          columns = x_test.columns,
                          index = x_test.index)

Now let's cross-validate the optimal value of `C`

In [40]:
%%time
param_grid = {
    'C': 10.0**np.linspace(-5, 2, num = 20)
}
svc_cv = LinearSVC(dual = False)
grid_search = GridSearchCV(svc_cv, param_grid,
                          cv = 5, 
                          scoring = 'accuracy')
grid_search.fit(x_train_std, y_train)
best_10 = grid_search.best_params_
best_10

Wall time: 15.1 s


{'C': 0.008858667904100823}

Alternatively:

In [41]:
%%time
param_grid = {
    'C': 10.0**np.linspace(-5, 2, num = 20),
    'dual': [False]
}
svc_cv = LinearSVC()
grid_search = GridSearchCV(svc_cv, param_grid,
                          cv = 5, 
                          scoring = 'accuracy')
grid_search.fit(x_train_std, y_train)
best_10 = grid_search.best_params_
best_10

Wall time: 14.5 s


{'C': 0.008858667904100823, 'dual': False}

Now to refit and find the accuracy with the model with the full testing data using the optimal value of `C`.

In [42]:
svc_10 = LinearSVC(C = best_10['C'], 
                         dual = False).fit(x_train_std, y_train)
acc_10 = svc_10.score(x_test_std, y_test)
acc_10

0.6808306900472799

*****
# 5% Correlation
[TOP](#Support-Vector-Machines---Part-1)

Let's do the same thing with a weakly greater than 5% threshold.

In [43]:
pos_net_jobs_cor = np.abs(df_prepped.corr().iloc[:, 1])
vrbls = pos_net_jobs_cor[pos_net_jobs_cor >= 0.05].index
df_prepped2 = df_prepped.loc[:, vrbls]

In [44]:
x = df_prepped.drop(columns = 'pos_net_jobs')
x_train, x_test = train_test_split(x, 
                                  train_size = 2/3, 
                                  random_state = 490)
ss = StandardScaler()
x_train_std = pd.DataFrame(ss.fit(x_train).transform(x_train),
                          columns = x_train.columns,
                          index = x_train.index)
x_test_std = pd.DataFrame(ss.fit(x_test).transform(x_test),
                          columns = x_test.columns,
                          index = x_test.index)

Now let's cross-validate the optimal value of `C`

In [45]:
%%time
param_grid = {
    'C': 10.0**np.linspace(-5, 2, num = 20)
}
svc_cv = LinearSVC(dual = False)
grid_search = GridSearchCV(svc_cv, param_grid,
                          cv = 5, 
                          scoring = 'accuracy')
grid_search.fit(x_train_std, y_train)
best_5 = grid_search.best_params_
best_5

Wall time: 14.6 s


{'C': 0.008858667904100823}

Now to refit and find the accuracy with the model with the full testing data using the optimal value of `C`.

In [46]:
svc_5 = LinearSVC(C = best_5['C'], 
                         dual = False).fit(x_train_std, y_train)
acc_5 = svc_5.score(x_test_std, y_test)
acc_5

0.6808306900472799

*copy and paste more...*

*****
# 1% Correlation
[TOP](#Support-Vector-Machines---Part-1)

Let's do the same thing with a weakly greater than 5% threshold.

In [47]:
pos_net_jobs_cor = np.abs(df_prepped.corr().iloc[:, 1])
vrbls = pos_net_jobs_cor[pos_net_jobs_cor >= 0.01].index
df_prepped2 = df_prepped.loc[:, vrbls]

In [48]:
x = df_prepped.drop(columns = 'pos_net_jobs')
x_train, x_test = train_test_split(x, 
                                  train_size = 2/3, 
                                  random_state = 490)
ss = StandardScaler()
x_train_std = pd.DataFrame(ss.fit(x_train).transform(x_train),
                          columns = x_train.columns,
                          index = x_train.index)
x_test_std = pd.DataFrame(ss.fit(x_test).transform(x_test),
                          columns = x_test.columns,
                          index = x_test.index)

Now let's cross-validate the optimal value of `C`

In [49]:
%%time
param_grid = {
    'C': 10.0**np.linspace(-5, 2, num = 20)
}
svc_cv = LinearSVC(dual = False)
grid_search = GridSearchCV(svc_cv, param_grid,
                          cv = 5, 
                          scoring = 'accuracy')
grid_search.fit(x_train_std, y_train)
best_1 = grid_search.best_params_
best_1

Wall time: 14.6 s


{'C': 0.008858667904100823}

Now to refit and find the accuracy with the model with the full testing data using the optimal value of `C`.

In [50]:
svc_1 = LinearSVC(C = best_1['C'], 
                         dual = False).fit(x_train_std, y_train)
acc_1 = svc_1.score(x_test_std, y_test)
acc_1

0.6808306900472799

*copy and paste more...*

********************
# Comparison 
[TOP](#Support-Vector-Machines---Part-1)

Print the percent improvement in the accuracy for each of three models. 
Which model was the best performer?

In [51]:
pct_10 = 100*(acc_10 - acc_null)/acc_null
pct_5 = 100*(acc_5 - acc_null)/acc_null
pct_1 = 100*(acc_1 - acc_null)/acc_null
print('10% correlation accuracy gain: {0:.2f}%'.format(pct_10))
print('5% correlation accuracy gain: {0:.2f}%'.format(pct_5))
print('1% correlation accuracy gain: {0:.2f}%'.format(pct_1))

10% correlation accuracy gain: 21.06%
5% correlation accuracy gain: 21.06%
1% correlation accuracy gain: 21.06%


Print the optimal value of `C` for each model. 
Which model has the least amount of regularization?

In [53]:
print(best_10['C'])
print(best_5['C'])
print(best_1['C'])

0.008858667904100823
0.008858667904100823
0.008858667904100823
