Notes to consider: should do feature selection on a different dataset than you train on … the effect of not doing this is you will overfit your training data.


This dataset is a list of business attributes for a given year and location (region/state).

Feature selection, which columns attribute to job creation? job deaths?

Predict Births (The number of jobs that were created because of firm births in the past year) from features

In [17]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import RFE, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso, RandomizedLasso)
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import operator

records=pd.read_csv('data/business_dynamics.csv')
records.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Births,DHS Denominator,Deaths,Entered,Entered Rate,Establishment Exit,Exited,Exited Rate,Firm Exits.Count,Job Creation.Continuers,...,Job Destruction.Rate,Net Job Creation,Net Job Creation Rate,Number of Firms,Physical Locations,Rate/Births,Rate/Deaths,Reallocation Rate,State,Year
0,89869,933909,58891,10634,17.2,5641,8057,13.1,5623,101653,...,15.5,46776,5.0,52371,62852,9.6,6.3,31.0,Alabama,1977
1,19259,108134,13504,2028,27.0,1039,1564,20.8,1035,18286,...,36.5,-1940,-1.8,6480,7725,17.8,12.5,69.4,Alaska,1977
2,70645,589552,45366,9379,22.0,4357,6230,14.6,4332,76781,...,19.9,29997,5.1,36477,44113,12.0,7.7,39.8,Arizona,1977
3,44527,529709,32531,7291,18.3,3994,5455,13.7,3970,75201,...,14.5,42803,8.1,35499,40702,8.4,6.1,29.0,Arkansas,1977
4,779164,6484959,488562,88187,20.7,39932,60589,14.2,39693,960214,...,16.7,656693,10.1,362887,438766,12.0,7.5,33.4,California,1977


In [123]:
records_noState=records.drop(['State'], axis=1)
sel= VarianceThreshold(threshold=(0.8*(1-.8)))
sel.fit_transform(records_noState)
print(sel.get_support())

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True]


Variance threshold only takes in continuous data. It removes any variable that has low variance, therefore would do little to help predict a value. In this case, all the columns in this data set have enough variation to be useful in calculating a predictor.

In [107]:
#Setting up the datasets to perform the feature selection on
ranks_rfe={}
ranks_rf={}
job_creation=records_noState['Births'].values

#Remove all text fields and the field you are predicting
records_filtered=records_noState.drop(['Births'], axis=1)
records_mtx=records_filtered.as_matrix()
records_colnames=records_filtered.columns


I first tried recursive feature selection. This method fits a model to the data with all the features at first, then calculates the importance of each feature in that model. It removes the feature that has the least importance in the model, then runs another model, this time with X-1 features. And so on. You can set the number of features you want it to select in the end, and the .ranking_ is based on when the feature was removed from the model.

In [108]:
lr_records=LinearRegression()
lr_records.fit(records_mtx,job_creation)
rfe_records=RFE(lr_records, n_features_to_select=3, verbose=3)
rfe_records.fit(records_mtx,job_creation)

Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.


RFE(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
  n_features_to_select=3, step=1, verbose=3)

In [111]:
#With Recursive feature elimination, the rankings are lower the more important

ranks_rfe=dict(zip(records_colnames, rfe_records.ranking_))
ranks_rfe_sorted=sorted(ranks_rfe.items(), key=operator.itemgetter(1))
print(ranks_rfe_sorted)


[('Job Creation.Continuers', 1), ('Job Creation.Count', 1), ('Job Destruction.Count', 1), ('Net Job Creation', 2), ('Deaths', 3), ('Job Destruction.Continuers', 4), ('Exited Rate', 5), ('Job Creation.Rate', 6), ('Rate/Deaths', 7), ('Entered Rate', 8), ('Rate/Births', 9), ('Net Job Creation Rate', 10), ('Job Destruction.Rate', 11), ('Year', 12), ('Reallocation Rate', 13), ('Firm Exits.Count', 14), ('Establishment Exit', 15), ('Exited', 16), ('Entered', 17), ('Number of Firms', 18), ('Job Destruction', 19), ('Physical Locations', 20), ('DHS Denominator', 21)]


RFE picked Job Creation.Continuers, Job Creation.Count, and Job Destruction.Count as the most important features when predicting Births (The number of jobs that were created because of firm births in the past year). Job Creation.Count should actually be removed because it seems like Births + Job Creation.Continuers would eaqual Job Creation.Count, so the two features would be collinear. Job Destruction.Count is an interesting feature, I assume it would be a negative relationship with Births. More on this when the model is created.

Random Forest is the second feature selection technique that I tried, to compare to the recursive feature selection results. Random Forest takes a random number of features with a random set of data from the training set and creates a model. It then calculates how important each feature in the model was, then runs another model, again with random number of features and data. It does this X number of times you tell it, then it compiles the importances of all the features into one output. In this case, the highest number is more important.

In [128]:

rf_records = RandomForestRegressor(n_estimators=100, verbose=3)
rf_records.fit(records_mtx,job_creation)

building tree 1 of 100
building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s



building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 

[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.2s finished


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=3, warm_start=False)

In [129]:
#For Random Forest, the higher the number, the more important the feature is
ranks_rf=dict(zip(records_colnames, ["%.4f" % i  for i in rf_records.feature_importances_]))
ranks_rf_sorted=sorted(ranks_rf.items(), key=operator.itemgetter(1), reverse=True)
print(ranks_rf_sorted)

[('Entered', '0.5891'), ('Job Creation.Count', '0.3679'), ('Exited', '0.0151'), ('Job Destruction', '0.0083'), ('Establishment Exit', '0.0069'), ('Deaths', '0.0025'), ('Job Destruction.Count', '0.0023'), ('Rate/Births', '0.0023'), ('Job Creation.Continuers', '0.0014'), ('Job Destruction.Rate', '0.0005'), ('Entered Rate', '0.0004'), ('Job Destruction.Continuers', '0.0004'), ('Rate/Deaths', '0.0004'), ('Reallocation Rate', '0.0004'), ('Year', '0.0004'), ('Exited Rate', '0.0003'), ('Net Job Creation', '0.0003'), ('DHS Denominator', '0.0002'), ('Job Creation.Rate', '0.0002'), ('Net Job Creation Rate', '0.0002'), ('Number of Firms', '0.0002'), ('Physical Locations', '0.0002'), ('Firm Exits.Count', '0.0001')]


Random forest results indicate that Entered and Job Creation.Count have the most effect on Births. Then the importance values drop off significantly. I discussed Job Creation.Count above with RFE and why I wouldn't use it in a model to predict Births. Entered is an interesting field. It is the number of establishments that entered during 'this' time (I'm assuming during the year of the record). Entering occurs when an establishment did not exist in the previous year. If that's the case, this feature does not relate to  Births directly since the firms that entered were not a part of the firms that created jobs in the past year (Births). This could indicate that the general trend of firms entering the market is affecting the Births. That is assuming my interpretation of these columns is correct. What is also interesting is Entered was very low on the RFE analysis. How can the two methods reveal so different results?

Questions:
1) Is 0.3679 and 0.0151 really that different between RF results? How much is a big difference?
2) How can the two feature selection methods reveal so different results with the feature Entered?
