This notebook is used to prototype different models for predicting voter turnout.

In [1]:
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import RandomizedSearchCV

import os
import pandas as pd
import pickle
import time

import matplotlib.pyplot as plt
%matplotlib inline

  from numpy.core.umath_tests import inner1d


First, I read in the data.

In [2]:
# Read in training data
data_train = pd.read_csv(os.path.join('data', 'train_2008.csv'))

# Extract input features and output labels
X_train = data_train.values[:, 1:-1]
y_train = data_train.values[:, -1]

# Define training set for hyperparameter selection
inds = np.arange(len(X_train))[:10000]
X = X_train[inds]
y = y_train[inds]

I then train random forest classifiers, including hyperparameter selection.

In [3]:
# Specify hyperparameters for tuning
parameters = {'n_estimators': np.arange(1000, 3100, 100),
              'max_features': np.arange(15, 65, 5),
              'min_samples_leaf': np.arange(0.0001, 0.005, 0.0001),
              'max_depth': [None] + list(np.arange(10, 55, 5))}

# Perform hyperparameter testing
clf = RandomizedSearchCV(RandomForestClassifier(), parameters,
                         scoring='roc_auc', return_train_score=True,
                         n_iter=10, n_jobs=-1)
clf.fit(X, y)

# Save results
filename = 'RandomForest_{:s}.pkl'.format(time.strftime('%Y%m%d-%H%M'))
#with open(filename, 'wb') as file:
#    pickle.dump(clf, file)

I read in the results, extract the best estimator, and train it to the full training data.  This final model is then used to make predictions on the test sets.

In [4]:
# Specify file name
#filename = ''

# Read in results
#with open(filename, 'rb') as file:
#    clf = pickle.load(file)

# Train best estimator
model = clf.best_estimator_
model.fit(X_train, y_train)

# Show score on training set
print('ROC AUC (training):',
      roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]))

# Read in input data for test sets
test_2008 = pd.read_csv(os.path.join('data', 'test_2008.csv'))
test_2012 = pd.read_csv(os.path.join('data', 'test_2012.csv'))

# Make predictions on test sets
pred_2008 = model.predict_proba(test_2008.values[:, 1:])[:, 1]
pred_2012 = model.predict_proba(test_2012.values[:, 1:])[:, 1]

# Write results
df_2008 = pd.DataFrame(data={'id': test_2008.values[:, 0],
                             'target': pred_2008})
df_2008.to_csv(os.path.join('predictions', 'pred_2008_CS.csv'),
               index=None, header=True)
df_2012 = pd.DataFrame(data={'id': test_2012.values[:, 0],
                             'target': pred_2012})
df_2012.to_csv(os.path.join('predictions', 'pred_2012_CS.csv'),
               index=None, header=True)

ROC AUC (training): 0.7925537920915184


We can use this trained model to extract feature importances.

In [5]:
# Specify file name
filename = 'RandomForest_20190211-0525.pkl'

# Read in results
with open(filename, 'rb') as file:
    _, _, clf = pickle.load(file)

features = data_train.columns[1:-1]
importances = clf.feature_importances_
inds = np.argsort(-importances)
feature_importances = pd.DataFrame(data={'importance': importances[inds]},
                                   index=features[inds])

We examine the top features and their importances.

In [6]:
feature_importances

Unnamed: 0,importance
PEEDUCA,0.132349
PEAGE,0.063656
HETENURE,0.042779
HUFAMINC,0.038618
QSTNUM,0.021533
PERRP,0.020787
PEIO1OCD,0.018445
GESTCEN,0.017623
PXGRPROF,0.017202
HWHHWGT,0.017076


I also note experiments with other modeling choices that were not used for the final training, including the following:
- Scaling the input data
- Resampling the input data for balanced classes
- Performing feature selection using Yitong's features (code not shown)
- Using logistic regression model

In [None]:
from sklearn import preprocessing

# Scale data
X = preprocessing.scale(X_train)

In [None]:
# Resample data to represent both classes equally
inds_neg = np.where(y_train == 0)[0]
inds_pos = np.where(y_train == 1)[0]
inds = np.concatenate((inds_neg,
                       np.random.choice(inds_pos, size=len(inds_neg),
                                        replace=True)))
np.random.shuffle(inds)
X = X_train[inds]
y = y_train[inds]

In [None]:
from sklearn.linear_model import LogisticRegression

# Train logistic regression model with different regularization strengths
parameters = {'C': np.logspace(-2, 2, 9)}
clf = RandomizedSearchCV(LogisticRegression(), parameters,
                         scoring='roc_auc', return_train_score=True,
                         n_iter=1, n_jobs=-1)
clf.fit(X, y)

In [None]:
[(clf.cv_results_['params'][i], clf.cv_results_['mean_test_score'][i])
 for i in np.argsort(clf.cv_results_['rank_test_score'])]