# <div align="center">Analysis and modeling of whether Americans vote or not</div> #


**<div align="center">Barrett Nibling</div>**
**<div align="center">bnibling@gmail.com</div>**

# Background

Every 2 years, a majority of American adults find themselves in the same situation come November 3rd with a decision whether or not to exercise one of their fundamental rights as a citizen in this country, voting in either the presidential and midterm elections. And for each election, millions of Americans choose the latter and decide to not cast a ballot. In fact, according to [MIT's Election Data + Science Lab](https://electionlab.mit.edu/research/voter-turnout), anywhere between 35 and 60 percent of the eligible voters in this country decide to abstain from voting come election day, with this figure varying based on numerous demographic features. 

So, the questions are: 
1. What makes people decided to vote or not for any given election? 

2. Can we predict who will vote and who will not vote from demographic and survey data? 

# Data

This notebook uses the polling data used for the Fivethirtyeight.com article, [Why Many Americans Don't Vote](https://projects.fivethirtyeight.com/non-voters-poll-2020-election/), that was provided by Ipsos. 

In the polling, approximately 8000 respondents were surveyed and then matched to voter file records by Aristole to determine their voting history, which reduced the number of responses to 5836.

The data includes basic demographic data:
  1. Age
  2. Gender
  3. Education
  4. Race
  5. Income

And includes responses to 33 topics about voting, government, and other political beliefs. 

The dataset and meta data (survey) can be found at:
 https://github.com/fivethirtyeight/data/tree/master/non-voters

# Import libraries

In [1]:
# !pip install xgboost
# !pip install scikit-optimize

In [2]:
%matplotlib inline

import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import (
    train_test_split,
    GridSearchCV
)

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from scipy import stats

from imblearn.combine import SMOTEENN

from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    mean_squared_error,
    r2_score,
    accuracy_score,
    mean_absolute_error,
    plot_confusion_matrix,
    classification_report,
)

from xgboost import XGBClassifier

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

ModuleNotFoundError: No module named 'skopt'

# Data Cleaning and Feature Engineering

Read in csv file as a pandas DataFrame

In [None]:
url = '/content/drive/My Drive/Thinkful/Capstone 2/nonvoters_data.csv'
data = pd.read_csv(url)

In [None]:
data.info()

Drop irrelevant features, namely ones with useless data, incomplete values, or feature dependent on other questions in the survey.

In [None]:
#Columns to Drop
#ID, weight, US citizen, Have you already voted, Political party (based on Q30)
drop_list = ['RespId','weight','Q1','Q21','Q22','Q31','Q32','Q33']

#Thoughts on the two parties
question_14 = ['Q14']
question_15 = ['Q15']

#Questions about whether participants voted in past elections
question_26 = ['Q26']
question_27 = ['Q27_1','Q27_2','Q27_3','Q27_4','Q27_5','Q27_6']
question_28 = ['Q28_1','Q28_2','Q28_3','Q28_4','Q28_5','Q28_6','Q28_7','Q28_8']
question_29 = ['Q29_1','Q29_2','Q29_3','Q29_4','Q29_5','Q29_6','Q29_7','Q29_8','Q29_9','Q29_10']

#`Other` option for 'What do you think would get more people to vote?' 
question_19_10 = ['Q19_10']

#Trump vs Biden
question_23 = ['Q23']

In [None]:
data.drop(drop_list, 1, inplace=True)
data.drop(question_14+question_15, 1, inplace=True)
data.drop(question_26+question_27+question_28+question_29, 1, inplace=True)
data.drop(question_19_10, 1, inplace=True)
data.drop(question_23, 1, inplace=True)

Organize features by type: Ordinal, Binary, Multi-class

In [None]:
#How important? (Ordinal)
question_2 = ['Q2_1', 'Q2_2', 'Q2_3', 'Q2_4', 'Q2_5', 'Q2_6', 'Q2_7', 'Q2_8', 'Q2_9', 'Q2_10']
#Agree or Disagree (Ordinal)
question_3 = ['Q3_1', 'Q3_2', 'Q3_3', 'Q3_4', 'Q3_5', 'Q3_6']
#Impact (Ordinal)
question_4 = ['Q4_1', 'Q4_2', 'Q4_3', 'Q4_4', 'Q4_5', 'Q4_6']
#Politician like you? (Ordinal)
question_6 = ['Q6']
#Trust (Ordinal)
question_8 = ['Q8_1', 'Q8_2', 'Q8_3', 'Q8_4', 'Q8_5', 'Q8_6', 'Q8_7', 'Q8_8', 'Q8_9']
#Way of governing (Ordinal)
question_9 = ['Q9_1', 'Q9_2', 'Q9_3', 'Q9_4']
#Ease of voting (Ordinal)
question_16 = ['Q16']
#Voting confidence (Ordinal)
question_17 = ['Q17_1', 'Q17_2', 'Q17_3', 'Q17_4']
#Following the 2020 election (Ordinal)
question_25 = ['Q25']

In [None]:
#Does presidential election matter? (Binary)
question_5 = ['Q5']
#U.S. government needs changes? (Binary)
question_7 = ['Q7']
#Various conditions (Binary )
question_10 = ['Q10_1', 'Q10_2', 'Q10_3', 'Q10_4']
question_11 = ['Q11_1', 'Q11_2', 'Q11_3', 'Q11_4', 'Q11_5', 'Q11_6']
question_18 = ['Q18_1', 'Q18_2', 'Q18_3', 'Q18_4', 'Q18_5', 'Q18_6', 'Q18_7', 'Q18_8', 'Q18_9', 'Q18_10']
#Registered to vote (Binary)
question_20 = ['Q20']

In [None]:
#What do you think would get more people to vote? (Binary- Yes or Blank)
question_19 = ['Q19_1','Q19_2','Q19_3','Q19_4','Q19_5','Q19_6','Q19_7','Q19_8','Q19_9']
#Preferred voting method (Categorical)
question_24 = ['Q24']
#Political party affiliation (Categorical)
question_30 = ['Q30']

Map the values for better analysis. 

Ordinal value ordering will be flipped 1-4:4-1 to better represent the data.

Binary values will be made either 0 or 1

Multi-class values will be properly encoded for future One-hot encoding.

All -1 values will be replaced with NaN values to be dropped.

In [None]:
df_encoded = data.copy()

In [None]:
ordinal_mapping = {1:4, 2:3, 3:2, 4:1, -1:np.nan}
bi_mapping = {1:1, 2:0, -1:np.nan}

In [None]:
df_encoded[question_2] = df_encoded[question_2].replace(ordinal_mapping)
df_encoded[question_3] = df_encoded[question_3].replace(ordinal_mapping)
df_encoded[question_4] = df_encoded[question_4].replace(ordinal_mapping)
df_encoded[question_6] = df_encoded[question_6].replace(ordinal_mapping)
df_encoded[question_8] = df_encoded[question_8].replace(ordinal_mapping)
df_encoded[question_9] = df_encoded[question_9].replace(ordinal_mapping)
df_encoded[question_16] = df_encoded[question_16].replace(ordinal_mapping)
df_encoded[question_17] = df_encoded[question_17].replace(ordinal_mapping)
df_encoded[question_25] = df_encoded[question_25].replace(ordinal_mapping)

df_encoded[question_5] = df_encoded[question_5].replace(bi_mapping)
df_encoded[question_7] = df_encoded[question_7].replace(bi_mapping)
df_encoded[question_10] = df_encoded[question_10].replace(bi_mapping)
df_encoded[question_11] = df_encoded[question_11].replace(bi_mapping)
df_encoded[question_18] = df_encoded[question_18].replace(bi_mapping)
df_encoded[question_20] = df_encoded[question_20].replace(bi_mapping)

df_encoded[question_19] = df_encoded[question_19].replace({1:1, -1:0})

In [None]:
q24_mapping = {1:'Mail-in', 2:'Early In-person', 3:'In-person', 4:'Other', -1:np.nan}
df_encoded[question_24] = df_encoded[question_24].replace(q24_mapping)

q30_mapping = {1:'Republican', 2:'Democrat', 3:'Independent', 4:'Other', 5:'No preference', -1:np.nan}
df_encoded[question_30] = df_encoded[question_30].replace(q30_mapping)

In [None]:
df_encoded.dropna(0, inplace=True)

Split features and the target variable.

Finish encoding of categorical values.

In [None]:
X = df_encoded.drop('voter_category', 1)
y = df_encoded['voter_category']

In [None]:
X_cat = pd.get_dummies(X.select_dtypes(include='O'), drop_first=True)

In [None]:
X = pd.concat([X.select_dtypes(exclude='O'), X_cat], 1)

In [None]:
y.value_counts()

In [None]:
voter_map = {'always':0, 'sporadic':0, 'rarely/never':1}
y = y.replace(voter_map)
y.value_counts()

Check for Collinearity in the feature variables and drop any above 0.8

In [None]:
corr_table = pd.melt(X.corr())
corr_table = corr_table[corr_table['value'] != 1.0]
corr_table['value'] = np.abs(corr_table['value'])
corr_table.sort_values('value', ascending=False).head(10)

See if there are any features with strong correlation with the target.

In [None]:
corr_target = np.abs(X.corrwith(y)).sort_values(ascending=False)
corr_target[:10]

# Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
smote = SMOTEENN(sampling_strategy=0.75)

X_train_samp, y_train_samp = smote.fit_resample(X_train, y_train)

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
def get_scores(model, X_train, X_test, y_train, y_test, verbose=False):
  if verbose:
    print('\nTraining Scores:')
    print(f'Score: {model.score(X_train, y_train)}')
    print(f'Accuracy: {accuracy_score(y_train, model.predict(X_train))}')
    print(f'Mean Squared Error: {mean_squared_error(y_train, model.predict(X_train))}')
    print(f'Mean Absolute Error: {mean_absolute_error(y_train, model.predict(X_train))}')
    print(f'\nTest Scores:')
    print(f'Score: {model.score(X_test, y_test)}')
    print(f'Accuracy: {accuracy_score(y_test, model.predict(X_test))}')
    print(f'Mean Squared Error: {mean_squared_error(y_test, model.predict(X_test))}')
    print(f'Mean Absolute Error: {mean_absolute_error(y_test, model.predict(X_test))}')

def get_class_report(model, X_train, X_test, y_train, y_test):
  print(f'\nTraining Report:\n {classification_report(y_train, model.predict(X_train))}')
  print(f'\nTest Report:\n {classification_report(y_test, model.predict(X_test))}')

# Linear Support Vector Classification 

Namely to check best coefficients

In [None]:
svc = SVC()

params = {'C': np.logspace(-1,1,3),
          'kernel': ['linear'],
          'gamma': ['scale', 'auto'],
          'class_weight': ['balanced', None]}

svc_lin_grid = GridSearchCV(svc, params, cv=3, scoring='f1', n_jobs=-1)

svc_lin_grid.fit(X_train, y_train)

In [None]:
svc_lin_grid.best_params_

In [None]:
get_scores(svc_lin_grid, X_train, X_test, y_train, y_test, verbose=True)
get_class_report(svc_lin_grid, X_train, X_test, y_train, y_test)
ax1 = plt.subplot(121)
plot_confusion_matrix(svc_lin_grid, X_train, y_train, values_format='.2f', ax=ax1, cmap='mako')
plt.title('Training Data')

ax2 = plt.subplot(122)
plot_confusion_matrix(svc_lin_grid, X_test, y_test, values_format='.2f', ax=ax2, cmap='mako')
plt.title('Test Data')

plt.tight_layout()
plt.show()

In [None]:
feature_svc_coef = pd.Series(abs(svc_lin_grid.best_estimator_.coef_[0]), index=X_train.columns)
feature_svc_coef.sort_values(ascending=False).nlargest(15).plot(kind='barh')
plt.show()

# Random Forest Classifier

To look how well the model fits and the `feature_importances`

In [None]:
params = {'max_depth': [5, 10, 15, 20], 'min_samples_split': [10, 25, 50], 'max_features': ['sqrt', 0.5, None]}

forest = RandomForestClassifier(criterion='entropy')

forest_grid = GridSearchCV(forest, param_grid = params, cv=5)

forest_grid.fit(X_train, y_train)

In [None]:
forest_grid.best_params_

In [None]:
get_scores(forest_grid, X_train, X_test, y_train, y_test, verbose=True)
get_class_report(forest_grid, X_train, X_test, y_train, y_test)
ax1 = plt.subplot(121)
plot_confusion_matrix(forest_grid, X_train, y_train, values_format='.2f', ax=ax1, cmap='mako')
plt.title('Training Data')

ax2 = plt.subplot(122)
plot_confusion_matrix(forest_grid, X_test, y_test, values_format='.2f', ax=ax2, cmap='mako')
plt.title('Test Data')

plt.tight_layout()
plt.show()

In [None]:
forest_feature_importances = pd.DataFrame({'columns': X_train.columns, 'importance scores':forest_grid.best_estimator_.feature_importances_}).sort_values(
    by='importance scores', ascending=False)

forest_feature_importances.head(10)

# Gradient Boosting Classifier

To look how well the model fits and the `feature_importances`

In [None]:
gbc = GradientBoostingClassifier(n_iter_no_change=100, n_estimators=10000)

params = {'learning_rate': np.arange(0.025, 0.125, 0.25),
          'max_depth': [3],
          'subsample': np.arange(0.4, 1.1,0.2),
          'max_features': [None, 'sqrt']}

gbc_grid = GridSearchCV(gbc, params, cv=3, scoring='f1', n_jobs=-1)

gbc_grid.fit(X_train, y_train)

In [None]:
gbc_grid.best_params_

In [None]:
get_scores(gbc_grid, X_train, X_test, y_train, y_test, verbose=True)
get_class_report(gbc_grid, X_train, X_test, y_train, y_test)
ax1 = plt.subplot(121)
plot_confusion_matrix(gbc_grid, X_train, y_train, values_format='.2f', ax=ax1, cmap='mako')
plt.title('Training Data')

ax2 = plt.subplot(122)
plot_confusion_matrix(gbc_grid, X_test, y_test, values_format='.2f', ax=ax2, cmap='mako')
plt.title('Test Data')

plt.tight_layout()
plt.show()

In [None]:
gbc_feature_importances = pd.DataFrame({'columns': X_train.columns, 'importance scores':gbc_grid.best_estimator_.feature_importances_}).sort_values(
    by='importance scores', ascending=False)

gbc_feature_importances.head(10)

# XGBoost Classifier

To look how well the model fits and the `feature_importances`

In [None]:
xgb = XGBClassifier(n_iter_no_change=50, n_estimators=1000)

params = {'learning_rate': np.arange(0.025, 0.125, 0.25),
          'max_depth': [3],
          'subsample': np.arange(0.4, 1.1,0.2),
          'max_features': [None, 'sqrt']}

xgb_grid = GridSearchCV(xgb, params, cv=3, scoring='f1', n_jobs=-1)

xgb_grid.fit(X_train, y_train)

In [None]:
xgb_grid.best_params_

In [None]:
get_scores(xgb_grid, X_train, X_test, y_train, y_test, verbose=True)
get_class_report(xgb_grid, X_train, X_test, y_train, y_test)
ax1 = plt.subplot(121)
plot_confusion_matrix(xgb_grid, X_train, y_train, values_format='.2f', ax=ax1, cmap='mako')
plt.title('Training Data')

ax2 = plt.subplot(122)
plot_confusion_matrix(xgb_grid, X_test, y_test, values_format='.2f', ax=ax2, cmap='mako')
plt.title('Test Data')

plt.tight_layout()
plt.show()

In [None]:
xgb_feature_importances = pd.DataFrame({'columns': X_train.columns, 'importance scores':xgb_grid.best_estimator_.feature_importances_}).sort_values(
    by='importance scores', ascending=False)

xgb_feature_importances.head(10)

#Important Features

Take the top 10 features from each model to make a `best_features` list and run the model again to see how it fairs.

In [None]:
best_features = []

for feature in feature_svc_coef.nlargest(10).index:
  best_features.append(feature)

for feature in forest_feature_importances['columns'].head(10):
  if feature not in best_features:
    best_features.append(feature)

for feature in gbc_feature_importances['columns'].head(10):
  if feature not in best_features:
    best_features.append(feature)

for feature in xgb_feature_importances['columns'].head(10):
  if feature not in best_features:
    best_features.append(feature)

In [None]:
best_features

In [None]:
X_train_best = X_train[best_features]
X_test_best = X_test[best_features]

In [None]:
xgb = XGBClassifier(n_iter_no_change=50, n_estimators=1000)

params = {'learning_rate': np.arange(0.025, 0.125, 0.25),
          'max_depth': [3],
          'subsample': np.arange(0.4, 1.1,0.2),
          'max_features': [None, 'sqrt']}

xgb_best_grid = GridSearchCV(xgb, params, cv=3, scoring='f1', n_jobs=-1)

xgb_best_grid.fit(X_train_best, y_train)

In [None]:
xgb_best_grid.best_params_

In [None]:
get_scores(xgb_best_grid, X_train_best, X_test_best, y_train, y_test, verbose=True)
get_class_report(xgb_best_grid, X_train_best, X_test_best, y_train, y_test)
ax1 = plt.subplot(121)
plot_confusion_matrix(xgb_best_grid, X_train_best, y_train, values_format='.2f', ax=ax1, cmap='mako')
plt.title('Training Data')

ax2 = plt.subplot(122)
plot_confusion_matrix(xgb_best_grid, X_test_best, y_test, values_format='.2f', ax=ax2, cmap='mako')
plt.title('Test Data')

plt.tight_layout()
plt.show()

In [None]:
smote = SMOTEENN(sampling_strategy=0.75)

X_train_best_samp, y_train_samp = smote.fit_resample(X_train_best, y_train)

In [None]:
xgb = XGBClassifier(n_iter_no_change=50, n_estimators=1000)

params = {'learning_rate': np.arange(0.025, 0.125, 0.25),
          'max_depth': [3],
          'subsample': np.arange(0.4, 1.1,0.2),
          'max_features': [None, 'sqrt']}

xgb_best_samp_grid = GridSearchCV(xgb, params, cv=3, scoring='f1', n_jobs=-1)

xgb_best_samp_grid.fit(X_train_best_samp, y_train_samp)

In [None]:
xgb_best_samp_grid.best_params_

In [None]:
get_scores(xgb_best_samp_grid, X_train_best_samp, np.array(X_test_best), y_train_samp, y_test, verbose=True)
get_class_report(xgb_best_samp_grid, X_train_best_samp, np.array(X_test_best), y_train_samp, y_test)
ax1 = plt.subplot(121)
plot_confusion_matrix(xgb_best_samp_grid, X_train_best_samp, y_train_samp, values_format='.2f', ax=ax1, cmap='mako')
plt.title('Training Data')

ax2 = plt.subplot(122)
plot_confusion_matrix(xgb_best_samp_grid, np.array(X_test_best), y_test, values_format='.2f', ax=ax2, cmap='mako')
plt.title('Test Data')

plt.tight_layout()
plt.show()

# Conclusion

No one model truly out performed the others in this testing. Each did fairly well to predict the test set with test accuracies of 0.84 for all but the SVC model (which got 0.8). However, accuracy isn't everything and the recall for most of the model suffer due to the class imbalance of the data set. When trying to account for this with resampling, the precision decreases drastically for the models. So, depending on which is most important, recall or precision, either method could be used.

Besides the demographic values that clearly matter when it comes to determining whether someone will vote or not, with age being clearly the most important of these features, there were a few features that appeared in each of the model's feature_importances lists. 

The most important of these, unsurprisingly, is `Q20`, which is whether someone has registered to vote or not. Clearly it goes without saying that being registered is a very important metric for voter turnout. Clearly having more incentives to ensure people are registered to vote is in order. 

`Q18_3` is similar to this as it shows those who missed voting registration deadlines.

`Q2_1` asked the respondants how important the elections are. `Q25` askes how much they paid attention to the election. So, having an interest and understanding how important the elections are is an indicator of someone interest in voting.

`Q16` is an interesting question, as it asks how easy or difficult it is to vote. So, efforts in making voting easier could greatly influence voter turnout, or at least changing people impressions of its difficulty could. 

