# Voter Data Analysis

## Overview
As my capstone project for Flatiron School’s Data Science program I built a model to predict how individuals would vote in a presidential election based on data from the 2012, 2016 and 2020 elections. I then used that model to analyze how broad categories of political issues and individual issues themselves influence an individual’s vote. I also examined the accuracy of predictions based on basic demographic information like income, race, education etc. 

## Business Understanding
This type of modeling could be useful in a number of contexts. Most obviously for a campaign interested in focusing their efforts on individuals most likely to vote for them but it could also be useful for political parties and special interest groups who want to better understand their constituents and the public as a whole. 

## Data
My data comes from the American National Election Studies for the years 2012, 2016 and 2020. The ANES is a national survey of voters in the United States, conducted before and after every presidential election. I used a subset of that data curated by the Inter-university Consortium for Political and Social Research. 
The full ANES survey data is publicly available for download from here: https://electionstudies.org/. You do have to make an account to access the data which you can do by clicking the login button in the top right corner of the home page. Once you have completed that process click on the Data Center tab at the top of the home page, then select the data set you would like (For example: 2020 Time Series Study) and then under the download data heading on the next page select the type of file you would like. 
The Inter-university Consortium for Political and Social Research’s data is available here: https://www.icpsr.umich.edu/web/pages/instructors/setups2020/ to individuals with an email address from with one of their member institutions. Once you have made an account with that email address, click on the “Find Data” tab at the top of the page and search for the data set i.e (Voting Behavior: The 2020 Election, Voting Behavior: The 2016 Election, or Voting Behavior: The 2012 Election). The first result will take you to a page where you can download the data.

## Data Preparation
To prepare my data for modeling I first dropped all rows where individuals did not vote or voted for a third party candidate. This left me with 6075 rows to work with. The columns are broken into 16 categories denoted by a letter in front of the question number. For example A01 and R15. Questions in categories A, D and E relate to past political behavior and opinions of current and former politicians. These are obviously strongly correlated with vote preference and are uninteresting in terms of analysis so were dropped. The data is categorical and so needed to be encoded. I used One Hot Encoding to avoid imposing a hierarchy where none should exist. 

In [1]:
# Imports
from sklearn.model_selection import cross_validate, cross_val_score, RandomizedSearchCV, train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from xgboost import XGBClassifier
from kmodes.kmodes import KModes

import pprint
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string


In [2]:
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',504)
pd.set_option('display.width',1000000000)

In [64]:
# Reading in data and displaying first 5 rows
df2020 = pd.read_stata('data/SETUPS2020/SETUPS2020.dta')
# df2020.head()

In [66]:
df2020['WEIGHT']

0       0.611133
1       1.209783
2       0.823936
3       0.512837
4       0.856575
          ...   
7448    1.480103
7449    1.503653
7450    1.150732
7451    0.281583
7452    0.432413
Name: WEIGHT, Length: 7453, dtype: float64

In [4]:
df2016 = pd.read_stata('data/SETUPS2016/SETUPS2016.dta')
# df2016.head()

In [48]:
df2012 = pd.read_stata('data/SETUPS2012/SETUPS2012.dta')
# df2012.head()

In [6]:
df2020.drop(['CASEID','WEIGHT'], axis=1, inplace=True)
df2016.drop(['CASEID','WEIGHT'], axis=1, inplace=True)
df2012.drop(['CASEID','WEIGHT_FULL'], axis=1, inplace=True)

In [7]:
# # Taking a look at the first column which askes if the respondent voted
# df2020['A01'].value_counts()

In [8]:
# # The second question asks who the respondent voted for
# df2020['A02'].value_counts()

In [9]:
# # Subsetting data to keep only rows where the respondent voted for Donald Trump or Joe Biden
# df2020 = df2020.loc[(df2020['A01'] == '1. Voted') & ((df2020['A02'] == '1. Joe Biden') | (df2020['A02'] == '2. Donald Trump'))]

In [10]:
# df2016['A02'].value_counts()

In [11]:
# # Subsetting data to keep only rows where the respondent voted for Donald Trump or Joe Biden
# df2016subset = df2016.loc[(df2016['A02'] == 'Clinton') | (df2016['A02'] == 'Trump')]

In [12]:
# df2016subset['A02'].value_counts()

In [13]:
# df2012['A02'].value_counts()

In [14]:
# # Subsetting data to keep only rows where the respondent voted for Donald Trump or Joe Biden
# df2012subset = df2012.loc[(df2012['A02'] == 'Obama') | (df2012['A02'] == 'Romney')]

In [15]:
# df2012subset['A02'].value_counts()

In [16]:
# # Getting target
# y = df2020['A02']
# X = df2020.drop(['A02'], axis=1, errors = "ignore")

I wrote a function to get the question categories for my dataset. This will help with subsetting the data later

In [17]:
# This function returns a dictionary where the key is the question category and the associated value is a list of 
# Columns in that category
def get_columns(df):
    # Creating empyt  dictionary
    dictionary = {}
    # Looping through potential categories 
    alphabet = list(string.ascii_uppercase[0:26])
    for char in alphabet:
        # Creating dictionary entry
        dictionary[char] = []
        for num in list(range(df.shape[1])):
            if df.columns[num].startswith(char):
                # Populating dictionary entry
                dictionary[char].append(df.columns[num])            
        temp = dictionary.pop(char)
        # Removing keys where the value is empty
        if temp != []:
            dictionary[char] = temp
    # Returning dictionary
    return dictionary

In [50]:
# Getting question categories for the 2020 dataset
_2020_dictionary = get_columns(df2020)
_2016_dictionary = get_columns(df2016)
_2012_dictionary = get_columns(df2012)

In [19]:
# Dropping question categories
df2020.drop(_2020_dictionary['A'], axis=1, inplace=True, errors = "ignore")
df2020.drop(_2020_dictionary['D'], axis=1, inplace=True)
df2020.drop(_2020_dictionary['E'], axis=1, inplace=True)

In [20]:
_2020_dictionary.pop('A')
_2020_dictionary.pop('D')
_2020_dictionary.pop('E');

In [21]:
# Dropping question categories
df2016subset = df2016
df2016subset.drop(_2016_dictionary['A'], axis=1, inplace=True, errors = "ignore")
df2016subset.drop(_2016_dictionary['D'], axis=1, inplace=True)
df2016subset.drop(_2016_dictionary['E'], axis=1, inplace=True)

In [22]:
_2016_dictionary.pop('A')
_2016_dictionary.pop('D')
_2016_dictionary.pop('E');

In [51]:
df2012subset = df2012
df2012subset.drop(_2012_dictionary['A'], axis=1, inplace=True, errors = "ignore")
df2012subset.drop(_2012_dictionary['D'], axis=1, inplace=True)
df2012subset.drop(_2012_dictionary['E'], axis=1, inplace=True)

In [52]:
_2012_dictionary.pop('A')
_2012_dictionary.pop('D')
_2012_dictionary.pop('E');

In [25]:
# categorical_columns = list(df2020.columns)

# Train Test Split

In [26]:
# # Preforming train/test split
# X = X[categorical_columns]
# X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Clustering

## 2020 Analysis

The 2020 data requires less preprocessing because of the way its formatted

In [27]:
# The dimension of data
print('Dimension data: {} rows and {} columns'.format(len(df2020), len(df2020.columns)))
# Print the first 5 rows
df2020.head();

Dimension data: 7453 rows and 204 columns


In [28]:
kmode2020 = KModes(n_clusters=6, init = "random", n_init = 5, verbose=1) 
kmode2020.fit_predict(df2020) 

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 3056, cost: 698635.0
Run 1, iteration: 2/100, moves: 1082, cost: 696973.0
Run 1, iteration: 3/100, moves: 527, cost: 696441.0
Run 1, iteration: 4/100, moves: 248, cost: 696317.0
Run 1, iteration: 5/100, moves: 171, cost: 696257.0
Run 1, iteration: 6/100, moves: 19, cost: 696257.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 2648, cost: 709211.0
Run 2, iteration: 2/100, moves: 1223, cost: 704728.0
Run 2, iteration: 3/100, moves: 895, cost: 702214.0
Run 2, iteration: 4/100, moves: 717, cost: 700972.0
Run 2, iteration: 5/100, moves: 388, cost: 700745.0
Run 2, iteration: 6/100, moves: 160, cost: 700720.0
Run 2, iteration: 7/100, moves: 25, cost: 700720.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 2780, cost: 698968.0
Run 3, iteration: 2/100, move

array([1, 4, 5, ..., 0, 1, 5], dtype=uint16)

In [30]:
df2020['Cluster'] = kmode2020.labels_

## 2016 Analysis

Some of the categories in 2016 and 2012 mix numeric and string data so we will need to deal with that before doing the cluster analysis

In [31]:
_2016_dictionary;

J, K, L, M, N, P

In [32]:
df2016_test_subset = df2016subset[df2016subset.columns[pd.Series(df2016subset.columns).str.startswith('R')]]

In [33]:
df2016subset['K08']

0        Low commitment
1                     2
2                     3
3                     4
4                     3
             ...       
3644                  3
3645                  3
3646                  2
3647                  3
3648    High commitment
Name: K08, Length: 3649, dtype: category
Categories (6, object): ['High commitment' < 2 < 3 < 4 < 'Low commitment' < 'NA']

In [34]:
df2016subset['H01'].cat.rename_categories(['Government insurance plan','2','3','4','5','6','Private insurance plan','NA'], inplace = True)
df2016subset['H02'].cat.rename_categories(['Government insurance plan','2','3','4','5','6','Private insurance plan','NA'], inplace = True)
df2016subset['H03'].cat.rename_categories(['Government insurance plan','2','3','4','5','6','Private insurance plan','NA'], inplace = True)

In [35]:
df2016subset['J01'].cat.rename_categories(['Provide many fewer services','2','3','4','5','6','Provide many more services','NA'], inplace = True)
df2016subset['J02'].cat.rename_categories(['Provide many fewer services','2','3','4','5','6','Provide many more services','NA'], inplace = True)
df2016subset['J03'].cat.rename_categories(['Provide many fewer services','2','3','4','5','6','Provide many more services','NA'], inplace = True)
df2016subset['J04'].cat.rename_categories(['Government should see to it','2','3','4','5','6','Individuals on own','NA'], inplace = True)
df2016subset['J05'].cat.rename_categories(['Government should see to it','2','3','4','5','6','Individuals on own','NA'], inplace = True)
df2016subset['J06'].cat.rename_categories(['Government should see to it','2','3','4','5','6','Individuals on own','NA'], inplace = True)
df2016subset['J14'].cat.rename_categories(['Regulate business','2','3','4','5','6','Do not regulate', 'NA'], inplace = True)
df2016subset['J15'].cat.rename_categories(['Regulate business','2','3','4','5','6','Do not regulate', 'NA'], inplace = True)
df2016subset['J16'].cat.rename_categories(['Regulate business','2','3','4','5','6','Do not regulate', 'NA'], inplace = True);

In [36]:
df2016subset['K07'].cat.rename_categories(['High traditionalism','2','3','4','Low traditionalism','NA'], inplace = True)
df2016subset['K08'].cat.rename_categories(['High commitment','2','3','4','Low commitment','NA'], inplace = True);
df2016subset['K09'].cat.rename_categories(['Low tolerance','2','3','4','High tolerance','NA'], inplace = True);

In [37]:
df2016subset['L06'].cat.rename_categories(['High','2','3','4','Low','NA'], inplace = True);

In [38]:
df2016subset['M01'].cat.rename_categories(['Govt should help blacks','2','3','4','5','6','Blacks should help themselves', 'NA'], inplace = True);
df2016subset['M02'].cat.rename_categories(['Govt should help blacks','2','3','4','5','6','Blacks should help themselves', 'NA'], inplace = True);
df2016subset['M03'].cat.rename_categories(['Govt should help blacks','2','3','4','5','6','Blacks should help themselves', 'NA'], inplace = True);
df2016subset['M06'].cat.rename_categories(['Low support','2','3','4','High support','NA'], inplace = True);

In [39]:
df2016subset['N05'].cat.rename_categories(['Greatly decrease','2','3','4','5','6','Greatly increase','NA'], inplace = True);
df2016subset['N06'].cat.rename_categories(['Greatly decrease','2','3','4','5','6','Greatly increase','NA'], inplace = True);
df2016subset['N07'].cat.rename_categories(['Greatly decrease','2','3','4','5','6','Greatly increase','NA'], inplace = True);

In [40]:
df2016subset['P01'].cat.rename_categories(['High Trust','2','3','Low Trust','NA'], inplace = True);
df2016subset['P02'].cat.rename_categories(['High efficacy','2','3','Low efficacy','NA'], inplace = True);
df2016subset['P03'].cat.rename_categories(['High support','2','3','Low support','NA'], inplace = True);
df2016subset['P06'].cat.rename_categories(['High Trust','2','3','Low Trust','NA'], inplace = True);

In [41]:
# cost = [] 
# K = range(6,7) 
# for k in list(K): 
#     kmode = KModes(n_clusters=k, init = "random", n_init = 5, verbose=1) 
#     kmode.fit_predict(df2016subset) 
#     cost.append(kmode.cost_) 

# plt.plot(K, cost, 'x-') 
# plt.xlabel('No. of clusters') 
# plt.ylabel('Cost') 
# plt.title('Elbow Curve') 
# plt.show()

In [42]:
kmode2016 = KModes(n_clusters=6, init = "random", n_init = 5, verbose=1) 
kmode2016.fit_predict(df2016subset) 

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 1707, cost: 272664.0
Run 1, iteration: 2/100, moves: 549, cost: 271626.0
Run 1, iteration: 3/100, moves: 344, cost: 271331.0
Run 1, iteration: 4/100, moves: 167, cost: 271268.0
Run 1, iteration: 5/100, moves: 104, cost: 271228.0
Run 1, iteration: 6/100, moves: 15, cost: 271228.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 1504, cost: 273229.0
Run 2, iteration: 2/100, moves: 574, cost: 272177.0
Run 2, iteration: 3/100, moves: 336, cost: 271884.0
Run 2, iteration: 4/100, moves: 100, cost: 271870.0
Run 2, iteration: 5/100, moves: 20, cost: 271870.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 1645, cost: 273475.0
Run 3, iteration: 2/100, moves: 701, cost: 272376.0
Run 3, iteration: 3/100, moves: 298, cost: 272065.0
Run 3, iteration: 4/100, moves:

array([4, 0, 2, ..., 2, 2, 5], dtype=uint16)

In [43]:
df2016subset['Cluster'] = kmode2016.labels_

# 2012 Analysis

J, K, M, N, P

In [45]:
# df2012subset = df2012[df2012.columns[pd.Series(df2012.columns).str.startswith('P')]]

In [53]:
print(df2012subset['J01'].cat.categories)

Index(['Provide many fewer services', 2, 3, 4, 5, 6, 'Provide many more services', 'NA'], dtype='object')


In [54]:
df2012subset['J01'].cat.rename_categories(['Provide many fewer services','2','3','4','5','6','Provide many more services','NA'], inplace = True)
df2012subset['J02'].cat.rename_categories(['Provide many fewer services','2','3','4','5','6','Provide many more services','NA'], inplace = True)
df2012subset['J03'].cat.rename_categories(['Provide many fewer services','2','3','4','5','6','Provide many more services','NA'], inplace = True)
df2012subset['J04'].cat.rename_categories(['Government should see to it','2','3','4','5','6','Individuals on own','NA'], inplace = True)
df2012subset['J05'].cat.rename_categories(['Government should see to it','2','3','4','5','6','Individuals on own','NA'], inplace = True)
df2012subset['J06'].cat.rename_categories(['Government should see to it','2','3','4','5','6','Individuals on own','NA'], inplace = True)
df2012subset['J07'].cat.rename_categories(['Government health plan','2','3','4','5','6','Private health plans','NA'], inplace = True)
df2012subset['J08'].cat.rename_categories(['Government health plan','2','3','4','5','6','Private health plans','NA'], inplace = True)
df2012subset['J09'].cat.rename_categories(['Government health plan','2','3','4','5','6','Private health plans','NA'], inplace = True)
df2012subset['J14'].cat.rename_categories(['Regulate business','2','3','4','5','6','Do not regulate', 'NA'], inplace = True)
df2012subset['J15'].cat.rename_categories(['Regulate business','2','3','4','5','6','Do not regulate', 'NA'], inplace = True)
df2012subset['J16'].cat.rename_categories(['Regulate business','2','3','4','5','6','Do not regulate', 'NA'], inplace = True);

In [55]:
df2012subset['K17'].cat.rename_categories(['High traditionalism','2','3','4','Low traditionalism','NA'], inplace = True)
df2012subset['K18'].cat.rename_categories(['Strong commitment','2','3','4','Weak commitment','NA'], inplace = True);

In [56]:
df2012subset['M01'].cat.rename_categories(['Govt should help blacks','2','3','4','5','6','Blacks should help themselves', 'NA'], inplace = True);
df2012subset['M02'].cat.rename_categories(['Govt should help blacks','2','3','4','5','6','Blacks should help themselves', 'NA'], inplace = True);
df2012subset['M03'].cat.rename_categories(['Govt should help blacks','2','3','4','5','6','Blacks should help themselves', 'NA'], inplace = True);
df2012subset['M08'].cat.rename_categories(['High support','2','3','4','Low support','NA'], inplace = True);

In [57]:
df2012subset['N06'].cat.rename_categories(['Greatly decrease','2','3','4','5','6','Greatly increase','NA'], inplace = True);
df2012subset['N07'].cat.rename_categories(['Greatly decrease','2','3','4','5','6','Greatly increase','NA'], inplace = True);
df2012subset['N08'].cat.rename_categories(['Greatly decrease','2','3','4','5','6','Greatly increase','NA'], inplace = True);

In [58]:
df2012subset['P02'].cat.rename_categories(['High','2','3','4','Low','NA'], inplace = True);
df2012subset['P03'].cat.rename_categories(['High','2','3','Low','NA'], inplace = True);
df2012subset['P04'].cat.rename_categories(['High','2','3','Low','NA'], inplace = True);
df2012subset['P12'].cat.rename_categories(['Little or no difference','2','3','Big difference','NA'], inplace = True);
df2012subset['P13'].cat.rename_categories(['Little or no difference','2','3','Big difference','NA'], inplace = True);

In [59]:
# cost = [] 
# K = range(6,7)  
# for k in list(K): 
#     kmode = KModes(n_clusters=k, init = "random", n_init = 5, verbose=1) 
#     kmode.fit_predict(df2012subset) 
#     cost.append(kmode.cost_) 

# plt.plot(K, cost, 'x-') 
# plt.xlabel('No. of clusters') 
# plt.ylabel('Cost') 
# plt.title('Elbow Curve') 
# plt.show()

In [60]:
kmode2012 = KModes(n_clusters=6, init = "random", n_init = 5, verbose=1) 
kmode2012.fit_predict(df2012subset) 

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 2013, cost: 465598.0
Run 1, iteration: 2/100, moves: 669, cost: 464644.0
Run 1, iteration: 3/100, moves: 301, cost: 464465.0
Run 1, iteration: 4/100, moves: 121, cost: 464410.0
Run 1, iteration: 5/100, moves: 73, cost: 464343.0
Run 1, iteration: 6/100, moves: 37, cost: 464343.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 2124, cost: 488493.0
Run 2, iteration: 2/100, moves: 759, cost: 487003.0
Run 2, iteration: 3/100, moves: 604, cost: 486020.0
Run 2, iteration: 4/100, moves: 505, cost: 485289.0
Run 2, iteration: 5/100, moves: 349, cost: 485086.0
Run 2, iteration: 6/100, moves: 209, cost: 484877.0
Run 2, iteration: 7/100, moves: 194, cost: 484762.0
Run 2, iteration: 8/100, moves: 46, cost: 484762.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 2

array([4, 2, 2, ..., 2, 0, 3], dtype=uint16)

In [61]:
df2012subset['Cluster'] = kmode2012.labels_

In [62]:
df2020.to_csv('2020_Cluster_Analysis.csv', index=False)
df2016subset.to_csv('2016_Cluster_Analysis.csv', index=False)
df2012subset.to_csv('2012_Cluster_Analysis.csv', index=False)

In [None]:
def cluster(data):
    cost = [] 
    K = range(1,6) 
    for k in list(K): 
        kmode = KModes(n_clusters=k, init = "random", n_init = 5, verbose=0) 
        kmode.fit_predict(data) 
        cost.append(kmode.cost_) 
      
    plt.plot(K, cost, 'x-') 
    plt.xlabel('No. of clusters') 
    plt.ylabel('Cost') 
    plt.title('Elbow Curve') 
    plt.show()
    return kmode

In [None]:
cluster_analysis_2020 = cluster(df2020subset)

In [None]:
cluster_analysis_2016 = cluster(df2016subset)

In [None]:
cluster_analysis_2012 = cluster(df2012subset)

In [None]:
kmode = KModes(n_clusters=6, init = "random", n_init = 5, verbose=1) 
kmode.fit_predict(X_train) 

In [None]:
kmode.cluster_centroids_

# Dummy Model

Using the uniform strategy for the dumy model should result in a roughly 50/50 split between our two choices which is what we see. This will serve as our baseline to compare the following models against

In [None]:
# Dummy model to use as baseline
dummy_clf = DummyClassifier(strategy = "uniform")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_test, y_test)

Now that we have that as a baseline we can begin the modeling process. We will start with a decision tree.

# Decision Tree

In [None]:
# Setting up one hot encoder to use with our categorical data
categorical_processing = OneHotEncoder(handle_unknown='ignore')

preprocessing = ColumnTransformer(
    [
        ("cat", categorical_processing, categorical_columns),
    ],
    verbose_feature_names_out=False,
)

# Setting up pipeline steps
tree_pipe = Pipeline(
    [
        ("preprocess", preprocessing),
        ("classifier", DecisionTreeClassifier(random_state=42)),
    ]
)
# Fitting pipeline to the training data
tree_pipe.fit(X_train, y_train)

In [None]:

categorical_processing = OneHotEncoder(handle_unknown='ignore')

preprocessing = ColumnTransformer(
    [
        ("cat", categorical_processing, categorical_columns),
    ],
    verbose_feature_names_out=False,
)

# Setting up pipeline steps
tree_pipe = Pipeline(
    [
        ("preprocess", preprocessing),
        ("classifier", DecisionTreeClassifier(random_state=42)),
    ]
)
# Fitting pipeline to the training data
tree_pipe.fit(X_train, y_train)

In [None]:
# Getting predictions
y_pred = tree_pipe.predict(X_train)
# Checking accuracy of predictions
print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

# Getting cross validation score 
print(f"CV accuracy: {cross_val_score(tree_pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

As a starting point that is a good score. Lets see if we can improve on it by tuning the hyper parameters using GridSearchCV  which tries every combonation of parameters looking for the best results

# Decision Tree Second Iteration 

In [None]:
# # Setting up parameter grid
# param_grid = {'classifier__criterion': ['gini', 'entropy', 'log_loss'],               
#               'classifier__max_depth': [2, 4, 6, 8, 10, 12]
#              }
# # Executing gridsearch
# gridsearch = GridSearchCV(estimator=tree_pipe,
#                           param_grid=param_grid,
#                           scoring='accuracy',
#                           cv=5,
#                           n_jobs = 3
#                          )
# # Fit the training data
# gridsearch.fit(X_train, y_train)

In [None]:
# # Print accuracy score for the best estimator and the best parameters 
# print(f'Best estimator score: ' + '{:.4%}'.format(gridsearch.score(X_train, y_train)))
# print(f'Gridsearch best params: ')
# print(gridsearch.best_params_)

### Results
Gridsearch score:  0.979367866549605

Gridsearch best params: 
- 'classifier__criterion': 'gini'
- 'classifier__max_depth': 8

In [None]:
# Updating the parameters in the pipeline
tree_pipe.set_params(classifier__criterion = 'gini',
                     classifier__max_depth = 8,
                    )
# Refitting pipeline
tree_pipe.fit(X_train, y_train)

In [None]:
# # Updating the parameters in the pipeline
# tree_pipe.set_params(classifier__criterion = gridsearch.best_params_['classifier__criterion'],
#                      classifier__max_depth = gridsearch.best_params_['classifier__max_depth'],
#                     )
# # Refitting pipeline
# tree_pipe.fit(X_train, y_train)

In [None]:
# Getting predictions from pipeline
y_pred = tree_pipe.predict(X_train)

# Checking accuracy of predictions
print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

# Getting cross validation score for training data 
print(f"CV accuracy: {cross_val_score(tree_pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

There is some improvement there but the tree is still somewhat over fit. Let’s take a look at the feature importance to get a better sense of what is going on.

# Feature Importance

In [None]:
# This function gets feature importances out of the pipeline. Single features are broken up into multiple columns because of 
# the encoding. This aggregates the importances by feature so high cardinality features are not discounted. 

def get_feature_importances(pipe):
    # Getting feature names
    feature_names = pipe[:-1].get_feature_names_out()
    # Creating a series with the feature names and their importances 
    feature_importances = pd.Series(pipe[-1].feature_importances_, index=feature_names).sort_values(ascending=True)
    # Creating a pandas datafram with the feature importances
    importances = feature_importances.to_frame(name = 'importance').reset_index().rename(columns={"index": "feature"})
    # Slicing the feature names stored in 'feature' to the first three letter which is the original feature name
    importances['feature'] = importances['feature'].str.slice(0, 3)
    # Grouping and summing the features
    importances = importances.groupby('feature').sum()
    # Returning a datafram with the feature importances
    return importances

In [None]:
# Getting top 10 feature importances for the Tree
tree_importances = get_feature_importances(tree_pipe)
tree_importances.nlargest(10, columns= 'importance')

In [None]:
# getting 10 smallest feature importances
tree_importances.nsmallest(10, columns= 'importance')

In [None]:
# Summing 100 smallest feature importances
tree_importances.nsmallest(100, columns= 'importance').sum()

The sum of the 100 least important features is zero so the tree is not taking those into account. Next we will try a random forest which creates a number of decision trees each using a different random subset of features. This will allow it to use a broader selection of the data and hopefully get better results. 

# Random Forest

In [None]:
# Setting up one hot encoder to use with our categorical data
categorical_processing = OneHotEncoder(handle_unknown='ignore')

preprocessing = ColumnTransformer(
    [
        ("cat", categorical_processing, categorical_columns),
    ],
    verbose_feature_names_out=False,
)

# Setting up pipeline steps
forest_pipe = Pipeline(
    [
        ("preprocess", preprocessing),
        ("classifier", RandomForestClassifier(random_state=42)),
    ]
)

# Fitting pipeline to the training data
forest_pipe.fit(X_train, y_train)

In [None]:
# Getting predictions from pipeline using training data
y_pred = forest_pipe.predict(X_train)

# Checking accuracy of predictions
print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

# Getting cross validation score for training data 
print(f"CV accuracy: {cross_val_score(forest_pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

That’s a good score for an untuned model but it is over fit. I will try to address that using RandomizedSearchCV and GridSearchCV to tune the hyper parameters.

# RandomizedSearchCV

In [None]:
# Setting up the parameters for RandomizedSearchCV to test
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Creating random grid
random_grid = {'classifier__n_estimators': n_estimators,
               'classifier__max_depth': max_depth,
               'classifier__min_samples_split': min_samples_split,
               'classifier__min_samples_leaf': min_samples_leaf,
               'classifier__bootstrap': bootstrap}

In [None]:
# # setting up RandomizedSearchCV 
# forest_random = RandomizedSearchCV(estimator = forest_pipe,
#                                    param_distributions = random_grid,
#                                    n_iter = 100,
#                                    cv = 3,
#                                    verbose=2,
#                                    random_state=42,
#                                    n_jobs = -1
#                                   )
# # Fiting random search model
# forest_random.fit(X_train, y_train)

In [None]:
# #checking best parameters
# forest_random.best_params_

### Results
- 'classifier__n_estimators': 400
- 'classifier__min_samples_split': 5
- 'classifier__min_samples_leaf': 2
- 'classifier__max_depth': 90
- 'classifier__bootstrap': False
 

# GridSearchCV

Based on the results from our randomized search I constructed this parameter grid to feed into GridSearchCV.

In [None]:
# # Setting up parameter grid
# param_grid = {'classifier__n_estimators': [200, 300, 400],
#               'classifier__criterion': ['gini', 'entropy', 'log_loss'],               
#               'classifier__max_depth': [70, 80, 90],
#               'classifier__min_samples_split': [4, 5, 6],
#               'classifier__min_samples_leaf': [2, 3, 4],
#               'classifier__bootstrap': [False, True]
#              }
# # Executing gridsearch
# gridsearch = GridSearchCV(estimator=forest_pipe,
#                           param_grid=param_grid,
#                           scoring='accuracy',
#                           cv=5,
#                           n_jobs = 3
#                          )
# # Fit the training data
# gridsearch.fit(X_train, y_train)

In [None]:
# # Print the accuracy on train set
# print(f'Best estimator score: ' + '{:.4%}'.format(gridsearch.score(X_train, y_train)))
# print(f'Gridsearch best params: ')
# print(gridsearch.best_params_)

### Results

Best estimator score: 99.2098%

Gridsearch best params: 
- 'classifier__bootstrap': True 
- 'classifier__criterion': 'gini' 
- 'classifier__max_depth': 70 
- 'classifier__min_samples_leaf': 3 
- 'classifier__min_samples_split': 4 
- 'classifier__n_estimators': 300

In [None]:
# Updating the parameters in the pipeline
forest_pipe.set_params(classifier__bootstrap = True,
                       classifier__criterion = 'gini',
                       classifier__max_depth = 70,
                       classifier__min_samples_leaf = 3,
                       classifier__min_samples_split = 4,
                       classifier__n_estimators = 300,
                      )
# Refitting pipeline
forest_pipe.fit(X_train, y_train)

In [None]:
# # Updating the parameters in the pipeline
# forest_pipe.set_params(classifier__n_estimators = gridsearch.best_params_['classifier__n_estimators'],
#                        classifier__criterion = gridsearch.best_params_['classifier__criterion'],
#                        classifier__max_depth = gridsearch.best_params_['classifier__max_depth'],
#                        classifier__min_samples_leaf = gridsearch.best_params_['classifier__min_samples_leaf'],
#                        classifier__min_samples_split = gridsearch.best_params_['classifier__min_samples_split'],
#                        classifier__bootstrap = gridsearch.best_params_['classifier__bootstrap'],
#                       )
# # Refitting pipeline
# forest_pipe.fit(X_train, y_train)

In [None]:
#Getting predictions from pipeline using training data
y_pred = forest_pipe.predict(X_train)

#Checking accuracy of predictions
print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

#getting cross validation score for training data 
print(f"CV accuracy: {cross_val_score(forest_pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

Training data prediction accuracy:  0.9920983318700615

CV accuracy: 0.966198942746548

Slight improvement in both directions but the model is still clearly over fit. The RandomizedSearchCV suggested that the model preformed best with a max depth of 90 which is high. We are worried about overfitting our model so we can try to prune our tree by decreasing the max depth

# Hyperparameter Tuning Second Iteration

In [None]:
# # Setting up parameter grid
# param_grid = {'classifier__n_estimators': [50, 100, 150, 200],
#               'classifier__criterion': ['gini', 'entropy', 'log_loss'],               
#               'classifier__max_depth': [4, 6, 8, 10, 12, 14],
#               'classifier__bootstrap': [True, False]
#              }
# # Executing gridsearch
# gridsearch = GridSearchCV(estimator=forest_pipe,
#                           param_grid=param_grid,
#                           scoring='accuracy',
#                           cv=5,
#                           n_jobs = 3
#                          )
# # Fit the training data
# gridsearch.fit(X_train, y_train)
# # Print the accuracy on test set

In [None]:
# # Print the accuracy on train set
# print(f'Best estimator score: ' + '{:.4%}'.format(gridsearch.score(X_train, y_train)))
# print(f'Gridsearch best params: ')
# print(gridsearch.best_params_)

### Results

Best estimator score: 99.1659%

Gridsearch best params: 
- 'classifier__bootstrap': True
- 'classifier__criterion': gini
- 'classifier__max_depth': 14
- 'classifier__n_estimators': 150

In [None]:
# Updating the parameters in the pipeline
forest_pipe.set_params(classifier__bootstrap = True,
                       classifier__criterion = 'gini',
                       classifier__max_depth = 14,
                       classifier__n_estimators = 150,
                      )
# Refitting pipeline
forest_pipe.fit(X_train, y_train)

In [None]:
# # Updating the parameters in the pipeline
# forest_pipe.set_params(classifier__bootstrap = gridsearch.best_params_['classifier__bootstrap'],
#                        classifier__criterion = gridsearch.best_params_['classifier__criterion'],
#                        classifier__max_depth = gridsearch.best_params_['classifier__max_depth'],
#                        classifier__n_estimators = gridsearch.best_params_['classifier__n_estimators'],
#                       )
# # Refitting pipeline
# forest_pipe.fit(X_train, y_train)

In [None]:
#Getting predictions from pipeline using training data
y_pred = forest_pipe.predict(X_train)

#Checking accuracy of predictions
print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

#getting cross validation score for training data 
print(f"CV accuracy: {cross_val_score(forest_pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

Almost no change with those scores. min_samples_split and min_samples_leaf can help prevent overfitting so I try to tune those next.

# Hyperparameter Tuning Third Iteration

In [None]:
# # Setting up parameter grid
# param_grid = {'classifier__n_estimators': [125, 150, 175],              
#               'classifier__max_depth': [12, 14, 16],
#               'classifier__min_samples_split': [4, 6, 8, 10],
#               'classifier__min_samples_leaf': [3, 4, 5, 6],
#              }
# # Executing gridsearch
# gridsearch = GridSearchCV(estimator=forest_pipe,
#                           param_grid=param_grid,
#                           scoring='accuracy',
#                           cv=5,
#                           n_jobs = 3
#                          )
# # Fit the training data
# gridsearch.fit(X_train, y_train)

In [None]:
# # Print the accuracy on train set
# print(f'Best estimator score: ' + '{:.4%}'.format(gridsearch.score(X_train, y_train)))
# print(f'Gridsearch best params: ')
# print(gridsearch.best_params_)

### Results

Best estimator score: 99.1659%

Gridsearch best params:

- 'classifier__max_depth': 14,
- 'classifier__min_samples_leaf': 3,
- 'classifier__min_samples_split': 4,
- 'classifier__n_estimators': 150

In [None]:
# Updating the parameters in the pipeline
forest_pipe.set_params(classifier__max_depth = 14,
                       classifier__min_samples_leaf = 3,
                       classifier__min_samples_split = 4,
                       classifier__n_estimators = 150,
                      )
# Refitting pipeline
forest_pipe.fit(X_train, y_train)

In [None]:
# # Updating the parameters in the pipeline
# forest_pipe.set_params(classifier__max_depth = gridsearch.best_params_['classifier__max_depth'],
#                        classifier__min_samples_leaf = gridsearch.best_params_['classifier__min_samples_leaf'],
#                        classifier__min_samples_split = gridsearch.best_params_['classifier__min_samples_split'],
#                        classifier__n_estimators = gridsearch.best_params_['classifier__n_estimators'],
#                       )
# # Refitting pipeline
# forest_pipe.fit(X_train, y_train)

In [None]:
#Getting predictions from pipeline using training data
y_pred = forest_pipe.predict(X_train)

#Checking accuracy of predictions
print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

#getting cross validation score for training data 
print(f"CV accuracy: {cross_val_score(forest_pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

This iteration resulted in no change from the previous iteration so parameter tuning has gotten me as far as it can. Next I will try incorporating dimensionality reduction using TruncatedSVD.

# TruncatedSVD

In [None]:
# Setting up pipeline steps
forest_pipe_SVD = Pipeline(
    [
        ("preprocess", preprocessing),
        ("SVD", TruncatedSVD(n_components = 200)),
        ("classifier", RandomForestClassifier(max_depth = 12,
                                              n_estimators = 200,
                                              min_samples_split = 5,
                                              min_samples_leaf = 4,
                                              bootstrap = True,
                                              
                                             )
        )
    ]
)
# Fitting pipeline to the training data
forest_pipe_SVD.fit(X_train, y_train)

In [None]:
# # Setting up parameter grid
# param_grid = {'classifier__n_estimators': [175, 200, 225],
#               'classifier__max_depth': [8, 10, 12],
#               'SVD__n_components' : [10, 100, 200, 300, 1000]
#              }
# # Executing gridsearch
# gridsearch = GridSearchCV(estimator=forest_pipe_SVD, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs = 3)
# # Fit the training data
# gridsearch.fit(X_train, y_train)

In [None]:
# # Print the accuracy on train set
# print(f'Best estimator score: ' + '{:.4%}'.format(gridsearch.score(X_train, y_train)))
# print(f'Gridsearch best params: ')
# print(gridsearch.best_params_)

### Results 

- 'SVD__n_components': 10
- 'classifier__max_depth': 10
- 'classifier__n_estimators': 200

In [None]:
# Updating the parameters in the pipeline
forest_pipe_SVD.set_params(SVD__n_components = 10,
                           classifier__max_depth = 10,
                           classifier__n_estimators = 200
                          )
# Refitting pipeline
forest_pipe_SVD.fit(X_train, y_train)

In [None]:
# # Updating the parameters in the pipeline
# forest_pipe_SVD.set_params(SVD__n_components = gridsearch.best_params_['SVD__n_components'],
#                            classifier__max_depth = gridsearch.best_params_['classifier__max_depth'],
#                            classifier__n_estimators = gridsearch.best_params_['classifier__n_estimators']
#                           )
# # Refitting pipeline
# forest_pipe_SVD.fit(X_train, y_train)

In [None]:
#Getting predictions from pipeline using training data
y_pred = forest_pipe_SVD.predict(X_train)

#Checking accuracy of predictions
print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

#getting cross validation score for training data 
print(f"CV accuracy: {cross_val_score(forest_pipe_SVD, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

The scores actually got slightly worse. XGBoost is another tree based model that often preforms better than random forests so I will try that next.

# XGBoost

In [None]:
# Setting up pipeline steps
XGBoost_pipe = Pipeline(
    [
        ("preprocess", preprocessing),
        ("classifier", XGBClassifier(random_state=42))
    ]
)
# Fitting pipeline to the training data
XGBoost_pipe.fit(X_train, y_train)

In [None]:
# #Getting predictions from pipeline using training data
# y_pred = XGBoost_pipe.predict(X_train)
# dd
# #Checking accuracy of predictions
# print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

# #getting cross validation score for training data 
# print(f"CV accuracy: {cross_val_score(XGBoost_pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

In [None]:
# param_grid = {'classifier__learning_rate': [0.1, 0.2, 0.3],
#               'classifier__max_depth': [2, 6, 8],
#               'classifier__min_child_weight': [1, 2],
#               'classifier__subsample': [0.5, 0.7],
#               'classifier__n_estimators': [50, 100, 150],
#              }
# # Executing gridsearch
# gridsearch = GridSearchCV(estimator=XGBoost_pipe, param_grid=param_grid, scoring='accuracy', cv=5, n_jobs = 3)
# # Fit the training data
# gridsearch.fit(X_train, y_train)

In [None]:
# # Print the accuracy on train set
# print(f'Best estimator score: ' + '{:.4%}'.format(gridsearch.score(X_train, y_train)))
# print(f'Gridsearch best params: ')
# print(gridsearch.best_params_)

Best estimator score: 98.1782%

Gridsearch best params: 

- 'classifier__learning_rate': 0.1,
- 'classifier__max_depth': 2,
- 'classifier__min_child_weight': 1,
- 'classifier__n_estimators': 150,
- 'classifier__subsample': 0.5

In [None]:
XGBoost_pipe.set_params(classifier__learning_rate = 0.1,
                        classifier__max_depth = 2,
                        classifier__min_child_weight = 1,
                        classifier__subsample = .5,
                        classifier__n_estimators = 150)
XGBoost_pipe.fit(X_train, y_train)

In [None]:
# XGBoost_pipe.set_params(classifier__learning_rate = gridsearch.best_params_['classifier__learning_rate'],
#                         classifier__max_depth = gridsearch.best_params_['classifier__max_depth'],
#                         classifier__min_child_weight = gridsearch.best_params_['classifier__min_child_weight'],
#                         classifier__subsample = gridsearch.best_params_['classifier__subsample'],
#                         classifier__n_estimators = gridsearch.best_params_['classifier__n_estimators'])
# # Refitting pipeline
# XGBoost_pipe.fit(X_train, y_train)

In [None]:
#Getting predictions from pipeline using training data
y_pred = XGBoost_pipe.predict(X_train)

#Checking accuracy of predictions
print(f"Training data prediction accuracy: ", accuracy_score(y_train, y_pred))

#getting cross validation score for training data 
print(f"CV accuracy: {cross_val_score(XGBoost_pipe, X_train, y_train, cv=5, scoring = 'accuracy').mean()}")

The improvements with the XGBoost model are slight but it is less over fit and has a higher cross validation score so it is the model I will go with.

# Evaluation

In [None]:
#Getting predictions from pipeline using testing data
y_pred = XGBoost_pipe.predict(X_test)

#Checking accuracy of predictions
print(f"Testing data accuracy score: ", accuracy_score(y_test, y_pred))

My final accuracy score on the test data was 96.84% which is quite good and shows how predictable voting behavior can be. The model is still slightly overfit. This is likely due to the high number of feature resulting from One Hot Encoding. I did try to incorporate dimensionality reduction but it was ineffective. Removing features before the modeling process could help solve the overfitting problem.  

In [None]:
#Creating confusion matrix
cf = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cf, display_labels=['Trump vote', 'Biden vote']).plot(cmap = plt.cm.cividis)

The results shown in the confusion matrix are in line with what I would expect given the accuracy score and it does not appear that the model is struggling to correctly categorize Biden voters more than Trump voters.

# Categorical Analysis

Now that I have a functional model that can predict how an individual will vote based on the whole dataset, I will analyze how well the model preforms when it only has access to a subset of the data. 

In [None]:
# Getting column dictionary for X_train
X_train_dict = get_columns(X_train)
# Setting up scoring dictionary
score_dict = {}

In [None]:
# Looping through the data by category
for key in X_train_dict:
    # Getting columns in the category
    columns = X_train_dict[key]
    # Subsetting the data
    X_train_subset = X_train[columns]
    # Updating the pipeling
    XGBoost_pipe.set_params(preprocess__transformers = [("cat", categorical_processing, columns)])
    # Refitting the model
    XGBoost_pipe.fit(X_train_subset, y_train)
    # Getting the cross validation score
    score_dict[key] = {'cross validation score' : cross_val_score(XGBoost_pipe, 
                                                                          X_train_subset, 
                                                                          y_train, 
                                                                          cv=5, 
                                                                          scoring = 'accuracy').mean()}

In [None]:
# Creating a dictionary to update the category labels
Catagory_labels = {'B' : 'Political Engagement',
                   'C' : 'Media Trust & Consumption',
                   'F' : 'Economy',
                   'G' : 'Direction of Country',
                   'H' : 'Health Care & Policy',
                   'J' : 'Federal Spending',
                   'K' : 'Abortion, Guns, Imigration',
                   'L' : 'Womens and Gender Issues',
                   'M' : 'Race, Diversity & Religious Minorities',
                   'N' : 'Security & Foreign Policy',
                   'P' : 'Trust in Government',
                   'Q' : 'LGBTQ Rights',
                   'R' : 'Demographics'}

In [None]:
# Creating a data frame with updated labels 
score_df = pd.DataFrame.from_dict(score_dict, orient = 'index')
score_df.rename(index = Catagory_labels,inplace=True)
score_df

When looking at the accurate scores for individual categories a few things jump out. Political Engagement has the lowest score which is not particularly surprising because with the country being so narrowly divided, both parties have similar levels of political engagement. The next lowest score is for the Demographics category. This is unfortunate this information is often available at the state or county level so being able to predict how a state or county will vote based on demographics alone would be useful. However the predictions using this data alone are still fairly accurate at 76.4% 

Looking at the most accurate categories, the Trust in Government and Health Care & Policy categories have the highest score with both being above 94.5%. This tracks with what we know American Politics at the moment. Democrats and Republicans don’t agree on much when it comes to health care policy. For example, a study from 2020 by a team from the Harvard T.H. Chan School of Public Health found that three quarters of democrats would like the federal government to ensure that all citizens have health insurance. In contrast, 79% of Republicans preferred a healthcare system that relies on private insurance. (https://jamanetwork.com/journals/jama/fullarticle/2777394) 

The reliability of the Trust in Government category when it comes to predicting ones vote is a bit more troubling. It has long been the case that trust in government declines when an individual’s preferred party is out of power in Washington as one can see in this analysis from Pew: https://www.pewresearch.org/politics/2023/09/19/public-trust-in-government-1958-2023/. However, as we saw following the 2020 election a lack of trust in institutions can quickly turn violent and deadly. The fact that this lack of trust is concentrated on one side of the political spectrum makes that an even more dangerous possibility.   

In [None]:
# Setting up new pipeline 
# Preprocessing steps
preprocessing = ColumnTransformer(
    [
        ("cat", categorical_processing, categorical_columns),
    ],
    verbose_feature_names_out=False,
)

# Setting up pipeline
XGBoost_pipe = Pipeline(
    [
        ("preprocess", preprocessing),
        ("classifier", XGBClassifier(random_state=42,
                                     learning_rate = 0.1,
                                     max_depth = 2,
                                     min_child_weight = 1,
                                     subsample = .5,
                                     n_estimators = 150
                                    )
        )
    ]
)

# Fitting pipeline to the training data
XGBoost_pipe.fit(X_train, y_train)

In [None]:
# getting individual feature importances
XGBoost_importances = get_feature_importances(XGBoost_pipe)
XGBoost_importances.nlargest(10, columns= 'importance')

When looking at the most important features there are a couple that aren’t surprising. Question P28 asks if the respondent favors the House of Representatives decision to impeach Donald Trump in 2019 and P29 asks if the respondent favors the Senates decision not to convict. Similarly H05 asks if the COVID-19 response was adequate. These were obviously major issues in the 2020 election.

In [None]:
# Pulling out second most important feature
single_feature = list(XGBoost_importances.nlargest(2, columns= 'importance').index)
single_feature.remove('P28')

In [None]:
# Modeling with a single column
X_train_subset = X_train[single_feature]
# Updating pipeline
XGBoost_pipe.set_params(preprocess__transformers = [("cat", categorical_processing, single_feature)])
# Refitting pipeline
XGBoost_pipe.fit(X_train_subset, y_train)
# Getting new predictions
y_pred = XGBoost_pipe.predict(X_train_subset)
print(f'cross validation score' , cross_val_score(XGBoost_pipe,
                                                  X_train_subset, 
                                                  y_train, 
                                                  cv=5, 
                                                  scoring = 'accuracy').mean())

Something that is interesting is that the model can predict with 87.1% accuracy who the respondent would vote for based on their opinion of the federal government's response to COVID-19. Taking this just a bit further the model loses very little in terms of accuracy as I restrict the features it has access to. Below I've given it features ranked 11 through 20 and 21 through 30 

In [None]:
# Getting features 11 through 20 ranked by importance
next_10_most_important_features = list(XGBoost_importances.nlargest(20, columns= 'importance').index)[10:20]

In [None]:
# modeling with a single column
X_train_subset = X_train[next_10_most_important_features]
# Updating pipeline
XGBoost_pipe.set_params(preprocess__transformers = [("cat", categorical_processing, next_10_most_important_features)])
# Refitting pipeline
XGBoost_pipe.fit(X_train_subset, y_train)
# getting new predictions
y_pred = XGBoost_pipe.predict(X_train_subset)
print(f'cross validation score' , cross_val_score(XGBoost_pipe,
                                           X_train_subset, 
                                           y_train, 
                                           cv=5, 
                                           scoring = 'accuracy').mean())

In [None]:
# Getting features 21 through 30 ranked by importance
next_10_most_important_features = list(XGBoost_importances.nlargest(30, columns= 'importance').index)[20:30]

In [None]:
# modeling with a single column
X_train_subset = X_train[next_10_most_important_features]
# Updating pipeline
XGBoost_pipe.set_params(preprocess__transformers = [("cat", categorical_processing, next_10_most_important_features)])
# Refitting pipeline
XGBoost_pipe.fit(X_train_subset, y_train)
# getting new predictions
y_pred = XGBoost_pipe.predict(X_train_subset)
print(f'cross validation score' , cross_val_score(XGBoost_pipe,
                                           X_train_subset, 
                                           y_train, 
                                           cv=5, 
                                           scoring = 'accuracy').mean())

Even with these less politically charged questions the model still has a very good idea how an individual will vote. For example, one of the columns the model had access to in the final run asked "how important should science be for decisions about COVID-19?" another asked "How much is Iran a threat to the United States?" These are not inherently political questions and in the U.S. they did not used to be politically relevant. However, in the era of hyper polarization, they can accurately predict how an individual will vote

# Conclusions

This project shows that modeling and predicting voting accurately is possible. Aditionally in the context of the 2020 election its shows the impact of the COVID-19 pandemic on the results. Finally with the accuracy of the predictions from teh Trust in Government category, this project is another data point indicating the troubling divisions and mistrust that exist in American society

# Next Steps 

The most obvious next step would be to look at data from the 2012 and 2016 elections to see how the issues important to voters have changed. As I mentioned previously the healthcare and policy category was a good predictor for vote choice in 2020. This almost certinly impacted by the COVID-19 pandemic. Analyzing previous elections could help quantify that impact.

Demographic data is generally available and as this project demonstrated a somewhat accurate predictor of how an individual will vote. Improving that  accuracy would be very useful to political parties and campaigns. 

Finally, turnout among elligable voters in 2020 was 66% which is a high in recent U.S. history. Still, a third of eligable voters did not turnout. If we could analize those potential voters and understand why they dont vote political parties could boost turnout among their voters or a nonpartisan group could work to boost turnout in general 