# A Machine Learning Model
## Applied to Voting Behavior in Arizona
### Keras Neural Network 


Below I just load a bunch of dependencies. I follow this with a GBQ query to get the data. I then do some data cleaning. Finally, I split the data into train and test sets.    

A political engagement indicator was created, scored 1 if the voter participated in the 2020 primary, as well as the 2018 primary and general elections. 


In [2]:
import tensorflow as tf
import os
import pandas as pd
from datetime import datetime
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/Users/Chris/Dropbox/Keys/az-voter-file-30395362c45b.json"
from google.cloud import bigquery
from sklearn.preprocessing import MinMaxScaler
import pandas_gbq
bqclient = bigquery.Client()
project_id = "az-voter-file"
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

I query the BQ below.

In [None]:
### Formulate the SQL query to pull the data from BigQuery


query = """
    SELECT 
    geo_id,
    registrant_id,
    general_2020,
    primary_2020,
    general_2018,
    primary_2018,
    general_2016,
    primary_2016,
    general_2014,
    bachelors_degree_2,
    poverty,
    total_pop,
    birth_year,
    registration_date,
    registration_change,
    median_age,
    median_income,
    white_pop,
    black_pop,
    asian_pop,
    hispanic_pop,
    amerindian_pop,
    gini_index,
    housing_units,
    children,
    employed_pop,
    armed_forces,
    pop_in_labor_force,
    in_undergrad_college,
    speak_only_english_at_home,
    less_than_high_school_graduate,
    P1_001N as total_district,
    P1_003N as white_district,
    P1_004N as black_district,
    P1_005N as indian_district,
    P1_006N as asian_district,
    P4_001N as total_ethnicity,
    P4_002N as latino,
FROM `az-voter-file.az_file.clean_data_machine_learning`
"""

df = pandas_gbq.read_gbq(query, project_id=project_id)
df.to_pickle('voter_file00_00_02.pkl') ## For later load, not to sync.

This is pretty rudimentary, and likely overkill, but we can compare it to far simpler measures as well. I created a variable, called "engaged, that is 1 if the voter participated in the 2020 primary, as well as the 2018 primary and general elections. I then split the data in half, into a test set and a train set. I then train a model on the train set, and test it on the test set. The model I use is a "neural network" with an input layer, 4 hidden layers, and an output layer. I tested this, specifying different parameterizations and hidden layers. It really doesn't matter. I hit about 85% accuracy, which is marginal, but far better than chance for these data. The features I use to train the model are, primary and general election voting prior to 2018, as well as the following characteristics measured at the  characteristics:
_________
### Voter Level Information
______
* 'general_2016', 
* 'bachelors_degree_2',
* 'primary_2016', 
* 'general_2014', 
* 'registration_change',
* 'registration_date',
* 'age'
_________
### Tract level Information
_________
* 'poverty',
* 'age', 
* 'median_age', 
* 'median_income',
* 'white_pop', 
* 'black_pop', 
* 'asian_pop', 
* 'hispanic_pop', 
* 'amerindian_pop', 
* 'gini_index', 
* 'housing_units', 
* 'employed_pop' 

All variables were 0-1 standardized prior to analysis.



In [26]:
df = pd.read_pickle('voter_file00_00_02.pkl')

reg_length =  (pd.to_datetime("11-04-2020", format = "%m-%d-%Y") - pd.to_datetime(df['registration_date'], 
                              format = "%Y-%m-%d", 
                              errors = 'coerce'))
df["registration_length"] = reg_length.dt.days 

reg_change=  (pd.to_datetime("11-04-2020", format = "%m-%d-%Y") - pd.to_datetime(df['registration_change'], 
                                format = "%Y-%m-%d", errors = 'coerce'))
df["registration_change"] = reg_change.dt.days 
df["age"] = 2020 -  df['birth_year']


st_dat = df[['general_2020', 'primary_2020', 'general_2018', 
    'primary_2018', 'general_2016', "bachelors_degree_2",
    'primary_2016', 'general_2014', "poverty",
    'age', 'registration_length', 
    'registration_change',
    'median_age', 'median_income',
    'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
    'amerindian_pop', 'gini_index', 'housing_units', 
    'employed_pop']]
 
st_dat = st_dat.dropna(how = 'any')
scaler = MinMaxScaler()
st_dat_array = scaler.fit_transform(st_dat)
st_dat = pd.DataFrame(st_dat_array, columns = st_dat.columns)


st_dat['engaged'] = np.where((((st_dat["primary_2018"] == 1)  
                                     & (st_dat["primary_2020"] == 1) 
                                     & (st_dat["general_2018"] == 1)
                                    )),1,0) 
train, test = train_test_split(st_dat, test_size=0.2)

features_train   =  train[['general_2016',  "bachelors_degree_2", "poverty",
                            'primary_2016', 'general_2014', 
                            'age', 'registration_length', 
                            'registration_change',
                            'median_age', 'median_income',
                            'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
                            'amerindian_pop', 'gini_index', 'housing_units', 
                            'employed_pop']]
labels_train     =  pd.DataFrame({"engaged":train['engaged'],  "not_engaged":1-train['engaged']})


features_train_array = np.array(features_train, np.float64)
labels_train_array   = np.array(labels_train,   np.float64)


features_test   = test[['general_2016',  "bachelors_degree_2", "poverty",
                            'primary_2016', 'general_2014', 
                            'age', 'registration_length', 
                            'registration_change',
                            'median_age', 'median_income',
                            'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
                            'amerindian_pop', 'gini_index', 'housing_units', 
                            'employed_pop']]
labels_test     =  pd.DataFrame({"engaged":test['engaged'],  "not_engaged":1-test['engaged']})


features_test_array = np.array(features_test, np.float64)
labels_test_array   = np.array(labels_test,   np.float64)

from tensorflow.keras.regularizers import l1_l2
model = tf.keras.Sequential()
# Define the first layer
model.add(keras.layers.Dense(20, activation='softmax', 
                               input_shape=(features_train.shape[1],)))
model.add(keras.layers.Dropout(0.25))
# model.add(keras.layers.Dense(10, activation='softmax'))
# model.add(keras.layers.Dense(5, activation='softmax'))
model.add(keras.layers.Dense(2, activation='softmax'))
    

# Finish the model compilation
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), 
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(features_train_array, 
          labels_train_array, epochs=1, 
          validation_split=0.20)

    8/73107 [..............................] - ETA: 9:59 - loss: 0.6992 - accuracy: 0.4180   

2022-09-04 11:48:08.770440: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-09-04 11:57:48.118989: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




<keras.callbacks.History at 0x474594fa0>

Above, where I constructed the training data, I also set aside 20 percent of the sample. **The model was not trained on this model. These are fresh data, randomly drawn, so that we can compare the outcome to the predicted outcome.** Overall, I reach about 85% acccuracy, which is not great, but far better than chance.

In [27]:
features_test   =   test[['general_2016',  "bachelors_degree_2", "poverty",
                            'primary_2016', 'general_2014',  'age', 'registration_length', 
                            'registration_change',
                            'median_age', 'median_income',
                            'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
                            'amerindian_pop', 'gini_index', 'housing_units', 
                            'employed_pop']]
labels_test     =  pd.DataFrame({"engaged":test['engaged'],  "not_engaged":1-test['engaged']})
features_test_array = np.array(features_test, np.float64)
labels_test_array   = np.array(labels_test,   np.float64)

In [28]:
outcome = model.predict(features_test_array) > 0.5


   87/22846 [..............................] - ETA: 40s

2022-09-04 12:00:01.257654: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




I've tinkered with the model quite a bit. I can't seem to improve it. It's not a remarkable degree of accuracy, but there's really not all that much individual level data, so I'm not sure.

In [29]:
full_data = df[['registrant_id',
    'general_2020', 'primary_2020', 'general_2018', 
    'primary_2018', 'general_2016', "bachelors_degree_2",
    'primary_2016', 'general_2014', "poverty",
    'age', 'registration_length', 
    'registration_change',
    'median_age', 'median_income',
    'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
    'amerindian_pop', 'gini_index', 'housing_units', 
    'employed_pop',  'black_district', 'white_district', 'latino']]
full_data = full_data.dropna(how = 'any')
registrant_id = full_data['registrant_id']
full_data_array = scaler.fit_transform(full_data)
full_data = pd.DataFrame(full_data_array, columns = full_data.columns)
full_data['engaged'] = np.where((((full_data["primary_2018"] == 1)  
                                     & (full_data["primary_2020"] == 1) 
                                     & (full_data["general_2018"] == 1)
                                    )),1,0) 
labels_full     =  pd.DataFrame({"engaged": full_data['engaged'],  "not_engaged": 1-full_data['engaged']})
features_full    =  full_data[['general_2016',  "bachelors_degree_2", "poverty",
                            'primary_2016', 'general_2014', 
                            'age', 'registration_length', 
                            'registration_change',
                            'median_age', 'median_income',
                            'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
                            'amerindian_pop', 'gini_index', 'housing_units', 
                            'employed_pop',  'black_district', 'white_district', 'latino']]

features_full_array = np.array(features_full, np.float64)
labels_full_array   = np.array(labels_full,   np.float64)

In [30]:
preds1 = model.predict(features_test_array) > 0.5
preds2 = model.predict(features_train_array)

# Evaluate the model
from tensorflow.keras.metrics import Accuracy, Precision, Recall
acc = Accuracy()
prec = Precision()
recall = Recall()
acc.update_state(labels_test_array, preds1)

acc.result().numpy()
# prec.result().numpy()
# recall.result().numpy()



0.84815085

In [46]:
### Standardize the data, train with variables below ####
full_data = df[['registrant_id', 'general_2020', 'primary_2020', 'general_2018', 
    'primary_2018', 'general_2016', "bachelors_degree_2",
    'primary_2016', 'general_2014', "poverty",
    'age', 'registration_length', 
    'registration_change',
    'median_age', 'median_income',
    'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
    'amerindian_pop', 'gini_index', 'housing_units', 
    'employed_pop']]
full_data = full_data.dropna(how = 'any')
registrant_id = full_data['registrant_id']
full_data_array = scaler.fit_transform(full_data)
full_data = pd.DataFrame(full_data_array, columns = full_data.columns)
full_data['engaged'] = np.where((((full_data["primary_2018"] == 1)  
                                     & (full_data["primary_2020"] == 1) 
                                     & (full_data["general_2018"] == 1)
                                    )),1,0) 
labels_full     =  pd.DataFrame({"engaged": full_data['engaged'],  "not_engaged": 1-full_data['engaged']})
features_full    =  full_data[['general_2016', "bachelors_degree_2",
    'primary_2016', 'general_2014', "poverty",
    'age', 'registration_length', 
    'registration_change',
    'median_age', 'median_income',
    'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
    'amerindian_pop', 'gini_index', 'housing_units', 
    'employed_pop']]

features_full_array = np.array(features_full, np.float64)
labels_full_array   = np.array(labels_full,   np.float64)


In [None]:
preds = model.predict(features_full_array)
preds = pd.DataFrame(preds)
preds.head()


In [66]:
np.random.binomial(1, preds.iloc[:,0])

array([1, 1, 0, ..., 0, 0, 1])

In [68]:
upload_data = pd.DataFrame( {"engaged_pr" : preds.iloc[:,0], 
                    "not_engaged_pr" : preds.iloc[:,1],  
                    "point" : np.random.binomial(1, preds.iloc[:,0]),
                    "engaged_true" :  full_data["engaged"],
                    "registrant_id" : registrant_id.tolist() } )

upload_data.head()

Unnamed: 0,engaged_pr,not_engaged_pr,point,engaged_true,registrant_id
0,0.55205,0.44795,1,1,26524902
1,0.405721,0.594279,1,1,26628656
2,0.03157,0.96843,0,0,26706618
3,0.466832,0.533168,1,0,26679668
4,0.1053,0.8947,1,0,26660520


In [69]:
from google.cloud import bigquery
from sklearn.preprocessing import MinMaxScaler
import pandas_gbq
bqclient = bigquery.Client()
project_id = "az-voter-file"
pandas_gbq.to_gbq(upload_data, "az-voter-file.az_file.nn02", project_id=project_id, if_exists="replace")

100%|██████████| 1/1 [00:00<00:00, 8004.40it/s]


In [74]:
full_data.to_csv( "not_uploaded.csv")