# A Machine Learning Model
## Applied to Voting Behavior in Arizona
### Keras Neural Network 


Below I just load a bunch of dependencies. I follow this with a GBQ query to get the data. I then do some data cleaning. Finally, I split the data into train and test sets.    

A political engagement indicator was created, scored 1 if the voter participated in the 2020 primary, as well as the 2018 primary and general elections. 


In [15]:
#import tensorflow as tf
import os
import pandas as pd
from datetime import datetime
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/Users/Chris/Dropbox/Keys/az-voter-file-30395362c45b.json"


In [16]:
import pandas_gbq
from google.cloud import bigquery
from sklearn.preprocessing import MinMaxScaler
import pandas_gbq


Always manage your python environments. I had to create a special one here because of my M1 chip. Don't ask me why, I don't know why tensorflow requires a separate build, but it does. For me it requires

**pip install tensorflow-macos==2.4.1**

**pip install tensorflow-metal==0.1.1**

But first -- and this is important -- create a virtual environment. There are a lot of conflicts that arise, and it's just easiest to keep separate spaces. Something like this:

**python3 -m venv "/Users/Chris/website_fall22/site/tensorflow"**

**source "/Users/Chris/website_fall22/site/tensorflow/bin/activate"**

Then... pip install upgrade and pip install all the packages, like pandas, numpy, sklearn, etc.


I query the BQ below.

In [27]:
### Formulate the SQL query to pull the data from BigQuery

query = """
    SELECT 
    geo_id,
    registrant_id,
    general_2020,
    primary_2020,
    general_2018,
    primary_2018,
    general_2016,
    primary_2016,
    general_2014,
    bachelors_degree,
    total_pop,
    birth_year,
    registration_date,
    registration_change,
    median_age,
    median_income,
    white_pop,
    black_pop,
    asian_pop,
    hispanic_pop,
    amerindian_pop,
    housing_units,
    employed_pop,
    armed_forces,
    pop_in_labor_force,
FROM `az-voter-file.registration.clean_data_machine_learning_blocks`
"""

df = pandas_gbq.read_gbq(query, project_id="az-voter-file")
df.to_pickle('voter_file00_00_02.pkl') ## For later load, not to sync.

Downloading: 100%|██████████| 3860252/3860252 [09:16<00:00, 6940.48rows/s]


This is pretty rudimentary, and likely overkill, but we can compare it to far simpler measures as well. I created a variable, called "engaged, that is 1 if the voter participated in the 2020 primary, as well as the 2018 primary and general elections. I then split the data in half, into a test set and a train set. I then train a model on the train set, and test it on the test set. The model I use is a "neural network" with an input layer, 4 hidden layers, and an output layer. I tested this, specifying different parameterizations and hidden layers. It really doesn't matter. I hit about 85% accuracy, which is marginal, but far better than chance for these data. The features I use to train the model are, primary and general election voting prior to 2018, as well as the following characteristics measured at the  characteristics:
_________
### Voter Level Information
______
* 'general_2016', 
* 'bachelors_degree_2',
* 'primary_2016', 
* 'general_2014', 
* 'registration_change',
* 'registration_date',
* 'age'
_________
### Tract level Information
_________
* 'poverty',
* 'age', 
* 'median_age', 
* 'median_income',
* 'white_pop', 
* 'black_pop', 
* 'asian_pop', 
* 'hispanic_pop', 
* 'amerindian_pop', 
* 'gini_index', 
* 'housing_units', 
* 'employed_pop' 

All variables were 0-1 standardized prior to analysis.



In [35]:
import numpy as np 
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

import keras
from keras import layers
#from tensorflow.keras import layers

#"https://caffeinedev.medium.com/how-to-install-tensorflow-on-m1-mac-8e9b91d93706"

df = pd.read_pickle('voter_file00_00_02.pkl')

reg_length =  (pd.to_datetime("11-04-2020", format = "%m-%d-%Y") - pd.to_datetime(df['registration_date'], 
                              format = "%Y-%m-%d", 
                              errors = 'coerce'))
df["registration_length"] = reg_length.dt.days 

reg_change=  (pd.to_datetime("11-04-2020", format = "%m-%d-%Y") - pd.to_datetime(df['registration_change'], 
                                format = "%Y-%m-%d", errors = 'coerce'))
df["registration_change"] = reg_change.dt.days 



st_dat = df[['general_2020', 'primary_2020', 'general_2018', 
    'primary_2018', 'general_2016', "bachelors_degree",
    'primary_2016', 'general_2014',
    'registration_length', 
    'registration_change',
    'median_age', 'median_income',
    'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
    'amerindian_pop', 'housing_units', 
    'employed_pop']]
 
st_dat = st_dat.dropna(how = 'any')
scaler = MinMaxScaler()
st_dat_array = scaler.fit_transform(st_dat)
st_dat = pd.DataFrame(st_dat_array, columns = st_dat.columns)


st_dat['engaged'] = np.where((((st_dat["primary_2018"] == 1)  
                                     & (st_dat["primary_2020"] == 1) 
                                     & (st_dat["general_2018"] == 1)
                                    )),1,0) 
train, test = train_test_split(st_dat, test_size=0.2)

features_train   =  train[['general_2016',  "bachelors_degree", 
                            'primary_2016', 'general_2014', 
                            'registration_length', 
                            'registration_change',
                            'median_age', 'median_income',
                            'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
                            'amerindian_pop', 'housing_units', 
                            'employed_pop']]
labels_train     =  pd.DataFrame({"engaged":train['engaged'],  "not_engaged":1-train['engaged']})


features_train_array = np.array(features_train, np.float64)
labels_train_array   = np.array(labels_train,   np.float64)


features_test   = test[['general_2016',  "bachelors_degree", 
                            'primary_2016', 'general_2014', 
                            'registration_length', 
                            'registration_change',
                            'median_age', 'median_income',
                            'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
                            'amerindian_pop', 'housing_units', 
                            'employed_pop']]
labels_test     =  pd.DataFrame({"engaged":test['engaged'],  "not_engaged":1-test['engaged']})


features_test_array = np.array(features_test, np.float64)
labels_test_array   = np.array(labels_test,   np.float64)

from tensorflow.keras.regularizers import l1_l2
model = tf.keras.Sequential()
# Define the first layer
model.add(keras.layers.Dense(20, activation='softmax', 
                               input_shape=(features_train.shape[1],)))
model.add(keras.layers.Dropout(0.25))
# model.add(keras.layers.Dense(10, activation='softmax'))
# model.add(keras.layers.Dense(5, activation='softmax'))
model.add(keras.layers.Dense(2, activation='softmax'))
    

# Finish the model compilation
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), 
              loss='categorical_crossentropy',
              metrics=['accuracy'])


In [23]:
# #! pip install tensorflow-macos --upgrade 
# import tensorflow as tf
# !pip install keras



In [38]:
print("TensorFlow version:", tf.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.list_physical_devices('GPU')

model.fit(features_train_array, 
          labels_train_array, epochs=10, batch_size=1000, 
          validation_split=0.20)

TensorFlow version: 2.9.2
Num GPUs Available:  1
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x57750b550>

Above, where I constructed the training data, I also set aside 20 percent of the sample. **The model was not trained on this model. These are fresh data, randomly drawn, so that we can compare the outcome to the predicted outcome.** Overall, I reach about 85% acccuracy, which is not great, but far better than chance.

In [40]:
features_test   =   test[['general_2016',  "bachelors_degree",
                            'primary_2016', 'general_2014',   'registration_length', 
                            'registration_change',
                            'median_age', 'median_income',
                            'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
                            'amerindian_pop', 'housing_units', 
                            'employed_pop']]
labels_test     =  pd.DataFrame({"engaged":test['engaged'],  "not_engaged":1-test['engaged']})
features_test_array = np.array(features_test, np.float64)
labels_test_array   = np.array(labels_test,   np.float64)

In [41]:
outcome = model.predict(features_test_array) > 0.5


   92/19234 [..............................] - ETA: 31s

2022-09-21 11:12:28.511711: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




I've tinkered with the model quite a bit. I can't seem to improve it. It's not a remarkable degree of accuracy, but there's really not all that much individual level data, so I'm not sure.

In [42]:
preds1 = model.predict(features_test_array) > 0.5
preds2 = model.predict(features_train_array)

# Evaluate the model
from tensorflow.keras.metrics import Accuracy, Precision, Recall
acc = Accuracy()
prec = Precision()
recall = Recall()
acc.update_state(labels_test_array, preds1)

acc.result().numpy()
# prec.result().numpy()
# recall.result().numpy()



0.8436642

In [43]:
### Standardize the data, train with variables below ####
full_data = df[['registrant_id', 'general_2020', 'primary_2020', 'general_2018', 
    'primary_2018', 'general_2016', "bachelors_degree",
    'primary_2016', 'general_2014',
    'registration_length', 
    'registration_change',
    'median_age', 'median_income',
    'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
    'amerindian_pop',  'housing_units', 
    'employed_pop']]
full_data = full_data.dropna(how = 'any')
registrant_id = full_data['registrant_id']
full_data_array = scaler.fit_transform(full_data)
full_data = pd.DataFrame(full_data_array, columns = full_data.columns)
full_data['engaged'] = np.where((((full_data["primary_2018"] == 1)  
                                     & (full_data["primary_2020"] == 1) 
                                     & (full_data["general_2018"] == 1)
                                    )),1,0) 
labels_full     =  pd.DataFrame({"engaged": full_data['engaged'],  "not_engaged": 1-full_data['engaged']})
features_full    =  full_data[['general_2016', "bachelors_degree",
    'primary_2016', 'general_2014', 
     'registration_length', 
    'registration_change',
    'median_age', 'median_income',
    'white_pop', 'black_pop', 'asian_pop', 'hispanic_pop', 
    'amerindian_pop',  'housing_units', 
    'employed_pop']]

features_full_array = np.array(features_full, np.float64)
labels_full_array   = np.array(labels_full,   np.float64)


In [44]:
preds = model.predict(features_full_array)
preds = pd.DataFrame(preds)
preds.head()




In [66]:
np.random.binomial(1, preds.iloc[:,0])

array([1, 1, 0, ..., 0, 0, 1])

In [13]:
upload_data = pd.DataFrame( {"engaged_pr" : preds.iloc[:,0], 
                    "not_engaged_pr" : preds.iloc[:,1],  
                    "point" : np.random.binomial(1, preds.iloc[:,0]),
                    "engaged_true" :  full_data["engaged"],
                    "registrant_id" : registrant_id.tolist() } )

upload_data.head()

Unnamed: 0,engaged_pr,not_engaged_pr,point,engaged_true,registrant_id
0,0.643133,0.356867,1,1,22484551
1,0.085125,0.914875,0,0,23496232
2,0.224582,0.775418,0,1,22302082
3,0.519468,0.480532,1,0,22274355
4,0.086591,0.913409,0,0,27379527


In [14]:
from google.cloud import bigquery
from sklearn.preprocessing import MinMaxScaler
import pandas_gbq
bqclient = bigquery.Client()
project_id = "az-voter-file"
pandas_gbq.to_gbq(upload_data, "az-voter-file.registration.nn04", project_id=project_id, if_exists="replace")

100%|██████████| 1/1 [00:00<00:00, 3292.23it/s]


In [74]:
full_data.to_csv( "not_uploaded.csv")