# Analysis of Voter Turnout in Indiana Pre- and Post- Voter Identification Law
### Authors: Christopher Lefrak, Hannah Li, George Yang, and Kuai Yu
### PSTAT 235

NOTES/TO-DO:
- truncate/limit outputs so the writeup looks polished and professional (no raw outputs/errors)
- interpret findings
- include visualizations and graphs (EDA? theoretical concepts?)

## Introduction

[importance/potential effect of voter ID law]

Thirty-five of the fifty states of the U.S. have passed stricter voter ID laws that require or request voters to present a form of identification at the polls. 
The remaining fifteen states do not require voters to present any documentation to vote at the polls. States such as Indiana, Wisconsin, and Tennessee have strict photo ID laws for voters, while states such as Minnesota, Nebraska, North Carolina, and Pennsylvania have no requirements for voter identification. A visualization of the levels of strictness of voter photo identification laws for each state can be seen in the graphic below.

![Voter ID Laws](GCS/voteridmap.png)

Advantages of implementing stricter voter identification requirements include preventing voter impersonation, thus  increasing public confidence in election processes. Disadvantages of implementing stricter laws unnecessarily burdens voters and administrators.

## Goals
In this project we focus our investigations of voter identification laws on the state of Indiana, which implemented a strict voter identification law in 2008. We seek to analyze how much voter turnout would have decreased or increased without the implentation of the law. 

> Project Goals
> - Apply the matching method on pre- voter identification law features.
> - Conduct k-Nearest-Neighbors (k-NN) classification to make predictions on voter data and cross validation to evaluate.
> - Strengthen our pyspark data analysis skills, collaborative skills, and project organization skills

[technologies, packages, skills...]

## Indiana Voter Data

### Dataset Overview

Our data is from the course's voter files folder. We primarily use the dataset corresponding to Indiana. At a glance, the dataset contains 726 columns and 946908 rows, records beginning from .... and ending at March 5, 2021

[eda/visualizations]

### Data Cleaning

Many of the columns of the dataset have missing values.
We narrowed down our focus to individuals who were of the legal voting age of 18 or older at the time of voting.


We subsetted the dataset to focus on a narrower set of voter attributes. We selected the following columns from the original dataset:

[table with column names and descriptions]




In [4]:
# Importing necessary modules
import seaborn as sns
import pyspark.sql.functions as F
import pandas as pd
import matplotlib.pyplot as plt
from operator import add
from functools import reduce
import numpy as np
import re
import os
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql.functions import *
import random
import pyspark
from pyspark.shell import spark

from pyspark.sql import SparkSession


# Setting up visualization
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/17 05:33:11 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
23/03/17 05:33:11 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
23/03/17 05:33:11 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
23/03/17 05:33:11 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator


Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.3
      /_/

Using Python version 3.8.15 (default, Nov 22 2022 08:46:39)
Spark context Web UI available at http://mycluster-m.c.pstat135-235.internal:36437
Spark context available as 'sc' (master = yarn, app id = application_1679026115533_0004).
SparkSession available as 'spark'.


In [5]:
indi_full = spark.read.parquet("gs://voter-project-235-25/VM2Uniform--IN--2021-01-15_parq")

23/03/17 05:33:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/03/17 05:34:06 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/03/17 05:34:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/03/17 05:34:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/03/17 05:34:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are regist

KeyboardInterrupt: 

23/03/17 05:36:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/03/17 05:37:06 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/03/17 05:37:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/03/17 05:37:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/03/17 05:37:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are regist

In [None]:
cols_to_keep = [
    "Voters_Gender",
    "Voters_Age",
    "Voters_BirthDate",
    "Residence_Families_HHCount",
    "Residence_HHGender_Description",
    "Mailing_Families_HHCount",
    "Mailing_HHGender_Description",

#   !! voter party affiliation
    "Parties_Description", 
    
    "CommercialData_PropertyType",
    "AddressDistricts_Change_Changed_CD",
    "AddressDistricts_Change_Changed_SD",
    "AddressDistricts_Change_Changed_HD",
    "AddressDistricts_Change_Changed_County",
    "Residence_Addresses_Density",
    "CommercialData_EstimatedHHIncome",
    "CommercialData_ISPSA",
    "CommercialData_AreaMedianEducationYears",
    "CommercialData_AreaMedianHousingValue",
    "CommercialData_MosaicZ4Global",
    "CommercialData_AreaPcntHHMarriedCoupleNoChild",
    "CommercialData_AreaPcntHHMarriedCoupleWithChild",
    "CommercialData_AreaPcntHHSpanishSpeaking",
    "CommercialData_AreaPcntHHWithChildren",
    "CommercialData_StateIncomeDecile",
    "Ethnic_Description",
    "EthnicGroups_EthnicGroup1Desc",
    "CommercialData_DwellingType",
    "CommercialData_PresenceOfChildrenCode",
    "CommercialData_PresenceOfPremCredCrdInHome",
    "CommercialData_DonatesToCharityInHome",
    "CommercialData_DwellingUnitSize",
    "CommercialData_ComputerOwnerInHome",
    "CommercialData_DonatesEnvironmentCauseInHome",
    "CommercialData_Education",
    "General_2000",
    "General_2004",
    "PresidentialPrimary_2000",
    "PresidentialPrimary_2004",
        
#   Outcome variable (indiana law happens in 2005, approved by SCOTUS before presidential election in 2008)
    "General_2008"
]

indi = (indi_full
        .select(cols_to_keep))

Based on the voter's age, we calculate the date at which they turn eighteen. We create a new variable whose value is the year of the earliest election that the voter could potentially participate in. So, if the date at which they turn eighteen is earlier than November 3rd, we set the value to the year at which they turn eighteen. If the date at which they turn eighteen is later than November 3rd, we set the value to the year of the following election.

In [None]:
yrs_add = 18
months_add = 18*12

# date of national 
target_month_day_presidential = "11-03"

# date of Indiana's presidential primary
target_month_day_primary = "05-03" 

indi = indi.withColumn("DATE_18", add_months(to_date(col("Voters_BirthDate"),"MM/dd/yyyy"), months_add))
indi.select(["Voters_BirthDate", "DATE_18"]).show(10)
indi = indi.dropna(subset = "Voters_BirthDate")
indi = indi.withColumn("YEAR_18", year("DATE_18"))
indi = indi.withColumn("comparator_date_presidential", to_date(concat(col("YEAR_18"), lit("-"), lit(target_month_day_presidential))))
indi = indi.withColumn("comparator_date_primary", to_date(concat(col("YEAR_18"), lit("-"), lit(target_month_day_primary))))
indi = indi.withColumn("YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL", \
                             when(col("DATE_18")<=col("comparator_date_presidential"), col("YEAR_18")) \
                             .otherwise(col("YEAR_18") + 1) \
                            )
indi = indi.withColumn("YEAR_ELIGIBLE_TO_VOTE_PRIMARY", \
                             when(col("DATE_18")<=col("comparator_date_primary"), col("YEAR_18")) \
                             .otherwise(col("YEAR_18") + 1) \
                            )

# check no missing vals:
indi.where(col("YEAR_18").isNull()).select("YEAR_18").show(10)

# get rid of rows where the voter was not old enough to vote in the 2008 general election
indi = indi.filter(col("YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL")<=2008).fillna("N", subset = ["General_2008"])

# for the 2000 and 2004 general elections, replace with "N" IF the person was old enough to vote at the time
indi = indi.withColumn("General_2000", \
                      when((col("YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL")<= 2004) & \
                           (col("General_2000").isNull()), "N") \
                      .otherwise(col("General_2000")) \
                      )

indi = indi.withColumn("General_2004", \
                      when((col("YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL")<= 2004) & \
                           (col("General_2004").isNull()), "N") \
                      .otherwise(col("General_2004")) \
                      )

# do the same for the primaries:
indi = indi.withColumn("PresidentialPrimary_2000", \
                      when((col("YEAR_ELIGIBLE_TO_VOTE_PRIMARY")<= 2004) & \
                           (col("PresidentialPrimary_2000").isNull()), "N") \
                      .otherwise(col("PresidentialPrimary_2000")) \
                      )

indi = indi.withColumn("PresidentialPrimary_2004", \
                      when((col("YEAR_ELIGIBLE_TO_VOTE_PRIMARY")<= 2004) & \
                           (col("PresidentialPrimary_2004").isNull()), "N") \
                      .otherwise(col("PresidentialPrimary_2004")) \
                      )

# make the general voting for 2008 a numeric variable; since we've deleted
# everyone who was not eligible to vote, this can be directly calculated with a 1-0.
indi = indi.withColumn("Voted_General_2008", when(indi.General_2008 == "Y",1).otherwise(0))
indi = indi.drop("General_2008")

We begin by obtaining a subset of the dataset to prototype code.


In [None]:
sampleind = indi.sample(True, 0.1, seed = 19480384)

We then convert the column `CommercialData_EstimatedHHIncome` from type string to type numeric by removing the right-most number, and replacing all symbols "$", "-", and "+".

In [None]:
sampleind = sampleind.withColumn("CommercialData_EstimatedHHIncome", regexp_extract(col("CommercialData_EstimatedHHIncome"), "(?<=-).*", 0))

sampleind = sampleind.withColumn("CommercialData_EstimatedHHIncome", \
                             regexp_replace('CommercialData_EstimatedHHIncome', "[\$,+]", "") \
                            )

sampleind = sampleind.withColumn("CommercialData_EstimatedHHIncome",col("CommercialData_EstimatedHHIncome").cast('double'))

sampleind.select(["CommercialData_EstimatedHHIncome"]).show(10, truncate=False)


We also convert the column `CommercialData_AreaMedianHousingValue` from type string to type numeric by replacing the symbol "$".

In [None]:
sampleind = sampleind.withColumn("CommercialData_AreaMedianHousingValue", regexp_replace("CommercialData_AreaMedianHousingValue", "\$", ""))
sampleind = sampleind.withColumn("CommercialData_AreaMedianHousingValue",col("CommercialData_AreaMedianHousingValue").cast('double'))
sampleind.select(["CommercialData_AreaMedianHousingValue"]).show(10, truncate=False)

We proceed to search for the string "Pnct" in all of the column names in our dataset, and convert these columns

> - 'CommercialData_AreaPcntHHMarriedCoupleNoChild'
> - 'CommercialData_AreaPcntHHMarriedCoupleWithChild'
> - 'CommercialData_AreaPcntHHSpanishSpeaking'
> -'CommercialData_AreaPcntHHWithChildren'
 
to numeric types by replacing the symbol "%".


In [None]:
cols_to_convert = [c for c in sampleind.columns if "Pcnt" in c]

for col_name in cols_to_convert:
    sampleind = sampleind.withColumn(col_name, regexp_replace(col_name, "\%", ""))
    sampleind = sampleind.withColumn(col_name, col(col_name).cast('double'))
    sampleind.select([col_name]).show(5, truncate=False)
    

We then remove the columns that were used for obtaining voter turnout data from our dataset.



In [None]:
columns_to_drop = ["comparator_date_presidential", "target_month_day_primary", 
                   "YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL", "comparator_date_primary", 
                   "YEAR_ELIGIBLE_TO_VOTE_PRIMARY", "YEAR_18", "DATE_18"]

sampleind = sampleind.drop(*columns_to_drop)

## Regression vs Matching
### Regression
Regression is parametric as it has a function linking the treatment $D$ and covariates $X$ with the outcome $Y$, and has parameters $\beta$'s to be estimated


#### Problem with regression
- doesn’t estimate weight based on weighted average- proportional on variance of the treatment in that group
- Assumes linear relationships bt covariates and outcome
- Underfitting, underestimating
- instead: knn for matching thing

### Matching

Matching is a statistical technique used to compare treated and non-treated data points with each other to reduce bias and evaluate the effect of the treatment. The matching process entails finding one or more non-treated data points with similar covariates to match a treated data point. In this case, we match voters from pre-voter identification law Indiana, before 2008, to voters from post-voter identification law Indiana, 2008 and beyond. We want to match voters with similar characteristics [column names] in order to predict [....], and, ultimately, to evaluate the effect of the voter identification law on voter turnout in Indiana.

Unlike regression, matching is non-parametric as it does not assume a functional form, and does not rely on the assumption of linearity.

![Regression vs Matching](GCS/regressionmatching.png)

Matching assumes the following:

> - Conditional Independence Assumption (CIA): 
>> - The potential outcomes are independent of the treatment assignment after adjusting for a set of covariates $X$
>> - This means that 
> - Common Support Assumption:
>> - There should be untreated and treated observations for each combination of values of covariates $X$ in the data to ensure matches.
>> - This means that if the treated group is small relative to the entire dataset, we are more likely to have common support.

However, the matching process is not perfect as "overmatching" can actually increase bias.

Additionally, working with numerous variables (dimensions) can result in the curse of dimensionality. When the number of dimensions of our dataset increases, for example as the number of covariates $X$ grows larger and larger relative to the number of rows in our dataset, we are less likely to find close matches for our treated observations.

- Central limit thm doesn’t always hold-> bias correction

## Application to Voter Data
Variables:
> - x: (sex)
> - T=treatment assignment under natural circumstances(whether u have a voter - turnout law passed in your state) (drug)
> - Y= treatment effect/outcome (voter turnout) (days)
> - Y1,Y0= realized outcomes (if u were treated (have laws), if u weren’t treated (no laws))
> - Average treatment effect (how much voter turnout would have increased or decreased without law)
> - If we have independence bt (Y0,Y1) and T, easy
> - If not, must do things including matching

We will follow the matching process for voter data from Indiana, then repeat it for another state [state] for comparison.




## Other State Matching

## k-Nearest-Neighbors (k-NN)

We build a k-NN model, which is a supervised machine learning model. Thus, it learns from already labeled data points.
- curse of dimensionality
### Cross Validation
- see how well knn performs

## Backup: Logistic Regression

Logistic regression is a statistical method
- use non-indiana data to predict Indiana data

## Summary of Findings


## Conclusion

[summary of everything]

[issues - curse of dimensionality]

[significance]

[possible future work]

## Resources
https://www.ncsl.org/elections-and-campaigns/voter-id#undefined 

https://www.franciscoyira.com/post/matching-in-r-2-differences-regression/
