# Analysis of Voter Turnout in Indiana Pre- and Post- Voter Identification Law
### Authors: Christopher Lefrak, Hannah Li, George Yang, and Kuai Yu
### PSTAT 235

NOTES/TO-DO:
- truncate/limit outputs so the writeup looks polished and professional (no raw outputs/errors)
- interpret findings
- include visualizations and graphs (EDA? theoretical concepts?)
- run and write all code (label all code clearly too) in the "Final Code file.ipynb" file, save all visualizations as images to embedd in this writeup document so it's more organized.
- submission: "Writeup.ipynb" file and "Final Code file.ipynb"


## Introduction

[importance/potential effect of voter ID law]

Thirty-five of the fifty states of the U.S. have passed stricter voter ID laws that require or request voters to present a form of identification at the polls. 
The remaining fifteen states do not require voters to present any documentation to vote at the polls. States such as Indiana, Wisconsin, and Tennessee have strict photo ID laws for voters, while states such as Minnesota, Nebraska, North Carolina, and Pennsylvania have no requirements for voter identification. A visualization of the levels of strictness of voter photo identification laws for each state can be seen in the map in **Fig. 1** below.

<figure>
<img src="https://drive.google.com/uc?export=view&id=1-bKVXaRl_j3trfCus6iWYylx3RX4ACPz" style="width:100%">
<figcaption align = "center"> Fig. 1: Strictness Levels of US Voter ID Laws </figcaption>
</figure>

Advantages of implementing stricter voter identification requirements include preventing voter impersonation, thus  increasing public confidence in election processes. Disadvantages of implementing stricter laws unnecessarily burdens voters and administrators.

## Goals
In this project we focus our investigations of voter identification laws on the state of Indiana, which implemented a strict voter identification law in 2008. We seek to analyze how much voter turnout would have decreased or increased without the implentation of the law. 

> Project Goals
> - Propensity score 
> - Conduct logistic regression
> - Strengthen our pyspark data analysis skills, collaborative skills, and project organization skills

[technologies, packages, skills...]

## Indiana Voter Data

### Dataset Overview

Our data is from the course's voter files folder. We primarily use the dataset corresponding to Indiana. At a glance, the dataset contains 726 columns and 946908 rows, records beginning from .... and ending at March 5, 2021

[eda/visualizations]

### Data Cleaning

Many of the columns of the dataset have missing values.
We narrowed down our focus to individuals who were of the legal voting age of 18 or older at the time of voting.


We subsetted the dataset to focus on a narrower set of voter attributes. We selected the following columns from the original dataset:

[table with column names and descriptions]




Based on the voter's age, we calculate the date at which they turn eighteen. We create a new variable whose value is the year of the earliest election that the voter could potentially participate in. So, if the date at which they turn eighteen is earlier than November 3rd, we set the value to the year at which they turn eighteen. If the date at which they turn eighteen is later than November 3rd, we set the value to the year of the following election.

In [None]:
yrs_add = 18
months_add = 18*12

# date of national 
target_month_day_presidential = "11-03"

# date of Indiana's presidential primary
target_month_day_primary = "05-03" 

indi = indi.withColumn("DATE_18", add_months(to_date(col("Voters_BirthDate"),"MM/dd/yyyy"), months_add))
indi.select(["Voters_BirthDate", "DATE_18"]).show(10)
indi = indi.dropna(subset = "Voters_BirthDate")
indi = indi.withColumn("YEAR_18", year("DATE_18"))
indi = indi.withColumn("comparator_date_presidential", to_date(concat(col("YEAR_18"), lit("-"), lit(target_month_day_presidential))))
indi = indi.withColumn("comparator_date_primary", to_date(concat(col("YEAR_18"), lit("-"), lit(target_month_day_primary))))
indi = indi.withColumn("YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL", \
                             when(col("DATE_18")<=col("comparator_date_presidential"), col("YEAR_18")) \
                             .otherwise(col("YEAR_18") + 1) \
                            )
indi = indi.withColumn("YEAR_ELIGIBLE_TO_VOTE_PRIMARY", \
                             when(col("DATE_18")<=col("comparator_date_primary"), col("YEAR_18")) \
                             .otherwise(col("YEAR_18") + 1) \
                            )

# check no missing vals:
indi.where(col("YEAR_18").isNull()).select("YEAR_18").show(10)

# get rid of rows where the voter was not old enough to vote in the 2008 general election
indi = indi.filter(col("YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL")<=2008).fillna("N", subset = ["General_2008"])

# for the 2000 and 2004 general elections, replace with "N" IF the person was old enough to vote at the time
indi = indi.withColumn("General_2000", \
                      when((col("YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL")<= 2004) & \
                           (col("General_2000").isNull()), "N") \
                      .otherwise(col("General_2000")) \
                      )

indi = indi.withColumn("General_2004", \
                      when((col("YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL")<= 2004) & \
                           (col("General_2004").isNull()), "N") \
                      .otherwise(col("General_2004")) \
                      )

# do the same for the primaries:
indi = indi.withColumn("PresidentialPrimary_2000", \
                      when((col("YEAR_ELIGIBLE_TO_VOTE_PRIMARY")<= 2004) & \
                           (col("PresidentialPrimary_2000").isNull()), "N") \
                      .otherwise(col("PresidentialPrimary_2000")) \
                      )

indi = indi.withColumn("PresidentialPrimary_2004", \
                      when((col("YEAR_ELIGIBLE_TO_VOTE_PRIMARY")<= 2004) & \
                           (col("PresidentialPrimary_2004").isNull()), "N") \
                      .otherwise(col("PresidentialPrimary_2004")) \
                      )

# make the general voting for 2008 a numeric variable; since we've deleted
# everyone who was not eligible to vote, this can be directly calculated with a 1-0.
indi = indi.withColumn("Voted_General_2008", when(indi.General_2008 == "Y",1).otherwise(0))
indi = indi.drop("General_2008")

We begin by obtaining a subset of the dataset to prototype code.


In [None]:
sampleind = indi.sample(True, 0.1, seed = 19480384)

We then convert the column `CommercialData_EstimatedHHIncome` from type string to type numeric by removing the right-most number, and replacing all symbols "$", "-", and "+".

In [None]:
sampleind = sampleind.withColumn("CommercialData_EstimatedHHIncome", regexp_extract(col("CommercialData_EstimatedHHIncome"), "(?<=-).*", 0))

sampleind = sampleind.withColumn("CommercialData_EstimatedHHIncome", \
                             regexp_replace('CommercialData_EstimatedHHIncome', "[\$,+]", "") \
                            )

sampleind = sampleind.withColumn("CommercialData_EstimatedHHIncome",col("CommercialData_EstimatedHHIncome").cast('double'))

sampleind.select(["CommercialData_EstimatedHHIncome"]).show(10, truncate=False)


We also convert the column `CommercialData_AreaMedianHousingValue` from type string to type numeric by replacing the symbol "$".

In [None]:
sampleind = sampleind.withColumn("CommercialData_AreaMedianHousingValue", regexp_replace("CommercialData_AreaMedianHousingValue", "\$", ""))
sampleind = sampleind.withColumn("CommercialData_AreaMedianHousingValue",col("CommercialData_AreaMedianHousingValue").cast('double'))
sampleind.select(["CommercialData_AreaMedianHousingValue"]).show(10, truncate=False)

We proceed to search for the string "Pnct" in all of the column names in our dataset, and convert these columns

> - 'CommercialData_AreaPcntHHMarriedCoupleNoChild'
> - 'CommercialData_AreaPcntHHMarriedCoupleWithChild'
> - 'CommercialData_AreaPcntHHSpanishSpeaking'
> -'CommercialData_AreaPcntHHWithChildren'
 
to numeric types by replacing the symbol "%".


In [None]:
cols_to_convert = [c for c in sampleind.columns if "Pcnt" in c]

for col_name in cols_to_convert:
    sampleind = sampleind.withColumn(col_name, regexp_replace(col_name, "\%", ""))
    sampleind = sampleind.withColumn(col_name, col(col_name).cast('double'))
    sampleind.select([col_name]).show(5, truncate=False)
    

We then remove the columns that were used for obtaining voter turnout data from our dataset.



In [None]:
columns_to_drop = ["comparator_date_presidential", "target_month_day_primary", 
                   "YEAR_ELIGIBLE_TO_VOTE_PRESIDENTIAL", "comparator_date_primary", 
                   "YEAR_ELIGIBLE_TO_VOTE_PRIMARY", "YEAR_18", "DATE_18"]

sampleind = sampleind.drop(*columns_to_drop)

## Propensity Score
Our goal is to predict if a voter has passed the law or not.

We want to be able to find the probability if someone votes if they did not have the voter identification law implented.

Assumption: People outside of Indiana are representative of the people in indiana.


> Variables:
> - T = whether they have the law
> - Y = whether they voted in 2008
> - P = predicted T

To compare the voter data with and without the implementation of a voter identification law, we observe that the difference in means $E[Y|T=1]-E[Y|T=0]$ is [...]. Thus, the treated voters (with implementation of a voter identification law) have a [...] compared to non-treated voters.

The propensity score is the conditional probability of receiving the treatment, the implementation of the voter identification law. Using this score means that we do not have to achieve conditional independence $(Y_1,Y_0) \perp |X$. In other words, we do not have to condition on the whole $X$ to achieve independence of potential outcomes of the treatment. Instead, it is sufficient to control confounders $X$ for a propensity score $P(x)=P(T|X)=E[T|X]$ to achieve $(Y_1,Y_0) \perp |P(x)$.

The propensity score essentially converts $X$ into the treatment $T$, acting as a middleman between $X$ and $T$. Initially, we cannot compare treated and non-treated osbervations. However, we can compare a treated and a non-treated observation if they have the same probability of receiving the treatment since receiving or not receiving the treatment would be attributed to randomness. Thus, we hold the propensity score constant to make the data appear more random.


### Propensity Weighting and Estimation

We write the difference in means again, but we now condition on $X$: 
$$E[Y|X,T=1]-E[Y|X,T=0]=E[\frac{Y}{P(x)}|X,T=1]P(x)-E[\frac{Y}{(1-P(x))}|X,T=0](1-P(x))$$

In other words, the propensity score more heavily weights the observations with a low probability of receiving treatment, and weakly weights


#### Positivity Assumption
We simplify our propensity score weighting estimator to $E[Y\frac{T-P(x)}{P(x)(1-P(x)}]$, where $P(x)$ and $(1-P(x))$ both must be greater than $0$. Thus, each voter must have some probability of both receiving and not receiving the treatment of the implementation of a voter identification law.


We now estimate the true propensity score $P(x)$ with $\hat{P}(x)$ using logistic regression.

In order to complete this, we convert categorical variables in our dataset into dummy variables.


> Issues
> - The propensity score's strong predictive power can hurt our goals of causal inference.
>> - We want out prediction to control for confounding variables, not neccesarily to predict the treatment very well.

## Logistic Regression

Logistic regression is a statistical method
- use non-indiana data to predict Indiana data

## Summary of Findings


## Conclusion

[summary of everything]

[issues - curse of dimensionality]

[significance]

[possible future work]

## Resources
https://www.ncsl.org/elections-and-campaigns/voter-id#undefined 

https://matheusfacure.github.io/python-causality-handbook/11-Propensity-Score.html
