# Analysis of Voter Turnout in Indiana Pre- and Post- Voter Identification Law
### Authors: Christopher Lefrak, Hannah Li, George Yang, and Kuai Yu
### PSTAT 235

NOTES/TO-DO (delete this before submitting):
- truncate/limit code so the writeup looks polished and professional (no raw outputs/errors)
- interpret findings
- include visualizations and graphs (EDA? theoretical concepts?)
- run and write ONLY NECESSARY code (label all code clearly and eliminate commented out code) in the "Final Code file.ipynb" file, save all visualizations as images to embedd in this writeup document so it's more organized.
- submission: "Writeup.ipynb" file and "Final Code file.ipynb"


## Introduction

[importance/potential effect of voter ID law]

Thirty-five of the fifty states of the U.S. have passed stricter voter ID laws that require or request voters to present a form of identification at the polls. 
The remaining fifteen states do not require voters to present any documentation to vote at the polls. States such as Indiana, Wisconsin, and Tennessee have strict photo ID laws for voters, while states such as Minnesota, Nebraska, North Carolina, and Pennsylvania have no requirements for voter identification. A visualization of the levels of strictness of voter photo identification laws for each state can be seen in the map in **Fig. 1** below.

<figure>
<img src="https://drive.google.com/uc?export=view&id=1-bKVXaRl_j3trfCus6iWYylx3RX4ACPz" style="width:100%">
<figcaption align = "center"> Fig. 1: Strictness Levels of US Voter ID Laws </figcaption>
</figure>

Advantages of implementing stricter voter identification requirements include preventing voter impersonation, thus  increasing public confidence in election processes. Disadvantages of implementing stricter laws unnecessarily burdens voters and administrators.

## Goals
In this project we focus our investigations of voter identification laws on the state of Indiana, which implemented a strict voter identification law in 2008. We seek to analyze how much voter turnout would have decreased or increased without the implentation of the law. 

> Project Goals
> - Propensity score 
> - Conduct logistic regression
> - Strengthen our PySpark data analysis skills, collaborative skills, and project organization skills

## Voter Data

### Dataset Overview

Our voter data is obtained from the course's VM2Uniform folder. We primarily use the dataset corresponding to Indiana. At a glance, the dataset contains 726 columns and 946908 rows, records beginning from .... and ending in 2021

[eda/visualizations]

We subsetted the dataset to focus on a narrower set of voter attributes. The columns we selected from the original dataset can be seen in the following section.




### Input Variables
In the table below we display the columns we keep from the original dataset

| Column Name | Data Type | Values |
| --- | --- | --- |
| Voters_Gender | categorical | ... |
| Voters_BirthDate | --- | ... |
| Residence_Families_HHCount | numerical | 1 through 10 |
| Residence_HHGender_Description | categorical | Cannot Determine, Female Only Household, Male Only Household, Mixed Gender Household |
| Mailing_Families_HHCount | numerical |  ... |
| Mailing_HHGender_Description | categorical |  ... |
| Parties_Description | categorical | ... |
| CommercialData_PropertyType | categorical | ... |
| AddressDistricts_Change_Changed_CD | categorical | ... |
| AddressDistricts_Change_Changed_SD | categorical | ... |
| AddressDistricts_Change_Changed_HD | categorical | ... |
| AddressDistricts_Change_Changed_County | categorical | ... |
| Residence_Addresses_Density | numerical | ... |
| CommercialData_EstimatedHHIncome | categorical | ... |
| CommercialData_ISPSA | categorical | 0 through 9 |
| CommercialData_AreaMedianEducationYears | numerical |  ... |
| CommercialData_AreaMedianHousingValue | numerical | ... |
| CommercialData_AreaPcntHHMarriedCoupleNoChild | categorical | ... |
| CommercialData_AreaPcntHHMarriedCoupleWithChild | categorical | ... |
| CommercialData_AreaPcntHHSpanishSpeaking | categorical | ... |
| CommercialData_AreaPcntHHWithChildren | categorical | ... |
| CommercialData_StateIncomeDecile | categorical | 0 through 9 |
| EthnicGroups_EthnicGroup1Desc | categorical |  East and South Asian, Eurpoean, Hispanic and Portuguese, Likely African-American, N/A, Other |
| CommercialData_DwellingType | categorical | Large mult wo/Apt number, POBOX, Single Family Dwelling Unit, Small Mult or large mult w/apt number |
| CommercialData_PresenceOfChildrenCode | categorical | Known Data, Modeled Likely to have a child, Modeled Not as Likely to have a child, Not Likely to have a child |
| CommercialData_DonatesToCharityInHome | categorical | Yes, unknown |
| CommercialData_DwellingUnitSize | categorical | Single Family Dwelling Unit, Duplex, Triplex, 4, 5-9, 10-19, 20-49, 50-100, 101+ |
| CommercialData_ComputerOwnerInHome | categorical | Yes, unknown |
| CommercialData_DonatesEnvironmentCauseInHome | categorical | Yes, unknown |
| CommercialData_Education | categorical | Unknown, HS Diploma - Extremely Likely, Some College -Extremely Likely, Bach Degree - Extremely Likely, Grad Degree - Extremely Likely, Less than HS Diploma - Ex Like, HS Diploma - Likely, Some College - Likely, Bach Degree - Likely, Grad Degree - Likely, Less than HS Diploma - Likely |


### Other Variables
The table below shows other control variables that we expect to be highly associated with the response variable.

| Column Name | Data Type | Values |
| --- | --- | --- |
| General_2000 | categorical |  |
| General_2004 | categorical |  |
| PresidentialPrimary_2000 | categorical | --- |
| PresidentialPrimary_2004 | categorical |  |
| General_2008 | categorical | --- |

The table below shows the response variable.

| Column Name | Data Type | Values |
| --- | --- | --- |
| General_2008 | categorical |  |

### Other States
In the table below are states that do not have strict voter identification laws.

| State | Abbreviation |
| --- | --- |
| California | CA | 
| Illinois | IL |
| Massachusetts | MA | 
| Maryland | MD | 
| Maine | ME |
| Minnesota | MN | 
| North Carolina | NC | 
| Nebraska | NE |
| New Jersey | NJ | 
| New Mexico | NM | 
| Nevada | NV |
| New York | NY | 
| Oregon | OR | 
| Pennsylvania | PA |
| Vermont | VT | 


### Data Cleaning

Many of the columns contain symbols including `$` and `%`, so we remove those symbols.

Many columns are also missing data. In numerical columns, we impute these values in with the mean value to minimize any changes to z-scores of the given data. We encode categorical columns using PySpark's `StringIndexer`. This maps the categorical labels of a column a column of label indices ordered by frequencies of labels, where the most frequent label is assigned to index 0.


Our data also contains many individuals who were not old enough to vote in the 2008 general election (they were below the 18 year old age requirement), so we removed the rows corresponding to these voters. 
We then converted the `General_2008` variable to be numerical. 


## Propensity Score
Our goal is to predict if a voter has passed the law or not.

We want to be able to find the probability if someone votes if they did not have the voter identification law implented.

Assumption: People outside of Indiana are representative of the people in indiana.


> Variables:
> - T = whether they have the law
> - Y = whether they voted in 2008
> - P = predicted T

To compare the voter data with and without the implementation of a voter identification law, we observe that the difference in means $$E[Y|T=1]-E[Y|T=0]$$ is [...]. Thus, the treated voters (with implementation of a voter identification law) have a [...] compared to non-treated voters.

The propensity score is the conditional probability of receiving the treatment, the implementation of the voter identification law. Using this score means that we do not have to achieve conditional independence $(Y_1,Y_0) \perp |X$. In other words, we do not have to condition on the whole $X$ to achieve independence of potential outcomes of the treatment. Instead, it is sufficient to control confounders $X$ for a propensity score $$P(x)=P(T|X)=E[T|X]$$ to achieve $(Y_1,Y_0) \perp |P(x)$.

The propensity score essentially converts $X$ into the treatment $T$, acting as a middleman between $X$ and $T$. Initially, we cannot compare treated and non-treated osbervations. However, we can compare a treated and a non-treated observation if they have the same probability of receiving the treatment since receiving or not receiving the treatment would be attributed to randomness. Thus, we hold the propensity score constant to make the data appear more random.


### Propensity Weighting and Estimation

We write the difference in means again, but we now condition on $X$: 

$$
\begin{align}
& E[Y|X,T=1]-E[Y|X,T=0]\\
& =E[\frac{Y}{P(x)}|X,T=1]P(x)-E[\frac{Y}{(1-P(x))}|X,T=0](1-P(x))
\end{align}
$$

In other words, the propensity score more heavily weights the observations with a low probability of receiving treatment, and weakly weights...

We simplify our propensity score weighting estimator to $E[Y\frac{T-P(x)}{P(x)(1-P(x)}]$, where $P(x)$ and $(1-P(x))$ both must be greater than $0$. Thus, each voter must have some probability of both receiving and not receiving the treatment of the implementation of a voter identification law.


We now estimate the true propensity score $P(x)$ with $\hat{P}(x)$ using logistic regression.

> Issues
> - The propensity score's strong predictive power can hurt our goals of causal inference.
>> - We want out prediction to control for confounding variables, not neccesarily to predict the treatment very well.

## Logistic Regression

Logistic regression is a statistical method that allows us to estimate the probability that an event occurs, in this case, if an individual voted or not, given a set of independent variables $X_1, X_2,...X_k$. The function can be expressed in terms of the log odds, where the odds $\frac{x}{1-x}$ are defined as the probability of success divided by the probability of failure:

$$
\begin{align}
logit(x)&=\frac{1}{1+e^{-x}} \\
ln(\frac{x}{1-x})&=\beta_0+\beta_1 X_1+\beta_2 X_2+...\beta_k X_k
\end{align}
$$

The coefficients of this function are $\beta_0, \beta_1, \beta_2, \beta_k$, and indicate the relative effect of the corresponding variables $X_1,X_2,...X_K$ on the response variable. The optimal coefficients maximize the function to find the best fit.


Applying this to our data, we split our dataset into 70% training data and 30% testing data, and fit logistic regression on our training data. 

A plot of the beta coefficients can be seen in **Fig. 2** and a plot of the ranked coefficients can be seen in **Fig. 3**.

<figure>
<img src="..." style="width:100%">
<figcaption align = "center"> Fig. 2: Logistic Regression Beta Coefficients </figcaption>
</figure>


<figure>
<img src="..." style="width:100%">
<figcaption align = "center"> Fig. 3: Logistic Regression Ranked Coefficients </figcaption>
</figure>

The propensity score we obtain is [...]


## Summary of Findings


## Conclusion

[summary of everything]

[issues - curse of dimensionality]

[significance]

[possible future work]

## Resources
https://www.ncsl.org/elections-and-campaigns/voter-id#undefined 

https://matheusfacure.github.io/python-causality-handbook/11-Propensity-Score.html

https://www.ibm.com/topics/logistic-regression#:~:text=Resources-,What%20is%20logistic%20regression%3F,given%20dataset%20of%20independent%20variables.