<a href="https://colab.research.google.com/github/AustinVes/Voter_Emotions/blob/master/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sourcing Data



The data I analyse comes from the American National Election Studies (ANES). Specifically, I use data from their 2016 ANES Time Series Study on “electoral participation, voting behavior, public opinion, media exposure, cognitive style, and values and predispositions” of Americans during the 2016 elections.

*“ANES is a collaboration of Stanford University and the University of Michigan, with funding by the National Science Foundation… The mission of the American National Election Studies (ANES) is to inform explanations of election outcomes by providing… high quality data from its own surveys on voting, public opinion, and political participation.”* -[ANES](https://electionstudies.org/)

The 2016 ANES Time Series Study dataset mainly constitutes survey responses taken both before and after the elections.

*“Data collection for the ANES 2016 Time Series Study began in early September and continued into January, 2017. Pre-election interviews were conducted with study respondents during the two months prior to the 2016 elections and were followed by post-election reinterviewing beginning November 9, 2016… Face-to-face interviewing was complemented with data collection on the Internet.  Data collection was conducted in the two modes independently, using separate samples but substantially identical questionnaires. Web-administered cases constituted a representative sample separate from the face-to-face.”*  -[ANES](https://electionstudies.org/)

This dataset is currently available for free [here](https://electionstudies.org/data-center/2016-time-series-study/) from the ANES website, but you will need to have an account with ANES to download it. This repository also contains several copies of this dataset, downloaded 04/06/2020. The dataset comes with a [User Guide and Codebook](https://electionstudies.org/wp-content/uploads/2018/12/anes_timeseries_2016_userguidecodebook.pdf) and [Methodology Report](https://electionstudies.org/wp-content/uploads/2018/12/anes_timeseries_2016_userguidecodebook.pdf), along with many other supporting documents available [here](https://electionstudies.org/data-center/2016-time-series-study/).

Here are the Terms of Use from ANES, which I accept:
* Use these datasets solely for research or statistical purposes and not for investigation of specific survey respondents.
* Make no use of the identity of any survey respondent(s) discovered intentionally or inadvertently, and to advise ANES of any such discovery (anes@electionstudies.org)
* Cite ANES data and documentation in your work that makes use of the data and documentation. Authors of publications based on ANES data should send citations of their published works to ANES for inclusion in our bibliography of related publications.
* You acknowledge that the original collector of the data, ANES, and the relevant funding agency/agencies bear no responsibility for use of the data or for interpretations or inferences based upon such uses.

## Downloading

Because downloading the 2016 ANES Time Series Study dataset from the ANES website requires creating an account, for the purposes of loading it into this Google Colab notebook, I made a copy of the dataset available publically in this GitHub repository.

ANES offers this dataset in several different file formats. After experimenting with all the formats and confirming they contain the same amount of data, I chose to work with the "raw" ASCII file because I best understood how to manipulate it. That said, all the analysis I present here (after sanitation) is the same no matter which format you draw from.

In [0]:
import os

if not os.path.exists('anes_timeseries_2016_rawdata.txt'):
    !wget https://raw.githubusercontent.com/AustinVes/Voter_Emotions/master/data/ASCII/anes_timeseries_2016_rawdata.txt

The raw data file contains plaintext delimiter-separated values (delimiter = "|") in fixed-width columns. I load it into a Pandas dataframe with the low_memory flag disabled to avoid winding up with mixed-type columns from parsing the file in chunks.

In [0]:
import pandas as pd

df = pd.read_csv('anes_timeseries_2016_rawdata.txt', delimiter='|', low_memory=False)
df.info()

## Understanding the Data Structure

Everything in this section is explained in greater detail in the [User Guide and Codebook](https://electionstudies.org/wp-content/uploads/2018/12/anes_timeseries_2016_userguidecodebook.pdf).

This dataset consists entirely of survey data. Each column corresponds to a variable describing some aspect of the interviews (responses to a survey question, info about testing conditions, sampling weights, etc.). Each row corresponds to an individual respondent and is persistent through the whole study, meaning that even if a respondent was interviewed multiple times, their data is all contained in their one row.

This dataset includes results from two distinct interview methodologies: face-to-face (FtF) and online survey. In both cases, interviews were carried out on a sampling of people pre-election and then a subset of those people were also interviewed post-election. Nobody was interviewed post-election that wasn't interviewed pre-election. Respondants from each respective methodology make up their own stratified samples but the dataset provides weights for combining the two groups if you want to analyze them together.

This study was carried out using stratified sampling. Along with each row, the dataset provides a number of different weights for bootstrapping representative samples of U.S. voters, depending on which of the following subsets of respondents you want to analyze:<br/>
* full sample using post‐election survey only or both pre and post<br/>
* full sample using pre‐election survey data only<br/>
* face‐to‐face mode alone, using the post‐election survey or both pre and * post<br/>
* face‐to‐face mode alone, using pre‐election survey data only<br/>
* Internet mode alone, using data from both pre‐ and post‐election or post alone<br/>
* Internet mode alone, using data from only the pre‐election survey

The variables all follow the same naming convention, that is:<br/>
"V" + study year (YY) + 1-digit section ID + 3-digit unique code (+ optional letter)<br/>
Examples: V166002, V162371b, V164012<br/>
The section ID code refers to which part of the study that variable is from (e.g. pre‐election interview, post‐election administrative variables, etc.). Beyond that, this naming scheme makes it impossible to intuit and difficult to remember what each variable means. If you want to work with this dataset, be ready to constantly cross-reference the codebook.

The values of categorical variables are encoded numerically as described in the codebook. Missing data is also encoded numerically as such:<br/>
‐1 = Inapplicable<br/>
‐2 = Text responses available in separate file or coded version will be included in future release<br/>
‐3 = Restricted<br/>
‐4 = Error<br/>
‐5 = Breakoff, sufficient partial IW<br/>
‐6 = No post‐election interview<br/>
‐7 = No post data, deleted due to incomplete IW<br/>
‐8 = Don’t know<br/>
‐9 = Refused<br/>
...At least in theory. I found a few odd exceptions to these rules in the dataset, which leads nicely into the next step of the process:

## Sanitizing

In exploring the dataset, the only obvious errors I found were related to these missing data codes. For instance, though almost all columns use these codes where there would otherwise be nothing, there is exactly one column in the whole dataset that uses empty cells instead: V162084. It being the only example and there being no justification given, I assume this was a mistake. I replace all the empty cells with '-4' as a string because this column has been incorrectly interpreted as an object dtype due to the "empty" cells actually containing spaces.

In [0]:
V162084_empty = df['V162084'].str.isspace()
V162084_empty.sum()

In [0]:
df.loc[V162084_empty, 'V162084'] = '-4'
df['V162084'].str.isspace().sum()

Next, almost all columns just code missing data as numbers (e.g. -1), but some columns also include the description of each missing data code (e.g. "-1. Inapplicable") which means those columns get interpreted incorrectly as objects (strings). I identify these cells using a regex that matches the sequence of a minus sign, a single digit, a period, then a space, and I remove all the text including and after the period.

In [0]:
import re

def remove_missing_data_desc(cell):
  if re.search('-\d. ', cell):
    return cell[:cell.find('.')]
  else:
    return cell

object_cols = df.select_dtypes(include='object')
df.update(object_cols.applymap(remove_missing_data_desc))

Now that I've removed all erroneous text from the missing data codes, I ask Pandas to re-interpret each object column's datatype to make every column numerical that should be.

In [0]:
df = df.apply(pd.to_numeric, errors='ignore')
# I would use df.infer_objects() but it doesn't seem to work as expected

In addition to the error codes provided at the beginning of the codebook, I also found these error codes in use in a small handful of columns:<br/>
99   = Not answered; The answer recorded by the interviewer is uninterpretable<br/>
999  = Don’t recognize (don’t know who this is)<br/>
998  = Don’t know (where to rate)<br/>
9999 = Refused<br/>
9998 = Don't know<br/>

## Validating