# Using ANESPy for Data Acquisition and Transformations

To use `anespy` in a project:

In [4]:
import anespy.anespy as anes

print(anespy.__version__)

0.1.5.0


## A Brief Note On the 'Type' for This Project

My project kind of sits between options A & B, as it basically serves as a pseudo-API for a website that has no true API (the ANES data archive) but is also about data transformations. I think it still demonstrates knowledge of the topics covered in the course, so hopefully it fulfills the criteria of a successful 'package' within the scope of this final project. 

## Loading ANES Data

One of the primary challenges of working with ANES data is that it's all over the place, and there is no *true* API for repeatedly accessing the data. Getting data files requires clicking a button for the format you'd like, which means that there are no user-facing static links for getting data. However, there is a somewhat hidden internal API that the site makes requests to when you select the file you wish to download. This package leverages this request system to acquire the datasets. 

The function ````load_ANES_data(year, add_names = False)```` takes two arguments:
1. ```year```: year of the data you would like to access
2. ```add_names```: if you want to swap the variable names for their more complete, context-inclusive names (defaults to ```False```)

For example, say you wanted to pull the 2016 version of the main ANES Time Series:

In [5]:
data = anes.load_ANES_data(2016)
data.head()

Unnamed: 0,version,V160001,V160001_orig,V160101,V160101f,V160101w,V160102,V160102f,V160102w,V160201,...,V168519,V168520,V168521,V168522,V168523,V168524,V168525,V168526,V168527,V168528
0,-1,1,300001,0.827,0.8877,0.0,0.842,0.9271,0.0,121,...,-1,1,-1,-1,-1,81,-1,-1,-1,-1
1,-1,2,300002,1.0806,1.1605,0.0,1.0133,1.0841,0.0,123,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1
2,-1,3,300003,0.3878,0.4161,0.0,0.3672,0.3985,0.0,121,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1
3,-1,4,300004,0.3596,0.3852,0.0,0.3663,0.4183,0.0,118,...,-1,2,-1,-1,-1,82,-1,-1,-1,-1
4,-1,5,300006,0.647,0.6931,0.0,0.6463,0.7262,0.0,113,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1


This package (at present) provides support for only the main Time Series supplements for the ANES going back to 2000. To check what versions are available, you can use the ```editions()``` method.

In [6]:
anes.editions()

cumulative
2020
2016
2012
2008
2004
2000


But there is something essential to note here about the data we just loaded. Though it is a DataFrame, it also *isn't* a DataFrame. 

In [7]:
type(data)

anespy.anespy.ANES

As part of the acquisition, the data are instantiated as an ```ANES``` object. 

## The ```ANES``` Class

Part of this package is the ```ANES``` class, which is a child of ```Pandas.DataFrame```. Much of the functionality of ```Pandas``` applies to work with ANES data, but there are consistent properties across ANES Time Series instances that make them worthy of additional methods. 

When you load ANES data, it is instantiated as an ```ANES``` object with a ```year``` property. For example, the data we just loaded:


In [8]:
data.year

2016

This property is very useful for class methods, or if you're transforming and combining datasets from multiple years.  

## Adding Years to the Data

An advantage of the ```ANES``` class is that certain transformations and functions can access the year of the Time Series automatically. For example, a common problem with ANES data is that they do not include a **Year** column by default outside of the long **version** name. One of the built-ins with this package is the class method ```add_year(self)```, which appends a **Year** column to the beginning of the ```ANES``` object. 

In [9]:
data.add_year()
data.head()

Unnamed: 0,Year,version,V160001,V160001_orig,V160101,V160101f,V160101w,V160102,V160102f,V160102w,...,V168519,V168520,V168521,V168522,V168523,V168524,V168525,V168526,V168527,V168528
0,2016,-1,1,300001,0.827,0.8877,0.0,0.842,0.9271,0.0,...,-1,1,-1,-1,-1,81,-1,-1,-1,-1
1,2016,-1,2,300002,1.0806,1.1605,0.0,1.0133,1.0841,0.0,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1
2,2016,-1,3,300003,0.3878,0.4161,0.0,0.3672,0.3985,0.0,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1
3,2016,-1,4,300004,0.3596,0.3852,0.0,0.3663,0.4183,0.0,...,-1,2,-1,-1,-1,82,-1,-1,-1,-1
4,2016,-1,5,300006,0.647,0.6931,0.0,0.6463,0.7262,0.0,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1


This can be especially useful for data intended to be exported, joining variables across time series, or merging ANES samples with other datasets from the same year. 

## Converting Variable Names

Something you might have noticed about the example data is that the variable names are non-identifying. Typically, work with ANES data requires referencing a codebook to understand what the variables you're working with are. This is only the beginning of the issues with ANES variable names, but included in this package is ```convert_var_names(self, drop_extra = True)```, which recodes the variable names as their full title from the codebook. 


In [10]:
data.convert_var_names()
data.head()

Converted to named variables.


Unnamed: 0,Year,2016 Case ID,Pre-election weight -full sample,Pre-election weight -FTF sample,Pre-election weight -Web sample,Post-election weight -full sample,Post-election weight -FTF sample,Post-election weight -Web sample,Stratum -full sample,Stratum -FTF sample,...,POST ADMIN: Post Web device logins: desktop/laptop,POST ADMIN: Beginning date of Post IW (YYYYMMDD),POST ADMIN: Date IWR first opened IW in CAPI (YYYYMMDD),POST ADMIN: Beginning time of Post IW (HHMMSS),POST ADMIN: Time IWR first opened IW in CAPI (HHMMSS),POST ADMIN: Ending date of Post IW (YYYYMMDD),POST ADMIN: Ending time of Post IW (HHMMSS),POST ADMIN: Elapsed time interview start to end (minutes),RANDOM: Assignment to forward or reverse response options for selected questions,RANDOM: Assignment to ballot color/ order of major party names in vote sections
0,2016,1,0.827,0.8877,0.0,0.842,0.9271,0.0,121,21,...,-1,-1,-1,-1,-1,-1,-1,60.1,2,2
1,2016,2,1.0806,1.1605,0.0,1.0133,1.0841,0.0,123,23,...,-1,-1,-1,-1,-1,-1,-1,57.3,2,1
2,2016,3,0.3878,0.4161,0.0,0.3672,0.3985,0.0,121,21,...,-1,-1,-1,-1,-1,-1,-1,60.216667,1,1
3,2016,4,0.3596,0.3852,0.0,0.3663,0.4183,0.0,118,18,...,-1,-1,-1,-1,-1,-1,-1,66.066667,1,2
4,2016,5,0.647,0.6931,0.0,0.6463,0.7262,0.0,113,13,...,-1,-1,-1,-1,-1,-1,-1,77.65,1,2


But if you change your mind at any point, this transformation can be undone:

In [11]:
data.convert_var_names()
data.head()

Converted to numbered variables.


Unnamed: 0,Year,V160001,V160101,V160101f,V160101w,V160102,V160102f,V160102w,V160201,V160201f,...,V165001d,V165002,V165002a,V165003,V165003a,V165004,V165005,V165006,V166001,V166002
0,2016,1,0.827,0.8877,0.0,0.842,0.9271,0.0,121,21,...,-1,-1,-1,-1,-1,-1,-1,60.1,2,2
1,2016,2,1.0806,1.1605,0.0,1.0133,1.0841,0.0,123,23,...,-1,-1,-1,-1,-1,-1,-1,57.3,2,1
2,2016,3,0.3878,0.4161,0.0,0.3672,0.3985,0.0,121,21,...,-1,-1,-1,-1,-1,-1,-1,60.216667,1,1
3,2016,4,0.3596,0.3852,0.0,0.3663,0.4183,0.0,118,18,...,-1,-1,-1,-1,-1,-1,-1,66.066667,1,2
4,2016,5,0.647,0.6931,0.0,0.6463,0.7262,0.0,113,13,...,-1,-1,-1,-1,-1,-1,-1,77.65,1,2


Something to note is that you may lose some variables because of a mismatch between the codebook's listed variables and what is actually provided in the data. If you would like to retain the extra data and manually search these mystery variables, you have the option to set ```drop_extra``` to false during the initial conversion.

In [12]:
data = anes.load_ANES_data(2016)
data.convert_var_names(drop_extra = False)
data.head()

Converted to named variables.


Unnamed: 0,version,2016 Case ID,V160001_orig,Pre-election weight -full sample,Pre-election weight -FTF sample,Pre-election weight -Web sample,Post-election weight -full sample,Post-election weight -FTF sample,Post-election weight -Web sample,Stratum -full sample,...,V168519,V168520,V168521,V168522,V168523,V168524,V168525,V168526,V168527,V168528
0,-1,1,300001,0.827,0.8877,0.0,0.842,0.9271,0.0,121,...,-1,1,-1,-1,-1,81,-1,-1,-1,-1
1,-1,2,300002,1.0806,1.1605,0.0,1.0133,1.0841,0.0,123,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1
2,-1,3,300003,0.3878,0.4161,0.0,0.3672,0.3985,0.0,121,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1
3,-1,4,300004,0.3596,0.3852,0.0,0.3663,0.4183,0.0,118,...,-1,2,-1,-1,-1,82,-1,-1,-1,-1
4,-1,5,300006,0.647,0.6931,0.0,0.6463,0.7262,0.0,113,...,-1,1,-1,-1,-1,82,-1,-1,-1,-1


## Recoding To Categories

Another disadvantage of ANES data 'as-is' is that the data are that most of the variables are factors, but are inconsistently coded. This issue is partially resolved by the ```load_ANES_data``` function, yet there remains the issue of ambiguous values for categoricals. Packaged with ```ANESPy``` is ```recode_to_char```, which replaces the values for a selected column with their full character labels from the codebook.

For example, the 2012 edition includes some ideology variables, which when left as numbers, are not entirely useful. 

In [13]:
data = anes.load_ANES_data(2008)
data['V083099a']

0       3
1       2
2      -1
3       2
4      -1
       ..
2317    2
2318   -1
2319    2
2320   -1
2321   -1
Name: V083099a, Length: 2322, dtype: int64

After recoding:

In [14]:
data.recode_to_char('V083099a')
data['V083099a']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col_in].iloc[idx] = recode_dict[tmp_tuple][0]


0                         Not too well
1                           Quite well
2       INAP, R selected for VERSION K
3                           Quite well
4       INAP, R selected for VERSION K
                     ...              
2317                        Quite well
2318    INAP, R selected for VERSION K
2319                        Quite well
2320    INAP, R selected for VERSION K
2321    INAP, R selected for VERSION K
Name: V083099a, Length: 2322, dtype: object

Now we have a complete understanding of what these variables represent.

#### A Note About Variable Names

At present, this function is designed to work only with the variable names in their original "V_____" format. Because of the duplicated pre/post variable issue, some variables will return a `KeyError` after being converted to their full-context name.

## Split Pre & Post Variables

Another somewhat unbelievable issue with ANES datasets is that some years have duplicated variable codes. The first appearance represents the *pre-election* sample, while the second represents the *post-election* sample. As part of the ```convert_var_names``` functionality, specific variables are given "Pre" and "Post" tags. These can then be leveraged to split the variables into Pre and Post groups, which can be very useful for later analysis.

In [15]:
data = anes.load_ANES_data(2012)
data.convert_var_names()
data_pre, data_post = data.split_pre_post()

Converted to named variables.


In [16]:
data_pre

Unnamed: 0,PRE: FTF ONLY: INTERVIEWER: Is R male or female (Observation),PRE: How often does R pay attn to politics and elections,PRE: Interested in following campaigns standard,PRE: Does R know where to go to vote in neighborhood,PRE: Did R vote for President in 2008,PRE: Recall of last (2008) Presidential vote choice,PRE: FTF ONLY: HH Internet use,PRE: Days in typical week review news on internet,PRE: Attention to internet news,PRE: Days in typical week watch news on TV,...,PRE: CASI/WEB Polit knowledge: number times can be elected (incl DKRF),PRE: CASI/WEB Political knowledge: size of federal deficit (incl DKRF),PRE: Years Senator Elected (incl DKRF),PRE: CASI/WEB Political knowledge: What is Medicare (incl DKRF),PRE: CASI/WEB Political knowledge: program Fed govt spends (incl DKRF),PRE: How satisfied is R with life,POST: FTF ONLY: Was R registered Pre-election,POST: FTF ONLY: Did registered R report voting Pre-election,POST: FTF ONLY: Sen elctn reg state (R not reg at resid Pre or Post),POST: FTF ONLY: Gov elctn reg state (R not reg at resid Pre or Post)
0,2,4,1,1,1,1,2,-1,-1,7,...,2,2,4,1,3,1,-1,-1,-1,-1
1,1,1,1,1,1,1,2,-1,-1,7,...,6,1,3,1,1,1,1,0,-1,-1
2,2,2,1,1,1,1,1,2,4,7,...,2,1,8,1,3,2,1,0,-1,-1
3,2,4,2,2,2,-1,-1,-1,-1,0,...,2,2,1,3,3,5,1,0,-1,-1
4,1,2,1,1,1,1,1,5,3,4,...,2,1,2,1,2,3,1,0,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5909,-1,3,3,1,2,-1,-1,0,-1,2,...,2,1,6,1,2,3,-1,-1,-1,-1
5910,-1,2,2,1,1,1,-1,1,5,5,...,2,1,6,1,4,3,-1,-1,-1,-1
5911,-1,5,3,1,1,1,-1,0,-1,5,...,4,1,3,1,1,1,-1,-1,-1,-1
5912,-1,3,2,1,1,1,-1,2,4,3,...,2,2,8,2,2,4,-1,-1,-1,-1


## Generate a Sample

Lastly, this package allows you to draw a sample from the object along a set of specific variables. This can be useful for designs involving re-sampling, exploratory statistical testing, or other functions where the entire set of respondents is not needed. 

The ```generate_sample``` function takes two arguments: `variables` (a list of variable names) and `n_respondents`, which is size of the sample you want to extract. 

In [17]:
data = anes.load_ANES_data(2004)
sample = data.generate_sample(list(data.columns.values[0:7]), n_respondents = 10)
sample

Unnamed: 0,Version,Dsetid,V040001,V040002,V040101,V040102,V040103
1185,-1,-1,1187,293,0.857,0.8195,172
254,-1,-1,255,126,0.4521,0.4991,241
1138,-1,-1,1140,363,0.7067,0.726,222
918,-1,-1,919,1047,0.4565,0.4625,102
280,-1,-1,281,2090,0.5203,0.0,181
732,-1,-1,733,73,1.4637,1.4083,161
33,-1,-1,34,712,0.5514,0.5452,172
278,-1,-1,279,888,1.1482,1.1522,222
166,-1,-1,167,182,0.5621,0.564,31
34,-1,-1,35,214,1.4637,1.4083,162
