Data QA Case: ``AK_precincts``
========================

Below are the steps involved in performing automated data quality checks on the ``AK_precincts`` shapefile from ``mggg-states``.

This notebook does the following:
1. Collects the following Alaska election data from ``mggg-states``, MEDSL, and Wikipedia:

    - 2016 United States presidential election
    - 2016 United States Senate elections
    - 2016 United States House of Representatives elections
    - 2018 United States Senate elections
    - 2018 United States House of Representatives elections

2. Wrangles the datasets so that they can be compared against each other.
3. Checks if column names in ``AK_precincts`` diverge from the MGGG naming convention (as outlined in ``naming_convention.json``).
4. Compares the vote counts in ``AK_precincts`` with those in the MEDSL and Wikipedia datasets.
5. Prints the aggregated votes in ``AK_precincts`` for ease of spot checking against Secretary of State websites.

*Note:* the automated checks are not completely exhaustive and further manual checks are required.

Automation Check Timestamp: 02:00 pm ET, 14 August 2020

---

After running the automated scripts, we recommend doing the following:

__Data Standardization__

- Manually evaluate column naming discrepancies to determine if changes are needed.
- Manually evaluate column datatypes to determine if changes are needed.

__Data Comparison__

- Manually investigate large differences found through comparing ``AK_precincts`` data with external sources (e.g. Are absentee ballots counted? Are the precinct counts accurate?).
- For overcounts, how are the votes counted? e.g. A `USH__D` count may include votes for all Democratic candidates where external sources may be only counting one main Democratic candidate.
- For more accurate comparisons, compare ``AK_precincts`` data with those in each States' Secretary of State website.

__Topological Soundness__

- Manually examine shapefiles for gaps and overlaps. 
- *Note:* although gaps and overlaps are not necessarily indicators of inaccurate data (because some counties have precinct islands), they *do* mean that the data cannot be for chain runs.

__Data Documentation__

- Do the READMEs provide data sources?
- Do the READMEs describe what aggregation/disaggregation processes were used?
- Do the READMEs discuss discrepancies/caveats in the data?
- Do the READMEs provide scripts used and/or discuss the data wrangling/processing process?

---

Step 0. Setup
----------------

In [None]:
# Install useful Python packages

!pip3 install numpy
!pip3 install pandas
!pip3 install geopandas
!pip3 install wikipedia

!pip3 install git+https://github.com/KeiferC/gdutils.git

In [12]:
# Import useful Python modules

import numpy as np
import pandas as pd
import geopandas as gpd

import json # for parsing a json file
import wikipedia # unofficial Wikipedia package (wrapper of MediaWiki API)
import os # for ensuring file traversal works regardless of operating system

import gdutils.datamine as dm # data-mining module from gdutils
import gdutils.dataqa as dq # data QA module from gdutils
import gdutils.extract as et # table extraction module from gdutils

from typing import Any, List, Tuple, Dict, Hashable, Union, NoReturn

Step 1. Data collection
---------------------------

__Step 1.1.__ Collect `AK_precincts` data from the `mggg-states`' `AK-shapefiles` GitHub repository.

In [3]:
# Clone 'AK-shapefiles' repository into 'output/mggg/'

# dm.clone_gh_repos(account='mggg-states', 
#                   account_type='orgs', 
#                   repos=['AK-shapefiles'],
#                   outpath=os.path.join('output', 'mggg'))

In [14]:
# Extract a GeoDataFrame from 'AK-shapefiles/AK_precincts.zip'

mggg_gdf = et.read_file(os.path.join('output', 'mggg', 'AK-shapefiles', 
                                     'AK_precincts.zip')).extract()

mggg_gdf.head() # renders first 5 rows of the extracted gf

Unnamed: 0,ID,AREA,DISTRICT,NAME,POPULATION,USH14D,USH14R,USH14L,PRES16D,PRES16R,...,VAP,WVAP,BVAP,AMINVAP,ASIANVAP,NHPIVAP,OTHERVAP,2MOREVAP,2MORE,geometry
0,266.0,1.553231,01-446,01-446 AURORA,2995.0,336,457,91,295,434,...,2315,1740,92,237,78,2,48,118,229,"POLYGON ((294705.801 1667364.692, 294704.326 1..."
1,329.0,0.578508,01-455,01-455 FAIRBANKS NO. 1,659.0,72,106,16,65,113,...,545,416,16,62,12,0,10,29,36,"POLYGON ((297483.985 1669129.153, 297485.509 1..."
2,267.0,0.469371,01-465,01-465 FAIRBANKS NO. 2,1542.0,108,166,44,120,157,...,1312,853,85,252,37,1,20,64,100,"POLYGON ((297800.944 1668172.899, 297823.138 1..."
3,268.0,0.401854,01-470,01-470 FAIRBANKS NO. 3,1872.0,216,234,54,205,218,...,1531,1047,97,232,36,3,30,86,138,"POLYGON ((296902.053 1668075.791, 296915.198 1..."
4,269.0,0.561294,01-475,01-475 FAIRBANKS NO. 4,1143.0,123,118,40,86,149,...,883,622,28,153,27,0,14,39,106,"POLYGON ((296178.482 1666807.889, 296101.344 1..."


__Step 1.2.__ Gather MEDSL data for comparison purposes.

In [4]:
# Print available MEDSL data to select applicable datasets

# print('{:27} : {}'.format('Repo Name', 'Repo URL'))
# print('------------------------------------------------------------------')

# for (repo, url) in dm.list_gh_repos(account='MEDSL', account_type='orgs'):
#     print("{:27} : {}".format(repo, url))

Repo Name                   : Repo URL
------------------------------------------------------------------
elections                   : https://github.com/MEDSL/elections.git
official-precinct-returns   : https://github.com/MEDSL/official-precinct-returns.git
primaries                   : https://github.com/MEDSL/primaries.git
data-management             : https://github.com/MEDSL/data-management.git
election-scrapers           : https://github.com/MEDSL/election-scrapers.git
medslcleaner                : https://github.com/MEDSL/medslcleaner.git
precinct-shapefiles         : https://github.com/MEDSL/precinct-shapefiles.git
documentation               : https://github.com/MEDSL/documentation.git
elections-performance-index : https://github.com/MEDSL/elections-performance-index.git
constituency-returns        : https://github.com/MEDSL/constituency-returns.git
state-returns               : https://github.com/MEDSL/state-returns.git
county-returns              : https://github.com/MEDSL/

In [5]:
# Clone applicable MEDSL datasets

medsl_repos = ['official-precinct-returns', # precinct-level 2016 election results
               '2018-elections-official']   # constituency-level 2018 election results

# this will take some time to complete
# dm.clone_gh_repos(account='MEDSL', 
#                   account_type='orgs', 
#                   repos=medsl_repos, 
#                   outpath=os.path.join('output', 'medsl'))

In [17]:
# Find Alaska-specific MEDSL data

# dm.list_files_of_type('.zip', os.path.join('output', 'medsl'))

['output/medsl/2018-elections-official/precinct_2018.zip',
 'output/medsl/official-precinct-returns/2016-precinct-local/2016-precinct-local.zip',
 'output/medsl/official-precinct-returns/source/2016-tn-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ny-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ut-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-wv-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ia-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-vt-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ma-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ct-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-wi-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-mt-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ms-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-va-precinct.zip',
 '

In [25]:
# Extract DataFrames from:
# 'official-precinct-returns/2016-precinct-president/2016-precinct-president.zip',
# 'output/medsl/official-precinct-returns/2016-precinct-senate/2016-precinct-senate.zip',
# 'offical-precinct-returns/2016-precinct-house/2016-precinct-house.zip', and
# 2018-elections-offical/precinct_2018.zip'

medsl_16_path = os.path.join('output', 'medsl', 'official-precinct-returns')
medsl_18_path = os.path.join('output', 'medsl', '2018-elections-official')

In [21]:

medsl_pres16_df = et.read_file(os.path.join(medsl_16_path, '2016-precinct-president',
                                            '2016-precinct-president.zip')).extract()

medsl_pres16_df.head()

Unnamed: 0,year,stage,special,state,state_postal,state_fips,state_icpsr,county_name,county_fips,county_ansi,...,candidate_full,candidate_suffix,candidate_nickname,candidate_fec,candidate_fec_name,candidate_google,candidate_govtrack,candidate_icpsr,candidate_maplight,geometry
0,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,P00003392,"CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE",,,,,
1,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,P60012234,"JOHNSON, JOHN FITZGERALD MR.",,,,,
2,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,P20003984,"STEIN, JILL",,,,,
3,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,P80001571,"TRUMP, DONALD J. / MICHAEL R. PENCE",,,,,
4,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,,,,,,,


In [22]:
medsl_sen16_df  = et.read_file(os.path.join(medsl_16_path, '2016-precinct-senate',
                                            '2016-precinct-senate.zip')).extract()

medsl_sen16_df.head()

Unnamed: 0,year,stage,special,state,state_postal,state_fips,state_icpsr,county_name,county_fips,county_ansi,...,candidate_full,candidate_suffix,candidate_nickname,candidate_fec,candidate_fec_name,candidate_google,candidate_govtrack,candidate_icpsr,candidate_maplight,geometry
0,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,,,,,,,
1,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,,,,,,,
2,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,,,,,,,
3,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,S6AL00302,"CRUMPTON, RONALD (RON) STEVEN",,,,,
4,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,Richard C. Shelby,,,S6AL00013,,kg:/m/020yj1,300089.0,14659.0,608.0,


In [24]:
medsl_ush16_df  = et.read_file(os.path.join(medsl_16_path, '2016-precinct-house',
                                            '2016-precinct-house.zip')).extract()

medsl_ush16_df.head()

Unnamed: 0,year,stage,special,state,state_postal,state_fips,state_icpsr,county_name,county_fips,county_ansi,...,candidate_full,candidate_suffix,candidate_nickname,candidate_fec,candidate_fec_name,candidate_google,candidate_govtrack,candidate_icpsr,candidate_maplight,geometry
0,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,,,,,,,
1,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,,,,,,,
2,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,,,,,,,
3,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,,,,H6AL02167,"MATHIS, NATHAN",,,,,
4,2016,gen,False,Alabama,AL,1,41,Autauga County,1001.0,161526.0,...,Martha Roby,,,H0AL02087,,kg:/m/0drx5mb,412394.0,21192.0,1408.0,


In [27]:
medsl_18_df = et.read_file(os.path.join(medsl_18_path, 'precinct_2018.zip')).extract()

medsl_18_df.head()

Unnamed: 0,precinct,office,party,mode,votes,jurisdiction,county,candidate,district,dataverse,year,stage,state,special,writein,state_po,state_fips,state_cen,state_ic,geometry
0,10 JONES COMMUNITY CTR,Straight Party,democratic,election day,98,Autauga,Autauga,Alabama Democratic Party,,all,2018,gen,Alabama,False,False,AL,1,63,41,
1,10 JONES COMMUNITY CTR,Straight Party,republican,election day,110,Autauga,Autauga,Alabama Republican Party,,all,2018,gen,Alabama,False,False,AL,1,63,41,
2,10 JONES COMMUNITY CTR,US House,democratic,election day,118,Autauga,Autauga,Tabitha Isner,2.0,house,2018,gen,Alabama,False,False,AL,1,63,41,
3,10 JONES COMMUNITY CTR,US House,republican,election day,153,Autauga,Autauga,Martha Roby,2.0,house,2018,gen,Alabama,False,False,AL,1,63,41,
4,10 JONES COMMUNITY CTR,US House,,election day,0,Autauga,Autauga,,2.0,house,2018,gen,Alabama,False,True,AL,1,63,41,
