##  We're going to load the turnout data, flatten it, and add it to key demographic data to create a comprehensive database table for query and predictions.  

### We will consider that a person has 'voted' if they voted in any election in an odd year and the November general election in an even year


In [1]:
#First set everthing up
import pyodbc
import graphlab as gl
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
import local #server information and credentials
from pullODBC import pullODBC
from pullODBCall import pullODBCall
from matplotlib.backends.backend_pdf import PdfPages
from __future__ import division # convert integer division to floating point
%matplotlib inline
plt.style.use('seaborn-white')
plt.rcParams['font.size'] = 7
plt.tight_layout;
plt.rc('xtick', labelsize=7) 
plt.rc('ytick', labelsize=7) 

### Section A: Extracting Voter History Data for each Registered Voter

Read in voter history from a csv file

In [2]:
history=pd.read_csv('history.csv')

Convert all columns to string

In [3]:
for col in history.columns:
    history[col]= history[col].apply(str)

Get rid of all the double quotes from the dataframe

In [4]:
history.columns = history.columns.str.strip('"')
for i, col in enumerate(history.columns):
    history.iloc[:, i] = history.iloc[:, i].str.replace('"', '')
print history.head(n=15)

   Unnamed: 0      ncid election_lbl county_desc
0           0   AA10737   11/03/2015    ALAMANCE
1           1   AA10737   11/05/2013    ALAMANCE
2           2   AA10737   05/06/2008    ALAMANCE
3           3   AA10737   11/04/2008    ALAMANCE
4           4   AA10737   10/09/2007    ALAMANCE
5           5   AA10737   11/08/2011    ALAMANCE
6           6   AA10737   11/04/2014    ALAMANCE
7           7  AA107373   11/04/2014    ALAMANCE
8           8  AA107373   11/06/2007    ALAMANCE
9           9  AA107373   11/02/2010    ALAMANCE
10         10  AA107373   11/08/2016    ALAMANCE
11         11  AA107373   03/15/2016    ALAMANCE
12         12  AA107373   11/04/2008    ALAMANCE
13         13  AA107373   11/06/2012    ALAMANCE
14         14  AA107373   05/08/2012    ALAMANCE


Drop the junk column

In [5]:
history = history.drop(['Unnamed: 0'], axis=1)

Save the total number of rows to raw_len and print

In [6]:
raw_len=len(history)
print "The number of rows pulled from the database is " + str(raw_len)

The number of rows pulled from the database is 32482836


Create a generic function to change blank fields to NaN for all categorical fields 

In [7]:
def fix_blanks_NaN(x):
    if x == "":
        return np.nan
    elif x == " ":
        return np.nan
    else:
        return x

In [8]:
history['election_lbl'] = history['election_lbl'].apply(fix_blanks_NaN) # change blanks to NaN
history['ncid'] = history['ncid'].apply(fix_blanks_NaN) # etc.
history['county_desc'] = history['county_desc'].apply(fix_blanks_NaN) # etc.

Clean the data by dropping rows with missing data and print a summary

In [9]:
history.dropna(inplace=True, how='any') #drop all rows with NaN
clean_len=len(history) #how many rows to I have left?
difference = raw_len - clean_len #how many rows have I dropped?
fraction = float(difference)/float(raw_len) #what fraction of rows have I dropped?
print "The number of rows I have dropped is " + str(difference) 
print "The number of rows remaining in the database is " + str(clean_len)
print "The precentage of rows that I have dropped is: " + str('{:.3%}'.format(fraction))

The number of rows I have dropped is 0
The number of rows remaining in the database is 32482836
The precentage of rows that I have dropped is: 0.000%


We need to flatten the database to a single row per NCID and massage the data

For local elections, election dates vary across the state in all odd numbered years.  In fact, many candidates run unopposed in the general election, so the primary election is more meaningfull.  Previously, we had only looked at elections held on election day in November.  For local elections, we will consider that a registered voter 'voted' if he voted in any election in odd numbered years.  So, the code below provides a marker of '1' if the voter voted at all in a year, '0' if that voter did not participate in any election that year.

In order to do this, we need to create a custom aggfunc that works on a pandas series and returns '1' if then length of that series is not equal to zero, and returns a '0' if the length of that series is equal to zero.

In [10]:
def not_zero_len(x):
    if len(x) != 0:
        return 1
    else:
        return 0

Let's create the new column 'year' that contains the year of the vote

In [11]:
history['year'] = history['election_lbl'].str[-4:]

In [12]:
print history.head()

      ncid election_lbl county_desc  year
0  AA10737   11/03/2015    ALAMANCE  2015
1  AA10737   11/05/2013    ALAMANCE  2013
2  AA10737   05/06/2008    ALAMANCE  2008
3  AA10737   11/04/2008    ALAMANCE  2008
4  AA10737   10/09/2007    ALAMANCE  2007


Now we create a function that returns 'True' if the input number is even, 'False' if it is odd

In [13]:
def is_even(x):
    if (x%2 == 0): 
        return True
    else: 
        return False

Next, we create a new column containing 'True' for even years, 'False' for odd years

In [14]:
history['even'] = history['year'].astype(int).apply(is_even)

In [15]:
print history.head(n=50)

        ncid election_lbl county_desc  year   even
0    AA10737   11/03/2015    ALAMANCE  2015  False
1    AA10737   11/05/2013    ALAMANCE  2013  False
2    AA10737   05/06/2008    ALAMANCE  2008   True
3    AA10737   11/04/2008    ALAMANCE  2008   True
4    AA10737   10/09/2007    ALAMANCE  2007  False
5    AA10737   11/08/2011    ALAMANCE  2011  False
6    AA10737   11/04/2014    ALAMANCE  2014   True
7   AA107373   11/04/2014    ALAMANCE  2014   True
8   AA107373   11/06/2007    ALAMANCE  2007  False
9   AA107373   11/02/2010    ALAMANCE  2010   True
10  AA107373   11/08/2016    ALAMANCE  2016   True
11  AA107373   03/15/2016    ALAMANCE  2016   True
12  AA107373   11/04/2008    ALAMANCE  2008   True
13  AA107373   11/06/2012    ALAMANCE  2012   True
14  AA107373   05/08/2012    ALAMANCE  2012   True
15  AA107373   07/17/2012    ALAMANCE  2012   True
16  AA107373   05/06/2008    ALAMANCE  2008   True
17  AA107373   05/04/2010    ALAMANCE  2010   True
18  AA107380   11/04/2008    GU

Now, let's split history into two dataframes, one for even years and one for odd years

In [16]:
history_even = history[history['even']]
history_odd = history[~history['even']]

In [17]:
print len(history_odd)/(len(history_odd)+len(history_even))

0.10631704079


In [18]:
print history_odd.head(n=20)

         ncid election_lbl county_desc  year   even
0     AA10737   11/03/2015    ALAMANCE  2015  False
1     AA10737   11/05/2013    ALAMANCE  2013  False
4     AA10737   10/09/2007    ALAMANCE  2007  False
5     AA10737   11/08/2011    ALAMANCE  2011  False
8    AA107373   11/06/2007    ALAMANCE  2007  False
45   AA107392   11/03/2009    ALAMANCE  2009  False
50   AA107392   10/06/2015    ALAMANCE  2015  False
52   AA107392   11/03/2015    ALAMANCE  2015  False
69   AA107397   11/08/2011    ALAMANCE  2011  False
72   AA107397   11/03/2015    ALAMANCE  2015  False
89    AA10740   10/09/2007    ALAMANCE  2007  False
92    AA10740   10/06/2009    ALAMANCE  2009  False
94    AA10740   11/06/2007    ALAMANCE  2007  False
97    AA10740   11/03/2009    ALAMANCE  2009  False
98    AA10740   11/03/2015    ALAMANCE  2015  False
99    AA10740   11/05/2013    ALAMANCE  2013  False
100   AA10740   11/08/2011    ALAMANCE  2011  False
123  AA107410   11/08/2011    ALAMANCE  2011  False
140   AA1074

In [19]:
del history # we no longer need the full 'history' dataframe

For even year elections, eliminate all records not in November

In [20]:
history_even = history_even[history_even['election_lbl'].astype(str).str[0:2]=='11']

In [21]:
print history_even.head()

        ncid election_lbl county_desc  year  even
3    AA10737   11/04/2008    ALAMANCE  2008  True
6    AA10737   11/04/2014    ALAMANCE  2014  True
7   AA107373   11/04/2014    ALAMANCE  2014  True
9   AA107373   11/02/2010    ALAMANCE  2010  True
10  AA107373   11/08/2016    ALAMANCE  2016  True


In [22]:
print history_odd.head()

       ncid election_lbl county_desc  year   even
0   AA10737   11/03/2015    ALAMANCE  2015  False
1   AA10737   11/05/2013    ALAMANCE  2013  False
4   AA10737   10/09/2007    ALAMANCE  2007  False
5   AA10737   11/08/2011    ALAMANCE  2011  False
8  AA107373   11/06/2007    ALAMANCE  2007  False


Now, let's use pandas pivot table technology to flatten the history file for even year elections

In [23]:
elections_by_ncid = pd.pivot_table(history_even[['ncid','year']], index=['ncid'],  
                                 aggfunc=len, columns='year', margins = False) # create a pivot table
elections_flattened_even = pd.DataFrame(elections_by_ncid.to_records()) # convert the pivot table to a dataframe
cols = elections_flattened_even.columns # get an index object with all the column names
cols = cols.delete(0) # remove the first column name 'ncid' since I don't want to rename this
elections_flattened_even.fillna(value=0, method=None, axis=None, inplace=True, limit=None, downcast=None) # replace NaN with 0
elections_flattened_even[cols] = elections_flattened_even[cols].astype(int)# convert values to integers
elections_flattened_even = elections_flattened_even.reindex_axis(sorted(elections_flattened_even.columns), axis=1) # sort the columns by year

In [24]:
print elections_flattened_even.head()

   2008  2010  2012  2014  2016  2018      ncid
0     1     0     0     0     1     0  AA100000
1     1     1     1     1     1     0  AA100006
2     1     1     1     1     1     0  AA100007
3     1     0     1     0     0     0  AA100008
4     1     1     1     1     1     0  AA100009


Now, let's use pandas pivot table technology to flatten the history file for odd year elections

First, we keep the aggfunc as 'len' so the output is the number of elections voted in each year

In [25]:
elections_by_ncid = pd.pivot_table(history_odd[['ncid','year']], index=['ncid'],  
                                 aggfunc=len, columns='year', margins = False) # create a pivot table
elections_flattened_odd = pd.DataFrame(elections_by_ncid.to_records()) # convert the pivot table to a dataframe
cols = elections_flattened_odd.columns # get an index object with all the column names
cols = cols.delete(0) # remove the first column name 'ncid' since I don't want to rename this
elections_flattened_odd.fillna(value=0, method=None, axis=None, inplace=True, limit=None, downcast=None) # replace NaN with 0
elections_flattened_odd[cols] = elections_flattened_odd[cols].astype(int)# convert values to integers
elections_flattened_odd = elections_flattened_odd.reindex_axis(sorted(elections_flattened_odd.columns), axis=1) # sort the columns by year

In [26]:
print elections_flattened_odd.head(n=10)

   2007  2009  2011  2013  2015      ncid
0     0     0     1     0     0  AA100007
1     2     1     1     0     0   AA10001
2     0     1     0     0     0  AA100023
3     0     0     0     1     1  AA100062
4     0     0     0     1     1  AA100074
5     0     0     1     1     0  AA100095
6     1     0     0     0     0  AA100110
7     2     1     1     0     1  AA100140
8     2     0     0     0     1   AA10016
9     1     0     0     0     0  AA100166


Next, we set the aggfunc to 'not_zero_len'  so the output is '1' if voted in any election that year, '0' if not voted in any elections that year

In [27]:
elections_by_ncid = pd.pivot_table(history_odd[['ncid','year']], index=['ncid'],  
                                 aggfunc=not_zero_len, columns='year', margins = False) # create a pivot table
elections_flattened_odd = pd.DataFrame(elections_by_ncid.to_records()) # convert the pivot table to a dataframe
cols = elections_flattened_odd.columns # get an index object with all the column names
cols = cols.delete(0) # remove the first column name 'ncid' since I don't want to rename this
elections_flattened_odd.fillna(value=0, method=None, axis=None, inplace=True, limit=None, downcast=None) # replace NaN with 0
elections_flattened_odd[cols] = elections_flattened_odd[cols].astype(int)# convert values to integers
elections_flattened_odd = elections_flattened_odd.reindex_axis(sorted(elections_flattened_odd.columns), axis=1) # sort the columns by year

In [28]:
print elections_flattened_odd.head(n=10)

   2007  2009  2011  2013  2015      ncid
0     0     0     1     0     0  AA100007
1     1     1     1     0     0   AA10001
2     0     1     0     0     0  AA100023
3     0     0     0     1     1  AA100062
4     0     0     0     1     1  AA100074
5     0     0     1     1     0  AA100095
6     1     0     0     0     0  AA100110
7     1     1     1     0     1  AA100140
8     1     0     0     0     1   AA10016
9     1     0     0     0     0  AA100166


Now, let's join the odd and even dataframes.  We use an outer join so we keep all NCIDs

In [29]:
elections_flattened = pd.merge(elections_flattened_odd, elections_flattened_even, how='outer')
print len(elections_flattened), len(elections_flattened_odd), len(elections_flattened_even)

6490208 1455448 6459643


In [30]:
print elections_flattened.head(n=10)

   2007  2009  2011  2013  2015      ncid  2008  2010  2012  2014  2016  2018
0   0.0   0.0   1.0   0.0   0.0  AA100007   1.0   1.0   1.0   1.0   1.0   0.0
1   1.0   1.0   1.0   0.0   0.0   AA10001   1.0   1.0   1.0   1.0   0.0   0.0
2   0.0   1.0   0.0   0.0   0.0  AA100023   1.0   0.0   1.0   0.0   0.0   0.0
3   0.0   0.0   0.0   1.0   1.0  AA100062   1.0   1.0   1.0   1.0   1.0   0.0
4   0.0   0.0   0.0   1.0   1.0  AA100074   1.0   1.0   1.0   1.0   1.0   0.0
5   0.0   0.0   1.0   1.0   0.0  AA100095   1.0   0.0   1.0   0.0   1.0   0.0
6   1.0   0.0   0.0   0.0   0.0  AA100110   1.0   1.0   1.0   1.0   1.0   0.0
7   1.0   1.0   1.0   0.0   1.0  AA100140   1.0   1.0   1.0   1.0   1.0   0.0
8   1.0   0.0   0.0   0.0   1.0   AA10016   1.0   1.0   1.0   1.0   1.0   0.0
9   1.0   0.0   0.0   0.0   0.0  AA100166   1.0   1.0   1.0   1.0   1.0   0.0


Just in case, change any 'NaN' values to zero

In [31]:
elections_flattened.fillna(value=0, method=None, axis=None, inplace=True, limit=None, downcast=None)

Finally, let's sort by columns in ascending order and get rid of '2018'

In [32]:
elections_flattened = elections_flattened.drop('2018', axis=1)
elections_flattened.sort(columns=None, axis=1, ascending=True, inplace=True, kind='quicksort', na_position='last')

  from ipykernel import kernelapp as app


In [33]:
print elections_flattened.head()

   2007  2008  2009  2010  2011  2012  2013  2014  2015  2016      ncid
0   0.0   1.0   0.0   1.0   1.0   1.0   0.0   1.0   0.0   1.0  AA100007
1   1.0   1.0   1.0   1.0   1.0   1.0   0.0   1.0   0.0   0.0   AA10001
2   0.0   1.0   1.0   0.0   0.0   1.0   0.0   0.0   0.0   0.0  AA100023
3   0.0   1.0   0.0   1.0   0.0   1.0   1.0   1.0   1.0   1.0  AA100062
4   0.0   1.0   0.0   1.0   0.0   1.0   1.0   1.0   1.0   1.0  AA100074


Clean up dataframes that are no longer necessary

In [34]:
del history_even, history_odd, elections_flattened_even, elections_flattened_odd

In [35]:
elections_flattened.to_csv('elections_flattened.csv')

We now have a flat dataframe with one row for each ncid and a column for each year.  The value in that column is '1' if the person with the ncid voted in the general election (even years) or any election (odd year) and '0' if they didn't vote

### Section B: Extracting Voter Demographics

Read in voter demographics from a csv file

In [36]:
demographics = pd.read_csv('demographics.csv')

Convert all columns to string

In [37]:
for col in demographics.columns:
    demographics[col]= demographics[col].apply(str)

Get rid of all the double quotes from the dataframe

In [38]:
demographics.columns = demographics.columns.str.strip('"')
for i, col in enumerate(demographics.columns):
    demographics.iloc[:, i] = demographics.iloc[:, i].str.replace('"', '')
print demographics.head()

  Unnamed: 0 birth_age gender_code race_code ethnic_code party_cd status_cd  \
0          0        79           M         W          NL      DEM         A   
1          1        60           M         W          NL      REP         A   
2          2        43           M         W          NL      REP         A   
3          3        55           M         W          NL      REP         A   
4          4        58           F         W          NL      DEM         I   

  zip_code birth_state drivers_lic county_desc  registr_dt      ncid  \
0    27253          NC           N    ALAMANCE  01/19/1959   AA34771   
1    27215          NC           Y    ALAMANCE  04/09/1984   AA53687   
2    27302          NC           Y    ALAMANCE  04/02/1992   AA80986   
3    27215          NC           Y    ALAMANCE  09/15/2003  AA131587   
4    27258          NC           Y    ALAMANCE  05/19/2006  AA146650   

  precinct_abbrv municipality_abbrv last_name first_name  
0             07                 

In [39]:
#Save the total number of rows to raw_len and print. 
raw_len=len(demographics)
print "The number of rows pulled from the database is " + str(raw_len)

The number of rows pulled from the database is 6774419


In [40]:
print demographics.columns

Index([u'Unnamed: 0', u'birth_age', u'gender_code', u'race_code',
       u'ethnic_code', u'party_cd', u'status_cd', u'zip_code', u'birth_state',
       u'drivers_lic', u'county_desc', u'registr_dt', u'ncid',
       u'precinct_abbrv', u'municipality_abbrv', u'last_name', u'first_name'],
      dtype='object')


Create two lists:  one for fields that can have a value "None" and another for fields that must be deleted if blank

In [41]:
can_have_none = ['municipality_abbrv', 'birth_state', 'last_name', 'first_name']
cannot_have_none = ['birth_age', 'gender_code', 'race_code',
       'ethnic_code', 'party_cd', 'status_cd', 'zip_code', 
       'drivers_lic', 'county_desc', 'registr_dt', 'ncid',
       'precinct_abbrv']

Create a generic function to change blank fields to 'None' for 'birth_state', 'municipality_abbrv', 'first_name' and 'last_name'

In [42]:
def fix_blanks(x):
    if x == "":
        return "None"
    elif x == " ":
        return "None"
    else:
        return x

In [45]:
for col in can_have_none:
    demographics[col] = demographics[col].apply(fix_blanks)
    
for col in cannot_have_none:
    demographics[col] = demographics[col].apply(fix_blanks_NaN)

In [46]:
demographics.dropna(inplace=True, how='any') #drop all rows with NaN
clean_len=len(demographics) #how many rows to I have left?
difference = raw_len - clean_len #how many rows have I dropped?
fraction = float(difference)/float(raw_len) #what fraction of rows have I dropped?
print "The number of rows I have dropped is " + str(difference) 
print "The number of rows remaining in the database is " + str(clean_len)
print "The precentage of rows that I have dropped is: " + str('{:.3%}'.format(fraction))

The number of rows I have dropped is 3156
The number of rows remaining in the database is 6771263
The precentage of rows that I have dropped is: 0.047%


### Section C: Merging Demographics and Turnout History

Now we do a left join, adding turnout information to each ncid in the demographics dataframe

In [47]:
combined = pd.merge(demographics, elections_flattened, how='left', on='ncid')

In [48]:
print len(combined), len(demographics), len(elections_flattened)

6771263 6771263 6490208


In [49]:
print combined.head()

  Unnamed: 0 birth_age gender_code race_code ethnic_code party_cd status_cd  \
0          0        79           M         W          NL      DEM         A   
1          1        60           M         W          NL      REP         A   
2          2        43           M         W          NL      REP         A   
3          3        55           M         W          NL      REP         A   
4          4        58           F         W          NL      DEM         I   

  zip_code birth_state drivers_lic  ...  2007 2008 2009 2010 2011 2012 2013  \
0    27253          NC           N  ...   0.0  1.0  0.0  1.0  0.0  1.0  0.0   
1    27215          NC           Y  ...   0.0  1.0  0.0  1.0  0.0  1.0  0.0   
2    27302          NC           Y  ...   0.0  1.0  1.0  1.0  0.0  1.0  1.0   
3    27215          NC           Y  ...   0.0  1.0  0.0  1.0  0.0  1.0  0.0   
4    27258          NC           Y  ...   NaN  NaN  NaN  NaN  NaN  NaN  NaN   

   2014  2015  2016  
0   1.0   0.0   1.0  
1   1.

Now save the dataframe 'combined' in a .csv

In [50]:
combined.to_csv('combined_updated.csv')