##  We're going to load the combined demographic and turnout data, and build a model to predict turnout for the 2017 local elections - More Data - Do Not Change

In [1]:
#First set everthing up
import pyodbc
import graphlab as gl
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
import local #server information and credentials
from pullODBC import pullODBC
from pullODBCall import pullODBCall
from matplotlib.backends.backend_pdf import PdfPages
from __future__ import division # convert integer division to floating point
%matplotlib inline
plt.style.use('seaborn-white')
plt.rcParams['font.size'] = 7
plt.tight_layout;
plt.rc('xtick', labelsize=7) 
plt.rc('ytick', labelsize=7) 

Read in combined demographic and participation data from the June 2017 snapshots

In [2]:
combined_df = pd.read_csv('combined_updated.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
print len(combined_df)
print combined_df.head()
print combined_df.columns

6771263
   Unnamed: 0  Unnamed: 0.1  birth_age gender_code race_code ethnic_code  \
0           0             0         79           M         W          NL   
1           1             1         60           M         W          NL   
2           2             2         43           M         W          NL   
3           3             3         55           M         W          NL   
4           4             4         58           F         W          NL   

  party_cd status_cd  zip_code birth_state  ...  2007 2008 2009 2010 2011  \
0      DEM         A     27253          NC  ...   0.0  1.0  0.0  1.0  0.0   
1      REP         A     27215          NC  ...   0.0  1.0  0.0  1.0  0.0   
2      REP         A     27302          NC  ...   0.0  1.0  1.0  1.0  0.0   
3      REP         A     27215          NC  ...   0.0  1.0  0.0  1.0  0.0   
4      DEM         I     27258          NC  ...   NaN  NaN  NaN  NaN  NaN   

  2012 2013 2014  2015  2016  
0  1.0  0.0  1.0   0.0   1.0  
1  1.0  0.

Change all the NaN's to 0

In [5]:
combined_df.fillna(value=0.0, method=None, axis=None, inplace=True, limit=None, downcast=None)

In [6]:
pd.set_option('max_columns',100)

Get rid of junk columns

In [7]:
del combined_df['Unnamed: 0']
del combined_df['Unnamed: 0.1']

Change all particpation columns to integer

In [8]:
years = ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016']
for year in years:
    combined_df[year] = combined_df[year].apply(int)

In [9]:
print combined_df.head()

   birth_age gender_code race_code ethnic_code party_cd status_cd  zip_code  \
0         79           M         W          NL      DEM         A     27253   
1         60           M         W          NL      REP         A     27215   
2         43           M         W          NL      REP         A     27302   
3         55           M         W          NL      REP         A     27215   
4         58           F         W          NL      DEM         I     27258   

  birth_state drivers_lic county_desc  registr_dt      ncid precinct_abbrv  \
0          NC           N    ALAMANCE  01/19/1959   AA34771             07   
1          NC           Y    ALAMANCE  04/09/1984   AA53687            12W   
2          NC           Y    ALAMANCE  04/02/1992   AA80986            10S   
3          NC           Y    ALAMANCE  09/15/2003  AA131587            03S   
4          NC           Y    ALAMANCE  05/19/2006  AA146650             13   

  municipality_abbrv last_name first_name  2007  2008  2

We're only interested in municipal elections, so let's only keep rows where municiaplity is not equal to 'Blank"

In [10]:
print "The total number of registered voters is " + str(len(combined_df))
combined_df = combined_df[combined_df['municipality_abbrv']!= 'None']
print "The total number of registered voters living in municipalities is " +str(len(combined_df))
print combined_df.head()

The total number of registered voters is 6771263
The total number of registered voters living in municipalities is 4110733
   birth_age gender_code race_code ethnic_code party_cd status_cd  zip_code  \
1         60           M         W          NL      REP         A     27215   
2         43           M         W          NL      REP         A     27302   
3         55           M         W          NL      REP         A     27215   
4         58           F         W          NL      DEM         I     27258   
6         82           M         W          NL      REP         A     27302   

  birth_state drivers_lic county_desc  registr_dt      ncid precinct_abbrv  \
1          NC           Y    ALAMANCE  04/09/1984   AA53687            12W   
2          NC           Y    ALAMANCE  04/02/1992   AA80986            10S   
3          NC           Y    ALAMANCE  09/15/2003  AA131587            03S   
4          NC           Y    ALAMANCE  05/19/2006  AA146650             13   
6          N

Change registr_dt to datetime and birth_age to float

In [11]:
combined_df['birth_age'] = combined_df['birth_age'].apply(float)

In [13]:
combined_df['registr_dt']=pd.to_datetime(combined_df.registr_dt, format='%m/%d/%Y')
print combined_df.head()

   birth_age gender_code race_code ethnic_code party_cd status_cd  zip_code  \
1       60.0           M         W          NL      REP         A     27215   
2       43.0           M         W          NL      REP         A     27302   
3       55.0           M         W          NL      REP         A     27215   
4       58.0           F         W          NL      DEM         I     27258   
6       82.0           M         W          NL      REP         A     27302   

  birth_state drivers_lic county_desc registr_dt      ncid precinct_abbrv  \
1          NC           Y    ALAMANCE 1984-04-09   AA53687            12W   
2          NC           Y    ALAMANCE 1992-04-02   AA80986            10S   
3          NC           Y    ALAMANCE 2003-09-15  AA131587            03S   
4          NC           Y    ALAMANCE 2006-05-19  AA146650             13   
6          NC           Y    ALAMANCE 1995-11-08   AA96254            10S   

  municipality_abbrv last_name first_name  2007  2008  2009  2

Create datetime objects for the four election days to be used in the model

In [14]:
election_day_2015 = pd.to_datetime('11/3/2015', dayfirst=False, yearfirst=False)
election_day_2014 = pd.to_datetime('11/4/2014', dayfirst=False, yearfirst=False) 
election_day_2013 = pd.to_datetime('11/5/2013', dayfirst=False, yearfirst=False) 
election_day_2012 = pd.to_datetime('11/6/2012', dayfirst=False, yearfirst=False) 

Create a set of functions to set new columns to 'Y' (voted) 'N' (registered but did not vote) and 'U' (not registered)

In [15]:
def calc_voted_2015(x):
    if x['2015'] == 1:
        return 'Y'
    elif x['registr_dt'] <= election_day_2015:
        return 'N'
    else:
        return 'U'

In [16]:
def calc_voted_2014(x):
    if x['2014'] == 1:
        return 'Y'
    elif x['registr_dt'] <= election_day_2014:
        return 'N'
    else:
        return 'U'

In [17]:
def calc_voted_2013(x):
    if x['2013'] == 1:
        return 'Y'
    elif x['registr_dt'] <= election_day_2013:
        return 'N'
    else:
        return 'U'

In [18]:
def calc_voted_2012(x):
    if x['2012'] == 1:
        return 'Y'
    elif x['registr_dt'] <= election_day_2012:
        return 'N'
    else:
        return 'U'

In [19]:
combined_df['target']= combined_df.apply(calc_voted_2015, axis=1)
print "created target column"
combined_df['congressional']= combined_df.apply(calc_voted_2014, axis=1)
print "created congressional column"
combined_df['local']= combined_df.apply(calc_voted_2013, axis=1)
print "created local column"
combined_df['presidential']= combined_df.apply(calc_voted_2012, axis=1)
print "created presiendial column"

created target column
created congressional column
created local column
created presiendial column


Print the first 10 rows and examine for correctness

In [20]:
pd.set_option('max_columns', 30)
print combined_df.head(n=10)

    birth_age gender_code race_code ethnic_code party_cd status_cd  zip_code  \
1        60.0           M         W          NL      REP         A     27215   
2        43.0           M         W          NL      REP         A     27302   
3        55.0           M         W          NL      REP         A     27215   
4        58.0           F         W          NL      DEM         I     27258   
6        82.0           M         W          NL      REP         A     27302   
9        30.0           F         W          NL      UNA         A     27253   
11       54.0           F         W          NL      REP         A     27215   
12       26.0           F         W          NL      REP         A     27215   
14       62.0           M         W          NL      UNA         A     27217   
18       51.0           F         W          NL      UNA         A     27215   

   birth_state drivers_lic county_desc registr_dt      ncid precinct_abbrv  \
1           NC           Y    ALAMANCE 19

Let's clean up the dataframe by only keeping the columns that we need and print the top to inspect and delete the old dataframe that we no longer need

In [23]:
participation_df = combined_df[['ncid', 'birth_age', 'gender_code', 'race_code', 'ethnic_code', 'party_cd', 
                               'status_cd', 'zip_code','birth_state', 'drivers_lic', 'county_desc', 'registr_dt',
                               'precinct_abbrv','target', 'congressional', 'local', 'presidential', 'municipality_abbrv']]

del combined_df

In [24]:
print participation_df.head()

       ncid  birth_age gender_code race_code ethnic_code party_cd status_cd  \
1   AA53687       60.0           M         W          NL      REP         A   
2   AA80986       43.0           M         W          NL      REP         A   
3  AA131587       55.0           M         W          NL      REP         A   
4  AA146650       58.0           F         W          NL      DEM         I   
6   AA96254       82.0           M         W          NL      REP         A   

   zip_code birth_state drivers_lic county_desc registr_dt precinct_abbrv  \
1     27215          NC           Y    ALAMANCE 1984-04-09            12W   
2     27302          NC           Y    ALAMANCE 1992-04-02            10S   
3     27215          NC           Y    ALAMANCE 2003-09-15            03S   
4     27258          NC           Y    ALAMANCE 2006-05-19             13   
6     27302          NC           Y    ALAMANCE 1995-11-08            10S   

  target congressional local presidential municipality_abbrv  

Before modelling, need to make sure the columns are of the correct type and that all folks with invalid ages and years registered are removed.  SFrames, which will be used for modelling, do not accept mixed types in columns, so we need to be very sure that all column types are the same.

In [25]:
string_columns = ['ncid', 'gender_code',  'race_code',  'ethnic_code', 'party_cd', 'status_cd', 
                'zip_code', 'birth_state', 'drivers_lic', 'county_desc', 'precinct_abbrv',
                 'target', 'congressional', 'local', 'presidential', 'municipality_abbrv']

In [26]:
for col in string_columns: #convert all the string_columns to 'string'
    participation_df[col] = participation_df[col].apply(str)

Create interaction variables

In [27]:
def convert_to_years(x): # function to convert a timedelta object to a float of years
    return x.total_seconds()/(60*60*24*365.2422)

In [28]:
participation_df['race_gender'] = participation_df['race_code'] + participation_df ['gender_code']
participation_df['race_ethnicity'] = participation_df['race_code'] + participation_df['ethnic_code']
participation_df['gender_ethnicity'] = participation_df['gender_code'] + participation_df ['ethnic_code']
participation_df['age_squared'] = (participation_df['birth_age'])**2
participation_df['age_cubed'] = (participation_df['birth_age'])**3
now = pd.to_datetime('today') # get today's data
participation_df['years_since_registration'] = now - participation_df['registr_dt'] # get days since registration as a timedelta object
participation_df['years_since_registration'] = participation_df['years_since_registration'].apply(convert_to_years) # convert to years
participation_df['years_squared'] = (participation_df['years_since_registration'])**2
participation_df['years_cubed'] = (participation_df['years_since_registration'])**3

In [29]:
print participation_df.head()

       ncid  birth_age gender_code race_code ethnic_code party_cd status_cd  \
1   AA53687       60.0           M         W          NL      REP         A   
2   AA80986       43.0           M         W          NL      REP         A   
3  AA131587       55.0           M         W          NL      REP         A   
4  AA146650       58.0           F         W          NL      DEM         I   
6   AA96254       82.0           M         W          NL      REP         A   

  zip_code birth_state drivers_lic county_desc registr_dt precinct_abbrv  \
1    27215          NC           Y    ALAMANCE 1984-04-09            12W   
2    27302          NC           Y    ALAMANCE 1992-04-02            10S   
3    27215          NC           Y    ALAMANCE 2003-09-15            03S   
4    27258          NC           Y    ALAMANCE 2006-05-19             13   
6    27302          NC           Y    ALAMANCE 1995-11-08            10S   

  target congressional local presidential municipality_abbrv race_ge

Remove all of the voters with bad ages and registration dates

In [30]:
participation_df = participation_df[(participation_df['birth_age'] <= 99) & (participation_df['years_since_registration'] <= 61)]

To create the model, we can only use voters that were registered in 2015.  So we need to remove all voters with code 'U' in the 'target' column.  

In [31]:
model_df = participation_df[(participation_df['target']!= 'U')]

Let's figure out the majority classifier

In [32]:
majority = model_df.groupby(['target'])['target'].count()/model_df.groupby(['target'])['target'].count().sum()
print majority
size_all = len(model_df)
print size_all

target
N    0.864531
Y    0.135469
Name: target, dtype: float64
3380621


By definition of status_cd = 'I', this implies that voters are inactive.  Therefore, the logical thing for our model to do is to predict all voters with status_cd = 'I' will not vote in the next local election.  So, we should model only the voters with status_cd = 'A'.  We can then combine everything in our final classification model to predict whether a voter will vote in the next municipal election:

    if status_cd == 'I' then predict 'No'

        else if model_prediction == 'Y' then predict 'Yes'
    
        else predict 'No'

Let's remove all voters with status_cd = 'I' from the model dataframe

In [33]:
model_df = model_df[(model_df['status_cd']== 'A')]

In [34]:
majority = model_df.groupby(['target'])['target'].count()/model_df.groupby(['target'])['target'].count().sum()
print majority
size_active = len(model_df)
print "The number of both active and inactive voters is " +str(size_all)
print "The number of active voters is " + str(size_active)
print str(size_all - size_active) + " voters are inactive."
print "The fraction of voters that are inactive is " +str((size_all-size_active)/size_all)

target
N    0.831105
Y    0.168895
Name: target, dtype: float64
The number of both active and inactive voters is 3380621
The number of active voters is 2692158
688463 voters are inactive.
The fraction of voters that are inactive is 0.203649861963


Our majority classifier is 83.1%.  This could be our starting benchmark for accuracy, since if we guess that everyone registered for the 2015 election did not vote, we would guess correctly 83.1% of the time.  

However, since the classes are imbalanced this is not a good benchmark.  What we're really trying to do is identify which voters are likely to vote in the next local election.  If we use the majority classifier, we would predict that no one would vote.

Accuracy of correclty predicting voters that will vote: 0%

Accuracy of correctly predicting voters that will not vote: 100%

Therefore, we're going to try to balance accuracy of predicting both voters that will vote and voters that won't vote.  

We can now explore 2015 turnout by county/muncipality for active voters only

In [35]:
table = pd.pivot_table(model_df[['county_desc', 'municipality_abbrv', 'target']], values=None, index=['county_desc','municipality_abbrv'], 
                                  columns='target', aggfunc=len, fill_value=None, margins=False, dropna=True, margins_name='All')
table = pd.DataFrame(table.to_records()) # convert the pivot table to a dataframe
table['turnout_active'] = table['Y']/(table['Y']+table['N'])

In [36]:
print table

    county_desc municipality_abbrv        N       Y  turnout_active
0      ALAMANCE                ALA    469.0    29.0        0.058233
1      ALAMANCE                BUR  16887.0  3978.0        0.190654
2      ALAMANCE                ELO   2774.0   454.0        0.140644
3      ALAMANCE                GIB   1604.0   253.0        0.136241
4      ALAMANCE                GRA   4982.0   781.0        0.135520
5      ALAMANCE                GRE    637.0   154.0        0.194690
6      ALAMANCE                HAW    646.0   175.0        0.213155
7      ALAMANCE                MEB   4555.0   464.0        0.092449
8      ALAMANCE                OSS    176.0    45.0        0.203620
9      ALAMANCE                SWE    766.0    55.0        0.066991
10    ALEXANDER                  T    669.0   235.0        0.259956
11    ALLEGHANY                 01    573.0   141.0        0.197479
12        ANSON                ANS    226.0    64.0        0.220690
13        ANSON               LILE    164.0    7

Now we convert model_df to an SFrame for modelling.

In [37]:
model_sf = gl.SFrame(model_df)

This non-commercial license of GraphLab Create for academic use is assigned to scottsmi@live.unc.edu and will expire on May 15, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\scott\AppData\Local\Temp\graphlab_server_1502721256.log.0


In [38]:
cols = model_sf.column_names()

In [39]:
print cols

['ncid', 'birth_age', 'gender_code', 'race_code', 'ethnic_code', 'party_cd', 'status_cd', 'zip_code', 'birth_state', 'drivers_lic', 'county_desc', 'registr_dt', 'precinct_abbrv', 'target', 'congressional', 'local', 'presidential', 'municipality_abbrv', 'race_gender', 'race_ethnicity', 'gender_ethnicity', 'age_squared', 'age_cubed', 'years_since_registration', 'years_squared', 'years_cubed']


Let's start out by using only past participation as our features

In [40]:
features = ['congressional', 'local', 'presidential']                

In [41]:
print features

['congressional', 'local', 'presidential']


We now reduce the data we will model to include only the features plus the column 'target'

In [42]:
model_reduced = model_sf[['congressional', 'local', 'presidential', 'target']]

Split the model data into 80% trainng data and 20% test data

In [43]:
train_data,test_data = model_reduced.random_split(.8)

In order to yeild meaningful coefficients, I need to insert a top row onto the SFrame with 'U' as values.

In [44]:
reference_sf = gl.SFrame({'local':'U', # create a new SFrame with 1 row and "None" for all categorical variables
                    'target': 'N', 
                    'congressional': 'U', 
                    'presidential': 'U'})
train_data = reference_sf.append(train_data) # append the full SFrame to the bottom of this 1 row SFrame 
print train_data.head()

+---------------+-------+--------------+--------+
| congressional | local | presidential | target |
+---------------+-------+--------------+--------+
|       U       |   U   |      U       |   N    |
|       Y       |   N   |      Y       |   N    |
|       Y       |   N   |      Y       |   N    |
|       N       |   N   |      Y       |   N    |
|       Y       |   N   |      Y       |   N    |
|       N       |   N   |      Y       |   N    |
|       Y       |   N   |      Y       |   N    |
|       Y       |   N   |      Y       |   N    |
|       N       |   N   |      Y       |   N    |
|       Y       |   Y   |      Y       |   Y    |
+---------------+-------+--------------+--------+
[10 rows x 4 columns]



We will no train and test a logistic classifier model.  To improve performance, we will use a Fast Iterative Shrinkage-Thresholding Algorithm (FISTA).  This algorithm speeds up convergence by 

In [45]:
turnout_model = gl.logistic_classifier.create(train_data, target='target',features= features,validation_set=test_data,
                                                          verbose=True, solver='newton', class_weights=None, max_iterations=50)

In [46]:
turnout_model.evaluate(test_data)

{'accuracy': 0.8511821149103649,
 'auc': 0.8175871748470867,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+--------+
 | target_label | predicted_label | count  |
 +--------------+-----------------+--------+
 |      Y       |        N        | 38839  |
 |      Y       |        Y        | 52775  |
 |      N       |        N        | 405933 |
 |      N       |        Y        | 41360  |
 +--------------+-----------------+--------+
 [4 rows x 3 columns],
 'f1_score': 0.5682399366887574,
 'log_loss': 0.3444555124075587,
 'precision': 0.5606310086577787,
 'recall': 0.5760582443731307,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-------+--------+
 | threshold | fpr | tpr |   p   |   n    |
 +-----------+-----+-----+-------+--------+
 |    0.0    | 1.0 | 1.0 | 91614 | 447293 |
 |   1e-05   | 1.0 | 1.0 | 91614 | 44729

Accuracy is 85.1% for using only previous turnout as features vs. 83.1% for the majority classifier, only a small improvement.   However, this does not tell the whole story.  Recall that the majority classifier performed as fullows:

Accuracy of correclty predicting voters that will vote: 0%

Accuracy of correctly predicting voters that will not vote: 100%


The classifier model we've just created performs as follows:

Accuracy of correclty predicting voters that will vote: 58%

Accuracy of correctly predicting voters that will not vote: 90%

If our objective to predict voters that actually will vote, this model is far better than the majority classifier.  However, identifying voters with 58% accuracy is still not good enough. 

We've observed that local election turnout varies widely among counties.  Let's add county_desc to the model and see if there is any meaningful improvement.

In [47]:
enhanced_features = ['congressional', 'local', 'presidential',  'county_desc'] 

In [48]:
model_reduced = model_sf[['local', 'target', 'congressional', 'presidential', 'county_desc']]
train_data,test_data = model_reduced.random_split(.8)

In [49]:
reference_sf = gl.SFrame({'local':'U', # create a new SFrame with 1 row and "None" for all categorical variables
                    'target': 'N', 
                    'congressional': 'U', 
                    'presidential': 'U',
                    'county_desc': 'U'})
train_data = reference_sf.append(train_data) # append the full SFrame to the bottom of this 1 row SFrame 
print train_data.head()

+---------------+-------------+-------+--------------+--------+
| congressional | county_desc | local | presidential | target |
+---------------+-------------+-------+--------------+--------+
|       U       |      U      |   U   |      U       |   N    |
|       Y       |   ALAMANCE  |   N   |      Y       |   N    |
|       Y       |   ALAMANCE  |   Y   |      Y       |   N    |
|       Y       |   ALAMANCE  |   N   |      Y       |   N    |
|       N       |   ALAMANCE  |   N   |      Y       |   N    |
|       Y       |   ALAMANCE  |   N   |      Y       |   N    |
|       N       |   ALAMANCE  |   N   |      Y       |   N    |
|       Y       |   ALAMANCE  |   N   |      Y       |   N    |
|       Y       |   ALAMANCE  |   N   |      Y       |   N    |
|       N       |   ALAMANCE  |   N   |      Y       |   N    |
+---------------+-------------+-------+--------------+--------+
[10 rows x 5 columns]



In [50]:
turnout_model = gl.logistic_classifier.create(train_data, target='target',features= enhanced_features,validation_set=test_data,
                                                          verbose=True, solver='newton', class_weights=None, max_iterations=100)

In [51]:
turnout_model.evaluate(test_data)

{'accuracy': 0.8672545817051395,
 'auc': 0.8569551142526537,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+--------+
 | target_label | predicted_label | count  |
 +--------------+-----------------+--------+
 |      Y       |        N        | 51522  |
 |      N       |        Y        | 19918  |
 |      Y       |        Y        | 38977  |
 |      N       |        N        | 427756 |
 +--------------+-----------------+--------+
 [4 rows x 3 columns],
 'f1_score': 0.5218014110339103,
 'log_loss': 0.320198457225191,
 'precision': 0.6618049070379489,
 'recall': 0.43068984187670584,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+-----+-------+--------+
 | threshold |      fpr       | tpr |   p   |   n    |
 +-----------+----------------+-----+-------+--------+
 |    0.0    |      1.0       | 1.0 | 90499 | 44767

Accuracy is 86.7% for using  previous turnout + county as features vs. 85.1 % using only previous turnout, again a small improvement.   However, this does not tell the whole story.  Recall that the classifier using only previous turnout performed as follows:

Accuracy of correctly predicting voters that will vote: 58%

Accuracy of correctly predicting voters that will not vote: 90%

Our new classifier adding county to the feature list performed as follows:

Accuracy of correctly predicting voters that will vote: 43%

Accuracy of correctly predicting voters that will not vote: 95%

If our objective to predict voters that actually will vote, this goes in the wrong direction.  The new model became more accurate at predicting non-voters at the expense of the accuracy of predicting voters.  We need to fix this.

Fortunately, there is a simple way do resolve this problem.  To achieve the best accuracy, the algorithm as it is considers the penalty for predicting a non-voter incorrectly as equal to the penalty for predicting a voter incorrectly.  This does not match our real-world objectives.  Classification modeling using unbalanced classes is a well studied subject, and one recommended way to handle this situation is to penalize an incorrect prediction into the majority class more than an incorrect prediction into the minority class.  This will increase the accuracy of predicting voters at the expense of the accuracy of predicting non-voter.  Overall accuracy will be lower, but this is a fair price to pay for increased accuracy of voters.  One rule of thumb is penalize incorrect prediction into the majority class proportionally to the class imbalance.  This would imply a penalty of 83.1/16.9.  In other words, the penalty for incorrectly predicting a voter as a non-voter would be approximately 5 times that of incorrectly predicting a voter as a non-voter.  Let's try modeling this way and see if we can get a better result.

In [52]:
turnout_model = gl.logistic_classifier.create(train_data, target='target',features= enhanced_features,validation_set=test_data,
                                                          verbose=True, solver='newton', class_weights='auto', max_iterations=100)

In [53]:
turnout_model.evaluate(test_data)

{'accuracy': 0.8056647211956006,
 'auc': 0.8572725692396028,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+--------+
 | target_label | predicted_label | count  |
 +--------------+-----------------+--------+
 |      Y       |        N        | 24173  |
 |      N       |        Y        | 80413  |
 |      Y       |        Y        | 66326  |
 |      N       |        N        | 367261 |
 +--------------+-----------------+--------+
 [4 rows x 3 columns],
 'f1_score': 0.5591515693101442,
 'log_loss': 0.47003355372634803,
 'precision': 0.4519998091850156,
 'recall': 0.7328920761555376,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+-----+-------+--------+
 | threshold |      fpr       | tpr |   p   |   n    |
 +-----------+----------------+-----+-------+--------+
 |    0.0    |      1.0       | 1.0 | 90499 | 4476

Accuracy is 80.6% for using  previous turnout + county as features vs. 86.6% using the unweighted model, significantly lower.  Once again, this does not tell the whole story.  Recall that the classifier using the unweighted model performed as follows:

Accuracy of correctly predicting voters that will vote: 43%

Accuracy of correctly predicting voters that will not vote: 95%

Using the weighted model, performance was:

Accuracy of correctly predicting voters that will vote: 73%

Accuracy of correctly predicting voters that will not vote: 82%


If our objective to predict voters that actually will vote, this is a significant improvement.  The new model became more accurate at predicting voters at the expense of the accuracy of predicting non-voters.  This is exactly what we needed.

To further improve accuracy of correctly predicting voters (make it as close as possible to the accuracy of correctly predicting non-voters), We can adjust the threshold for predicting non-voters lower.  Currently, every person with a probability score > 0.5 will be predicted as a voter, and those < 0.5 will be predicted as a non-voter.  Let's move this threshold down to balance the accuracies.

We now need to create a function that manually calculate accuracy, sensitivity and specificity.

The function below (get_model_performance) takes as input a classification model, an SFrame used by this model to make predictions, and a threshold for making predictions such that any probability above this threshold yields s 'positive' or 'republican' classification while any probability below this threshold yields a 'negative' or 'democratic' classification. It also takes as input the column in the SFrame containing the classification target and the value of this column that represents a 'positive'.  As a reminder, we are considering 'republican' sentiment as a positive and 'democratic' sentiment as a negative.

This function returns a 3-tuple:  accuracy, sensitivity, selectivity

In [54]:
from __future__ import division # convert integer division to floating point
def get_model_performance(model, sframe, threshold, target_feature, positive_value):
    sframe['percent'] = model.predict(sframe, output_type='probability') # add a column with the model generated probabilities
    sframe['predict_postive'] = sframe['percent'] >= threshold # add a column that contains 'true' if a potivite is predicted
    sframe['predict_negative'] = sframe['percent'] < threshold # add a column that contains 'true' if a negative is predicted
    sframe['true_postive'] = sframe[target_feature] == positive_value # add a column that contains 'true' if the row is a true postive
    sframe['true_negative'] = sframe[target_feature] != positive_value # add a column that contains 'true' if the row is a true negative
    
    num_observations = len(sframe) # total number of observations
    true_positives = sframe['true_postive'].sum() # number of true postives
    true_negatives = sframe['true_negative'].sum() # number of true negatives
    correct_positives = ((sframe['true_postive'] == 1) & (sframe['predict_postive'] == 1)).sum() # number of correct positive predictions
    correct_negatives = ((sframe['true_negative'] == 1) & (sframe['predict_negative'] == 1)).sum() # number of correct negative predictions
    sensitivity = correct_positives/true_positives # calculate sensitivity: correctly predicted postives / total positives
    specificity = correct_negatives/true_negatives  # calculate specificity: correclty predicted negatives / total negatives
    accuracy = (correct_positives + correct_negatives)/num_observations # calculate accuracy:  correct predictions / total observations
    return accuracy, sensitivity, specificity # return the 3-tuple

Let's test the above function

In [55]:
print get_model_performance(turnout_model, test_data, .5, 'target', 'Y')

(0.8056647211956006, 0.7328920761555376, 0.8203759878840406)


Let's use the above function to select the threshold that balances the accuracies.

In [56]:
min_difference = 1
best_threshold = 0
best_result = 0.0,0.0,0.0
for thresh in np.linspace(0.45, 0.5, num=100):
    output =  get_model_performance(turnout_model, train_data, thresh, 'target', 'Y')
    difference = output[1]-output[2]
    if abs(difference) < min_difference:
        min_difference = abs(difference)
        best_threshold = thresh
        best_result = output
print "The best threshold value for predicting positives is " + str(best_threshold)  
print "This yields accuracy of " + str(best_result[0]) + " sensitivity of " + str(best_result[1]) + " and specificity of " + str(best_result[2])
threshold_to_use = best_threshold  

The best threshold value for predicting positives is 0.487878787879
This yields accuracy of 0.772805858534 sensitivity of 0.785640541253 and specificity of 0.770194223469


We now have a good model.  To summarize:

Accuracy of correctly predicting voters that will vote: 79%

Accuracy of correctly predicting voters that will not vote: 77%

If model probability score < 0.488 then predict "Will not vote"

Else (probability score > 0.488) predict "Will vote"

This model can be combined with 'status_cd' = 'I' ----> "Will not vote"  to make final predictions.

Let's explore the coefficients to determine the influence each has on the model.

In [57]:
coeff_sf = turnout_model.coefficients
print coeff_sf.print_rows(num_rows=200)

+---------------+--------------+-------+------------------+-----------------+
|      name     |    index     | class |      value       |      stderr     |
+---------------+--------------+-------+------------------+-----------------+
|  (intercept)  |     None     |   Y   |  -2.14075112037  |   7.9265846811  |
| congressional |      Y       |   Y   |  0.859944421914  | 0.0198984970322 |
| congressional |      N       |   Y   | -0.847431962115  | 0.0207891181476 |
|     local     |      N       |   Y   | -0.443263932581  | 0.0178110335079 |
|     local     |      Y       |   Y   |  1.84206357206   | 0.0185934806385 |
|  presidential |      Y       |   Y   |  0.345722065867  | 0.0170247357358 |
|  presidential |      N       |   Y   | -0.245998258174  |  0.02459794445  |
|  county_desc  |   ALAMANCE   |   Y   |  1.04003369759   |  7.92664002797  |
|  county_desc  |    AVERY     |   Y   |  3.37725713656   |  7.92748526281  |
|  county_desc  |  ALEXANDER   |   Y   |  0.937905062256  |  7.9

By far, the factor that has the most impact on projecting turnout for an upcoming local election is whether a person voted in the last local election.  The next most meaningful factor is whether a person voted in the last congressional election.  Whether a person voted in the last presidential election has little impact on projecting turnout for the next local election.  

Which county you live in in North Carolina has a large impact in whether you will vote in a local election.  That being said, county is very highly correlated with voting history.  This high correlation is demonstrated by the very small (1.6%) increase in accuracy we achieve by adding county to voting history.

Let's explore the coefficients in more detail

In [58]:
turnout_model.save('turnout_model')

In [59]:
coeff_sf.save('turnout_coeff_sf')

Create a new dataframe for coefficient statistics, calculate these statistics, and add these to the dataframe.  First, create a pandas dataframe from the coefficient sframe.

In [60]:
coeff_df = coeff_sf.to_dataframe()

In [61]:
import statsmodels.robust.scale as sm # for median absolute deviation (MAD)

coeff_df['value'].astype('float64') # Ensure that values are float64
coeff_statistics_df = pd.DataFrame(data=None, index=None, columns=['COEF','MIN', 'MAX', 'RANGE', 'MAD'], 
                                 dtype='float64', copy=False) # Create an empty dataframe for statistics
names = coeff_df['name'].unique() # names has all the feature values 
# Iterate through all the features with more than 1 coeffient value (the categorical varaiabes) 
# and add statistics about the coefficient values to coeff_statistics_df
for coeff in names:
    df = coeff_df[coeff_df['name']== coeff]
    if len(df) > 1:
        new_row = pd.DataFrame([[coeff, df['value'].min(), df['value'].max(), df['value'].ptp(), 
                                 sm.mad(df['value'])]], columns=['COEF','MIN', 'MAX', 'RANGE', 'MAD']) 
                               
        coeff_statistics_df = coeff_statistics_df.append(new_row, ignore_index=False)
        
coeff_statistics_df.sort_values('MAD', axis=0, ascending=False, inplace=True, 
                              kind='quicksort', na_position='last') # sort the dataframe on Median Absolute Deviation

In [62]:
print coeff_statistics_df

            COEF        MIN       MAX      RANGE       MAD
0          local  -0.443264  1.842064   2.285328  1.694116
0  congressional  -0.847432  0.859944   1.707376  1.265680
0    county_desc -12.149360  3.377257  15.526617  0.781381
0   presidential  -0.245998  0.345722   0.591720  0.438643


The table above supports our previous conclusions

Let's try using a combination of county and precicnt_desc as our geographical predictor instead of just county. We create the new column first.

In [63]:
model_sf['county_prec'] = model_sf['county_desc'] + model_sf['precinct_abbrv']

Next we create the feature list

In [64]:
new_enhanced_features = ['congressional', 'local', 'presidential',  'county_prec'] 

In [65]:
model_reduced = model_sf[['local', 'target', 'congressional', 'presidential', 'county_prec']]
train_data,test_data = model_reduced.random_split(.8)

In [66]:
reference_sf = gl.SFrame({'local':'U', # create a new SFrame with 1 row and "None" for all categorical variables
                    'target': 'N', 
                    'congressional': 'U', 
                    'presidential': 'U',
                    'county_prec': 'U'})
train_data = reference_sf.append(train_data) # append the full SFrame to the bottom of this 1 row SFrame 
print train_data.head()

+---------------+--------------+-------+--------------+--------+
| congressional | county_prec  | local | presidential | target |
+---------------+--------------+-------+--------------+--------+
|       U       |      U       |   U   |      U       |   N    |
|       Y       | ALAMANCE12W  |   N   |      Y       |   N    |
|       Y       | ALAMANCE10S  |   Y   |      Y       |   N    |
|       N       | ALAMANCE10S  |   N   |      Y       |   N    |
|       Y       | ALAMANCE03S  |   N   |      Y       |   N    |
|       N       | ALAMANCE12W  |   N   |      Y       |   N    |
|       Y       | ALAMANCE12S  |   N   |      Y       |   N    |
|       Y       | ALAMANCE126  |   N   |      Y       |   N    |
|       Y       | ALAMANCE12E  |   N   |      Y       |   N    |
|       N       | ALAMANCE1210 |   N   |      Y       |   N    |
+---------------+--------------+-------+--------------+--------+
[10 rows x 5 columns]



In [67]:
turnout_model = gl.logistic_classifier.create(train_data, target='target',features= new_enhanced_features, 
                                              validation_set=test_data, verbose=True, solver='newton', 
                                              class_weights='auto', max_iterations=100)

In [68]:
turnout_model.evaluate(test_data)

{'accuracy': 0.8075789559749362,
 'auc': 0.8858333549307208,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+--------+
 | target_label | predicted_label | count  |
 +--------------+-----------------+--------+
 |      N       |        Y        | 85158  |
 |      Y       |        N        | 18393  |
 |      Y       |        Y        | 72219  |
 |      N       |        N        | 362378 |
 +--------------+-----------------+--------+
 [4 rows x 3 columns],
 'f1_score': 0.5824371242272843,
 'log_loss': 0.4291897182549305,
 'precision': 0.4588917059036581,
 'recall': 0.797013640577407,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+-------+--------+
 | threshold |      fpr       |      tpr       |   p   |   n    |
 +-----------+----------------+----------------+-------+--------+
 |    0.0    |     

Accuracy is 80.8% for using  previous turnout + county as features vs. 80.6% using the county only model, increased slightly.  Recall that the classifier using only county performed as follows:

Accuracy of correctly predicting voters that will vote: 73%

Accuracy of correctly predicting voters that will not vote: 82%

The new model using county + precinct performed as follows:

Accuracy of correctly predicting voters that will vote: 80%

Accuracy of correctly predicting voters that will not vote: 81%


If our objective to predict voters that actually will vote, this is a significant improvement.  The new model became more accurate at predicting voters at the slight expense of the accuracy of predicting non-voters.  This is exactly what we needed.  These numbers are so close that tuning might be of limited value but we'll try it anyway to see if we can slightly improve the model


In [69]:
min_difference = 1
best_threshold = 0
best_result = 0.0,0.0,0.0
for thresh in np.linspace(0.47, 0.5, num=100):
    output =  get_model_performance(turnout_model, train_data, thresh, 'target', 'Y')
    difference = output[1]-output[2]
    if abs(difference) < min_difference:
        min_difference = abs(difference)
        best_threshold = thresh
        best_result = output
print "The best threshold value for predicting positives is " + str(best_threshold)  
print "This yields accuracy of " + str(best_result[0]) + " sensitivity of " + str(best_result[1]) + " and specificity of " + str(best_result[2])
threshold_to_use = best_threshold  

The best threshold value for predicting positives is 0.489393939394
This yields accuracy of 0.804164881238 sensitivity of 0.804322688208 and specificity of 0.804132782698


Wow!  It looks like we hit the mark.  We'll use this new model to predict turnout for 2017 municipal elections by registered voter.  

In [70]:
coeff_sf = turnout_model.coefficients

In [71]:
turnout_model.save('enhanced_turnout_model')

In [72]:
coeff_sf.save('enhanced_turnout_coeff_sf')

Create a new dataframe for coefficient statistics, calculate these statistics, and add these to the dataframe.  First, create a pandas dataframe from the coefficient sframe.

In [73]:
coeff_df = coeff_sf.to_dataframe()

In [74]:
import statsmodels.robust.scale as sm # for median absolute deviation (MAD)

coeff_df['value'].astype('float64') # Ensure that values are float64
coeff_statistics_df = pd.DataFrame(data=None, index=None, columns=['COEF','MIN', 'MAX', 'RANGE', 'MAD'], 
                                 dtype='float64', copy=False) # Create an empty dataframe for statistics
names = coeff_df['name'].unique() # names has all the feature values 
# Iterate through all the features with more than 1 coeffient value (the categorical varaiabes) 
# and add statistics about the coefficient values to coeff_statistics_df
for coeff in names:
    df = coeff_df[coeff_df['name']== coeff]
    if len(df) > 1:
        new_row = pd.DataFrame([[coeff, df['value'].min(), df['value'].max(), df['value'].ptp(), 
                                 sm.mad(df['value'])]], columns=['COEF','MIN', 'MAX', 'RANGE', 'MAD']) 
                               
        coeff_statistics_df = coeff_statistics_df.append(new_row, ignore_index=False)
        
coeff_statistics_df.sort_values('MAD', axis=0, ascending=False, inplace=True, 
                              kind='quicksort', na_position='last') # sort the dataframe on Median Absolute Deviation

In [75]:
print coeff_statistics_df

            COEF        MIN       MAX      RANGE       MAD
0          local  -0.483707  1.892583   2.376290  1.761546
0  congressional  -0.859728  0.904866   1.764594  1.308095
0    county_prec -13.310620  5.666729  18.977349  1.041924
0   presidential  -0.239070  0.393148   0.632218  0.468664


Let's compare the above to the previous model.  The MAD for local, congressional, and presidential previous participation is pretty much unchanged.  The MAD for county_prec is significantly higher than county alone.  This tells us that adding precinct to the county give us meaningful information as to turnout probability.  That being said, we need to be careful; county_prec is still highly correlated with the turnout information so we need to use this information carefully.  In other words, in specific precincts, there may be no real person that actually voted in the last 3 elections so if we were to model hypothetical people assuming 100% turnout in these last 3 elections our predictions would probably be off.  

One final thought.  Municipal elections are very different from each other.  Some elections have no competitive races most years.  Other elections may be competitive some years and not competitive other years.  Statewide analyses and predictions are of limited use.  To most accurately model local elections, modeling needs to done for each unique municipality considering the specific circumstances of each election. I would suggest these be done a county at a time for each municipality in a county.  These models could be combined with the political preference model to help identify specific voters that have specific turnout probabilities combined with specific political leanings.  These voters could be contacted for focus groups and surveys for use in journalistic projects.   