# Table of Contents:
* [Import Packages](#first-bullet)
* [Read In Data](#two-bullet)
* [Fill NAs](#three-bullet)
* [Create Columns For Models](#four-bullet)
* [Create Dummy Variables](#five-bullet)
* [Run Models](#infinity-bullet)
  * [Logistic Regression](#six-bullet)
  * [Naive Bayes](#seven-bullet)
  * [random Forest](#eight-bullet)
  * [AdaBoostClassifier](#nine-bullet)
* [Add In Donations Data](#ten-bullet)
  * [Logistic Regression](#sec-two-bullet)
  * [Naive Bayes](#sec-two-one-bullet)

Now that we have prepared our data by combining the two BOE voter files into one, we can begin modeling.  The goal of our model is to classify whether registered voters will participate in an election or stay home.  We will be training the model on 2014 data, and then testing it on 2016.  We will try a variety of classification models to see which performs the best.  We will also try running models that use the donation data (by congressional district) that we scraped.

# Import Packages <a class="anchor" id="first-bullet"></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.linear_model import LogisticRegression

In [3]:
from sklearn.metrics import confusion_matrix

In [4]:
import re

In [41]:
from sklearn.ensemble import RandomForestClassifier 

# Read In Data <a class="anchor" id="two-bullet"></a>

To work with the voter file data on the AWS free tier EC2 instance, we will need to subset it down from the 7.5 MM records.  By using 1 MM randomly selected records from the voter file, we will be able to run our models for evaluation.

In [5]:
pd.set_option('display.max_columns', 500)

In [7]:
# Due to the size of the files, working with the entire dataset is not possible on the AWS free tier.  We are subsetting the 
# voters down to a more managable 1 million
#import pyarrow

In [8]:
#combined_vf = pd.read_parquet('./data/combined_vf.parquet', engine='pyarrow')

In [9]:
#combined_sample_subset = combined_vf.sample(n=1_000_000, replace=False)

In [10]:
#combined_sample_subset.to_csv('./data/combined_sample_subset.csv')

In [7]:
#combined_vf = pd.read_csv('./data/combined_vf.csv')

In [5]:
combined_sample_subset = pd.read_csv('combined_sample_subset.csv')

In [6]:
df_money = pd.read_csv('donations.csv')

In [7]:
combined_sample_subset.head()

Unnamed: 0.1,Unnamed: 0,birth_year,nc_house_abbrv,nc_senate_abbrv,birth_age,race_code,ethnic_code,registr_dt,party_cd,cong_dist_abbrv,...,gen_16_dem,gen_12_dem,gen_18_dem,gen_14_dem,gen_10_dem,gen_16_ind,gen_12_ind,gen_18_ind,gen_14_ind,gen_10_ind
0,BL493176,1991.0,30.0,22.0,27.0,W,NL,07/27/2018,UNA,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,CJ111714,1956.0,26.0,11.0,62.0,W,NL,07/08/2005,UNA,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0
2,BR155504,1988.0,69.0,35.0,30.0,W,NL,09/29/2018,DEM,9.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,BL407163,1993.0,,,25.0,B,NL,04/26/2012,DEM,,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,AB6333,1953.0,94.0,42.0,65.0,W,UN,06/23/1982,DEM,5.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Fill NAs <a class="anchor" id="three-bullet"></a>

Our model will not run if we have NA values.  We will fill them with 0s.

In [7]:
combined_sample_subset = combined_sample_subset.fillna(value=0)
# for the entire voter file there hopefully won't be any nulls

In [8]:
combined_sample_subset.shape

(1000000, 31)

In [9]:
combined_sample_subset['registr_dt'] = pd.to_datetime(combined_sample_subset['registr_dt'], infer_datetime_format=True)

In [12]:
combined_sample_subset.head()

Unnamed: 0.1,Unnamed: 0,birth_year,nc_house_abbrv,nc_senate_abbrv,birth_age,race_code,ethnic_code,registr_dt,party_cd,cong_dist_abbrv,...,gen_16_dem,gen_12_dem,gen_18_dem,gen_14_dem,gen_10_dem,gen_16_ind,gen_12_ind,gen_18_ind,gen_14_ind,gen_10_ind
0,BL493176,1991.0,30.0,22.0,27.0,W,NL,2018-07-27,UNA,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,CJ111714,1956.0,26.0,11.0,62.0,W,NL,2005-07-08,UNA,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0
2,BR155504,1988.0,69.0,35.0,30.0,W,NL,2018-09-29,DEM,9.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,BL407163,1993.0,0.0,0.0,25.0,B,NL,2012-04-26,DEM,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,AB6333,1953.0,94.0,42.0,65.0,W,UN,1982-06-23,DEM,5.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Update any records that have 2 for voted to 1

In some cases, voters request an absentee ballot, and then vote in person.  Because they requested the absentee ballot, the voter file considers them both an absentee voter and an in-person voter.  To eliminate these cases, and show that they only voted once, we will update all instances of 2 in the voted columns to 1.

In [10]:
list_of_years = [10, 12, 14, 16, 18]
for i in list_of_years:
    col = 'gen_' + str(i) + '_voted'
    try:
        combined_sample_subset.loc[combined_sample_subset[col] == 2, col] = 1
        print(col)
    except:
        pass

gen_10_voted
gen_12_voted
gen_14_voted
gen_16_voted
gen_18_voted


# Create Columns For Model <a class="anchor" id="four-bullet"></a>

To test how this kind of analysis would work in reality, I need to train the model on the data that would be available, and then create predictions for the following cycle.  If I am predicting how the model would perform on 2016 data, I cannot show it any 2016 data, as this would not be available yet.  I am going to train the model on 2014 data and then test it on 2016 data.  To do this, I am turning all yearly data (party voted for in a year and voted/didn't vote in a year) into relative values.  For the 2016 data, t-0 = 2016, t-2 = 2014, etc.  For 2014, t-0 = 2014, t-2 - 2012, etc.  I am also deleting all data for the year in question and into the future (eg. for 2016 not including and 2016 or 2018 data in my feature set.

In [11]:
# turn reg date into a numeric column so that it can be used in the model
list_of_reg_years = []
for i in combined_sample_subset['registr_dt']:
    try:
        list_of_reg_years.append(i.year)
    except:
        list_of_reg_years.append(0)

In [12]:
combined_sample_subset['reg_yr'] = list_of_reg_years

In [13]:
target = 16
columns_dict = {}
for i in combined_sample_subset.columns:
    if bool(re.search(r'\d\d', i)) is True:
        year = int(re.findall('\d\d', i)[0])
       # print(i)
       # print(year - target)
        column_name = re.sub('\d\d', '', i) + '_' + str(year - target)
      #  print(column_name)
        columns_dict[i] = column_name
        
        
     #   out = re.sub("(<[^>]+>)", '', txt)

In [14]:
# make a new DF that renames all of the yearly columns to their difference to the target year. this way the model can be trained
# on one year, and then tested on the subsequent cycle
def rename_year_columns(target_year, datafr):
    columns_dict = {}
    for i in datafr.columns:
        if bool(re.search(r'\d\d', i)) is True:
            year = int(re.findall('\d\d', i)[0])
            column_name = re.sub('\d\d', '', i) + '_' + str(year - target_year)
            columns_dict[i] = column_name
    new_data = datafr.rename(columns = columns_dict) 
    return new_data

In [15]:
target_2016 = rename_year_columns(16,combined_sample_subset)
target_2014 = rename_year_columns(14,combined_sample_subset)

In [21]:
target_2016.head()

Unnamed: 0.1,Unnamed: 0,birth_year,nc_house_abbrv,nc_senate_abbrv,birth_age,race_code,ethnic_code,registr_dt,party_cd,cong_dist_abbrv,...,gen__dem_-4,gen__dem_2,gen__dem_-2,gen__dem_-6,gen__ind_0,gen__ind_-4,gen__ind_2,gen__ind_-2,gen__ind_-6,reg_yr
0,BL493176,1991.0,30.0,22.0,27.0,W,NL,2018-07-27,UNA,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2018
1,CJ111714,1956.0,26.0,11.0,62.0,W,NL,2005-07-08,UNA,2.0,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,2005
2,BR155504,1988.0,69.0,35.0,30.0,W,NL,2018-09-29,DEM,9.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,2018
3,BL407163,1993.0,0.0,0.0,25.0,B,NL,2012-04-26,DEM,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2012
4,AB6333,1953.0,94.0,42.0,65.0,W,UN,1982-06-23,DEM,5.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1982


In [22]:
target_2014.shape

(1000000, 32)

In [23]:
target_2016.shape

(1000000, 32)

## Create Dummy Variables <a class="anchor" id="five-bullet"></a>

Now that we have two data frames, one fro testing and one for training, we need to add in dummy variables for columns that we want to include that are not numeric variables (categorical) as well as numerical variables that appear to be numerical that are actually categorical (eg house/ senate district).

In [16]:
# create dummy cols
dummy_df_race = pd.get_dummies(combined_sample_subset['race_code'], prefix='race')
dummy_df_ethnic = pd.get_dummies(combined_sample_subset['ethnic_code'], prefix='ethnic')
dummy_df_party = pd.get_dummies(combined_sample_subset['party_cd'], prefix='party')
dummy_df_nc_house = pd.get_dummies(combined_sample_subset['nc_house_abbrv'], prefix='nc_house') 
dummy_df_nc_sen = pd.get_dummies(combined_sample_subset['nc_senate_abbrv'], prefix='nc_senate')  
dummy_df_cong_dist = pd.get_dummies(combined_sample_subset['cong_dist_abbrv'], prefix='cong_dist')  

In [17]:
def merge_dummy_vars(datafr_name):
 
    
    # combine all into one df
    datafr_name = datafr_name.join(dummy_df_race)
    datafr_name = datafr_name.join(dummy_df_ethnic)
    datafr_name = datafr_name.join(dummy_df_party)
    datafr_name = datafr_name.join(dummy_df_nc_house)
    datafr_name = datafr_name.join(dummy_df_nc_sen)
    datafr_name = datafr_name.join(dummy_df_cong_dist)
    return datafr_name

In [18]:
target_2016 = merge_dummy_vars(target_2016)

In [19]:
target_2016.gen__voted_0.value_counts()

1.0    633608
0.0    366392
Name: gen__voted_0, dtype: int64

In [20]:
target_2014 = merge_dummy_vars(target_2014)

In [29]:
target_2016.head()

Unnamed: 0.1,Unnamed: 0,birth_year,nc_house_abbrv,nc_senate_abbrv,birth_age,race_code,ethnic_code,registr_dt,party_cd,cong_dist_abbrv,...,cong_dist_4.0,cong_dist_5.0,cong_dist_6.0,cong_dist_7.0,cong_dist_8.0,cong_dist_9.0,cong_dist_10.0,cong_dist_11.0,cong_dist_12.0,cong_dist_13.0
0,BL493176,1991.0,30.0,22.0,27.0,W,NL,2018-07-27,UNA,1.0,...,0,0,0,0,0,0,0,0,0,0
1,CJ111714,1956.0,26.0,11.0,62.0,W,NL,2005-07-08,UNA,2.0,...,0,0,0,0,0,0,0,0,0,0
2,BR155504,1988.0,69.0,35.0,30.0,W,NL,2018-09-29,DEM,9.0,...,0,0,0,0,0,1,0,0,0,0
3,BL407163,1993.0,0.0,0.0,25.0,B,NL,2012-04-26,DEM,0.0,...,0,0,0,0,0,0,0,0,0,0
4,AB6333,1953.0,94.0,42.0,65.0,W,UN,1982-06-23,DEM,5.0,...,0,1,0,0,0,0,0,0,0,0


In [30]:
target_2014.head()

Unnamed: 0.1,Unnamed: 0,birth_year,nc_house_abbrv,nc_senate_abbrv,birth_age,race_code,ethnic_code,registr_dt,party_cd,cong_dist_abbrv,...,cong_dist_4.0,cong_dist_5.0,cong_dist_6.0,cong_dist_7.0,cong_dist_8.0,cong_dist_9.0,cong_dist_10.0,cong_dist_11.0,cong_dist_12.0,cong_dist_13.0
0,BL493176,1991.0,30.0,22.0,27.0,W,NL,2018-07-27,UNA,1.0,...,0,0,0,0,0,0,0,0,0,0
1,CJ111714,1956.0,26.0,11.0,62.0,W,NL,2005-07-08,UNA,2.0,...,0,0,0,0,0,0,0,0,0,0
2,BR155504,1988.0,69.0,35.0,30.0,W,NL,2018-09-29,DEM,9.0,...,0,0,0,0,0,1,0,0,0,0
3,BL407163,1993.0,0.0,0.0,25.0,B,NL,2012-04-26,DEM,0.0,...,0,0,0,0,0,0,0,0,0,0
4,AB6333,1953.0,94.0,42.0,65.0,W,UN,1982-06-23,DEM,5.0,...,0,1,0,0,0,0,0,0,0,0


In [31]:
# train model on '14
list_of_cols_14 = list(target_2014.columns)
# test model on '16
list_of_cols_16 = list(target_2016.columns)

Below are columns that exist in the dataframes that we do not want to include as features.  Any column that is for t = 0, has a positive number for t (eg. gen__voted_2), or is t = -6 we will not include.

In [32]:
list_to_rm = [ 'nc_house_abbrv',
 'nc_senate_abbrv',
 'race_code',
 'ethnic_code',
 'party_cd',
 'cong_dist_abbrv',
'Unnamed: 0',
 'gen__voted_0',
 'gen__rep_0',
 'gen__dem_0',
 'gen__ind_0',
 'gen__voted_-6',
 'gen__rep_-6',
 'gen__dem_-6',
 'gen__ind_-6',
 'gen__voted_2',
 'gen__rep_2',
 'gen__dem_2',
 'gen__ind_2',  
 'gen__voted_4',
 'gen__rep_4',
 'gen__dem_4',
 'gen__ind_4',
 'registr_dt',
 'Unnamed: 0.1'
]
for i in list_to_rm:
    try:
        list_of_cols_16.remove(i)
    except:
        pass

In [33]:
for i in list_to_rm:
    try:
        list_of_cols_14.remove(i)
    except:
        pass

In [35]:
# train model on '14 data
X_train = target_2014[list_of_cols_14]

y_train = target_2014['gen__voted_0']

X_test = target_2016[list_of_cols_16]


y_test = target_2016['gen__voted_0']

In [36]:
X_train.shape

(1000000, 216)

In [37]:
X_test.shape

(1000000, 216)

# Run Models <a class="anchor" id="infinity-bullet"></a>

Below, we will try out a variety of classification models on our data.  At the end of the process, we will compare the performance of each of the models.

## Logistic Regression Model <a class="anchor" id="six-bullet"></a>

In [32]:
lr_model = LogisticRegression(penalty='l1', C=.25, verbose=100)
lr_model.fit(X_train, y_train)

[LibLinear]



LogisticRegression(C=0.25, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=100, warm_start=False)

In [33]:
lr_model_preds_train = lr_model.predict(X_train)


In [34]:
lr_model_preds_test = lr_model.predict(X_test)

In [35]:
confusion_matrix(y_train, lr_model_preds_train)# Predicted values.

array([[527024,  82677],
       [111194, 279105]])

In [36]:
tn, fp, fn, tp = confusion_matrix(y_train, lr_model_preds_train).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.8644
Sensitivity: 0.7151


Sensitivity - voters who we predict will vote and actually voted.

Specificity - the percentage of non-voters who are correctly identified as not voting.

In [37]:
confusion_matrix(y_test, lr_model_preds_test)# Predicted values.

array([[334690,  31702],
       [275661, 357947]])

In [38]:
tn, fp, fn, tp = confusion_matrix(y_test, lr_model_preds_test).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.9135
Sensitivity: 0.5649


In [39]:
# look at coefs

In [40]:
coefs = pd.DataFrame(list_of_cols_14)
coefs['val'] = lr_model.coef_[0]
coefs.sort_values(by='val')

Unnamed: 0,0,val
30,nc_house_0.0,-0.857699
202,cong_dist_0.0,-0.510932
110,nc_house_80.0,-0.429151
111,nc_house_81.0,-0.423833
170,nc_senate_19.0,-0.388423
172,nc_senate_21.0,-0.374003
115,nc_house_85.0,-0.336217
117,nc_house_87.0,-0.305275
201,nc_senate_50.0,-0.294299
124,nc_house_94.0,-0.289108


## Naive Bayes Model <a class="anchor" id="seven-bullet"></a>

In [59]:
from sklearn.naive_bayes import MultinomialNB

In [72]:
mb_model = MultinomialNB(alpha=.2, fit_prior = True)

In [73]:
mb_model.fit(X_train,y_train);
#mb_model = MultinomialNB()

In [74]:
mb_model_preds_train = mb_model.predict(X_train)


In [75]:
mb_model_preds_test = mb_model.predict(X_test)

In [76]:
confusion_matrix(y_train, mb_model_preds_train)# Predicted values.

array([[438833, 170868],
       [101189, 289110]], dtype=int64)

In [77]:
tn, fp, fn, tp = confusion_matrix(y_train, mb_model_preds_train).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.7198
Sensitivity: 0.7407


In [78]:
confusion_matrix(y_test, mb_model_preds_test)# Predicted values.

array([[268944,  97448],
       [263065, 370543]], dtype=int64)

In [79]:
tn, fp, fn, tp = confusion_matrix(y_test, mb_model_preds_test).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.734
Sensitivity: 0.5848


In [80]:
coefs = pd.DataFrame(list_of_cols_14)
coefs['val'] = mb_model.coef_[0]
coefs.sort_values(by='val')

Unnamed: 0,0,val
26,party_GRE,-17.835170
24,party_CST,-17.456054
23,party_0,-15.718316
19,ethnic_0,-15.718316
11,race_0,-15.718316
27,party_LIB,-14.320982
15,race_M,-14.013491
45,nc_house_15.0,-13.859140
72,nc_house_42.0,-13.777421
77,nc_house_47.0,-13.682177


## Random Forest <a class="anchor" id="eight-bullet"></a>

In [42]:
rf_model= RandomForestClassifier(n_estimators  = 100, verbose = 100)
rf_model.fit(X_train,y_train)

building tree 1 of 100
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.8s remaining:    0.0s
building tree 2 of 100
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   19.7s remaining:    0.0s
building tree 3 of 100
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   29.2s remaining:    0.0s
building tree 4 of 100
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   39.1s remaining:    0.0s
building tree 5 of 100
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   48.7s remaining:    0.0s
building tree 6 of 100
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   58.3s remaining:    0.0s
building tree 7 of 100
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  1.1min remaining:    0.0s
building tree 8 of 100
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  1.3min remaining:    0.0s
building tree 9 of 100
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.5min remaining:    0.0s
building tree 10 of 100
[Parallel(n_jobs=1)]: Done  10 out of  10 | elaps

[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed: 12.8min remaining:    0.0s
building tree 81 of 100
[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 12.9min remaining:    0.0s
building tree 82 of 100
[Parallel(n_jobs=1)]: Done  82 out of  82 | elapsed: 13.1min remaining:    0.0s
building tree 83 of 100
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed: 13.2min remaining:    0.0s
building tree 84 of 100
[Parallel(n_jobs=1)]: Done  84 out of  84 | elapsed: 13.4min remaining:    0.0s
building tree 85 of 100
[Parallel(n_jobs=1)]: Done  85 out of  85 | elapsed: 13.6min remaining:    0.0s
building tree 86 of 100
[Parallel(n_jobs=1)]: Done  86 out of  86 | elapsed: 13.7min remaining:    0.0s
building tree 87 of 100
[Parallel(n_jobs=1)]: Done  87 out of  87 | elapsed: 13.9min remaining:    0.0s
building tree 88 of 100
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed: 14.0min remaining:    0.0s
building tree 89 of 100
[Parallel(n_jobs=1)]: Done  89 out of  89 | elapsed: 14.2min rem

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=100,
            warm_start=False)

In [43]:
rf_model_preds_train = rf_model.predict(X_train)


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    3.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    3.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    4.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    5.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    6.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    6.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    7.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:    8.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    9.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  13 out of  1

In [44]:
rf_model_preds_test = rf_model.predict(X_test)

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    3.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    4.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    5.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    6.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    7.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    8.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:    9.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    9.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  13 out of  1

In [46]:
confusion_matrix(y_train, rf_model_preds_train)# Predicted values.

array([[596021,  13680],
       [ 18604, 371695]])

In [50]:
tn, fp, fn, tp = confusion_matrix(y_train, rf_model_preds_train).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.9089
Sensitivity: 0.5556


In [48]:
confusion_matrix(y_test, rf_model_preds_test)# Predicted values.

array([[335566,  30826],
       [290559, 343049]])

In [49]:
tn, fp, fn, tp = confusion_matrix(y_test, rf_model_preds_test).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.9159
Sensitivity: 0.5414


## AdaBoost Classifier <a class="anchor" id="nine-bullet"></a>

In [81]:
from sklearn.model_selection import  GridSearchCV

In [82]:
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier, AdaBoostClassifier

In [91]:
ada_model = AdaBoostClassifier(base_estimator=MultinomialNB(alpha=.2, fit_prior = True), n_estimators=100)

In [92]:
ada_model.fit(X_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=MultinomialNB(alpha=0.2, class_prior=None, fit_prior=True),
          learning_rate=1.0, n_estimators=100, random_state=None)

In [93]:
ada_model_preds_train = ada_model.predict(X_train)


In [94]:
ada_model_preds_test =ada_model.predict(X_test)

In [95]:
confusion_matrix(y_train, ada_model_preds_train)# Predicted values.

array([[468945, 140756],
       [148964, 241335]], dtype=int64)

In [96]:
tn, fp, fn, tp = confusion_matrix(y_test, ada_model_preds_test).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.7553
Sensitivity: 0.4682


In [97]:
confusion_matrix(y_test, ada_model_preds_test)# Predicted values.

array([[276723,  89669],
       [336936, 296672]], dtype=int64)

In [98]:
tn, fp, fn, tp = confusion_matrix(y_test, ada_model_preds_test).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.7553
Sensitivity: 0.4682


# Add in Donation Data To Model <a class="anchor" id="ten-bullet"></a>

To see if the donation data by congressional district that we pulled in from Follow The Money enhances our models performance, we will append it to our training and test set and include the number of donors and donations as features in our model.

In [21]:
mask = df_money['Office Sought'].str.contains('US HOUSE DISTRICT NC')
#df[df['ids'].str.contains("ball")]
df_cong_money = df_money[mask]
#df_cong_money['cd'] = 
cds = []
for i in df_cong_money['Office Sought']:
    j = re.findall(r'\d+', i)
    cd = int(j[0])
    cds.append(cd)
df_cong_money['cd_number'] = cds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


In [22]:
df_cong_money['party'] = df_cong_money['General Party'].map({'DEMOCRATIC' : 'DEM', 'REPUBLICAN' : 'REP'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [23]:
def f(row):
    #print(row)
    if row['Election Year'] == year:
        val = row['# of Records']
    else:
        val = 0
    return val

In [24]:
def g(row):
    #print(row)
    if row['Election Year'] == year:
        val = row['Total $']
    else:
        val = 0
    return val

In [25]:
years = [2010, 2012, 2014, 2016, 2018]
for i in years:
    year = i
    column = 'donors_' + str(year)[2:]
    df_cong_money[column] = df_cong_money.apply(f,axis=1)
    column_b = 'donations_' + str(year)[2:]
    df_cong_money[column_b] = df_cong_money.apply(g,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [26]:
grouped_df_money = df_cong_money[['party', 'cd_number', 'donors_10','donors_12','donors_14','donors_16','donors_18','donations_10','donations_12','donations_14','donations_16','donations_18']].groupby(['party', 'cd_number']).sum()

In [27]:
grouped_df_money.reset_index(level=['party', 'cd_number'], inplace=True)

In [28]:
grouped_df_money.head()

Unnamed: 0,party,cd_number,donors_10,donors_12,donors_14,donors_16,donors_18,donations_10,donations_12,donations_14,donations_16,donations_18
0,DEM,1,727.0,889.0,422.0,807.0,0,824515.0,899936.0,532059.0,1104049.0,0
1,DEM,2,1094.0,198.0,4406.0,181.0,0,1147768.0,94558.0,2068316.0,132454.0,0
2,DEM,3,18.0,55.0,46.0,0.0,0,9702.0,15672.0,25388.0,0.0,0
3,DEM,4,1290.0,1561.0,952.0,902.0,0,980731.0,1147338.0,838259.0,785244.0,0
4,DEM,5,612.0,305.0,63.0,18.0,0,338870.0,190003.0,17955.0,8674.0,0


In [29]:
target_money_2016 = rename_year_columns(16,grouped_df_money)
target_money_2014 = rename_year_columns(14,grouped_df_money)

In [30]:
target_money_2016.head()

Unnamed: 0,party,cd_number,donors__-6,donors__-4,donors__-2,donors__0,donors__2,donations__-6,donations__-4,donations__-2,donations__0,donations__2
0,DEM,1,727.0,889.0,422.0,807.0,0,824515.0,899936.0,532059.0,1104049.0,0
1,DEM,2,1094.0,198.0,4406.0,181.0,0,1147768.0,94558.0,2068316.0,132454.0,0
2,DEM,3,18.0,55.0,46.0,0.0,0,9702.0,15672.0,25388.0,0.0,0
3,DEM,4,1290.0,1561.0,952.0,902.0,0,980731.0,1147338.0,838259.0,785244.0,0
4,DEM,5,612.0,305.0,63.0,18.0,0,338870.0,190003.0,17955.0,8674.0,0


In [31]:
target_money_2014.head()

Unnamed: 0,party,cd_number,donors__-4,donors__-2,donors__0,donors__2,donors__4,donations__-4,donations__-2,donations__0,donations__2,donations__4
0,DEM,1,727.0,889.0,422.0,807.0,0,824515.0,899936.0,532059.0,1104049.0,0
1,DEM,2,1094.0,198.0,4406.0,181.0,0,1147768.0,94558.0,2068316.0,132454.0,0
2,DEM,3,18.0,55.0,46.0,0.0,0,9702.0,15672.0,25388.0,0.0,0
3,DEM,4,1290.0,1561.0,952.0,902.0,0,980731.0,1147338.0,838259.0,785244.0,0
4,DEM,5,612.0,305.0,63.0,18.0,0,338870.0,190003.0,17955.0,8674.0,0


In [32]:
target_2014.head()

Unnamed: 0.1,Unnamed: 0,birth_year,nc_house_abbrv,nc_senate_abbrv,birth_age,race_code,ethnic_code,registr_dt,party_cd,cong_dist_abbrv,...,cong_dist_4.0,cong_dist_5.0,cong_dist_6.0,cong_dist_7.0,cong_dist_8.0,cong_dist_9.0,cong_dist_10.0,cong_dist_11.0,cong_dist_12.0,cong_dist_13.0
0,BL493176,1991.0,30.0,22.0,27.0,W,NL,2018-07-27,UNA,1.0,...,0,0,0,0,0,0,0,0,0,0
1,CJ111714,1956.0,26.0,11.0,62.0,W,NL,2005-07-08,UNA,2.0,...,0,0,0,0,0,0,0,0,0,0
2,BR155504,1988.0,69.0,35.0,30.0,W,NL,2018-09-29,DEM,9.0,...,0,0,0,0,0,1,0,0,0,0
3,BL407163,1993.0,0.0,0.0,25.0,B,NL,2012-04-26,DEM,0.0,...,0,0,0,0,0,0,0,0,0,0
4,AB6333,1953.0,94.0,42.0,65.0,W,UN,1982-06-23,DEM,5.0,...,0,1,0,0,0,0,0,0,0,0


In [33]:
target_16_plus_money = pd.merge(target_2016, target_money_2016, how='left', left_on=['party_cd', 'cong_dist_abbrv'], right_on = ['party', 'cd_number'])
target_14_plus_money = pd.merge(target_2014, target_money_2014, how='left', left_on=['party_cd', 'cong_dist_abbrv'], right_on = ['party', 'cd_number'])


In [34]:
target_16_plus_money.to_csv('target_2016_plus_money.csv')

In [35]:
target_14_plus_money.to_csv('target_2014_plus_money.csv')

In [51]:
target_16_plus_money.head()

Unnamed: 0.1,Unnamed: 0,birth_year,nc_house_abbrv,nc_senate_abbrv,birth_age,race_code,ethnic_code,registr_dt,party_cd,cong_dist_abbrv,...,donors__-6,donors__-4,donors__-2,donors__0,donors__2,donations__-6,donations__-4,donations__-2,donations__0,donations__2
0,BL493176,1991.0,30.0,22.0,27.0,W,NL,2018-07-27,UNA,1.0,...,,,,,,,,,,
1,CJ111714,1956.0,26.0,11.0,62.0,W,NL,2005-07-08,UNA,2.0,...,,,,,,,,,,
2,BR155504,1988.0,69.0,35.0,30.0,W,NL,2018-09-29,DEM,9.0,...,148.0,1048.0,0.0,98.0,0.0,111256.0,594305.0,0.0,53422.0,0.0
3,BL407163,1993.0,0.0,0.0,25.0,B,NL,2012-04-26,DEM,0.0,...,,,,,,,,,,
4,AB6333,1953.0,94.0,42.0,65.0,W,UN,1982-06-23,DEM,5.0,...,612.0,305.0,63.0,18.0,0.0,338870.0,190003.0,17955.0,8674.0,0.0


In [52]:
combined_sample_subset.head()

Unnamed: 0.1,Unnamed: 0,birth_year,nc_house_abbrv,nc_senate_abbrv,birth_age,race_code,ethnic_code,registr_dt,party_cd,cong_dist_abbrv,...,donors__-4,donors__-2,donors__0,donors__2,donors__4,donations__-4,donations__-2,donations__0,donations__2,donations__4
0,BL493176,1991.0,30.0,22.0,27.0,W,NL,2018-07-27,UNA,1.0,...,,,,,,,,,,
1,CJ111714,1956.0,26.0,11.0,62.0,W,NL,2005-07-08,UNA,2.0,...,,,,,,,,,,
2,BR155504,1988.0,69.0,35.0,30.0,W,NL,2018-09-29,DEM,9.0,...,148.0,1048.0,0.0,98.0,0.0,111256.0,594305.0,0.0,53422.0,0.0
3,BL407163,1993.0,0.0,0.0,25.0,B,NL,2012-04-26,DEM,0.0,...,,,,,,,,,,
4,AB6333,1953.0,94.0,42.0,65.0,W,UN,1982-06-23,DEM,5.0,...,612.0,305.0,63.0,18.0,0.0,338870.0,190003.0,17955.0,8674.0,0.0


In [36]:
target_14_plus_money.shape

(1000000, 249)

In [37]:
target_16_plus_money.shape

(1000000, 249)

In [38]:
# fill nas
target_14_plus_money = target_14_plus_money.fillna(value=0)
target_16_plus_money = target_16_plus_money.fillna(value=0)


In [39]:
# train model on '14
list_of_cols_14_money = list(target_14_plus_money.columns)
# test model on '16
list_of_cols_16_money = list(target_16_plus_money.columns)

In [43]:
list_to_rm = ['Unnamed: 0.1','Unnamed: 0','race_code','ethnic_code','party_cd','gen__voted_2','gen__voted_4','gen__voted_0','gen__rep_2','gen__rep_4','gen__rep_0'
              ,'gen__dem_2','gen__dem_4', 'gen__dem_0','gen__ind_2', 'gen__ind_4', 'gen__ind_0','gen__voted_-6', 'gen__rep_-6', 
              'gen__dem_-6', 'gen__ind_-6','First gen__voted_0','donations__0','donations__2','donations__4','nc_house_abbrv','cong_dist_abbrv',
              'donors__0', 'donors__2','donors__4', 'gen__ind_0', 'party','donors__-6','registr_dt','donations__-6','nc_senate_abbrv', 'cd_number']
for i in list_to_rm:
    try:
        list_of_cols_16_money.remove(i)
    except:
        pass

In [46]:
for i in list_to_rm:
    try:
        list_of_cols_16_money.remove(i)
    except:
        pass

In [44]:
for i in list_to_rm:
    try:
        list_of_cols_14_money.remove(i)
    except:
        pass

In [48]:
# train model on '14 data
X_train = target_14_plus_money[list_of_cols_14_money]

y_train = target_14_plus_money['gen__voted_0']

X_test = target_16_plus_money[list_of_cols_16_money]


y_test = target_16_plus_money['gen__voted_0']

In [49]:
X_train.shape

(1000000, 220)

In [50]:
X_test.shape

(1000000, 220)

## Logistic Regression <a class="anchor" id="sec-two-bullet"></a>

In [51]:
lr_model = LogisticRegression(penalty='l1', C=.25, verbose=100)
lr_model.fit(X_train, y_train)

[LibLinear]

LogisticRegression(C=0.25, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=100, warm_start=False)

In [52]:
lr_model_preds_train = lr_model.predict(X_train)

In [53]:
lr_model_preds_test = lr_model.predict(X_test)

In [54]:
confusion_matrix(y_train, lr_model_preds_train)# Predicted values.

array([[526711,  82990],
       [110871, 279428]])

In [55]:
tn, fp, fn, tp = confusion_matrix(y_train, lr_model_preds_train).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.8639
Sensitivity: 0.7159


In [56]:
confusion_matrix(y_test, lr_model_preds_test)# Predicted values.

array([[333337,  33055],
       [271898, 361710]])

In [57]:
tn, fp, fn, tp = confusion_matrix(y_test, lr_model_preds_test).ravel()

spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

Specificity: 0.9098
Sensitivity: 0.5709


In [58]:
coefs = pd.DataFrame(list_of_cols_14_money)
coefs['val'] = lr_model.coef_[0]
coefs.sort_values(by='val')

Unnamed: 0,0,val
202,cong_dist_0.0,-0.862183
151,nc_senate_0.0,-0.538772
110,nc_house_80.0,-0.405808
170,nc_senate_19.0,-0.400547
111,nc_house_81.0,-0.398928
172,nc_senate_21.0,-0.387709
115,nc_house_85.0,-0.326864
117,nc_house_87.0,-0.293750
124,nc_house_94.0,-0.286296
201,nc_senate_50.0,-0.285243


## Naive Bayes <a class="anchor" id="sec-two-one-bullet"></a>

In [61]:
mb_model = MultinomialNB(alpha=.2, fit_prior = True)

In [62]:
mb_model.fit(X_train, y_train)

MultinomialNB(alpha=0.2, class_prior=None, fit_prior=True)

In [63]:
mb_model_preds_train = mb_model.predict(X_train)

In [64]:
mb_model_preds_test = mb_model.predict(X_test)

In [65]:
confusion_matrix(y_train, mb_model_preds_train)# Predicted values.

array([[390753, 218948],
       [198212, 192087]])

In [66]:
confusion_matrix(y_test, mb_model_preds_test)# Predicted values.

array([[268972,  97420],
       [349715, 283893]])

In [67]:
coefs = pd.DataFrame(list_of_cols_14_money)
coefs['val'] = mb_model.coef_[0]
coefs.sort_values(by='val')

Unnamed: 0,0,val
26,party_GRE,-23.676618
24,party_CST,-23.297502
23,party_0,-21.559765
19,ethnic_0,-21.559765
11,race_0,-21.559765
27,party_LIB,-20.162430
15,race_M,-19.854939
45,nc_house_15.0,-19.700589
72,nc_house_42.0,-19.618869
77,nc_house_47.0,-19.523626
