In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data
import numpy as np
from statistics import mean

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

%matplotlib inline

# Presidential Primaries: Democratic Party Edition

This short project builds on previous Python notebooks I've posted on my Github, focusing on the contribution of campaign finances to a candidate's likelihood of winning. This particular project focuses on *Democratic Presidential primaries*.

All of the data used below comes directly from the [https://www.fec.gov/data/](Federal Election Commission). It includes all campaign finance data for Democratic candidates from 1995 - 2016 in the year preceding the election.

In [2]:
money = pd.read_csv(r"C:/Users/ia767/Downloads/totals-2019-03-14T12_36_55.csv")
money = money.append(pd.read_csv(r"C:/Users/ia767/Downloads/totals-2019-03-14T12_36_34.csv"))
money = money.append(pd.read_csv(r"C:/Users/ia767/Downloads/totals-2019-03-14T12_36_06.csv"))
money = money.append(pd.read_csv(r"C:/Users/ia767/Downloads/totals-2019-03-14T12_35_58.csv"))
money = money.append(pd.read_csv(r"C:/Users/ia767/Downloads/totals-2019-03-14T12_35_49.csv"))
money = money.append(pd.read_csv(r"C:/Users/ia767/Downloads/totals-2019-03-14T12_35_29.csv"))
money = money.append(pd.read_csv(r"C:/Users/ia767/Downloads/totals-2019-03-14T12_35_16.csv"))

In [3]:
money.head(3)

Unnamed: 0,name,office,office_full,party,party_full,state,district,district_number,election_districts,election_years,...,cycle,is_election,receipts,disbursements,cash_on_hand_end_period,debts_owed_by_committee,coverage_start_date,coverage_end_date,federal_funds_flag,has_raised_funds
0,"CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE",P,President,DEM,DEMOCRATIC PARTY,US,0,0,"{00,00}","{2008,2016}",...,2016,f,585669600.0,585346300.0,323317.48,182.5,2015-04-01 00:00:00,2016-12-31 00:00:00,f,t
1,"WELLS, ROBERT CARR JR",P,President,DEM,DEMOCRATIC PARTY,US,0,0,"{00,00}","{2012,2016}",...,2016,f,16344.0,48000.0,0.0,0.0,2015-01-01 00:00:00,2016-12-31 00:00:00,f,t
2,"SHREFFLER, DOUG",P,President,DEM,DEMOCRATIC PARTY,US,0,0,"{00,00}","{2016,2020}",...,2016,f,0.0,30227.46,-8765.32,0.0,2015-01-01 00:00:00,2016-09-30 00:00:00,f,f


To make the analysis simpler, I focus on a subset of the columns included, which I deemed to be the most relevant for prediction purposes. These are mainly variables dealing directly with campaign finance.

In [4]:
money = money[["name", 'coverage_start_date', "candidate_status", "receipts", 'disbursements',
       'cash_on_hand_end_period', 'debts_owed_by_committee']]

In [5]:
money.head(3)

Unnamed: 0,name,coverage_start_date,candidate_status,receipts,disbursements,cash_on_hand_end_period,debts_owed_by_committee
0,"CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE",2015-04-01 00:00:00,C,585669600.0,585346300.0,323317.48,182.5
1,"WELLS, ROBERT CARR JR",2015-01-01 00:00:00,C,16344.0,48000.0,0.0,0.0
2,"SHREFFLER, DOUG",2015-01-01 00:00:00,N,0.0,30227.46,-8765.32,0.0


Then the dataset is further filtered to candidates whose status is "C" (running).

In [6]:
money = money[money["candidate_status"] == "C"] 

I also add a winner column which designates candidates who won the Democratic primary.

In [7]:
money["winner"] = 0
money["winner"][money["name"] == "CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE"] = 1
money["winner"][money["name"] == "OBAMA, BARACK"] = 1
money["winner"][money["name"] == "KERRY, JOHN F"] = 1
money["winner"][money["name"] == "GORE, AL"] = 1
money["winner"][money["name"] == "CLINTON, WILLIAM JEFFERSON"] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://p

Finally, I change the index of the dataset, which is distorted by the use of `.append`, so it corresponds directly to the final dataset's row numbers.

In [8]:
money = money.set_index(np.arange(money.shape[0]))

In [9]:
money.head(2)

Unnamed: 0,name,coverage_start_date,candidate_status,receipts,disbursements,cash_on_hand_end_period,debts_owed_by_committee,winner
0,"CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE",2015-04-01 00:00:00,C,585669600.0,585346300.0,323317.48,182.5,1
1,"WELLS, ROBERT CARR JR",2015-01-01 00:00:00,C,16344.0,48000.0,0.0,0.0,0


In [10]:
money.tail(2)

Unnamed: 0,name,coverage_start_date,candidate_status,receipts,disbursements,cash_on_hand_end_period,debts_owed_by_committee,winner
89,"HERMAN, RAPHAEL",2011-01-01 00:00:00,C,6410982.0,6475000.0,0.0,0.0,0
90,"OBAMA, BARACK",2011-01-01 00:00:00,C,738503800.0,737507900.0,3299312.93,5647729.93,1


### Prediction Models

The code below runs three different models (scaled logit, KNN and Random Forest) and compares their performance. Stratified K Fold was specified as the cross-validation method because winners comprise such a small percentage of the data set.

In [11]:
y = money['winner'] #splits dataset into x and y
X = money[["receipts", "disbursements", "cash_on_hand_end_period", "debts_owed_by_committee"]]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state= 10) #randomly splits them into test/train models

In [13]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
logreg_scaled = LogisticRegression().fit(X_train_scaled, y_train) #Scaled log regression, using default C
print("CV Test Score: {:.2f}".format( np.mean(cross_val_score(logreg_scaled, X_test_scaled, y_test, cv= StratifiedKFold() ))) )

CV Test Score: 0.96


In [14]:
param_grid = {'n_neighbors': np.arange(1, 50, 2)}
grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=10)
grid.fit(X_train, y_train)

best_k = grid.best_params_["n_neighbors"] #storing best C

knn = KNeighborsClassifier(n_neighbors = best_k).fit(X_train, y_train)

print("best mean cross-validation score: {:.3f}".format(grid.best_score_))
print("best parameters: {}".format(grid.best_params_))
print("CV Test Score: {:.2f}".format( np.mean(cross_val_score(knn, X_test, y_test, cv= StratifiedKFold() ))) )



best mean cross-validation score: 0.971
best parameters: {'n_neighbors': 3}
CV Test Score: 0.91


In [15]:
from sklearn.ensemble import RandomForestClassifier

tree_rfc = RandomForestClassifier(n_estimators=200).fit(X_train, y_train)

print("CV Test Score: {:.2f}".format(np.mean(cross_val_score(tree_rfc, X_test, y_test, cv= StratifiedKFold()) ))) 

CV Test Score: 1.00


It appears that the scaled logit and Random forest models perform best. However, their high cross-validation test scores might be misleading due to the imbalance in the number of winners v losers: 84 out of the 91 candidates listed did not win, so a model predicting every candidate to lose would have an accuracy score of 92%.

The code below compares the best two models 'winner' prediction  to the actual outcome of the primaries.

In [16]:
y[logreg_scaled.predict(X_scaled) == 1]

0     1
3     0
20    1
39    1
59    1
74    1
82    1
90    1
Name: winner, dtype: int64

In [17]:
y[tree_rfc.predict(X) == 1]

0     1
3     0
20    1
39    1
59    1
74    1
82    1
90    1
Name: winner, dtype: int64

It appears that the canddiate which confounds both of my best performing models is none other than Bernie Sanders.

In [18]:
money.ix[3, :]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


name                          SANDERS, BERNARD
coverage_start_date        2015-04-01 00:00:00
candidate_status                             C
receipts                           2.37641e+08
disbursements                      2.32185e+08
cash_on_hand_end_period            5.45517e+06
debts_owed_by_committee                 449409
winner                                       0
Name: 3, dtype: object

This points to the main weakness of these models, which is that they are built using *only* campaign data. As a result, they focus on 'mainstream' candidates which draw a lot of power within their party and more conventional circles. (Financial) "underdogs" will be, as a result, severely underestimated. 

Finally, I take advantages of one of the features of Tree-based models to look at individual contribution of my independent variables. Here, receipts, disbursements and cash on hand are all relatively useful. Committee debts, however, seem to bring little to the model.

In [19]:
print(tree_rfc.feature_importances_) 
#in same order as feature names in data
X.columns

[ 0.35439159  0.32944565  0.21606227  0.04510049]


Index(['receipts', 'disbursements', 'cash_on_hand_end_period',
       'debts_owed_by_committee'],
      dtype='object')

There are two main avenues that should be considered when building on these models further:
    1. adding polling data (to branch out from financials only)
    2. consider the fact that these models are trained on the *whole year at once*, and not as contributions roll in (as they will in this election cycle)