### Project Problem & Hypothesis

This project attempts to take a look at U.S. primary election data to discern trends that will assist in predicting general election outcome. With this election cycle, there appears to be greater polarization between the two major political parties. Due to this variation in political options, we believe it will be easier to predict the general election outcome based on primary results. We predict that by discerning the underlying voter characteristics (age, ethnicity, education level, etc.) in key swing states, we can determine the preferences of the 'so-called' median voter that the general election will hinge on.

Since we are predicting voter preference, we are modeling a machine learning problem that is predicting a binary target of a form resembling the following: {0: Democrat vote, 1: Republican vote}.

Predicting this outcome will ultimately help me sleep better at night as the current political and social events occurring across the globe are disconcerting. There is also a wealth of research and work that shows election years having some large impact on the economy. Given the disparate views between each candidate, knowing with some level of confidence the outcome of the election can help lead one to making investment decisions that would be favored by the incoming POTUS.

It is my belief that race and region will play the largest role in determining the outcome of the upcoming presidential election

In [1]:
import pandas as pd

### Below, read in dataset and the dictionary of column descriptions

In [2]:
data = pd.read_csv('/Users/antuanweeks/PythonCode/GA_DataScience/akw_projects/datasets/2016-us-election/county_facts.csv')

In [3]:
data.head() # previewing the county-level demographic and socioeconomic information

Unnamed: 0,fips,area_name,state_abbreviation,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,...,SBO415207,SBO015207,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210
0,0,United States,,318857056,308758105,3.3,308745538,6.2,23.1,14.5,...,8.3,28.8,5319456312,4174286516,3917663456,12990,613795732,1046363,3531905.43,87.4
1,1000,Alabama,,4849377,4780127,1.4,4779736,6.1,22.8,15.3,...,1.2,28.1,112858843,52252752,57344851,12364,6426342,13369,50645.33,94.4
2,1001,Autauga County,AL,55395,54571,1.5,54571,6.0,25.2,13.8,...,0.7,31.7,0,0,598175,12003,88157,131,594.44,91.8
3,1003,Baldwin County,AL,200111,182265,9.8,182265,5.6,22.2,18.7,...,1.3,27.3,1410273,0,2966489,17166,436955,1384,1589.78,114.6
4,1005,Barbour County,AL,26887,27457,-2.1,27457,5.7,21.2,16.5,...,0.0,27.0,0,0,188337,6334,0,8,884.88,31.0


In [4]:
data_dict = pd.read_csv('/Users/antuanweeks/PythonCode/GA_DataScience/akw_projects/datasets/2016-us-election/county_facts_dictionary.csv')

In [5]:
data_dict.head() # previewing the dictinoary dataframe

Unnamed: 0,column_name,description
0,PST045214,"Population, 2014 estimate"
1,PST040210,"Population, 2010 (April 1) estimates base"
2,PST120214,"Population, percent change - April 1, 2010 to ..."
3,POP010210,"Population, 2010"
4,AGE135214,"Persons under 5 years, percent, 2014"


In [6]:
data_results = pd.read_csv('/Users/antuanweeks/PythonCode/GA_DataScience/akw_projects/datasets/2016-us-election/primary_results.csv')

In [7]:
data_results.head() # previewing the primary results dataframe

Unnamed: 0,state,state_abbreviation,county,fips,party,candidate,votes,fraction_votes
0,Alabama,AL,Autauga,1001.0,Democrat,Bernie Sanders,544,0.182
1,Alabama,AL,Autauga,1001.0,Democrat,Hillary Clinton,2387,0.8
2,Alabama,AL,Baldwin,1003.0,Democrat,Bernie Sanders,2694,0.329
3,Alabama,AL,Baldwin,1003.0,Democrat,Hillary Clinton,5290,0.647
4,Alabama,AL,Barbour,1005.0,Democrat,Bernie Sanders,222,0.078


### Datset Description

The county_facts dataset features demographic and economic information of counties in the United States. It breaks down information such as ethnic firm ownership, type of employment (non-farm), retail sales, languages spoken, income, and more. The primary_results dataset provides the outcomes of primaries in 49 states and 2633 counties. It features The state, county, fips county code, party, candidate, number of votes and fraction of votes from total. Since we have two datasets with overlapping data regarding county, we will be able to join/merge the two in our later analyses. 

### Domain Knowledge

I am familiar with voter theory from undergraduate studies. Hopefully this will allow me to have robust assumptions and perhaps insights that might be missed otherwise.

Within Economics, there is a theory known as the Median Voter Theorem (MVT). In simplest form, it posits that an election between two candidates will be ultimately decided by the preference of the median voter. Given the two-party political system exhibited in the United States, I believe the general election will follow a similar form (as it has in elections past), since the presidency will be won center of the extremes of each political platform.

__Other Methods__

Some social scientists predict such outcomes based on the underlying characteristics of the political system and do not even use polling or demograhpic data. They believe that certain predictors (such as the relationship between House/Senate control and the incumbent president's party) can be strong predictors in outcome. If we are able to obtain datasets with this information, we may attempt to capture this data as well.

Work by Weingart and Sebastien look at the relationship between donor funding and electoral success. Their model predicted more than 80% of the dropouts and winners within elections spanning from 2000 to 2012.

### Project Concerns

- is this too simple?
- what are ways I can acquire more data in order to perform studies over multiple election cycles?
- assumptions: 
 - that we will be able to obtain the swing state information
 - constant, homogenous (or easy to model) voter turnout so that there is not a big difference in outcomes from primary to general
 - rational preferences--voting preferences are transitive, such that someone voting for a democrat in primary will vote for the nearest representative (likely a democrat) in the general election as opposed to abstaining, or worse, voting for a candidate of the opposing party.
- dataset implications: it is based on a particular group of the population that appears more involved/interested in the political process seeing that they participate during primary season.
- risks:
 - if general election voter turnout is too low or electorate varies greatly from the electorate turning out for primaries, we risk having a mismatch in preferences with training dataset
 - people are not always rational; a passionate follower of a particular candidate no longer in the race may elect to not vote in the general election. This will decrease the actual turnout in the general election

__is the data incorrect?__ Since this is polling data, there is the chance that there are inaccuracies in the data, but with increasing electronic capabilities of our polling stations, accuracy/correctness is increasing

### Outcome

1. We anticipate the result to provide a county-by-county prediction of the 2016 presidential election. We will then be able to aggregate the data and apply the electoral college to determine the predicted outcome of the U.S. general election.

2. The audience expects an outcome similar to what is described in (1).

3. I believe gender will play a big role in determining the outcome. I cannot estimate a value, but I believe it will be significant. Ethnicity will also play a large role, I believe, perhaps the largest predictor in the race.

4. There will be a good number of regressors in the model. I do not believe it will need to be overly complex, but simply need to calculate the impact of multiple regressors.

5. We will find out in November. But seriously, we are looking for a high degree of confidence in the result, otherwise we have nothing more than what we can get from the daily poll data.

6. If the project is a bust, we will evaluate what cause errors and do better next time!