# Election Data Analysis

There is a <b>county_facts.csv</b> in the <b>datasets/geo/</b> folder, whose variable descriptions are located in the <b>county_facts_dictionary.csv</b> file. These are facts mostly obtained from the 2012-2014 census. Can you find the top 3 factors explaining the relative voting margin on each county for the 2016 election?

## Step 1: Preprocessing data

In [101]:
import pandas as pd
import numpy as np

Load county_facts

In [102]:
Facts = pd.read_csv('county_facts.csv')
Facts.head()

Unnamed: 0,fips,area_name,state_abbreviation,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,...,SBO415207,SBO015207,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210
0,0,United States,,318857056,308758105,3.3,308745538,6.2,23.1,14.5,...,8.3,28.8,5319456312,4174286516,3917663456,12990,613795732,1046363,3531905.43,87.4
1,1000,Alabama,,4849377,4780127,1.4,4779736,6.1,22.8,15.3,...,1.2,28.1,112858843,52252752,57344851,12364,6426342,13369,50645.33,94.4
2,1001,Autauga County,AL,55395,54571,1.5,54571,6.0,25.2,13.8,...,0.7,31.7,0,0,598175,12003,88157,131,594.44,91.8
3,1003,Baldwin County,AL,200111,182265,9.8,182265,5.6,22.2,18.7,...,1.3,27.3,1410273,0,2966489,17166,436955,1384,1589.78,114.6
4,1005,Barbour County,AL,26887,27457,-2.1,27457,5.7,21.2,16.5,...,0.0,27.0,0,0,188337,6334,0,8,884.88,31.0


Load 2016 Election data.

In [103]:
Election = pd.read_csv('elections/2016/general.csv')
Election.head()

Unnamed: 0.1,Unnamed: 0,state,county,1st,2nd,3rd,votes1,votes2,votes3,pct1,pct2,pct3,party1,party2,party3
0,0,Alabama,Autauga County,D. TRUMP,H. CLINTON,G. JOHNSON,18110,5908,538,0.734,0.24,0.022,R,D,O
1,1,Alabama,Baldwin County,D. TRUMP,H. CLINTON,G. JOHNSON,72780,18409,2448,0.774,0.196,0.026,R,D,O
2,2,Alabama,Barbour County,D. TRUMP,H. CLINTON,G. JOHNSON,5431,4848,93,0.523,0.467,0.009,R,D,O
3,3,Alabama,Bibb County,D. TRUMP,H. CLINTON,G. JOHNSON,6733,1874,124,0.77,0.214,0.014,R,D,O
4,4,Alabama,Blount County,D. TRUMP,H. CLINTON,G. JOHNSON,22808,2150,337,0.899,0.085,0.013,R,D,O


**Note:** The Election data doesn't have a state_abbr field. For convenience of merging dataframes, we downloaded a state-abbr mapping, and merge it with Election dataset.

In [104]:
state_abbr = pd.read_csv('state_abbr.csv')
state_abbr.head()

Unnamed: 0,State,Abbreviation
0,ALABAMA,AL
1,ALASKA,AK
2,ARIZONA,AZ
3,ARKANSAS,AR
4,CALIFORNIA,CA


Merge the two dataset so that we have a Abbreviation field.

In [105]:
Election['state_up'] = Election['state'].str.replace('-', ' ').str.upper()
Election = Election.merge(state_abbr, left_on='state_up', right_on='State', how='left')
Election.head()

Unnamed: 0.1,Unnamed: 0,state,county,1st,2nd,3rd,votes1,votes2,votes3,pct1,pct2,pct3,party1,party2,party3,state_up,State,Abbreviation
0,0,Alabama,Autauga County,D. TRUMP,H. CLINTON,G. JOHNSON,18110,5908,538,0.734,0.24,0.022,R,D,O,ALABAMA,ALABAMA,AL
1,1,Alabama,Baldwin County,D. TRUMP,H. CLINTON,G. JOHNSON,72780,18409,2448,0.774,0.196,0.026,R,D,O,ALABAMA,ALABAMA,AL
2,2,Alabama,Barbour County,D. TRUMP,H. CLINTON,G. JOHNSON,5431,4848,93,0.523,0.467,0.009,R,D,O,ALABAMA,ALABAMA,AL
3,3,Alabama,Bibb County,D. TRUMP,H. CLINTON,G. JOHNSON,6733,1874,124,0.77,0.214,0.014,R,D,O,ALABAMA,ALABAMA,AL
4,4,Alabama,Blount County,D. TRUMP,H. CLINTON,G. JOHNSON,22808,2150,337,0.899,0.085,0.013,R,D,O,ALABAMA,ALABAMA,AL


From the previously given national_county.txt data, we can have the FIPS created and merged into the election dataset.

In [106]:
data = pd.read_csv('national_county.txt', sep=",", header=None, dtype=str)
data.columns = ["STATE", "STATEFP", "COUNTYFP", "COUNTYNAME","CLASSFP"]
data['FIPS'] = pd.to_numeric(data['STATEFP']+data['COUNTYFP'],\
                             downcast='integer')
data.head()

Unnamed: 0,STATE,STATEFP,COUNTYFP,COUNTYNAME,CLASSFP,FIPS
0,AL,1,1,Autauga County,H1,1001
1,AL,1,3,Baldwin County,H1,1003
2,AL,1,5,Barbour County,H1,1005
3,AL,1,7,Bibb County,H1,1007
4,AL,1,9,Blount County,H1,1009


In [107]:
Election = Election.merge(data, left_on='Abbreviation', right_on='STATE', how='left')
Election.head()

Unnamed: 0.1,Unnamed: 0,state,county,1st,2nd,3rd,votes1,votes2,votes3,pct1,...,party3,state_up,State,Abbreviation,STATE,STATEFP,COUNTYFP,COUNTYNAME,CLASSFP,FIPS
0,0,Alabama,Autauga County,D. TRUMP,H. CLINTON,G. JOHNSON,18110,5908,538,0.734,...,O,ALABAMA,ALABAMA,AL,AL,1,1,Autauga County,H1,1001.0
1,0,Alabama,Autauga County,D. TRUMP,H. CLINTON,G. JOHNSON,18110,5908,538,0.734,...,O,ALABAMA,ALABAMA,AL,AL,1,3,Baldwin County,H1,1003.0
2,0,Alabama,Autauga County,D. TRUMP,H. CLINTON,G. JOHNSON,18110,5908,538,0.734,...,O,ALABAMA,ALABAMA,AL,AL,1,5,Barbour County,H1,1005.0
3,0,Alabama,Autauga County,D. TRUMP,H. CLINTON,G. JOHNSON,18110,5908,538,0.734,...,O,ALABAMA,ALABAMA,AL,AL,1,7,Bibb County,H1,1007.0
4,0,Alabama,Autauga County,D. TRUMP,H. CLINTON,G. JOHNSON,18110,5908,538,0.734,...,O,ALABAMA,ALABAMA,AL,AL,1,9,Blount County,H1,1009.0


The democratic voting margin for each county is defined as:

rel_voting_margin = (democrat_votes - republican_votes)/total_votes

<b>Note:</b> With the given election dataset, we don't have a field of total_votes. However, we could approximate it by taking the sum of votes1, votes2 and votes3.

In [108]:
Election['total_votes'] = Election['votes1'] + Election['votes2'] + Election['votes3']

Then we can compute the rel_voting_margin.

<b>Note:</b> we need to determine the votes belonging to each party. 

In [109]:
rel_voting_margin = []
party1 = Election['party1']
party2 = Election['party2']
party3 = Election['party3']
votes1 = Election['votes1']
votes2 = Election['votes2']
votes3 = Election['votes3']
total = Election['total_votes']
num = len(party1)
for i in range(num):
    if party1[i] == 'D':
        demo = votes1[i]
        if party2[i] == 'R':
            repub = votes2[i]
        else:
            repub = votes3[i]
    elif party1[i] == 'R':
        repub = votes1[i]
        if party2[i] == 'D':
            demo = votes2[i]
        else:
            demo = votes3[i]
    else:
        if party2[i] == 'D':
            demo = votes2[i]
            repub = votes3[i]
        else:
            demo = votes3[i]
            repub = votes2[i]
        
    rel_voting_margin.append((demo-repub)/total[i])
    
Election['rel_voting_margin'] = rel_voting_margin

In [124]:
df = pd.DataFrame({'FIPS':Election['FIPS'],\
                   'rel_voting_margin': rel_voting_margin})
df.head()

Unnamed: 0,FIPS,rel_voting_margin
0,1001.0,-0.496905
1,1003.0,-0.496905
2,1005.0,-0.496905
3,1007.0,-0.496905
4,1009.0,-0.496905


Join the datasets so that we have all the data we need for further analysis.

**Note:** If we don't create the FIPS field in the Election dataset, it would be a lot more difficult to merge with the county_facts dataset.

In [125]:
df = df.merge(Facts, left_on='FIPS', right_on='fips', how='left')
df = df.drop('fips', axis=1)
df = df.set_index(['FIPS', 'area_name', 'state_abbreviation'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,rel_voting_margin,PST045214,PST040210,PST120214,POP010210,AGE135214,AGE295214,AGE775214,SEX255214,RHI125214,...,SBO415207,SBO015207,MAN450207,WTN220207,RTN130207,RTN131207,AFN120207,BPS030214,LND110210,POP060210
FIPS,area_name,state_abbreviation,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1001.0,Autauga County,AL,-0.496905,55395.0,54571.0,1.5,54571.0,6.0,25.2,13.8,51.4,77.9,...,0.7,31.7,0.0,0.0,598175.0,12003.0,88157.0,131.0,594.44,91.8
1003.0,Baldwin County,AL,-0.496905,200111.0,182265.0,9.8,182265.0,5.6,22.2,18.7,51.2,87.1,...,1.3,27.3,1410273.0,0.0,2966489.0,17166.0,436955.0,1384.0,1589.78,114.6
1005.0,Barbour County,AL,-0.496905,26887.0,27457.0,-2.1,27457.0,5.7,21.2,16.5,46.6,50.2,...,0.0,27.0,0.0,0.0,188337.0,6334.0,0.0,8.0,884.88,31.0
1007.0,Bibb County,AL,-0.496905,22506.0,22919.0,-1.8,22915.0,5.3,21.0,14.8,45.9,76.3,...,0.0,0.0,0.0,0.0,124707.0,5804.0,10757.0,19.0,622.58,36.8
1009.0,Blount County,AL,-0.496905,57719.0,57322.0,0.7,57322.0,6.1,23.6,17.0,50.5,96.0,...,0.0,23.2,341544.0,0.0,319700.0,5622.0,20941.0,3.0,644.78,88.9


Drop the rows where NaN is contained.

In [162]:
df = df.dropna(axis=0, how='any')

<b>Q:</b> Why do we use relative voting margin instead of the voting ratio democrat_votes / republican_votes?

<b>A:</b> By taking the quotient, we standardized the relative voting margin so that it's comparable cross counties.

## Step 2: Explaining rel_voting_margin

In [163]:
from sklearn.linear_model import LinearRegression

Regress the rel_voting_margin on each column of the county_facts provided, and get the r-square.

In [185]:
(m, n) = df.shape
Y = df.iloc[:, 0].values.reshape(m, 1)

In [182]:
r_square = []
for i in range(1,n):
    X = df.iloc[:, i].values.reshape(m, 1)
    lm = LinearRegression()
    lm.fit(X, Y)
    r_square.append(lm.score(X=X, y=Y))

Rank the r_square in descending order, and we get the top 3 factors explaining the democratic relative voting margin on each county for the 2016 election.

In [188]:
fact_dic = pd.read_csv('county_facts_dictionary.csv')
fact_dic['r_square'] = r_square
fact_dic.sort_values(by='r_square', ascending=False).head(3)

Unnamed: 0,column_name,description,r_square
26,HSG495213,"Median value of owner-occupied housing units, ...",0.048331
8,RHI125214,"White alone, percent, 2014",0.025178
9,RHI225214,"Black or African American alone, percent, 2014",0.023545
