
# Campaign contribution of 2012 election

The data is obtained from fec, and here's a couple of things to do. 
1. Clean up the name of doner and emplyer of them. More information can be found in clean_up_name dict
2. Find the contribution for each occupation and for each party, and make a plot for major occupations.
3. Top donner for two candidates in 2012 (obama and romney), by state and by occupations


In [1]:
import pandas as pd 
fec = pd.read_csv('P00000001-ALL.csv', dtype={'contbr_zip':str})

cand = fec.cand_nm.unique().tolist()
dict_cand = {}
for c in cand:
    if c != 'Obama, Barack':
        dict_cand[c] = 'Republican'
    else:
        dict_cand[c] = 'Democrat'
fec['party'] = fec.cand_nm.map(dict_cand)
fec = fec[fec.contb_receipt_amt > 0]
fec.head()

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,party
0,C00410118,P20002978,"Bachmann, Michelle","HARVEY, WILLIAM",MOBILE,AL,366010290,RETIRED,RETIRED,250.0,20-JUN-11,,,,SA17A,736166,Republican
1,C00410118,P20002978,"Bachmann, Michelle","HARVEY, WILLIAM",MOBILE,AL,366010290,RETIRED,RETIRED,50.0,23-JUN-11,,,,SA17A,736166,Republican
2,C00410118,P20002978,"Bachmann, Michelle","SMITH, LANIER",LANETT,AL,368633403,INFORMATION REQUESTED,INFORMATION REQUESTED,250.0,05-JUL-11,,,,SA17A,749073,Republican
3,C00410118,P20002978,"Bachmann, Michelle","BLEVINS, DARONDA",PIGGOTT,AR,724548253,NONE,RETIRED,250.0,01-AUG-11,,,,SA17A,749073,Republican
4,C00410118,P20002978,"Bachmann, Michelle","WARDENBURG, HAROLD",HOT SPRINGS NATION,AR,719016467,NONE,RETIRED,300.0,20-JUN-11,,,,SA17A,736166,Republican


In [2]:
1. Use map function to add party affiliation. (Everyone is Republican except for Barack Obama. 
2. get number of contributions for each occupation.
3. Clean the occupation and employname
4. compare which occupation prefers which party 
5. generate plot for 4
6. get top donner for obama and romney by occupation/by state
7. use qcut to categorize the contribution amount.

SyntaxError: invalid syntax (<ipython-input-2-8dbc91ffcde9>, line 1)

## clean up occupation column

In [None]:
clean_up_name = {'INFORMATION REQUESTED PER BEST EFFORTS': 'NOT PROVIDED',
                 'INFORMATION REQUESTED': 'NOT PROVIDED',
                'INFORMATION REQUESTED (BEST EFFORTS)': 'NOT PROVIDED',
                 'C.E.O.': 'CEO'}

for k, v in clean_up_name.items():
    fec['contbr_occupation'] = fec.contbr_occupation.str.replace(k, v)
for k, v in clean_up_name.items():
    fec['contbr_employer'] = fec.contbr_employer.str.replace(k, v)

Alternatively, there's another way to replace these garbage names. However, we can not simply apply map function because if there is not a key there, the resulted value is going to be NA. Hence, we have to use get(x,x) to make x the default value, which means, allow occupations with no mapping to pass through. 

In [None]:
emp_mapping = {
       'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
       'INFORMATION REQUESTED' : 'NOT PROVIDED',
       'SELF' : 'SELF-EMPLOYED',
       'SELF EMPLOYED' : 'SELF-EMPLOYED',
}
f = lambda x: emp_mapping.get(x, x) #make sure understand the get function.
fec.contbr_employer.map(f)
fec.contbr_employer.head()

## top doner by occupation

In [None]:
amount_party_job = fec.groupby(['party', 'contbr_occupation']).sum()['contb_receipt_amt']
amount_party_job = amount_party_job.unstack(0)
sorting_idx = amount_party_job.sum(1).sort_values(ascending=False).index
sorting_idx = sorting_idx[0:10]
to_plot = amount_party_job.loc[sorting_idx, :] #it's a dataframe
to_plot.plot(kind='barh')

In [None]:
#an alternative way to make plots.
#the trick is make a designated column for party affiliation
import seaborn as sns 
sns_plot = to_plot.reset_index().melt(id_vars='contbr_occupation', value_vars=['Democrat', 'Republican'])
sns.barplot(data=sns_plot, hue='party', x='value', y='contbr_occupation')

## Top doners by candidates

In [None]:
two_cand = fec.loc[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt']), :]
two_cand.head()

In [None]:
def find_top_contrs(df):
    df = df.groupby('contbr_occupation').sum()['contb_receipt_amt']
    df = df.sort_values(ascending=False) #series sort values does not to pass anything...
    return df[0:5]

In [None]:
result_two_cand = two_cand.groupby(['cand_nm']).apply(find_top_contrs)
result_two_cand

In [None]:
def find_top_employer(df):
    df = df.groupby('contbr_employer').sum()['contb_receipt_amt']
    df = df.sort_values(ascending=False) #series sort values does not to pass anything...
    return df[0:10]

In [None]:
result_two_cand = two_cand.groupby(['cand_nm']).apply(find_top_employer)
result_two_cand

The importand lesson learned here, is we can perform a groupby operation inside a groupby. In this case, for each presidnetial condidate, we groupby the occupations of their doner and find the top occupations/employers.

In [None]:
def find_top_state(df):
    df = df.groupby('contbr_st').sum()['contb_receipt_amt']
    df = df.sort_values(ascending=False) #series sort values does not to pass anything...
    return df
result_two_cand = two_cand.groupby(['cand_nm']).apply(find_top_state)
result_two_cand.head()

In [None]:
two_cand_by_state = result_two_cand.unstack(0).dropna()
index = two_cand_by_state.sum(1).sort_values(ascending=False)[0:20].index #get top doner states
two_cand_by_state.div(two_cand_by_state.sum(1).values, axis=0).loc[index, :] #just get the ratio split, order by amount

# Bucketing Donation Amounts

This part is about bucketing or dicretizing the continuous variables into buckets. The most commonly used are the qcut function and cut function. The first one let you cut threshod while the second function gauretees that you can always have a equal sample size. 

In [None]:
import numpy as np
bins = np.array([0, 1, 10, 100, 1000, 10000,100000, 1000000, 10000000])
two_cand['bins'] = pd.cut(two_cand.contb_receipt_amt, bins)

In [None]:
two_c = two_cand.groupby(['cand_nm', 'bins']).count()['contb_receipt_amt']
two_c

In [None]:
two_c.unstack(0).

In [None]:
two_c_prop = two_cand.groupby(['cand_nm', 'bins']).sum()['contb_receipt_amt'].unstack(0)
pp = two_c_prop.div(two_c_prop.sum(1), axis=0)

In [None]:
pp[0:-2].plot.barh(title='the table of donation size')