## Motivation

Every two years, every member of the United States House of Representatives is up for election. After being elected, members of the House are given a set budget from the legislature itself to hire staff, buy office equipment, and defray other costs of legislating and addressing constituent concerns. While each office gets the same amount of money from Congress to spend on these purposes, congressional offices have discretion over how that allowance is actually spent, and we would like to see whether some spending patterns are associated with higher political success. 

# Part 1: Getting/Formatting the Data

For this project, we decided to use the [House Office Expenditure Data](https://www.propublica.org/datastore/dataset/house-office-expenditures) from ProPublica as it contains well formatted data about house expenditures from 2009 to 2018. The main downside of this dataset is that it is missing data from the most recent midterm election, but we still have almost 10 years of data to work with.

To programatically access the datasets we are working with, we have included copies here [repository](https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/).

In [1]:
import pandas as pd
# just finna test with one of the files to see what happens
frames = []
# #manually add stuff for 2009 since only Q3 and Q4 are present
# frames.append(pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2009Q3-house-disburse-detail.csv').dropna(subset = ['BIOGUIDE_ID']))
# frames.append(pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2009Q4-house-disburse-detail.csv').dropna(subset = ['BIOGUIDE_ID']))


#automate the dataframes from 2010 to 2017
for i in range(2016, 2018):
    for j in range(1, 5):
        df = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/' + str(i) + 'Q' + str(j) +'-house-disburse-detail.csv', thousands=',')
        df.dropna(subset = ['BIOGUIDE_ID'], inplace=True)
        df["QUARTER"] = str(i) + 'Q' + str(j)#df.apply(lambda row: str(i) + 'Q' + str(j))
        frames.append(df)
        

# manually add stuff for 2018 since only Q1 is present
df = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2018Q1-house-disburse-detail.csv', thousands=',')
df.dropna(subset = ['BIOGUIDE_ID'], inplace=True)
df["QUARTER"] = str(2018) + 'Q' + str(1)
frames.append(df)

house_data = pd.concat(frames)
house_data

  interactivity=interactivity, compiler=compiler, result=result)
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,AMOUNT,BIOGUIDE_ID,CATEGORY,DATE,END DATE,OFFICE,PAYEE,PROGRAM,PURPOSE,QUARTER,RECIP (orig.),RECORDID,SORT SEQUENCE,START DATE,TRANSCODE,TRANSCODELONG,YEAR
5387,-25.85,A000374,FRANKED MAIL,01-31,01/31/16,HON. RALPH ABRAHAM,,,FRANKED MAIL,2016Q1,,FLG0055718,,01/20/16,GL,General ledger,2016
5388,627.83,A000374,FRANKED MAIL,02-29,01/31/16,HON. RALPH ABRAHAM,UNITED STATES POSTAL SERVICE,,FRANKED MAIL,2016Q1,UNITED STATES POSTAL SERVICE,00844090,,01/03/16,AP,Accounts payable,2016
5389,-62.50,A000374,FRANKED MAIL,02-29,02/29/16,HON. RALPH ABRAHAM,,,FRANKED MAIL,2016Q1,,FLG0056519,,02/20/16,GL,General ledger,2016
5390,944.14,A000374,FRANKED MAIL,03-23,02/29/16,HON. RALPH ABRAHAM,UNITED STATES POSTAL SERVICE,,FRANKED MAIL,2016Q1,UNITED STATES POSTAL SERVICE,00849298,,02/01/16,AP,Accounts payable,2016
5391,-9.45,A000374,FRANKED MAIL,03-31,03/31/16,HON. RALPH ABRAHAM,,,FRANKED MAIL,2016Q1,,FLG0057391,,03/20/16,GL,General ledger,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61345,54.85,Z000017,SUPPLIES AND MATERIALS,2/19/18,12/30/17,2017 HON. LEE M. ZELDIN,CITI PCARD-READYREFRESH BY NESTLE,OFFICIAL EXPENSES OF MEMBERS,OFFICE SUPPLIES (OUTSIDE),2018Q1,,974834,DETAIL,12/29/17,AP,,2017
61346,110.00,Z000017,SUPPLIES AND MATERIALS,2/19/18,1/26/18,2017 HON. LEE M. ZELDIN,CITI PCARD-TIMES REVIEW NEWSPAP,OFFICIAL EXPENSES OF MEMBERS,PUBLICATIONS/REFERENCE MAT'L,2018Q1,,974834,DETAIL,12/29/17,AP,,2018
61347,2183.20,Z000017,SUPPLIES AND MATERIALS,,,2017 HON. LEE M. ZELDIN,,OFFICIAL EXPENSES OF MEMBERS,SUPPLIES AND MATERIALS TOTALS:,2018Q1,,,SUBTOTAL,,,,2018
61348,38303.93,Z000017,SUPPLIES AND MATERIALS,,,2017 HON. LEE M. ZELDIN,,OFFICIAL EXPENSES OF MEMBERS,OFFICIAL EXPENSES OF MEMBERS TOTALS:,2018Q1,,,SUBTOTAL,,,,2018


[election data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2)

In [2]:
election_data = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/1976-2020-house.csv')

# filter out the years before 2009 and after 2018
election_data = election_data[election_data['year'] >= 2009]
election_data = election_data[election_data['year'] <= 2018]

election_data

Unnamed: 0,year,state,state_po,state_fips,state_cen,state_ic,office,district,stage,runoff,special,candidate,party,writein,mode,candidatevotes,totalvotes,unofficial,version,fusion_ticket
22553,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,DAVID WALTER,CONSTITUTION,False,TOTAL,26357,156281,False,20220331,False
22554,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,JO BONNER,REPUBLICAN,False,TOTAL,129063,156281,False,20220331,False
22555,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,WRITEIN,,True,TOTAL,861,156281,False,20220331,False
22556,2010,ALABAMA,AL,1,63,41,US HOUSE,2,GEN,,False,BOBBY BRIGHT,DEMOCRAT,False,TOTAL,106865,219028,False,20220331,False
22557,2010,ALABAMA,AL,1,63,41,US HOUSE,2,GEN,,False,MARTHA ROBY,REPUBLICAN,False,TOTAL,111645,219028,False,20220331,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29631,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,DANIEL CLYDE CUMMINGS,CONSTITUTION,False,TOTAL,6070,201245,False,20220331,False
29632,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,GREG HUNTER,DEMOCRAT,False,TOTAL,59903,201245,False,20220331,False
29633,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,LIZ CHENEY,REPUBLICAN,False,TOTAL,127963,201245,False,20220331,False
29634,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,RICHARD BRUBAKER,LIBERTARIAN,False,TOTAL,6918,201245,False,20220331,False


Try to figure out what categories we are working with and how these categories should be group into broader categories

In [3]:
from collections import defaultdict
categories = defaultdict(lambda: 0)
def add_set(row):
      categories[row['CATEGORY']] += 1

#apply add_set to the house_data
house_data.apply(add_set, axis = 1);
categories

defaultdict(<function __main__.<lambda>()>,
            {'FRANKED MAIL': 25231,
             'PERSONNEL COMPENSATION': 97559,
             'TRAVEL': 177810,
             'RENT, COMMUNICATION, UTILITIES': 59560,
             'PRINTING AND REPRODUCTION': 33221,
             'OTHER SERVICES': 36402,
             'SUPPLIES AND MATERIALS': 134555,
             'EQUIPMENT': 24966,
             'TRANSPORTATION OF THINGS': 248,
             'PERSONNEL BENEFITS': 1,
             'RENT COMMUNICATION UTILITIES': 18459,
             'RENT  COMMUNICATION  UTILITIES': 92169,
             'BENEFITS TO FORMER PERSONNEL': 5})

In [4]:
replace = {
    'RENT  COMMUNICATION  UTILITIES': 'RENT, COMMUNICATION, UTILITIES',
    'RENT COMMUNICATION UTILITIES': 'RENT, COMMUNICATION, UTILITIES'
}
house_data.replace(to_replace=replace, value=None, inplace=True)
categories = defaultdict(lambda: 0)
def add_set(row):
      categories[row['CATEGORY']] += 1
      
house_data.apply(add_set, axis = 1);
categories

defaultdict(<function __main__.<lambda>()>,
            {'FRANKED MAIL': 25231,
             'PERSONNEL COMPENSATION': 97559,
             'TRAVEL': 177810,
             'RENT, COMMUNICATION, UTILITIES': 170188,
             'PRINTING AND REPRODUCTION': 33221,
             'OTHER SERVICES': 36402,
             'SUPPLIES AND MATERIALS': 134555,
             'EQUIPMENT': 24966,
             'TRANSPORTATION OF THINGS': 248,
             'PERSONNEL BENEFITS': 1,
             'BENEFITS TO FORMER PERSONNEL': 5})

In [5]:

hd_by_quarter = house_data.groupby('QUARTER')
hd_by_quarter.size()
hd_by_quarter.groups

{'2016Q1': Int64Index([ 5387,  5388,  5389,  5390,  5391,  5392,  5393,  5394,  5395,
              5396,
             ...
             87796, 87797, 87798, 87799, 87800, 87801, 87802, 87803, 87804,
             87805],
            dtype='int64', length=82406),
 '2016Q2': Int64Index([ 5518,  5519,  5520,  5521,  5522,  5523,  5524,  5525,  5526,
              5527,
             ...
             85008, 85009, 85010, 85011, 85012, 85013, 85014, 85015, 85016,
             85017],
            dtype='int64', length=79481),
 '2016Q3': Int64Index([ 5693,  5694,  5695,  5696,  5697,  5698,  5699,  5700,  5701,
              5702,
             ...
             78981, 78982, 78983, 78984, 78985, 78986, 78987, 78988, 78989,
             78990],
            dtype='int64', length=73270),
 '2016Q4': Int64Index([ 5493,  5494,  5495,  5496,  5497,  5498,  5499,  5500,  5501,
              5502,
             ...
             76883, 76884, 76885, 76886, 76887, 76888, 76889, 76890, 76891,
             76

In [13]:
import numpy as np
quarters = house_data['QUARTER'].unique()
categories = house_data['CATEGORY'].unique()

quarter = quarters[0]
hd_by_quarter = house_data[house_data['QUARTER'] == quarter]
bio_ids = hd_by_quarter['BIOGUIDE_ID'].unique()


spending_per_candidate = hd_by_quarter.groupby(['BIOGUIDE_ID','CATEGORY'])['AMOUNT'].sum()
spending_data = []
for bio_id in bio_ids:
    row = [bio_id]
    for category in categories:
        if category in spending_per_candidate[bio_id]:
            row.append(spending_per_candidate[bio_id][category])
        else:
            row.append(0)
    spending_data.append(row)
quarter_spending_df = pd.DataFrame(spending_data,columns = ["BIOGUIDE_ID"] + categories.tolist())
# quarter_spending_df = quarter_spending_df.dropna(axis='columns')
# quarter_spending_df.dropna(subset = ['BIOGUIDE_ID']))
quarter_spending_df
# spending_per_candidate.to_csv('spending_per_candidate_' + quarter + '.csv')

Unnamed: 0,BIOGUIDE_ID,FRANKED MAIL,PERSONNEL COMPENSATION,TRAVEL,"RENT, COMMUNICATION, UTILITIES",PRINTING AND REPRODUCTION,OTHER SERVICES,SUPPLIES AND MATERIALS,EQUIPMENT,TRANSPORTATION OF THINGS,PERSONNEL BENEFITS,BENEFITS TO FORMER PERSONNEL
0,A000374,2433.05,208960.60,20827.93,19348.00,2255.50,9697.34,3770.78,466.20,0.00,0.0,0
1,A000370,20807.28,278020.63,11188.31,24069.64,31032.73,7139.75,11784.75,2163.69,0.00,0.0,0
2,A000055,2226.95,238973.95,14521.68,25156.54,2993.68,5686.53,23911.44,1685.25,0.00,0.0,0
3,A000371,15194.90,207955.61,19109.92,30868.23,48556.87,5692.00,3894.49,4338.41,0.00,0.0,0
4,A000372,53488.23,216416.12,10613.05,31413.01,22435.65,23773.97,7119.03,8231.82,0.00,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
452,Y000066,5431.69,220412.77,10948.66,19740.14,11103.72,22884.84,38224.48,8858.24,0.00,0.0,0
453,Y000033,775.06,243902.28,18090.62,22143.81,615.85,11250.00,17876.68,4098.39,0.00,0.0,0
454,Y000064,14101.66,225005.56,12034.22,17341.96,15317.40,12038.02,6585.75,1217.79,0.00,0.0,0
455,Z000017,41137.64,220874.98,5361.25,21395.06,26891.43,11695.00,3000.35,1413.02,18.13,0.0,0


In [17]:
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
train, test = train_test_split(quarter_spending_df[categories], test_size=0.2)
kmeans_model = KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto')
kmeans_trained = kmeans_model.fit(train)
kmeans_trained.cluster_centers_

array([[ 4.10417563e+03,  2.12396503e+05,  1.12490868e+04,
         2.30833500e+04,  3.17063850e+03,  1.73584741e+04,
         9.20753900e+03,  4.31957525e+03,  3.13125000e-01,
        -1.94289029e-16,  0.00000000e+00],
       [ 9.37500000e+00,  4.12840000e+02,  6.09000000e+00,
         2.29033333e+01,  0.00000000e+00,  1.91666667e+01,
         1.42280833e+02, -8.48250000e+00, -8.88178420e-16,
         0.00000000e+00,  0.00000000e+00],
       [ 4.78975089e+04,  1.94078866e+05,  1.04468354e+04,
         2.69625746e+04,  5.54827696e+04,  1.65362154e+04,
         9.82128571e+03,  9.61261821e+03, -8.88178420e-16,
         0.00000000e+00,  0.00000000e+00],
       [ 1.47622078e+04,  1.85427706e+05,  1.52679441e+04,
         2.47954713e+04,  1.42043143e+04,  1.73197854e+04,
         8.47243913e+03,  6.54075717e+03,  2.83500000e+01,
        -1.38777878e-17,  0.00000000e+00],
       [ 2.71508481e+03,  2.60769519e+05,  9.43638827e+03,
         2.52379571e+04,  4.11672846e+03,  9.63474442e+03,
  