## Motivation

Every two years, every member of the United States House of Representatives is up for election. After being elected, members of the House are given a set budget from the legislature itself to hire staff, buy office equipment, and defray other costs of legislating and addressing constituent concerns. While each office gets the same amount of money from Congress to spend on these purposes, congressional offices have discretion over how that allowance is actually spent, and we would like to see whether some spending patterns are associated with higher political success. 

# Getting/Formatting the Data

For this project, we decided to use the [House Office Expenditure Data](https://www.propublica.org/datastore/dataset/house-office-expenditures) from ProPublica as it contains well formatted data about house expenditures from 2009 to 2018. The main downside of this dataset is that it is missing data from the most recent midterm election, but we still have almost 10 years of data to work with.

To programatically access the datasets we are working with, we have included copies here [repository](https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/).

In [53]:
import pandas as pd
# just finna test with one of the files to see what happens
frames = []
# #manually add stuff for 2009 since only Q3 and Q4 are present
# frames.append(pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2009Q3-house-disburse-detail.csv').dropna(subset = ['BIOGUIDE_ID']))
# frames.append(pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2009Q4-house-disburse-detail.csv').dropna(subset = ['BIOGUIDE_ID']))


#automate the dataframes from 2010 to 2017
for i in range(2016, 2018):
    for j in range(1, 5):
        df = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/' + str(i) + 'Q' + str(j) +'-house-disburse-detail.csv', thousands=',')
        df.dropna(subset = ['BIOGUIDE_ID'], inplace=True)
        df["QUARTER"] = str(i) + 'Q' + str(j)#df.apply(lambda row: str(i) + 'Q' + str(j))
        frames.append(df)
        

# manually add stuff for 2018 since only Q1 is present
df = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2018Q1-house-disburse-detail.csv', thousands=',')
df.dropna(subset = ['BIOGUIDE_ID'], inplace=True)
df["QUARTER"] = str(2018) + 'Q' + str(1)
frames.append(df)

house_data = pd.concat(frames)
house_data

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,BIOGUIDE_ID,OFFICE,QUARTER,CATEGORY,DATE,PAYEE,START DATE,END DATE,PURPOSE,AMOUNT,YEAR,TRANSCODE,TRANSCODELONG,RECORDID,RECIP (orig.),PROGRAM,SORT SEQUENCE
5387,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,01-31,,01/20/16,01/31/16,FRANKED MAIL,-25.85,2016,GL,General ledger,FLG0055718,,,
5388,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,02-29,UNITED STATES POSTAL SERVICE,01/03/16,01/31/16,FRANKED MAIL,627.83,2016,AP,Accounts payable,00844090,UNITED STATES POSTAL SERVICE,,
5389,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,02-29,,02/20/16,02/29/16,FRANKED MAIL,-62.50,2016,GL,General ledger,FLG0056519,,,
5390,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,03-23,UNITED STATES POSTAL SERVICE,02/01/16,02/29/16,FRANKED MAIL,944.14,2016,AP,Accounts payable,00849298,UNITED STATES POSTAL SERVICE,,
5391,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,03-31,,03/20/16,03/31/16,FRANKED MAIL,-9.45,2016,GL,General ledger,FLG0057391,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61345,Z000017,2017 HON. LEE M. ZELDIN,2018Q1,SUPPLIES AND MATERIALS,2/19/18,CITI PCARD-READYREFRESH BY NESTLE,12/29/17,12/30/17,OFFICE SUPPLIES (OUTSIDE),54.85,2017,AP,,974834,,OFFICIAL EXPENSES OF MEMBERS,DETAIL
61346,Z000017,2017 HON. LEE M. ZELDIN,2018Q1,SUPPLIES AND MATERIALS,2/19/18,CITI PCARD-TIMES REVIEW NEWSPAP,12/29/17,1/26/18,PUBLICATIONS/REFERENCE MAT'L,110.00,2018,AP,,974834,,OFFICIAL EXPENSES OF MEMBERS,DETAIL
61347,Z000017,2017 HON. LEE M. ZELDIN,2018Q1,SUPPLIES AND MATERIALS,,,,,SUPPLIES AND MATERIALS TOTALS:,2183.20,2018,,,,,OFFICIAL EXPENSES OF MEMBERS,SUBTOTAL
61348,Z000017,2017 HON. LEE M. ZELDIN,2018Q1,SUPPLIES AND MATERIALS,,,,,OFFICIAL EXPENSES OF MEMBERS TOTALS:,38303.93,2018,,,,,OFFICIAL EXPENSES OF MEMBERS,SUBTOTAL


[election data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2)

In [54]:
election_data = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/1976-2020-house.csv')

# filter out the years before 2009 and after 2018
election_data = election_data[election_data['year'] >= 2009]
election_data = election_data[election_data['year'] <= 2018]

election_data

Unnamed: 0,year,state,state_po,state_fips,state_cen,state_ic,office,district,stage,runoff,special,candidate,party,writein,mode,candidatevotes,totalvotes,unofficial,version,fusion_ticket
22553,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,DAVID WALTER,CONSTITUTION,False,TOTAL,26357,156281,False,20220331,False
22554,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,JO BONNER,REPUBLICAN,False,TOTAL,129063,156281,False,20220331,False
22555,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,WRITEIN,,True,TOTAL,861,156281,False,20220331,False
22556,2010,ALABAMA,AL,1,63,41,US HOUSE,2,GEN,,False,BOBBY BRIGHT,DEMOCRAT,False,TOTAL,106865,219028,False,20220331,False
22557,2010,ALABAMA,AL,1,63,41,US HOUSE,2,GEN,,False,MARTHA ROBY,REPUBLICAN,False,TOTAL,111645,219028,False,20220331,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29631,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,DANIEL CLYDE CUMMINGS,CONSTITUTION,False,TOTAL,6070,201245,False,20220331,False
29632,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,GREG HUNTER,DEMOCRAT,False,TOTAL,59903,201245,False,20220331,False
29633,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,LIZ CHENEY,REPUBLICAN,False,TOTAL,127963,201245,False,20220331,False
29634,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,RICHARD BRUBAKER,LIBERTARIAN,False,TOTAL,6918,201245,False,20220331,False


Try to figure out what categories we are working with and how these categories should be group into broader categories

In [55]:
from collections import defaultdict
categories = defaultdict(lambda: 0)
def add_set(row):
      categories[row['CATEGORY']] += 1

#apply add_set to the house_data
house_data.apply(add_set, axis = 1);
categories

defaultdict(<function __main__.<lambda>()>,
            {'FRANKED MAIL': 25231,
             'PERSONNEL COMPENSATION': 97559,
             'TRAVEL': 177810,
             'RENT, COMMUNICATION, UTILITIES': 59560,
             'PRINTING AND REPRODUCTION': 33221,
             'OTHER SERVICES': 36402,
             'SUPPLIES AND MATERIALS': 134555,
             'EQUIPMENT': 24966,
             'TRANSPORTATION OF THINGS': 248,
             'PERSONNEL BENEFITS': 1,
             'RENT COMMUNICATION UTILITIES': 18459,
             'RENT  COMMUNICATION  UTILITIES': 92169,
             'BENEFITS TO FORMER PERSONNEL': 5})

In [56]:
replace = {
    'RENT  COMMUNICATION  UTILITIES': 'RENT, COMMUNICATION, UTILITIES',
    'RENT COMMUNICATION UTILITIES': 'RENT, COMMUNICATION, UTILITIES'
}
house_data.replace(to_replace=replace, value=None, inplace=True)

In [57]:
categories = defaultdict(lambda: 0)
def add_set(row):
      categories[row['CATEGORY']] += 1
      
house_data.apply(add_set, axis = 1)
str(categories)

"defaultdict(<function <lambda> at 0x7f8f0268a5e0>, {'FRANKED MAIL': 25231, 'PERSONNEL COMPENSATION': 97559, 'TRAVEL': 177810, 'RENT, COMMUNICATION, UTILITIES': 170188, 'PRINTING AND REPRODUCTION': 33221, 'OTHER SERVICES': 36402, 'SUPPLIES AND MATERIALS': 134555, 'EQUIPMENT': 24966, 'TRANSPORTATION OF THINGS': 248, 'PERSONNEL BENEFITS': 1, 'BENEFITS TO FORMER PERSONNEL': 5})"

In [58]:
import requests
from bs4 import BeautifulSoup
raw = requests.get("https://www.congress.gov/help/field-values/member-bioguide-ids")
soup = BeautifulSoup(raw.text, 'lxml')
table = soup.find('table')

# Since the entire thing is a formatted table, read it directly into a pandas dataframe
tabledf = pd.read_html(str(table))
tabledf = tabledf[0]
tabledf.dropna(subset = ['Member'], inplace=True)
tabledf.reset_index(drop=True, inplace=True)
members = tabledf['Member'].str.extractall("(.*), (.*) \((.*) - (.*)\)")

members.reset_index(drop=True, inplace=True)
members.rename(columns={0 : 'LASTNAME', 1: "FIRSTNAME", 2: "PARTY", 3: "STATE"}, inplace=True)
members['BIOGUIDE_ID'] = tabledf['Member ID']
members

Unnamed: 0,LASTNAME,FIRSTNAME,PARTY,STATE,BIOGUIDE_ID
0,Abdnor,James,Republican,South Dakota,A000009
1,Abercrombie,Neil,Democratic,Hawaii,A000014
2,Abourezk,James,Democratic,South Dakota,A000017
3,Abraham,Ralph Lee,Republican,Louisiana,A000374
4,Abraham,Spencer,Republican,Michigan,A000355
...,...,...,...,...,...
2422,Zinke,Ryan K.,Republican,Montana,Z000018
2423,Zion,Roger H.,Republican,Indiana,Z000010
2424,Zorinsky,Edward,Democratic,Nebraska,Z000013
2425,Zschau,Edwin V. W.,Republican,California,Z000014


## k-means

In [59]:
import numpy as np
quarters = house_data['QUARTER'].unique()
categories = house_data['CATEGORY'].unique()

quarter = quarters[4]
hd_by_quarter = house_data[house_data['QUARTER'] == quarter]
bio_ids = hd_by_quarter['BIOGUIDE_ID'].unique()


spending_per_candidate = hd_by_quarter.groupby(['BIOGUIDE_ID','CATEGORY'])['AMOUNT'].sum()
spending_data = []
for bio_id in bio_ids:
    row = [bio_id]
    for category in categories:
        if category in spending_per_candidate[bio_id]:
            row.append(spending_per_candidate[bio_id][category])
        else:
            row.append(0)
    spending_data.append(row)
quarter_spending_df = pd.DataFrame(spending_data,columns = ["BIOGUIDE_ID"] + categories.tolist())
# quarter_spending_df = quarter_spending_df.dropna(axis='columns')
# quarter_spending_df.dropna(subset = ['BIOGUIDE_ID']))
quarter_spending_df
# spending_per_candidate.to_csv('spending_per_candidate_' + quarter + '.csv')

Unnamed: 0,BIOGUIDE_ID,FRANKED MAIL,PERSONNEL COMPENSATION,TRAVEL,"RENT, COMMUNICATION, UTILITIES",PRINTING AND REPRODUCTION,OTHER SERVICES,SUPPLIES AND MATERIALS,EQUIPMENT,TRANSPORTATION OF THINGS,PERSONNEL BENEFITS,BENEFITS TO FORMER PERSONNEL
0,A000374,1235.60,234012.50,28623.57,19171.87,964.32,6409.85,2841.01,466.20,0.00,0,0.0
1,A000370,497.72,194269.43,8839.30,24445.28,853.15,7481.98,12858.26,10834.53,0.00,0,0.0
2,A000055,1319.35,221486.67,16458.35,23422.79,4490.80,5672.16,19533.53,1685.25,0.00,0,0.0
3,A000371,227.32,236002.41,22340.77,34242.45,6511.83,5580.00,27086.50,2966.79,0.00,0,0.0
4,A000372,11524.96,216703.99,11103.40,22524.60,12234.00,11540.00,12282.52,28309.61,0.00,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
492,Y000066,1590.74,239414.97,11215.94,34125.09,1520.25,11713.00,7336.26,2255.92,0.00,0,0.0
493,Y000033,134.37,237000.03,12776.28,21982.52,237.15,11250.00,13690.45,4345.74,0.00,0,0.0
494,Y000064,78.69,11202.10,3113.02,1274.01,56.88,451.82,765.72,10.70,0.00,0,0.0
495,Z000017,12589.16,261323.87,11695.71,25278.58,13928.18,10695.00,5990.83,1172.22,254.32,0,0.0


In [69]:
# Left join the spending data with the candidate data
spending_member_info = quarter_spending_df.merge(members, on="BIOGUIDE_ID")

In [92]:
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
# train, test = train_test_split(spending_member_info[categories], test_size=0.2)
kmeans_model = KMeans(n_clusters=50, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='auto')
kmeans_trained = kmeans_model.fit(spending_member_info[categories])
# kmeans_trained.cluster_centers_

clusters = kmeans_trained.predict(spending_member_info[categories])
for i in list(set(clusters)):
# i=0
    print("Cluster:",i)
    print(spending_member_info[["STATE","BIOGUIDE_ID","LASTNAME"]][clusters==i].to_string())
    print("")
# print(spending_member_info[["PARTY","STATE"]][clusters==i].groupby("STATE").count())

Cluster: 0
         STATE BIOGUIDE_ID          LASTNAME
18  California     B000287           Becerra
22    Virginia     B001292  Beyer, Donald S.

Cluster: 1
          STATE BIOGUIDE_ID     LASTNAME
8      Nebraska     A000373      Ashford
124    Illinois     D000622    Duckworth
147   Louisiana     F000456      Fleming
149    Virginia     F000445       Forbes
163  New Jersey     G000548      Garrett
204       Texas     H000636     Hinojosa
207  California     H001034        Honda
217    New York     I000057       Israel
247     Arizona     K000368  Kirkpatrick
297  Washington     M000404    McDermott
310     Florida     M000689         Mica
312     Florida     M001144       Miller
371   Wisconsin     R000587       Ribble

Cluster: 2
         STATE BIOGUIDE_ID       LASTNAME
37       Texas     B000755          Brady
156       Ohio     F000455          Fudge
166      Texas     G000552        Gohmert
202   Arkansas     H001072           Hill
245       Iowa     K000362           King
335 

In [79]:
[clusters==1]

[array([False, False, False, False, False, False, False, False,  True,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False,  True,
        False, False, False, False, False, False, False, False,  True,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False,  True, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False,  True, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False,  True, False, False,  True, False,
      

In [84]:
print(spending_member_info[["PARTY","LASTNAME","STATE"]][clusters==6].to_string())

          PARTY LASTNAME         STATE
236  Republican    Kelly  Pennsylvania


In [80]:
len(spending_member_info[["PARTY","LASTNAME"]][clusters==1])

56