## Motivation

Every two years, every member of the United States House of Representatives is up for election. After being elected, members of the House are given a set budget from the legislature itself to hire staff, buy office equipment, and defray other costs of legislating and addressing constituent concerns. While each office gets the same amount of money from Congress to spend on these purposes, congressional offices have discretion over how that allowance is actually spent, and we would like to see whether some spending patterns are associated with higher political success. 

# Part 1: Getting/Formatting the Data

For this project, we decided to use the [House Office Expenditure Data](https://www.propublica.org/datastore/dataset/house-office-expenditures) from ProPublica as it contains well formatted data about house expenditures from 2009 to 2018. The main downside of this dataset is that it is missing data from the most recent midterm election, but we still have almost 10 years of data to work with.

To programatically access the datasets we are working with, we have included copies here [repository](https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/).

In [38]:
import pandas as pd
# just finna test with one of the files to see what happens
frames = []
# #manually add stuff for 2009 since only Q3 and Q4 are present
# frames.append(pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2009Q3-house-disburse-detail.csv').dropna(subset = ['BIOGUIDE_ID']))
# frames.append(pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2009Q4-house-disburse-detail.csv').dropna(subset = ['BIOGUIDE_ID']))


#automate the dataframes from 2010 to 2017
for i in range(2016, 2018):
    for j in range(1, 5):
        df = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/' + str(i) + 'Q' + str(j) +'-house-disburse-detail.csv')
        df.dropna(subset = ['BIOGUIDE_ID'], inplace=True)
        df["QUARTER"] = str(i) + 'Q' + str(j)#df.apply(lambda row: str(i) + 'Q' + str(j))
        frames.append(df)
        

# manually add stuff for 2018 since only Q1 is present
df = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/2018Q1-house-disburse-detail.csv')
df.dropna(subset = ['BIOGUIDE_ID'], inplace=True)
df["QUARTER"] = str(2018) + 'Q' + str(1)
frames.append(df)

house_data = pd.concat(frames)
house_data

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,BIOGUIDE_ID,OFFICE,QUARTER,CATEGORY,DATE,PAYEE,START DATE,END DATE,PURPOSE,AMOUNT,YEAR,TRANSCODE,TRANSCODELONG,RECORDID,RECIP (orig.),PROGRAM,SORT SEQUENCE
5387,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,01-31,,01/20/16,01/31/16,FRANKED MAIL,-25.85,2016,GL,General ledger,FLG0055718,,,
5388,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,02-29,UNITED STATES POSTAL SERVICE,01/03/16,01/31/16,FRANKED MAIL,627.83,2016,AP,Accounts payable,00844090,UNITED STATES POSTAL SERVICE,,
5389,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,02-29,,02/20/16,02/29/16,FRANKED MAIL,-62.50,2016,GL,General ledger,FLG0056519,,,
5390,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,03-23,UNITED STATES POSTAL SERVICE,02/01/16,02/29/16,FRANKED MAIL,944.14,2016,AP,Accounts payable,00849298,UNITED STATES POSTAL SERVICE,,
5391,A000374,HON. RALPH ABRAHAM,2016Q1,FRANKED MAIL,03-31,,03/20/16,03/31/16,FRANKED MAIL,-9.45,2016,GL,General ledger,FLG0057391,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61345,Z000017,2017 HON. LEE M. ZELDIN,2018Q1,SUPPLIES AND MATERIALS,2/19/18,CITI PCARD-READYREFRESH BY NESTLE,12/29/17,12/30/17,OFFICE SUPPLIES (OUTSIDE),54.85,2017,AP,,974834,,OFFICIAL EXPENSES OF MEMBERS,DETAIL
61346,Z000017,2017 HON. LEE M. ZELDIN,2018Q1,SUPPLIES AND MATERIALS,2/19/18,CITI PCARD-TIMES REVIEW NEWSPAP,12/29/17,1/26/18,PUBLICATIONS/REFERENCE MAT'L,110.0,2018,AP,,974834,,OFFICIAL EXPENSES OF MEMBERS,DETAIL
61347,Z000017,2017 HON. LEE M. ZELDIN,2018Q1,SUPPLIES AND MATERIALS,,,,,SUPPLIES AND MATERIALS TOTALS:,2183.2,2018,,,,,OFFICIAL EXPENSES OF MEMBERS,SUBTOTAL
61348,Z000017,2017 HON. LEE M. ZELDIN,2018Q1,SUPPLIES AND MATERIALS,,,,,OFFICIAL EXPENSES OF MEMBERS TOTALS:,38303.93,2018,,,,,OFFICIAL EXPENSES OF MEMBERS,SUBTOTAL


[election data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2)

In [31]:
election_data = pd.read_csv('https://github.com/AndrewTrackim/cmsc320-final/raw/master/house-expenditure/1976-2020-house.csv')

# filter out the years before 2009 and after 2018
election_data = election_data[election_data['year'] >= 2009]
election_data = election_data[election_data['year'] <= 2018]

election_data

Unnamed: 0,year,state,state_po,state_fips,state_cen,state_ic,office,district,stage,runoff,special,candidate,party,writein,mode,candidatevotes,totalvotes,unofficial,version,fusion_ticket
22553,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,DAVID WALTER,CONSTITUTION,False,TOTAL,26357,156281,False,20220331,False
22554,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,JO BONNER,REPUBLICAN,False,TOTAL,129063,156281,False,20220331,False
22555,2010,ALABAMA,AL,1,63,41,US HOUSE,1,GEN,,False,WRITEIN,,True,TOTAL,861,156281,False,20220331,False
22556,2010,ALABAMA,AL,1,63,41,US HOUSE,2,GEN,,False,BOBBY BRIGHT,DEMOCRAT,False,TOTAL,106865,219028,False,20220331,False
22557,2010,ALABAMA,AL,1,63,41,US HOUSE,2,GEN,,False,MARTHA ROBY,REPUBLICAN,False,TOTAL,111645,219028,False,20220331,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29631,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,DANIEL CLYDE CUMMINGS,CONSTITUTION,False,TOTAL,6070,201245,False,20220331,False
29632,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,GREG HUNTER,DEMOCRAT,False,TOTAL,59903,201245,False,20220331,False
29633,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,LIZ CHENEY,REPUBLICAN,False,TOTAL,127963,201245,False,20220331,False
29634,2018,WYOMING,WY,56,83,68,US HOUSE,0,GEN,,False,RICHARD BRUBAKER,LIBERTARIAN,False,TOTAL,6918,201245,False,20220331,False


Try to figure out what categories we are working with and how these categories should be group into broader categories

In [32]:
from collections import defaultdict
categories = defaultdict(lambda: 0)
def add_set(row):
      categories[row['CATEGORY']] += 1

#apply add_set to the house_data
house_data.apply(add_set, axis = 1);
categories

defaultdict(<function __main__.<lambda>>,
            {'BENEFITS TO FORMER PERSONNEL': 5,
             'EQUIPMENT': 24966,
             'FRANKED MAIL': 25231,
             'OTHER SERVICES': 36402,
             'PERSONNEL BENEFITS': 1,
             'PERSONNEL COMPENSATION': 97559,
             'PRINTING AND REPRODUCTION': 33221,
             'RENT  COMMUNICATION  UTILITIES': 92169,
             'RENT COMMUNICATION UTILITIES': 18459,
             'RENT, COMMUNICATION, UTILITIES': 59560,
             'SUPPLIES AND MATERIALS': 134555,
             'TRANSPORTATION OF THINGS': 248,
             'TRAVEL': 177810})

In [33]:
replace = {
    'RENT  COMMUNICATION  UTILITIES': 'RENT, COMMUNICATION, UTILITIES',
    'RENT COMMUNICATION UTILITIES': 'RENT, COMMUNICATION, UTILITIES'
}
house_data.replace(to_replace=replace, value=None, inplace=True)
categories = defaultdict(lambda: 0)
def add_set(row):
      categories[row['CATEGORY']] += 1
      
house_data.apply(add_set, axis = 1);
categories

defaultdict(<function __main__.<lambda>>,
            {'BENEFITS TO FORMER PERSONNEL': 5,
             'EQUIPMENT': 24966,
             'FRANKED MAIL': 25231,
             'OTHER SERVICES': 36402,
             'PERSONNEL BENEFITS': 1,
             'PERSONNEL COMPENSATION': 97559,
             'PRINTING AND REPRODUCTION': 33221,
             'RENT, COMMUNICATION, UTILITIES': 170188,
             'SUPPLIES AND MATERIALS': 134555,
             'TRANSPORTATION OF THINGS': 248,
             'TRAVEL': 177810})