# Out-of-State-Contributions: Data Importation and Preparation

In [1]:
from functools import reduce
import numpy as np
import pandas as pd
import us

%load_ext jupyternotify

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 500)
pd.options.display.float_format = "{:,.2f}".format # Format floats

<IPython.core.display.Javascript object>

## Import and format the data

Import and format contribution-level data from the [National Institute on Money in Politics](https://www.followthemoney.org/) for gubernatorial, state senate and state house candidates in 2018, 2014 and 2010.

Download and save each cycle's contributions data and concatenate the data into a single file.

In [2]:
!sh process_contribs.sh

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/joe/.wget-hsts'. HSTS will be disabled.
--2018-08-29 11:46:55--  https://www.followthemoney.org/aaengine/aafetch.php?dt=1&y=2018&c-exi=1&c-r-ot=G,S,H&gro=c-t-id,d-id&APIKey=7393ac8fa32733ae574c429362bce82a&mode=csv
Resolving www.followthemoney.org (www.followthemoney.org)... 69.144.32.182
Connecting to www.followthemoney.org (www.followthemoney.org)|69.144.32.182|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data/raw/contributions_18.csv’

data/raw/contributi     [        <=>         ]   1.32G   955KB/s    in 37m 38s 

2018-08-29 12:29:58 (612 KB/s) - ‘data/raw/contributions_18.csv’ saved [1415067691]

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/joe/.wget-hsts'. HSTS will be disabled.
--2018-08-29 12:29:59--  http

Import the contributions data.

In [2]:
%%notify
contributions = pd.read_csv("data/raw/contributions.csv", usecols=["Candidate", "Election_Status", "General_Party", "Election_Jurisdiction", "Election_Year", "Office_Sought", "Contributor", "Amount", "Date", "In-State"], error_bad_lines=False)
contributions.columns = ["candidate", "election_status", "party", "state", "year", "office", "contributor", "amount", "date", "in_out_state"]
contributions.info()

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8676062 entries, 0 to 8676061
Data columns (total 10 columns):
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             object
date               object
in_out_state       float64
dtypes: float64(1), int64(1), object(8)
memory usage: 661.9+ MB


<IPython.core.display.Javascript object>

Convert the contribution amount column to numeric (float) data type and the contribution date column to datetime data type.

In [8]:
contributions["amount"] = pd.to_numeric(contributions["amount"], errors="coerce")
contributions["date"] = pd.to_datetime(contributions["date"], errors="coerce")
contributions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8547919 entries, 0 to 8676061
Data columns (total 11 columns):
candidate              object
election_status        object
party                  object
state                  object
year                   int64
office                 object
contributor            object
amount                 float64
date                   datetime64[ns]
in_out_state           object
standardized_office    object
dtypes: datetime64[ns](1), float64(1), int64(1), object(8)
memory usage: 782.6+ MB


Rename the categories in the in-vs.-out-of-state column.

In [9]:
# 0 = out-of-state, 1 = in-state, 2 = unknown
contributions["in_out_state"] = contributions["in_out_state"].replace({0: "out-of-state", 1: "in-state", 2: "unknown"})
contributions.head(1)

Unnamed: 0,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office
0,"RAUNER, BRUCE VINCENT & SANGUINETTI, EVELYN PA...",Won-Primary,Republican,IL,2018,GOVERNOR / LIEUTENANT GOVERNOR,"RAUNER, BRUCE VINCENT",50000000.0,2016-12-20,in-state,GOVERNOR/LIEUTENANT GOVERNOR


Filter out unitemized donations as it is impossible to determine where those contributions originated.

In [10]:
contributions = contributions[contributions["contributor"] != "UNITEMIZED DONATIONS"]
contributions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8547919 entries, 0 to 8676061
Data columns (total 11 columns):
candidate              object
election_status        object
party                  object
state                  object
year                   int64
office                 object
contributor            object
amount                 float64
date                   datetime64[ns]
in_out_state           object
standardized_office    object
dtypes: datetime64[ns](1), float64(1), int64(1), object(8)
memory usage: 782.6+ MB


Create a standardized office column.

In [11]:
%%notify
contributions["standardized_office"] = np.where(contributions["office"].str.contains("governor", case=False), "GOVERNOR/LIEUTENANT GOVERNOR",
                                       np.where(contributions["office"].str.contains("senate", case=False), "STATE SENATE",
                                       np.where(contributions["office"].str.contains("house", case=False), "STATE HOUSE/ASSEMBLY",
                                       np.where(contributions["office"].str.contains("assembly", case=False), "STATE HOUSE/ASSEMBLY", ""))))
contributions.head(1)

Unnamed: 0,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office
0,"RAUNER, BRUCE VINCENT & SANGUINETTI, EVELYN PA...",Won-Primary,Republican,IL,2018,GOVERNOR / LIEUTENANT GOVERNOR,"RAUNER, BRUCE VINCENT",50000000.0,2016-12-20,in-state,GOVERNOR/LIEUTENANT GOVERNOR


<IPython.core.display.Javascript object>

Create a standardized election status column.

## Calculate a cut-off point for prior election cycles

Our next task is to determine a data cut-off point for prior election cycles so we can make accurate comparisons across cycles.

Extract the month and year from the contribution date column for 2018 election cycle data.

In [12]:
contributions_18 = contributions[contributions["year"] == 2018]
contributions_18["month"] = contributions_18["date"].dt.to_period("M")
contributions_18.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2132284 entries, 0 to 2161316
Data columns (total 12 columns):
candidate              object
election_status        object
party                  object
state                  object
year                   int64
office                 object
contributor            object
amount                 float64
date                   datetime64[ns]
in_out_state           object
standardized_office    object
month                  object
dtypes: datetime64[ns](1), float64(1), int64(1), object(9)
memory usage: 211.5+ MB


Group the contributions by state and month.

In [13]:
%%notify
grouped_by_month = contributions_18.groupby(["state", "month"])["amount"].sum().reset_index()
contributions_18.drop("month", axis=1, inplace=True)
grouped_by_month.head(1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,state,month,amount
0,AK,2017-04,223.93


<IPython.core.display.Javascript object>

Because we eventually want to use each state's month column as the cut-off date for contributions, we need to add a day to the month and the year and then convert the column into datetime data type.

In [14]:
grouped_by_month["month"] = grouped_by_month["month"].astype(str) + "-28" # No month has fewer than 28 days
grouped_by_month["month"] = pd.to_datetime(grouped_by_month["month"], errors="coerce")
grouped_by_month.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1434 entries, 0 to 1433
Data columns (total 3 columns):
state     1434 non-null object
month     1434 non-null datetime64[ns]
amount    1434 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 33.7+ KB


We know some of the contribution dates are wrong. We know this because some of the dates occur in the future and, unless we've got some time travelling campaign donors, these are data entry errors. To eliminate this noise, we will filter out months after August 2018.

In [15]:
grouped_by_month = grouped_by_month[grouped_by_month["month"] <= "2018-08-28"]

Return the most recent month of contributions for each state.

In [16]:
latest_month = grouped_by_month.groupby("state")["month"].max().reset_index()
latest_month.rename(columns={"month": "latest_month"}, inplace=True)
latest_month

Unnamed: 0,state,latest_month
0,AK,2018-07-28
1,AL,2018-07-28
2,AR,2018-03-28
3,AZ,2017-12-28
4,CA,2018-07-28
5,CO,2018-07-28
6,CT,2018-07-28
7,FL,2018-08-28
8,GA,2018-07-28
9,HI,2018-07-28


## Apply the cut-off date to the contributions data.

Join the table of the 2018 cycle's latest contribution months with the contribution-level data.

In [17]:
contributions = contributions.merge(latest_month, on="state")

Convert the year in the latest month column to its equivalent in the relevant election cycle.

In [18]:
contributions["latest_month"] = contributions["latest_month"].mask(contributions["year"] == 2014,
                                           contributions["latest_month"] - pd.to_timedelta(4, unit="y"))
contributions["latest_month"] = contributions["latest_month"].mask(contributions["year"] == 2010,
                                           contributions["latest_month"] - pd.to_timedelta(8, unit="y"))
# Remove time values from latest month column
contributions["latest_month"] = pd.DatetimeIndex(contributions["latest_month"]).normalize()

Filter the data to eliminate contributions after the 2018 cycle's latest contribution month in each state.

In [23]:
#contributions = contributions[contributions["date"] <= contributions["latest_month"]]
contributions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5666188 entries, 0 to 8440633
Data columns (total 12 columns):
candidate              object
election_status        object
party                  object
state                  object
year                   int64
office                 object
contributor            object
amount                 float64
date                   datetime64[ns]
in_out_state           object
standardized_office    object
latest_month           datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(1), object(8)
memory usage: 562.0+ MB


## Add redistricting rules to the 2018 election cycles' data

Our next task is to incorporate each state's redistricting rules in our analysis. This will allow us to determine whether a particular office's role in that state's redistricting process has an effect on the proportion of out-of-state contributions flowing to its race.

In [64]:
redistricting = pd.read_csv("data/raw/redistricting_rules.csv")
redistricting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
state                     50 non-null object
single_house_district     50 non-null bool
independent_commission    50 non-null bool
no_veto                   50 non-null bool
two_year_term             50 non-null bool
dtypes: bool(4), object(1)
memory usage: 680.0+ bytes


We need to join the contribution-level data with the table of state redistring rules. In order to do so, we will add a state abbreviation column to the redistricting rules.

In [65]:
states = pd.DataFrame(list(us.states.mapping("name", "abbr").items()), columns=["state", "abbreviation"])
states.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 2 columns):
state           59 non-null object
abbreviation    59 non-null object
dtypes: object(2)
memory usage: 1.0+ KB


Join the table of state redistricting rules and state abbreviations.

In [66]:
redistricting = redistricting.merge(states, on="state")
redistricting

Unnamed: 0,state,single_house_district,independent_commission,no_veto,two_year_term,abbreviation
0,Alabama,False,False,False,False,AL
1,Alaska,True,False,False,False,AK
2,Arizona,False,True,False,False,AZ
3,Arkansas,False,False,False,False,AR
4,California,False,True,False,False,CA
5,Colorado,False,False,False,False,CO
6,Connecticut,False,False,True,True,CT
7,Delaware,True,False,False,False,DE
8,Florida,False,False,False,False,FL
9,Georgia,False,False,False,False,GA


Join the table of contribution-level data with the redistricting rules.

In [67]:
contributions_18 = contributions_18.merge(redistricting, left_on="state", right_on="abbreviation")
contributions_18.drop(["state_y", "abbreviation"], axis=1, inplace=True)
contributions_18.rename(columns={"state_x": "state"}, inplace=True)
contributions_18.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2132284 entries, 0 to 2132283
Data columns (total 19 columns):
candidate                   object
election_status             object
party                       object
state                       object
year                        int64
office                      object
contributor                 object
amount                      float64
date                        datetime64[ns]
in_out_state                object
standardized_office         object
single_house_district_x     object
independent_commission_x    object
no_veto_x                   object
two_year_term_x             object
single_house_district_y     bool
independent_commission_y    bool
no_veto_y                   bool
two_year_term_y             bool
dtypes: bool(4), datetime64[ns](1), float64(1), int64(1), object(12)
memory usage: 268.4+ MB


Filter 2018 contributions to those in races where the office plays a role in redistricting.

In [68]:
redistricting_contributions = contributions_18[
    (
        (contributions_18["standardized_office"] == "GOVERNOR/LIEUTENANT GOVERNOR") &
        (contributions_18["single_house_district"] == FALSE) &
        (contributions_18["independent_commission"] == FALSE) &
        (contributions_18["no_veto"] != "X")
    )
    |
    (
        (
            (contributions_18["standardized_office"] == "STATE HOUSE/ASSEMBLY") |
            (contributions_18["standardized_office"] == "STATE SENATE")
        ) &
        (contributions_18["single_house_district"] == FALSE) &
        (contributions_18["independent_commission"] == FALSE) &
        (contributions_18["two_year_term"] == FALSE)
    )
].reset_index()
redistricting_contributions.info()

KeyError: 'single_house_district'

Filter 2018 contributions to those in races where the office does not play a role in redistricting.

In [58]:
non_redistricting_contributions = contributions_18[
    (
        (contributions_18["standardized_office"] == "GOVERNOR/LIEUTENANT GOVERNOR") &
        (contributions_18["single_house_district"].notnull()) |
        (contributions_18["independent_commission"].notnull()) |
        (contributions_18["no_veto"].notnull())
    )
    |
    (
        (
            (contributions_18["standardized_office"] == "STATE HOUSE/ASSEMBLY") |
            (contributions_18["standardized_office"] == "STATE SENATE")
        ) &
        ((contributions_18["single_house_district"].notnull()) |
        (contributions_18["independent_commission"].notnull()) |
        (contributions_18["two_year_term"].notnull()))
    )
].reset_index()
non_redistricting_contributions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 383465 entries, 0 to 383464
Data columns (total 16 columns):
index                     383465 non-null int64
candidate                 383465 non-null object
election_status           383465 non-null object
party                     383465 non-null object
state                     383465 non-null object
year                      383465 non-null int64
office                    383465 non-null object
contributor               383465 non-null object
amount                    383465 non-null float64
date                      380656 non-null datetime64[ns]
in_out_state              383465 non-null object
standardized_office       383465 non-null object
single_house_district     20823 non-null object
independent_commission    227702 non-null object
no_veto                   134940 non-null object
two_year_term             107850 non-null object
dtypes: datetime64[ns](1), float64(1), int64(2), object(12)
memory usage: 46.8+ MB


In [59]:
grouped_redistricting = redistricting_contributions.groupby("state").size().reset_index()
grouped_redistricting.columns = ["state", "records"]
grouped_redistricting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
state      30 non-null object
records    30 non-null int64
dtypes: int64(1), object(1)
memory usage: 560.0+ bytes


In [60]:
grouped_non_redistricting = non_redistricting_contributions.groupby("state").size().reset_index()
grouped_non_redistricting.columns = ["state", "records"]
grouped_non_redistricting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 2 columns):
state      16 non-null object
records    16 non-null int64
dtypes: int64(1), object(1)
memory usage: 336.0+ bytes


In [61]:
joined_redistricting = grouped_redistricting.merge(grouped_non_redistricting, on="state")
joined_redistricting

Unnamed: 0,state,records_x,records_y
0,IN,14276,14276
1,KY,12508,12508
2,MO,306,306


Concatenate the contributions data.

In [62]:
redistricting_contributions[redistricting_contributions["state"] == "IN"]

Unnamed: 0,index,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office,single_house_district,independent_commission,no_veto,two_year_term
1656054,1959979,"HOLDMAN, TRAVIS",Pending-General,Republican,IN,2018,SENATE DISTRICT 019,INDIANA REPUBLICAN SENATE MAJORITY CAMPAIGN CMTE,50000.00,2018-04-13,in-state,STATE SENATE,,,X,
1656055,1959980,"LAMOTTE, CRYSTAL D",Lost-Primary,Republican,IN,2018,SENATE DISTRICT 031,"LAMOTTE, CRYSTAL D",50000.00,2018-04-10,in-state,STATE SENATE,,,X,
1656056,1959981,"MOSELEY, CHARLES (CHUCK)",Pending-General,Democratic,IN,2018,HOUSE DISTRICT 010,BOILERMAKERS LOCAL 374,40000.00,2017-11-01,in-state,STATE HOUSE/ASSEMBLY,,,X,
1656057,1959982,"ALI, ZAKI",Pending-General,Republican,IN,2018,SENATE DISTRICT 025,"ZAKI, ALI",32000.00,2018-04-13,in-state,STATE SENATE,,,X,
1656058,1959983,"ROGERS, LINDA",Pending-General,Republican,IN,2018,SENATE DISTRICT 011,HOOSIERS FOR QUALITY EDUCATION,25000.00,2018-04-10,in-state,STATE SENATE,,,X,
1656059,1959984,"GIPSON, KEVIN",Lost-Primary,Republican,IN,2018,HOUSE DISTRICT 049,"GIPSON, KEVIN",24950.00,2018-03-01,in-state,STATE HOUSE/ASSEMBLY,,,X,
1656060,1959985,"GARTEN, CHRIS",Pending-General,Republican,IN,2018,SENATE DISTRICT 045,HOME BUILDERS ASSOCIATION OF SOUTHERN INDIANA,23000.00,2017-12-20,in-state,STATE SENATE,,,X,
1656061,1959986,"TALLIAN, KAREN",Pending-General,Democratic,IN,2018,SENATE DISTRICT 004,TALLIAN FOR INDIANA,22480.00,2016-07-15,in-state,STATE SENATE,,,X,
1656062,1959987,"ABBOTT, DAVID H",Pending-General,Republican,IN,2018,HOUSE DISTRICT 082,FRIENDS OF DAVID OBER,20000.00,2018-03-30,in-state,STATE HOUSE/ASSEMBLY,,,X,
1656063,1959988,"GOODRICH, CHUCK",Pending-General,Republican,IN,2018,HOUSE DISTRICT 029,HOOSIERS FOR QUALITY EDUCATION,20000.00,2018-04-12,in-state,STATE HOUSE/ASSEMBLY,,,X,


In [63]:
non_redistricting_contributions[non_redistricting_contributions["state"] == "IN"]

Unnamed: 0,index,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office,single_house_district,independent_commission,no_veto,two_year_term
316433,1959979,"HOLDMAN, TRAVIS",Pending-General,Republican,IN,2018,SENATE DISTRICT 019,INDIANA REPUBLICAN SENATE MAJORITY CAMPAIGN CMTE,50000.00,2018-04-13,in-state,STATE SENATE,,,X,
316434,1959980,"LAMOTTE, CRYSTAL D",Lost-Primary,Republican,IN,2018,SENATE DISTRICT 031,"LAMOTTE, CRYSTAL D",50000.00,2018-04-10,in-state,STATE SENATE,,,X,
316435,1959981,"MOSELEY, CHARLES (CHUCK)",Pending-General,Democratic,IN,2018,HOUSE DISTRICT 010,BOILERMAKERS LOCAL 374,40000.00,2017-11-01,in-state,STATE HOUSE/ASSEMBLY,,,X,
316436,1959982,"ALI, ZAKI",Pending-General,Republican,IN,2018,SENATE DISTRICT 025,"ZAKI, ALI",32000.00,2018-04-13,in-state,STATE SENATE,,,X,
316437,1959983,"ROGERS, LINDA",Pending-General,Republican,IN,2018,SENATE DISTRICT 011,HOOSIERS FOR QUALITY EDUCATION,25000.00,2018-04-10,in-state,STATE SENATE,,,X,
316438,1959984,"GIPSON, KEVIN",Lost-Primary,Republican,IN,2018,HOUSE DISTRICT 049,"GIPSON, KEVIN",24950.00,2018-03-01,in-state,STATE HOUSE/ASSEMBLY,,,X,
316439,1959985,"GARTEN, CHRIS",Pending-General,Republican,IN,2018,SENATE DISTRICT 045,HOME BUILDERS ASSOCIATION OF SOUTHERN INDIANA,23000.00,2017-12-20,in-state,STATE SENATE,,,X,
316440,1959986,"TALLIAN, KAREN",Pending-General,Democratic,IN,2018,SENATE DISTRICT 004,TALLIAN FOR INDIANA,22480.00,2016-07-15,in-state,STATE SENATE,,,X,
316441,1959987,"ABBOTT, DAVID H",Pending-General,Republican,IN,2018,HOUSE DISTRICT 082,FRIENDS OF DAVID OBER,20000.00,2018-03-30,in-state,STATE HOUSE/ASSEMBLY,,,X,
316442,1959988,"GOODRICH, CHUCK",Pending-General,Republican,IN,2018,HOUSE DISTRICT 029,HOOSIERS FOR QUALITY EDUCATION,20000.00,2018-04-12,in-state,STATE HOUSE/ASSEMBLY,,,X,


## Export the data

Concatenate the three cycles' contributions data.

In [27]:
contributions = pd.concat([contributions_18, contributions_14, contributions_10]).reset_index(drop=True)
contributions = contributions[["candidate", "election_status", "party", "state", "year",
                               "contributor", "amount", "date", "in_out_state", "office",
                               "standardized_office", "single_house_district", "independent_commission",
                               "no_veto", "two_year_term", "latest_month"]]
contributions.info()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5680264 entries, 0 to 5680263
Data columns (total 16 columns):
candidate                 object
election_status           object
party                     object
state                     object
year                      int64
office                    object
contributor               object
amount                    float64
date                      datetime64[ns]
in_out_state              object
standardized_office       object
single_house_district     object
independent_commission    object
no_veto                   object
two_year_term             object
latest_month              datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(1), object(12)
memory usage: 693.4+ MB


In [28]:
%%notify
contributions.to_csv("data/contributions.csv", index=False)
contributions_18.to_csv("data/contributions_18.csv", index=False)
contributions_14.to_csv("data/contributions_14.csv", index=False)
contributions_10.to_csv("data/contributions_10.csv", index=False)