# Out-of-State-Contributions: Data Importation and Preparation

In [1]:
from functools import reduce
import numpy as np
import pandas as pd
import us

%load_ext jupyternotify

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 500)
pd.options.display.float_format = "{:,.2f}".format # Format floats

<IPython.core.display.Javascript object>

## Import and format the data

Import and format contribution-level data from the [National Institute on Money in Politics](https://www.followthemoney.org/) for gubernatorial, state senate and state house candidates in 2018, 2014 and 2010.

Download and save each cycle's contributions data and concatenate the data into a single file.

In [2]:
!sh process_contribs.sh

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/joe/.wget-hsts'. HSTS will be disabled.
--2018-08-29 11:46:55--  https://www.followthemoney.org/aaengine/aafetch.php?dt=1&y=2018&c-exi=1&c-r-ot=G,S,H&gro=c-t-id,d-id&APIKey=7393ac8fa32733ae574c429362bce82a&mode=csv
Resolving www.followthemoney.org (www.followthemoney.org)... 69.144.32.182
Connecting to www.followthemoney.org (www.followthemoney.org)|69.144.32.182|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data/raw/contributions_18.csv’

data/raw/contributi     [        <=>         ]   1.32G   955KB/s    in 37m 38s 

2018-08-29 12:29:58 (612 KB/s) - ‘data/raw/contributions_18.csv’ saved [1415067691]

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/joe/.wget-hsts'. HSTS will be disabled.
--2018-08-29 12:29:59--  http

Import the contributions data.

In [3]:
%%notify
contributions = pd.read_csv("data/raw/contributions.csv", usecols=["Candidate", "Election_Status", "Specific_Party", "Election_Jurisdiction", "Election_Year", "Office_Sought", "Contributor", "Amount", "Date", "In-State"], error_bad_lines=False)
contributions.columns = ["candidate", "election_status", "party", "state", "year", "office", "contributor", "amount", "date", "in_out_state"]
contributions.info()

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8676062 entries, 0 to 8676061
Data columns (total 10 columns):
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             object
date               object
in_out_state       float64
dtypes: float64(1), int64(1), object(8)
memory usage: 661.9+ MB


<IPython.core.display.Javascript object>

Convert the contribution amount column to numeric (float) data type and the contribution date column to datetime data type.

In [4]:
contributions["amount"] = pd.to_numeric(contributions["amount"], errors="coerce")
contributions["date"] = pd.to_datetime(contributions["date"], errors="coerce")
contributions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8676062 entries, 0 to 8676061
Data columns (total 10 columns):
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             float64
date               datetime64[ns]
in_out_state       float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(6)
memory usage: 661.9+ MB


Rename the categories in the in-vs.-out-of-state column.

In [5]:
# 0 = out-of-state, 1 = in-state, 2 = unknown
contributions["in_out_state"] = contributions["in_out_state"].replace({0: "out-of-state", 1: "in-state", 2: "unknown"})
contributions.head(1)

Unnamed: 0,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state
0,"RAUNER, BRUCE VINCENT & SANGUINETTI, EVELYN PA...",Won-Primary,REPUBLICAN,IL,2018,GOVERNOR / LIEUTENANT GOVERNOR,"RAUNER, BRUCE VINCENT",50000000.0,2016-12-20,in-state


Filter out unitemized donations as it is impossible to determine where those contributions originated.

In [6]:
contributions = contributions[contributions["contributor"] != "UNITEMIZED DONATIONS"]
contributions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8547919 entries, 0 to 8676061
Data columns (total 10 columns):
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             float64
date               datetime64[ns]
in_out_state       object
dtypes: datetime64[ns](1), float64(1), int64(1), object(7)
memory usage: 717.4+ MB


Create a standardized office column.

In [7]:
contributions["standardized_office"] = np.where(contributions["office"].str.contains("governor", case=False), "GOVERNOR/LIEUTENANT GOVERNOR",
                                       np.where(contributions["office"].str.contains("senate", case=False), "STATE SENATE",
                                       np.where(contributions["office"].str.contains("house", case=False), "STATE HOUSE", "")))
contributions.head(1)

Unnamed: 0,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office
0,"RAUNER, BRUCE VINCENT & SANGUINETTI, EVELYN PA...",Won-Primary,REPUBLICAN,IL,2018,GOVERNOR / LIEUTENANT GOVERNOR,"RAUNER, BRUCE VINCENT",50000000.0,2016-12-20,in-state,GOVERNOR/LIEUTENANT GOVERNOR


Filter the data by election year.

In [8]:
contributions_18 = contributions[contributions["year"] == 2018]
contributions_14 = contributions[contributions["year"] == 2014]
contributions_10 = contributions[contributions["year"] == 2010]

## Calculate a cut-off point for prior election cycles

Our next task is to determine a data cut-off point for prior election cycles so we can make accurate comparisons across cycles.

Extract the month and year from the contribution date column for 2018 election cycle data.

In [9]:
contributions_18["month"] = contributions_18["date"].dt.to_period("M")
contributions_18.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2132284 entries, 0 to 2161316
Data columns (total 12 columns):
candidate              object
election_status        object
party                  object
state                  object
year                   int64
office                 object
contributor            object
amount                 float64
date                   datetime64[ns]
in_out_state           object
standardized_office    object
month                  object
dtypes: datetime64[ns](1), float64(1), int64(1), object(9)
memory usage: 211.5+ MB


Group the contributions by state and month.

In [10]:
grouped_by_month = contributions_18.groupby(["state", "month"])["amount"].sum().reset_index()
grouped_by_month.head(1)

Unnamed: 0,state,month,amount
0,AK,2017-04,223.93


Because we eventually want to use each state's month column as the cut-off date for contributions, we need to add a day to the month and the year and then convert the column into datetime data type.

In [11]:
grouped_by_month["month"] = grouped_by_month["month"].astype(str) + "-28" # No month has fewer than 28 days
grouped_by_month["month"] = pd.to_datetime(grouped_by_month["month"], errors="coerce")
grouped_by_month.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1434 entries, 0 to 1433
Data columns (total 3 columns):
state     1434 non-null object
month     1434 non-null datetime64[ns]
amount    1434 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 33.7+ KB


We know some of the contribution dates are wrong. We know this because some of the dates occur in the future and, unless we've got some time travelling campaign donors, these are data entry errors. To eliminate this noise, we will filter out months after August 2018.

In [12]:
grouped_by_month = grouped_by_month[grouped_by_month["month"] <= "2018-08-28"]

Return the most recent month of contributions for each state.

In [13]:
latest_month = grouped_by_month.groupby("state")["month"].max().reset_index()
latest_month.rename(columns={"month": "latest_month"}, inplace=True)
latest_month

Unnamed: 0,state,latest_month
0,AK,2018-07-28
1,AL,2018-07-28
2,AR,2018-03-28
3,AZ,2017-12-28
4,CA,2018-07-28
5,CO,2018-07-28
6,CT,2018-07-28
7,FL,2018-08-28
8,GA,2018-07-28
9,HI,2018-07-28


## Apply the cut-off date to the 2014 and 2010 election cycles' data.

Join the table of the 2018 cycle's latest contribution months with the 2014 and 2010 contribution-level data.

In [14]:
contributions_14 = contributions_14.merge(latest_month, on="state")
contributions_10 = contributions_10.merge(latest_month, on="state")

Convert the year in the latest month column to its equivalent in the relevant election cycle.

In [15]:
%%notify
# 2017 = 2013, 2018 = 2014
contributions_14["latest_month"] = contributions_14["latest_month"].mask(contributions_14["latest_month"].dt.year == 2017, contributions_14["latest_month"] + pd.offsets.DateOffset(year=2013))
contributions_14["latest_month"] = contributions_14["latest_month"].mask(contributions_14["latest_month"].dt.year == 2018, contributions_14["latest_month"] + pd.offsets.DateOffset(year=2014))
# 2017 = 2009, 2018 = 2010
contributions_10["latest_month"] = contributions_10["latest_month"].mask(contributions_10["latest_month"].dt.year == 2017, contributions_10["latest_month"] + pd.offsets.DateOffset(year=2009))
contributions_10["latest_month"] = contributions_10["latest_month"].mask(contributions_10["latest_month"].dt.year == 2018, contributions_10["latest_month"] + pd.offsets.DateOffset(year=2010))



<IPython.core.display.Javascript object>

Filter the data to eliminate contributions after the 2018 cycle's latest contribution month in each state.

In [16]:
contributions_14 = contributions_14[contributions_14["date"] <= contributions_14["latest_month"]]
contributions_10 = contributions_10[contributions_10["date"] <= contributions_10["latest_month"]]

## Add redistricting rules to the 2018 election cycles' data

Our next task is to incorporate each state's redistricting rules in our analysis. This will allow us to determine whether a particular office's role in that state's redistricting process has an effect on the proportion of out-of-state contributions flowing to its race.

In [17]:
redistricting = pd.read_csv("data/raw/redistricting_rules.csv")
redistricting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
state                     50 non-null object
single_house_district     7 non-null object
independent_commission    6 non-null object
no_veto                   8 non-null object
two_year_term             3 non-null object
dtypes: object(5)
memory usage: 2.0+ KB


We need to join the contribution-level data with the table of state redistring rules. In order to do so, we will add a state abbreviation column to the redistricting rules.

In [18]:
states = pd.DataFrame(list(us.states.mapping("name", "abbr").items()), columns=["state", "abbreviation"])
states.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 2 columns):
state           59 non-null object
abbreviation    59 non-null object
dtypes: object(2)
memory usage: 1.0+ KB


Join the table of state redistricting rules and state abbreviations.

In [19]:
redistricting = redistricting.merge(states, on="state")
redistricting

Unnamed: 0,state,single_house_district,independent_commission,no_veto,two_year_term,abbreviation
0,Alabama,,,,,AL
1,Alaska,X,,,,AK
2,Arizona,,X,,,AZ
3,Arkansas,,,,,AR
4,California,,X,,,CA
5,Colorado,,,,,CO
6,Connecticut,,,X,X,CT
7,Delaware,X,,,,DE
8,Florida,,,,,FL
9,Georgia,,,,,GA


Join the table of the 2018 cycle's contribution-level data with the redistricting rules.

In [20]:
contributions_18 = contributions_18.merge(redistricting, left_on="state", right_on="abbreviation")
contributions_18.drop(["state_y", "abbreviation"], axis=1, inplace=True)
contributions_18.rename(columns={"state_x": "state"}, inplace=True)
contributions_18.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2132284 entries, 0 to 2132283
Data columns (total 16 columns):
candidate                 object
election_status           object
party                     object
state                     object
year                      int64
office                    object
contributor               object
amount                    float64
date                      datetime64[ns]
in_out_state              object
standardized_office       object
month                     object
single_house_district     object
independent_commission    object
no_veto                   object
two_year_term             object
dtypes: datetime64[ns](1), float64(1), int64(1), object(13)
memory usage: 276.6+ MB


## Export the data

In [21]:
contributions_18.to_csv("data/contributions_18.csv", index=False)
contributions_14.to_csv("data/contributions_14.csv", index=False)
contributions_10.to_csv("data/contributions_10.csv", index=False)