# Out-of-State-Contributions: Data Importation and Preparation

In [1]:
import numpy as np
import pandas as pd
import us

%load_ext jupyternotify

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 500)
pd.options.display.float_format = "{:,.2f}".format # Format floats

<IPython.core.display.Javascript object>

## Import and format the data

Import and format contribution-level data from the [National Institute on Money in Politics](https://www.followthemoney.org/) for gubernatorial, state senate and state house candidates in 2018, 2014 and 2010.

Download and save each cycle's contributions data and concatenate the data into a single file.

In [2]:
!sh process_contributions.sh

Import the contributions data.

In [4]:
%%notify
contributions = pd.read_csv("data/raw/contributions.csv", usecols=["Candidate:id", "Candidate", "Election_Status", "General_Party", "Election_Jurisdiction", "Election_Year", "Office_Sought", "Contributor", "Amount", "Date", "Street", "City", "State", "Zip", "In-State"], error_bad_lines=False)
contributions.columns = ["candidate_id", "candidate", "election_status", "party", "state", "year", "office", "contributor", "amount", "date", "street", "city", "state", "zip", "in_out_state"]
contributions.info()

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9153076 entries, 0 to 9153075
Data columns (total 15 columns):
candidate_id       int64
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             object
date               object
street             object
city               object
state              object
zip                float64
in_out_state       float64
dtypes: float64(2), int64(2), object(11)
memory usage: 1.0+ GB


<IPython.core.display.Javascript object>

In [5]:
contributions_18 = pd.read_csv("data/raw/contributions_18.csv", usecols=["Candidate:id", "Candidate", "Election_Status", "General_Party", "Election_Jurisdiction", "Election_Year", "Office_Sought", "Contributor", "Amount", "Date", "In-State"], error_bad_lines=False)
contributions_18.columns = ["candidate_id", "candidate", "election_status", "party", "state", "year", "office", "contributor", "amount", "date", "in_out_state"]
contributions_18.info()

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2635991 entries, 0 to 2635990
Data columns (total 11 columns):
candidate_id       int64
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             object
date               object
in_out_state       float64
dtypes: float64(1), int64(2), object(8)
memory usage: 221.2+ MB


Convert the contribution amount column to numeric (float) data type and the contribution date column to datetime data type.

In [6]:
contributions["amount"] = pd.to_numeric(contributions["amount"], errors="coerce")
contributions["date"] = pd.to_datetime(contributions["date"], errors="coerce")
contributions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9153076 entries, 0 to 9153075
Data columns (total 15 columns):
candidate_id       int64
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             float64
date               datetime64[ns]
street             object
city               object
state              object
zip                float64
in_out_state       float64
dtypes: datetime64[ns](1), float64(3), int64(2), object(9)
memory usage: 1.0+ GB


Filter out unitemized contributions as we cannot ascertain from where those contributions came.

In [7]:
contributions = contributions[contributions["contributor"] != "UNITEMIZED DONATIONS"]
contributions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9007598 entries, 0 to 9153075
Data columns (total 15 columns):
candidate_id       int64
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             float64
date               datetime64[ns]
street             object
city               object
state              object
zip                float64
in_out_state       float64
dtypes: datetime64[ns](1), float64(3), int64(2), object(9)
memory usage: 1.1+ GB


Filter out contributions to candidates who raised less than $1,000.

In [8]:
contributions_by_candidate = contributions.groupby("candidate_id")["amount"].sum().reset_index()
contributions_by_candidate = contributions_by_candidate[contributions_by_candidate["amount"] >= 1000]
contributions = contributions.merge(contributions_by_candidate, on="candidate_id", how="inner")
contributions.drop("amount_y", axis=1, inplace=True)
contributions.rename(columns={"amount_x": "amount"}, inplace=True)
contributions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8996016 entries, 0 to 8996015
Data columns (total 15 columns):
candidate_id       int64
candidate          object
election_status    object
party              object
state              object
year               int64
office             object
contributor        object
amount             float64
date               datetime64[ns]
street             object
city               object
state              object
zip                float64
in_out_state       float64
dtypes: datetime64[ns](1), float64(3), int64(2), object(9)
memory usage: 1.1+ GB


Rename the categories in the in-vs.-out-of-state column.

In [9]:
# 0 = out-of-state, 1 = in-state, 2 = unknown
contributions["in_out_state"] = contributions["in_out_state"].replace({0: "out-of-state", 1: "in-state", 2: "unknown"})
contributions.groupby("in_out_state").size()

in_out_state
in-state        7568757
out-of-state    1298298
unknown          128959
dtype: int64

Create a standardized office column.

In [10]:
%%notify
contributions["standardized_office"] = np.where(contributions["office"].str.contains("governor", case=False), "GOVERNOR/LIEUTENANT GOVERNOR",
                                       np.where(contributions["office"].str.contains("senate", case=False), "STATE SENATE",
                                       np.where(contributions["office"].str.contains("house", case=False), "STATE HOUSE/ASSEMBLY",
                                       np.where(contributions["office"].str.contains("assembly", case=False), "STATE HOUSE/ASSEMBLY", ""))))
contributions.groupby("standardized_office").size()

standardized_office
GOVERNOR/LIEUTENANT GOVERNOR    3828901
STATE HOUSE/ASSEMBLY            3447248
STATE SENATE                    1719867
dtype: int64

<IPython.core.display.Javascript object>

Create a standardized election status column.

In [11]:
%%notify
advanced_to_general = ["Deceased-General", "Disqualified-General", "Default Winner-General",
                       "Default Winner-Primary","Lost-General", "Lost-General Runoff", "Lost-Retention",
                       "Pending-General", "Tied-General", "Withdrew-General", "Won-General",
                       "Won-General Runoff", "Won-Primary", "Won-Primary Runoff", "Won-Top Two Primary"]
did_not_advance = ["Disqualified-Primary", "Lost-Convention", "Lost-Primary", "Lost-Primary Runoff",
                              "Lost-Top Two Primary", "Pending-Primary", "Pending-Primary Runoff",
                   "Tied-Primary", "Withdrew-Primary", "Withdrew-Primary Runoff"]
contributions["standardized_status"] = np.where(contributions["election_status"].isin(advanced_to_general),
                                                "ADVANCED TO GENERAL",
                                       np.where(contributions["election_status"].isin(did_not_advance),
                                                "DID NOT ADVANCE", ""))
contributions.groupby("standardized_status").size()

standardized_status
ADVANCED TO GENERAL    7685396
DID NOT ADVANCE        1310620
dtype: int64

<IPython.core.display.Javascript object>

In [17]:
glasson_stansbury_contributions = contributions[(contributions["candidate"].str.contains("Stansbury", case=False)) | (contributions["candidate"].str.contains("Glasson", case=False))]
glasson_stansbury_contributions.to_excel("glasson_stansbury_contributions.xlsx", index=False)

## Calculate a cut-off point for prior election cycles

Our next task is to determine a data cut-off date for prior election cycles so we can make accurate comparisons across cycles.

Extract the month and year from the contribution date column for 2018 election cycle data.

In [12]:
contributions_18 = contributions[contributions["year"] == 2018]
contributions_18["month"] = contributions_18["date"].dt.to_period("M")
contributions_18.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2548798 entries, 0 to 2548797
Data columns (total 14 columns):
candidate_id           int64
candidate              object
election_status        object
party                  object
state                  object
year                   int64
office                 object
contributor            object
amount                 float64
date                   datetime64[ns]
in_out_state           object
standardized_office    object
standardized_status    object
month                  object
dtypes: datetime64[ns](1), float64(1), int64(2), object(10)
memory usage: 291.7+ MB


Group the contributions by state and month.

In [13]:
%%notify
grouped_by_month = contributions_18.groupby(["state", "month"])["amount"].sum().reset_index()
contributions_18.drop("month", axis=1, inplace=True)
grouped_by_month.head(1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,state,month,amount
0,AK,2013-08,50.0


<IPython.core.display.Javascript object>

Because we want to use each state's month column as the cut-off date for contributions, we need to add a day to the month and the year and then convert the column into datetime data type.

In [14]:
grouped_by_month["month"] = grouped_by_month["month"].astype(str) + "-28" # No month has fewer than 28 days
grouped_by_month["month"] = pd.to_datetime(grouped_by_month["month"], errors="coerce")
grouped_by_month.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1604 entries, 0 to 1603
Data columns (total 3 columns):
state     1604 non-null object
month     1604 non-null datetime64[ns]
amount    1604 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 37.7+ KB


We know some of the contribution dates are wrong. We know this because some of the dates occur in the future and, unless we've got some time travelling campaign donors, these are data entry errors. To eliminate this noise, we will filter out months after September 2018.

In [15]:
grouped_by_month = grouped_by_month[grouped_by_month["month"] <= "2018-09-28"]

Return the most recent month of contributions for each state.

In [16]:
latest_month = grouped_by_month.groupby("state")["month"].max().reset_index()
latest_month.rename(columns={"month": "latest_month"}, inplace=True)
latest_month

Unnamed: 0,state,latest_month
0,AK,2018-08-28
1,AL,2018-07-28
2,AR,2018-03-28
3,AZ,2017-12-28
4,CA,2018-09-28
5,CO,2018-07-28
6,CT,2018-08-28
7,DE,2017-12-28
8,FL,2018-08-28
9,GA,2018-07-28


Filter out the states whose most recent month of contributions falls in 2017.

In [17]:
latest_month = latest_month[latest_month["latest_month"] > "2018-01-01"].reset_index(drop=True)
latest_month

Unnamed: 0,state,latest_month
0,AK,2018-08-28
1,AL,2018-07-28
2,AR,2018-03-28
3,CA,2018-09-28
4,CO,2018-07-28
5,CT,2018-08-28
6,FL,2018-08-28
7,GA,2018-07-28
8,HI,2018-08-28
9,IA,2018-07-28


## Apply the cut-off date to the contributions data.

Join the table of the 2018 cycle's latest contribution months with the contribution-level data.

In [18]:
contributions = contributions.merge(latest_month, on="state")

Convert the year in the latest month column to its equivalent in the relevant election cycle.

In [19]:
contributions["latest_month"] = contributions["latest_month"].mask(contributions["year"] == 2014,
                                           contributions["latest_month"] - pd.to_timedelta(4, unit="y"))
contributions["latest_month"] = contributions["latest_month"].mask(contributions["year"] == 2010,
                                           contributions["latest_month"] - pd.to_timedelta(8, unit="y"))
# Remove time values from latest month column
contributions["latest_month"] = pd.DatetimeIndex(contributions["latest_month"]).normalize()

Filter the data to eliminate contributions after the 2018 cycle's latest contribution month in each state.

In [20]:
contributions = contributions[contributions["date"] <= contributions["latest_month"]]
contributions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6362352 entries, 0 to 8373867
Data columns (total 14 columns):
candidate_id           int64
candidate              object
election_status        object
party                  object
state                  object
year                   int64
office                 object
contributor            object
amount                 float64
date                   datetime64[ns]
in_out_state           object
standardized_office    object
standardized_status    object
latest_month           datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(2), object(9)
memory usage: 728.1+ MB


## Add redistricting rules to the 2018 election cycle's data

Our next task is to incorporate each state's redistricting rules in our analysis. This will allow us to determine whether a particular office's role in that state's redistricting process has an effect on the proportion of out-of-state contributions flowing to its race.

In [21]:
redistricting = pd.read_csv("data/raw/redistricting_rules.csv")
redistricting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
state                     50 non-null object
independent_commission    50 non-null object
single_house_district     50 non-null object
no_veto                   50 non-null object
two_year_term             50 non-null object
dtypes: object(5)
memory usage: 2.0+ KB


We need to join the contribution-level data with the table of state redistring rules. In order to do so, we will add a state abbreviation column to the redistricting rules.

In [22]:
states = pd.DataFrame(list(us.states.mapping("name", "abbr").items()), columns=["state", "abbreviation"])
states.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 2 columns):
state           59 non-null object
abbreviation    59 non-null object
dtypes: object(2)
memory usage: 1.0+ KB


Join the table of state redistricting rules and state abbreviations.

In [23]:
redistricting = redistricting.merge(states, on="state")
redistricting

Unnamed: 0,state,independent_commission,single_house_district,no_veto,two_year_term,abbreviation
0,Alabama,N,N,N,N,AL
1,Alaska,N,Y,N,N,AK
2,Arizona,Y,N,N,N,AZ
3,Arkansas,N,N,N,N,AR
4,California,Y,N,N,N,CA
5,Colorado,N,N,N,N,CO
6,Connecticut,N,N,Y,Y,CT
7,Delaware,N,Y,N,N,DE
8,Florida,N,N,N,N,FL
9,Georgia,N,N,N,N,GA


Join the table of 2018 contribution-level data with the redistricting rules.

In [24]:
contributions_18 = contributions[contributions["year"] == 2018]
contributions_18 = contributions_18.merge(redistricting, left_on="state", right_on="abbreviation")
contributions_18.drop(["state_y", "abbreviation"], axis=1, inplace=True)
contributions_18.rename(columns={"state_x": "state"}, inplace=True)
contributions_18.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2441346 entries, 0 to 2441345
Data columns (total 18 columns):
candidate_id              int64
candidate                 object
election_status           object
party                     object
state                     object
year                      int64
office                    object
contributor               object
amount                    float64
date                      datetime64[ns]
in_out_state              object
standardized_office       object
standardized_status       object
latest_month              datetime64[ns]
independent_commission    object
single_house_district     object
no_veto                   object
two_year_term             object
dtypes: datetime64[ns](2), float64(1), int64(2), object(13)
memory usage: 353.9+ MB


Filter contributions to those in races where the office plays a role in redistricting.

In [25]:
redistricting_contributions_18 = contributions_18[((contributions_18["standardized_office"] == "GOVERNOR/LIEUTENANT GOVERNOR") &
                                                   (contributions_18["single_house_district"] == "N") &
                                                   (contributions_18["independent_commission"] == "N") &
                                                   (contributions_18["no_veto"] == "N")) |
                                               (((contributions_18["standardized_office"] == "STATE HOUSE/ASSEMBLY") |
                                                   (contributions_18["standardized_office"] == "STATE SENATE")) &
                                                   (contributions_18["single_house_district"] == "N") &
                                                   (contributions_18["independent_commission"] == "N") &
                                                   (contributions_18["two_year_term"] == "N"))
                                              ]
redistricting_contributions_18["redistricting_role"] = "Y"
redistricting_contributions_18.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2055524 entries, 0 to 2422483
Data columns (total 19 columns):
candidate_id              int64
candidate                 object
election_status           object
party                     object
state                     object
year                      int64
office                    object
contributor               object
amount                    float64
date                      datetime64[ns]
in_out_state              object
standardized_office       object
standardized_status       object
latest_month              datetime64[ns]
independent_commission    object
single_house_district     object
no_veto                   object
two_year_term             object
redistricting_role        object
dtypes: datetime64[ns](2), float64(1), int64(2), object(14)
memory usage: 313.6+ MB


Filter contributions to those in races where the office does not play a role in redistricting.

In [26]:
non_redistricting_contributions_18 = contributions_18[((contributions_18["standardized_office"] == "GOVERNOR/LIEUTENANT GOVERNOR") &
                                                   ((contributions_18["single_house_district"] == "Y") |
                                                   (contributions_18["independent_commission"] == "Y") |
                                                   (contributions_18["no_veto"] == "Y"))) |
                                                   (((contributions_18["standardized_office"] == "STATE HOUSE/ASSEMBLY") |
                                                   (contributions_18["standardized_office"] == "STATE SENATE")) &
                                                   ((contributions_18["single_house_district"] == "Y") |
                                                   (contributions_18["independent_commission"] == "Y") |
                                                   (contributions_18["two_year_term"] == "Y")))
                                                  ]
non_redistricting_contributions_18["redistricting_role"] = "N"
non_redistricting_contributions_18.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 385822 entries, 52277 to 2441345
Data columns (total 19 columns):
candidate_id              385822 non-null int64
candidate                 385822 non-null object
election_status           385822 non-null object
party                     385822 non-null object
state                     385822 non-null object
year                      385822 non-null int64
office                    385822 non-null object
contributor               385822 non-null object
amount                    385822 non-null float64
date                      385822 non-null datetime64[ns]
in_out_state              385822 non-null object
standardized_office       385822 non-null object
standardized_status       385822 non-null object
latest_month              385822 non-null datetime64[ns]
independent_commission    385822 non-null object
single_house_district     385822 non-null object
no_veto                   385822 non-null object
two_year_term             385822 non

Confirm the filtering worked.

In [27]:
redistricting_contributions_18[(redistricting_contributions_18["standardized_office"] == "GOVERNOR/LIEUTENANT GOVERNOR") & ((redistricting_contributions_18["single_house_district"] == "Y") | (redistricting_contributions_18["independent_commission"] == "Y") | (redistricting_contributions_18["no_veto"] == "Y"))]

Unnamed: 0,candidate_id,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office,standardized_status,latest_month,independent_commission,single_house_district,no_veto,two_year_term,redistricting_role


In [28]:
non_redistricting_contributions_18[(non_redistricting_contributions_18["standardized_office"] == "GOVERNOR/LIEUTENANT GOVERNOR") & (non_redistricting_contributions_18["single_house_district"] == "N") & (non_redistricting_contributions_18["independent_commission"] == "N") & (non_redistricting_contributions_18["no_veto"] == "N")]

Unnamed: 0,candidate_id,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office,standardized_status,latest_month,independent_commission,single_house_district,no_veto,two_year_term,redistricting_role


In [29]:
redistricting_contributions_18[((redistricting_contributions_18["standardized_office"] == "STATE HOUSE/ASSEMBLY") | (redistricting_contributions_18["standardized_office"] == "STATE SENATE")) & ((redistricting_contributions_18["single_house_district"] == "Y") | (redistricting_contributions_18["independent_commission"] == "Y") | (redistricting_contributions_18["two_year_term"] == "Y"))]

Unnamed: 0,candidate_id,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office,standardized_status,latest_month,independent_commission,single_house_district,no_veto,two_year_term,redistricting_role


In [30]:
non_redistricting_contributions_18[((non_redistricting_contributions_18["standardized_office"] == "STATE HOUSE/ASSEMBLY") | (non_redistricting_contributions_18["standardized_office"] == "STATE SENATE")) & (non_redistricting_contributions_18["single_house_district"] == "N") & (non_redistricting_contributions_18["independent_commission"] == "N") & (non_redistricting_contributions_18["two_year_term"] == "N")]

Unnamed: 0,candidate_id,candidate,election_status,party,state,year,office,contributor,amount,date,in_out_state,standardized_office,standardized_status,latest_month,independent_commission,single_house_district,no_veto,two_year_term,redistricting_role


Concatenate the redistricting and non-redistricting contributions data.

In [31]:
contributions_18 = pd.concat([redistricting_contributions_18, non_redistricting_contributions_18])
contributions_18.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2441346 entries, 0 to 2441345
Data columns (total 19 columns):
candidate_id              int64
candidate                 object
election_status           object
party                     object
state                     object
year                      int64
office                    object
contributor               object
amount                    float64
date                      datetime64[ns]
in_out_state              object
standardized_office       object
standardized_status       object
latest_month              datetime64[ns]
independent_commission    object
single_house_district     object
no_veto                   object
two_year_term             object
redistricting_role        object
dtypes: datetime64[ns](2), float64(1), int64(2), object(14)
memory usage: 372.5+ MB


## Export the data

Concatenate the 2010, 2014 and 2018 contributions data.

In [32]:
contributions_14 = contributions[contributions["year"] == 2014]
contributions_10 = contributions[contributions["year"] == 2010]
contributions_14["redistricting_role"] = ""
contributions_10["redistricting_role"] = ""
contributions = pd.concat([contributions_18, contributions_14, contributions_10]).reset_index(drop=True)
contributions = contributions[["candidate", "candidate_id", "year", "state", "party", "election_status", "contributor", "amount",
                               "date", "in_out_state", "no_veto", "office", "latest_month", "redistricting_role",
                               "independent_commission", "single_house_district", "standardized_office",
                               "standardized_status", "two_year_term"]]
contributions.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362352 entries, 0 to 6362351
Data columns (total 19 columns):
candidate                 object
candidate_id              int64
year                      int64
state                     object
party                     object
election_status           object
contributor               object
amount                    float64
date                      datetime64[ns]
in_out_state              object
no_veto                   object
office                    object
latest_month              datetime64[ns]
redistricting_role        object
independent_commission    object
single_house_district     object
standardized_office       object
standardized_status       object
two_year_term             object
dtypes: datetime64[ns](2), float64(1), int64(2), object(14)
memory usage: 922.3+ MB


Export the contribution-level data for the 2010, 2014 and 2018 election cycles with filters applied and redistricting rules added.

In [33]:
%%notify
contributions.to_csv("data/contributions.csv", index=False)

<IPython.core.display.Javascript object>