# 527 IRS filings data

**Post Description:** Detailed walk through of data extraction and cleaning for form 8871, 8872, related entities, contributions and expenditures.

**Post Categories:** Data Sourcing

# What is this data?

Before we get into *how* to get the data, let's spend a moment understanding why getting this data is important.  To do that we need to understand a little bit of terminology and background.

### 527 Organizations

A 527 organization is created primarily to influence politics.  They are tax-exempt political organizations that were created to influence selection, nomination, election, appointment, or defeat of candidates to federal, state, or local public office.  

### What data does the IRS provide

The IRS provides form 8871 and 8872 and attachments to those forms.  These are:

+ Form 8871: Top level information about a 527 organization
    + Type "D": Directors and Officers
    + Type "R": Related Entities
    + Type "E": Election Authority Identification number(s)
+ Form 8872: Top level main form data
    + Type "A": Schedule A data (Itemized Contributions) for 8872
    + Type "B": Schedule B data (Itemized Expenditures) for 8872
    
Clearly these are important to understading financial influence in politics and a core requirement to understand it well.


# Data Pull

>Credit: [This repo]( https://github.com/sahilchinoy/django-irs-filings) written by [Sahil Chinoy](https://sahilchinoy.com/) was very helpful in writing this code.  I changed a lot, but left a lot the same.  The irs has changed significantly since the code was written, but it was a huge help anyway as the general structure of the files is the same.

We will now walk through the code to download and get the data into a usable format.

## Imports

In [3]:
# System Libs
import os, sys, shutil
from pathlib import Path

#File Libs
import csv, io, pickle

# Data Processing and Transformation
import pandas as pd 
from collections import defaultdict
def def_value(): return [] # this is for a default dict we use later
from fastcore.all import L
from datetime import datetime

# Vinculum Re-Usable Utilities
from download_utils import unzip_file, download_file

# jupyter Utils
from IPython.display import clear_output

# Configure Logging for Jupyter - This is to make transition to modules easier
import logging
logger = logging.getLogger(name="jupyter")
if len(logger.handlers) == 0: logger.addHandler(logging.StreamHandler(stream=sys.stdout))
logger.setLevel(logging.INFO)

## Download and Extract

This step is very simple so we won't spend much time on it.  It covers 3 things:
+ Download
+ Unzip
+ Flatten directory structure out

We imported the download_file and unzip_file functions from our utility functions.  I will not cover those here as they are standard download and unzipping and not unique to this dataset.  Feel free to check those out at our github repository if you wouldb like.

```python
from download_utils import unzip_file, download_file
```

We need a simple function to clean the directories.  This is because once unzipped, the data file is 6 directories deep.

In [1]:
def clean_527(zip_path,extract_path,final_path):
    logger.info('Cleaning up archive...')
    shutil.move(f"{extract_path}/var/IRS/data/scripts/pofd/download/FullDataFile.txt",final_path)
    shutil.rmtree(extract_path)
    os.remove(zip_path)
    logger.info(f"FINAL RAW DATA FILE RELATIVE PATH: {final_path}")

After that all we need to do is Download unzip and clean.  This step is pretty straightforward.

In [8]:
def extract_data(url, zip_path,extract_path,final_path):
    download_file(url,zip_path)
    unzip_file(zip_path,extract_path)
    clean_527(zip_path,extract_path,final_path)

Run all that and we are done with this step and have the file unzipped and where we want it.

In [9]:
url = 'http://forms.irs.gov/app/pod/dataDownload/fullData'
base_dir = Path("./data/irs-filings")
zip_path = (base_dir/'data.zip')
extract_path = (base_dir/'unzipped/')
final_path = (base_dir/'raw_FullDataFile.txt')

In [None]:
extract_data(zip_path,extract_path,final_path)

In [13]:
!ls data/irs-filings/ | grep txt

raw_FullDataFile.txt


## Clean and Split

The next step is more unique to this dataset.  This is more complex due to the IRS practice of storing multiple types of data in the same pipe delimited file (with differing numbers of columns).  In a normal "relational" structure, this would be multiple different files.

This means we need to parse row by row to determine what type of row it is to know how to process it.

### Mapping

The first step is to get a mapping of the different record types.  I created this by converting the IRS data dictionary from MS Word into an excel format so that I could read in and get field names and data types for each value.

In [19]:
def load_mapping_file(fname, record_types = ["1","D","R","E","2","A","B"]):
    mappings = {}
    for r in record_types: mappings[r] = pd.read_excel(mappings_path,sheet_name=r)
    return mappings

When we run this for each type of row we get fiel name and type 

In [24]:
mappings_path = Path("DataConfigs/irs-filings/mappings.xlsx")
load_mapping_file(mappings_path)['E']

Unnamed: 0,model_name,field_name,field_type,position
0,record_type,Record Type,C,0
1,form_id_number,Form ID Number,N,1
2,eain_id,EAIN ID,N,2
3,election_authority_id_number,ELECTION AUTHORITY ID NUMBER,C,3
4,state_issued,STATE_ISSUED,C,4


Next, we convert the information we need from this into so we can look it up faster as we are parsing the file.

In [32]:
def build_mappings(mapping_file):
    record_types = L(mapping_file.keys())
    mappings = {}
    for record_type in record_types:
        cols = L(o for o in mapping_file[record_type].columns)
        mapping = {}
        for row in mapping_file[record_type].values:
            mapping[row[cols.index('position')]] = (row[cols.index('model_name')],row[cols.index('field_type')])
        mappings[record_type] = mapping
    return mappings

In [33]:
mappings = build_mappings(load_mapping_file(mappings_path))
mappings['E']

{0: ('record_type', 'C'),
 1: ('form_id_number', 'N'),
 2: ('eain_id', 'N'),
 3: ('election_authority_id_number', 'C'),
 4: ('state_issued', 'C')}

## Parsing

Finally we are onto the meat of this, which is parsing the file.

Firse we create a function to parse each cell.  This uses the mapping to set the appropriate datatype and truncate as needed.

In [36]:
def clean_cell(cell, cell_type,NULL_TERMS = ['N/A','NOT APPLICABLE','NA','NONE','NOT APPLICABE','NOT APLICABLE','N A','N-A']):
    if cell_type == 'D': cell = datetime.strptime(cell, '%Y-%m-%d %H:%M:%S')
    elif cell_type == 'I': cell = int(cell)
    elif cell_type == 'N': cell = float(cell)
    else:
        cell = cell.upper()
        if len(cell) > 50: cell = cell[0:50]
        if not cell or cell in NULL_TERMS: cell = None
    return cell

Then we go through each row and clean the cell and append it to a dictionary.  This was for each row we have each field name cleaned and the value associated with it cleaned per our data dictionary/mapping file.

In [37]:
def parse_row(row, mapping):
    fields = mapping
    parsed_row = {}
    for i, cell in enumerate(row[0:len(fields)]):
        field_name, field_type = fields[i]
        parsed_cell = clean_cell(cell, field_type)
        parsed_row[field_name] = parsed_cell
    return parsed_row

In [38]:
# This just allows us to have a visual indicator of status of parsing
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f): pass
    return i + 1

In [41]:
def process_file(final_path,mappings):
    with io.open(final_path, 'r', encoding='ISO-8859-1') as raw_file:
        
        # initialize variables
        def def_value(): return []
        records = defaultdict(def_value)
        file_length = file_len(final_path)
        start_time = datetime.now()
        
        # Parse File
        reader = csv.reader(raw_file, delimiter='|')
        for i,row in enumerate(reader): # Row by Row
            try:
                form_type = str(row[0]) # Check recort type for the row
                if form_type in mappings.keys(): # Ensure record type is in mapping
                    parsed_row = parse_row(row, mappings[form_type]) # Parse row using function above
                    records[form_type].append(parsed_row) # Save parsed row in list
                elif form_type in ("H","F"): logger.info(row)
            except IndexError:
                if row != '\n': records["error_idxs"].append(i) # Append any erroneus lines to a list
                
            # Print out progress information and estimated time remaining
            if i%10000 ==0:
                clear_output(wait=True)
                elapsed = datetime.now()-start_time
                time_per = elapsed/max(i,1)
                logger.info(f"{i} of {file_length} | {round((i/file_length)*100,2)}% | Elapsed={elapsed} | Time Per={time_per} | Remaining={time_per*(file_length-i)}")
    
    # Save parsed data
    pickle.dump(dict(records), open('data/irs-filings/processed_lists.pickle', "wb" ) )




We can run all that and we have our data

In [1]:
mappings_path = Path("DataConfigs/irs-filings/mappings.xlsx")
mappings = build_mappings(load_mapping_file(mappings_path))
process_file(final_path,mappings)

# Look at the output

In this tutorial we will not load the data into the graph database, but there will be a blog post on how we load data into it in the future.  For now, let's make sure the data looks good and look at a couple very basic things here in python.

In [4]:
with open('data/irs-filings/processed_lists.pickle', 'rb') as handle:
    records = pickle.load(handle)

Let's start with seeing how many organizations some of our top directors and officers are involved with.

In [32]:
directorsdf = pd.DataFrame(records['D'])
topdirectors = directorsdf[['entity_name','org_name']].groupby('entity_name').nunique().sort_values('org_name',ascending=False).head().reset_index()
topdirectors

Unnamed: 0,entity_name,org_name
0,J. RICHARD EICHMAN,241
1,KINDE DURKEE,203
2,SHAWNDA DEANE,181
3,LAURA ANN STEPHEN,172
4,NANCY H. WATKINS,166


When we look we see the top 5 are all in over 150 organizations!   I wonder how they split their time between so many different companies...

>Note:  You may recognize the name Kinda Durkee.  It was widely reported that she embezzled millions and agreed to an 8 year sentence for her crimes.

Let's take a look at them to see what they do.

In [68]:
#filter for top directors
topdirectors = directorsdf[directorsdf.entity_name.isin(topdirectors.entity_name)]

In [48]:
# Get count of titles
titles = topdirectors[['entity_title','record_type']].groupby('entity_title').count()
titles.sort_values('record_type', ascending=False).head()

Unnamed: 0_level_0,record_type
entity_title,Unnamed: 1_level_1
TREASURER,1058
ASSISTANT TREASURER,303
DEPUTY TREASURER,13
ASSITANT TREASURER,10
SECRETARY/TREASURER,8


We can see in basically these are all money people in all the orgs they are in.  Let's see how much crossover there is with multiple of these individuals being at the same organization.

In [69]:
orgs = topdirectors[['ein','entity_name']].groupby('ein').nunique()
(orgs.entity_name > 1).sum()/(orgs.entity_name>0).sum()

0.12198067632850242

Out of the top 5 names we see that in over 12% of the organizations they are a part of at least 1 of the other 4 are involved as a director or officer in that org as well.  This means that if 1 of these 5 individuals are an officer at one of these political organizations, there's about 1/8 chance at least 1 other of them is involved as an officer as well as well.

What if we expanded this to include related entities?

In [93]:
# Get director table for join
directorjoin = topdirectors[['org_name','entity_name']]
directorjoin.columns = ['org_name','director_name']

# get related entities table for join
relateddf = pd.DataFrame(records['R'])
relatedjoin = relateddf[['org_name','entity_name']]
relatedjoin.columns = ['org_name','related_org']

# Join tables
related = pd.merge(directorjoin,relatedjoin,how='left',left_on='org_name',right_on='org_name')[['related_org','director_name']]
related = related[~related.related_org.isnull()]
related.columns = ['org_name','director_name']

relatedtopdirectors =  pd.concat([related,directorjoin])

Now let's run the same thing as above to see if the percentage move is we inlcude the directors relationships not just to the orgs they are a part of, but of orgs related to those as well.

How much overlap do we see?

In [104]:
orgs = relatedtopdirectors[['org_name','director_name']].groupby('org_name').nunique().reset_index()
(orgs.director_name > 1).sum()/(orgs.director_name>0).sum()

0.17130044843049327

Up to 17%, meaning if one of these individuals is involved in an organization there is a greater than 1/6 chance that one of the other top 4 individuals are involved in that organization or a related organization.  

This may or may not be meaningful depending on what is found as we dig deeper, but I hope you enjoyed seeing a cursory sneak peak at just a couple of the many data points the IRS data includes.