# Get a subset of a state (by filtering for a city)

## Download the 2017 metadata and show where we put it

Use irsx_index at the command line to retrieve the 2017 listing of all xml 990 efilings.

__Note that these are filings received *during* 2017, so check the tax_period__

Irsx_index is a helper command that is included with irsx, so you need to have installed it first. Try `pip install irsx` or see more [here](https://github.com/jsfenfen/990-xml-reader/#installation).

We use the `--verbose` flag so can watch it's progress

    $ irsx_index --verbose --year=2017
    Getting index file for year: 2017 
    remote=https://s3.amazonaws.com/irs-form-990/index_2017.csv 
    local=/Users/jfenton/github-whitelabel/envs/irs-cookbook/lib/python3.6/site-packages/irsx/CSV/index_2017.csv
    Beginning streaming download of https://s3.amazonaws.com/irs-form-990/index_2017.csv
    Total file size: 59.45 MB

In [1]:
## You don't need to run the command below if you've run irs_index at the command line
## To actually do this from within the notebook environment uncomment the below
## Note that we're using the %sx 'magic command' which captures the output as an array 
## Your mileage may vary depending on how jupyter plays with your operating system

%sx irsx_index --verbose --year=2018

['Getting index file for year: 2018 remote=https://s3.amazonaws.com/irs-form-990/index_2018.csv local=/Users/mfriese1/Sites/irsx_cookbook/venv/lib/python3.6/site-packages/irsx/CSV/index_2018.csv',
 'Beginning streaming download of https://s3.amazonaws.com/irs-form-990/index_2018.csv',
 'Total file size: 55.52 MB',
 'Download completed to /Users/mfriese1/Sites/irsx_cookbook/venv/lib/python3.6/site-packages/irsx/CSV/index_2018.csv in 0:00:05.009696']

# Get all nonprofit organizations in your state

We grabbed a file for just the state of Oregon as eo_or.csv from here: 
https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf

Note that this method isn't great for historic organizations; the IRS purges organizations after they've become inactive for a period of time. Historic EO BMF files are available here: http://nccs-data.urban.org/data.php?ds=bmf 

In [2]:
# importing libraries we'll use.
import csv
import os
import pandas as pd

# This tells us where the csv files are located in the system
from irsx.settings import INDEX_DIRECTORY

In [3]:
oregon_np = pd.read_csv("eo_or.csv")

In [4]:
# look at the first few lines
#oregon_np.head()
## print the headers as an array
list(oregon_np)

['EIN',
 'NAME',
 'ICO',
 'STREET',
 'CITY',
 'STATE',
 'ZIP',
 'GROUP',
 'SUBSECTION',
 'AFFILIATION',
 'CLASSIFICATION',
 'RULING',
 'DEDUCTIBILITY',
 'FOUNDATION',
 'ACTIVITY',
 'ORGANIZATION',
 'STATUS',
 'TAX_PERIOD',
 'ASSET_CD',
 'INCOME_CD',
 'FILING_REQ_CD',
 'PF_FILING_REQ_CD',
 'ACCT_PD',
 'ASSET_AMT',
 'INCOME_AMT',
 'REVENUE_AMT',
 'NTEE_CD',
 'SORT_NAME']

In [5]:
# Ignore some columns for now
or_np_simplified = oregon_np.filter(items=['EIN', 'NAME', 'ICO', 'STREET', 'CITY', 'STATE', 'ZIP', 'INCOME_AMT', 'ASSET_AMT', 'TAX_PERIOD'])
print("total oregon orgs: %s" % len(or_np_simplified))

# This is a toy filter for a demo -- you'd want something more robust than a perfect text match
pdx_orgs = or_np_simplified.query('CITY == "PORTLAND"')
print("total Portland, OR orgs: %s" % len(pdx_orgs))

total oregon orgs: 24185
total Portland, OR orgs: 5637


In [7]:
# Show the top values by income 
or_np_simplified.sort_values(by=['INCOME_AMT'], ascending=[0]).head()


Unnamed: 0,EIN,NAME,ICO,STREET,CITY,STATE,ZIP,INCOME_AMT,ASSET_AMT,TAX_PERIOD
23481,941105628,KAISER FOUNDATION HOSPITALS,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,36971310000.0,51507510000.0,201712.0
16192,930798039,KAISER FOUNDATION HEALTH PLAN OF THE NORTHWEST,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,5075035000.0,1285021000.0,201712.0
12295,840591617,KAISER FOUNDATION HEALTH PLAN OF COLORADO,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,4632392000.0,1797186000.0,201712.0
12716,910511770,KAISER FOUNDATION HEALTH PLAN OF WASHINGTON,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,4066199000.0,2628078000.0,201712.0
8794,520954463,KAISER FOUNDATION HEALTH PLAN OF THE MID ATLAN...,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,3841838000.0,1560429000.0,201712.0


In [12]:

# this is from the index file we dowloaded at the start
INDEX_2018= os.path.join(INDEX_DIRECTORY, 'index_2018.csv')
np_2018 = pd.read_csv(INDEX_2018)


## Now save the list of possible filers who actually filed out to .csv

In [13]:
# Now find orgs that are in oregon that filed in 2018. 
# This join requires that both fields be named EIN and be formatted the same

ore_2018_efilers = pd.merge(np_2018,
                 or_np_simplified,
                 on='EIN')
print("Found a total of %s oregon 2018 efilers" % len(ore_2018_efilers))
ore_2018_efilers.head()

# sort by income amt, asset amt
ore_2018_efilers = ore_2018_efilers.sort_values(by=['INCOME_AMT', 'ASSET_AMT'], ascending=[0,0])
# Lets write them back out to a file for reference.
ore_2018_efilers.to_csv('orefilers.csv')

# These are the top few for reference
ore_2018_efilers.head()

Found a total of 6264 oregon 2018 efilers


Unnamed: 0,RETURN_ID,FILING_TYPE,EIN,TAX_PERIOD_x,SUB_DATE,TAXPAYER_NAME,RETURN_TYPE,DLN,OBJECT_ID,NAME,ICO,STREET,CITY,STATE,ZIP,INCOME_AMT,ASSET_AMT,TAX_PERIOD_y
1225,15138469,EFILE,941105628,201612,1/19/2018 1:38:47 PM,KAISER FOUNDATION HOSPITALS,990,93493313028757,201703139349302875,KAISER FOUNDATION HOSPITALS,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,36971310000.0,51507510000.0,201712.0
1226,16035003,EFILE,941105628,201712,12/21/2018 7:25:49 PM,KAISER FOUNDATION HOSPITALS,990,93493312025448,201843129349302544,KAISER FOUNDATION HOSPITALS,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,36971310000.0,51507510000.0,201712.0
1518,15150911,EFILE,930798039,201612,1/29/2018 8:34:07 AM,KAISER FOUNDATION HEALTH PLAN OF THE NORTHWEST,990,93493313028047,201743139349302804,KAISER FOUNDATION HEALTH PLAN OF THE NORTHWEST,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,5075035000.0,1285021000.0,201712.0
1519,16034561,EFILE,930798039,201712,12/21/2018 5:02:39 PM,KAISER FOUNDATION HEALTH PLAN OF THE NORTHWEST,990,93493312021948,201843129349302194,KAISER FOUNDATION HEALTH PLAN OF THE NORTHWEST,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,5075035000.0,1285021000.0,201712.0
1894,15149962,EFILE,840591617,201612,1/26/2018 4:28:14 PM,KAISER FOUNDATION HEALTH PLAN OF COLORADO,990,93493313025267,201713139349302526,KAISER FOUNDATION HEALTH PLAN OF COLORADO,% CHIEF ACCOUNTING OFFICER,2701 NW VAUGHN ST STE 490,PORTLAND,OR,97210-5358,4632392000.0,1797186000.0,201712.0
