# Get a subset of a state (by filtering for a city)

## Download the 2017 metadata and show where we put it

Use irsx_index at the command line to retrieve the 2017 listing of all xml 990 efilings.

__Note that these are filings received *during* 2017, so check the tax_period__

Irsx_index is a helper command that is included with irsx, so you need to have installed it first. Try `pip install irsx` or see more [here](https://github.com/jsfenfen/990-xml-reader/#installation).

We use the `--verbose` flag so can watch it's progress

    $ irsx_index --verbose --year=2017
    Getting index file for year: 2017 
    remote=https://s3.amazonaws.com/irs-form-990/index_2017.csv 
    local=/Users/jfenton/github-whitelabel/envs/irs-cookbook/lib/python3.6/site-packages/irsx/CSV/index_2017.csv
    Beginning streaming download of https://s3.amazonaws.com/irs-form-990/index_2017.csv
    Total file size: 59.45 MB

In [2]:
## You don't need to run the command below if you've run irs_index at the command line
## To actually do this from within the notebook environment uncomment the below
## Note that we're using the %sx 'magic command' which captures the output as an array 
## Your mileage may vary depending on how jupyter plays with your operating system

## Getting 2018 and 2019 files here to ensure we've got the latest


%sx irsx_index --verbose --year=2018

['Getting index file for year: 2018 remote=https://s3.amazonaws.com/irs-form-990/index_2018.csv local=/Users/mfriese1/Sites/irsx_cookbook/venv/lib/python3.6/site-packages/irsx/CSV/index_2018.csv',
 'Beginning streaming download of https://s3.amazonaws.com/irs-form-990/index_2018.csv',
 'Total file size: 55.52 MB',
 'Download completed to /Users/mfriese1/Sites/irsx_cookbook/venv/lib/python3.6/site-packages/irsx/CSV/index_2018.csv in 0:00:06.536202']

In [3]:
%sx irsx_index --verbose --year=2019

['Getting index file for year: 2019 remote=https://s3.amazonaws.com/irs-form-990/index_2019.csv local=/Users/mfriese1/Sites/irsx_cookbook/venv/lib/python3.6/site-packages/irsx/CSV/index_2019.csv',
 'Beginning streaming download of https://s3.amazonaws.com/irs-form-990/index_2019.csv',
 'Total file size: 36.65 MB',
 'Download completed to /Users/mfriese1/Sites/irsx_cookbook/venv/lib/python3.6/site-packages/irsx/CSV/index_2019.csv in 0:00:05.391344']

# Get all nonprofit organizations in your state

We grabbed a file for just the state of Oregon as eo_or.csv from here: 
https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf

Note that this method isn't great for historic organizations; the IRS purges organizations after they've become inactive for a period of time. Historic EO BMF files are available here: http://nccs-data.urban.org/data.php?ds=bmf 

In [1]:
# importing libraries we'll use.
import csv
import os
import pandas as pd

# This tells us where the csv files are located in the system
from irsx.settings import INDEX_DIRECTORY

In [2]:
oregon_np = pd.read_csv("eo_or.csv")

In [3]:
# look at the first few lines
#oregon_np.head()
## print the headers as an array
list(oregon_np)

['EIN',
 'NAME',
 'ICO',
 'STREET',
 'CITY',
 'STATE',
 'ZIP',
 'GROUP',
 'SUBSECTION',
 'AFFILIATION',
 'CLASSIFICATION',
 'RULING',
 'DEDUCTIBILITY',
 'FOUNDATION',
 'ACTIVITY',
 'ORGANIZATION',
 'STATUS',
 'TAX_PERIOD',
 'ASSET_CD',
 'INCOME_CD',
 'FILING_REQ_CD',
 'PF_FILING_REQ_CD',
 'ACCT_PD',
 'ASSET_AMT',
 'INCOME_AMT',
 'REVENUE_AMT',
 'NTEE_CD',
 'SORT_NAME']

In [8]:
# Ignore some columns for now
or_np_simplified = oregon_np.filter(items=['EIN', 'NAME', 'ICO', 'STREET', 'CITY', 'STATE', 'ZIP', 'SUBSECTION','INCOME_AMT', 'ASSET_AMT', 'REVENUE_AMT', 'TAX_PERIOD', 'NTEE_CD'])
print("total oregon orgs: %s" % len(or_np_simplified))

# This is a toy filter for a demo -- you'd want something more robust than a perfect text match
pdx_orgs = or_np_simplified.query('CITY == "PORTLAND"')
print("total Portland, OR orgs: %s" % len(pdx_orgs))

total oregon orgs: 24613
total Portland, OR orgs: 5721


In [9]:
# Show the top values by income 
or_np_simplified.sort_values(by=['INCOME_AMT'], ascending=[0]).head()


Unnamed: 0,EIN,NAME,ICO,STREET,CITY,STATE,ZIP,SUBSECTION,INCOME_AMT,ASSET_AMT,REVENUE_AMT,TAX_PERIOD,NTEE_CD
6145,455093195,HEALTH SHARE OF OREGON,% JANET MEYER,2121 SW BROADWAY STE 200,PORTLAND,OR,97201-3181,3,2010355000.0,99666100.0,2010355000.0,201812.0,E80
13958,930223960,ASANTE,,2635 SISKIYOU BLVD,MEDFORD,OR,97504-8125,3,1371129000.0,1265166000.0,816045300.0,201809.0,E220
18196,930933975,CAREOREGON INC,% TERESA KENNEDY LEARN CFO,315 SW 5TH AVE,PORTLAND,OR,97204-1703,3,1265449000.0,427700600.0,1147714000.0,201812.0,E31Z
14210,930386823,LEGACY EMANUEL HOSPITAL & HEALTH CENTER,,2801 N GANTENBEIN AVE,PORTLAND,OR,97227-1623,3,934438900.0,605104800.0,934290300.0,201803.0,E220
15223,930602940,ST CHARLES HEALTH SYSTEM INC,,2500 NE NEFF RD,BEND,OR,97701-6015,3,831908300.0,1047606000.0,830561600.0,201812.0,E220


In [10]:

# this is from the index files we dowloaded at the start
INDEX_2018= os.path.join(INDEX_DIRECTORY, 'index_2018.csv')
np_2018 = pd.read_csv(INDEX_2018)

INDEX_2019= os.path.join(INDEX_DIRECTORY, 'index_2019.csv')
np_2019 = pd.read_csv(INDEX_2019)

df = pd.concat([np_2018,np_2019])
np_all = df.sort_values('TAX_PERIOD', ascending=False).drop_duplicates(subset=['EIN'])


## Now save the list of possible filers who actually filed out to .csv

In [11]:
# Now find orgs that are in oregon that filed in 2018 and 2019. 
# This join requires that both fields be named EIN and be formatted the same

ore_efilers = pd.merge(np_all,
                 or_np_simplified,
                 on='EIN')
print("Found a total of %s oregon 2018/19 efilers" % len(ore_efilers))
ore_efilers.head()

# sort by income amt, asset amt
ore_efilers = ore_efilers.sort_values(by=['INCOME_AMT', 'ASSET_AMT'], ascending=[0,0])
# Lets write them back out to a file for reference.
ore_efilers.to_csv('orefilers.csv')

# These are the top few for reference
ore_efilers.head()

Found a total of 6728 oregon 2018/19 efilers


Unnamed: 0,RETURN_ID,FILING_TYPE,EIN,TAX_PERIOD_x,SUB_DATE,TAXPAYER_NAME,RETURN_TYPE,DLN,OBJECT_ID,NAME,...,STREET,CITY,STATE,ZIP,SUBSECTION,INCOME_AMT,ASSET_AMT,REVENUE_AMT,TAX_PERIOD_y,NTEE_CD
6338,15892895,EFILE,455093195,201712,11/8/2018 10:52:16 AM,HEALTH SHARE OF OREGON,990,93493283012358,201802839349301235,HEALTH SHARE OF OREGON,...,2121 SW BROADWAY STE 200,PORTLAND,OR,97201-3181,3,2010355000.0,99666100.0,2010355000.0,201812.0,E80
2716,16724048,EFILE,930223960,201809,10/7/2019 12:38:43 PM,ASANTE,990,93493226015059,201902269349301505,ASANTE,...,2635 SISKIYOU BLVD,MEDFORD,OR,97504-8125,3,1371129000.0,1265166000.0,816045300.0,201809.0,E220
5075,16007105,EFILE,930933975,201712,12/14/2018 8:22:16 PM,CAREOREGON INC,990,93493309024658,201803099349302465,CAREOREGON INC,...,315 SW 5TH AVE,PORTLAND,OR,97204-1703,3,1265449000.0,427700600.0,1147714000.0,201812.0,E31Z
4887,16280455,EFILE,930386823,201803,5/9/2019 2:21:18 AM,LEGACY EMANUEL HOSPITAL AND HEALTH CENTER,990,93493046023289,201930469349302328,LEGACY EMANUEL HOSPITAL & HEALTH CENTER,...,2801 N GANTENBEIN AVE,PORTLAND,OR,97227-1623,3,934438900.0,605104800.0,934290300.0,201803.0,E220
5097,16061823,EFILE,930602940,201712,2/5/2019 12:00:34 AM,ST CHARLES HEALTH SYSTEM INC,990,93493320016858,201803209349301685,ST CHARLES HEALTH SYSTEM INC,...,2500 NE NEFF RD,BEND,OR,97701-6015,3,831908300.0,1047606000.0,830561600.0,201812.0,E220
