## Import with a generic data structure

This is similar to number 2, but starts using a structured data format to capture variables, which helps with explicit defaults.
There's not much advantage to doing it this way--arguable the code is more complex--but it starts to become more useful with more complex processing; see recipe 3.

## This format for this data structure is:
# data_capture_dict = {
#    '<SCHEDULE NAME>': {
#        'groups': {
#            '<REPEATING GROUP (TABLE) NAME>': {
#                '<IRSX VARIABLE NAME': {
#                                    'header':'<HEADER IN OUR OUTPUT CSV FILE>',
#                                    'default': <DEFAULT VALUE TO USE IF IT'S MISSING
 

data_capture_dict = {
    'IRS990ScheduleJ': {
        'groups': {
            'SkdJRltdOrgOffcrTrstKyEmpl': {
                'PrsnNm': {'header':'name'},
                'BsnssNmLn1Txt': {'header':'business_name1'},
                'BsnssNmLn2Txt': {'header':'business_name2'},
                'TtlTxt': {'header':'title'},
                'TtlCmpnstnFlngOrgAmt': {
                    'header':'org_comp',
                    'default':0
                },
                'TtlCmpnstnRltdOrgsAmt': {
                    'header':'related_comp',
                    'default':0
                },
            }
        }
    }
}

In [1]:
import unicodecsv as csv
from irsx.xmlrunner import XMLRunner

In [2]:
# read the whole file in here, it's not very long
file_rows = [] 

# We're using the output of part 1
with open('pdxefilers.csv', 'rb') as infile:
    reader = csv.DictReader(infile)
    for row in reader:
        file_rows.append(row)

In [3]:
len(file_rows)

1874

In [4]:
# the name of the output file
outfilename ="employees.csv"
outfile = open(outfilename , 'wb')

# the header rows as they'll appear in the output
headers = ["period", "ein", "object_id", "taxpayer_name", "name", "business_name1", "business_name2", "title", "org_comp", "related_comp"]

# start up a dictwriter, ignore extra rows
dw = csv.DictWriter(outfile, headers, extrasaction='ignore')
dw.writeheader()

# get an XMLRunner -- this is what actually does the parsin
xml_runner = XMLRunner()

## Figure out what to extract

Data from each repeating group should go to it's own file, otherwise it won't make sense.

To figure out what to capture, I started by looking at schedule J: http://www.irsx.info/#IRS990ScheduleJ
Then I went to the table details and picked the rows I wanted from the repeating group:
http://www.irsx.info/metadata/groups/SkdJRltdOrgOffcrTrstKyEmpl.html

Note that it's common for director/employee names in schedule J to get listed as businessname


In [6]:

def run_filing(filing, metadata_row, dw):
        parsed_filing = xml_runner.run_filing(filing)
        if not parsed_filing:
            print("Skipping filing %s(filings with pre-2013 filings are skipped)\n row details: %s" % (filing, metadata_row))
            return None
        
        schedule_list = parsed_filing.list_schedules()

        for sked in data_capture_dict.keys():
                
            if sked in schedule_list:

                parsed_skeds = parsed_filing.get_parsed_sked(sked)
                if parsed_skeds:
                    parsed_sked = parsed_skeds[0]
                else:
                    continue
                    
                for group in data_capture_dict[sked]['groups']:
                    #print("Extracting from repeating group %s" % group)
                    try:
                        groups = parsed_sked['groups'][group]
                        #print("Found %s groups for %s" % (len(groups), group))
                    except KeyError:
                        print("No groups found for %s\n" % group)
                        continue
                    
                    # Get the individual variables we're gonna pull 
                    capture_dict = data_capture_dict[sked]['groups'][group]

                    # We know the grops are there, extract from each one
                    for parsed_group in groups:
                        
                        # Store the data for the new csv output file here
                        row_data = {}
                        # Get rows from the metadata row we passed in
                        row_data['period'] = metadata_row['TAX_PERIOD_x']
                        row_data['ein'] = metadata_row['EIN']
                        row_data['object_id'] = metadata_row['OBJECT_ID']
                        row_data['taxpayer_name'] = metadata_row['TAXPAYER_NAME']
                        
                        for variablename in capture_dict.keys():
                            try:
                                val = parsed_group[variablename]
                                csv_header = capture_dict[variablename]['header']
                                row_data[csv_header] = val
                            except KeyError:
                                
                                try:
                                    default = capture_dict[variablename]['default']
                                    csv_header = capture_dict[variablename]['header']
                                    row_data[csv_header]=default
                                except KeyError:
                                    pass            
                        dw.writerow(row_data)            

In [7]:
DEMO_MAX = 1000
for count, row in enumerate(file_rows):
    this_object_id = row['OBJECT_ID']
    run_filing(this_object_id, row, dw)
    # Don't run endlessly during a demo:
    if(count > DEMO_MAX):
        break
    if count%100==0:
        print("Processed %s filings" % count)

Processed 0 filings
Processed 100 filings
No groups found for SkdJRltdOrgOffcrTrstKyEmpl

Processed 200 filings
No groups found for SkdJRltdOrgOffcrTrstKyEmpl

Processed 300 filings
No groups found for SkdJRltdOrgOffcrTrstKyEmpl

No groups found for SkdJRltdOrgOffcrTrstKyEmpl

Processed 400 filings
Processed 500 filings
Processed 600 filings
Processed 700 filings
No groups found for SkdJRltdOrgOffcrTrstKyEmpl

Processed 800 filings
No groups found for SkdJRltdOrgOffcrTrstKyEmpl

No groups found for SkdJRltdOrgOffcrTrstKyEmpl

Processed 900 filings
No groups found for SkdJRltdOrgOffcrTrstKyEmpl

Processed 1000 filings
No groups found for SkdJRltdOrgOffcrTrstKyEmpl



In [8]:
# outfile.close()