# Mapping Pilot Files to Draft DATA Act Standard

This is an exercise of mapping data elements from the [pilot's input files](https://github.com/18F/data-act-pilot/tree/master/schema) to the [draft DATA Act Standard](http://fedspendingtransparency.github.io/data-exchange-standard/).

1. Get a list of data elements from the XML draft data act standard (i.e., a series of .xsd files)
2. For each of the four input files used in this pilot, match the column names to their XML counterparts
3. Output the mappings (and show which pilot data elements don't map to the XML version

## Setup

For this notebook to work, you will need:

### Schema File
The "palette" DATA Act schema .xsd file that includes financial, awards, and assistance awards information. This file is included in the [_Schema instance (XML version)_ download](http://fedspendingtransparency.github.io/assets/docs/DATA_Act_Schema_v0.5.zip) and is named _plt/case-finassist-ussglfin/da-palette-finassist-ussglfin-2015-06-29.xsd_

**Note**: The repo contains a copy of the .xsd files in schema/xbrl, but this will have to be updated when a new draft of the standard is released.

### generateDS Python Package
The [generateDS Python package](http://pythonhosted.org/generateDS/) generates Python class definitions for XML schema docs. For this exercise, we don't really need the class because we're not parsing data. However, there's some handy code in generated class that provides a list of data elements in the .xsd. We use this shortcut to get a list of the XBRL elements so we can compare them to the pilot input files.

**Note**: So far, research into open source ecosystem for open source XML schema/XBRL tools shows that it's pretty weak and outdated. 

For example, installing generateDS using pip (either PyPi or Bitbucket) doesn't actually run setup.py. Here is an annoying series of steps I used to install it.
1. download the project .tar
1. unzip .tar into your virtualenv's site-packages folder
1. from your virtualenv's site-packages folder:
    * ```cd generateDS-2.17a0```
    * ```python setup.py build```
    * ```python setup.py install```

### Read the financial assistance/ussglfin palette and create Python object

This command sends the DATA Act fiancial-award-finassist-award .xsd to generateDS and outputs _finassist_ussglfin.py_ (Python class definitions based on the .xsd). Change the command to reflect the location of generateDS.py on your system.

In [None]:
!python ~/Dev/.virtualenvs/intercessor/lib/python2.7/site-packages/generateDS-2.17a0/generateDS.py -o finassist_ussglfin.py -s finassist_ussglfin_sub.py -f --no-dates --no-versions --member-specs=list ../schema/xbrl/plt/case-finassist-ussglfin/da-palette-finassist-ussglfin-2015-06-29.xsd

### Use resulting Python object to grab financial assistance/financial XBRL element names

Use _finassist_ussglfin_ (generated in the previous step) to get a list of the XML data element names. Put them in a Pandas dataframe.

In [None]:
import pandas as pd
import finassist_ussglfin
mapper = finassist_ussglfin

elementDict = mapper.GDSClassesMapping
xbrl_elements = pd.DataFrame(list(elementDict.keys()))
xbrl_elements = xbrl_elements.rename(columns = {0:'element_xbrl'})
xbrl_elements['element_lower'] = xbrl_elements['element_xbrl'].str.lower()
xbrl_elements.to_csv('xbrl_elements.csv', index = False)
xbrl_elements

### Fix un-matched names

The first iterations revealed some element names in the pilot files that don't exactly match their XML counterparts. Eventually, another draft of the data standard will reflect the canonical element names. For now, create a crosswalk dictionary we can use for the mapping.

In [None]:
xbrl_elements['element_original'] = xbrl_elements['element_lower']
updated_element_names = {
    'budgetAuthorityAppropriated' : 'budgetauthorityappropriatedamount',
    'mainAccountNumber' : 'mainaccountcode',
    'otherBudgetaryResources' : 'otherbudgetaryresourcesamount',
    'outlays' : 'outlayamount',
    'modificationAmendmentNumber' : 'awardmodamendmentnumber',
    'parentAwardID' : 'parentawardnumber',
    #'typeOfAction' : 'reasonformodification',
    'periodOfPerformanceActionDate' : 'ActionDateDay',
    #'periodOfPerformanceActionDate': 'ActionDateMonth',
    #'periodOfPerformanceActionDate' : 'ActionDateYear',
    'typeOfTransactionCode' : 'assistancetype',
    'totalFundingAmount' : 'currenttotalfundingobligationamount',
    'awardingAgency' : 'awardingagencyname',
    'awardingSubTierAgency' : 'awardingsubtieragencyname',
    'catalogOfFederalDomesticAssistanceTitle' : 'cfda_description',
    'catalogOfFederalDomesticAssistanceNumber' : 'cfda_code',
    'periodOfPerformanceStartDate' : 'periodofperfstartday',
    'periodOfPerformanceCurrentEndDate' : 'periodofperfcurrentendday',
    'periodOfPerformancePotentialEndDate' : 'periodofperfpotentialendday',
    'primaryPlaceOfPerformance' : 'placeofperfcity',
    'awardeeUniqueIdentifier' : 'recipientdunsnumber',
    'awardeeAddress' : 'recipientlegalentityaddressstreet1',
    'awardeeLegalBusinessName' : 'recipientlegalentityname',
    'highlyCompensatedOfficer' : 'HighCompOfficer1FirstName',
    #'ultimateParentUniqueIdentifier' : 'parentawardnumber'
}
for key in updated_element_names:
    #print '{} {}'.format(key,updated_element_names[key])
    xbrl_elements['element_lower'][(xbrl_elements['element_xbrl'] == key)] = updated_element_names[key]
xbrl_elements


### For each of the four csvs, match element names to XML counterpart

In [None]:
#appropriations
approp = pd.read_csv('../schema/appropriation.csv')
approp_elements = approp[['elementMappingName']]
approp_elements = approp_elements.rename(columns = {'elementMappingName':'element'})
approp_elements['element_lower'] = approp_elements['element'].str.lower()
xbrl_mapping_approp = pd.merge(approp_elements, xbrl_elements, how='left')
xbrl_mapping_approp = xbrl_mapping_approp.fillna({'element_xbrl': ''})
xbrl_mapping_approp[['element', 'element_xbrl']]

In [None]:
#appropriations/object class/program activity
obj_class_pgm_activity = pd.read_csv('../schema/object_class_program_activity.csv')
ocpa_elements = obj_class_pgm_activity[['elementMappingName']]
ocpa_elements = ocpa_elements.rename(columns = {'elementMappingName':'element'})
ocpa_elements['element_lower'] = ocpa_elements['element'].str.lower()
xbrl_mapping_ocpa = pd.merge(ocpa_elements, xbrl_elements, how='left')
xbrl_mapping_ocpa = xbrl_mapping_ocpa.fillna({'element_xbrl': ''})
xbrl_mapping_ocpa[['element', 'element_xbrl']]

In [None]:
#award financial
award_financial = pd.read_csv('../schema/award_financial.csv')
award_financial_elements = award_financial[['elementMappingName']]
award_financial_elements = award_financial_elements.rename(columns = {'elementMappingName':'element'})
award_financial_elements['element_lower'] = award_financial_elements['element'].str.lower()
xbrl_mapping_award_financial = pd.merge(award_financial_elements, xbrl_elements, how='left')
xbrl_mapping_award_financial = xbrl_mapping_award_financial.fillna({'element_xbrl': ''})
xbrl_mapping_award_financial[['element', 'element_xbrl']]

In [None]:
#award
award = pd.read_csv('../schema/award.csv')
award_elements = award[['elementMappingName']]
award_elements = award_elements.rename(columns = {'elementMappingName':'element'})
award_elements['element_lower'] = award_elements['element'].str.lower()
xbrl_mapping_award = pd.merge(award_elements, xbrl_elements, how='left')
xbrl_mapping_award = xbrl_mapping_award.fillna({'element_xbrl': ''})
pd.set_option('display.max_rows', 1000)
xbrl_mapping_award[['element', 'element_xbrl']]

### Save mapping results

In [None]:
xbrl_mapping_approp.to_csv('xbrl_mapping_approp.csv', index = False)
xbrl_mapping_ocpa.to_csv('xbrl_mapping_ocpa.csv', index = False)
xbrl_mapping_award_financial.to_csv('xbrl_mapping_award_financial.csv', index = False)
xbrl_mapping_award.to_csv('xbrl_mapping_award.csv', index = False)