# Mapping Pilot Files to Draft DATA Act Standard

This is an exercise of mapping data elements from the [pilot's input files](https://github.com/18F/data-act-pilot/tree/master/schema) to the [draft DATA Act Standard](http://fedspendingtransparency.github.io/data-exchange-standard/).

1. Get a list of data elements from the XML draft data act standard (i.e., a series of .xsd files)
2. For each of the four input files used in this pilot, match the column names to their XML counterparts
3. Output the mappings (and show which pilot data elements don't map to the XML version

## Setup

For this notebook to work, you will need:

### Schema File
The "palette" DATA Act schema .xsd file that includes financial, awards, and assistance awards information. This file is included in the [_Schema instance (XML version)_ download](http://fedspendingtransparency.github.io/assets/docs/DATA_Act_Schema_v0.5.zip) and is named _plt/case-finassist-ussglfin/da-palette-finassist-ussglfin-2015-06-29.xsd_

**Note**: The repo contains a copy of the .xsd files in schema/xbrl, but this will have to be updated when a new draft of the standard is released.

### generateDS Python Package
The [generateDS Python package](http://pythonhosted.org/generateDS/) generates Python class definitions for XML schema docs. For this exercise, we don't really need the class because we're not parsing data. However, there's some handy code in generated class that provides a list of data elements in the .xsd. We use this shortcut to get a list of the XBRL elements so we can compare them to the pilot input files.

**Note**: So far, research into open source ecosystem for open source XML schema/XBRL tools shows that it's pretty weak and outdated. 

For example, installing generateDS using pip (either PyPi or Bitbucket) doesn't actually run setup.py. Here is an annoying series of steps I used to install it.
1. download the project .tar
1. unzip .tar into your virtualenv's site-packages folder
1. from your virtualenv's site-packages folder:
    * ```cd generateDS-2.17a0```
    * ```python setup.py build```
    * ```python setup.py install```

### Read the .xsd schemas and create Python objects

This command sends the DATA Act gen, award, and finassist .xsd schemas to generateDS and outputs corresponding Python class definitions for each. Change the command to reflect the location of generateDS.py on your system.

In [None]:
!python ~/Dev/.virtualenvs/intercessor/lib/python2.7/site-packages/generateDS-2.17a0/generateDS.py -o gen.py -qf --no-dates --no-versions ../schema/xbrl/gen/da-gen-2015-06-29.xsd
!python ~/Dev/.virtualenvs/intercessor/lib/python2.7/site-packages/generateDS-2.17a0/generateDS.py -o ussglfin.py -qf --no-dates --no-versions ../schema/xbrl/ussglfin/da-ussglfin-content-2015-06-29.xsd
!python ~/Dev/.virtualenvs/intercessor/lib/python2.7/site-packages/generateDS-2.17a0/generateDS.py -o award.py -qf --no-dates --no-versions ../schema/xbrl/award/da-award-content-2015-06-29.xsd
!python ~/Dev/.virtualenvs/intercessor/lib/python2.7/site-packages/generateDS-2.17a0/generateDS.py -o finassist.py -qf --no-dates --no-versions ../schema/xbrl/finassist/da-finassist-content-2015-06-29.xsd
!python ~/Dev/.virtualenvs/intercessor/lib/python2.7/site-packages/generateDS-2.17a0/generateDS.py -o procurement.py -qf --no-dates --no-versions ../schema/xbrl/procurement/da-procurement-content-2015-06-29.xsd

### Use Python objects to grab XBRL element names

Use Python classes generated in the previous step to get a list of the XML data element names and make some dataframes.

In [None]:
import pandas as pd
import gen, finassist, award, ussglfin, procurement
def get_fields(schema):
    mapper = schema
    elementDict = mapper.GDSClassesMapping
    xbrl_elements = pd.DataFrame(list(elementDict.items()), columns = ['xbrl_element', 'xbrl_type'])
    xbrl_elements['xbrl_element_lower'] = xbrl_elements['xbrl_element'].str.lower()
    xbrl_elements['xbrl_schema'] = mapper.__name__
    type = lambda x: x.__name__
    xbrl_elements['xbrl_type'] = xbrl_elements['xbrl_type'].map(type)
    return xbrl_elements
fields_gen = get_fields(gen)
fields_ussglfin = get_fields(ussglfin)
fields_award = get_fields(award)
fields_finassist = get_fields(finassist)
fields_procurement = get_fields(procurement)
#important - concatenate in this order so the subsequent de-dupe keeps the row that documents the correct schema
xbrl_elements = pd.concat([fields_gen, fields_ussglfin, fields_award, fields_finassist, fields_procurement])
xbrl_elements = xbrl_elements.drop_duplicates(subset='xbrl_element')
xbrl_elements.to_csv('xbrl_elements.csv', index = False)

### Crosswalk XML and .csv element names

Some element names in the current XML-based schema don't reflect the element names released in late July. The names will sync-up in the future. For now, create a crosswalk dictionary that we can use for the data mapping.

In [None]:
csv_xml_crosswalk = {
    'budgetauthorityappropriatedamount' : 'budgetAuthorityAppropriated',
    'mainaccountcode' : 'mainAccountNumber',
    'otherbudgetaryresourcesamount' : 'otherBudgetaryResources',
    'outlayamount' : 'outlays',
    'piidprefix' : 'awardID',
    'piidawardyear' : 'awardID',
    'piidawardtype' : 'awardID',
    'piidawardnumber' : 'awardID',
    'fainawardnumber' : 'awardID',
    'awardmodamendmentnumber' : 'modificationAmendmentNumber',
    'currenttotalfundingobligationamount' : 'totalFundingAmount', #finassist
    'currenttotalvalueawardamount' : 'currentTotalValueOfAward', #proc
    'potentialtotalvalueawardamount' : 'potentialTotalValueOfAward', #proc
    'parentawardidprefix' : 'parentAwardID',
    'parentawardyear' : 'parentAwardID',
    'parentawardtype' : 'parentAwardID',
    'parentawardnumber' : 'parentAwardID',
    'actiondateday' : 'periodOfPerformanceActionDate',
    'actiondatemonth' : 'periodOfPerformanceActionDate',
    'actiondateyear' : 'periodOfPerformanceActionDate',
    'assistancetype' : 'typeOfTransactionCode',
    'typeofcontractpricing' : 'typeOfTransactionCode',
    'idvtype' : 'typeOfTransactionCode',
    'contractawardtype' : 'typeOfTransactionCode',
    'reasonformodification' : 'typeOfAction',
    'curenttotalfundingobligationamount' : 'totalFundingAmount',
    'awardingagencyname' : 'awardingAgency', #awardingAgency = gen:agencyComplexType
    'awardingagencycode' : 'awardingAgency',
    'awardingsubtieragencycode': 'awardingSubTierAgency', #awardingSubTierAgency = gen:agencyComplexType
    'awardingsubtieragencyname' : 'awardingSubTierAgency',
    'fundingagencyname' : 'fundingAgency', #fundingAgency = gen:agencyComplexType
    'fundingagencycode' : 'fundingAgency',
    'fundingsubtieragencyname' : 'fundingSubTierAgency', #fundingSubTierAgency = gen:agencyComplexType
    'fundingsubtieragencycode' : 'fundingSubTierAgency',
    'cfda_description' : 'catalogOfFederalDomesticAssistanceTitle',
    'cfda_code' : 'catalogOfFederalDomesticAssistanceNumber',
    'periodofperfstartday' : 'periodOfPerformanceStartDate', #all dates = xbrli:dateItemType
    'periodofperfstartmonth' : 'periodOfPerformanceStartDate',
    'periodofperfstartyear' : 'periodOfPerformanceStartDate',
    'periodofperfcurrentendday' : 'periodOfPerformanceCurrentEndDate',
    'perioofperfcurrentendmonth' : 'periodOfPerformanceCurrentEndDate',
    'periodofperfcurrentendyear' : 'periodOfPerformanceCurrentEndDate',
    'periodofperfpotentialendday' : 'periodOfPerformancePotentialEndDate',
    'periodofperfpotentialendmonth' : 'periodOfPerformancePotentialEndDate',
    'periodofperfpotentialendyear' : 'periodOfPerformancePotentialEndDate',
    'orderingperiodendday' : 'periodOfPerformanceOrderingPeriodEndDate',
    'orderingperiodendmonth' :'periodOfPerformanceOrderingPeriodEndDate',
    'orderingperiodendyear': 'periodOfPerformanceOrderingPeriodEndDate',
    'recipientdunsnumber' : 'awardeeUniqueIdentifier',
    'recipientultimateparentuniqueid' : 'ultimateParentUniqueIdentifier',
    'recipientultimateparentlegalentityname' : 'ultimateParentLegalBusinessName',
    'recipientlegalentityaddressstreet1' : 'awardeeAddress', #awardeeAddress = gen:addressComplexType
    #'recipientlegalentityaddressstreet2' : 'awardeeAddress',
    'recipientlegalentitycityname' : 'awardeeAddress', 
    'recipientlegalentitystatecode' : 'awardeeAddress',
    'recipientlegalentityzip' : 'awardeeAddress',
    'recipientlegalentityzip+4' : 'awardeeAddress',
    'recipientlegalentitypostalcode' : 'awardeeAddress',
    'recipientlegalentitycongressionaldistrict' : 'awardeeAddress',
    'recipientlegalentitycountrycode' : 'awardeeAddress',
    'recipientlegalentitycountryname' : 'awardeeAddress',
    'recipientlegalentityname' : 'awardeeLegalBusinessName',
    'highcompofficer1firstname' : 'highlyCompensatedOfficerFirstName', #highlyCompensatedOfficer = award:highlyCompensatedOfficerComplexType
    'highcompofficer1middleinitial' : 'highlyCompensatedOfficerMiddleInitial',
    'highcompofficer1lastname' : 'highlyCompensatedOfficerLastName',
    'highcompofficer1amount' : 'highlyCompensatedOfficerCompensation', 
    'highcompofficer2firstname' : 'highlyCompensatedOfficerFirstName', #highlyCompensatedOfficer = award:highlyCompensatedOfficerComplexType
    'highcompofficer2middleinitial' : 'highlyCompensatedOfficerMiddleInitial',
    'highcompofficer2lastname' : 'highlyCompensatedOfficerLastName',
    'highcompofficer2amount' : 'highlyCompensatedOfficerCompensation',
    'highcompofficer3firstname' : 'highlyCompensatedOfficerFirstName', #highlyCompensatedOfficer = award:highlyCompensatedOfficerComplexType
    'highcompofficer3middleinitial' : 'highlyCompensatedOfficerMiddleInitial',
    'highcompofficer3lastname' : 'highlyCompensatedOfficerLastName',
    'highcompofficer3amount' : 'highlyCompensatedOfficerCompensation', 
    'highcompofficer4firstname' : 'highlyCompensatedOfficerFirstName', #highlyCompensatedOfficer = award:highlyCompensatedOfficerComplexType
    'highcompofficer4middleinitial' : 'highlyCompensatedOfficerMiddleInitial',
    'highcompofficer4lastname' : 'highlyCompensatedOfficerLastName',
    'highcompofficer4amount' : 'highlyCompensatedOfficerCompensation', 
    'highcompofficer5firstname' : 'highlyCompensatedOfficerFirstName', #highlyCompensatedOfficer = award:highlyCompensatedOfficerComplexType
    'highcompofficer5middleinitial' : 'highlyCompensatedOfficerMiddleInitial',
    'highcompofficer5lastname' : 'highlyCompensatedOfficerLastName',
    'highcompofficer5amount' : 'highlyCompensatedOfficerCompensation',
    'placeofperfcity' : 'primaryPlaceOfPerformance',
    'placeofperfstate' : 'primaryPlaceOfPerformance',
    'placeofperfcounty' : 'primaryPlaceOfPerformance',
    'placeofperfzip+4' :'primaryPlaceOfPerformance',
    'placeofperfcongressionaldistrict' : 'primaryPlaceOfPerformance', 
    'placeofperfcountryname' : 'primaryPlaceOfPerformance',
    'naics_code' : 'naicsNumber',
    'naics_description' : 'naicsDescription'
}
def map_csv(csv_df, xbrl_df = xbrl_elements, crosswalk = csv_xml_crosswalk):
    csv_df = csv_df[['elementMappingName']]
    csv_df = csv_df.rename(columns = {'elementMappingName':'csv_element'})
    csv_df['csv_element_lower'] = csv_df['csv_element'].str.lower()
    csv_df = pd.merge(
        csv_df,
        xbrl_df[['xbrl_element_lower', 'xbrl_element']],
        how='left', left_on='csv_element_lower',
        right_on='xbrl_element_lower')
    for key in crosswalk:
        csv_df['xbrl_element'][(csv_df['csv_element_lower'] == key)] = crosswalk[key]
    #csv_df = csv_df.fillna({'xbrl_element': ''})
    csv_df = pd.merge(csv_df, xbrl_elements, how='left', left_on='xbrl_element', right_on='xbrl_element')
    csv_df = csv_df.fillna('')
    csv_df = csv_df[['csv_element', 'xbrl_element', 'xbrl_type', 'xbrl_schema']]
    return csv_df

### For each of the four csvs, match element names to XML counterpart

In [None]:
#appropriations
approp = pd.read_csv('../schema/appropriation.csv')
xbrl_mapping_approp = map_csv(approp)
xbrl_mapping_approp

In [None]:
#appropriations/object class/program activity
obj_class_pgm_activity = pd.read_csv('../schema/object_class_program_activity.csv')
xbrl_mapping_ocpa = map_csv(obj_class_pgm_activity)
xbrl_mapping_ocpa

In [None]:
#award financial
award_financial = pd.read_csv('../schema/award_financial.csv')
xbrl_mapping_award_financial = map_csv(award_financial)
xbrl_mapping_award_financial

In [None]:
#award
award = pd.read_csv('../schema/award.csv')
xbrl_mapping_award = map_csv(award)
xbrl_mapping_award

### Save mapping results

In [None]:
xbrl_mapping_approp.to_csv('xbrl_mapping_approp.csv', index = False)
xbrl_mapping_ocpa.to_csv('xbrl_mapping_ocpa.csv', index = False)
xbrl_mapping_award_financial.to_csv('xbrl_mapping_award_financial.csv', index = False)
xbrl_mapping_award.to_csv('xbrl_mapping_award.csv', index = False)