# Normalize and Map NPPES Outputs
We know that we will need to recreate the NPPES Public Use Files (PUFs) as a product of NPD, so we need to ensure that the core data model supports, at a minimum, the entities and data elements that are present in those files. This Jupyter notebook represents an attempt to to normalize the NPPES open data files and parse key entities and data elements, while mapping the normalized data elements back to the original CSV fields.

In [1]:
# Note: I did not containerize this script due to its scratch nature, so you may have to uncomment the below line to install xlswriter
# !pip install xlsxwriter

In [2]:
import pandas as pd
import os
import warnings
import numpy as np

warnings.simplefilter(action='ignore', category=pd.errors.DtypeWarning)
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

### Sample Data
Note: I manually downloaded and unzipped the weekly incremental files [here](https://download.cms.gov/nppes/NPI_Files.html) and put them in the `sample_data` dir. These allow us to review the structure of the CSVs and also to have an easily manageable chunk of data, so that we can understand which entities and elements pertain to organizations and which pertain to individuals 

In [3]:
work_dir = 'sample_data'

### Read and Transpose the Sample Data CSVs
For each of the files included in the NPPES PUF output (npidata, endpoint, othername, and pl), we do the following:
1. Read the file
2. Use NPI entity type code information to separate each file into a dataframe for individual records and a dataframe for organization records
3. Remove the fields that are not relevant to each of those categories (dropping the fields that are entirely populated with null values in the sample data)
4. Transpose the fields so that all the field names appear in a column instead of as headers
5. Union the individual and organization dataframes (using pd.concat) and populate the filename and original sort order of the fields

Once this process has been completed for each of the sample data files, the resulting dataframes are all unioned, so there is a single dataframe that contains all the fields across all the files and core entity types (individual and organization)

In [4]:
code_to_type = {1:'individual', 2:'organization'}
concatted_dfs = []
for filename in ['npidata', 'endpoint', 'othername', 'pl']:
    file = [f for f in os.listdir(work_dir) if filename in f][0]
    file_df = pd.read_csv(os.path.join(work_dir,file))
    if filename == 'npidata':
        npis_with_types = file_df[['NPI', 'Entity Type Code']]
    else:
        file_df = file_df.merge(npis_with_types, on = 'NPI')
    field_list = []
    for code in code_to_type.keys():
        subset = file_df.loc[file_df['Entity Type Code']==code].dropna(axis=1, how = 'all')
        transposed_fields = subset.T.reset_index(drop=False).rename(columns = {'index':'original_field'})[['original_field']]
        transposed_fields[code_to_type[code]] = True
        if filename != 'npidata':
            transposed_fields = transposed_fields.loc[~(transposed_fields['original_field']=='Entity Type Code')]
        transposed_fields.set_index('original_field', inplace=True)
        field_list.append(transposed_fields)
    concatted_df = pd.concat(field_list, axis=1)
    concatted_df['file'] = filename
    sort_order_df = {val:i for i,val in enumerate(file_df.columns)}
    concatted_df['original_sort_order'] = [sort_order_df[i] for i in concatted_df.index]
    concatted_dfs.append(concatted_df)
all_fields_df = pd.concat(concatted_dfs)


### Standardize Field Names
There are a number of inconsistencies in the field names of the original CSVs, which makes it difficult to identify common data elements. This mapping takes a naive approach to parsing the underlying data elements, applying text splitting and substitution to remove unnecessary words, characters, and capitalization (e.g. `"Provider Business Mailing Address Postal Code"` is  converted to `"business_mailing_address_postal_code"`)

In [5]:
all_fields_df['data_element'] = list(pd.Series(all_fields_df.index).apply(lambda x: x.replace('Provider','').replace('Healthcare','').replace('Other','').split('(')[0].replace(' - ',' ').strip().split('_')[0].replace(' ','_').replace('__','_').lower()))

### Handle Addresses
The NPPES PUF Files contain a number of address fields representing different types of addresses (e.g. `Mailing Address`, `Practice Location Address`). At their core, these are all just addresses. Since all the addresses have postal codes, we identify all the postal code fields, remove the text `"_postal_code"` so that the remaining data element text is the address type, if available (e.g. `"business_mailing_address_postal_code"` now has the address type of `"business_mailing_address"`). Then the address type value is populated for all the data elements for which the identified address types are present and the address type value is removed from the data element field (e.g. `"business_mailing_address_postal_code"` now has the data element of `"postal_code"`)

In [6]:
address_types = [de.replace('_postal_code','') for de in all_fields_df['data_element'] if 'postal_code' in de and de!='postal_code']
all_fields_df['address_type'] = all_fields_df['data_element'].apply(lambda x: max([at for at in address_types if at in x], default=''))
all_fields_df['data_element'] = [row['data_element'].replace(row["address_type"],'').rstrip('_-').lstrip('_-') for x,row in all_fields_df.iterrows()]

### Additional Field Clean Up
It is the case that some data elements need additional text-based clean up. Here, we standardize the address line data elements (which can appear as "line_1", "line_one", or "first_line", for example), and also parse entities and sub entities for the following entity types: `['name', 'license_number', 'number', 'identifier', 'taxonomy', 'npi', 'endpoint', 'credential']` based on whether those entity types appear in the data_element name. 

In [7]:
entity_types = []
subtypes = []
for i, row in all_fields_df.iterrows():
    entity_type = np.nan
    subtype = np.nan
    data_element = row['data_element']
    if row['address_type'] != '':
        all_fields_df.loc[i,'data_element'] = data_element.replace('_name','')
        if 'one' in data_element or '1' in data_element or 'first' in data_element:
            all_fields_df.loc[i,'data_element'] = 'line_1'
        elif 'two' in data_element or '2' in data_element or 'second' in data_element:
            all_fields_df.loc[i,'data_element'] = 'line 2'
        entity_type = 'address'
        subtype = row['address_type']
    elif 'authorized_official' in data_element:
        entity_type = 'authorized_official'
        for entity in ['name', 'license_number', 'number', 'identifier', 'taxonomy', 'npi', 'endpoint', 'credential']:
            if entity in data_element:
                if entity != 'number':
                    subtype = entity
                else:
                    if 'identification' not in data_element:
                        subtype = 'contact_number'
                break
        all_fields_df.loc[i,'data_element'] = data_element.replace('authorized_official_','')
    elif row['file'] == 'endpoint' and i!='NPI':
        entity_type = 'endpoint'
    elif data_element in ['entity_type_code', 'enumeration_date', 'certification_date', 'last_updated_date']:
        entity_type = 'npi'
    else:
        for entity in ['name', 'license_number', 'number', 'identifier', 'taxonomy', 'npi', 'endpoint', 'credential']:
            if entity in data_element:
                if entity != 'number':
                    entity_type = entity
                else:
                    if 'identification' not in data_element:
                        entity_type = 'contact_number'
                        subtype = data_element.replace('_number','')
                break
    entity_types.append(entity_type)
    subtypes.append(subtype)
all_fields_df['entity_type'] = entity_types
all_fields_df['entity_subtype'] = subtypes

### Separate Individual and Organization-related entities
In order to understand relationships between the entities identified above and the core entities of Individuals and Organizations, we associate the entities and data elements with either individuals or organizations, repeating the data elements and entities if they pertain to both

In [8]:
individual_fields = all_fields_df.loc[all_fields_df['individual'].fillna(False)]
individual_fields['type'] = 'individual'
organization_fields = all_fields_df.loc[all_fields_df['organization'].fillna(False)]
organization_fields['type'] = 'organization'
combined_fields = pd.concat([individual_fields, organization_fields])[['file','original_sort_order','type','entity_type','entity_subtype','data_element']]

  individual_fields = all_fields_df.loc[all_fields_df['individual'].fillna(False)]
  organization_fields = all_fields_df.loc[all_fields_df['organization'].fillna(False)]


### Calculate Cardinality
Based on flattened the structure of the sample data CSVs, in which each row represented an individual or organization and columns were repeated to represent one-to-many relationships between an individual or an organization and another entity (e.g. `"Proider License Number_1"`, `"Provider License Number_2"`, etc.), we count the ocurrence of each data element across types (individuals or providers) and entity types (e.g. address, name) assigning a cardinality of "many" if the fields are repeated and "one" if the fields only occur once per type. The NPI field is repeated across files because it serves as a foreign key between files and not because an individual or organization can have many NPIs, so we manually populate NPI as having a "one" cardinality.

In [9]:
data_element_counts = combined_fields.groupby(['type','entity_type'])['data_element'].value_counts().groupby(['type','entity_type']).max()
combined_fields['cardinality'] = ['many' if (row['type'],row['entity_type']) in data_element_counts.index and data_element_counts.loc[row['type'],row['entity_type']]>1 else 'one' for i,row in combined_fields.iterrows() ]
combined_fields.loc[combined_fields['entity_type']=='npi','cardinality'] = 'one'

### Output the Result
We create an Excel spreadsheet with two sheets: 
* "Normalized Data Elements," which has each data element only listed once, irrespective of the source field(s) that map to it 
* "Raw Mapping," which preserves the mapping between data elements and source field(s)

In [10]:
normalized_data_elements = combined_fields.drop_duplicates(subset = ['type','entity_type','data_element'])[['type','entity_type','cardinality','data_element']].sort_values(['type','entity_type','data_element'])
raw_mapping = combined_fields.reset_index(drop=False)[['file', 'original_field', 'original_sort_order', 'type', 'entity_type', 'cardinality', 'entity_subtype', 'data_element']]
dfs = {'Normalized Data Elements':normalized_data_elements, 'Raw Mapping':raw_mapping}
with pd.ExcelWriter('nppes_data_elements.xlsx', engine='xlsxwriter') as writer:
    for df_name in dfs.keys():
        df = dfs[df_name]
        df.to_excel(writer, sheet_name=df_name, index=False)
        worksheet = writer.sheets[df_name]
        for i, col in enumerate(df.columns):
            column_len = max(df[col].astype(str).map(len).max(), len(col)) + 1
            worksheet.set_column(i, i, column_len)