# FHIR for Research Workshop

## Exercise 1 

Intro: see https://github.com/NIH-NCPI/fhir-101/blob/master/FHIR%20101%20-%20Practical%20Guide.ipynb as a great example

## What is this notebook?

(common overview of the FHIR Training)

(overview of this specific notebook)




### Icons in this Guide
 📘 A link to a useful external reference related to the section the icon appears in  

 ⚡️ A key takeaway for the section that this icon appears in  

 🖐 A hands-on section where you will code something or interact with the server  


(any required MITRE legalese should either go here or at the very bottom of the notebook)

## Motivation / Purpose

## Scenario

(this section describes the specifics of the use case: what is the problem statement, what is the basic approach we are going to take, etc)


## Initial Setup

In [1]:
# import any required libraries here.
#  - requests
#  - fhirclient: https://github.com/smart-on-fhir/client-py
#  - Pandas - DataFrames
#  - NumPy - basic data analysis
#  - matplotlib
#  - maybe seaborn for viz on top of matplotlib ?

## Step 1 Connect to Client

sync to source server for data extraction

## Step 2 Query Data

Submit query to source and retreive data. Save it locally

## Step 3 Mount Data onto Pandas Dataframe

Take FHIR formatted data and convert it to a pandas dataframe for subsequent analysis.

This resource seems like a good one! https://github.com/dermatologist/fhiry

In [2]:
#fhir.py document
from pandas.io.json import json_normalize
import pandas as pd
import json
import os


class Fhiry(object):
    def __init__(self):
        self._df = None
        self._filename = ""
        self._folder = ""

    @property
    def df(self):
        return self._df

    @property
    def filename(self):
        return self._filename

    @property
    def folder(self):
        return self._folder

    @filename.setter
    def filename(self, filename):
        self._filename = filename
        self._df = self.read_bundle_from_file(filename)

    @folder.setter
    def folder(self, folder):
        self._folder = folder

    def read_bundle_from_file(self, filename):
        with open(filename, 'r') as f:
            json_in = f.read()
            json_in = json.loads(json_in)
            return json_normalize(json_in['entry'])

    def delete_unwanted_cols(self):
        del self._df['resource.text.div']

    def process_df(self):
        """Read a single JSON resource or a directory full of JSON resources
        ONLY COMMON FIELDS IN ALL resources will be mapped
        """
        if self._folder:
            df = pd.DataFrame(columns=[])
            for file in os.listdir(self._folder):
                if file.endswith(".json"):
                    self._df = self.read_bundle_from_file(
                        os.path.join(self._folder, file))
                    self.delete_unwanted_cols()
                    self.convert_object_to_list()
                    self.add_patient_id()
                    if df.empty:
                        df = self._df
                    else:
                        df = pd.concat([df, self._df])
            self._df = df
        elif self._filename:
            self._df = self.read_bundle_from_file(self._filename)
            self.delete_unwanted_cols()
            self.convert_object_to_list()
            self.add_patient_id()

    def process_file(self, filename):
        self._df = self.read_bundle_from_file(filename)
        self.delete_unwanted_cols()
        self.convert_object_to_list()
        self.add_patient_id()
        return self._df

    def convert_object_to_list(self):
        """Convert object to a list of codes
        """
        for col in self._df.columns:
            if 'coding' in col:
                codes = self._df.apply(
                    lambda x: self.process_list(x[col]), axis=1)
                self._df = pd.concat(
                    [self._df, codes.to_frame(name=col+'codes')], 1)
                del self._df[col]
            if 'display' in col:
                codes = self._df.apply(
                    lambda x: self.process_list(x[col]), axis=1)
                self._df = pd.concat(
                    [self._df, codes.to_frame(name=col+'display')], 1)
                del self._df[col]

    def add_patient_id(self):
        """Create a patientId column with the resource.id of the first Patient resource
        """
        self._df['patientId'] = self._df[(
            self._df['resource.resourceType'] == "Patient")].iloc[0]['resource.id']

    def get_info(self):
        if self._df is None:
            return "Dataframe is empty"
        return self._df.info()

    def process_list(self, myList):
        """Extracts the codes from a list of objects
        Args:
            myList (list): A list of objects
        Returns:
            list: A list of codes
        """
        myCodes = []
        if isinstance(myList, list):
            for entry in myList:
                if 'code' in entry:
                    myCodes.append(entry['code'])
                else:
                    myCodes.append(entry['display'])
        return myCodes

In [3]:
# parallel file
#from fhiry import Fhiry, Fhirndjson
import os
import multiprocessing as mp
import pandas as pd
from pandas.io.json import json_normalize


def process_files(file):
    f = Fhiry()
    return f.process_file(file)


def process_ndjson(file):
    f = Fhirndjson()
    return f.process_file(file)

def process1(folder):
    # TODO: Fix the below error when ? folder has few files
    # Currently falls back when it fails
    # json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
    try:
        pool = mp.Pool(mp.cpu_count())
        list_of_dataframes = pool.map(process_files, [folder + '/' + row for row in os.listdir(folder)])
        pool.close()
        return pd.concat(list_of_dataframes)
    except:
        f = Fhiry()
        f.folder = folder
        f.process_df()
        return f.df


def ndjson(folder):
    # TODO: Fix the below error when ? folder has few files
    # Currently falls back when it fails
    # json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
    try:
        pool = mp.Pool(mp.cpu_count())
        list_of_dataframes = pool.map(
            process_ndjson, [folder + '/' + row for row in os.listdir(folder)])
        pool.close()
        return pd.concat(list_of_dataframes)
    except:
        f = Fhirndjson()
        f.folder = folder
        f.process_df()
        return f.df

In [4]:
#import fhiry.parallel as fp
df = process1('fhir-test')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1862 entries, 0 to 422
Columns: 110 entries, fullUrl to resource.verificationStatus.codingcodes
dtypes: float64(5), object(105)
memory usage: 1.6+ MB


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




In [5]:
df.columns

Index(['fullUrl', 'patientId', 'request.method', 'request.url',
       'resource.abatementDateTime', 'resource.achievementStatus.codingcodes',
       'resource.active', 'resource.activity', 'resource.address',
       'resource.addresses',
       ...
       'resource.use', 'resource.vaccineCode.codingcodes',
       'resource.vaccineCode.text',
       'resource.valueCodeableConcept.codingcodes',
       'resource.valueCodeableConcept.text', 'resource.valueQuantity.code',
       'resource.valueQuantity.system', 'resource.valueQuantity.unit',
       'resource.valueQuantity.value',
       'resource.verificationStatus.codingcodes'],
      dtype='object', length=110)

In [6]:
df.head(5)

Unnamed: 0,fullUrl,patientId,request.method,request.url,resource.abatementDateTime,resource.achievementStatus.codingcodes,resource.active,resource.activity,resource.address,resource.addresses,...,resource.use,resource.vaccineCode.codingcodes,resource.vaccineCode.text,resource.valueCodeableConcept.codingcodes,resource.valueCodeableConcept.text,resource.valueQuantity.code,resource.valueQuantity.system,resource.valueQuantity.unit,resource.valueQuantity.value,resource.verificationStatus.codingcodes
0,urn:uuid:b426b062-8273-4b93-a907-de3176c0567d,b426b062-8273-4b93-a907-de3176c0567d,POST,Patient,,,,,[{'extension': [{'url': 'http://hl7.org/fhir/S...,,...,,[],,[],,,,,,[]
1,urn:uuid:fc0bcb63-569b-3658-aa03-71cf89aea64e,b426b062-8273-4b93-a907-de3176c0567d,POST,Organization,,,True,,"[{'line': ['563 BROADWAY'], 'city': 'EVERETT',...",,...,,[],,[],,,,,,[]
2,urn:uuid:0000016d-3a85-4cca-0000-000000000636,b426b062-8273-4b93-a907-de3176c0567d,POST,Practitioner,,,True,,"[{'line': ['563 BROADWAY'], 'city': 'EVERETT',...",,...,,[],,[],,,,,,[]
3,urn:uuid:d051d64b-5b2f-4465-92c3-4e693d54f653,b426b062-8273-4b93-a907-de3176c0567d,POST,Encounter,,,,,,,...,,[],,[],,,,,,[]
4,urn:uuid:5cd46965-1cb3-47dd-a2e3-fb2581b981da,b426b062-8273-4b93-a907-de3176c0567d,POST,Observation,,,,,,,...,,[],,[],,cm,http://unitsofmeasure.org,cm,117.946892,[]


In [8]:
df['resource.address'].head()

0    [{'extension': [{'url': 'http://hl7.org/fhir/S...
1    [{'line': ['563 BROADWAY'], 'city': 'EVERETT',...
2    [{'line': ['563 BROADWAY'], 'city': 'EVERETT',...
3                                                  NaN
4                                                  NaN
Name: resource.address, dtype: object

In [10]:
df['patientId'].unique()

array(['b426b062-8273-4b93-a907-de3176c0567d',
       '5cbc121b-cd71-4428-b8b7-31e53eba8184',
       'adccf2c3-9dc4-4067-ba23-98982c4875da',
       '31191928-6acb-4d73-931c-e601cc3a13fa',
       '67816396-e325-496d-a6ec-c047756b7ce4',
       '5c818f3d-7051-4b86-8203-1dc624a91804'], dtype=object)

## Step 4 Exploratory Data Analysis 

Conduct some limited, EDA for demonstration purposes.

## Summary

(A review of what was done in this notebook, possibly reinforcing how this kind of use case could be useful in the real world)