Intro text about FHIR Bulk Data, what we're going to do in this workshop, etc.

http://hl7.org/fhir/uv/bulkdata/index.html

http://www.hl7.org/fhir/smart-app-launch/backend-services.html

The goal of this workshop is to connect to the SMART Bulk Data Server and fetch a set of sample patient data.
While libraries like FHIR-PYrate https://github.com/UMEssen/FHIR-PYrate allow you to fetch data from a server and parse it directly into a DataFrame, these libraries generally do not support FHIR Bulk Data. This workshop will step through the process of building up functions to fetch Bulk Data and convert it into DataFrames.



The FHIR Bulk Data specification uses SMART Backend Authorization to connect. The basic idea of SMART Backend Authorization is that registered clients make a signed request to a token endpoint to receive a Bearer token, which is then used for subsequent calls to the FHIR server.

In practice, client registration is likely to happen as a separate one-time event performed manually. The SMART Backend Auth specification does not define an API for registration. 

For this workshop, we are connecting to the SMART Bulk Data Server which allows clients to "register" using either a JWKS URL or by raw JWKS. ("registration" in this case is not stored on the server, instead they provide a base URL where that "registration" is stored as state in the URL and clientID).
For convenience, the SMART Bulk Data Server launch page allows users to generate a once-off JWKS to use for testing. For production usage, clients must generate their own certificates and JWKS and keep the private key private. In this workshop we use a JWKS generated by the launch page with algorithm "RS384".

IMPORTANT: this workshop largely skips error handling and stays on the "happy path" for brevity and readability. We strongly recommend reviewing the specifications and adding better error handling before using any of this code in a production environment.


In [1]:
# First define our credentials

client_id = 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6InJlZ2lzdHJhdGlvbi10b2tlbiJ9.eyJqd2tzIjp7ImtleXMiOlt7Imt0eSI6IlJTQSIsImFsZyI6IlJTMzg0IiwibiI6IjNlLUx6cjJfRk5NQVRqZWpZa0Zqd1JxQTdWM3d6TnFMV25WODRjVGd5ZnlNVThvdjRERUk4S0V3cXpBZ0Q5U0ZzaW9FY2dLMjlDcE1tVHpYVy1lblFXYUc4Qk5qTjMxZ1NFNnBFLXpZMWdDOTFSclVjbGUzdENHYXVMeWlCS3JzOHA3WS0tekhLVVlaWjNLWmtUbldON2FFU0docGZQSlJvcUZ3V3BoZFBxc1dNUDdPZjVDUDFRMHJ3OTFaMXZ5TkdoV3l6YWNVd043WkJfVjVoVW1URE1Cb2FSXy1IMnp4YUVQVUZiaHVFU3R1QmozdmZURS10NE5RbWVNSmlaSkd1LVdiZHVMTWN2UzFNYlQwUzZiaEFEbUlHSW5tbkh0MnRlbVhUbXhaVWdTaERocHFaUHZ4alJfV0tZVW5uekM5S2lxaDNsT19lQUtObUV2Q0k1WjhqUSIsImUiOiJBUUFCIiwia2V5X29wcyI6WyJ2ZXJpZnkiXSwiZXh0Ijp0cnVlLCJraWQiOiIwOWEyZGRmMzljZWVmMGRmMDQ1ZDdmNGUzOTZjNzg1MSJ9LHsia3R5IjoiUlNBIiwiYWxnIjoiUlMzODQiLCJuIjoiM2UtTHpyMl9GTk1BVGplallrRmp3UnFBN1Yzd3pOcUxXblY4NGNUZ3lmeU1VOG92NERFSThLRXdxekFnRDlTRnNpb0VjZ0syOUNwTW1UelhXLWVuUVdhRzhCTmpOMzFnU0U2cEUtelkxZ0M5MVJyVWNsZTN0Q0dhdUx5aUJLcnM4cDdZLS16SEtVWVpaM0taa1RuV043YUVTR2hwZlBKUm9xRndXcGhkUHFzV01QN09mNUNQMVEwcnc5MVoxdnlOR2hXeXphY1V3TjdaQl9WNWhVbVRETUJvYVJfLUgyenhhRVBVRmJodUVTdHVCajN2ZlRFLXQ0TlFtZU1KaVpKR3UtV2JkdUxNY3ZTMU1iVDBTNmJoQURtSUdJbm1uSHQydGVtWFRteFpVZ1NoRGhwcVpQdnhqUl9XS1lVbm56QzlLaXFoM2xPX2VBS05tRXZDSTVaOGpRIiwiZSI6IkFRQUIiLCJkIjoiMWtQM0RscFNxS0F0bzFaRF94QnlablJZRk5LbE1LR3QtRi1GZWRMQjAwQm5tbDJSYXpqc0VLVU9mN2V1dkpuSm1nREcyZXVWQnBYdjdlRzNhWnQwOXNjdGI0cklOMEpzT21MM0NhMllpc09jZ3Ftc2dkZi1HNEoyQmZUWDF2bk9XVTdTM2lYekFmNFRlTFJEWHRvZjN4bnZESmtCZndmVG1OZVR5V05nWXFhdDM4VmJjTjFPMVJGNFhPMGk3RktUaVZ0Z3d2RzlLX2hHMXNrQkpMd0R0YXR3SGl6Z2ZJdERtRFMtbTQxaGRUSlFDZUptS3c0eGpJWDlWaXlpRXpsMTFxRmNrWUkzQUl2Q2toTkZMNVh5dkpuV0kyaHpmbTJxa1gzbTRXNEdKZVU0SV9Dbzl0dUd2SUNhZHJ0eFhQZjVWWk4tck5IemszRTVkbEZNNkRUdEFRIiwicCI6Ii1CR0NKaWF2V3ZiRE1SejZzd1lRc2ZEbTBBMnV0S3plZW82WDF0M3hBa0hCZ1pRajltZ25acFlVZEdvWW00NjFLSU1YUE54VWxfX0hWcWFuaUlNMjNvODlWTDJkMkJ1Z3UwcE5PbzhTcXdVNFROOVNIczBPOXUwVWZJRURzRVpPWWdDaDB1ZXBtMFMteG1QZ0F3UkxJTE10NmJUTFYyLVFtT1dHb2xjeno0ayIsInEiOiI1UWdpNjRTMEJaR2JobVJRYWFxc0YySFU5VllFM25Qck1GcHNJdmhvUzNRWEw5RFNOMEpQYWZ4VHVsMW9HS3hsV3FvRDkydks5YWNaVVpZWjU2SHZ4MFdWaGFzUWpGMERmd3NFMV8xRk1OTFppVy1xdVhGOGxRVjlEOHpHSkREYllXeEdOUDYtYkpXVWQycnM3MndtMzFDa3lrbUNsQmk5UWJ5SXRrM3FYLVUiLCJkcCI6ImVHWUhCUDFCbnFTbGxfQzR2S3IwNzJnOG5qNEZ6U3Naei1IbFVDUG9GWEJVdXM5cnBPeG9NeUlrUzF3ekZVenVIX3VBQzhua1JPR2ZuaTdFb1QwT0pIYmhEWF82WENrTW1kbzJJWFhQV2JIdTRXQ0NPdkRMa296LXBHNzVtMVNFTm95WF9nVHlES29RN2JrTHdHc1ZDNG5yZnNLQTdxNzNQejRuV2lONHdnRSIsImRxIjoiQ0pOemEwby15MjZXV2tQclZ1bVRKQlRfdW1nTUtxQkFrRUR5aDZTeGt4RzN4SXlYTW9hREhyN2FDOEp2b1d6akpxX3pFaEt4T04yVzd4MGx2eXlySTlVUk1qNGprbjN4SVpLeURieG9HTm5zVjE4ZEQzQ1diNllTOXNKLU1PQzdkanh0ckpKVll3OS16Ykh0U1ZITmF0TkVPR2JrUXROaVV0SFNkTEVhTTVrIiwicWkiOiJPR1JXYkJ4V2xncy05UUlsaEpFZEdPM0ZDb0FYREkyc3JTUlVwXzlnZVpXcGlxVTZHYUp0MjY4NzdKNHlGR0hhSG83MWN4eHpiOG55NTZCMTFiZm5DT3E2TUwtalI5Ym9RWmo1REhCaE1CYTZuVHI0anpfaE1uQjZqTFdMNFd3dGJwMkdpcXVMTVAyWnI2MGFSZHFJOVAzNlFadXMyVURGUmxLVmJrMlRLeFkiLCJrZXlfb3BzIjpbInNpZ24iXSwiZXh0Ijp0cnVlLCJraWQiOiIwOWEyZGRmMzljZWVmMGRmMDQ1ZDdmNGUzOTZjNzg1MSJ9XX0sImFjY2Vzc1Rva2Vuc0V4cGlyZUluIjoxNSwiaWF0IjoxNjgzOTAwNTEzfQ.OdAI3mUiBsEIF1ViZRyqOb6gpEg3HOxPjSB9kwf9R8w'

private_key = """-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEA3e+Lzr2/FNMATjejYkFjwRqA7V3wzNqLWnV84cTgyfyMU8ov
4DEI8KEwqzAgD9SFsioEcgK29CpMmTzXW+enQWaG8BNjN31gSE6pE+zY1gC91RrU
cle3tCGauLyiBKrs8p7Y++zHKUYZZ3KZkTnWN7aESGhpfPJRoqFwWphdPqsWMP7O
f5CP1Q0rw91Z1vyNGhWyzacUwN7ZB/V5hUmTDMBoaR/+H2zxaEPUFbhuEStuBj3v
fTE+t4NQmeMJiZJGu+WbduLMcvS1MbT0S6bhADmIGInmnHt2temXTmxZUgShDhpq
ZPvxjR/WKYUnnzC9Kiqh3lO/eAKNmEvCI5Z8jQIDAQABAoIBAQDWQ/cOWlKooC2j
VkP/EHJmdFgU0qUwoa34X4V50sHTQGeaXZFrOOwQpQ5/t668mcmaAMbZ65UGle/t
4bdpm3T2xy1visg3Qmw6YvcJrZiKw5yCqayB1/4bgnYF9NfW+c5ZTtLeJfMB/hN4
tENe2h/fGe8MmQF/B9OY15PJY2Bipq3fxVtw3U7VEXhc7SLsUpOJW2DC8b0r+EbW
yQEkvAO1q3AeLOB8i0OYNL6bjWF1MlAJ4mYrDjGMhf1WLKITOXXWoVyRgjcAi8KS
E0UvlfK8mdYjaHN+baqRfebhbgYl5Tgj8Kj224a8gJp2u3Fc9/lVk36s0fOTcTl2
UUzoNO0BAoGBAPgRgiYmr1r2wzEc+rMGELHw5tANrrSs3nqOl9bd8QJBwYGUI/Zo
J2aWFHRqGJuOtSiDFzzcVJf/x1amp4iDNt6PPVS9ndgboLtKTTqPEqsFOEzfUh7N
DvbtFHyBA7BGTmIAodLnqZtEvsZj4AMESyCzLem0y1dvkJjlhqJXM8+JAoGBAOUI
IuuEtAWRm4ZkUGmqrBdh1PVWBN5z6zBabCL4aEt0Fy/Q0jdCT2n8U7pdaBisZVqq
A/dryvWnGVGWGeeh78dFlYWrEIxdA38LBNf9RTDS2YlvqrlxfJUFfQ/MxiQw22Fs
RjT+vmyVlHdq7O9sJt9QpMpJgpQYvUG8iLZN6l/lAoGAeGYHBP1BnqSll/C4vKr0
72g8nj4FzSsZz+HlUCPoFXBUus9rpOxoMyIkS1wzFUzuH/uAC8nkROGfni7EoT0O
JHbhDX/6XCkMmdo2IXXPWbHu4WCCOvDLkoz+pG75m1SENoyX/gTyDKoQ7bkLwGsV
C4nrfsKA7q73Pz4nWiN4wgECgYAIk3NrSj7LbpZaQ+tW6ZMkFP+6aAwqoECQQPKH
pLGTEbfEjJcyhoMevtoLwm+hbOMmr/MSErE43ZbvHSW/LKsj1REyPiOSffEhkrIN
vGgY2exXXx0PcJZvphL2wn4w4Lt2PG2sklVjD37Nse1JUc1q00Q4ZuRC02JS0dJ0
sRozmQKBgDhkVmwcVpYLPvUCJYSRHRjtxQqAFwyNrK0kVKf/YHmVqYqlOhmibduv
O+yeMhRh2h6O9XMcc2/J8uegddW35wjqujC/o0fW6EGY+QxwYTAWup06+I8/4TJw
eoy1i+FsLW6dhoqrizD9ma+tGkXaiPT9+kGbrNlAxUZSlW5NkysW
-----END RSA PRIVATE KEY-----
"""

key_id = "09a2ddf39ceef0df045d7f4e396c7851"

# And server endpoint
server_url = 'https://bulk-data.smarthealthit.org/eyJlcnIiOiIiLCJwYWdlIjoxMDAwMCwiZHVyIjoxMCwidGx0IjoxNSwibSI6MSwic3R1Ijo0LCJkZWwiOjB9/fhir'


In [2]:
# we use the requests library to make all HTTP requests
import requests

# and use a Session, in case we need to persist common settings such as proxy or SSL configuration
session = requests.Session()

from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
session.verify = False


In [3]:
# Let's start by confirming we can even hit the server with the metadata endpoint
r = session.get(f'{server_url}/metadata')
metadata = r.json()

metadata

{'resourceType': 'CapabilityStatement',
 'status': 'active',
 'date': '2023-05-18T13:11:36+00:00',
 'publisher': "Boston Children's Hospital",
 'kind': 'instance',
 'instantiates': ['http://hl7.org/fhir/uv/bulkdata/CapabilityStatement/bulk-data'],
 'software': {'name': 'SMART Sample Bulk Data Server', 'version': '2.1.1'},
 'implementation': {'description': 'SMART Sample Bulk Data Server'},
 'fhirVersion': '4.0.1',
 'acceptUnknown': 'extensions',
 'format': ['json'],
 'rest': [{'mode': 'server',
   'security': {'extension': [{'url': 'http://fhir-registry.smarthealthit.org/StructureDefinition/oauth-uris',
      'extension': [{'url': 'token',
        'valueUri': 'https://bulk-data.smarthealthit.org/auth/token'},
       {'url': 'register',
        'valueUri': 'https://bulk-data.smarthealthit.org/auth/register'}]}],
    'service': [{'coding': [{'system': 'http://hl7.org/fhir/restful-security-service',
        'code': 'SMART-on-FHIR',
        'display': 'SMART-on-FHIR'}],
      'text': 'OAut

In [4]:
# For SMART Backend Auth, the token endpoint is published at .well-known/smart-configuration
r = session.get(f'{server_url}/.well-known/smart-configuration')
smart_config = r.json()

smart_config

{'token_endpoint': 'https://bulk-data.smarthealthit.org/auth/token',
 'registration_endpoint': 'https://bulk-data.smarthealthit.org/auth/register',
 'token_endpoint_auth_methods_supported': ['private_key_jwt'],
 'token_endpoint_auth_signing_alg_values_supported': ['HS256',
  'HS384',
  'HS512',
  'RS256',
  'RS384',
  'RS512',
  'ES256',
  'ES384',
  'ES512',
  'PS256',
  'PS384',
  'PS512'],
 'scopes_supported': ['system/*.rs',
  'system/Patient.rs',
  'system/Encounter.rs',
  'system/Condition.rs',
  'system/Claim.rs',
  'system/ExplanationOfBenefit.rs',
  'system/Observation.rs',
  'system/Immunization.rs',
  'system/DiagnosticReport.rs',
  'system/Procedure.rs',
  'system/CareTeam.rs',
  'system/CarePlan.rs',
  'system/MedicationRequest.rs',
  'system/AllergyIntolerance.rs',
  'system/Device.rs',
  'system/ImagingStudy.rs',
  'system/Organization.rs',
  'system/Practitioner.rs',
  'system/DocumentReference.rs',
  'system/Group.rs',
  'system/*.read',
  'system/Patient.read',
  'sys

In [5]:
# There are a number of important fields here, but for now the one we care most about is the token_endpoint
token_endpoint = smart_config['token_endpoint']

In [6]:
# Now to get a token, we create a JWT client assertion as follows:
import jwt
import datetime

assertion = jwt.encode({
        'iss': client_id,
        'sub': client_id,
        'aud': token_endpoint,
        'exp': int((datetime.datetime.now() + datetime.timedelta(minutes=5)).timestamp())
}, private_key, algorithm='RS384',
headers={"kid": key_id}) # kid required for smart bulk data server


# And post it to the token endpont
r = session.post(token_endpoint, data={
    'scope': 'system/*.read',
    'grant_type': 'client_credentials',
    'client_assertion_type': 'urn:ietf:params:oauth:client-assertion-type:jwt-bearer',
    'client_assertion': assertion
})

token_response = r.json()

token_response

{'token_type': 'bearer',
 'scope': 'system/*.read',
 'expires_in': 299,
 'access_token': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbl90eXBlIjoiYmVhcmVyIiwic2NvcGUiOiJzeXN0ZW0vKi5yZWFkIiwiZXhwaXJlc19pbiI6Mjk5LCJpYXQiOjE2ODQ0NDQ5NDMsImV4cCI6MTY4NDQ0NTI0Mn0.Zzdne5i3oO0KBFnsjhfMyImCldFoWXRbtM78UptC-5s'}

In [7]:
# Two important fields we need to keep track of are the token itself, and the expire time. 
# Tokens are only valid for a certain amount of time, and once they expire
# we will need to fetch a new one via the same process as above.
# 'expires_in' is in seconds from the current time. 

token = token_response['access_token']
expire_time = datetime.datetime.now() + datetime.timedelta(seconds=token_response['expires_in'])


In [8]:
# Now to make this easier for ourselves, let's package this up into a function that we can call 

def get_token():
    global token, expire_time
    if datetime.datetime.now() < expire_time:
        # the existing token is still valid so 
        return token
    
    assertion = jwt.encode({
            'iss': client_id,
            'sub': client_id,
            'aud': token_endpoint,
            'exp': int((datetime.datetime.now() + datetime.timedelta(minutes=5)).timestamp())
    }, private_key, algorithm='RS384',
    headers={"kid": key_id})

    r = session.post(token_endpoint, data={
        'scope': 'system/*.read',
        'grant_type': 'client_credentials',
        'client_assertion_type': 'urn:ietf:params:oauth:client-assertion-type:jwt-bearer',
        'client_assertion': assertion
    })

    token_response = r.json()
    token = token_response['access_token']
    expire_time = datetime.datetime.now() + datetime.timedelta(seconds=token_response['expires_in'])
    
    return token

# now we can reference get_token() and it will automatically fetch a new one whenever needed


In [9]:
# Now we make the export request
r = session.get(f'{server_url}/Patient/$export?_type=Patient,Condition', headers={'Authorization': f'Bearer {get_token()}', 'Accept': 'application/fhir+json', 'Prefer': 'respond-async'})

r.headers


{'Server': 'Cowboy', 'Connection': 'keep-alive', 'X-Powered-By': 'Express', 'Content-Location': 'https://bulk-data.smarthealthit.org/fhir/bulkstatus/06771732fc08e2f250431a3a8d9fd7f2', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '644', 'Etag': 'W/"284-AothQCy6VN/EvjChVMq1OgdnwQA"', 'Date': 'Thu, 18 May 2023 21:22:23 GMT', 'Via': '1.1 vegur'}

In [10]:
check_url = r.headers['Content-Location']
check_url

'https://bulk-data.smarthealthit.org/fhir/bulkstatus/06771732fc08e2f250431a3a8d9fd7f2'

In [11]:
# Now we check the status in a loop

from time import sleep

while True:
    r = session.get(check_url, headers={'Authorization': f'Bearer {get_token()}', 'Accept': 'application/fhir+json'})

    # There are three possible options here: http://hl7.org/fhir/uv/bulkdata/export.html#bulk-data-status-request
    # Error = 4xx or 5xx status code
    # In-Progress = 202
    # Complete = 200

    if r.status_code == 200:
        # complete
        response = r.json()
        print(response)
        break

    elif r.status_code == 202:
        # in progress
        print(r.headers)
        
        delay = r.headers['Retry-After']
        
        print(f"Sleeping {delay} seconds before retrying")
        sleep(int(delay))

    else:
        # error
        print(r.text)

        break

{'Server': 'Cowboy', 'Connection': 'keep-alive', 'X-Powered-By': 'Express', 'X-Progress': '0% complete, currenly processing Patient resources', 'Retry-After': '2', 'Date': 'Thu, 18 May 2023 21:22:23 GMT', 'Content-Length': '0', 'Via': '1.1 vegur'}
Sleeping 2 seconds before retrying
{'Server': 'Cowboy', 'Connection': 'keep-alive', 'X-Powered-By': 'Express', 'X-Progress': '21% complete, currenly processing Patient resources', 'Retry-After': '2', 'Date': 'Thu, 18 May 2023 21:22:25 GMT', 'Content-Length': '0', 'Via': '1.1 vegur'}
Sleeping 2 seconds before retrying
{'Server': 'Cowboy', 'Connection': 'keep-alive', 'X-Powered-By': 'Express', 'X-Progress': '41% complete, currenly processing Patient resources', 'Retry-After': '2', 'Date': 'Thu, 18 May 2023 21:22:27 GMT', 'Content-Length': '0', 'Via': '1.1 vegur'}
Sleeping 2 seconds before retrying
{'Server': 'Cowboy', 'Connection': 'keep-alive', 'X-Powered-By': 'Express', 'X-Progress': '61% complete, currenly processing Patient resources', 'Ret

In [12]:
output_files = response['output']
output_files

[{'type': 'Condition',
  'count': 639,
  'url': 'https://bulk-data.smarthealthit.org/eyJpZCI6IjA2NzcxNzMyZmMwOGUyZjI1MDQzMWEzYThkOWZkN2YyIiwib2Zmc2V0IjowLCJsaW1pdCI6NjM5LCJzZWN1cmUiOnRydWV9/fhir/bulkfiles/1.Condition.ndjson'},
 {'type': 'Patient',
  'count': 100,
  'url': 'https://bulk-data.smarthealthit.org/eyJpZCI6IjA2NzcxNzMyZmMwOGUyZjI1MDQzMWEzYThkOWZkN2YyIiwib2Zmc2V0IjowLCJsaW1pdCI6MTAwLCJzZWN1cmUiOnRydWV9/fhir/bulkfiles/1.Patient.ndjson'}]

In [None]:
# so the reponse points us to one or more ndjson files per resource type
# Now we can loop through the list and download each one
# Each file is an NDJSON (Newline Delimited JSON) so that's one resource per line.
# For starters we'll keep a dict of { resourceType: [resources,...]}
import json

resources_by_type = {}

for output_file in output_files:
    download_url = output_file['url']
    resource_type = output_file['type']
    
    r = session.get(download_url, headers={'Authorization': f'Bearer {get_token()}', 'Accept': 'application/fhir+json'})
                                                   
    ndjson = r.text.strip()
    
    if resource_type not in resources_by_type:
        resources_by_type[resource_type] = []
    
    for line in ndjson.split('\n'):
        resource = json.loads(line)
        resources_by_type[resource_type].append(resource)
        

resources_by_type
    

{'Condition': [{'resourceType': 'Condition',
   'id': 'a5a38601-b6fe-46b4-a67e-cde9d5957dde',
   'clinicalStatus': {'coding': [{'system': 'http://terminology.hl7.org/CodeSystem/condition-clinical',
      'code': 'active'}]},
   'verificationStatus': {'coding': [{'system': 'http://terminology.hl7.org/CodeSystem/condition-ver-status',
      'code': 'confirmed'}]},
   'code': {'coding': [{'system': 'http://snomed.info/sct',
      'code': '40055000',
      'display': 'Chronic sinusitis (disorder)'}],
    'text': 'Chronic sinusitis (disorder)'},
   'subject': {'reference': 'Patient/6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2'},
   'encounter': {'reference': 'Encounter/17b801ac-58e3-4f6b-8b48-8e33f3a36086'},
   'onsetDateTime': '1985-06-18T17:30:49-04:00',
   'recordedDate': '1985-06-18T17:30:49-04:00'},
  {'resourceType': 'Condition',
   'id': '8f818ad4-c292-47e8-8d99-c4c54174b671',
   'clinicalStatus': {'coding': [{'system': 'http://terminology.hl7.org/CodeSystem/condition-clinical',
      'code'

In [14]:
# Finally, let's convert these into DataFrames
import pandas as pd

resource_dfs = {}

for resource_type, resources in resources_by_type.items():
    resource_dfs[resource_type] = pd.json_normalize(resources)

# Now we can work with them by type:

resource_dfs['Patient']
    

Unnamed: 0,resourceType,id,extension,identifier,name,telecom,gender,birthDate,address,multipleBirthBoolean,communication,text.status,text.div,maritalStatus.coding,maritalStatus.text,multipleBirthInteger
0,Patient,6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Lemke', 'given...","[{'system': 'phone', 'value': '555-532-1156', ...",male,1965-01-13,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,M,
1,Patient,58c297c4-d684-4677-8024-01131d93835e,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Wintheiser', '...","[{'system': 'phone', 'value': '555-712-4709', ...",female,1971-04-05,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,M,
2,Patient,538a9a4e-8437-47d3-8c01-1a17dca8f0be,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Alaniz', 'give...","[{'system': 'phone', 'value': '555-446-6900', ...",male,1923-03-24,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,M,
3,Patient,c6c60742-8694-46e4-bb42-b00bf6d8b536,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Walsh', 'given...","[{'system': 'phone', 'value': '555-436-4287', ...",female,1965-10-27,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,M,
4,Patient,fbfec681-d357-4b28-b1d2-5db6434c7846,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Bednar', 'give...","[{'system': 'phone', 'value': '555-405-4909', ...",female,1942-07-04,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,M,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Patient,5efb1ac1-d29b-40a5-a3d1-2d682f10bfa7,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Schmeler', 'gi...","[{'system': 'phone', 'value': '555-971-6300', ...",male,1995-10-19,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,Never Married,
96,Patient,c1981741-f90e-4077-9156-429a3c4c5ded,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Lubowitz', 'gi...","[{'system': 'phone', 'value': '555-328-5229', ...",male,1956-05-06,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,M,
97,Patient,f98b23bf-4443-46d0-9eaf-563e767cf948,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Funk', 'given'...","[{'system': 'phone', 'value': '555-497-7639', ...",male,1966-02-07,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,M,
98,Patient,c536dee9-9ef6-4807-ae20-9f1045c9c7d6,[{'url': 'http://hl7.org/fhir/StructureDefinit...,[{'system': 'https://github.com/synthetichealt...,"[{'use': 'official', 'family': 'Bergstrom', 'g...","[{'system': 'phone', 'value': '555-845-1730', ...",male,1990-11-18,[{'extension': [{'url': 'http://hl7.org/fhir/S...,False,[{'language': {'coding': [{'system': 'urn:ietf...,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",[{'system': 'http://terminology.hl7.org/CodeSy...,S,


In [15]:
# Note that there's no perfect option for representing FHIR in a tabular way, because of all the nested values,
# but we can do a little better with the flatten_json library: https://github.com/amirziai/flatten

from flatten_json import flatten

for resource_type, resources in resources_by_type.items():
    resource_dfs[resource_type] = pd.json_normalize(list(map(lambda r: flatten(r), resources)))

# Now let's take another look
resource_dfs['Patient']

Unnamed: 0,resourceType,id,text_status,text_div,extension_0_url,extension_0_valueString,extension_1_url,extension_1_valueAddress_city,extension_1_valueAddress_state,extension_1_valueAddress_country,...,multipleBirthBoolean,communication_0_language_coding_0_system,communication_0_language_coding_0_code,communication_0_language_coding_0_display,communication_0_language_text,name_1_use,name_1_family,name_1_given_0,name_1_prefix_0,multipleBirthInteger
0,Patient,6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Lettie Boyle,http://hl7.org/fhir/StructureDefinition/patien...,Boston,Massachusetts,US,...,False,urn:ietf:bcp:47,en-US,English,English,,,,,
1,Patient,58c297c4-d684-4677-8024-01131d93835e,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Marquetta Schamberger,http://hl7.org/fhir/StructureDefinition/patien...,Macau,Macao Special Administrative Region of the Peo...,CN,...,False,urn:ietf:bcp:47,zh,Chinese,Chinese,maiden,Heathcote,Aleta,Mrs.,
2,Patient,538a9a4e-8437-47d3-8c01-1a17dca8f0be,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Pilar Orta,http://hl7.org/fhir/StructureDefinition/patien...,San Jose,San Jose,CR,...,False,urn:ietf:bcp:47,es,Spanish,Spanish,,,,,
3,Patient,c6c60742-8694-46e4-bb42-b00bf6d8b536,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Arvilla Haag,http://hl7.org/fhir/StructureDefinition/patien...,Norton,Massachusetts,US,...,False,urn:ietf:bcp:47,en-US,English,English,maiden,Kuphal,Alyce,Mrs.,
4,Patient,fbfec681-d357-4b28-b1d2-5db6434c7846,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Marcelina Harber,http://hl7.org/fhir/StructureDefinition/patien...,Brockton,Massachusetts,US,...,False,urn:ietf:bcp:47,en-US,English,English,maiden,Runolfsson,Arnette,Mrs.,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Patient,5efb1ac1-d29b-40a5-a3d1-2d682f10bfa7,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Allison Daugherty,http://hl7.org/fhir/StructureDefinition/patien...,Boston,Massachusetts,US,...,False,urn:ietf:bcp:47,en-US,English,English,,,,,
96,Patient,c1981741-f90e-4077-9156-429a3c4c5ded,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Antoinette Parker,http://hl7.org/fhir/StructureDefinition/patien...,Mansfield,Massachusetts,US,...,False,urn:ietf:bcp:47,en-US,English,English,,,,,
97,Patient,f98b23bf-4443-46d0-9eaf-563e767cf948,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Barbar Windler,http://hl7.org/fhir/StructureDefinition/patien...,Randolph,Massachusetts,US,...,False,urn:ietf:bcp:47,en-US,English,English,,,,,
98,Patient,c536dee9-9ef6-4807-ae20-9f1045c9c7d6,generated,"<div xmlns=""http://www.w3.org/1999/xhtml"">Gene...",http://hl7.org/fhir/StructureDefinition/patien...,Juli Johns,http://hl7.org/fhir/StructureDefinition/patien...,Holyoke,Massachusetts,US,...,False,urn:ietf:bcp:47,en-US,English,English,,,,,


In [16]:
# Next, what if we know in advance we will only want certain fields? 

# https://github.com/beda-software/fhirpath-py

import fhirpathpy

fhir_paths=[
        ["id", "identifier[0].value"],
        ["gender", "gender"],
        ["date_of_birth", "birthDate"],
        ["marital_status", "maritalStatus.coding[0].code"]
    ]

for f in fhir_paths:
     f[1] = fhirpathpy.compile(f[1])

for resource_type, resources in resources_by_type.items():
    filtered_resources = []
    
    for resource in resources:
        filtered_resource = {}
        for f in fhir_paths:
            fieldname = f[0]
            func = f[1]
            filtered_resource[fieldname] = func(resource)
            
            if isinstance(filtered_resource[fieldname], list) and len(filtered_resource[fieldname]) == 1:
                filtered_resource[fieldname] = filtered_resource[fieldname][0]
            
        filtered_resources.append(filtered_resource)

    resource_dfs[resource_type] = pd.json_normalize(list(map(lambda r: flatten(r), filtered_resources)))
    

resource_dfs['Patient']

Unnamed: 0,id,gender,date_of_birth,marital_status
0,6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,male,1965-01-13,M
1,58c297c4-d684-4677-8024-01131d93835e,female,1971-04-05,M
2,538a9a4e-8437-47d3-8c01-1a17dca8f0be,male,1923-03-24,M
3,c6c60742-8694-46e4-bb42-b00bf6d8b536,female,1965-10-27,M
4,fbfec681-d357-4b28-b1d2-5db6434c7846,female,1942-07-04,M
...,...,...,...,...
95,5efb1ac1-d29b-40a5-a3d1-2d682f10bfa7,male,1995-10-19,S
96,c1981741-f90e-4077-9156-429a3c4c5ded,male,1956-05-06,M
97,f98b23bf-4443-46d0-9eaf-563e767cf948,male,1966-02-07,M
98,c536dee9-9ef6-4807-ae20-9f1045c9c7d6,male,1990-11-18,S


In [17]:
# Now we have everything we need. Let's bring everything together into one class with a clear entrypoint

import requests
import jwt
import datetime
import json
import fhirpathpy
from flatten_json import flatten



class BulkDataFetcher:
    def __init__(
        self,
        base_url: str,
        client_id: str,
        private_key: str,
        key_id: str,
        session = None
    ):
        self.base_url = base_url
        self.client_id = client_id
        self.private_key = private_key
        self.key_id = key_id
        
        self.token = None
        self.token_expire_time = None
        
        if session is None:
            self.session = requests.Session()
        else:
            self.session = session
            
        r = self.session.get(f'{base_url}/.well-known/smart-configuration')
        smart_config = r.json()
        self.token_endpoint = smart_config['token_endpoint']
        
        self.resource_types = []
        self.fhir_paths = {}
        
        
    def get_token(self):
        if self.token and datetime.datetime.now() < self.expire_time:
            # the existing token is still valid so 
            return self.token

        assertion = jwt.encode({
                'iss': self.client_id,
                'sub': self.client_id,
                'aud': self.token_endpoint,
                'exp': int((datetime.datetime.now() + datetime.timedelta(minutes=5)).timestamp())
        }, self.private_key, algorithm='RS384',
        headers={"kid": key_id})

        r = self.session.post(self.token_endpoint, data={
            'scope': 'system/*.read',
            'grant_type': 'client_credentials',
            'client_assertion_type': 'urn:ietf:params:oauth:client-assertion-type:jwt-bearer',
            'client_assertion': assertion
        })

        token_response = r.json()
        self.token = token_response['access_token']
        self.expire_time = datetime.datetime.now() + datetime.timedelta(seconds=token_response['expires_in'])

        return self.token
    
    def add_resource_type(self, resource_type: str, fhir_paths = None):
        self.resource_types.append(resource_type)
        if fhir_paths:
            # fhir_paths=[
            #    ("id", "identifier[0].value"),
            #    ("marital_status", "maritalStatus.coding[0].code")
            # ]
            compiled_fhir_paths = [(f[0], fhirpathpy.compile(f[1])) for f in fhir_paths]
            self.fhir_paths[resource_type] = compiled_fhir_paths
            
    def _invoke_request(self):
        types = ','.join(self.resource_types)
        r = self.session.get(f'{self.base_url}/Patient/$export?_type={types}', headers={'Authorization': f'Bearer {self.get_token()}', 'Accept': 'application/fhir+json', 'Prefer': 'respond-async'})

        self.check_url = r.headers['Content-Location']
        return self.check_url
    
    def _wait_until_ready(self):
        while True:
            r = self.session.get(self.check_url, headers={'Authorization': f'Bearer {self.get_token()}', 'Accept': 'application/fhir+json'})

            # There are three possible options here: http://hl7.org/fhir/uv/bulkdata/export.html#bulk-data-status-request
            # Error = 4xx or 5xx status code
            # In-Progress = 202
            # Complete = 200

            if r.status_code == 200:
                # complete
                response = r.json()
                self.output_files = response['output']
                return self.output_files

            elif r.status_code == 202:
                # in progress
                delay = r.headers['Retry-After']

                sleep(int(delay))

            else:
                raise RuntimeError(r.text)

    def get_dataframe(self):
        self._invoke_request()
        self._wait_until_ready()
        
        resources_by_type = {}

        for output_file in self.output_files:
            download_url = output_file['url']
            resource_type = output_file['type']

            r = self.session.get(download_url, headers={'Authorization': f'Bearer {get_token()}', 'Accept': 'application/fhir+json'})

            ndjson = r.text.strip()

            if resource_type not in resources_by_type:
                resources_by_type[resource_type] = []

            for line in ndjson.split('\n'):
                resource = json.loads(line)
                
                if resource_type in self.fhir_paths:
                    fhir_paths = self.fhir_paths[resource_type]
                    filtered_resource = {}
                    for f in fhir_paths:
                        fieldname = f[0]
                        func = f[1]
                        filtered_resource[fieldname] = func(resource)

                        if isinstance(filtered_resource[fieldname], list) and len(filtered_resource[fieldname]) == 1:
                            filtered_resource[fieldname] = filtered_resource[fieldname][0]
                    resource = filtered_resource

                resources_by_type[resource_type].append(resource)
        
        dfs = {}
        
        for resource_type, resources in resources_by_type.items():
            dfs[resource_type] = pd.json_normalize(list(map(lambda r: flatten(r), resources)))
        
        return dfs


# And then to invoke it:


fetcher = BulkDataFetcher(server_url, client_id, private_key, key_id, session)                

fetcher.add_resource_type('Patient', [
        ("id", "identifier[0].value"),
        ("gender", "gender"),
        ("date_of_birth", "birthDate"),
        ("marital_status", "maritalStatus.coding[0].code")
])

fetcher.add_resource_type('Condition')

dfs = fetcher.get_dataframe()

dfs['Patient']

Unnamed: 0,id,gender,date_of_birth,marital_status
0,6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,male,1965-01-13,M
1,58c297c4-d684-4677-8024-01131d93835e,female,1971-04-05,M
2,538a9a4e-8437-47d3-8c01-1a17dca8f0be,male,1923-03-24,M
3,c6c60742-8694-46e4-bb42-b00bf6d8b536,female,1965-10-27,M
4,fbfec681-d357-4b28-b1d2-5db6434c7846,female,1942-07-04,M
...,...,...,...,...
95,5efb1ac1-d29b-40a5-a3d1-2d682f10bfa7,male,1995-10-19,S
96,c1981741-f90e-4077-9156-429a3c4c5ded,male,1956-05-06,M
97,f98b23bf-4443-46d0-9eaf-563e767cf948,male,1966-02-07,M
98,c536dee9-9ef6-4807-ae20-9f1045c9c7d6,male,1990-11-18,S


In [18]:
dfs['Condition']

Unnamed: 0,resourceType,id,clinicalStatus_coding_0_system,clinicalStatus_coding_0_code,verificationStatus_coding_0_system,verificationStatus_coding_0_code,code_coding_0_system,code_coding_0_code,code_coding_0_display,code_text,subject_reference,encounter_reference,onsetDateTime,recordedDate,abatementDateTime
0,Condition,a5a38601-b6fe-46b4-a67e-cde9d5957dde,http://terminology.hl7.org/CodeSystem/conditio...,active,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,40055000,Chronic sinusitis (disorder),Chronic sinusitis (disorder),Patient/6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,Encounter/17b801ac-58e3-4f6b-8b48-8e33f3a36086,1985-06-18T17:30:49-04:00,1985-06-18T17:30:49-04:00,
1,Condition,8f818ad4-c292-47e8-8d99-c4c54174b671,http://terminology.hl7.org/CodeSystem/conditio...,active,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,162864005,Body mass index 30+ - obesity (finding),Body mass index 30+ - obesity (finding),Patient/6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,Encounter/0953dd44-90bb-4805-badd-169a761a6ab3,2005-01-19T16:30:49-05:00,2005-01-19T16:30:49-05:00,
2,Condition,65d9d5f2-a772-4586-932f-df1f2ce1a863,http://terminology.hl7.org/CodeSystem/conditio...,active,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,15777000,Prediabetes,Prediabetes,Patient/6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,Encounter/d4e1370a-a679-4570-a3dc-e4f7ac847512,2013-02-06T16:30:49-05:00,2013-02-06T16:30:49-05:00,
3,Condition,77ac8342-6950-4302-a303-efba12e06785,http://terminology.hl7.org/CodeSystem/conditio...,resolved,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,68496003,Polyp of colon,Polyp of colon,Patient/6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,Encounter/58ad433b-3707-4d40-9b63-2a803b4913bd,2015-01-14T16:30:49-05:00,2015-01-14T16:30:49-05:00,2017-05-03T17:30:49-04:00
4,Condition,6514ab0c-bc64-4e1b-aa61-b97d27d72bc7,http://terminology.hl7.org/CodeSystem/conditio...,active,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,271737000,Anemia (disorder),Anemia (disorder),Patient/6c5d9ca9-54d7-42f5-bfae-a7c19cd217f2,Encounter/58ad433b-3707-4d40-9b63-2a803b4913bd,2015-01-14T16:30:49-05:00,2015-01-14T16:30:49-05:00,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
634,Condition,ab051f6c-4298-407b-9315-2322ce913539,http://terminology.hl7.org/CodeSystem/conditio...,active,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,162864005,Body mass index 30+ - obesity (finding),Body mass index 30+ - obesity (finding),Patient/a845ead4-d9de-42eb-b4b5-eb21a8963578,Encounter/c5ed8aed-2b7e-4630-bd1d-ac5090967edc,2014-11-22T15:43:42-05:00,2014-11-22T15:43:42-05:00,
635,Condition,76c1f07a-f8f2-4705-aa80-5f7a25d7c651,http://terminology.hl7.org/CodeSystem/conditio...,resolved,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,39848009,Whiplash injury to neck,Whiplash injury to neck,Patient/a845ead4-d9de-42eb-b4b5-eb21a8963578,Encounter/9c8b41dd-d6fd-4691-ae46-01b47992dd8d,2015-07-13T16:43:42-04:00,2015-07-13T16:43:42-04:00,2015-08-10T16:43:42-04:00
636,Condition,b9a078eb-bb83-49ed-b4ed-633d1445356d,http://terminology.hl7.org/CodeSystem/conditio...,resolved,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,70704007,Sprain of wrist,Sprain of wrist,Patient/a845ead4-d9de-42eb-b4b5-eb21a8963578,Encounter/f044f05a-8433-4952-926d-dd8e2b4ee44e,2018-07-25T16:43:42-04:00,2018-07-25T16:43:42-04:00,2018-08-15T16:43:42-04:00
637,Condition,0fe427ce-7ea1-4409-8de1-3879f9dc56bb,http://terminology.hl7.org/CodeSystem/conditio...,resolved,http://terminology.hl7.org/CodeSystem/conditio...,confirmed,http://snomed.info/sct,444814009,Viral sinusitis (disorder),Viral sinusitis (disorder),Patient/a845ead4-d9de-42eb-b4b5-eb21a8963578,Encounter/9100e9aa-1206-403b-b2bf-b75ac23991bd,2018-09-26T16:43:42-04:00,2018-09-26T16:43:42-04:00,2018-10-17T16:43:42-04:00
