# Select patients with multiple imaging studies of different modalities
---
This notebook briefly demonstrates how to use the MIDRC open APIs to build a cohort of MIDRC patients that have multiple imaging studies of different modalities.

by Chris Meyer, PhD

Manager of Data and User Services at the Center for Translational Data Science at University of Chicago

Last updated: April 2024





## 1) Set up Python environment
---


### Download an API key file containing your credentials
---
1) Navigate to the MIDRC data portal in your browser: https://data.midrc.org.
2) Read and accept the DUA (if you haven't already).
3) Navigate to the user profile page: https://data.midrc.org/identity
4) Click on the button "Create API Key" and save the `credentials.json` file somewhere safe


### Set local variables
---
Change the following `cred` variable path to point to your credentials file downloaded from the MIDRC data portal following the instructions above.

In [None]:
cred = "/Users/christopher/Downloads/midrc-credentials.json" # location of your MIDRC credentials, downloaded from https://data.midrc.org/identity by clicking "Create API key" button and saving the credentials.json locally
api = "https://data.midrc.org" # The base URL of the data commons being queried. This shouldn't change for MIDRC.


### Install / Import Python Packages and Scripts

In [None]:
## The packages below may be necessary for users to install according to the imports necessary in the subsequent cells.

import sys
#!{sys.executable} -m pip install
#!{sys.executable} -m pip install --upgrade pandas
#!{sys.executable} -m pip install --upgrade --ignore-installed PyYAML
#!{sys.executable} -m pip install --upgrade pip
#!{sys.executable} -m pip install --upgrade gen3
#!{sys.executable} -m pip install pydicom
#!{sys.executable} -m pip install --upgrade Pillow
#!{sys.executable} -m pip install psmpy
#!{sys.executable} -m pip install python-gdcm --upgrade
#!{sys.executable} -m pip install pylibjpeg --upgrade

In [None]:
## Import Python Packages and scripts

import os, subprocess
import pandas as pd
#import numpy as np
#import pydicom

# import some Gen3 packages
import gen3
from gen3.auth import Gen3Auth
from gen3.query import Gen3Query



### Initiate instances of the Gen3 SDK Classes using credentials file for authentication
---
Make sure the "cred" variable reflects the location of your credentials file.

In [None]:
auth = Gen3Auth(api, refresh_file=cred) # authentication class
query = Gen3Query(auth) # query class


## 2) Build Cohorts by Sending Queries to the MIDRC APIs
#### General notes on sending queries:
* There are many ways to query and access metadata for cohort building in MIDRC, but this notebook will focus on using the [Gen3](https://gen3.org) graphQL query service ["guppy"](https://github.com/uc-cdis/guppy/#readme). This is the backend query service that [MIDRC's data explorer GUI](https://data.midrc.org/explorer) uses. So, anything you can do in the explorer GUI, you can do with guppy queries, and more!
* The guppy graphQL service has more functionality than is demonstrated in this simple example. You can find extensive documentation in GitHub [here](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md) in case you'd like to build your own queries from scratch.
* The Gen3 SDK (intialized as `query` above in this notebook) has Python wrapper scripts to make sending queries to the guppy graphQL API simpler. The guppy SDK package can be viewed in GitHub [here](https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/query.py).
* Guppy queries focus on a particular type of data (cases, imaging studies, files, etc.), which corresponds to the major tabs in [MIDRC's data explorer GUI](https://data.midrc.org/explorer).
* Queries include arguments that are akin to selecting filter values in [MIDRC's data explorer GUI](https://data.midrc.org/explorer).
* To see more documentation about how to use and combine filters with various operator logic (like AND/OR/IN, etc.) see [this page](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md#filter).

---


#### Set query parameters
---
* Here, we'll send a query to the `case` guppy index, which corresponds to the "Cases" tab of [MIDRC's data explorer GUI](https://data.midrc.org/explorer).
* The filters defined below can be modified to return different subsets of cases. Here, we'll select cases that have at least one Chest CT and at least one Chest X-ray (CXR).
* If our query request is successful, the API response should be in JSON format, and it should contain a list of patient IDs along with any other patient data we ask for.


In [None]:
### Set some "imaging_study" query parameters

## Imaging study modality filter: we want imaging studies with at least one CT and one CR or DX
modality_1 = ["DX", "CR"]
modality_2 = ["CT"]

## Imaging study body part filter: here we select "chest" as the "LOINC system" filter, which is the body part examined
body_part_examined = "Chest"

## The fields we want our query to return; 
## Note: you can set fields to "None" to return all fields with the query in the cell below
fields = ["project_id",
          "submitter_id",
          "imaging_studies.loinc_system",
          "imaging_studies.study_uid",
          "imaging_studies.study_modality",
          "_imaging_studies_count",
          "_cr_series_file_count",
          "_dx_series_file_count",
          "_ct_series_file_count"]

In [None]:
## Note: the "fields" option defines what fields we want the query to return. If set to "None", returns all available fields.

cases = query.raw_data_download(
                    data_type="case",
                    #fields=None,
                    fields=fields,
                    filter_object={
                        "AND": [
                            {"nested":{"path":"imaging_studies",
                                        "=": {"loinc_system": body_part_examined}}},
                            {"nested":{"path":"imaging_studies",
                                       "IN":{"study_modality":modality_1}}},
                            {"nested":{"path":"imaging_studies",
                                       "IN":{"study_modality":modality_2}}}
                        ],
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(cases) > 0 and "submitter_id" in cases[0]:
    case_ids = [i['submitter_id'] for i in cases] ## make a list of the imaging study IDs returned
    print("Query returned {} cases with data for each that looks like this:\n\t".format(len(cases)))
    display(cases[0:1])
else:
    print("Your query returned no data! Please, check that query parameters are valid.")

### Filter Query Results for only the desired imaging studies
---
Our query has returned all cases that have at least one imaging study of the Chest, and have at least one CXR and one CT. However, those cases may have imaging studies of other modalities or body parts we're not interested in. 

So, next we'll filter the query results to obtain only imaging studies that are both of the Chest and of modality CT, CR, or DX, thus excluding studies of other body parts or modalities.

In [None]:
## Get only the imaging studies for Chest CT and Chest X-rays so we can build a file download manifest
desired_studies = {i['submitter_id']:[j for j in i['imaging_studies'] if (j['study_modality'][0] in modality_1+modality_2 and 'loinc_system' in j and 'Chest' in j['loinc_system'])] for i in cases}
list(desired_studies.items())[0]


## 3) Send another query to get data file details for our cohort / case ID
---
Now that we have a list of imaging studies we're interested in from our original cohort of cases, we can run another query to get the `object_id` of each of the imaging series files related to those imaging studies. This is achieved by querying the `data_file` guppy index, which corresponds to the "Data Files" tab of the MIDRC data explorer GUID. 

All MIDRC data files, including both images and annotations, are listed in the guppy index `data_file`, which is queried in a similar manner to our query of the `imaging_study` index above. The query parameter `data_type` below determines which guppy (Elasticsearch) index we're querying.

To get only `data_file` records that correspond to our imaging study cohort built previously, we'll use the list of study UIDs as a query filter. 


### Set 'data_file' query parameters
---
Here, we'll utilize the property `source_node` to filter the list of files for our cohort to only those matching the type of files we're interested in. In this example, we ask only for CR, DX, and CT images, which will exclude any other types of files related to our desired imaging studies like annotations or supplemental files.

We're also using the property `study_uid` as a filter to restrict the `data_file` records returned down to those associated with the imaging studies in our cohort built above. 


In [None]:
## We only want CR, DX, and CT imaging series files, so we can use the "source_node" to filter out other types of data files
source_nodes = ['cr_series_file', 'dx_series_file', 'ct_series_file']

# Build a list of study UIDs to use as a filter in our data_file query
all_study_uids = []
for case_id in desired_studies:
    studies = desired_studies[case_id]
    study_uids = [i['study_uid'] for i in studies]
    all_study_uids += study_uids

display(len(list(set(all_study_uids))))


In [None]:
## Search for specific files associated with our cohort by adding "study_uid" as a filter
# * Note: "fields" is set to "None" in this query, which by default returns all the properties available
data_files = query.raw_data_download(
                    data_type="data_file",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"IN": {"study_uid": all_study_uids}},
                            {"IN": {"source_node": source_nodes}},
                        ]
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(data_files) > 0:
    object_ids = [i['object_id'] for i in data_files if 'object_id' in i] ## make a list of the file object_ids returned by our query
    print("Query returned {} data files with {} object_ids.".format(len(data_files),len(object_ids)))
    print("Data is a list with rows like this:\n\t {}".format(data_files[0:1]))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")

## 4) Access data files using their object_id / data GUID (globally unique identifiers)
---
In order to download files stored in MIDRC, users need to reference the file's object_id (AKA data GUID or Globally Unique IDentifier).

Once we have a list of GUIDs we want to download, we can use either the gen3-client or the gen3 SDK to download the files. You can also access individual files in your browser after logging-in and entering the GUID after the `files/` endpoint, as in this URL: https://data.midrc.org/files/GUID

where GUID is the actual GUID, e.g.: https://data.midrc.org/files/dg.MD1R/b87d0db3-d95a-43c7-ace1-ab2c130e04ec

For instructions on how to install and use the gen3-client, please see [the MIDRC quick-start guide](https://data.midrc.org/dashboard/Public/documentation/Gen3_MIDRC_GetStarted.pdf), which can be found linked here and in the MIDRC data portal header as "Get Started".

Below we use the gen3 SDK command `gen3 drs-pull object` which is [documented in detail here](https://github.com/uc-cdis/gen3sdk-python/blob/master/docs/howto/drsDownloading.md).

### Use the Gen3 SDK command `gen3 drs-pull object` to download an individual file

In [None]:
## Make a new directory for downloaded files
#os.system("rm -r downloads")
os.system("mkdir -p downloads")


In [None]:
## We can use a simple loop to download all files and keep track of successes and failures

success,failure,other=[],[],[]
count,total = 0,len(object_ids)
for object_id in object_ids:
    count+=1
    cmd = "gen3 --auth {} --endpoint data.midrc.org drs-pull object {} --output-dir downloads".format(cred,object_id)
    stout = subprocess.run(cmd, shell=True, capture_output=True)
    print("Progress ({}/{}): {}".format(count,total,stout.stdout))
    if "failed" in str(stout.stdout):
        failure.append(object_id)
    elif "successfully" in str(stout.stdout):
        success.append(object_id)
    else:
        other.append(object_id)


### Export a Gen3 file download "manifest"
---
The following script generates a Gen3-style data file download manifest JSON file. 

This `manifest.json` file can be used In case you want to use the gen3-client command-line tool or the `gen3 drs-pull manifest` command shown below.


In [None]:
## Export a Gen3 file download "manifest" JSON file to use with the gen3-client command-line tool or the `gen3 drs-pull manifest` command.

def write_manifest(guids, filename):

    with open(filename, "w") as mani:

        mani.write("[\n  {\n")

        count = 0
        for guid in guids:
            count += 1
            file_line = '    "object_id": "{}"\n'.format(guid)
            mani.write(file_line)
            if count == len(guids):
                mani.write("  }]")
            else:
                mani.write("  },\n  {\n")

    print("\tDone ({}/{}).".format(count, len(guids)))
    print("\tManifest written to file: {}".format(filename))
    return filename

manifest_filename = "multimodal_cases_files_manifest.json"
write_manifest(guids=object_ids,filename=manifest_filename)

In [None]:
ll

In [None]:
cmd = "gen3 --auth {} --endpoint data.midrc.org drs-pull manifest {}".format(cred,manifest_filename)
print(cmd)
#stout = subprocess.run(cmd, shell=True, capture_output=True)
# This command is better run in the terminal so you can watch progress bar. Running in the notebook may take quite some time.