# Cohort Building and Data Access Using the MIDRC BDF Imaging Hub

---

This notebook briefly demonstrates how to use the MIDRC Biomedical Imaging Hub (BIH) APIs to discover medical imaging datasets across the Biomedical Data Fabric (BDF), including those in data resources other than the MIDRC data commons.

Anything a user can do in the [MIDRC BIH Explorer graphical user interface](https://imaging-hub.data-commons.org/Explorer), including using complex search criteria to select similar subsets of images distributed across multiple repositories, can also be achieved programmatically using API requests.

by Chris Meyer, PhD

August 2025

## 1) Set up Python environment
---

### Download an API key file containing your credentials
---
1) Navigate to the MIDRC BIH login page in your browser: https://imaging-hub.data-commons.org/portal/login.
2) Navigate to the user profile page: https://imaging-hub.data-commons.org/portal/identity.
3) Click on the button "Create API Key" and save the `credentials.json` file somewhere safe as `bih-credentails.json`.

### Set local variables
---
Change the following `bcred` variable path to point to your credentials file downloaded from the MIDRC data portal following the instructions above.

In [None]:
bcred = "/Users/cgmeyer/Downloads/bih-credentials.json" # location of your MIDRC BIH credentials, downloaded from https://imaging-hub.data-commons.org/portal/identity by clicking "Create API key" button and saving the credentials.json locally
bapi = "https://imaging-hub.data-commons.org" # The base URL of the resource being queried. This shouldn't change for MIDRC BIH


### Install / Import Python Packages and Scripts

In [None]:
## The packages below may be necessary for users to install according to the imports necessary in the subsequent cells.
## Uncomment the lines for packages you may need to install

import sys
#!{sys.executable} -m pip install
#!{sys.executable} -m pip install --upgrade pandas
#!{sys.executable} -m pip install --upgrade --ignore-installed PyYAML
#!{sys.executable} -m pip install --upgrade pip
#!{sys.executable} -m pip install --upgrade gen3
#!{sys.executable} -m pip install pydicom
#!{sys.executable} -m pip install --upgrade Pillow
#!{sys.executable} -m pip install psmpy
#!{sys.executable} -m pip install python-gdcm --upgrade
#!{sys.executable} -m pip install pylibjpeg --upgrade


In [None]:
## Import Python Packages and scripts

import os, subprocess
import pandas as pd
import numpy as np
import pydicom
from PIL import Image
import glob
#import gdcm
#import pylibjpeg

# import some Gen3 packages
import gen3
from gen3.auth import Gen3Auth
from gen3.query import Gen3Query
from IPython.display import display

### Initiate instances of the Gen3 SDK Classes using your credentials file for authentication
---
Again, make sure the "bcred" directory path variable reflects the location of _your_ credentials file (path variables set above).

In [None]:
bauth = Gen3Auth(bapi, refresh_file=bcred) # authentication class
bquery = Gen3Query(bauth) # query class


## 2) Build Cohorts by Sending Queries to the MIDRC BIH Metadata API
---

Currently, there are four views of the imaging data in MIDRC BIH: datasets, patients, imaging studies, and imaging series. These four views correspond to four 






### Find Imaging Studies of Interest

* Here, we'll send a query to the `imaging_study` index, which is the default table view in the [MIDRC BIH data explorer GUI](https://data.midrc.org/explorer).
* The filters defined below can be added to, removed, or modified to return different subsets of imaging studies.
* If our query request is successful, the API response should be in JSON format. The response will be a list of structured data records, each corresponding to a single imaging study. 
* The Gen3 query service "guppy" has extensive documentation in GitHub [here](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md), which will guide you through query syntax, available types of filters, operators, etc.

#### Fetch the Query Schema 
---

In order to see all the fields available to use in queries as filter parameters, we can send a request to [get the query schema/mapping](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md#mapping-query). Here we specify the imaging_study index to see all the fields in BIH related to imaging studies.




In [None]:
query_string = """{
  _mapping {
    imaging_study
  }
}"""
bquery.graphql_query(query_string=query_string,variables=None)


#### Set some filter values to subset the imaging studies in BIH

In [None]:
## Find imaging studies with the following parameters:
StudyDescription = ["XR Chest AP or PA", 
                    "CHEST AP PORT",
                    "CHEST PORT 1 VIEW (RAD)-CS",
                    "CHEST PA & LATERAL (RAD)-CS",
                    "CHEST AP VIEWONLY",
                    "Portable Chest",
                    "Chest Portable",
                    "CHEST AP PORTABLE"]

## Filter studies based on some patient attributes:
PatientSex = "Male"

min_PatientAge = 65
max_PatientAge = 70

EthnicGroup = ["Non-Hispanic/Non-Latino",
                "Not Hispanic or Latino"]

race = ["Black",
        "Black or African American"]


In [None]:
studies = bquery.raw_data_download(
                    data_type="imaging_study",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"IN": {"StudyDescription": StudyDescription}},
                            {"=": {"PatientSex": PatientSex}},
                            {"IN": {"EthnicGroup": EthnicGroup}},
                            {"IN": {"race": race}},
                            {"AND":[{">=":{"PatientAge":min_PatientAge}},{"<=":{"PatientAge":max_PatientAge}}]}
                        ]
                    },
                )


In [None]:
# Summarize the query response
if len(studies) > 0:
    study_ids = list(set([i['submitter_id'] for i in studies if 'submitter_id' in i])) ## make a list of the imaging study IDs returned
    platforms = list(set([rec['commons_name'][0] for rec in studies if 'commons_name' in rec])) ## make a list of the imaging study IDs returned
    subject_ids = list(set([rec['subject_id'][0] for rec in studies if 'subject_id' in rec])) ## make a list of the imaging studiy IDs returned
    print(f"Query returned {len(studies)} imaging studies for {len(subject_ids)} subjects across {len(platforms)} platforms: {platforms}.")
    print("Data is a list with rows like this:")
    for k,v in studies[0:1][0].items():
      print("\t\'{}' : '{}'".format(k,v))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")

In [None]:
studies_df = pd.DataFrame(studies)
display(studies_df)

In [None]:
## Export the file metadata as a TSV file
filename = "MIDRC_BIH_imaging_studies_metadata.tsv"
studies_df.to_csv(filename, sep='\t')

In [None]:
## Explore counts of patient demographics
display(studies_df.value_counts('EthnicGroup'))
race_df = studies_df['race'].explode()
display(race_df.value_counts())
display(studies_df['StudyDescription'].value_counts())

### Find Imaging Series of Interest
---
Now we will search over the >1M imaging series indexed in the MIDRC BIH. 
* First, we'll send a request to get the imaging_series schema/mapping.
* Then we'll set some values to use as filters in our data download request using the same [raw_data_download](https://github.com/uc-cdis/gen3sdk-python/blob/2b4fb5ad9facd7cd37818743b558251b48e1f219/gen3/query.py#L146) SDK function we used earlier for imaging studies.
* The API response should be a list of structured data records, each one corresponding to a single imaging series indexed in MIDRC BIH.

In [None]:
query_string = """{
  _mapping {
    imaging_series
  }
}"""
bquery.graphql_query(query_string=query_string,variables=None)


In [None]:
### Set some "imaging_series" query parameters to select Lung CT imaging series for female patients with Lung Cancer

## Here we select imaging series with a BodyPartExamined of "Chest"
BodyPartExamined = ["LUNG","CHEST"]

## Here we select imaging series with a Modality of "CT"
Modality = "CT"

## Here we select imaging series with a PatientSex of "Female"
PatientSex = "Female"

## Here we select imaging series with a disease_type of "COVID-19"
#disease_type = ["Non-small Cell Lung Cancer",
#                "Lung Cancer"]

disease_type = ["COVID-19"]


In [None]:
series = bquery.raw_data_download(
                    data_type="imaging_series",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"IN": {"BodyPartExamined": BodyPartExamined}},
                            {"=": {"Modality": Modality}},
                            {"=": {"PatientSex": PatientSex}},
                            {"IN": {"disease_type": disease_type}},
                        ]
                    },
                )

In [None]:
if len(series) > 0:
    series_ids = list(set([i['submitter_id'] for i in series if 'submitter_id' in i])) ## make a list of the imaging series IDs returned
    object_ids = list(set([rec['object_ids'][0] for rec in series if 'object_ids' in rec and rec['object_ids'] is not None])) ## make a list of the imaging series IDs returned
    platforms = list(set([rec['commons_name'][0] for rec in series if 'commons_name' in rec])) ## make a list of the imaging study IDs returned
    subject_ids = list(set([rec['subject_id'][0] for rec in series if 'subject_id' in rec])) ## make a list of the imaging series IDs returned
    print(f"Query returned {len(series)} imaging series for {len(subject_ids)} subjects across {len(platforms)} platforms: {platforms}.")
    print("Data is a list with rows like this:")
    for k,v in series[0:1][0].items():
      print("\t\'{}' : '{}'".format(k,v))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")


In [None]:
series_df = pd.DataFrame(series)
display(series_df)

In [None]:
## Export the file metadata as a TSV file
filename = "MIDRC_BIH_imaging_series_metadata.tsv"
series_df.to_csv(filename, sep='\t')

### Find Patient Cohorts of Interest

* Here, we'll send a query to the `subject` index, which corresponds to the Subjects tab of the MIDRC BIH Explorer GUI.
* First, we'll specify some values of subject attributes to send as filters, then we'll send our query request using the Gen3 SDK.
* The response should be a list of structured records each one of which corresponds to a single subject indexed in BIH.

In [None]:
query_string = """{
  _mapping {
    subject
  }
}"""
bquery.graphql_query(query_string=query_string,variables=None)


In [None]:
### Set some "subject" query parameters to select subjects in BIH with NSCLC
race = "Asian"
disease_type = "Breast Cancer"
primary_site = "Breast"



In [None]:
subjects = bquery.raw_data_download(
                    data_type="subject",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"=": {"race": race}},
                            {"=": {"primary_site": primary_site}},
                            {"=": {"disease_type": disease_type}},
                        ]
                    },
                )

In [None]:
if len(subjects) > 0:
    subject_ids = list(set([i['submitter_id'] for i in subjects if 'submitter_id' in i])) 
    platforms = list(set([rec['commons_name'][0] for rec in subjects if 'commons_name' in rec])) ## make a list of the imaging study IDs returned
    print(f"Query returned {len(subjects)} subjects across {len(platforms)} platform(s): {platforms}.")
    print("Data is a list with rows like this:")
    for k,v in subjects[0:1][0].items():
      print("\t\'{}' : '{}'".format(k,v))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")


In [None]:
subjects_df = pd.DataFrame(subjects)
display(subjects_df)

In [None]:
## Export the file metadata as a TSV file
filename = "MIDRC_BIH_imaging_subjects_metadata.tsv"
subjects_df.to_csv(filename, sep='\t')

### Find Datasets of Interest

* Here, we'll send a query to the `dataset` index, which corresponds to the Datasets tab of the MIDRC BIH Explorer GUI.
* First, we'll specify some values of dataset attributes to send as filters, then we'll send our query request using the Gen3 SDK.
* The response should be a list of structured records each one of which corresponds to a single dataset indexed in BIH.

In [None]:
query_string = """{
  _mapping {
    dataset
  }
}"""
bquery.graphql_query(query_string=query_string,variables=None)


In [None]:
### Set some "dataset" query parameters to select datasets in BIH
disease_type = "Non-small Cell Lung Cancer"
primary_site = ["Lung","Chest","Esophagus, Lung, Pancreas, Thymus"]



In [None]:
datasets = bquery.raw_data_download(
                    data_type="dataset",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"=": {"disease_type": disease_type}},
                            {"IN": {"primary_site": primary_site}},
                        ]
                    },
                )

In [None]:
if len(datasets) > 0:
    platforms = list(set([rec['commons_name'] for rec in datasets if 'commons_name' in rec])) ## make a list of the imaging study IDs returned
    print(f"Query returned {len(datasets)} datasets across {len(platforms)} platform(s): {platforms}.")
    print("Data is a list with rows like this:")
    for k,v in datasets[0:1][0].items():
      print("\t\'{}' : '{}'".format(k,v))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")


In [None]:
datasets_df = pd.DataFrame(datasets)
display(datasets_df.sort_values(by='submitter_id', key=lambda col: col.str.lower(), ascending=False))

* Note: There are some datasets that may be hosted by more than one repository. Researchers should ensure they are not including duplicates in any analyses or AI training sets.  

## 3) Access image files using their object_id / data GUID (globally unique identifiers)
---
There are a number of ways to access the image files indexed in MIDRC BIH. In general, users will need to understand the host platform's process for downloading files, but for Gen3-powered data commons like the MIDRC Data Commons, once we have a list of object_ids / image GUIDs we want to download, we can use either the gen3-client or the gen3 SDK to download the files. 

In order to programmatically access files for MIDRC imaging series indexed in MIDRC BIH, users can reference the file's object_id (AKA "data GUID" or "Globally Unique IDentifier", which is an example of a GA4GH DRS URI).

Once we have a list of GUIDs we want to download, we can use either the gen3-client or the gen3 SDK to download the files. You can also access individual files in your browser after logging-in and entering the GUID after the `files/` endpoint, as in this URL: https://data.midrc.org/files/GUID

where GUID is the actual GUID, e.g.: https://data.midrc.org/files/dg.MD1R/b87d0db3-d95a-43c7-ace1-ab2c130e04ec

For instructions on how to install and use the gen3-client, please see [the MIDRC quick-start guide](https://data.midrc.org/dashboard/Public/documentation/Gen3_MIDRC_GetStarted.pdf).

Below we use the gen3 SDK command `gen3 drs-pull object` which is [documented in detail here](https://github.com/uc-cdis/gen3sdk-python/blob/master/docs/howto/drsDownloading.md).


### Get credentials from the host platform MIDRC Data Commons

### Download an API key file containing your credentials
---
1) Navigate to the MIDRC data portal in your browser: https://data.midrc.org.
2) Read and accept the DUA (if you haven't already).
3) Navigate to the user profile page: https://data.midrc.org/identity
4) Click on the button "Create API Key" and save the `credentials.json` file somewhere safe
5) Change the following `cred` variable path to point to your credentials file downloaded from the MIDRC data portal following the instructions above.

In [None]:
cred = "/Users/cgmeyer/Downloads/midrc-credentials.json" # location of your MIDRC credentials, downloaded from https://data.midrc.org/identity by clicking "Create API key" button and saving the credentials.json locally

### Make a list of object_ids to download

In [None]:
## get subset of imaging series that have object_ids
series_with_files = [rec for rec in series if 'object_ids' in rec and rec['object_ids'] is not None]

## make a list of the imaging series IDs returned
object_ids = list(set([rec['object_ids'][0] for rec in series])) 

print(f"Found {len(object_ids)} object_ids for the {len(series)} imaging series select in BIH.")


### Use the Gen3 SDK command `gen3 drs-pull object` to download an individual file

In [None]:
## Make a new directory for downloaded files
if os.path.exists("downloads"):
    os.system("rm -r downloads")
os.system("mkdir -p downloads")


In [None]:
## We can use a simple loop to download all files and keep track of successes and failures
max_downloads = 3
success,failure,other=[],[],[]
count,total = 0,len(object_ids)
for object_id in object_ids[0:max_downloads]:
    count+=1
    cmd = "gen3 --auth {} --endpoint data.midrc.org drs-pull object {} --output-dir downloads".format(cred,object_id)
    stout = subprocess.run(cmd, shell=True, capture_output=True)
    print("Progress ({}/{}): {}".format(count,total,stout.stdout))
    if "failed" in str(stout.stdout):
        failure.append(object_id)
    elif "successfully" in str(stout.stdout):
        success.append(object_id)
    else:
        other.append(object_id)


In [None]:
# Get a list of all downloaded .dcm files
## NOTE: Since we've downloaded some zip files containing entire imaging series from MIDRC, the number of files may be more than the number of object_ids once the packages are unzipped
image_files = glob.glob(pathname='**/*.dcm',recursive=True,)
print(f"Found {len(image_files)} image files in the downloads directory.")
image_files

### View the DICOM Images
---
Here we'll use the [Python package `pydicom`](https://pydicom.github.io/pydicom/stable/) to view the downloaded DICOM images. 

Note that some of the files may contain compressed pixel data that require other packages to view; so, for this demo we'll simply skip over those using the following loop.

In [None]:
max_view = 1 # just view the first one for demo purposes
for image_file in image_files[0:max_view]:
    print(image_file)
    ds = pydicom.dcmread(image_file)
    try:
        new_image = ds.pixel_array.astype(float)
        scaled_image = (np.maximum(new_image, 0) / new_image.max()) * 255.0
        scaled_image = np.uint8(scaled_image)
        final_image = Image.fromarray(scaled_image)
        print(type(final_image))
        display(final_image)
    except Exception as e:
        print("Couldn't view {}: {}.".format(image_file,e))

### View the DICOM Headers
---
DICOM files have metadata elements embedded in the images. These can also be read and viewed using the `pydicom` package.

In [None]:
ds = pydicom.dcmread(image_files[0],force=True)
display(ds)

In [None]:
# Access individual elements using tags or codes
display(ds.file_meta)
display(ds.ImageType)
display(ds[0x0008, 0x0016])


In [None]:
# View the dicom metadata for all files as a DataFrame
dfs = []
for image_file in image_files:
    ds = pydicom.dcmread(image_file)
    df = pd.DataFrame(ds.values())
    df[0] = df[0].apply(lambda x: pydicom.dataelem.DataElement_from_raw(x) if isinstance(x, pydicom.dataelem.RawDataElement) else x)
    df['name'] = df[0].apply(lambda x: x.name)
    df['value'] = df[0].apply(lambda x: x.value)
    df = df[['name', 'value']]
    df = df.set_index('name').T.reset_index(drop=True)
    df['filename'] = image_file
    df.drop(columns=['Pixel Data'],inplace=True) # drop the pixel data as it's too large and nonsensical to store in a DataFrame
    dfs.append(df)

In [None]:
# Make a master dataframe for all images using only headers in all dataframes
headers = list(set.intersection(*map(set,dfs)))
df = pd.concat([df[headers] for df in dfs])
df.set_index('filename',inplace=True)


In [None]:
display(df)

In [None]:
## Export the file metadata as a TSV file
filename = "MIDRC_DICOM_metadata.tsv"
df.to_csv(filename, sep='\t')


## The End
---
If you have any questions related to this notebook don't hesitate to reach out to the MIDRC Helpdesk at midrc-support@gen3.org or the author directly at cgmeyer@uchicago.edu

Happy data wrangling!