# How to Build a Cohort of Severe COVID-19 Cases Using the MIDRC Data Commons
---
This notebook demonstrates how to build a cohort of severe COVID-19 cases using patient clinical data and AI research-based annotations in the MIDRC data commons.

Our goal is to download structured data and files for 2 related cohorts: 1) severe COVID cases and 2) a control cohort of non-severe COVID cases.

Luckily for the patients, there are many more non-severe cases; but that presents a challenge for building a balanced dataset that is optimal for AI/ML training and evaluation.

* Cohort 1: All chest x-rays (CXR) with an mRALE score of 10 or higher obtained within 2 days after a positive COVID test.

* Cohort 2: Matching number of CXRs with an mRALE score <10 obtained within 2 days after a positive COVID test.

* Additionally, we want the cohorts to be somewhat balanced and matched in terms of the demographics: age, sex, race, and ethnicity.


by Chris Meyer, PhD

Manager of Data and User Services at the Center for Translational Data Science at University of Chicago

August 2023





## 1) Set up Python environment
---


### Set local variables
---
Change the following directory paths to a valid working directories where you're running this notebook.

In [None]:
cred = "/content/credentials.json" # location of your MIDRC credentials, downloaded from https://data.midrc.org/identity by clicking "Create API key" button and saving the credentials.json locally; then upload to Colab Files browser
api = "https://data.midrc.org" # The base URL of the data commons being queried. This shouldn't change for MIDRC.


### Install / Import Python Packages and Scripts

In [None]:
## The packages below may be necessary for users to install according to the imports necessary in the subsequent cells.

import sys
#!{sys.executable} -m pip install
#!{sys.executable} -m pip install --upgrade pandas
#!{sys.executable} -m pip install --upgrade --ignore-installed PyYAML
#!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install --upgrade gen3
#!{sys.executable} -m pip install pydicom
#!{sys.executable} -m pip install IPython#!{sys.executable} -m pip install psmpy

In [None]:
## Import Python Packages and scripts

import os, subprocess
import pandas as pd
import numpy as np
#import pydicom

# import some Gen3 packages
import gen3
from gen3.auth import Gen3Auth
from gen3.query import Gen3Query
from IPython.display import display


### Initiate instances of the Gen3 SDK Classes using credentials file for authentication
---
Again, make sure the "cred" directory path variable reflects the location of your credentials file (path variables set above).

In [None]:
auth = Gen3Auth(api, refresh_file=cred) # authentication class
query = Gen3Query(auth) # query class


## 2) Build Cohorts by Sending Queries to the MIDRC APIs
#### General notes on sending queries:
* There are many ways to query and access metadata for cohort building in MIDRC, but this notebook will focus on using the [Gen3](https://gen3.org) graphQL query service ["guppy"](https://github.com/uc-cdis/guppy/#readme). This is the backend query service that [MIDRC's data explorer GUI](https://data.midrc.org/explorer) uses. So, anything you can do in the explorer GUI, you can do with guppy queries, and more!
* The guppy graphQL service has more functionality than is demonstrated in this simple example with extensive documentation in GitHub [here](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md) in case you'd like to build your own queries from scratch.
* The Gen3 SDK (intialized as "query" above in this notebook) has Python wrapper scripts to make sending queries to the guppy graphQL API simpler. The guppy SDK package can be viewed in GitHub [here](https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/query.py).
* Guppy queries focus on a particular type of data (cases, imaging studies, files, etc.) and include arguments that are akin to selecting filter values in [MIDRC's data explorer GUI](https://data.midrc.org/explorer).
* To see more documentation about to use and combine filters with various operator logic (like AND/OR/IN, etc.) see [this page](https://github.com/uc-cdis/guppy/blob/master/doc/queries.md#filter).
* We then send our query to MIDRC's guppy API endpoint using [the Gen3Query SDK package](https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/query.py) we initialized earlier.
* If our query request is successful, the API response should be in JSON format, and it should contain a list of patient IDs along with any other patient data we ask for.

---


#### Cohort 1: All chest x-rays (CXR) with an mRALE score of 10 or higher obtained within 2 days after a positive COVID test.
---

In [None]:
### Set some "imaging_study" query parameters

## mRALE filter: we'll select all imaging studies annotated with an mRAle scores greater than or equal to this threshold number
mRALE_threshold = 10

## days from study to positive COVID-19 test filter: we want imaging studies performed within two days after a positive test
min_days_from_study_to_test = -2
max_days_from_study_to_test = 0

## Imaging study modality filter: we want chest x-rays, so we want studies with a modality of either DX or CR
study_modalities = ["DX", "CR"]

## Imaging study body part filter: here we select "chest" as the "LOINC system" filter, which is the body part examined
body_part_examined = "Chest"



In [None]:
## Note: the "fields" option defines what fields we want the query to return. If set to "None", returns all available fields.

severe_studies = query.raw_data_download(
                    data_type="imaging_study",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"IN": {"loinc_system": [body_part_examined]}},
                            {"IN": {"study_modality": study_modalities}},
                            {"nested": {"path": "imaging_study_annotations", ">=": {"midrc_mRALE_score": mRALE_threshold}}},
                            {"AND": [
                                {">=": {"days_from_study_to_pos_covid_test": min_days_from_study_to_test}},
                                {"<=": {"days_from_study_to_pos_covid_test": max_days_from_study_to_test}}
                            ]}
                        ]
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(severe_studies) > 0 and "submitter_id" in severe_studies[0]:
    severe_study_ids = [i['submitter_id'] for i in severe_studies] ## make a list of the imaging study IDs returned
    print("Query returned {} study IDs.".format(len(severe_studies)))
    print("Data is a list with rows like this:\n\t {}".format(severe_studies[0:1]))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")

#### Cohort 2: CXRs with an mRALE score <10 obtained within 2 days after a positive COVID test.
---

We don't need to set any new parameters for our filters this time. We just need to reverse the operator on the mRALE threshold from greater than or equal to (`>=`) to less than (`<`).

In [None]:
mild_studies = query.raw_data_download(
                    data_type="imaging_study",
                    fields=None,
                    filter_object={
                        "AND": [
                            {"IN": {"loinc_system": [body_part_examined]}},
                            {"IN": {"study_modality": study_modalities}},
                            {"nested": {"path": "imaging_study_annotations", "<": {"midrc_mRALE_score": mRALE_threshold}}},
                            {"AND": [
                                {">=": {"days_from_study_to_pos_covid_test": min_days_from_study_to_test}},
                                {"<=": {"days_from_study_to_pos_covid_test": max_days_from_study_to_test}}
                            ]}
                        ]
                    },
                    sort_fields=[{"submitter_id": "asc"}]
                )

if len(mild_studies) > 0 and "submitter_id" in mild_studies[0]:
    mild_study_ids = [i['submitter_id'] for i in mild_studies] ## make a list of the imaging study IDs returned
    print("Query returned {} study IDs.".format(len(mild_studies)))
    print("Data is a list with rows like this:\n\t {}".format(mild_studies[0:1]))
else:
    print("Your query returned no data! Please, check that query parameters are valid.")

In [None]:
severe_df = pd.DataFrame(severe_studies)
display(severe_df.head())

mild_df = pd.DataFrame(mild_studies)
display(mild_df.head())


In [None]:
## Label cases as mild or severe and then combine the dataframes into a single dataframe
mild_df['cohort'] = 'mild'
severe_df['cohort'] = 'severe'
df = pd.concat([mild_df,severe_df],ignore_index=True)


In [None]:
# convert patient demographic columns in lists to strings
df['case_ids'] = df['case_ids'].apply(lambda x: ','.join(map(str, x)))
df['ethnicity'] = df['ethnicity'].apply(lambda x: ','.join(map(str, x)))
df['race'] = df['race'].apply(lambda x: ','.join(map(str, x)))
df['age_at_index'] = df['age_at_index'].apply(lambda x: ','.join(map(str, x)))
df['age_at_index'] = df['age_at_index'].astype(int)
df['sex'] = df['sex'].apply(lambda x: ','.join(map(str, x)))

# add binned ages for calculating age distributions later
age_bins = np.arange(10,100,10)
df['age_bin'] = pd.cut(df['age_at_index'], bins=age_bins)

In [None]:
# The dataset is inbalanced with more mild COVID patients than severe
df['cohort'].value_counts()

## 3) Now we use re-sampling techniques to balance the dataset
---
In order to create a mild COVID cohort of the same size as the smaller severe COVID cohort that roughly matches the demographics of the smaller cohort, we need to sample cases from the larger mild COVID cohort through a process called "undersampling" until the size of the two cohorts is equal.

"Undersampling" refers to a group of techniques designed to balance the class distribution for a classification dataset that has a skewed class distribution and is a common technique used in machine learning to balance imbalanced datasets. In this case, we want to undersample the larger patient cohort while ensuring that the resulting two cohorts have a similar distribution of four demographic variables: sex, race, ethnicity, and age.

The following column headers in the Pandas DataFrame we created above will be used for our sampling script: `sex`, `ethnicity`, `race`, and `age_at_index`.

### Calculate the Size of the Smaller Cohort:

Determine the size you want for the smaller cohort. If both cohorts need to be of the same size, you can calculate the size as the minimum size of the two cohorts. You can do this by using the `min` function.


In [None]:
cohorts = ['mild', 'severe']
cohort_sizes = {}
for cohort in cohorts:
    cohort_sizes[cohort] = len(df[df['cohort']==cohort])
display(cohort_sizes)


In [None]:
smaller_cohort_size = min(cohort_sizes.values())
smaller_cohort_name = list(cohort_sizes.keys())[list(cohort_sizes.values()).index(smaller_cohort_size)]
sdf = df.loc[df['cohort']==smaller_cohort_name] # smaller cohort DataFrame

larger_cohort_size = max(cohort_sizes.values())
larger_cohort_name = list(cohort_sizes.keys())[list(cohort_sizes.values()).index(larger_cohort_size)]
ldf = df.loc[df['cohort']==larger_cohort_name] # larger cohort DataFrame

print("The smaller cohort is '{}' with a size of '{}'.".format(smaller_cohort_name,smaller_cohort_size))
print("The larger cohort is '{}' with a size of '{}'.".format(larger_cohort_name,larger_cohort_size))

### Undersampling:

Now, we undersample the larger cohort to match the smaller cohort's size while maintaining the desired distribution of demographic variables. For this, we'll use the `sample` function in Pandas.

To use the Pandas `sample` function to undersample the larger cohort (mild COVID cases) while considering the four demographic variables and their distribution in the smaller cohort (severe COVID cases), you can create a custom sampling probability based on the distribution of the smaller cohort.

#### Strategy
---
1) Determine the frequency of all combinations of demographic properties in the smaller cohort,
2) Add this frequency to each row in the larger cohort by matching the demographics combinations, and
3) Undersample the larger cohort using the inverse of these frequencies as weights


In [None]:
# Make a list of all combinations of demographic variables found in the master DataFrame using pd.value_counts()

# list of properties to consider (dprops: "demographic properties")
#dprops = ['sex','ethnicity']
dprops = ['sex','ethnicity','race','age_bin']

print("Counts of Demographic Property Combinations in Master DataFrame:")
mvc = df[dprops].value_counts() # mvc: "master value counts"
#mvc[('Male', 'Not Hispanic or Latino')] # this is how you access individual values
display(mvc)

print("\nAll Combinations of Demographic Properties in Master DataFrame:")
combos = mvc.index.tolist()
display(combos)


In [None]:
# Now look at the frequencies of each demographic combo in the smaller cohort
svc = sdf[dprops].value_counts(normalize=False).reindex(combos) # svc: "smaller cohort value counts"
print("Smaller Cohort Demographics Value Counts\n{}\n\n".format(svc))

# use normalize=True to get the relative frequencies (count of a demographic / sum of all demographics)
svf = sdf[dprops].value_counts(normalize=True).reindex(combos) # svf: "smaller cohort value frequencies"
print("Smaller Cohort Demographics Frequencies\n{}\n\n".format(svf))

# sampling weights should be the inverse of their frequencies; less frequent demographics get a higher probability of sampling
svw = 1/svf # svw: "smaller cohort value weights"
print("Smaller Cohort Demographics Weights for Undersampling\n{}\n\n".format(svw))

Demographics in the larger cohort not represented in the smaller cohort get NaN for counts. Here we'll convert the `NaN`s to `0` to use as weights:

In [None]:
# Replace NaNs with 0:
for key in list(svw.keys()):
    if np.isnan(svw[key]):
        svw[key] = 0
display(svw)

In [None]:
# Now apply the weights to each row of the larger cohort (mild COVID cases) to use in undersampling
for combo in combos:
    print("{}: {}".format(combo,svw[combo]))
    ldf.loc[(ldf[dprops[0]] == combo[0]) & (ldf[dprops[1]] == combo[1]) & (ldf[dprops[2]] == combo[2]) & (ldf[dprops[3]] == combo[3]),'weight'] = svw[combo]


In [None]:
# Double check that all rows in the larger cohort were assigned a weight. If this is an empty DataFrame, then each row has a non-NaN weight.
ldf.loc[ldf['weight'].isna()]

### Undersample the Larger Cohort:

Use the `sample` function with the calculated weights to undersample the larger cohort.

In [None]:
# Undersample the larger cohort (mild COVID cases) using weights

udf = ldf.sample(n=smaller_cohort_size, weights=ldf['weight'], random_state=np.random.RandomState(41)) # undersampled larger cohort DataFrame, can set random_state in order to have reproducible sampling
#udf = ldf.sample(n=smaller_cohort_size, weights=ldf['weight']) # undersampled larger cohort DataFrame, leave random_state out to get a non-reproducible, random sample
udf.reset_index(drop=True,inplace=True)
udf

### Combine the Two Cohorts:

After undersampling, you'll have two cohorts of the same size, and the larger cohort should be balanced with respect to the four demographic variables.


In [None]:
# Combine the undersampled larger cohort with the smaller cohort
bdf = pd.concat([udf.drop(columns=["weight"]), sdf]).reset_index(drop=True) # balanced DataFrame
bdf

## 4) Verification:

Ensure that the demographic distributions of the two cohorts are now balanced for all four variables. You can use the `pandas` `groupby` and `value_counts` methods or other appropriate methods to check the distributions.


In [None]:
# Use pandas groupby and plot functions to view relative counts of different demographics in the balanced cohort
for prop in dprops:
  dfu = bdf.groupby([prop],observed=True).cohort.value_counts().unstack()
  ax = dfu.plot(kind='bar', figsize=(7, 5), xlabel=prop, ylabel='Count', rot=90) # change rot=0 or rot=45 to change x-axis label display angle
  ax.legend(title='cohort', bbox_to_anchor=(1, 1), loc='upper left')

## The End
---
If you have any questions related to this notebook don't hesitate to reach out to the MIDRC Helpdesk at midrc-support@datacommons.io or the author directly at cgmeyer@uchicago.edu

Happy data wrangling!