# COVIDx Challenge
---
**29-Aug-2022**

---
**CHALLENGE SUMMARY**: The goal of this Challenge is to train an AI/machine learning model in the task of distinguishing between COVID-negative and COVID-positive patients using frontal-view portable chest radiographs (CXRs).

**DATA**: This Jupyter Notebook demonstrates how the cohort for the Chellenge Dataset was selected.

*Some descriptions within this notebook were taken from the MIDRC notebook tutorial repo at https://github.com/MIDRC/tutorial_notebooks/blob/1488e64937aca3424252ed5afaf44ca3213c1d08/Cohort_Selection_Using_MIDRC_Temporal_COVID_Test_Data.ipynb*

---
If you would like to run this notebook on your own, you must have Python 3.x (and the Jupyter python package) installed on your local system. In addition, make sure you follow the instructions in the MIDRC Quick Start Guide to set up your MIDRC account. You will need to use your MIDRC credentials "API key" file within the notebook.

## Import Python Packages and scripts

In [None]:
# PLEASE BE AWARE that you may need to install some packages into your local instance of Python
# so that all of the functions here will operate properly. Installing packages is beyond the scope
# of this notebook, but the commands to install some common packages that may be needed are
# included in the comments below.
#
# !pip install --upgrade pip
# !pip install --upgrade --ignore-installed PyYAML
# !pip install --upgrade gen3
# !pip install cdiserrors


## Import Python Packages and scripts

import pandas as pd
import sys, os, webbrowser
import gen3
import tqdm

from gen3.submission import Gen3Submission
from gen3.auth import Gen3Auth
from gen3.index import Gen3Index
from gen3.query import Gen3Query

*Make sure you change the location of your working directory in the cell below!*

In [None]:
## Define the directories for our workspace

home_dir = "/Users/ngrusz1/Downloads/jupyter-notebooks"  #CHANGE THIS DIRECTORY to the working directory for your files!

print ("Changing to working directory: ")
os.chdir(home_dir)
print(os.getcwd())
challenge_dir = "{}/COVIDx-Challenge".format(home_dir)


## Make a directory to work in

try:
    os.mkdir(format(challenge_dir))
except OSError as error:
    print(error)

    
## Import some custom Python from the Gen3 repo

os.system("curl -LO https://raw.githubusercontent.com/cgmeyer/gen3sdk-python/master/expansion/expansion.py")
# If curl is not available on your system, you can try wget or a similar tool
#os.system("wget https://raw.githubusercontent.com/cgmeyer/gen3sdk-python/master/expansion/expansion.py")

print("Retreived additional Python scripts from Gen3 repo.")
%run expansion.py

print ("Changing to chellenge directory: ")
os.chdir(challenge_dir)
print(os.getcwd())

## Connect to the MIDRC Data Commons

*Make sure you update the location of your MIDRC API key file in the cell below!*

In [None]:
## Initiate instances of the Gen3 SDK Classes using credentials file uploaded. This is unnecessary with functioning WTS

#CHANGE THIS FILE to the full path of your MIDRC credentials file!
cred = '/Users/ngrusz1/Downloads/jupyter-notebooks/credentials/midrc-credentials.json'

api = 'https://data.midrc.org'
auth = Gen3Auth(api, refresh_file=cred) # authentication class
sub = Gen3Submission(api, auth) # submission class
query = Gen3Query(auth) # query class
exp = Gen3Expansion(api,auth,sub) # class with some custom scripts
exp.get_project_ids()

## Query the MIDRC Data Commons and export metadata using submission API
---
Here we'll utilize the MIDRC submission API to export all the imaging study and measurement (COVID-19 tests) data using the ["get_node_tsvs" function](https://github.com/cgmeyer/gen3sdk-python/blob/2aecc6575b22f9cca279b650914971dd6723a2ce/expansion/expansion.py#L219), which is a wrapper to export and merge all the records in a node across each project in the data commons using the [Gen3SDK](https://github.com/uc-cdis/gen3sdk-python/) function [Gen3Submission.export_node()](https://github.com/uc-cdis/gen3sdk-python/blob/5d7b5270ff11cf7037f211cf01e410d8e73d6b84/gen3/submission.py#L361).

**NOTE**: For this challenge, we are only interested in selecting cases from the *Open-A1* and *Open-R1* projects. As such, we will explicitly define this set of projects -- just in case the user has access to additional projects that we don't plan to use here!

In [None]:
## Explicitly define the list of projects we will use in this challenge.

pids = ['Open-A1','Open-R1']

In [None]:
## Export all the records in the imaging_study node

st = exp.get_node_tsvs(node='imaging_study',projects=pids)
s = st.loc[((st['project_id'] == 'Open-R1') | (st['project_id'] == 'Open-A1')) & (st['age_at_imaging'] >= 18) & ((st['study_description'] == 'XR CHEST 1 VIEW AP') | (st['study_description'] == 'CHEST PORT 1 VIEW (RAD)-CS') | (st['study_description'] == 'XR PORT CHEST 1V'))]
s

In [None]:
## Now export all the data in the measurement node, which is used to store the COVID test data

meas = exp.get_node_tsvs(node='measurement',projects=pids)
meas.head()

In [None]:
## Filter the measurements for only COVID-19 tests with a non-null "test_days_from_index" property

m = meas.loc[(~meas['test_days_from_index'].isna()) & (meas['test_name']=='COVID-19')]
m

In [None]:
## Check out the properties in each DataFrame to help make a list of properties to merge into a single table

display(list(s))
display(len(s))
display(list(m))
display(len(m))

In [None]:
## Merge the imaging_study and measurement data using "case_ids" as a foreign key

temp = pd.merge(s, meas, on='case_ids')
display(temp)

## Calculate the days from COVID-19 test to an imaging_study
---
Now that we have the temporal data for imaging studies and COVID-19 tests in a single DataFrame for all cases in MIDRC for which this data is provided, we can calculate the number of days between each COVID-19 test and each imaging study, which we'll call `days_from_test_to_study`.

* Note: In MIDRC, a negative "days to XYZ" indicates that the event XYZ took place that many days prior to the index event, while a positive "days to" indicates the number of days since the index event. For example, a "days_to_study" of "-10" indicates that the imaging study was performed 10 days *before* the index event. A value of "365" indicates the imaging study took place one year *after* the index event. 

We expect a positive value if the test was performed before the study.
- So, if `test_days_from_index` is `1` and `days_to_study` is `4`, the `days_from_test_to_study` should be `3`, which means the study took place 3 days after the test. 
- If the test is on day 4 and the study is on day 1, then the `days_from_test_to_study` is `-3`, meaning the study took place 3 days before the test.

In [None]:
## Calculate the days from COVID-19 test to an imaging_study

temp['days_from_test_to_study'] = temp['days_to_study'] - temp['test_days_from_index']
display(temp)

## Identify "COVID-19 positive" imaging studies
---
Now that we've calculated `days_from_test_to_study`, we can define a cut-off value and filter the imaging studies using that value to determine which imaging studies were performed within a certain time-frame of receiving a positive COVID-19 test.

Our new property `days_from_test_to_study` has a positive value if the COVID test was performed before the imaging study (i.e., there were +3 days from the test date to the study date) and a negative value if the test was performed after the imaging study (i.e., go back in time 3 days from the test date to the study date). 

For this challenge, we assume that an imaging study was performed when a person was "COVID-positive" if the imaging study was performed up to 14-days prior to a positive test result. So, we'll filter the DataFrame of studies for a `days_from_test_to_study` in the range of 0 to 14 and also require the `test_result_text` to be `Positive`.

In [None]:
ps = temp.loc[(temp['days_from_test_to_study'] <= 14) & (temp['days_from_test_to_study'] >= 0) & ((temp['test_result_text'] == 'Negative') | (temp['test_result_text'] == 'Positive'))]
display(ps)

In [None]:
## Define a subroutine for identifying unique study identifiers (after merging the study and 
## measurement data, each study will appear in multiple rows of the resulting table if a patient 
## has multiple COVID tests)

def unique(items):
    items = list(items)
    ht = {}
    unique_list = []
    for i in items:
        if ht.get(i) == None:
            unique_list.append(i)
            ht[i] = True
    return unique_list

In [None]:
## Identify the most recent COVID test prior to the imaging study within the preset time window in order
## to determine the "COVID status" of a study. This is because, for example, a patient may test Negative 
## on day 1, test Positive on day 10, and then be imaged on day 13. In this case, the imaging study appear
## twice in the table after filtering the full merged table for the time window with both 'Negative' 
## and 'Positive' labels. We will pick most recent test result.

# PLEASE NOTE this step may take a long time to run. Scroll down in the output window to see the status.

rs = pd.DataFrame(columns = temp.columns)
studies = unique(ps['study_uid'])
num_studies = len(studies)
for i in tqdm.tqdm(range(num_studies), desc = 'Creating Reference Standard Table', ascii = False, ncols = 133):
    i = studies[i]
    study_frame = ps.loc[ps['study_uid'] == i] # get all entries for a given study
    min_day = min(study_frame['days_from_test_to_study']) #get the closest covid test for the study
    rs = rs.append(study_frame.loc[(study_frame['days_from_test_to_study'] == min_day)].iloc[0])
print(rs)
print(rs.columns)
filename = 'DX_imaging_studies_plus_covid_tests.tsv'
rs_filename = 'COVIDx_training_imagingStudy_COVIDstatus.tsv'
os.chdir(challenge_dir)
ps.to_csv(filename,sep='\t',index = False)
rs.to_csv(rs_filename, sep = '\t', index = False)

## Get the imaging files for the identified studies or cases.
---
Now that we have a list of imaging studies that were deemed to take place when a patient was infected with COVID-19, we can use the study_uid, which is a unique identifier for imaging studies, to collect the associated files. If we want *all* the imaging studies for the cohort of identified cases, e.g., to have a "healthy" or "baseline" images for comparison, we can instead use the case_ids to pull all imaging files for the cases, keeping in mind that this will pull any additional imaging studies that may fall outside our defined temporal range.

In [None]:
## Make a list of study_uids and case_ids

cids = list(set(rs['case_ids']))
display(len(cids))

sids = list(set(rs['study_uid']))
display(len(sids))

In [None]:
## This query retrieves ALL imaging_study records. We will eventually filter these results based on the COVID test data

res = query.query(
    data_type="imaging_study",
    first=60000,
    fields=[
              "study_uid",
              "case_ids",
              "object_id",
              "project_id"
           ]
)

In [None]:
## Take a glance at the returned data

st = res['data']['imaging_study']
display(len(st))
st[1:2]

In [None]:
## Convert the query data to a DataFrame and remove any records that lack a study_uid or object_id

oids = pd.DataFrame(st)
oids = oids.loc[(~oids['object_id'].isna())&(~oids['study_uid'].isna())]
len(oids)

In [None]:
## Now filter the imaging studies based on our temporal results

toids = oids.loc[oids['study_uid'].isin(sids)]
len(toids)

In [None]:
## Take a glance at the results

display(toids)

In [None]:
## Save our result to a DataFrame

results_name = "Imaging_Covid_Status_Object_Names.tsv"
toids.to_csv(results_name,sep='\t',index=False)

object_ids = list(set([a for b in toids.object_id.tolist() for a in b]))

len(object_ids)

manifest = [{"object_id":i} for i in object_ids]
display(len(manifest))
display(manifest)

mani_name = 'MIDRC_COVIDx_challenge_1_imaging_studies_covid_manifest.json'
with open(mani_name,'w') as mani: 
      mani.write(str(manifest))

In [None]:
## We should now have all the metadata files we need as well as a Gen3 manifest file for downloading the associated images!
## Check the working directory for the files!

print(os.getcwd())
os.listdir(os.getcwd())