<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/notebooks/nsclc-radiomics/nsclc_radiomics_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> Rationale </center> 

This notebook stores a demo for Hosny et Al. [Deep learning for lung cancer prognostication: A retrospective multi-cohort radiomics study](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002711), reproduced using the tools provided by the Imaging Data Commons.

<br>

The goal of this notebook is to provide the user with an example of how the tools provided by the Imaging Data Commons could be used to run an AI/ML end-to-end analysis on a cohort hosted by the portal, and to describe what we identified as the best practices to do so.

<br>


---
---

# <center> Environment Setup, Data Download and Pre-processing </center>

In [None]:
curr_dir = !pwd
curr_droid = !hostname
curr_pilot = !whoami

print("Current directory :", curr_dir[-1])
print("Hostname          :", curr_droid[-1])
print("Username          :", curr_pilot[-1])

## Environment Set Up

This demo notebook was conceived to be run using a GPU.

To access a free GPU on Colab:
`Edit > Notebooks Settings`

From the dropdown menu under `Hardware accelerator`, select `GPU`.

In [None]:
# check wether the use of a GPU was correctly enabled
gpu_list = !nvidia-smi --list-gpus

has_gpu = False if "failed" in gpu_list[0] else True

has_gpu

---

## Preliminary Notes

Hosny et Al. model was developed using Keras 1.2.2 and an old version of Tensorflow, as stated by the authors (e.g., see [the docker config file in the model GitHub repository](https://github.com/modelhub-ai/deep-prognosis/blob/master/dockerfiles/keras:1.0.1)). Since Google Colab instances are running either TensorFlow 2.x.x or TensorFlow 1.15.2, and Keras 2.x.x, pulling the model from the [project repository](https://github.com/modelhub-ai/deep-prognosis) will not work out-of-the-box (due to compatibility issues between Keras 1.x.x and Keras 2.x.x). 

<br>

While it is possible to [use the `%tensorflow_version 1.x` magic to switch to the latest 1.x version of TensorFlow](https://colab.research.google.com/notebooks/tensorflow_version.ipynb)<sup>*</sup>, Colab does not allow to switch to older version of Keras. Fortunately, the solution to this issue is known and discussed in [various threads](https://github.com/keras-team/keras/issues/6382#issuecomment-530258501). Together with this notebook we hence provide a [Keras-2-compatible network configuration JSON file](https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/nsclc-radiomics/demo/architecture.json).

<sup>*</sup> the magic command must be run before importing TensorFlow.

<br>

Such JSON file is stored, together with the source code and the files needed for this demo, under the [Imagin Data Common's "IDC-Examples" repository](https://github.com/ImagingDataCommons/IDC-Examples) (specifically, under [IDC-Examples/notebooks/nsclc-radiomics/](https://github.com/ImagingDataCommons/IDC-Examples/tree/master/notebooks/nsclc-radiomics). All the other 

Since we are interested in cloning only a subdirectory of the repository, and the git CLI does not allow that, we can use either [GitHub's CLI](https://cli.github.com/manual/) `gh` (requires authentication) or Apache's subversion `svn` [[1]](https://cheatography.com/davechild/cheat-sheets/subversion/) [[2]](https://stackoverflow.com/questions/7106012/download-a-single-folder-or-directory-from-a-github-repo):

In [None]:
%%capture
!sudo apt install subversion

In [None]:
!svn checkout https://github.com/ImagingDataCommons/IDC-Examples/trunk/notebooks/nsclc-radiomics/demo

In [None]:
!pip3 install -r demo/requirements.txt

For image pre-processing we will use [Plastimatch](https://plastimatch.org), a reliable and open source software for image computation.

Plastimatch is available as an extension (plug-in) for 3D Slicer, but can also be used from the command line/from python scripts (using libraries such as `subprocess`):

In [None]:
%%capture
!sudo apt update
!sudo apt install plastimatch

Verify the installation process was successful by checking Plastimatch version:

In [None]:
!plastimatch --version

---

In [None]:
import os
import sys
import json
import sklearn 
import numpy as np
import pandas as pd
import SimpleITK as sitk

from IPython.display import clear_output

%tensorflow_version 1.x
import tensorflow as tf
import keras

print("Python version               : ", sys.version.split('\n')[0])
print("Numpy version                : ", np.__version__)
print("TensorFlow version           : ", tf.__version__)
print("Keras (stand-alone) version  : ", keras.__version__)

print("\nThis Colab instance is equipped with a GPU.")

# ----------------------------------------

#everything that has to do with plotting goes here below

import matplotlib
matplotlib.use('agg')

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

## ----------------------------------------

# create new colormap appending the alpha channel to the selected one
# (so that we don't get a \"color overlay\" when plotting the segmask superimposed to the CT)
cmap = plt.cm.Reds
my_reds = cmap(np.arange(cmap.N))
my_reds[:,-1] = np.linspace(0, 1, cmap.N)
my_reds = ListedColormap(my_reds)

cmap = plt.cm.jet
my_jet = cmap(np.arange(cmap.N))
my_jet[:,-1] = np.linspace(0, 1, cmap.N)
my_jet = ListedColormap(my_jet)

## ----------------------------------------

import seaborn as sns

---

## Data Download

The Imaging Data Commons GCS buckets are "[requester pays](https://cloud.google.com/storage/docs/requester-pays)" buckets. Hence, it is not possible to [mount such buckets directly in Colab](https://gist.github.com/korakot/f3600576720206363c734eca5f302e38).

Instead, what the user can do is to query the BigQuery table associated to the bucket/dataset, select the cohort of interest, and then download the files exploiting `gsutil`.

<font size="5" color="orange"><b>This step (auth) won't be needed once the IDC bucket goes live</b></font>

In [None]:
from google.colab import auth
auth.authenticate_user()

---
#### Exploiting BigQuery to Select which Data to Download

Let's run a BigQuery query, exploiting the `%%bigquery` [IPython Magic](https://googleapis.dev/python/bigquery/latest/magics.html), to parse information regarding the dataset of interest (e.g., which subjects to download, the mapping between DICOM CTs and DICOM RTSTRUCTs/RTSEGs, ...).

Using the following syntax, the result will be stored as a DataFrame in `cohort_df`.

In [None]:
%%bigquery --project=idc-sandbox-000 cohort_df

WITH
  ct_series AS (
  SELECT
    DISTINCT(PatientID),
    StudyInstanceUID AS ctStudyInstanceUID,
    SeriesInstanceUID AS ctSeriesInstanceUID
  FROM
    `idc-dev-etl.idc_tcia.idc_tcia`
  WHERE
    PatientID LIKE "LUNG1%"
    AND Modality = "CT"
  ORDER BY
    PatientID),
  rtstruct_series AS (
  SELECT
    DISTINCT(PatientID),
    StudyInstanceUID AS rtstructStudyInstanceUID,
    SeriesInstanceUID AS rtstructSeriesInstanceUID
  FROM
    `idc-dev-etl.idc_tcia.idc_tcia`
  WHERE
    PatientID LIKE "LUNG1%"
    AND Modality = "RTSTRUCT"
  ORDER BY
    PatientID),
  seg_series AS (
  SELECT
    DISTINCT(PatientID),
    StudyInstanceUID AS segStudyInstanceUID,
    SeriesInstanceUID AS segSeriesInstanceUID
  FROM
    `idc-dev-etl.idc_tcia.idc_tcia`
  WHERE
    PatientID LIKE "LUNG1%"
    AND Modality = "SEG"
  ORDER BY
    PatientID)
SELECT
  PatientID,
  ctStudyInstanceUID,
  ctSeriesInstanceUID,
  rtstructStudyInstanceUID,
  rtstructSeriesInstanceUID,
  segStudyInstanceUID,
  segSeriesInstanceUID
FROM
  ct_series
JOIN
  rtstruct_series
using (PatientID)
JOIN
  seg_series
USING
  (PatientID)
ORDER BY
  PatientID

In [None]:
cohort_df.head()

After selecting a few subjects from the cohort, exploiting the output of the BigQuery query, populate a dictionary with the `ctStudyInstanceUID` and `rtstructStudyInstanceUID` values. Finally, exploit `gsutil` to download such sub-cohort:

In [None]:
n_patients = 10

# useful for logging purposes
download_dict = dict()

# since gsutil can be called with "-m" providing a list as an input,
# append each gs URI to download to a list:
to_download = list()
base_gs_uri = 'gs://idc-tcia-1-nsclc-radiomics/dicom/'

# analysis baseline: Hosny et Al. results
baseline_csv_name = 'nsclc-radiomics_hosny_baseline.csv'
baseline_csv_path = os.path.join('demo', baseline_csv_name)
baseline_df = pd.read_csv(baseline_csv_path)

# make sure the selected sub-cohort was analysed in full by Hosny et Al. 
#subcohort_df = cohort_df.sample(n = n_patients)

# list of the NSCLC-Radiomics subjects analysed in Hosny et Al.
baseline_subj_list = [a[1:] for a in list(baseline_df["id"].dropna())]

# intersection between the two sets
common_subj_list = list(set(baseline_subj_list) & set(cohort_df["PatientID"]))

assert len(baseline_subj_list) == len(common_subj_list)

# populate a dataset with n_subjects sampled from this pool only
subcohort_df = cohort_df[cohort_df["PatientID"].isin(common_subj_list)].sample(n = n_patients)


for pat_num, pat in enumerate(list(subcohort_df["PatientID"])):
  print("(%g/%g) - PatientID: %s"%(pat_num + 1, n_patients, pat), end = '\r')

  pat_df = subcohort_df[subcohort_df["PatientID"] == pat]
  download_dict[pat] = dict()
  download_dict[pat]["ctStudyInstanceUID"] = pat_df["ctStudyInstanceUID"].values[0]
  download_dict[pat]["rtstructStudyInstanceUID"] = pat_df["rtstructStudyInstanceUID"].values[0]

  to_download.append(base_gs_uri + download_dict[pat]["ctStudyInstanceUID"])
  to_download.append(base_gs_uri + download_dict[pat]["rtstructStudyInstanceUID"])

In [None]:
# populate a dataframe with such gs URI (again, mostly for logging purposes)
manifesto_dict = {"gs_uri" : to_download}
manifesto_df = pd.DataFrame(manifesto_dict, columns = ["gs_uri"])

manifesto_df.head()

In [None]:
# generate a text file, which will be parsed by gsutil to download the selected patients:
manifesto_df.to_csv("gcs_paths.txt", header = False, index = False)

# check everything went as expected
!head /content/gcs_paths.txt -n 2
!echo "..." && echo ""

# the number of lines in the file should be equal to twice the specified "n_patients"
!echo "Number of lines in the file:" $(cat /content/gcs_paths.txt | wc -l)

---

#### DICOM Data Download

The following instructions will download the DICOM CT and DICOM RTSTRUCT files for `n_patients` patients. For `n_patients = 10`, the download should take approximately 3-4 minutes (roughly 1-1.5k files, a total of 500-700MB).

In [None]:
# if everything is allright, proceed with the download
!mkdir -p data/nsclc-radiomics/dicom

!cat gcs_paths.txt | gsutil -u idc-sandbox-000 -m cp -Ir ./data/nsclc-radiomics/dicom

---

Check the disk space after the download:

In [None]:
!df -h

In [None]:
!du -h ./data/nsclc-radiomics -d 0


---

## Data Pre-processing

Let us explot the python scripts under `demo` to preprocess the data:

* First of all, each DICOM Series (CT and RTSTRUCT) is converted from DICOM to NRRD using plastimatch. Following the authors pipeline, we then resample all the volumes to 1-mm isotropic:
  * In this case, the interpolation step uses a linear strategy but, as we will investigate later, the model is robust to the textural differences introduced by a different interpolation strategy, such as a nearest neighbour one);
  * A .png image is exported for quality control, together with other potentially useful information, and can be found in the patient folder (under `nsclc-radiomics_preprocessed`);
* After the conversion and the resampling, other application-specific transformations are applied to the data. The main tumour Center of Mass (CoM) is computed starting from the labelled GTV, and a $150 \times 150 \times 150$ subvolume is cropped around such coordinate.

<br>

In order to save space on the Colab instance partition, only the final $150 \times 150 \times 150$ subvolumes (both CT and the GTV segmentation mask, saved in NRRD) are kept. 

All the informations regarding the crop, the list of segmentation masks found in the RTSTRUCT, and the CoM - together with the aforementioned quality-control pngs, are saved under the patient folder in the following fashion:

```
data
    |_nsclc-radiomics_preprocessed
                                  |_nrrd
                                        |_LUNG1-XYZ
                                                   |_LUNG1-XYZ_whole_ct_rt
                                                   |_LUNG1-XYZ_com_log.json
                                                   |_LUNG1-XYZ_lookup_info.json
                                                   ...
```

In [None]:
from demo.data_utils import *
from demo.utils import *

# FIXME: DEBUG

#import importlib
#importlib.reload(data_utils)

In [None]:
data_base_path = 'data'

dataset_name = 'nsclc-radiomics'
dataset_path = os.path.join(data_base_path, dataset_name)
data_path = os.path.join(dataset_path, 'dicom')

preproc_dataset_name = dataset_name + '_preprocessed'
preproc_dataset_path = os.path.join(data_base_path, preproc_dataset_name)
preproc_data_path = os.path.join(preproc_dataset_path, 'nrrd')
    
if not os.path.exists(preproc_data_path):
    os.makedirs(preproc_data_path)

In [None]:
for pat_num, pat in enumerate(subcohort_df.PatientID.values):

    # clear cell output before moving to the next (goes at the top to clean what comes next)
    clear_output(wait = True)
    
    print("\nPatient %d/%d (%s)"%(pat_num + 1, len(subcohort_df), pat))
    
    pat_dir_path = os.path.join(preproc_data_path, pat)
    
    if not os.path.exists(pat_dir_path):
        os.mkdir(pat_dir_path)
    
    # location where the tmp nrrd files (resampled CT/RTSTRUCT nrrd) should be saved
    # by the "export_res_nrrd_from_dicom" function found in preprocess.py
    ct_nrrd_path = os.path.join(pat_dir_path, pat + '_ct_resampled.nrrd')
    rt_nrrd_path = os.path.join(pat_dir_path, pat + '_rt_resampled.nrrd')

    # location where the nrrd files (cropped resampled CT/RTSTRUCT nrrd) should be saved
    # by the "export_com_subvolume" function found in preprocess.py
    ct_nrrd_crop_path = os.path.join(pat_dir_path, pat + '_ct_res_crop.nrrd')
    rt_nrrd_crop_path = os.path.join(pat_dir_path, pat + '_rt_res_crop.nrrd')
    
    # if the latter are already there, skip the processing
    if os.path.exists(ct_nrrd_crop_path) and os.path.exists(rt_nrrd_crop_path):
        print("%s\nand\n%s\nfound, skipping the processing for patient %s..."%(ct_nrrd_crop_path,
                                                                               rt_nrrd_crop_path, 
                                                                               pat))
        continue
    
    ## ----------------------------------------
    
    pat_df = subcohort_df[subcohort_df["PatientID"] == pat]


    path_to_ct_dir = os.path.join(data_path,
                                  pat_df["ctStudyInstanceUID"].values[0],
                                  pat_df["ctSeriesInstanceUID"].values[0])

    path_to_rt_dir = os.path.join(data_path,
                                  pat_df["rtstructStudyInstanceUID"].values[0],
                                  pat_df["rtstructSeriesInstanceUID"].values[0])

    path_to_seg_dir = os.path.join(data_path, 
                                   pat_df["segStudyInstanceUID"].values[0], 
                                   pat_df["segSeriesInstanceUID"].values[0])

    # FIXME: sanity check
    assert os.path.exists(path_to_ct_dir)
    assert os.path.exists(path_to_rt_dir)
    assert os.path.exists(path_to_seg_dir)    
    
    # log lookup informations (human-readable to StudyUID and SeriesUID)
    lookup_dict_path = os.path.join(pat_dir_path, pat + '_lookup_info.json')
    
    lookup_dict = dict()
    lookup_dict[pat] = dict()
    
    lookup_dict[pat]["path_to_ct_dir"] = path_to_ct_dir
    lookup_dict[pat]["ctStudyInstanceUID"] = pat_df["ctStudyInstanceUID"].values[0]
    lookup_dict[pat]["ctSeriesInstanceUID"] = pat_df["ctSeriesInstanceUID"].values[0]
    
    lookup_dict[pat]["path_to_rt_dir"] = path_to_rt_dir
    lookup_dict[pat]["rtstructStudyInstanceUID"] = pat_df["rtstructStudyInstanceUID"].values[0]
    lookup_dict[pat]["rtstructSeriesInstanceUID"] = pat_df["rtstructSeriesInstanceUID"].values[0]
    
    # FIXME: not used so far but still, useful to log for future purposes
    lookup_dict[pat]["path_to_seg_dir"] = path_to_seg_dir
    lookup_dict[pat]["segStudyInstanceUID"] = pat_df["segStudyInstanceUID"].values[0]
    lookup_dict[pat]["segSeriesInstanceUID"] = pat_df["segSeriesInstanceUID"].values[0]
    
    with open(lookup_dict_path, 'w') as json_file:
        json.dump(lookup_dict, json_file, indent = 2)
    
    ## ----------------------------------------

    proc_log = export_res_nrrd_from_dicom(dicom_ct_path = path_to_ct_dir, 
                                          dicom_rt_path = path_to_rt_dir, 
                                          output_dir = pat_dir_path, 
                                          pat_id = pat,
                                          output_dtype = "float")
    
    # check every step of the DICOM to NRRD conversion returned 0 (everything's ok)
    assert(np.sum(np.array(list(proc_log.values()))) == 0)
    
    # FIXME: sanity check
    assert(os.path.exists(ct_nrrd_path))
    assert(os.path.exists(rt_nrrd_path))
    
    sitk_vol = sitk.ReadImage(ct_nrrd_path)
    vol = sitk.GetArrayFromImage(sitk_vol)
    
    sitk_seg = sitk.ReadImage(rt_nrrd_path)
    seg = sitk.GetArrayFromImage(sitk_seg)
    
    # FIXME: sanity check
    assert(vol.shape == seg.shape)
    
    com = compute_center_of_mass(seg)
    com_int = [int(coord) for coord in com]

    # export the CoM slice (CT + RTSTRUCT) for quality control
    export_png_slice(input_volume = vol,
                     input_segmask = seg,
                     fig_out_path = os.path.join(pat_dir_path, pat + '_whole_CT_CoM.png'),
                     fig_dpi = 220,
                     lon_slice_idx = com_int[0],
                     cor_slice_idx = com_int[1],
                     sag_slice_idx = com_int[2],
                     z_first = True)
    
    # crop a (150, 150, 150) subvolume from the resampled scans, get rid of the latter
    proc_log = export_com_subvolume(ct_nrrd_path = ct_nrrd_path, 
                                    rt_nrrd_path = rt_nrrd_path, 
                                    crop_size = (150, 150, 150), 
                                    output_dir = pat_dir_path,
                                    pat_id = pat,
                                    z_first = True, 
                                    rm_orig = True)
    
    # log CoM information
    com_log_path = os.path.join(pat_dir_path, pat + '_com_log.json')
    com_log_dict = {k : v for (k, v) in proc_log.items() if "com_int" in k}
    
    with open(com_log_path, 'w') as json_file:
        json.dump(com_log_dict, json_file, indent = 2)
    
    # if CoM calculation goes wrong then continue
    proc_log_crop = {k : v for (k, v) in proc_log.items() if "cropping" in k}
    if len(proc_log_crop) == 0:
        os.remove(ct_nrrd_path)
        os.remove(rt_nrrd_path)
        continue
    
    # check the cropped volumes have been exported as intended
    assert(np.sum(np.array(list(proc_log_crop.values()))) == 0)
    assert(os.path.exists(ct_nrrd_crop_path))
    assert(os.path.exists(rt_nrrd_crop_path))
    
    sitk_vol = sitk.ReadImage(ct_nrrd_crop_path)
    vol_crop = sitk.GetArrayFromImage(sitk_vol)

    sitk_seg = sitk.ReadImage(rt_nrrd_crop_path)
    seg_crop = sitk.GetArrayFromImage(sitk_seg)
    
    # export the cropped subvolume CoM slice (CT + RTSTRUCT) for quality control
    export_png_slice(input_volume = vol_crop,
                     input_segmask = seg_crop,
                     fig_out_path = os.path.join(pat_dir_path, pat + '_crop_CT_CoM.png'),
                     fig_dpi = 220,
                     lon_slice_idx = 75,
                     cor_slice_idx = 75,
                     sag_slice_idx = 75,
                     z_first = True)

---

### Logging All the Processing Details

In order to make sure the whole pipeline is easily reproducible, let's log all the details in `nsclc-radiomics_preprocessed/nscls-radiomics_preproc_details.csv`.

The CSV file contains information regarding:
* The patient ID;
* The relative paths to the folders containing DICOM CT, DICOM RTSTRUCT and DICOM RTSEG Series;
* The Study and Series Instance UID of all the Series;
* The shape of the original DICOM Series and the shape of the resampled to 1mm isotropic NRRD volumes;
* The name of the label used to compute the CoM (as stored in the DICOM RTSTRUCT Series, and thus exported by Plastimatch), the CoM (integer coordinates), the bounding box size and its coordinates in the resampled volume space.

In [None]:
csv_out_name = 'nsclc-radiomics_preproc_details.csv'
dataset_csv_path = os.path.join(data_base_path, csv_out_name)


df_keys = ['PatientID',
           'path_to_ct_dir', 'ctStudyInstanceUID', 'ctSeriesInstanceUID',
           'path_to_rt_dir', 'rtstructStudyInstanceUID', 'rtstructSeriesInstanceUID',
           'path_to_seg_dir', 'segStudyInstanceUID', 'segSeriesInstanceUID',
           'rt_exported', 'orig_shape', '1mm_iso_shape', 'crop_shape', 'com_int', 'bbox']

data = {k : list() for k in df_keys}

det_df = pd.DataFrame(data = data, dtype = object)

In [None]:
for pat_num, pat in enumerate(subcohort_df.PatientID.values):

    print("\rProcessing patient '%s' (%d/%d)... "%(pat, pat_num + 1,
                                                   len(subcohort_df.PatientID)),
          end = '')
    
    pat_df = subcohort_df[subcohort_df['PatientID'] == pat]
            
    # init a dictionary with the same keys as "df_keys" to populate the latter
    pat_dict = dict()
    pat_dict["PatientID"] = pat

    pat_dir_path = os.path.join(preproc_data_path, pat)
    pat_json_path = os.path.join(pat_dir_path, pat + '_lookup_info.json')

    with open(pat_json_path, 'r') as json_file:
        lookup_dict = json.load(json_file)
    
    pat_dict["path_to_ct_dir"] = lookup_dict[pat]["path_to_ct_dir"]
    pat_dict["ctStudyInstanceUID"] = pat_df['ctStudyInstanceUID'].values[0]
    pat_dict["ctSeriesInstanceUID"] = pat_df['ctSeriesInstanceUID'].values[0]

    pat_dict["path_to_rt_dir"] = lookup_dict[pat]["path_to_rt_dir"]
    pat_dict["rtstructStudyInstanceUID"] = pat_df['rtstructStudyInstanceUID'].values[0]
    pat_dict["rtstructSeriesInstanceUID"] = pat_df['rtstructSeriesInstanceUID'].values[0]

    pat_dict["path_to_seg_dir"] = lookup_dict[pat]["path_to_seg_dir"]
    pat_dict["segStudyInstanceUID"] = pat_df['segStudyInstanceUID'].values[0]
    pat_dict["segSeriesInstanceUID"] = pat_df['segSeriesInstanceUID'].values[0]
    
    # ----------------------------------------
    
    # populate the "rt_exported" field
    rt_folder = os.path.join(pat_dir_path, pat  + '_whole_ct_rt')
    pat_dict['rt_exported'] = [f for f in os.listdir(rt_folder) if 'gtv-1' in f.lower()][0].split('.nrrd')[0]   
    if pat_dict['rt_exported'] != 'GTV-1':
        a.append(pat)
    
    # ----------------------------------------
    
    dicom_ct_path = lookup_dict[pat]["path_to_ct_dir"]
    
    dcm_file_path = os.path.join(dicom_ct_path,                 # parent folder
                                 os.listdir(dicom_ct_path)[0])  # *.dcm files

    dcm_file = pydicom.dcmread(dcm_file_path)
    n_dcm_files = len([f for f in os.listdir(dicom_ct_path) if '.dcm' in f])

    xy = int(float(dcm_file.Rows)*float(dcm_file.PixelSpacing[0]))
    z = int(float(dcm_file.SliceThickness)*float(n_dcm_files))
    
    orig_dcm_shape = (n_dcm_files, dcm_file.Columns, dcm_file.Rows)
    res_dcm_shape = (z, xy, xy)
    
    pat_dict['orig_shape'] = orig_dcm_shape
    pat_dict['1mm_iso_shape'] = res_dcm_shape
    
    # ----------------------------------------
     
    com_json_path = os.path.join(pat_dir_path, pat + '_com_log.json')
    
    try:
        with open(com_json_path, 'r') as json_file:
            com_dict = json.load(json_file)
            pat_dict['com_int'] = tuple(com_dict["com_int"])
    except:
        print('_com_log.json loading error;')
    
    # ----------------------------------------
    
    bbox_json_path = os.path.join(pat_dir_path, pat + '_crop_log.json')
    
    try:
        with open(bbox_json_path, 'r') as json_file:
            bbox_dict = json.load(json_file)
            pat_dict['bbox'] = bbox_dict
    except:
        print('_crop_log.json loading error;')
    
    # ----------------------------------------
    ct_res_crop_path = os.path.join(pat_dir_path, pat + '_ct_res_crop.nrrd')
    
    try:
        sitk_ct_res_crop = sitk.ReadImage(ct_res_crop_path)
        pat_dict['crop_shape'] = sitk_ct_res_crop.GetSize()
    except:
        print('_ct_res_crop.nrrd loading error;')
        
    # ----------------------------------------
    
    det_df = det_df.append(pat_dict, ignore_index = True)

det_df.to_csv(dataset_csv_path, index = False)

---
---

# <center> Data Exploration </center>


---
---

# <center> Data Processing </center>


Brief description of the model here (image from the paper, or description of the architecture, also from the paper)?

Then, load the model.

In [None]:
arch_json_path = "demo/architecture.json"
weights_path = "demo/weights.h5"

"""
 N.B. the warnings are due to the fact that the model was developed for
 Keras 1, and the config file has been converted in a Keras-2-compatible file

 nonetheless, Keras 2 uses different naming conventions/def.s, so in order to
 get rid of the warnings one should change all the layers def.s in the JSON file
"""

# load the model architecture from the config file, then load the model weights 
with open(arch_json_path, 'r') as json_file:
    model_json = json.load(json_file)  

model = keras.models.model_from_config(model_json)
model.summary()

model.load_weights(weights_path)

In [None]:
"""
# sanity check on model weights + visualisation?
assert np.sum(model.get_weights()[1]) != 0

with tf.Session() as sess:
  #This way
  tf.global_variables_initializer().run()
  
  bn1_weights = model.layers[1].weights[2].value()
  
  print(model.layers[1].weights[2].name)
  print("weights:", sess.run(bn1_weights))
"""

---

In [None]:
# define a new dataframe to store basics information + baseline output
# as well as the reproduced experiment output
df_keys = ['PatientID', 'StudyInstanceUID', 'SeriesInstanceUID_CT',
           'SeriesInstanceUID_RTSTRUCT', 'CNN_output_raw', 'CNN_output_argmax',
           'baseline_output_raw', 'baseline_output_argmax', 'surv2yr'
           ]

data = {k : list() for k in df_keys}

out_df = pd.DataFrame(data, dtype = object)

In [None]:
y_pred_dict = dict()

input_df = pd.read_csv(dataset_csv_path) 
input_subj_list = list(input_df["PatientID"])

"""
# analysis baseline: Hosny et Al. results
baseline_csv_name = 'nsclc-radiomics_hosny_baseline.csv'
baseline_csv_path = os.path.join('demo', baseline_csv_name)
baseline_df = pd.read_csv(baseline_csv_path)
"""

for idx, subj in enumerate(input_subj_list):

    print("Processing subject '%s' (%d/%d)... "%(subj, idx + 1, len(input_subj_list)), end = '\r')

    """
    The NRRD files for each subject in "input_df" should exist and readable
    (already double checked during the creation of 'lung1_proc_details.csv').
    If not, just run the code in  'lung1_det_csv.ipynb', found under /src.
    """
    
    subj_df = input_df[input_df['PatientID'] == subj]
    
    ct_res_crop_path = os.path.join(preproc_data_path, subj, subj + '_ct_res_crop.nrrd')
    
    input_vol = get_input_volume(input_ct_nrrd_path = ct_res_crop_path)
    input_vol = np.expand_dims(input_vol, axis = 0)
    input_vol = np.expand_dims(input_vol, axis = -1)
    
    y_pred_raw = model.predict(input_vol)
    y_pred_argmax = int(np.argmax(y_pred_raw[0]))
    
    subj_dict = dict()
    subj_dict["PatientID"] = subj
    
    subj_dict["StudyInstanceUID"] = subj_df["ctStudyInstanceUID"].values[0]
    subj_dict["SeriesInstanceUID_CT"] = subj_df["ctSeriesInstanceUID"].values[0]
    subj_dict["SeriesInstanceUID_RTSTRUCT"] = subj_df["rtstructSeriesInstanceUID"].values[0]

    subj_dict["CNN_output_raw"] = y_pred_raw.tolist()[0]
    subj_dict["CNN_output_argmax"] = y_pred_argmax
    
    baseline_output_list = list()
    
    try:
        baseline_output_list.append(baseline_df[baseline_df["id"] == ' %s'%(subj)]["logit_0"].values[0])
        baseline_output_list.append(baseline_df[baseline_df["id"] == ' %s'%(subj)]["logit_1"].values[0])

        subj_dict['baseline_output_raw'] = np.array(baseline_output_list)
        subj_dict['baseline_output_argmax'] = int(np.argmax(np.array(baseline_output_list)))
        
        subj_dict['surv2yr'] = baseline_df[baseline_df["id"] == ' %s'%(subj)]["surv2yr"].values[0]
    except:
        pass

    out_df = out_df.append(subj_dict, ignore_index = True)

In [None]:
out_df

---
---

# <center> Visualising the Results </center>


In [None]:
# replication
y_true = np.stack(out_df["surv2yr"].values)
y_pred = np.stack(out_df["CNN_output_raw"].values)

fpr, tpr, thr_roc = sklearn.metrics.roc_curve(y_true, y_pred[:, 1])
prc, rec, thr_pr = sklearn.metrics.precision_recall_curve(y_true, y_pred[:, 1])

roc_auc = sklearn.metrics.auc(fpr, tpr)
pr_auc = sklearn.metrics.auc(rec, prc)

print("ROC AUC: %g"%(roc_auc))
print("PR AUC: %g"%(pr_auc))

In [None]:
# Hosny et Al. model
y_pred_baseline = np.stack(out_df["baseline_output_raw"].values)

fpr_base, tpr_base, thr_roc_base = sklearn.metrics.roc_curve(y_true, 
                                                             y_pred_baseline[:, 1])

prc_base, rec_base, thr_pr_base = sklearn.metrics.precision_recall_curve(y_true,
                                                                         y_pred_baseline[:, 1])

roc_auc_baseline = sklearn.metrics.auc(fpr_base, tpr_base)
pr_auc_baseline = sklearn.metrics.auc(rec_base, prc_base)

print("ROC AUC: %g"%(roc_auc_baseline))
print("PR AUC: %g"%(pr_auc_baseline))

In [None]:
# operating point
opp = 0.5

opp_roc = np.argmin(np.abs(thr_roc - opp))
opp_pr = np.argmin(np.abs(thr_pr - opp))

print('ROC OPP: FPR = %2.4f, TPR = %2.4f'%(fpr[opp_roc], tpr[opp_roc]))
print('PR OPP: PRC = %2.4f, REC = %2.4f'%(prc[opp_pr], rec[opp_pr]))

# ----------------------------------------

opp_baseline = 0.5

opp_roc_baseline = np.argmin(np.abs(thr_roc_base - opp_baseline))
opp_pr_baseline = np.argmin(np.abs(thr_pr_base - opp_baseline))

print('\nROC OPP: FPR = %2.4f, TPR = %2.4f'%(fpr_base[opp_roc_baseline], tpr_base[opp_roc_baseline]))
print('PR OPP: PRC = %2.4f, REC = %2.4f'%(prc_base[opp_pr_baseline], rec_base[opp_pr_baseline]))

In [None]:
sns.set()

fig, (ax0, ax1) = plt.subplots(1, 2, figsize = (12, 12))

# plot ROC curve
ax0.plot(fpr, tpr, label = 'ROC AUC: %2.2f'%(roc_auc))

# plot operating point on ROC curve
ax0.plot(fpr[opp_roc], tpr[opp_roc], label = 'CNN Operating Point', marker = '^', color = 'red')
ax0.legend(loc = 'lower right')
ax0.set_aspect('equal', 'box')
ax0.set_xlim([-0.01, 1.01])
ax0.set_ylim([-0.01, 1.01])
ax0.set_xlabel('FPR')
ax0.set_ylabel('TPR')
ax0.set_title('ROC curve - replicated pre-processing pipeline')

## ----------------------------------------

# plot PR curve
ax1.plot(rec, prc, label = 'PR AUC: %2.2f'%(pr_auc))

# plot operating point on PR curve
ax1.plot(rec[opp_pr], prc[opp_pr], label = 'CNN Operating Point', marker = '^', color = 'red')
ax1.legend(loc = 'upper right')
ax1.set_aspect('equal', 'box')
ax1.set_xlim([-0.01, 1.01])
ax1.set_ylim([-0.01, 1.01])
ax1.set_xlabel('Recall')
ax1.set_ylabel('Precision')
ax1.set_title('PR curve - replicated pre-processing pipeline')

In [None]:
sns.set()

fig, (ax0, ax1) = plt.subplots(1, 2, figsize = (12, 12))

# plot ROC curve
ax0.plot(fpr_base, tpr_base, label = 'ROC AUC: %2.2f'%(roc_auc_baseline))

# plot operating point on ROC curve
ax0.plot(fpr_base[opp_roc_baseline], tpr_base[opp_roc_baseline],
         label = 'CNN Operating Point', marker = '^', color = 'red')
ax0.legend(loc = 'lower right')
ax0.set_aspect('equal', 'box')
ax0.set_xlim([-0.01, 1.01])
ax0.set_ylim([-0.01, 1.01])
ax0.set_xlabel('FPR')
ax0.set_ylabel('TPR')
ax0.set_title('ROC curve - results from Hosny et Al.')

## ----------------------------------------

# plot PR curve
ax1.plot(rec_baseline, prc_baseline, label = 'PR AUC: %2.2f'%(pr_auc_baseline))

# plot operating point on PR curve
ax1.plot(rec_baseline[opp_pr_baseline], prc_baseline[opp_pr_baseline],
         label = 'CNN Operating Point', marker = '^', color = 'red')
ax1.legend(loc = 'upper right')
ax1.set_aspect('equal', 'box')
ax1.set_xlim([-0.01, 1.01])
ax1.set_ylim([-0.01, 1.01])
ax1.set_xlabel('Recall')
ax1.set_ylabel('Precision')
ax1.set_title('PR curve - results from Hosny et Al.')