<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/notebooks/nsclc-radiomics/src/nsclc_radiomics_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> Environment Setup, Data Download and Pre-processing </center>


In [105]:
curr_dir = !pwd
curr_droid = !hostname
curr_pilot = !whoami

print("Current directory :", curr_dir[-1])
print("Hostname          :", curr_droid[-1])
print("Username          :", curr_pilot[-1])

Current directory : /content
Hostname          : 027ce7241af7
Username          : root


## Environment Set Up

N.B. To access a free GPU on Colab:
`Edit > Notebooks Settings`

From the dropdown menu under `Hardware accelerator`, select `GPU`.

---
Check whether the filesystem was wiped or not:

In [None]:
!ls

sample_data


Check whether the instance is a GPU instance.

This could be useful **for debug purposes** - e.g., to drive some operations in the following blocks (e.g., the install of the right version of Keras and import of TensorFlow).

In [None]:
gpu_list = !nvidia-smi --list-gpus

has_gpu = False if "failed" in gpu_list[0] else True

has_gpu

False

---
<font size="+2" color="orange"><b>Be aware of Tensorflow/Keras compatibility problems.</b></font>

Ahmed's model was compiled and saved in such a way that only old versions of Tensorflow supports it.

Output of `pip3 freeze | grep tensor` on the IDC VM:
```
tensorflow-datasets==1.2.0
tensorflow-estimator==1.15.1
tensorflow-gpu==1.15.2
tensorflow-hub==0.6.0
tensorflow-io==0.8.1
tensorflow-metadata==0.21.1
tensorflow-probability==0.8.0
tensorflow-serving-api-gpu==1.15.0
```

For `pip3 freeze | grep Keras`:
```
Keras==1.2.2
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
```

While running `nvcc --version` gets us:
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
```

<br>

Google Colab's GPU instances come with CUDA 10.1 pre-installed:

```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
```

And, by default, newer versions of Tensorflow are installed (in which Keras is fully integrated in TF):

```
keras-vis==0.4.1

tensorflow==2.3.0
tensorflow-addons==0.8.3
tensorflow-datasets==2.1.0
tensorflow-estimator==2.3.0
tensorflow-gcs-config==2.3.0
tensorflow-hub==0.9.0
tensorflow-metadata==0.24.0
tensorflow-privacy==0.2.2
tensorflow-probability==0.11.0
```

<br>

Luckily, [it is possible to use the](https://colab.research.google.com/notebooks/tensorflow_version.ipynb) `%tensorflow_version 1.x` [magic to switch to an older version of TensorFlow](https://colab.research.google.com/notebooks/tensorflow_version.ipynb) (`1.15.2`). The magic command must be run before importing Tensorflow. Still, CUDA 10.1 is used (which probably means Tensorflow 1.15, not "compatible by default" with 10.1, was compiled from source)


We then will need to install an old version of Keras for/by which Ahmed's model was formatted.

---

In [None]:
# FIXME: useful for debug purposes
if has_gpu:
  !pip3 install --force-reinstall Keras==1.2.2

In [None]:
# FIXME: parse from a GitHub repository
requirements_list = ["numpy",
                     "pandas",
                     "SimpleITK",
                     "pydicom",
                     "matplotlib",
                     "seaborn"]

with open('requirements.txt', 'w') as fp:
  for item in requirements_list:
    fp.write("%s\n"%(item))

!pip3 install -r requirements.txt

Collecting SimpleITK
[?25l  Downloading https://files.pythonhosted.org/packages/22/c6/0319c4efabb6e7f5650bbd41e1e5ec5c89ca0e857a9aaf287c29ac8c266c/SimpleITK-2.0.0-cp36-cp36m-manylinux1_x86_64.whl (44.9MB)
[K     |████████████████████████████████| 44.9MB 104kB/s 
[?25hCollecting pydicom
[?25l  Downloading https://files.pythonhosted.org/packages/d3/56/342e1f8ce5afe63bf65c23d0b2c1cd5a05600caad1c211c39725d3a4cc56/pydicom-2.0.0-py3-none-any.whl (35.4MB)
[K     |████████████████████████████████| 35.5MB 1.4MB/s 
Installing collected packages: SimpleITK, pydicom
Successfully installed SimpleITK-2.0.0 pydicom-2.0.0


Install [Plastimatch](https://plastimatch.org), an reliable and open source software for image computation. Plastimatch is available as an extension (plug-in) for 3D Slicer, but can also be used from the command line.

In [None]:
!sudo apt update
!sudo apt install plastimatch

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:10 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:12 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,340 kB]
Hit:13 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:14 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bio

Verify the installation process was successful by checking Plastimatch version:

In [None]:
!plastimatch --version

plastimatch version 1.7.0


In [None]:
import os
import sys
import json
import numpy as np
import pandas as pd
import SimpleITK as sitk

from IPython.display import clear_output

print("Python version      : ", sys.version.split('\n')[0])
print("Numpy version       : ", np.__version__)

# FIXME: useful for debug purposes
if has_gpu:
  %tensorflow_version 1.x
  import tensorflow as tf
  import keras
  print("TensorFlow version  : ", tf.__version__)
  print("Keras version       : ", keras.__version__)
  print("\nThis Colab instance is equipped with a GPU.")
else:
  print("\nThis Colab instance IS NOT equipped with a GPU.")


# ----------------------------------------

#everything that has to do with plotting goes here below
import matplotlib
matplotlib.use('agg')

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

## ----------------------------------------

# create new colormap appending the alpha channel to the selected one
# (so that we don't get a \"color overlay\" when plotting the segmask superimposed to the CT)
cmap = plt.cm.Reds
my_reds = cmap(np.arange(cmap.N))
my_reds[:,-1] = np.linspace(0, 1, cmap.N)
my_reds = ListedColormap(my_reds)

cmap = plt.cm.jet
my_jet = cmap(np.arange(cmap.N))
my_jet[:,-1] = np.linspace(0, 1, cmap.N)
my_jet = ListedColormap(my_jet)

## ----------------------------------------

import seaborn as sns

Python version      :  3.6.9 (default, Jul 17 2020, 12:50:27) 
Numpy version       :  1.18.5

This Colab instance IS NOT equipped with a GPU.


---

## Data Download

[How to Mount GCS bucket on Colab:](https://gist.github.com/korakot/f3600576720206363c734eca5f302e38)

First step is the authentication:

In [8]:
from google.colab import auth
auth.authenticate_user()

Mount Google Drive's filesystem.

<font size="+1" color="orange"><b>This will become unneccessary once the specific IDC repository will be made public and clonable on the Colab filesystem.</b></font>

In [9]:
from google.colab import drive

mount_point = '/mnt/gdrive'

if os.path.exists(mount_point):
  drive.flush_and_unmount()

drive.mount(mount_point)

Mounted at /mnt/gdrive


---
#### Exploiting BigQuery to Select which Data to Download

To get information regarding the dataset of interest (e.g., which subjects to download, the mapping between DICOM CTs and DICOM RTSTRUCTs/RTSEGs, ...), run a BigQuery query exploiting the `%%bigquery` [IPython Magics](https://googleapis.dev/python/bigquery/latest/magics.html).

The result will be stored as a DataFrame in `cohort_df`. 

In [10]:
%%bigquery --project=idc-sandbox-000 cohort_df

WITH
  ct_series AS (
  SELECT
    DISTINCT(PatientID),
    StudyInstanceUID AS ctStudyInstanceUID,
    SeriesInstanceUID AS ctSeriesInstanceUID
  FROM
    `idc-dev-etl.idc_tcia.idc_tcia`
  WHERE
    PatientID LIKE "LUNG1%"
    AND Modality = "CT"
  ORDER BY
    PatientID),
  rtstruct_series AS (
  SELECT
    DISTINCT(PatientID),
    StudyInstanceUID AS rtstructStudyInstanceUID,
    SeriesInstanceUID AS rtstructSeriesInstanceUID
  FROM
    `idc-dev-etl.idc_tcia.idc_tcia`
  WHERE
    PatientID LIKE "LUNG1%"
    AND Modality = "RTSTRUCT"
  ORDER BY
    PatientID),
  seg_series AS (
  SELECT
    DISTINCT(PatientID),
    StudyInstanceUID AS segStudyInstanceUID,
    SeriesInstanceUID AS segSeriesInstanceUID
  FROM
    `idc-dev-etl.idc_tcia.idc_tcia`
  WHERE
    PatientID LIKE "LUNG1%"
    AND Modality = "SEG"
  ORDER BY
    PatientID)
SELECT
  PatientID,
  ctStudyInstanceUID,
  ctSeriesInstanceUID,
  rtstructStudyInstanceUID,
  rtstructSeriesInstanceUID,
  segStudyInstanceUID,
  segSeriesInstanceUID
FROM
  ct_series
JOIN
  rtstruct_series
using (PatientID)
JOIN
  seg_series
USING
  (PatientID)
ORDER BY
  PatientID

In [11]:
cohort_df.head()

Unnamed: 0,PatientID,ctStudyInstanceUID,ctSeriesInstanceUID,rtstructStudyInstanceUID,rtstructSeriesInstanceUID,segStudyInstanceUID,segSeriesInstanceUID
0,LUNG1-001,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.40744.29.2393413539117143687725971...,1.3.6.1.4.1.40744.29.2279381215866080725084441...,1.3.6.1.4.1.40744.29.2393413539117143687725971...,1.2.276.0.7230010.3.1.3.8323329.16296.15548365...
1,LUNG1-002,1.3.6.1.4.1.32722.99.99.2037150038059966416957...,1.3.6.1.4.1.32722.99.99.2329880015517990803358...,1.3.6.1.4.1.40744.29.2037150038059966416957653...,1.3.6.1.4.1.40744.29.2432675512669112458302594...,1.3.6.1.4.1.40744.29.2037150038059966416957653...,1.2.276.0.7230010.3.1.3.8323329.21133.15548295...
2,LUNG1-003,1.3.6.1.4.1.32722.99.99.2477262867958601216867...,1.3.6.1.4.1.32722.99.99.2389222799296192439904...,1.3.6.1.4.1.40744.29.2477262867958601216867965...,1.3.6.1.4.1.40744.29.2175894477461117410564218...,1.3.6.1.4.1.40744.29.2477262867958601216867965...,1.2.276.0.7230010.3.1.3.8323329.20958.15548278...
3,LUNG1-004,1.3.6.1.4.1.32722.99.99.2026036697036260886779...,1.3.6.1.4.1.32722.99.99.2809816144625926346520...,1.3.6.1.4.1.40744.29.2026036697036260886779831...,1.3.6.1.4.1.40744.29.1989243449739101957480841...,1.3.6.1.4.1.40744.29.2026036697036260886779831...,1.2.276.0.7230010.3.1.3.8323329.31481.15548267...
4,LUNG1-005,1.3.6.1.4.1.32722.99.99.7196186628043392557101...,1.3.6.1.4.1.32722.99.99.3490584753983772067630...,1.3.6.1.4.1.40744.29.7196186628043392557101987...,1.3.6.1.4.1.40744.29.1186162681418618501244963...,1.3.6.1.4.1.40744.29.7196186628043392557101987...,1.2.276.0.7230010.3.1.3.8323329.7097.155482712...


After selecting a few subjects from the cohort, exploiting the output of the BigQuery query, populate a dictionary with the `ctStudyInstanceUID` and `rtstructStudyInstanceUID` values. Finally, exploit `gsutil` to download such sub-cohort:

In [17]:
n_patients = 10

# useful for logging purposes
download_dict = dict()

# since gsutil can be called with "-m" providing a list as an input,
# append each gs URI to download to a list:
to_download = list()
base_gs_uri = 'gs://idc-tcia-1-nsclc-radiomics/dicom/'

subcohort_df = cohort_df.sample(n = n_patients)

for pat_num, pat in enumerate(list(subcohort_df["PatientID"])):
  print("(%g/%g) - PatientID: %s"%(pat_num + 1, n_patients, pat), end = '\r')

  pat_df = subcohort_df[subcohort_df["PatientID"] == pat]
  download_dict[pat] = dict()
  download_dict[pat]["ctStudyInstanceUID"] = pat_df["ctStudyInstanceUID"].values[0]
  download_dict[pat]["rtstructStudyInstanceUID"] = pat_df["rtstructStudyInstanceUID"].values[0]

  to_download.append(base_gs_uri + download_dict[pat]["ctStudyInstanceUID"])
  to_download.append(base_gs_uri + download_dict[pat]["rtstructStudyInstanceUID"])

(1/10) - PatientID: LUNG1-244(2/10) - PatientID: LUNG1-376(3/10) - PatientID: LUNG1-106(4/10) - PatientID: LUNG1-043(5/10) - PatientID: LUNG1-336(6/10) - PatientID: LUNG1-143(7/10) - PatientID: LUNG1-136(8/10) - PatientID: LUNG1-407(9/10) - PatientID: LUNG1-350(10/10) - PatientID: LUNG1-067

In [20]:
# populate a dataframe with such gs URI (again, mostly for logging purposes)
manifesto_dict = {"gs_uri" : to_download}
manifesto_df = pd.DataFrame(manifesto_dict, columns = ["gs_uri"])

manifesto_df.head()

Unnamed: 0,gs_uri
0,gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1....
1,gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1....
2,gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1....
3,gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1....
4,gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1....


In [32]:
# generate a text file, which will be parsed by gsutil to download the selected patients:
manifesto_df.to_csv("gcs_paths.txt", header = False, index = False)

# check everything went as expected
!head /content/gcs_paths.txt -n 2
!echo "..." && echo ""

# the number of lines in the file should be equal to twice the specified "n_patients"
!echo "Number of lines in the file:" $(cat /content/gcs_paths.txt | wc -l)

gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.337406557596513362688574142695266048351
gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1.4.1.40744.29.337406557596513362688574142695266048351
...

Number of lines in the file: 20


---

#### DICOM Data Download

The following instructions will download the DICOM CT and DICOM RTSTRUCT files for `n_patients` patients. For `n_patients = 10`, the download should take approximately 3-4 minutes (roughly 1-1.5k files, a total of 500-700MB).

In [33]:
# if everything is allright, proceed with the download
!mkdir -p data/nsclc-radiomics/dicom

!cat gcs_paths.txt | gsutil -u idc-sandbox-000 -m cp -Ir ./data/nsclc-radiomics/dicom

Copying gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.337406557596513362688574142695266048351/1.3.6.1.4.1.32722.99.99.125867255546325889061338987114510353419/1.3.6.1.4.1.32722.99.99.101327154694781456769426549942732497146.dcm...
/ [0 files][    0.0 B/ 49.7 MiB]                                                Copying gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.337406557596513362688574142695266048351/1.3.6.1.4.1.32722.99.99.125867255546325889061338987114510353419/1.3.6.1.4.1.32722.99.99.100092387728571126613076858797865161575.dcm...
/ [0 files][    0.0 B/ 49.7 MiB]                                                Copying gs://idc-tcia-1-nsclc-radiomics/dicom/1.3.6.1.4.1.32722.99.99.337406557596513362688574142695266048351/1.3.6.1.4.1.32722.99.99.125867255546325889061338987114510353419/1.3.6.1.4.1.32722.99.99.103484459146885810660349892602025355476.dcm...
/ [0 files][    0.0 B/ 49.7 MiB]                                                Copying gs://idc-tcia-

---

Check the disk space after the download:

In [34]:
!df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay         108G   32G   72G  31% /
tmpfs            64M     0   64M   0% /dev
tmpfs           6.4G     0  6.4G   0% /sys/fs/cgroup
shm             5.9G     0  5.9G   0% /dev/shm
tmpfs           6.4G   12K  6.4G   1% /var/colab
/dev/sda1       114G   33G   82G  29% /etc/hosts
tmpfs           6.4G     0  6.4G   0% /proc/acpi
tmpfs           6.4G     0  6.4G   0% /proc/scsi
tmpfs           6.4G     0  6.4G   0% /sys/firmware
drive            15G   33M   15G   1% /mnt/gdrive


In [111]:
!du -h ./data/nsclc-radiomics -d 0

684M	./data/nsclc-radiomics



---

## Data Pre-processing

For the time being, copy the python files from the src directory under `My Drive/Colab Notebooks`.

When the specific IDC repository becomes available, just clone that (or, better, [a subdirectory only](https://stackoverflow.com/questions/600079/git-how-do-i-clone-a-subdirectory-only-of-a-git-repository).)

In [101]:
!ls $drive_src_path

data_utils.py  __init__.py  __pycache__


In [138]:
# replace the following instructions with "git clone" (and the --filter flag?)
# possibly, move this to the top (as for now we need to mount the GDrive folder
# before doing anything, but in the future this will change)
drive_src_path = os.path.join(mount_point, "My\ Drive/Colab\ Notebooks/src/")
!mkdir -p src
!cp -r $drive_src_path .

from src.data_utils import *
from src.utils import *

# FIXME: DEBUG

import importlib
importlib.reload(data_utils)


<module 'src.data_utils' from '/content/src/data_utils.py'>

In [139]:
data_base_path = 'data'

dataset_name = 'nsclc-radiomics'
dataset_path = os.path.join(data_base_path, dataset_name)
data_path = os.path.join(dataset_path, 'dicom')

preproc_dataset_name = dataset_name + '_preprocessed'
preproc_dataset_path = os.path.join(data_base_path, preproc_dataset_name)
preproc_data_path = os.path.join(preproc_dataset_path, 'nrrd')
    
if not os.path.exists(preproc_data_path):
    os.makedirs(preproc_data_path)

In [140]:
for pat_num, pat in enumerate(subcohort_df.PatientID.values):

    # clear cell output before moving to the next (goes at the top to clean what comes next)
    clear_output(wait = True)
    
    print("\nPatient %d/%d (%s)"%(pat_num + 1, len(subcohort_df), pat))
    
    pat_dir_path = os.path.join(preproc_data_path, pat)
    
    if not os.path.exists(pat_dir_path):
        os.mkdir(pat_dir_path)
    
    # location where the tmp nrrd files (resampled CT/RTSTRUCT nrrd) should be saved
    # by the "export_res_nrrd_from_dicom" function found in preprocess.py
    ct_nrrd_path = os.path.join(pat_dir_path, pat + '_ct_resampled.nrrd')
    rt_nrrd_path = os.path.join(pat_dir_path, pat + '_rt_resampled.nrrd')

    # location where the nrrd files (cropped resampled CT/RTSTRUCT nrrd) should be saved
    # by the "export_com_subvolume" function found in preprocess.py
    ct_nrrd_crop_path = os.path.join(pat_dir_path, pat + '_ct_res_crop.nrrd')
    rt_nrrd_crop_path = os.path.join(pat_dir_path, pat + '_rt_res_crop.nrrd')
    
    # if the latter are already there, skip the processing
    if os.path.exists(ct_nrrd_crop_path) and os.path.exists(rt_nrrd_crop_path):
        print("%s\nand\n%s\nfound, skipping the processing for patient %s..."%(ct_nrrd_crop_path,
                                                                               rt_nrrd_crop_path, 
                                                                               pat))
        continue
    
    ## ----------------------------------------
    
    pat_df = subcohort_df[subcohort_df["PatientID"] == pat]


    path_to_ct_dir = os.path.join(data_path,
                                  pat_df["ctStudyInstanceUID"].values[0],
                                  pat_df["ctSeriesInstanceUID"].values[0])

    path_to_rt_dir = os.path.join(data_path,
                                  pat_df["rtstructStudyInstanceUID"].values[0],
                                  pat_df["rtstructSeriesInstanceUID"].values[0])

    path_to_seg_dir = os.path.join(data_path, 
                                   pat_df["segStudyInstanceUID"].values[0], 
                                   pat_df["segSeriesInstanceUID"].values[0])

    # FIXME: sanity check
    assert os.path.exists(path_to_ct_dir)
    assert os.path.exists(path_to_rt_dir)
    assert os.path.exists(path_to_seg_dir)    
    
    # log lookup informations (human-readable to StudyUID and SeriesUID)
    lookup_dict_path = os.path.join(pat_dir_path, pat + '_lookup_info.json')
    
    lookup_dict = dict()
    lookup_dict[pat] = dict()
    
    lookup_dict[pat]["path_to_ct_dir"] = path_to_ct_dir
    lookup_dict[pat]["ctStudyInstanceUID"] = pat_df["ctStudyInstanceUID"].values[0]
    lookup_dict[pat]["ctSeriesInstanceUID"] = pat_df["ctSeriesInstanceUID"].values[0]
    
    lookup_dict[pat]["path_to_rt_dir"] = path_to_rt_dir
    lookup_dict[pat]["rtstructStudyInstanceUID"] = pat_df["rtstructStudyInstanceUID"].values[0]
    lookup_dict[pat]["rtstructSeriesInstanceUID"] = pat_df["rtstructSeriesInstanceUID"].values[0]
    
    # FIXME: not used so far but still, useful to log for future purposes
    lookup_dict[pat]["path_to_seg_dir"] = path_to_seg_dir
    lookup_dict[pat]["segStudyInstanceUID"] = pat_df["segStudyInstanceUID"].values[0]
    lookup_dict[pat]["segSeriesInstanceUID"] = pat_df["segSeriesInstanceUID"].values[0]
    
    with open(lookup_dict_path, 'w') as json_file:
        json.dump(lookup_dict, json_file, indent = 2)
    
    ## ----------------------------------------

    proc_log = export_res_nrrd_from_dicom(dicom_ct_path = path_to_ct_dir, 
                                          dicom_rt_path = path_to_rt_dir, 
                                          output_dir = pat_dir_path, 
                                          pat_id = pat,
                                          output_dtype = "float")
    
    # check every step of the DICOM to NRRD conversion returned 0 (everything's ok)
    assert(np.sum(np.array(list(proc_log.values()))) == 0)
    
    # FIXME: sanity check
    assert(os.path.exists(ct_nrrd_path))
    assert(os.path.exists(rt_nrrd_path))
    
    sitk_vol = sitk.ReadImage(ct_nrrd_path)
    vol = sitk.GetArrayFromImage(sitk_vol)
    
    sitk_seg = sitk.ReadImage(rt_nrrd_path)
    seg = sitk.GetArrayFromImage(sitk_seg)
    
    # FIXME: sanity check
    assert(vol.shape == seg.shape)
    
    com = compute_center_of_mass(seg)
    com_int = [int(coord) for coord in com]

    # export the CoM slice (CT + RTSTRUCT) for quality control
    export_png_slice(input_volume = vol,
                     input_segmask = seg,
                     fig_out_path = os.path.join(pat_dir_path, pat + '_whole_CT_CoM.png'),
                     fig_dpi = 220,
                     lon_slice_idx = com_int[0],
                     cor_slice_idx = com_int[1],
                     sag_slice_idx = com_int[2],
                     z_first = True)
    
    # crop a (150, 150, 150) subvolume from the resampled scans, get rid of the latter
    proc_log = export_com_subvolume(ct_nrrd_path = ct_nrrd_path, 
                                    rt_nrrd_path = rt_nrrd_path, 
                                    crop_size = (150, 150, 150), 
                                    output_dir = pat_dir_path,
                                    pat_id = pat,
                                    z_first = True, 
                                    rm_orig = True)
    
    # log CoM information
    com_log_path = os.path.join(pat_dir_path, pat + '_com_log.json')
    com_log_dict = {k : v for (k, v) in proc_log.items() if "com_int" in k}
    
    with open(com_log_path, 'w') as json_file:
        json.dump(com_log_dict, json_file, indent = 2)
    
    # if CoM calculation goes wrong then continue
    proc_log_crop = {k : v for (k, v) in proc_log.items() if "cropping" in k}
    if len(proc_log_crop) == 0:
        os.remove(ct_nrrd_path)
        os.remove(rt_nrrd_path)
        continue
    
    # check the cropped volumes have been exported as intended
    assert(np.sum(np.array(list(proc_log_crop.values()))) == 0)
    assert(os.path.exists(ct_nrrd_crop_path))
    assert(os.path.exists(rt_nrrd_crop_path))
    
    sitk_vol = sitk.ReadImage(ct_nrrd_crop_path)
    vol_crop = sitk.GetArrayFromImage(sitk_vol)

    sitk_seg = sitk.ReadImage(rt_nrrd_crop_path)
    seg_crop = sitk.GetArrayFromImage(sitk_seg)
    
    # export the cropped subvolume CoM slice (CT + RTSTRUCT) for quality control
    export_png_slice(input_volume = vol_crop,
                     input_segmask = seg_crop,
                     fig_out_path = os.path.join(pat_dir_path, pat + '_crop_CT_CoM.png'),
                     fig_dpi = 220,
                     lon_slice_idx = 75,
                     cor_slice_idx = 75,
                     sag_slice_idx = 75,
                     z_first = True)


Patient 10/10 (LUNG1-067)
Converting DICOM CT to NRRD using plastimatch... Done.
Converting DICOM RTSTRUCT to NRRD using plastimatch... Done.

Resampling NRRD CT to 1mm isotropic using plastimatch... Done.
Resampling NRRD RTSTRUCT to 1mm isotropic using plastimatch... Done.

Removing temporary files (DICOM to NRRD, non-resampled)... Done.

Exporting figure at: data/nsclc-radiomics_preprocessed/nrrd/LUNG1-067/LUNG1-067_whole_CT_CoM.png

Cropping the resampled NRRD CT to bbox using plastimatch... Done.
Cropping the resampled NRRD RTSTRUCT to bbox using plastimatch... Done.

Removing the resampled NRRD files... Done.

Exporting figure at: data/nsclc-radiomics_preprocessed/nrrd/LUNG1-067/LUNG1-067_crop_CT_CoM.png


---

### On Fostering Reproducibility: Logging the Processing Details?


In [152]:
csv_out_name = 'nsclc-radiomics_preproc_details.csv'
dataset_csv_path = os.path.join(data_base_path, csv_out_name)


df_keys = ['PatientID',
           'path_to_ct_dir', 'ctStudyInstanceUID', 'ctSeriesInstanceUID',
           'path_to_rt_dir', 'rtstructStudyInstanceUID', 'rtstructSeriesInstanceUID',
           'path_to_seg_dir', 'segStudyInstanceUID', 'segSeriesInstanceUID',
           'rt_exported', 'orig_shape', '1mm_iso_shape', 'crop_shape', 'com_int', 'bbox']

data = {k : list() for k in df_keys}

det_df = pd.DataFrame(data = data, dtype = object)

In [153]:
a = list()

for pat_num, pat in enumerate(subcohort_df.PatientID.values):

    print("\rProcessing patient '%s' (%d/%d)... "%(pat, pat_num + 1,
                                                   len(subcohort_df.PatientID)),
          end = '')
    
    pat_df = subcohort_df[subcohort_df['PatientID'] == pat]
            
    # init a dictionary with the same keys as "df_keys" to populate the latter
    pat_dict = dict()
    pat_dict["PatientID"] = pat

    pat_dir_path = os.path.join(preproc_data_path, pat)
    pat_json_path = os.path.join(pat_dir_path, pat + '_lookup_info.json')

    with open(pat_json_path, 'r') as json_file:
        lookup_dict = json.load(json_file)
    
    pat_dict["path_to_ct_dir"] = lookup_dict[pat]["path_to_ct_dir"]
    pat_dict["ctStudyInstanceUID"] = pat_df['ctStudyInstanceUID'].values[0]
    pat_dict["ctSeriesInstanceUID"] = pat_df['ctSeriesInstanceUID'].values[0]

    pat_dict["path_to_rt_dir"] = lookup_dict[pat]["path_to_rt_dir"]
    pat_dict["rtstructStudyInstanceUID"] = pat_df['rtstructStudyInstanceUID'].values[0]
    pat_dict["rtstructSeriesInstanceUID"] = pat_df['rtstructSeriesInstanceUID'].values[0]

    pat_dict["path_to_seg_dir"] = lookup_dict[pat]["path_to_seg_dir"]
    pat_dict["segStudyInstanceUID"] = pat_df['segStudyInstanceUID'].values[0]
    pat_dict["segSeriesInstanceUID"] = pat_df['segSeriesInstanceUID'].values[0]
    
    # ----------------------------------------
    
    # populate the "rt_exported" field
    rt_folder = os.path.join(pat_dir_path, pat  + '_whole_ct_rt')
    pat_dict['rt_exported'] = [f for f in os.listdir(rt_folder) if 'gtv-1' in f.lower()][0].split('.nrrd')[0]   
    if pat_dict['rt_exported'] != 'GTV-1':
        a.append(pat)
    
    # ----------------------------------------
    
    dicom_ct_path = lookup_dict[pat]["path_to_ct_dir"]
    
    dcm_file_path = os.path.join(dicom_ct_path,                 # parent folder
                                 os.listdir(dicom_ct_path)[0])  # *.dcm files

    dcm_file = pydicom.dcmread(dcm_file_path)
    n_dcm_files = len([f for f in os.listdir(dicom_ct_path) if '.dcm' in f])

    xy = int(float(dcm_file.Rows)*float(dcm_file.PixelSpacing[0]))
    z = int(float(dcm_file.SliceThickness)*float(n_dcm_files))
    
    orig_dcm_shape = (n_dcm_files, dcm_file.Columns, dcm_file.Rows)
    res_dcm_shape = (z, xy, xy)
    
    pat_dict['orig_shape'] = orig_dcm_shape
    pat_dict['1mm_iso_shape'] = res_dcm_shape
    
    # ----------------------------------------
     
    com_json_path = os.path.join(pat_dir_path, pat + '_com_log.json')
    
    try:
        with open(com_json_path, 'r') as json_file:
            com_dict = json.load(json_file)
            pat_dict['com_int'] = tuple(com_dict["com_int"])
    except:
        print('_com_log.json loading error;')
    
    # ----------------------------------------
    
    bbox_json_path = os.path.join(pat_dir_path, pat + '_crop_log.json')
    
    try:
        with open(bbox_json_path, 'r') as json_file:
            bbox_dict = json.load(json_file)
            pat_dict['bbox'] = bbox_dict
    except:
        print('_crop_log.json loading error;')
    
    # ----------------------------------------
    ct_res_crop_path = os.path.join(pat_dir_path, pat + '_ct_res_crop.nrrd')
    
    try:
        sitk_ct_res_crop = sitk.ReadImage(ct_res_crop_path)
        pat_dict['crop_shape'] = sitk_ct_res_crop.GetSize()
    except:
        print('_ct_res_crop.nrrd loading error;')
        
    # ----------------------------------------
    
    det_df = det_df.append(pat_dict, ignore_index = True)

det_df.to_csv(dataset_csv_path)

Processing patient 'LUNG1-067' (10/10)... 

---
---

# <center> Data Exploration </center>
