The contents of this notebook are copied/redistributed and modified from [RSNA 2022 tutorials](https://github.com/RSNA/AI-Deep-Learning-Lab-2022/blob/main/sessions/data-curation/RSNA22_DLL_Data_Processing_Curation_for_Deep_Learning.ipynb) as per the [MIT License](https://github.com/RSNA/AI-Deep-Learning-Lab-2022/blob/main/LICENSE).

<a href="https://colab.research.google.com/github/AFRICAI-MICCAI/model_development_1_data/blob/main/Notebooks/2-%20Data-inspection-and-curation-DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Data Processing & Curation for Deep Learning**

### In this session, we will build a **toolbox** of data processing techniques useful for deep learning applications.

### Learning Objectives
1. Extract relevant data from radiology and pathology reports
2. Understand how to generate and process image-based annotations
3. Perform image registration and normalization
4. Recognize features of data formats ideal for deep learning

### Course Description
This course will provide attendees with the essential tools to perform data processing and curation necessary for deep learning projects. Attendees will start with free text radiology and pathology reports as well as anonymized DICOM data and process data into a unified data file ready for deep learning applications.


## **Setting up the runtime environment...**

### Conda environment
It is suggested to create a conda environment for the summer school's notebooks. Please find conda installation instructions [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) (miniconda would be enough).  
If you have not created/initialized the africai conda environment, run in a terminal from the parent  directory *model_development_1_data*:  
> conda env create -f africai.yml  
> conda activate africai

*Other useful commands*:  
To deactivate a conda environment 
> conda deactivate

To delete a conda environment (e.g. africai conda environment, replace *ENV_NAME* with *africai*)
> conda remove --name *ENV_NAME* --all 

### Install [Advanced Normalization Tools](http://stnava.github.io/ANTs/) (ants) 
In a terminal run, for MacOS:  
> pip install https://github.com/ANTsX/ANTsPy/releases/download/v0.3.8/antspyx-0.3.8-cp310-cp310-macosx_10_9_x86_64.whl  

For Linux:
> pip install https://github.com/ANTsX/ANTsPy/releases/download/v0.3.8/antspyx-0.3.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl  

For Windows:
> pip install https://github.com/ANTsX/ANTsPy/releases/download/v0.3.8/antspyx-0.3.8-cp310-cp310-win_amd64.whl


### Install rest dependencies
Running the following cell will install the rest necessary libraries.

In [None]:
%%capture

!pip install matplotlib
!pip install scikit-image
!pip install simplejson
!pip install pydicom
!pip install dicom2nifti
!pip install nipype
!pip install h5py

In [None]:
import matplotlib.pyplot as plt
import os
from pathlib import Path
import numpy as np
import pydicom
import shutil as shutil
import dicom2nifti
import nibabel as nib
import h5py
import re
from skimage import exposure
from zipfile import ZipFile
from pathlib import Path
from IPython.display import HTML, display
import ants


def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )


get_ipython().events.register("pre_run_cell", set_css)

## **Downloading the data**
Running this cell will download the data for this demo.

In [None]:
data_dir = os.path.join(os.path.dirname(os.getcwd()), "data")
Path(data_dir).mkdir(parents=True, exist_ok=True)

!wget -P {data_dir} -q https://github.com/RSNA/AI-Deep-Learning-Lab-2022/raw/main/sessions/data-curation/Sample_DICOM.zip

file_name = os.path.join(data_dir, "Sample_DICOM.zip")
with ZipFile(file_name, "r") as zip:
    zip.extractall(data_dir)
    print("Input data file unzipped!")

## **Listing the data files**
Running this cell will show you the directory structure for the provided sample prostate MRI data. We have three folders, each of which contains a series of files for T2 weighted imaging (T2WI), diffusion weighted imaging (DWI) and an ADC map (ADC). (Note: that the lesion.nii.gz is a backup file provided for the demo)

This data has been directly exported from PACS in the DICOM format. This data can be loaded into a DICOM viewer to be visualized. A free option available to Mac OS users is Osirix Lite.



In [None]:
dcm_sample_path = os.path.join(data_dir, "Sample_DICOM")

# Command line "magic" command to show directory contents
if os.name == "nt":
    !dir {dcm_sample_path}
else:
    !ls {dcm_sample_path}/**/*


for current_dir, subdirs, files1 in os.walk(dcm_sample_path):
    # Current Iteration Directory
    print(current_dir)

    # Directories
    for dirname in subdirs:
        print("\t" + dirname)

    # Files
    for filename in files1:
        print("\t" + filename)

## **Preparing the data for annotation and further processing**

DICOM files are not always a convenient format to work with for deep learning. The first step that we will take will be conversion of our DICOM files to [nifti](https://nifti.nimh.nih.gov/), an open source image file format that has good interoperability between many available image processing tools and packages. 

### **Conversion to nifti file format**
Run this cell to convert our DICOM files to nifti files. Each folder of DICOM files (a series) will be converted into a nifti file

In [None]:
niftipath = os.path.join(dcm_sample_path, "nifti")
os.makedirs(niftipath, exist_ok=True)

for series in ["T2", "DWI", "ADC"]:
    dicom2nifti.dicom_series_to_nifti(
        os.path.join(dcm_sample_path, series),
        os.path.join(niftipath, (series + ".nii")),
        reorient_nifti=True,
    )
print("Data converted!")

### **Visualizing the data**

There are many options for image visualization and annotation. Two easy and versatile open source tools are [**Mango**](https://mangoviewer.com/) and [**itk-SNAP**](http://www.itksnap.org). There are versions for windows, macOS and linux available to download and install. 

As an example, open Mango and navigate as follows:  
&rarr; *Open*  
&rarr; *Open Image...*  
&rarr; Browse to *<your_path>/model_development_1_data/data/Sample_DICOM/nifti* and select the nifti file for the T2 series  
&rarr; *Window* &rarr; *Maximize* (to maximize the visualization)

A brief and handy Mango user guide can be found [here](https://mangoviewer.com/userguide.html).

To visualize the provided lesion segmentation, navigate in the window where the data is visualized as follows:  
&rarr; *File*  
&rarr; *Add Overlay...*  
&rarr; Browse to *<your_path>/model_development_1_data/data/Sample_DICOM* and select the file *lesion.nii.gz*

## **Image Registration**

You may have multiple series (by CT or MRI) that you want registered to each other for your deep learning project. There are many ways to do this, some easier than others. The open source libraries [**Advanced Normalization Tools**](http://stnava.github.io/ANTs/) (ants) or [**NiftyReg**](http://cmictig.cs.ucl.ac.uk/wiki/index.php/NiftyReg) are helpful for this purpose.  
In this example we will employ ants.

Note that ADC and DWI images are not registered to T2 images. Note that the lesion file is registered to T2 as that is those are the images we used for the annotation.

In [None]:
# get pixel data from nifti files
t2_img = nib.load(os.path.join(niftipath, "T2.nii"))
data_t2 = np.rot90(t2_img.get_fdata())
adc_img = nib.load(os.path.join(niftipath, "ADC.nii"))
data_adc = np.rot90(adc_img.get_fdata())
dwi_img = nib.load(os.path.join(niftipath, "DWI.nii"))
data_dwi = np.rot90(dwi_img.get_fdata())
seg_img = nib.load(os.path.join(dcm_sample_path, "lesion.nii.gz"))
data_seg = np.rot90(seg_img.get_fdata())

# plot image
fig, axes = plt.subplots(ncols=4, nrows=1, figsize=(36, 24), sharex=True, sharey=True)
ax0 = axes[0]
ax1 = axes[1]
ax2 = axes[2]
ax3 = axes[3]
ax0.imshow(data_t2[:, :, 12], cmap="gray")
ax0.set_title("T2")
ax1.imshow(data_adc[:, :, 12], cmap="gray")
ax1.set_title("ADC")
ax2.imshow(data_dwi[:, :, 12], cmap="gray")
ax2.set_title("DWI")
ax3.imshow(data_seg[:, :, 12], cmap="Blues", alpha=0.5)
ax3.set_title("Lesion")
plt.tight_layout()
plt.rcParams.update({"font.size": 40})

### **Register your images**
Register the T2, DWI and ADC series. Different pixel and slice spacing as well as field of view is used for T2 versus DWI and ADC imaging.

In [None]:
t2_ants = ants.from_nibabel(t2_img)
adc_ants = ants.from_nibabel(adc_img)
dwi_ants = ants.from_nibabel(dwi_img)

mytx = ants.registration(fixed=t2_ants, moving=adc_ants, type_of_transform="SyN")
adc_regis_ants = ants.apply_transforms(
    fixed=t2_ants, moving=adc_ants, transformlist=mytx["fwdtransforms"]
)
adc_regis = ants.to_nibabel(adc_regis_ants)

mytx = ants.registration(fixed=t2_ants, moving=dwi_ants, type_of_transform="SyN")
dwi_regis_ants = ants.apply_transforms(
    fixed=t2_ants, moving=dwi_ants, transformlist=mytx["fwdtransforms"]
)
dwi_regis = ants.to_nibabel(dwi_regis_ants)

## **Visualize the registered images**

In [None]:
# get pixel data from nifti files
data_t2 = np.rot90(t2_img.get_fdata())
data_adc = np.rot90(adc_regis.get_fdata())
data_dwi = np.rot90(dwi_regis.get_fdata())
data_lesion = np.rot90(seg_img.get_fdata())

# plot image
fig, axes = plt.subplots(ncols=4, nrows=1, figsize=(36, 24), sharex=True, sharey=True)
ax0 = axes[0]
ax1 = axes[1]
ax2 = axes[2]
ax3 = axes[3]
ax0.imshow(data_t2[:, :, 12], cmap="gray")
ax0.set_title("T2")
ax1.imshow(data_adc[:, :, 12], cmap="gray")
ax1.set_title("ADC")
ax2.imshow(data_dwi[:, :, 12], cmap="gray")
ax2.set_title("DWI")
ax3.imshow(data_lesion[:, :, 12], cmap="Blues", alpha=0.5)
ax3.set_title("Lesion")
plt.tight_layout()
plt.rcParams.update({"font.size": 40})

## **Image Normalization**

Image normalization can be a critical step to help correct for scanner and exam variation in aquisition, especially for MRI images.

### **Normalize your images**
This normalization is being performed over the entire imaged volume for each series. An alternate approach would be to normalize based on pixels within a region of interest.




In [None]:
# Define function for normalization
def normalize_equalize(t2):
    p1, p2 = np.percentile(t2[t2 != 0], (0.1, 99.9))
    t2_rescale = exposure.rescale_intensity(t2, in_range=(p1, p2), out_range=(0, 1))
    t2_rescale_equalize = exposure.equalize_hist(t2_rescale)
    return t2_rescale_equalize


# Normalize each series
data_t2_norm = normalize_equalize(data_t2)
data_adc_norm = normalize_equalize(data_adc)
data_dwi_norm = normalize_equalize(data_dwi)

### **Visualize the normalized images**

In [None]:
# plot image
fig, axes = plt.subplots(ncols=4, nrows=1, figsize=(36, 24), sharex=True, sharey=True)
ax0 = axes[0]
ax1 = axes[1]
ax2 = axes[2]
ax3 = axes[3]
ax0.imshow(data_t2_norm[:, :, 12], cmap="gray")
ax0.set_title("T2")
ax1.imshow(data_adc_norm[:, :, 12], cmap="gray")
ax1.set_title("ADC")
ax2.imshow(data_dwi_norm[:, :, 12], cmap="gray")
ax2.set_title("DWI")
ax3.imshow(data_lesion[:, :, 12], cmap="Blues", alpha=0.5)
ax3.set_title("Lesion")
plt.tight_layout()
plt.rcParams.update({"font.size": 40})

## **Free text report processing**

Oftentimes, valuable information for ground truth will need to be processed from free text reports. For this demonstration, the prostate lesion that we annotated needs to be linked to results from a biopsy report.

### **View the raw free text pathology report**

In [None]:
report = 'XXX DEPARTMENT OF PATHOLOGY XXX ADDRESS TEL: XXX  FAX: XXXX  ; ;SURGICAL PATHOLOGY REPORT  ;Patient Name: XXX Med. Rec.#: XXXXXXX DOB: XX/XX/XXXX (Age: XX) Sex: Male Accession #: XX Visit #: XXX Service Date: XX/XX/XXXX Received: XX/XX/XXXX Location: XX Client:XX  Physician(s): XX; ;FINAL PATHOLOGIC DIAGNOSIS  ;A. Prostate, right apex, needle core biopsy:  High-grade prostatic intraepithelial neoplasia (HGPIN); see comment.   ;B. Prostate, right mid, needle core biopsy:  Prostatic adenocarcinoma, Gleason score 4+4=8; see comment.    ;C. Prostate, right base, needle core biopsy:  Prostatic adenocarcinoma, Gleason score 4+4=8; see comment.    ;D. Prostate, right anterior, needle core biopsy:  Atypical small acinar proliferation (ASAP); see comment.   ;E. Prostate, left apex, needle core biopsy:  Benign prostatic tissue.  ;F. Prostate, left mid, needle core biopsy:  Prostatic adenocarcinoma, Gleason score 3+3=6; see comment.    ;G. Prostate, left base, needle core biopsy:  Benign prostatic tissue.   ;H. Prostate, left anterior, needle core biopsy:  Benign prostatic tissue.    ;I. Prostate, "left mid US lesion", needle core biopsy:  Benign prostatic tissue.    ;J. Prostate, "right base US lesion", needle core biopsy:  Prostatic adenocarcinoma, Gleason score 4+4=8; see comment.    ;K. Prostate, "MRI #1 right mid/base posterior", needle core biopsy: Prostatic adenocarcinoma, Gleason score 4+4=8; see comment.     ; ;COMMENT: Immunohistochemical staining for p63/CK903/P504S was performed on blocks A1, D1, and F1. In part A (right apex), all glands show at least partially retained basal cells with some of the glands demonstrating increased P504S staining, supporting high-grade prostatic intraepithelial neoplasia (HGPIN). In part D (right anterior), a focus composed of 7-8 small atypical glands are seen, and these glands demonstrate no retained basal cells by immunostaining, but lack definite infiltrative growth pattern with no increased P504S staining, and are best classified as atypical small acinar proliferation (ASAP). In part F (left mid), atypical glands show no retained basal cells with increased P504S staining, supporting the diagnosis of adenocarcinoma.   ;The Gleason pattern 4 in this case is of the expansile cribriform and cribriform types.   ;Specimen B  2 of 2 cores contain carcinoma. The total length of tumor in all of the cores is 11 mm. The total length of tissue in all of the cores is 21 mm. The percentage of the tissue involved by tumor is 52%. The percentage of tumor greater than Gleason pattern 3 is 100%. Perineural invasion is not present. No extraprostatic tumor is seen.  ;Specimen C  2 of 2 cores contain carcinoma. The total length of tumor in all of the cores is 21 mm. The total length of tissue in all of the cores is 28 mm. The percentage of the tissue involved by tumor is 75%. The percentage of tumor greater than Gleason pattern 3 is &gt; 95%. Perineural invasion is not present. No extraprostatic tumor is seen.  ;Specimen F  1 of 3 cores contains carcinoma. The total length of tumor in all of the cores is 2 mm. The total length of tissue in all of the cores is 34 mm. The percentage of the tissue involved by tumor is 6%. The percentage of tumor greater than Gleason pattern 3 is 0%. Perineural invasion is not present. No extraprostatic tumor is seen.  ;Specimen J  2 of 2 cores contain carcinoma. The total length of tumor in all of the cores is 28 mm. The total length of tissue in all of the cores is 31 mm. The percentage of the tissue involved by tumor is 90%. The percentage of tumor greater than Gleason pattern 3 is &gt;95%. Perineural invasion is not present. No extraprostatic tumor is seen.  ;Specimen K  2 of 3 cores contain carcinoma. The total length of tumor in all of the cores is 16 mm. The total length of tissue in all of the cores is 37 mm. The percentage of the tissue involved by tumor is 43%. The percentage of tumor greater than Gleason pattern 3 is &gt;95%. Perineural invasion is not present. No extraprostatic tumor is seen.  ; ;Specimen(s) Received A:Right apex B:Right mid C:Right base D:Right anterior E:Left apex F:Left mid G:Left base H:Left anterior I:Left mid US lesion J:Right base US lesion K:MRI #1 right mid/base posterior  ; ;Clinical History The patient is a 72-year-old man with an elevated PSA in two subsequent occasions (28 and 31), estimated prostatic volume of 29.4 cc with ultrasound revealing a right anterior base and left mid gland lesions, and MRI revealing a right mid/base lesion with extracapsular extension. No previous biopsies. Overall PI-RADS v2 score = 5.   ; ;Gross Description The case is received in 11 parts, each labeled with the patient name and medical record number.  ;Part A is received in formalin and additionally labeled "right apex," consists of two 0.1 cm-thick cores of soft tan-white tissue.  The cores each have a length of 2 cm and 1.5 cm.  The specimen is entirely submitted in cassette A1.  (lds)  ;Part B is received in formalin and additionally labeled "right mid," consists of two 0.1 cm-thick cores of soft yellow-gold tissue.  The cores each have a length of 1 cm and 1.1 cm.  The specimen is entirely submitted in cassette B1.  (lds)  ;Part C is received in formalin and additionally labeled "right base," consists of two 0.1 cm-thick cores of soft yellow-gold tissue.  The cores each have a length of 1.7 cm and 1.1 cm.  The specimen is entirely submitted in cassette C1.  (lds)  ;Part D is received in formalin and additionally labeled "right anterior," consists of a single 0.1 cm-thick core of soft tan-white tissue.  The core has a length of 1.6 cm.  The specimen is entirely submitted in cassette D1.  (lds)  ;Part E is received in formalin and additionally labeled "left apex," consists of two 0.1 cm-thick cores of soft tan-white tissue.  The cores each have a length of 1.5 cm and 1 cm.  The specimen is entirely submitted in cassette E1.  (lds)  ;Part F is received in formalin and additionally labeled "left mid," consists of two 0.1 cm-thick cores of soft tan-white tissue.  The cores each have a length of 2 cm and 1.7 cm.  The specimen is entirely submitted in cassette F1.  (lds)  ;Part G is received in formalin and additionally labeled "left base," consists of two 0.1 cm-thick cores of soft tan-white tissue.  The cores each have a length of 1.5 cm and 1.8 cm.  The specimen is entirely submitted in cassette G1.  (lds)  ;Part H is received in formalin and additionally labeled "left anterior," consists of a single 0.1 cm-thick core of soft tan-white tissue.  The core has a length of 2.2 cm.  The specimen is entirely submitted in cassette H1.  (lds)  ;Part I is received in formalin and additionally labeled "US lesion left mid," consists of two 0.1 cm-thick cores of soft tan-white tissue.  The cores each have a length of 1.8 cm and 1.4 cm.  The specimen is entirely submitted in cassette I1.  (lds)  ;Part J is received in formalin and additionally labeled "US lesion right base," consists of two 0.1 cm-thick cores of soft white-pink tissue.  The cores each have a length of 1.6 cm and 1.5 cm.  The specimen is entirely submitted in cassette J1.  (lds)  ;Part K is received in formalin and additionally labeled "MRI #1, right mid/base posterior," consists of three 0.1 cm-thick cores of soft white-pink tissue.  The cores each have a length of 0.4 cm, 1.3 cm and 2 cm.  The specimen is entirely submitted in cassette K1.  (lds)  ;All controls performed with the immunohistochemical stains reported above reacted appropriately. These immunohistochemical stains were developed and their performance characteristics determined by the XXX Department of Pathology. They have not been cleared or approved by the U. S. Food and Drug Administration. The FDA has determined that such clearance or approval is not necessary. These tests are used for clinical purposes. They should not be regarded as investigational or for research. This laboratory is certified under the Clinical Laboratory Improvement Amendments of 1988 ("CLIA") as qualified to perform high-complexity clinical testing.  ;Diagnosis based on gross and microscopic examinations.  Final diagnosis made by attending pathologist following review of all pathology slides.  The attending pathologist has reviewed all dictations, including prosector work, and preliminary interpretations performed by any resident involved in the case and performed all necessary edits before signing the final report.  ;XXX/Pathology Resident XXX/Pathologist      Electronically signed out on XX/XX/XXXX XX:XX  ;'
print(report)

### **Trim the free text pathology report to the section with results of interest**

In [None]:
report_trim = report.split("FINAL PATHOLOGIC DIAGNOSIS")[1].split("COMMENT:")[0]
print(report_trim)

### **Split the free text pathology report into each individual result**

In [None]:
import re


def path_process(report):
    # split by path result by location
    report1 = re.split("[A-Z][.] ", report)
    if len(report1) == 1:
        report1 = re.split(
            "[0-9][.] ",
            report.split("CLINICAL DATA")[0]
            .split("++++++++++Addendum.++++++++++")[0]
            .split("COMMENT")[0],
        )
    # remove leading prostate phrase
    repor_t2 = [
        x.lstrip(" ").lstrip("Prostate").lstrip(" gland").lstrip(", ").lstrip(" - ")
        for x in report1
    ]
    return repor_t2


repor_t2 = path_process(report_trim)

for i in repor_t2:
    print(i)

### **Identify any pathology results for MRI targets**

In [None]:
report_MR = [x for x in repor_t2 if any(y in x for y in ["MR", "MRI"])]
print(report_MR)

Isolate the Gleason score for the MRI target. The ISUP grade group is more convenient to store for ground truth as a single integer and therefore has also been calculated.

In [None]:
gleason = report_MR[0].split("+")
print(
    "Gleason Score =",
    int(gleason[0].strip(" ")[-1]),
    "+",
    int(gleason[1].strip(" ")[0]),
)


# convert to more useable ISUP score
def GleasonGrade(gleason1, gleason2):
    if (gleason1 + gleason2) == 0:
        GG = 0
    elif ((gleason1 + gleason2) <= 6) & ((gleason1 + gleason2) != 0):
        GG = 1
    elif (gleason1 == 3) and (gleason2 == 4):
        GG = 2
    elif (gleason1 == 4) and (gleason2 == 3):
        GG = 3
    elif (gleason1 + gleason2) == 8:
        GG = 4
    elif (gleason1 + gleason2) == 9 or (gleason1 + gleason2) == 10:
        GG = 5
    else:
        GG = float("nan")
    return GG


GG = GleasonGrade(int(gleason[0].strip(" ")[-1]), int(gleason[1].strip(" ")[0]))
print("Gleason Grade =", GG)

## **Conversion to ideal format for storage and model development**

You have now imported in your MRI images, inspected the lesion segmentation, registered multiple series to each other, normalized your imaging data and processed a pathology report to obtain a biopsy result for the annotated MRI lesion. 

You have multiple options for how to store this data. A convenient option is to choose a file format with the ability to handle heterogeneous data simultaneously including numerical arrays and strings. One such option is [**hdf5**](https://www.hdfgroup.org/solutions/hdf5/).

### **Convert your data to hdf5**
Add the imaging data, lesion annotation and pathology groundtruth to a single hdf5 file.  
Here you have the choice for whether to combine your imaging inputs and/or lesion annotation into a multichannel input versus keeping each separate from each other and likely combining in a later step. For now, we have kept them separate.

In [None]:
# create hdf5 file
f = h5py.File(os.path.join(dcm_sample_path, "data_point_example.hdf5"), "w")

# assign datasets (datasets have to be numerical data)
dset = f.create_dataset("T2", data=data_t2_norm)
dset = f.create_dataset("ADC", data=data_adc_norm)
dset = f.create_dataset("DWI", data=data_dwi_norm)
dset = f.create_dataset("Lesion_mask", data=data_lesion)
dset = f.create_dataset("Lesion_GG", data=GG)

# assign attributes (can be strings or lists of data)
f.attrs["ID"] = "data_point_000"

print("hdf5 file created")
print("datasets:")
print(list(f.keys()))
print("attributes:")
print(list(f.attrs.keys()))

### Acknowledgments
An earlier version of this notebook was presented at the 2021 Radiological Society of North America Annual Meeting. The pathology report analysis was previously presented at the 2020 Radiological Society of North America Annual Meeting.

We would also like to acknowledge the general advice of research collaborators:
- Christopher Bridge, PhD
- Abhejit Rajagopal, PhD