# ProstateX(2) preprocessing

This notebook will read and preprocess all ProstateX and ProstateX2 images so that they can be read by the Retina UNet for the purposes of lesion detection and classification.

Before running this notebook, the following things will be needed:

- Download TRAIN and / or TEST images from ProstateX challenge website: 
  - Go to: https://wiki.cancerimagingarchive.net/display/Public/SPIE-AAPM-NCI+PROSTATEx+Challenges
  - Go to the "Detailed Description" tab 
  - In Section "PROSTATEx Challenge (November 21, 2016 to February 16, 2017)" download all the data, except for the .bmp files
  - In Section "PROSTATEx-2 — SPIE-AAPM-NCI Prostate MR Gleason Grade Group Challenge (May 15, 2017 to August 3, 2017)" download only lesion information (.zip), as the images are exactly the same as the ones we have already downloaded.
  - Once all images have been downloaded, please modify the path below to point to them
  
  
- We will need ``plot_lib`` to visualize the images as they are processed:
 - `plot_lib`: https://github.com/OscarPellicer/plot_lib
 
- Edit the following cell according to where the downloaded data is located.

- Finally, make sure to read the Initial Setup Section (below) for further instructions.


- Please note that this Notebook will have to be run twice, once for the ProstateX TRAIN data, and other time for the ProstateX test data

In [1]:
#Path to the ProstateX dataset to be processed (can be either the train or test set)
DS_PATH= r'D:\oscar\Prostate Images\ProstateX\TRAIN'
#DS_PATH= r'D:\oscar\Prostate Images\ProstateX\TEST'

#TRAIN is boolean indicating whether we are using the TRAIN or the TEST data
TRAIN= True

#Main configuration
verbose= True #Show extra information during the process
apply_registration= True #Use transformations in transforms_path to register the images

## Intial setup

`plot_lib` setup

In [2]:
#Import plot_lib
from pathlib import Path
import sys, os
sys.path.append(os.path.join(Path.home(), 'plot_lib'))
from plot_lib import plot_alpha, plot_multi_mask, plot,  plot4

#Some CSS to allow images to display side by side by default
br= lambda: print(' '*100) 
from IPython.display import display, HTML
CSS = """.output { flex-direction: row; flex-wrap: wrap; }
         .widget-hslider { width: auto; } """
HTML('<style>{}</style>'.format(CSS))

Import other required libraries. 

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import SimpleITK as sitk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pydicom
import glob
import pickle
from preprocessing_lib import (info, grow_regions_sitk,
                              join_sitk_images, join_masks, read_prostatex_patient,
                              rescale_intensity, center_image, get_blank_image,
                              get_lesion_mask_id_seed, grow_lesions)

We need to set all the **required paths**. Please, feel free to modify them to point to the correct place if needed

In [4]:
#1) Data that must have been downloaded following instructions above
#Path where DICOMS are stored
dicom_path= os.path.join(DS_PATH, 'Images')

#Path where ktrans images are stored
ktrans_path= os.path.join(DS_PATH, 'Images Ktrans')

#Path where csvs are stored
prostateX_csv_path= os.path.join(DS_PATH, 'Lesion Information', 
                                 'ProstateX-Findings-%s.csv'%('Train' if TRAIN else 'Test'))
prostateX2_csv_path= os.path.join(DS_PATH.replace('ProstateX', 'ProstateX2'), 'Lesion Information', 
                                 'ProstateX-Findings-%s.csv'%('Train' if TRAIN else 'Test'))

#2) Data provided alongside the repository:
#Path to where ProstateX_masks are stored
masks_path= r'ProstateX_masks'

#Path where registration transformations are stored
transforms_path= r'ProstateX_transforms'

#3) Output data to be generated by this Notebook:
#Path where data to be read by the Retina UNet model will be stored
#We create the directory if it did not exist
output_path= 'out' + ('_unregistered' if not apply_registration else '')
if not os.path.exists(output_path): os.makedirs(output_path)

## Description of the datasets and their labels

**ProstateX** contains 204 train and 142 test images of prostates with lesion locations and two significance levels (only in train, in test there are no significance levels):
 - **False**: Gleason Score <= 3+3
 - **True**: Gleason Score >= 3+4
     
References:
 - https://prostatex.grand-challenge.org/challenge_description/
 - https://wiki.cancerimagingarchive.net/display/Public/SPIE-AAPM-NCI+PROSTATEx+Challenges
     
     
**ProstateX2** contains the same images (both train and test), but with much more complete label information:
 - **Grade Group 1 (Gleason score <= 3+3)**: Only individual discrete well-formed glands
 - **Grade Group 2 (Gleason score 3+4)**: Predominantly well-formed glands with lesser component of poorly-formed/fused/cribriform glands
 - **Grade Group 3 (Gleason score 4+3)**: Predominantly poorly formed/fused/cribriform glands with lesser component of well-formed glands
 - **Grade Group 4 (Gleason score 4+4, 3+5, 5+3)**: (1) Only poorly-formed/fused/cribriform glands or (2) predominantly well-formed glands and lesser component lacking glands or (3) predominantly lacking glands and lesser component of well-formed glands
 - **Grade Group 5 (Higher Gleason scores)**: Lacks gland formation (or with necrosis) with or without poorly formed/fused/cribriform glands

References:
 - Epstein JI, Egevad L, Amin MB, Delahunt B, Srigley JR, Humphrey PA, the Grading Committee. The 2014 International Society of Urologic Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma: Definition of grading  - patterns and proposal for a new grading system. Am J Surg Pathol, (40)244-252, 2016
https://www.aapm.org/GrandChallenge/PROSTATEx-2/

As it turns out, the images that appear in ProstateX, but not in ProsateX2, are those that have a GS < 6 (i.e.: benign). 

Therefore, by combining the information from both datasets, we can create an **extended labelling system** defined as follows:
 - 0: Normal prostate
 - 1-5: GG{1-5}
 - 10: Benign lesion or normal prostate
 - 20: Unknown (test set)

In [5]:
#Read csvs
l_info= pd.read_csv(prostateX_csv_path, index_col='ProxID')
l_info_2= pd.read_csv(prostateX2_csv_path, index_col='ProxID')
lesion_info= l_info.reset_index().merge(l_info_2.reset_index(), how="left", 
                                        on=['ProxID', 'pos', 'zone', 'fid']).set_index('ProxID')

if TRAIN: 
    lesion_info.loc[lesion_info.ggg.isna(), 'ggg']= 10
    lesion_info.ggg= lesion_info.ggg.astype(int);

lesion_info.head(10)

Unnamed: 0_level_0,fid,pos,zone,ClinSig,ggg
ProxID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ProstateX-0000,1,25.7457 31.8707 -38.511,PZ,True,3
ProstateX-0001,1,-40.5367071921656 29.320722668457 -16.70766907...,AS,False,1
ProstateX-0002,1,-27.0102 41.5467 -26.0469,PZ,True,2
ProstateX-0002,2,-2.058 38.6752 -34.6104,PZ,False,1
ProstateX-0003,1,22.1495 31.2717 -2.45933,TZ,False,10
ProstateX-0003,2,-21.2871 19.3995 19.7429,TZ,False,10
ProstateX-0004,1,-7.69665 3.64226 23.1659,AS,False,1
ProstateX-0005,0,-14.5174331665039 49.4428329467773 20.78152465...,PZ,True,2
ProstateX-0005,1,-38.6276 42.2781 21.4084,PZ,True,3
ProstateX-0005,1,-22.0892639160156 25.4668045043945 22.87915420...,TZ,False,10


## Read, process and save the images

For each patient, we will produce three ouputs to be saved `output_path` for use in the Retina UNet (`ID` represents the actual patient ID):

- `ID_img.npy`: It contains the prostate mpMRI with dimensions 160(x) x 160(y) x 24(z) x 7 (channels) and spacing 0.5 x 0.5 x 3 mm. The channels are the following:
 - 0: T2
 - 1: B500
 - 2: B800
 - 3: ADC
 - 4: ktrans
 - 5: Prostate mask
 - 6: Central Zone mask
 - 7: Peripheral Zone mask
 
 
- `ID_rois.npy`: It contains a mask of the lesion IDs


- `meta_info_ID.pickle`: It contains a dictionary with items: 
 - 'pid': String with the ID of the patient. E.g.: 'ProstateX-0000'
 - 'class_target': 1D array with the class associated with each of the lesions. E.g.: np.array([1,10])
 - 'spacing': tuple with the spacing of the image. E.g.: (0.5, 0.5, 3)
 - 'fg_slices': z slices where there is at least one lesion. E.g.: [5,6,7,8,9].

In [None]:
for ID in os.listdir(dicom_path):
    try:
        #Print patient ID
        print('\n%s'%ID) 
        
        #Check if registration transform exists and load it
        if apply_registration and os.path.exists(os.path.join(transforms_path, ID + '.tfm')):
            transform= sitk.ReadTransform(os.path.join(transforms_path, ID + '.tfm'))
        else:
            transform= sitk.Euler3DTransform()
            print('No transform was found (or apply_registration is off). Image might be unregistered')

        #---------------Read all images and masks, and then combine them---------------
        
        #Read all the modalities for a given ProstateX patient ID
        #There might be multiple directories (or not)
        patient_directories= os.listdir(os.path.join(dicom_path, ID))
        if len(patient_directories) != 1: 
             print(' - Warning: Multiple directories!')
        patient_directories= patient_directories[0]
        images_path= os.path.join(dicom_path, ID, patient_directories)
        images_list= read_prostatex_patient(ID, images_path, ktrans_path, verbose=True)

        #Read Prostate segmentation mask
        mask = sitk.ReadImage(os.path.join(masks_path, ID + '_msk.nrrd'))

        #Read CZ segmentation mask
        cz_mask= sitk.ReadImage(os.path.join(masks_path, ID + '_cz_msk.nrrd'))
        
        #Combine all images in images_list as single multichannel image using the first as reference
        img_final= join_sitk_images(images_list, resampler=sitk.sitkBSpline, cast_type=sitk.sitkFloat32)
        
        #Join all masks
        prostate_mask= join_masks(mask, cz_mask > 1.5, mode='append')

        #---------------Lesion region growing preparation---------------
        
        #Load lesions information for current ID
        try:
            lesions= lesion_info.loc[ID].values
            lesions= lesions[np.newaxis, ] if len(lesions.shape) == 1 else lesions
            positions= np.array([np.fromstring(p[1], dtype=np.float32, sep=' ') for p in lesions])
            positions_img= np.array([images_list[0].TransformPhysicalPointToContinuousIndex(p.astype(np.float64)) 
                                         for p in positions])
            significances= [p[4] for p in lesions] if TRAIN else [20 for _ in lesions]
            print(' - Lesion positions and significances:')
            for i, (pos, sig) in enumerate(zip(positions_img, significances)):
                print('   - %d: %s, Sig: %d'%(i+1, str(pos), sig))
            
        except Exception as e:
            positions, positions_img, significances= [], [], []
            print(' - Error: No lesion information found!')
            raise e
            
        #We now create a small mask around the positions where lesions are located,
        #which will be used as the seed to grow the lesions later
        lesion_mask_id_seed= get_lesion_mask_id_seed(positions_img, mask)
        prostate_mask_intermediate= join_masks(prostate_mask, lesion_mask_id_seed, mode='append')
            
        #---------------Get ROI and apply computed registration (if available)---------------
               
        #Rescale intensity (must be converted to numpy first)
        img_backup= sitk.Image(img_final)
        img_array= sitk.GetArrayFromImage(img_final)
        img_array= rescale_intensity(img_array)
        img_final= sitk.GetImageFromArray(img_array, isVector=True)
        img_final.CopyInformation(img_backup)
        
        #Keep only the central area of size (160,160,24) around the prostate centroid 
        img_final, prostate_mask_intermediate= center_image(
                                    img_final, prostate_mask_intermediate, center_around_roi=True,
                                    size=(160,160,24), spacing=(0.5,0.5,3),
                                    transform_channels=[1,2,3], per_channel_transform=transform,
                                    pre_mask_growth_mm=2, pre_mask_growth_mm_channels= [2])
        
        #---------------Actual lesion region growing---------------
        
        #Create the automatic lesion segmentation mask
        lesion_mask_id, _= grow_lesions( prostate_mask_intermediate, img_final, significances, transform, 
                                            iters_max=120, factors= [2.5,2.5,3.5,4] )
        
        #Join all masks
        prostate_mask= join_masks(sitk.VectorIndexSelectionCast(prostate_mask_intermediate, 0), 
                                  sitk.VectorIndexSelectionCast(prostate_mask_intermediate, 1), mode='append')
        prostate_mask= join_masks(prostate_mask, lesion_mask_id, mode='append')
        
        #---------------Plot results---------------
        
        #info(img_final)
        for c in [0,3]:
            plot_multi_mask(sitk.VectorIndexSelectionCast(img_final, c), prostate_mask, title='All masks')

        #---------------Save them---------------
        
        #Generate needed arrays and information
        img_final_arr= sitk.GetArrayFromImage(img_final)
        prostate_mask_arr= sitk.GetArrayFromImage(prostate_mask)  

        final_rois= prostate_mask_arr[...,2][..., np.newaxis]
        img_arr= np.concatenate([img_final_arr, #C0-4 Image 
                                 prostate_mask_arr[...,[0,1]], #C5-6: Prostate & CZ mask
                                 (prostate_mask_arr[...,0] - prostate_mask_arr[...,1])[...,np.newaxis]], #C7: PZ mask
                                 axis=-1)
        fg_slices= [ii for ii in np.unique(np.argwhere(final_rois != 0)[:, 0])]    

        #Save all information: ID_rois.npy, ID_img.npy, meta_info_ID.pickle
        np.save(os.path.join(output_path, '{}_rois.npy'.format(ID)), final_rois)
        np.save(os.path.join(output_path, '{}_img.npy'.format(ID)), img_arr)
        with open(os.path.join(output_path, 'meta_info_{}.pickle'.format(ID)), 'wb') as handle:
            meta_info_dict = {'pid': ID, 'class_target': significances, 
                              'spacing': img_final.GetSpacing(), 'fg_slices': fg_slices}
            pickle.dump(meta_info_dict, handle)
            
    except Exception as e:
        print(' - Error: Unhandled exception: %s'%e)

In [10]:
#Generate info_df.pickle for all the images
files = [os.path.join(output_path, f) for f in os.listdir(output_path) if 'meta_info' in f]
df = pd.DataFrame(columns=['pid', 'class_target', 'spacing', 'fg_slices'])
for f in files:
    with open(f, 'rb') as handle:
        df.loc[len(df)] = pickle.load(handle)
df.to_pickle(os.path.join(output_path, 'info_df.pickle'))
print ("Aggregated meta info to df with length", len(df))

Aggregated meta info to df with length 204
