### Inner Eye Deep learning framework implementation with normalization
This is prepared following https://github.com/microsoft/InnerEye-DeepLearning/blob/main/docs/creating_dataset.md

DeepMedic was preprocessed in a similar way as the data for InnerEye deep learning. Hence we will use the data preprocessed for DeepMedic and check adherence of the dataset to the InnerEye guidelines.

In [1]:
# Load packages
from MEDIcaTe.file_folder_ops import *
from MEDIcaTe.utilities import *
from MEDIcaTe.nii_resampling import find_pix_dim_with_orientation
import pandas as pd

The requirements described at https://github.com/microsoft/InnerEye-DeepLearning/blob/main/docs/creating_dataset.md are already adhered to by data in normalized fashion in the folder
/homes/kovacs/project_data/hnc-auto-contouring/deepMedic/data_nifti_as_deepmedic/train_normalized.

These were created as a part of the pre-processing for DeepMedic. 

The guide specifies, that images should be encoded as float32 and labels as binary masks. 
So we will create a copy of the images where the datatypes are changed to these formats.

We start by checking the current format of the data. 

In [3]:
path_to_train_dat = '/homes/kovacs/project_data/hnc-auto-contouring/deepMedic/data_nifti_as_deepmedic/train_normalized'
path_to_image_folder_dst = '/homes/kovacs/project_data/hnc-auto-contouring/inner-eye/d_train_norm/images'
path_to_label_folder_dst = '/homes/kovacs/project_data/hnc-auto-contouring/inner-eye/d_train_norm/labels'

### Converting dataset format to float32 for images and int8 for labels

In [None]:
# This only needs to be run  once, så it is uncommented.
# To run in background I used the file ./convert_dtypes.py
# convert images to float32

def convert_data_types(path_to_train_dat, path_to_image_folder_dst, path_to_label_folder_dst):
    for i,fo in enumerate(listdir(path_to_train_dat)):
        for fi in listdir(join(path_to_train_dat,fo)):
            # copy images to image destination folder and convert to float32
            if ('0000.nii.gz' in fi) | ('0001.nii.gz' in fi):
                print(f'{fi} is PET or CT i.e. image. Converting to float32.')
                convert_nii_to_float32(join(path_to_train_dat,fo,fi),join(path_to_image_folder_dst,fi))
            
            # copy labels to label destination folder and convert to int8
            elif fi == f'{fo}.nii.gz':
                print(f'{fi} is a tumor label file. Converting to int8.')
                convert_nii_to_int8(join(path_to_train_dat,fo,fi),join(path_to_label_folder_dst,fi))

# run only once.
convert_data_types(path_to_train_dat, path_to_image_folder_dst, path_to_label_folder_dst)
        

### Generating the dataset.csv file

In [None]:
# Generate dataset.csv file
'''
This all depends on how you strucuted your files. This script works for my structure, which is:
-- data
    -- d_train
        -- images
            HNC01_000_0000.nii.gz
            HNC01_000_0001.nii.gz
            HNC01_001_0000.nii.gz
            HNC01_001_0001.nii.gz
            ...
        -- labels
            HNC01_000.nii.gz
            HNC01_001.nii.gz
            ...
where the ending _0000.nii.gz are CT's and _0001.nii.gz are PET.
'''
path_labels = '/homes/kovacs/project_data/hnc-auto-contouring/inner-eye/d_train_norm/labels'
path_images = '/homes/kovacs/project_data/hnc-auto-contouring/inner-eye/d_train_norm/images'

path_to_dataset_csv = '/homes/kovacs/project_data/hnc-auto-contouring/inner-eye'

# paths relative to lcation of dataset.csv:
rel_path_labels = 'd_train_norm/labels'
rel_path_images = 'd_train_norm/images'

subject = []
filePath = []
channel = []

for i,f in enumerate(listdir(path_labels)):
    case_id = f[:-7]
    # add ct line
    filePath.append(join(rel_path_images,f'{case_id}_0000.nii.gz'))
    channel.append('ct')
    subject.append(i+1)

    # add pet line
    filePath.append(join(rel_path_images,f'{case_id}_0001.nii.gz'))
    channel.append('pet')
    subject.append(i+1)

    # add label line
    filePath.append(join(rel_path_labels,f'{case_id}.nii.gz'))
    channel.append('tumor')
    subject.append(i+1)

out_dat = pd.DataFrame(list(zip(subject, filePath, channel)), columns =['subject', 'filePath', 'channel'])
out_dat.to_csv(join(path_to_dataset_csv,'dataset.csv'),index=False)
print(out_dat.head(15))

### Adhering to the image size requirements
We check that this dataset adheres to the image size requirements as presscribed at https://github.com/microsoft/InnerEye-DeepLearning/blob/main/docs/creating_dataset.md. 

In [None]:
# Note: takes about 7 minutes to run for 8-900 cases.
path_labels = '/homes/kovacs/project_data/hnc-auto-contouring/inner-eye/d_train_norm/labels'
path_images = '/homes/kovacs/project_data/hnc-auto-contouring/inner-eye/d_train_norm/images'
path_to_dataset_csv = '/homes/kovacs/project_data/hnc-auto-contouring/inner-eye'


ct_dim_list = []
pet_dim_list = []
label_dim_list = []

for i,f in enumerate(listdir(path_labels)):
    case_id = f[:-7]

    ct_file = join(path_images,f'{case_id}_0000.nii.gz')
    pet_file = join(path_images,f'{case_id}_0001.nii.gz')
    label_file = join(path_labels,f)
    
    ct_dim = find_pix_dim_with_orientation(ct_file)
    pet_dim = find_pix_dim_with_orientation(pet_file)
    label_dim = find_pix_dim_with_orientation(label_file)

    ct_dim_list.append(ct_dim)
    pet_dim_list.append(pet_dim)
    label_dim_list.append(label_dim)

out_dat = pd.DataFrame(list(zip(ct_dim_list, pet_dim_list, label_dim_list)), columns =['ct', 'pet', 'label'])
out_dat.to_csv(join(path_to_dataset_csv,'image_dimensions_norm.csv'),index=False)
