# CONVERT DICOM TO AXIAL PNG 2D SLICES, <br /> k-FOLD CV, FILTER BASED ON IMAGE PIXEL STATISTICS
## 🧠 RSNA-MICCAI Brain Tumor Radiogenomic Classification 🧠


**Data Preprocessing** for [RSNA-MICCAI Brain Tumor Radiogenomic Classification](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification) challenge with steps on how to **filter** useful images, **create map of chosen images**, **make k-fold Train/Val splits**, **useful insights on Image Pixel Statistics** and **convert DICOM to PNG records** of the MRI scans.

---------------------------------------

**Credit:**
Notebook based on 
1. this wonderful solution [Connecting voxel spaces](https://www.kaggle.com/boojum/connecting-voxel-spaces), 
2. my EDA notebook [
🧠 Brain Radiogenomics Advanced EDA](https://www.kaggle.com/smoschou55/brain-radiogenomics-advanced-eda) 

and some inspiration from the following notebooks: 
1. [
🧠Brain Tumor🧠 - EDA with Animations and Modeling](https://www.kaggle.com/ihelon/brain-tumor-eda-with-animations-and-modeling), 
2. [DICOM to PNG dataset (128 GB -> 5.2 GB) 🎨🔥](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/253000) and 
3. [🧠Fast DICOM--> PNG full DATA+ {Download DATA} ✅](https://www.kaggle.com/anasshnn/fast-dicom-png-full-data-download-data).

<!-- 2. [Converting DICOM Metadata to CSV](https://www.kaggle.com/carlolepelaars/converting-dicom-metadata-to-csv-rsna-ihd-2019)
3. [DICOM Metadata EDA](https://www.kaggle.com/anarthal/dicom-metadata-eda)
4. [Pulmonary Dicom Preprocessing](https://www.kaggle.com/allunia/pulmonary-dicom-preprocessing#Prepare-to-start-) and 
5. [Insightful EDA on Meta Data & Dicom Files](https://www.kaggle.com/jagdmir/insightful-eda-on-meta-data-dicom-files).
6. [BTRC EDA (Final)](https://www.kaggle.com/josecarmona/btrc-eda-final)
7. [(Part-1) RSNA-MICCAI BTRC: Understanding The Data](https://www.kaggle.com/arnabs007/part-1-rsna-miccai-btrc-understanding-the-data) -->

![](https://storage.googleapis.com/kaggle-competitions/kaggle/29653/logos/header.png)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:darkviolet; border:0' role="tab" aria-controls="home"><center>Quick Navigation</center></h3>

* [1. Overview](#1)
* [2. Helper Functions](#2)
* [3. Create DataFrame with Image Filepaths](#10)
* [4. Compute Image Data Stats](#15)
* [5. Filter Based on Theshold and Save PNGs](#20)
* [6. Train / Val Split](#30)
* [7. Convert to Voxel Space of Choice](#40)

<a id="1"></a>
<h2 style='background:darkviolet; border:0; color:white'><center>1. Overview<center><h2>

In [None]:
!pip install pandarallel

import os
import shutil
import ast
import json
import glob
import random
import collections
import gc

import numpy as np
import pandas as pd

# Visualization
import nibabel as nib
import matplotlib.pyplot as plt
import SimpleITK as sitk
import matplotlib.image as mpimg
import seaborn as sns; sns.set();

import imageio    # save to PNG images
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut
import cv2

from tqdm.notebook import tqdm; tqdm.pandas(); # get nice bar

from pandarallel import pandarallel; pandarallel.initialize(); 

# Seed for reproducability
seed = 1234
np.random.seed(seed)

## Files
**train/** - folder containing the training files, with each top-level folder representing a subject  
**train_labels.csv** - file containing the target MGMT_value for each subject in the training data (e.g. the presence of MGMT promoter methylation)   
**test/** - the test files, which use the same structure as train/; your task is to predict the MGMT_value for each subject in the test data. NOTE: the total size of the rerun test set (Public and Private) is ~5x the size of the Public test set   
**sample_submission.csv** - a sample submission file in the correct format

The exact mpMRI scans included are:

- Fluid Attenuated Inversion Recovery (FLAIR)
- T1-weighted pre-contrast (T1w)
- T1-weighted post-contrast (T1Gd)
- T2-weighted (T2)

Exact folder structure:


```
Training/Validation/Testing
│
└─── 00000
│   │
│   └─── FLAIR
│   │   │ Image-1.dcm
│   │   │ Image-2.dcm
│   │   │ ...
│   │   
│   └─── T1w
│   │   │ Image-1.dcm
│   │   │ Image-2.dcm
│   │   │ ...
│   │   
│   └─── T1wCE
│   │   │ Image-1.dcm
│   │   │ Image-2.dcm
│   │   │ ...
│   │   
│   └─── T2w
│   │   │ Image-1.dcm
│   │   │ Image-2.dcm
│   │   │ .....
│   
└─── 00001
│   │ ...
│   
│ ...   
│   
└─── 00002
│   │ ...
```

DICOM® — [Digital Imaging and Communications in Medicine](https://www.dicomstandard.org/about-home) — is the international standard for medical images and related information. It defines the formats for medical images that can be exchanged with the data and quality necessary for clinical use. With hundreds of thousands of medical imaging devices in use, DICOM® is one of the most widely deployed healthcare messaging Standards in the world.

<a id="2"></a>
<h2 style='background:darkviolet; border:0; color:white'><center>2. Helper Functions<center><h2>

In [None]:
# Paths 
KAGGLE_DIR = '/kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/'
IMG_PATH_TRAIN = KAGGLE_DIR + 'train/'
IMG_PATH_TEST = KAGGLE_DIR + 'test/'
TRAIN_CSV_PATH = KAGGLE_DIR + 'train_labels.csv'
TEST_CSV_PATH = KAGGLE_DIR + 'sample_submission.csv'

In [None]:
# All filenames for train and test images
train_images = os.listdir(IMG_PATH_TRAIN)
test_images = os.listdir(IMG_PATH_TEST)

### For more details in pixel arrays see : [Working with Pixel Data](https://pydicom.github.io/pydicom/stable/old/working_with_pixel_data.html)

In [None]:
def load_dicom(path):
    # read file
    dicom = pydicom.read_file(path)
    # get pixel data into a useful format. 
    data = dicom.pixel_array
    # transform data into black and white scale / grayscale
    data = data - np.min(data)
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data

def save_png_to_disk(data, path, png_master_dir):
    # SAVE PNG TO DISK     
    image_name=path.split('/')[4:][-1].split('.')[0]                               
    png_image_path=png_master_dir+'/'+'/'.join(path.split('/')[4:-1])+'/'+image_name+'.png'    
    imageio.imsave(png_image_path,data)

def convert_dicom_to_png(path, png_master_dir, resize = None):
    dicom = pydicom.read_file(path)
    data = apply_voi_lut(dicom.pixel_array, dicom)
    # If Resize == True, Resize Image to Specified Resolution
    if resize:
        data = cv2.resize(data, resize)
    # Transform Data as Necessary     
    if dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    data = data - np.min(data)
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    # SAVE PNG TO DISK     
    save_png_to_disk(data, path, png_master_dir)
    return data

def image_stats(path):
    dicom = pydicom.read_file(path)
    data = apply_voi_lut(dicom.pixel_array, dicom)
    
    # Transform Data as Necessary     
    if dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    
    data = data - np.min(data)
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    
    # Compute and Return Image Stats: min, max, mean, std, 25th-, 50th-, and 75th percentile
    return np.min(data), \
            np.max(data), \
            np.mean(data), \
            np.std(data), \
            np.percentile(data, 25), \
            np.percentile(data, 50), \
            np.percentile(data, 75)

def is_valid_slice(path, threshold=0):
    dicom = pydicom.read_file(path)
    data = apply_voi_lut(dicom.pixel_array, dicom)
    
    # Transform Data as Necessary     
    if dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    
    data = data - np.min(data)
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    
    # REMOVE FILEPATH FROM IMAGE REGISTRY IF IMAGE HAS LOW INFO VALUE
    if np.mean(data) <= threshold:
        return False
    else:
        return True

<a id="10"></a>
<h2 style='background:darkviolet; border:0; color:white'><center>3. Create DataFrame with Image Filepaths<center><h2>

❗❗❗ **Uncomment Below to Repeat the entire IO process from sratch** ❗❗❗

**\train**

In [None]:
# f = []
# for (dirpath, dirnames, filenames) in os.walk(IMG_PATH_TRAIN):
#     f.extend(os.path.join(dirpath, x) for x in filenames)
    
# train_file_paths_df = pd.DataFrame({'file_paths': f})
# train_file_paths_df['directory'] = IMG_PATH_TRAIN
# train_file_paths_df['dataset'] = train_file_paths_df['file_paths'].str.split("/", n = 7, expand = True)[4]
# train_file_paths_df['patient_id'] = train_file_paths_df['file_paths'].str.split("/", n = 7, expand = True)[5]
# train_file_paths_df['scan_type'] = train_file_paths_df['file_paths'].str.split("/", n = 7, expand = True)[6]
# train_file_paths_df['file'] = train_file_paths_df['file_paths'].str.split("/", n = 7, expand = True)[7]
# display(train_file_paths_df.head(2))
# train_file_paths_df.shape[0]

**\test**

In [None]:
# f = []
# for (dirpath, dirnames, filenames) in os.walk(IMG_PATH_TEST):
#     f.extend(os.path.join(dirpath, x) for x in filenames)
    
# test_file_paths_df = pd.DataFrame({'file_paths': f})
# test_file_paths_df['directory'] = IMG_PATH_TEST
# test_file_paths_df['dataset'] = test_file_paths_df['file_paths'].str.split("/", n = 7, expand = True)[4]
# test_file_paths_df['patient_id'] = test_file_paths_df['file_paths'].str.split("/", n = 7, expand = True)[5]
# test_file_paths_df['scan_type'] = test_file_paths_df['file_paths'].str.split("/", n = 7, expand = True)[6]
# test_file_paths_df['file'] = test_file_paths_df['file_paths'].str.split("/", n = 7, expand = True)[7]
# display(test_file_paths_df.head(2))
# test_file_paths_df.shape[0]

### [Exclude 3 problematic cases](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/262046)

**BraTSIDs**
* 00109
* 00123
* 00709

In [None]:
# train_df = train_file_paths_df.copy()
# test_df = test_file_paths_df.copy()

# train_df = train_df[(train_df.patient_id != "00109") & 
#                     (train_df.patient_id != "00123") &
#                     (train_df.patient_id != "00709")]

Save to *.csv file

In [None]:
# train_df.to_csv('train_filepaths_rsna.csv', index=False)
# test_df.to_csv('test_filepaths_rsna.csv', index=False)

<a id="15"></a>
<h2 style='background:darkviolet; border:0; color:white'><center>4. Compute Image Data Stats<center><h2>

# Create Master DataFrame (and save into *.csv - file) with Pixel Stats per Image.
This only needs to take place once and then can be used a number of times to remove images using any type of stats criterion from the path files dataframe before creating a new PNG dataset.

❗❗❗ **Uncomment Below to Repeat the entire IO process from sratch** ❗❗❗

**/train**

In [None]:
# stdf = train_df.copy()
# stdf["stats"] = stdf["file_paths"].parallel_apply(lambda x: image_stats(x))
# stdf[["min_px", "max_px", "mean_px", "std_px", "q1", "q2", "q3"]] = pd.DataFrame(stdf["stats"].tolist(), index=stdf.index)
# stdf = stdf.drop(['stats'], axis=1)
# stdf

In [None]:
# stdf.to_csv('stats_train_file_paths_df.csv', index=False)

**/test**

In [None]:
# ssdf = test_df.copy()
# ssdf["stats"] = ssdf["file_paths"].parallel_apply(lambda x: image_stats(x))
# ssdf[["min_px", "max_px", "mean_px", "std_px", "q1", "q2", "q3"]] = pd.DataFrame(ssdf["stats"].tolist(), index=ssdf.index)
# ssdf = ssdf.drop(['stats'], axis=1)
# ssdf

In [None]:
# ssdf.to_csv('stats_test_file_paths_df.csv', index=False)

# Read pre-saved *csv file

If you don't want to uncomment code-blocks above and rerun the entire process.

In [None]:
train_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/stats_train_file_paths_df.csv')
test_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/stats_test_file_paths_df.csv')

# Visualize Overall Pixel Statistics to get Insights on Images

Number of unique values

**\train**

In [None]:
stats_cols = []
num_unique = []

for col in train_df:
#     print("* For attribute  '{}' , there are [ {} ] unique values.".format(col,
#                     len(train_meta_df[col].unique())))
    stats_cols.append(col)
    num_unique.append(len(train_df[col].unique()))
    
train_df_stats = pd.DataFrame(
    {'col_name': stats_cols,
     'value_count': num_unique,
     'nan_count': train_df.isna().sum()
    })

train_df_stats = train_df_stats.sort_values(by=['value_count'], ascending=False).reset_index(drop=True)
train_df_stats

**\test**

In [None]:
stats_cols = []
num_unique = []

for col in test_df:
#     print("* For attribute  '{}' , there are [ {} ] unique values.".format(col,
#                     len(train_meta_df[col].unique())))
    stats_cols.append(col)
    num_unique.append(len(test_df[col].unique()))
    
test_df_stats = pd.DataFrame(
    {'col_name': stats_cols,
     'value_count': num_unique,
     'nan_count': test_df.isna().sum()
    })

test_df_stats = test_df_stats.sort_values(by=['value_count'], ascending=False).reset_index(drop=True)
test_df_stats

Some Useful Histograms

**\train**

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(25, 8), sharey=True)
fig.suptitle('Train Dataset Pixel Distributions')

sns.histplot(ax=axes[0], data = train_df[['mean_px', 'std_px']], bins=50, alpha=0.5,)
axes[0].set_title("mean, std")
sns.histplot(ax=axes[1], data = train_df[['min_px', 'max_px']], bins=50, alpha=0.5,)
axes[1].set_title("min, max")
sns.histplot(ax=axes[2], data= train_df[['q1', 'q2', 'q3']], bins=50, alpha=0.5,)
axes[2].set_title("q1, q2, q3")
plt.show();

There are many empty images.

**\test**

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(25, 8), sharey=True)
fig.suptitle('Test Dataset Pixel Distributions')

sns.histplot(ax=axes[0], data = test_df[['mean_px', 'std_px']], bins=50, alpha=0.5,)
axes[0].set_title("mean, std")
sns.histplot(ax=axes[1], data = test_df[['min_px', 'max_px']], bins=50, alpha=0.5,)
axes[1].set_title("min, max")
sns.histplot(ax=axes[2], data= test_df[['q1', 'q2', 'q3']], bins=50, alpha=0.5,)
axes[2].set_title("q1, q2, q3")
plt.show();

Similar Distributions as for Train dataset.

<a id="20"></a>
<h2 style='background:darkviolet; border:0; color:white'><center>5. Filter Based on Thershold and Save PNGs<center><h2>

# Remove Paths to Images based on some Critetia
e.g. remove images from path files df that have np.mean(data) < 10

In [None]:
threshold = 50
truncated_train_df = train_df[train_df.mean_px >= threshold].reset_index(drop=True)
truncated_test_df = test_df[test_df.mean_px >= threshold].reset_index(drop=True)

# Create Directory Tree to Host PNGs

In [None]:
def create_png_dir_tree(PNG_MSTR_DIR, train_df, test_df):
    #delete old folder (if you run the code twice)
    shutil.rmtree(PNG_MSTR_DIR, ignore_errors=True)
    
    # create the main PNG folder with the test and train 
    png_test_path=PNG_MSTR_DIR + '/test/'
    png_train_path=PNG_MSTR_DIR + '/train/'
        
    os.makedirs(PNG_MSTR_DIR)
    os.makedirs(png_train_path)
    os.makedirs(png_test_path)
    print('\t\t\t DONE DIR TREE')

    # floders creation 
    for trfold in set(train_df.patient_id):
        os.mkdir(png_train_path+str(trfold).zfill(5))
        for mp in set(train_df.scan_type):
            os.mkdir(png_train_path+str(trfold).zfill(5)+'/'+str(mp))
    print('\t\t\t DONE CREATING TRAIN DIR')

    for trfold in set(test_df.patient_id): 
        os.mkdir(png_test_path+str(trfold).zfill(5))
        for mp in set(test_df.scan_type):
            os.mkdir(png_test_path+str(trfold).zfill(5)+'/'+str(mp))
    print('\t\t\t DONE CREATING TEST DIR')

# SAVE PNG IMAGES INTO WORKING DIR

❗❗❗ IF YOU WANT TO SAVE PNG FILES ABOVE THRESHOLD UNCOMMENT CELL BELOW ❗❗❗

In [None]:
# # Create directory
# PNG_ROOT_DIR = 'png_dataset_threshold_' + str(threshold)
# create_png_dir_tree(PNG_ROOT_DIR, truncated_train_df, truncated_test_df)

# print('\t\t\t Start Saving TRAIN Images')
# # Convert DICOM to PNG and Save to Disk
# truncated_train_df["file_paths"].parallel_apply(lambda x: convert_dicom_to_png(x, PNG_ROOT_DIR, (256, 256)));
# print('\t\t\t Finished Saving TRAIN Images')

# print('\t\t\t Start Saving TEST Images')
# truncated_test_df["file_paths"].parallel_apply(lambda x: convert_dicom_to_png(x, PNG_ROOT_DIR, (256, 256)));
# print('\t\t\t Finished Saving TEST Images')

# Visualize PNG images

❗❗❗ IF YOU WANT TO VISUALIZE n - RANDOM SAVED PNG FILES UNCOMMENT CELL BELOW ❗❗❗

In [None]:
# n = 5
# j = 0
# plt.figure(figsize=(18, 10))
# for i in random.sample(range(truncated_train_df.shape[0]), n):
#     j +=1
#     img = mpimg.imread(PNG_ROOT_DIR + '/train/' + 
#                        str(truncated_train_df.patient_id.iloc[i]).zfill(5) +"/"+
#                        str(truncated_train_df.scan_type.iloc[i]) +"/"+
#                        str(truncated_train_df.file.iloc[i].split(".")[0])+'.png') 
#     print(img.shape)
#     plt.subplot(1, n, j)
#     imgplot = plt.imshow(img)

# Conclusions
* Images have different sizes, so we used CV2 to resize to the desired resolution, in this case 256 x 256.

* We might also want to convert all images to the same plane of reference, most probably into axial plane (most images are already into axial plane).

<a id="30"></a>
<h2 style='background:darkviolet; border:0; color:white'><center>6. Train / Val Splits<center><h2>

# Train / Val k -fold Cross Validation Splits

In [None]:
train_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/train_filepaths_rsna.csv')
test_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/test_filepaths_rsna.csv')

train_lbl_df = pd.read_csv('/kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv')

### [Exclude 3 problematic cases](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/262046)

**BraTSIDs**
* 109
* 123
* 709

In [None]:
train_lbl_df = train_lbl_df[(train_lbl_df.BraTS21ID != 109) & 
                    (train_lbl_df.BraTS21ID != 123) &
                    (train_lbl_df.BraTS21ID != 709)].reset_index(drop = True)
train_lbl_df

## 5 - fold Stratified Cross Validation on Target Bindary Values MGMT_value

In [None]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)

print('Class Ratio:',sum(train_lbl_df['MGMT_value'])/len(train_lbl_df['MGMT_value']))

target = train_lbl_df.loc[:,'MGMT_value']

fold_no = 1
train_fold_dict = {}
val_fold_dict = {}
for train_index, test_index in skf.split(train_lbl_df, target):
    train = train_lbl_df.loc[train_index,:]
    val = train_lbl_df.loc[test_index,:]
    train_fold_dict['train_fold_'+str(fold_no)] = train.set_index('BraTS21ID')['MGMT_value'].to_dict()
    val_fold_dict['val_fold_'+str(fold_no)] = val.set_index('BraTS21ID')['MGMT_value'].to_dict()
    print('Fold',str(fold_no),'Class Ratio:',sum(val['MGMT_value'])/len(val['MGMT_value']),
          ',\t len train, val, sum:',len(train), len(val), len(train)+len(val))
    fold_no += 1

**train/ folds**

In [None]:
train_df_1 = train_df.copy()
train_fold_df_1 = train_fold_dict['train_fold_1']
train_df_1['MGMT_value'] = train_df_1['patient_id'].map(train_fold_df_1)
train_df_1 = train_df_1.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

train_df_2 = train_df.copy()
train_fold_df_2 = train_fold_dict['train_fold_2']
train_df_2['MGMT_value'] = train_df_2['patient_id'].map(train_fold_df_2)
train_df_2 = train_df_2.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

train_df_3 = train_df.copy()
train_fold_df_3 = train_fold_dict['train_fold_3']
train_df_3['MGMT_value'] = train_df_3['patient_id'].map(train_fold_df_3)
train_df_3 = train_df_3.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

train_df_4 = train_df.copy()
train_fold_df_4 = train_fold_dict['train_fold_4']
train_df_4['MGMT_value'] = train_df_4['patient_id'].map(train_fold_df_4)
train_df_4 = train_df_4.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

train_df_5 = train_df.copy()
train_fold_df_5 = train_fold_dict['train_fold_5']
train_df_5['MGMT_value'] = train_df_5['patient_id'].map(train_fold_df_5)
train_df_5 = train_df_5.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

**val/ folds**

In [None]:
val_df_1 = train_df.copy()
val_fold_df_1 = val_fold_dict['val_fold_1']
val_df_1['MGMT_value'] = val_df_1['patient_id'].map(val_fold_df_1)
val_df_1 = val_df_1.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

val_df_2 = train_df.copy()
val_fold_df_2 = val_fold_dict['val_fold_2']
val_df_2['MGMT_value'] = val_df_2['patient_id'].map(val_fold_df_2)
val_df_2 = val_df_2.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

val_df_3 = train_df.copy()
val_fold_df_3 = val_fold_dict['val_fold_3']
val_df_3['MGMT_value'] = val_df_3['patient_id'].map(val_fold_df_3)
val_df_3 = val_df_3.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

val_df_4 = train_df.copy()
val_fold_df_4 = val_fold_dict['val_fold_4']
val_df_4['MGMT_value'] = val_df_4['patient_id'].map(val_fold_df_4)
val_df_4 = val_df_4.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

val_df_5 = train_df.copy()
val_fold_df_5 = val_fold_dict['val_fold_5']
val_df_5['MGMT_value'] = val_df_5['patient_id'].map(val_fold_df_5)
val_df_5 = val_df_5.dropna(axis=0).groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

**Check sizes make sense**

In [None]:
print('Train, ', 'Test, ', 'Sum')
print(len(set(list(train_df_1.patient_id))), len(set(list(val_df_1.patient_id))), len(set(list(train_df_1.patient_id)))+len(set(list(val_df_1.patient_id))))
print(len(set(list(train_df_2.patient_id))), len(set(list(val_df_2.patient_id))), len(set(list(train_df_2.patient_id)))+len(set(list(val_df_2.patient_id))))
print(len(set(list(train_df_3.patient_id))), len(set(list(val_df_3.patient_id))), len(set(list(train_df_3.patient_id)))+len(set(list(val_df_3.patient_id))))
print(len(set(list(train_df_4.patient_id))), len(set(list(val_df_4.patient_id))), len(set(list(train_df_4.patient_id)))+len(set(list(val_df_4.patient_id))))
print(len(set(list(train_df_5.patient_id))), len(set(list(val_df_5.patient_id))), len(set(list(train_df_5.patient_id)))+len(set(list(val_df_5.patient_id))))

<a id="40"></a>
<h2 style='background:darkviolet; border:0; color:white'><center>7. Convert to Voxel Space of Choice<center><h2>

Based on this fantastic notebook: [
Connecting voxel spaces](https://www.kaggle.com/boojum/connecting-voxel-spaces)

# Steps
* As we saw in the [
🧠 Advanced EDA - Brain Tumor Data 🧠](https://www.kaggle.com/smoschou55/advanced-eda-brain-tumor-data) each patient has all images in a single modality in the same plane of reference, e.g. all FLAIR coronal, all T1w and T1wCE axial, and all T2w sagittal.

* Thus, we just need to 
    1. Determine the orientation of each modality per patient (we need the summary table from EDA)
    2. Choose a reference example modality in terms of orientation and resolution
    3. Use Simple ITK Python package to convert one 3D volume (collection of all images in a modality per patient) into the orientation and resolution of the reference example 3D volume (e.g. Axial T1w Volume of (256, 256, 32)).
    4. Finally, we can save 2D SLICES into DICOM or PNG formats. 
    5. OPTIONALLY: Determine which **IMAGES** need to be removed (np.mean(data) < threshold) and only store those images that have important information in them. E.g. save each image in the X-range of the 3D volume (image sequences) into PNG **IF and ONLY IF** np.mean(data) >= threshold.

## 1. Determine the orientation of each modality per patient

In [None]:
#  Read Metadata train and test dfs
train_meta_df = pd.read_csv('/kaggle/input/stage0-metadata-rsna/stage_0_train_with_metadata.csv')
test_meta_df = pd.read_csv('/kaggle/input/stage0-metadata-rsna/stage_0_test_with_metadata.csv')

def get_image_plane(data):
    '''
    Returns the MRI's plane from the dicom data.
    
    '''
    x1,y1,_,x2,y2,_ = [round(j) for j in ast.literal_eval(data.ImageOrientationPatient)]
    cords = [x1,y1,x2,y2]

    if cords == [1,0,0,0]:
        return 'coronal'
    if cords == [1,0,0,1]:
        return 'axial'
    if cords == [0,1,0,0]:
        return 'sagittal'

train_meta_df['Orientation'] = train_meta_df.apply(get_image_plane, axis=1)

test_meta_df['Orientation'] = test_meta_df.apply(get_image_plane, axis=1)

In [None]:
dftr = train_meta_df.groupby(['PatientID', 'Orientation', 'SeriesDescription']).size().reset_index(name='count') 
dftr

NOTE: Here we have not removed yet the 3 patients with problematic datasets, thus 585 * 4 = 2340.

## 2. Choose a reference example modality in terms of orientation and resolution

In [None]:
dftr2 = train_meta_df[(train_meta_df.Rows == 256) & 
                      (train_meta_df.Columns == 256) &
                      (train_meta_df.Orientation == "axial") &
                      (train_meta_df.SeriesDescription == "T1w")].groupby(['PatientID', 'Orientation', 'SeriesDescription']).size().reset_index(name='count') 
dftr2.loc[(dftr2['count'] < 50) & (dftr2['count'] >15)].reset_index(drop = True)

In [None]:
dftr[dftr.PatientID == 143]

### In Conclusion:
* There are 20 examples (20 patients) with T1w 3D volumes of resolution 256 x 256 and between 15 and 50 slices and "Axial" orientation.

* We chose PatientID = 143 with T1w 256x256 and 36 slices as the rederence 3D volume.

* Next we will convert all other modalities for all patients into the same voxel space as the 3D volume of PatientID = 143. 

* For that we will need a dataframe with all the modalities and their associated file - paths (about 2328 = 582 * 4).

### DataFrame of all modalities with their associated file-paths

In [None]:
train_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/train_filepaths_rsna.csv')
test_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/test_filepaths_rsna.csv')

dir_train_df = train_df.groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')
dir_test_df = test_df.groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

In [None]:
display(dir_train_df.head(2))
display(dir_test_df.head(2))

In [None]:
dir_train_df[dir_train_df.patient_id == 143]

Index 369

In [None]:
dir_train_df.iloc[369]

## 3 + 4. Convert to Voxel Space of Choice with Simple ITK and Save

### Helper Functions to : Resample, Normalize and Convert to Voxel Space of Interest

In [None]:
def resample(image, ref_image):

    resampler = sitk.ResampleImageFilter()
    resampler.SetReferenceImage(ref_image)
    resampler.SetInterpolator(sitk.sitkLinear)
    
    resampler.SetTransform(sitk.AffineTransform(image.GetDimension()))

    resampler.SetOutputSpacing(ref_image.GetSpacing())

    resampler.SetSize(ref_image.GetSize())

    resampler.SetOutputDirection(ref_image.GetDirection())

    resampler.SetOutputOrigin(ref_image.GetOrigin())

    resampler.SetDefaultPixelValue(image.GetPixelIDValue())

    resamped_image = resampler.Execute(image)
    
    return resamped_image

In [None]:
def normalize_png(data):
    '''Input: BLock of 2D Images forming a 3D volume of a full brain'''
    data = data - np.min(data)
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data

### Helper Functions to : Create Directory Tree and Save Converted Voxel Space

In [None]:
def create_train_val_dir(PNG_MSTR_DIR, train_df, val_df, fold = None):
    if fold:
        # create the main PNG folder with the val and train 
        png_val_path=PNG_MSTR_DIR + '/val_'+str(fold)+'/'
        png_train_path=PNG_MSTR_DIR + '/train_'+str(fold)+'/'
    else:
        png_val_path=PNG_MSTR_DIR + '/val/'
        png_train_path=PNG_MSTR_DIR + '/train/'
        
    os.makedirs(png_train_path)
    os.makedirs(png_val_path)
    print('\t\t\t DONE DIR TREE')

    # floders creation 
    for trfold in set(train_df.patient_id):
        os.mkdir(png_train_path+str(trfold).zfill(5))
        for mp in set(train_df.scan_type):
            os.mkdir(png_train_path+str(trfold).zfill(5)+'/'+str(mp))
    print('\t\t\t DONE SAVING TRAIN IMAGES')

    for trfold in set(val_df.patient_id): 
        os.mkdir(png_val_path+str(trfold).zfill(5))
        for mp in set(val_df.scan_type):
            os.mkdir(png_val_path+str(trfold).zfill(5)+'/'+str(mp))
    print('\t\t\t DONE SAVING VAL IMAGES')
    
    
def create_test_dir(PNG_MSTR_DIR, test_df):
    png_test_path=PNG_MSTR_DIR + '/test/'

    os.makedirs(png_test_path)
    print('\t\t\t DONE DIR TREE')

    # floders creation 
    for trfold in set(test_df.patient_id):
        os.mkdir(png_test_path+str(trfold).zfill(5))
        for mp in set(test_df.scan_type):
            os.mkdir(png_test_path+str(trfold).zfill(5)+'/'+str(mp))
    print('\t\t\t DONE SAVING TEST IMAGES')
    
def convert_train_val_2_voxel_space(ref_dir, png_out_dir, train_df, val_df, fold = None, 
                                    is_test=True, save_cv_test = False, save_train_val = True, 
                                    test_df = None, threshold = -1.0):
#     ref_dir = str(dir_train_df.directory.iloc[369])+ \
#                 str(dir_train_df.patient_id.iloc[369]).zfill(5)+ \
#                 '/'+str(dir_train_df.scan_type.iloc[369])
    print('Reference Dir:')
    print(f'{ref_dir}')

    reader = sitk.ImageSeriesReader()
    reader.LoadPrivateTagsOn()

    filenamesDICOM = reader.GetGDCMSeriesFileNames(f'{ref_dir}')
    reader.SetFileNames(filenamesDICOM)
    ref_sitk = reader.Execute()

    if save_train_val:
        if is_test:
            print('\t STARTED TRAIN CONVERSION and SAVING TO DISK')
        else:
            print('\t STARTED TRAIN_'+str(fold)+' CONVERSION and SAVING TO DISK')
        for i in tqdm(range(len(train_df))):
            scan_dir = str(train_df.directory.iloc[i])+ \
                        str(train_df.patient_id.iloc[i]).zfill(5)+ \
                        '/'+str(train_df.scan_type.iloc[i])

            if is_test:
                output_dir = png_out_dir+'/train/'+ \
                        str(train_df.patient_id.iloc[i]).zfill(5)+ \
                        '/'+str(train_df.scan_type.iloc[i])
            else:
                output_dir = png_out_dir+'/train_'+str(fold)+'/'+ \
                            str(train_df.patient_id.iloc[i]).zfill(5)+ \
                            '/'+str(train_df.scan_type.iloc[i])

            filenamesDICOM = reader.GetGDCMSeriesFileNames(f'{scan_dir}')
            reader.SetFileNames(filenamesDICOM)
            scan_sitk = reader.Execute()

            scan_resampled = resample(scan_sitk, ref_sitk)
            scan_sitk_array = normalize_png(sitk.GetArrayFromImage(scan_resampled))

            for j in range(len(scan_sitk_array[:,0,0])):
                # SAVE PNG TO DISK IF Criterion TRUE   
                if np.mean(scan_sitk_array[j,:,:]) <= threshold:
                    pass
                else:
                    imageio.imsave(output_dir+'/Image-'+str(j)+'.png', scan_sitk_array[j,:,:])
        if is_test:
            print('\t FINISHED TRAIN CONVERSION and SAVING TO DISK')
        else:
            print('\t FINISHED TRAIN_'+str(fold)+' CONVERSION and SAVING TO DISK')

        if is_test:
            print('\t STARTED TEST CONVERSION and SAVING TO DISK')
        else:
            print('\t STARTED VAL_'+str(fold)+' CONVERSION and SAVING TO DISK')
        for i in tqdm(range(len(val_df))):
            scan_dir = str(val_df.directory.iloc[i])+ \
                        str(val_df.patient_id.iloc[i]).zfill(5)+ \
                    '/'+str(val_df.scan_type.iloc[i])

            if is_test:
                output_dir = png_out_dir+'/test/'+ \
                        str(val_df.patient_id.iloc[i]).zfill(5)+ \
                        '/'+str(val_df.scan_type.iloc[i])
            else:
                output_dir = png_out_dir+'/val_'+str(fold)+'/'+ \
                            str(val_df.patient_id.iloc[i]).zfill(5)+ \
                            '/'+str(val_df.scan_type.iloc[i])

            filenamesDICOM = reader.GetGDCMSeriesFileNames(f'{scan_dir}')
            reader.SetFileNames(filenamesDICOM)
            scan_sitk = reader.Execute()

            scan_resampled = resample(scan_sitk, ref_sitk)
            scan_sitk_array = normalize_png(sitk.GetArrayFromImage(scan_resampled))

            for j in range(len(scan_sitk_array[:,0,0])):
                # SAVE PNG TO DISK IF Criterion TRUE    
                if np.mean(scan_sitk_array[j,:,:]) <= threshold:
                    pass
                else:
                    imageio.imsave(output_dir+'/Image-'+str(j)+'.png', scan_sitk_array[j,:,:])
        if is_test:
            print('\t FINISHED TEST CONVERSION and SAVING TO DISK')
        else:
            print('\t FINISHED VAL_'+str(fold)+' CONVERSION and SAVING TO DISK')
    else:
        print('\t ONLY SAVE TEST - SKIP TRAIN / VAL')
        
    if save_cv_test:
        print('\t STARTED TEST CONVERSION and SAVING TO DISK')
        for i in tqdm(range(len(test_df))):
            scan_dir = str(test_df.directory.iloc[i])+ \
                        str(test_df.patient_id.iloc[i]).zfill(5)+ \
                    '/'+str(test_df.scan_type.iloc[i])

            output_dir = png_out_dir+'/test/'+ \
                        str(test_df.patient_id.iloc[i]).zfill(5)+ \
                        '/'+str(test_df.scan_type.iloc[i])

            filenamesDICOM = reader.GetGDCMSeriesFileNames(f'{scan_dir}')
            reader.SetFileNames(filenamesDICOM)
            scan_sitk = reader.Execute()

            scan_resampled = resample(scan_sitk, ref_sitk)
            scan_sitk_array = normalize_png(sitk.GetArrayFromImage(scan_resampled))

            for j in range(len(scan_sitk_array[:,0,0])):
                # SAVE PNG TO DISK     
                imageio.imsave(output_dir+'/Image-'+str(j)+'.png', scan_sitk_array[j,:,:])
        print('\t FINISHED TEST CONVERSION and SAVING TO DISK')

**Create k-fold Train / Val Directories, Convert and Save PNGs**

❗❗❗ SAVE A FEW EXAMPLES TO DEMONSTRATE ❗❗❗ 

In [None]:
create_k_fold_cv_ds = True
create_test_ds = False
fold = 1

if create_k_fold_cv_ds:
    # Load summary dataframes
    train_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/train_filepaths_rsna.csv')
    test_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/test_filepaths_rsna.csv')

    dir_train_df = train_df.groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')
    dir_test_df = test_df.groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

    # Create k-fold Train / Val Directories
    png_out_dir = os.path.join('/kaggle/working', 'png_'+str(fold)+'_outof_5_fold_cv')
    if os.path.exists(png_out_dir) and os.path.isdir(png_out_dir):
        shutil.rmtree(png_out_dir)
    os.makedirs(png_out_dir)

    create_train_val_dir(png_out_dir, train_df_1, val_df_1, fold = fold)
#     create_train_val_dir(png_out_dir, train_df_2, val_df_2, fold = fold)
#     create_train_val_dir(png_out_dir, train_df_3, val_df_3, fold = fold)
#     create_train_val_dir(png_out_dir, train_df_4, val_df_4, fold = fold)
#     create_train_val_dir(png_out_dir, train_df_5, val_df_5, fold = fold)

    # Choose Reference 3D Volume (Collection of Images)
    ref_dir = str(dir_train_df.directory.iloc[369])+ \
                    str(dir_train_df.patient_id.iloc[369]).zfill(5)+ \
                    '/'+str(dir_train_df.scan_type.iloc[369])
    
    if create_test_ds:
        # Create Test Directory
#         test_png_out_dir = os.path.join('/kaggle/working', 'png_test_axial_256x256x36')
#         if os.path.exists(test_png_out_dir) and os.path.isdir(test_png_out_dir):
#             shutil.rmtree(test_png_out_dir)
#         os.makedirs(test_png_out_dir)
#         png_out_dir = test_png_out_dir
        create_test_dir(png_out_dir, dir_test_df)
        convert_train_val_2_voxel_space(ref_dir, test_png_out_dir, train_df_1, val_df_1, fold = fold, 
                is_test=False, save_cv_test = True, save_train_val = False, test_df = dir_test_df[:8])

    # Convert to Voxel Space of Choice
    convert_train_val_2_voxel_space(ref_dir, png_out_dir, train_df_1[:8], val_df_1[:8], fold = fold, 
                                    is_test=False, threshold = 0.0)
#     convert_train_val_2_voxel_space(ref_dir, png_out_dir, train_df_2, val_df_2, fold = fold, 
#                                     is_test=False)
#     convert_train_val_2_voxel_space(ref_dir, png_out_dir, train_df_3, val_df_3, fold = fold, 
#                                     is_test=False)
#     convert_train_val_2_voxel_space(ref_dir, png_out_dir, train_df_4, val_df_4, fold = fold, 
#                                     is_test=False)
#     convert_train_val_2_voxel_space(ref_dir, png_out_dir, train_df_5, val_df_5, fold = fold, 
#                                     is_test=False)

### Check that empty slices were removed

In [None]:
!ls -alt ./png_1_outof_5_fold_cv/train_1/00183/FLAIR

Indeed Images 0,1,2 and 32, 33, 34, 35 for threshold = 0 were removed.

**Create Train / Test Directories, Convert and Save PNGs**

❗❗❗ SAVE A FEW EXAMPLES TO DEMONSTRATE ❗❗❗ 

In [None]:
create_train_test_ds = False

if create_train_test_ds:
    # Load summary dataframes
    train_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/train_filepaths_rsna.csv')
    test_df = pd.read_csv('/kaggle/input/train-test-filepaths-rsna-full/test_filepaths_rsna.csv')

    dir_train_df = train_df.groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')
    dir_test_df = test_df.groupby(['patient_id', 'directory', 'scan_type']).size().reset_index(name='count')

    # Create Train / Test Dirs
    png_out_dir = os.path.join('/kaggle/working', 'png_voxel_converted_ds')
    if os.path.exists(png_out_dir) and os.path.isdir(png_out_dir):
        shutil.rmtree(png_out_dir)
    os.makedirs(png_out_dir)
    
    create_png_dir_tree(png_out_dir, dir_train_df, dir_test_df)

    # Choose Reference 3D Volume (Collection of Images)
    ref_dir = str(dir_train_df.directory.iloc[369])+ \
                    str(dir_train_df.patient_id.iloc[369]).zfill(5)+ \
                    '/'+str(dir_train_df.scan_type.iloc[369])

    # Convert to Voxel Space of Choice and Save
    convert_train_val_2_voxel_space(ref_dir, png_out_dir, dir_train_df[:8], dir_test_df[:8], 
                                    fold = None, is_test=True)

In [None]:
'''Check out the writing to disk process.'''
# !ls /kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00143/T1w/ | wc -l
# !ls /kaggle/working/png_voxel_converted_ds/train/00000/FLAIR

**Visualize slices of Patient ID 0, which has T1w, T1wCE : Axial, FLAIR : Coronal, and T2w: Sagittal Orientations and can easily see whether the conversion to Axial worked or not.**

In [None]:
if create_train_test_ds:
    n = 1
    j = 0
    plt.figure(figsize=(18, 10))
    img = mpimg.imread('/kaggle/working/png_voxel_converted_ds/train/00000/T2w/Image-18.png')
    print(img.shape)
    plt.subplot(1, 4, 1)
    imgplot = plt.imshow(img)
    img = mpimg.imread('/kaggle/working/png_voxel_converted_ds/train/00000/FLAIR/Image-18.png')
    print(img.shape)
    plt.subplot(1, 4, 2)
    imgplot = plt.imshow(img)
    img = mpimg.imread('/kaggle/working/png_voxel_converted_ds/train/00000/T1wCE/Image-18.png')
    print(img.shape)
    plt.subplot(1, 4, 3)
    imgplot = plt.imshow(img)
    img = mpimg.imread('/kaggle/working/png_voxel_converted_ds/train/00000/T1w/Image-18.png')
    print(img.shape)
    plt.subplot(1, 4, 4)
    imgplot = plt.imshow(img)

#### 5. You can remove images with mostly empty pixel values (see Sec 5: Filter Based on Theshold and Save PNGs)

# Trick to be able to create Dataset from PNG image collections

Save any time of csv or text file

In [None]:
!echo "This dataset contains PNG files in AXIAL orientation for all patients" > README.txt