![](https://i.imgur.com/5xUFyIP.png)

# Create Voxel Dataset with KaggleRecipes
Hey fellow Kagglers, I have been working on a GitHub library called **[KaggleRecipes](https://github.com/ayulockin/kagglerecipes)** along with [Morgan](https://www.kaggle.com/morganmcg), my colleague from Weights and Biases, these past few days. The idea for this repository is to package necessary utilities and provide baseline training and inference for the competitions, we are participating in. The repository is also instrumented with Weights and Biases and comes with convenient utilities to visualise your datasets and do efficient, effective model evaluation with W&B.

This is an **early release** and we are working to add new functionalities and features to abstract away a lot of boiler plate code for your Kaggle competitions. You can find the full documentation **[here](https://ayulockin.github.io/kagglerecipes/)**. 

# Created Dataset:

The following are the dataset created using KaggleRecipes using a single BraTS21ID as reference sequence. 

* [RSNA MICCAI Voxel BraTS21ID 143](https://www.kaggle.com/ayuraj/rsna-miccai-voxel-brats21id-143) - Uses BraTS21ID 143 as the reference sequence. 


## Multiprocessing with `mpire`
MPIRE is a multiprocessing library recently released that abstracts away a lot of the multi-processing **and** and claims to be faster than `multiprocessing.Pool` and `concurrent.futures.ProcessPoolExecutor` and is on par with ray. This [blog post](https://towardsdatascience.com/mpire-for-python-multiprocessing-is-really-easy-d2ae7999a3e9) introduces it and you can find the [mpire github repo here](https://github.com/Slimmer-AI/mpire/issues/11). 

We've used it in `kagglerecipes` as its really enjoy able to work with (Morgan's favorite features is the simple flag to turn on a tqdm progress bar). We used it to cut down the dicom metadata extraction step in this notebook **from 44minutes to 12minutes**. 

## Credits
This kernel is possible because of awesome works done by these fellow Kagglers:
* [Connecting voxel spaces](https://www.kaggle.com/boojum/connecting-voxel-spaces) by [Michael Beregov](https://www.kaggle.com/boojum)
* [Normalized Voxels: Align Planes and Crop](https://www.kaggle.com/ren4yu/normalized-voxels-align-planes-and-crop) by [yu4u](https://www.kaggle.com/ren4yu)
* [🧠 DICOM to 2D Resized Axial PNGs 256x256 [x36] 🧠](https://www.kaggle.com/smoschou55/dicom-to-2d-resized-axial-pngs-256x256-x36) by [Sofia Moschou](https://www.kaggle.com/smoschou55)

## Previous Work
🧐 For a deeper EDA looking at individual mri slices see the [kernel here](https://www.kaggle.com/ayuraj/brain-tumor-eda-and-interactive-viz-with-w-b). <br>
🧐 You can use this voxel manipulated dataset with this [[Train] Brain Tumor as Video Classification + W&B](https://www.kaggle.com/ayuraj/train-brain-tumor-as-video-classification-w-b) kernel. 


# Imports and Setup

In [None]:
!pip install -q kagglerecipes

In [None]:
import os
import ast
import mpire
import wandb
import timeit
import imageio
import numpy as np
import pandas as pd
from tqdm import tqdm
from pathlib import Path
import matplotlib.pyplot as plt

# kagglerecipes based imports
from kagglerecipes.preprocess import VoxelData
from kagglerecipes.utils import (
    get_patient_id,
    get_all_BraTS21_dicom_meta,
    get_patient_BraTS21ID_path,
    get_image_plane,
    KAGGLE_BRAINTUMOR_META_COLS
)
from kagglerecipes.wandb_utils import log_to_artifacts

In [None]:
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    secret_value_0 = user_secrets.get_secret("wandb_api")
    wandb.login(key=secret_value_0)
    anony=None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')

In [None]:
# This is an optional config used to show that you can pass dictionary to log your hyperparameters and othe required info. 
CONFIG = {'competition': 'rsna-miccai-brain', '_wandb_kernel': 'kr_voxel', 'use_wandb': True}

# Prepare dataset

In [None]:
DATA_PATH = Path('../input/rsna-miccai-brain-tumor-radiogenomic-classification/')
TRAIN_PATH = DATA_PATH / 'train/'
SCAN_TYPES = ['FLAIR', 'T1w', 'T1wCE', 'T2w']

Load the `train_labels.csv` dataset. 

Load the dataset as W&B artifacts. This will enable data version control since there can be multiple possible dataset that you can create for this competition. Data version control goes a long way to ensure reproducible results, not mess up with a lot going on, share with team and most importantly have a bird eye view of everything. Maybe this [discussion post would be worth a read](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/229586). 

📌 You can easily log any file or directory using our `log_to_artifacts` function. Check out the [documentation](https://ayulockin.github.io/kagglerecipes/wandb_utils.html#log_to_artifacts).

In [None]:
# Load as dataframe.
train_df = pd.read_csv(DATA_PATH / 'train_labels.csv')
# Exluding three cases
# Refer: https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/262046
train_df = train_df[train_df.BraTS21ID != 109]
train_df = train_df[train_df.BraTS21ID != 123]
train_df = train_df[train_df.BraTS21ID != 709]

print(f'Number of BraTS21ID: {len(train_df)}')

# Save the file as W&B artifacts. 
run = wandb.init(project='rsna-miccai-brain-tumor', 
                 config=CONFIG,  # CONFIG is optional here
                 job_type='log-dataset-labels',
                 anonymous=anony) 

log_to_artifacts(path_to_data=DATA_PATH/'train_labels.csv', # 📌
                artifact_name='raw_rsna_miccai_labels', 
                artifact_type='labels',
                log='file')
wandb.finish()

Add the path to the patient folders to `train_labels.csv`

📌 Use `get_patient_BraTS21ID_path` to easily get the correct path. Check out the [documentation](https://ayulockin.github.io/kagglerecipes/utils.html#get_patient_BraTS21ID_path).

In [None]:
# add path to the df
train_df['path'] = train_df.apply(lambda row: get_patient_BraTS21ID_path(row, TRAIN_PATH), axis=1) # 📌

train_df.head(2)

⭐ For demonstration purposes we are using only 100 patientIds sampled randomly. Comment the cell below to run on the entire dataset. Running on the entire dataset will take around ~90 minutes while using Kaggle kernel. 

In [None]:
# train_df = train_df.sample(100).reset_index(drop=True)

# Data Preprocessing

### Numer of Cores to use - Try Check Your System Specs
We can use multiprocessing to handle some of our data processing. To get the best out of your multiprocessing, lets check how many cores are available to use

In [None]:
mpire.cpu_count()

In case you're more curious what specs the machine you're using has, [Tim Yee](https://www.kaggle.com/teeyee314) has a nice kernel [here](https://www.kaggle.com/teeyee314/cpu-kernel-specs) that gives you system info.

In [None]:
#GPU count and name
!nvidia-smi -L

#cpu model name
!lscpu |grep 'Model name'

#no.of sockets i.e available slots for physical processors
!lscpu | grep 'Socket(s):'

#no.of cores each processor is having
!lscpu | grep 'Core(s) per socket'

#no.of threads each core is having
!lscpu | grep 'Thread(s) per core'

The code cell below performs the following steps:

📌 It uses multiprocessing to quickly extract the associated metadata using our `get_all_BraTS21_dicom_meta` function. Check out the [documentation](https://ayulockin.github.io/kagglerecipes/utils.html#get_all_BraTS21_dicom_meta). <br>
📌 It then extracts the orientation of each DICOM slice using `get_image_place` function. Check out the [documentation](https://ayulockin.github.io/kagglerecipes/utils.html#get_image_plane).



In [None]:
# Number of parallel processes to use for data processing
n_jobs = 10

# # Get DICOM metadata
train_meta_df = get_all_BraTS21_dicom_meta(train_df, KAGGLE_BRAINTUMOR_META_COLS, 
                                           SCAN_TYPES, n_jobs, True) # 📌

# Get orienation metadata
train_meta_df['Orientation'] = train_meta_df.apply(get_image_plane, axis=1) # 📌

# Save metadata
train_meta_df.to_csv('train_meta_df.csv', index=False)

print(f'train_meta_df has {len(train_meta_df)} rows')
train_meta_df.head()

## Save Metadata as W&B Artifacts

Additionally you can save the metadata as W&B Artifacts. Check out the [official documentation page](https://docs.wandb.ai/guides/artifacts) to learn more. 

In [None]:
# Save the file as W&B artifacts. 
run = wandb.init(project='rsna-miccai-brain-tumor',
                 config=CONFIG,
                 job_type='create-dicom-metadata',
                 anonymous=anony)

log_to_artifacts(path_to_data='train_meta_df.csv', # we saved the file in the last cell. # 📌
                artifact_name='metadata',  # let's name it metadata. 
                artifact_type='meta-dataset', # let the type be meta-dataset
                log='file') 

wandb.finish()

### [Check out the saved Artifacts page $\rightarrow$](https://wandb.ai/ayush-thakur/brain-tumor-voxel-dataset/artifacts/meta-dataset/metadata/3debb76d8026375a39aa/files)

# Create Dataset

In the cell below you can query the metadata table to find a reference image as the set criteria. 

criteria:
* Height (`Rows`) and Width (`Columns`)
* Orientation (`Orientation`)
* MRI Scan type (`SeriesDescription`)
* Number of slices (`count`) per `PatientID`


In [None]:
# Find the reference MRI sequence manually
dftr2 = train_meta_df[(train_meta_df.Rows == '256') & 
                      (train_meta_df.Columns == '256') &
                      (train_meta_df.Orientation == "axial") &
                      (train_meta_df.SeriesDescription == "T1w")].groupby(['PatientID', 'Orientation', 'SeriesDescription']).size().reset_index(name='count') 

dftr2.loc[(dftr2['count'] < 50) & (dftr2['count'] >15)].reset_index(drop = True)

⭐Note: If you are using the entire dataset and you have decided on an ID that you want to use as reference, use Run and Save all. 

In [None]:
REFERENCE_ID = '00147'

The cell below have functions responsible to save the voxel manipulated dataset. 

📌 The `VoxelData` class contains the main logic for the dataset creation by manipulating the MRI sequences in the Voxel space. It takes `reference_path` which is the reference modality. We will be selecting PatientID `00102` with 23 slices. Check out the [documentation](https://ayulockin.github.io/kagglerecipes/preprocess.html#VoxelData).

📌 The `VoxelData` has a method called`get_voxel_data` which takes in the path to the MRI sequence and returns the manipulated data. 

In [None]:
 def save_voxel_data(
     scan_types:list,
     save_path, 
     reference_path,
     patient_path,  # Path to the patient folder
     BraTS21ID:int  # BraTS21ID
 ):
    "Returns a re-sampled image based on the reference path"
    
    connect_voxel = VoxelData(reference_path) # 📌
    
    # Create folder to save to
    save_dir = save_path / get_patient_id(BraTS21ID)
    os.makedirs(save_dir, exist_ok=True)

    # Resample the dicom files and save the output images as numpy files
    for scan in scan_types:
        scan_path = os.path.join(patient_path, scan)
        voxel_data = connect_voxel.get_voxel_data(scan_path) # 📌
        np.save(save_dir / f'{scan}.npy', voxel_data)

        
def save_all_voxel_data(
    reference_path:str,  # Path to the dicom file to use as a template for resampling
    save_path:str,  # Path to save voxel data to
    df,  # Dataframe with path to patient folder and BraTS21ID
    scan_types:list=['FLAIR', 'T1w', 'T1wCE', 'T2w'],  # The subfolders in the patient data to loop through
):    
    "Resamples the dicom data based on reference template dicom and then saves that voxel data to save_path"
    
    patient_path = df.path.values
    BraTS21ID_ls = df.BraTS21ID.values
    results = []
    for i in tqdm(range(len(df))):
        res = save_voxel_data(scan_types, save_path, reference_path, patient_path[i], BraTS21ID_ls[i])
        results.append(res)

In [None]:
REFERENCE_PATH = os.path.join(TRAIN_PATH / REFERENCE_ID, "T1w") # Note: The PatientID and the MRI Scan type selected as reference.
SAVE_PATH = Path('../tmp/') 

os.makedirs(SAVE_PATH, exist_ok=True)

# Save all data
save_all_voxel_data(REFERENCE_PATH, SAVE_PATH, train_df)

# Visualize the Created Dataset as W&B Tables

This allows you to interactively check if the created dataset is correct. 

🧐 For a deeper EDA looking at individual images see this **[kernel here](https://www.kaggle.com/ayuraj/brain-tumor-eda-and-interactive-viz-with-w-b)**

⚠️ See this discussion post for more info on MGMT and the objective of this competition: **[[Self-Note] "Brain tumor classification" is misleading!](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/264861)**

In [None]:
# Number of scans to viz
NUM_SAMPLES = 32

# Initialize a W&B run to log images
run = wandb.init(project='rsna-miccai-brain-tumor',
                 config=CONFIG,
                 name='viz-dataset-tables',
                 anonymous=anony) # 📌 W&B Code 1

data_at = wandb.Table(columns=['patient_id', 'target', 'FLAIR', 'T1w', 'T1wCE', 'T2w']) # 📌 W&B Code 2

for i in tqdm(range(len(train_df))):
    os.makedirs('tables-gif', exist_ok=True)
    
    row = train_df.loc[i]
    patient_id = get_patient_id(row.BraTS21ID)
    
    for j, key in enumerate(SCAN_TYPES):
        _frames = np.load(f'{SAVE_PATH}/{patient_id}/{key}.npy')
        imageio.mimsave(f'tables-gif/out_{patient_id}_{j}.gif', (_frames*255).astype('uint8'))
    
    data_at.add_data(patient_id,                                            
                     row.MGMT_value,
                     wandb.Image(f'tables-gif/out_{patient_id}_0.gif'),
                     wandb.Image(f'tables-gif/out_{patient_id}_1.gif'),
                     wandb.Image(f'tables-gif/out_{patient_id}_2.gif'),
                     wandb.Image(f'tables-gif/out_{patient_id}_3.gif')) # 📌 W&B Code 3
    
    if i == NUM_SAMPLES:
        break

wandb.log({'MRI Sequencing Dataset': data_at}) # 📌 W&B Code 4
wandb.finish() # 📌 W&B Code 5

### Visualize entire dataset interactively

### [Check out the Tables](https://wandb.ai/ayush-thakur/brain-tumor-viz/runs/kb9nwx3a) 

![img](https://i.imgur.com/4cGorA3.gif)

You would want to save the created dataset as Kaggle Dataset which is not covered in this Kernel. But if you are pushing out the dataset as public Kaggle dataset if would be great if you can use the title something like: `RSNA-MICCAI Voxel Dataset BraTS21Id <PatientID>`

**If you like the effort consider upvoting and using the library :)**

In [None]:
!zip -rq voxel.zip ../tmp

In [None]:
print('Done!')