# nnUNet Data Preparation

The [nnUNet](https://github.com/MIC-DKFZ/nnUNet) is a self-configuring method for deep
learning-based biomedical image segmentation. However it does require data to be formatted in a
specific way on the file system. In this notebook, we demonstrate some useful functionality to
prepare a dataset converted by PyDicer for training using nnUNet.

> Note: PyDicer currently only supports nnUNet v1. Contributions adding support for nnUNet v2 are
> welcome.

In [None]:
try:
    from pydicer import PyDicer
except ImportError:
    !pip install pydicer
    from pydicer import PyDicer

import os
import logging

from pathlib import Path

from pydicer.utils import fetch_converted_test_data

from pydicer.dataset.nnunet import NNUNetDataset
from pydicer.dataset.structureset import StructureSet


## Setup nnUNet

Consult the [nnUNet documentation](https://github.com/MIC-DKFZ/nnUNet/tree/master/documentation) 
for details on how to install nnUNet, setup folder paths and conduct model training. The dataset
will be prepared in the `nnUNet_raw_data_base` directory. If you already have this set in your
environment you can remove the following cell. For demonstration purposes, we set our 
`nnUNet_raw_data_base` to a scratch directory.

In [None]:
os.environ["nnUNet_raw_data_base"] = "./nnScratch"

## Setup PyDicer

For this example, we will use the LCTSC test data which has already been converted using PyDicer.
We also initialise our PyDicer object.

For working with nnUNet, we set the PyDicer logging verbosity to `INFO`, so that we can see the
relevant output being generated by the tool.

In [None]:
working_directory = fetch_converted_test_data("./testdata_lctsc", dataset="LCTSC")
pydicer = PyDicer(working_directory)
pydicer.set_verbosity(logging.INFO)

## Define Structures

PyDicer uses the [structure name mapping](https://australiancancerdatanetwork.github.io/pydicer/_examples/WorkingWithStructures.html) functionality to determine which structures to train
the nnUNet model for. Here we add a structure name mapping for this task.

In [None]:
mapping_id = "nnunet_lctsc"
mapping = {
    "Esophagus": [],
    "Heart": [],
    "Lung_L": ["L_Lung", "Lung_Left"],
    "Lung_R": ["Lung_Right"],
    "SpinalCord": ["SC"],
}

pydicer.add_structure_name_mapping(
    mapping_id=mapping_id,
    mapping_dict=mapping
)

## Initialise `NNUNetDataset` object

The `NNUNetDataset` class provides the functionality to prepare a dataset from PyDicer data. Here
we create an object of this class for use in this example. Check out the 
[documentation](https://australiancancerdatanetwork.github.io/pydicer/nnunet.html) for more
information on how the `NNUNetDataset` class works.

In [None]:
nnunet_task_id = 123
nnunet_task_name = "LCTSC_Test"
nnunet_task_description = "A dummy nnUNet task for demonstration purposes"

nnunet = NNUNetDataset(
    working_directory,
    nnunet_task_id,
    nnunet_task_name,
    nnunet_task_description,
    mapping_id=mapping_id
)

## Inspect Dataset

Our `NNUNetDataset` tool expects to have exactly one image and one structure set per patient
(multi-modal training not yet supported, contributions welcome). Let's fetch our converted
DataFrame to confirm that this is the case.

If your dataset isn't yet in such as state, you can use the 
[`dataset preparation`](https://australiancancerdatanetwork.github.io/pydicer/_examples/DatasetPreparation.html)
module in PyDicer to prepare a subset of data. Once the dataset is prepared, pass the
`dataset_name` argument when creating the `NNUNetDataset` object above.

In [None]:
df = pydicer.read_converted_data(working_directory)
df

## Check Dataset

The `check_dataset` function confirms that we have one image and one structure set per patient in 
our dataset.

In [None]:
nnunet.check_dataset()

## Split Dataset

Here we randomly split our dataset into a training and testing set. You can specify the
`training_cases` and `testing_cases` to use in the `split_dataset` function. If these aren't
supplied, the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
function from `sklearn` will be used. You can pass keyword arguments to this function via the
`split_dataset` function.

In [None]:
nnunet.split_dataset()

## Check for Duplicate Data

Now that the dataset is split, we must ensure that none of the `training_cases` are present in the
`testing_cases`. Even if the cases have different IDs, it is possible that through anonymisation
the same patient is anonymised to two different IDs. The `check_duplicates_train_test` function
will check the imaging data to ensure there are no duplicates.

In [None]:
nnunet.check_duplicates_train_test()

## Check Structure Names

The nnUNet requires that all structures are present for all cases (missing structures are not
supported). The `check_structure_names` function will output a grid indicating where structures
might be missing (or a structure name mapping is missing).

If there are any cases for which any of the structures are missing, this should be resolved (by
adding a structure mapping or remove the case from the dataset) before proceeding.

In [None]:
df_results = nnunet.check_structure_names()
df_results

## Check for Overlapping Structures

nnUNet (v1) is unable to handle structures which are overlapping. If there are structures which are
overlapping, the PyDicer tool will assign the overlapping voxels to the smaller structure (to 
assign to the larger structure, set `nnunet.assign_overlap_to_largest=False`).

The `check_overlapping_structures` function will log any structures which are overlapping and will
be affected by this rule.

In [None]:
nnunet.check_overlapping_structures()

## Prepare nnUNet Dataset

Now that all checks are complete, we can proceed with preparing the nnUNet dataset. Take a look in
the dataset directory after the cell finishes running to confirm that everything worked as
expected.

In [None]:
nnunet_dataset_path = nnunet.prepare_dataset()
print(f"Dataset prepared in: {nnunet_dataset_path}")

## Prepare nnUNet Training Scripts

Consult the [nnUNet documentation](https://github.com/MIC-DKFZ/nnUNet/tree/master/documentation)
for information on model training. The `generate_training_scripts` may help prepare a script useful
for training the nnUNet models for the dataset which was prepared.

In [None]:
# Add some additional commands at the top of the script (useful for activating a virtual
# environemnt)
script_header = [
    '# source /path/to/venv/bin/activate',
]

script_path = nnunet.generate_training_scripts(script_header=script_header)
print(f"Training script ready in: {script_path}")