# Segmentation Training Walkthrough
This notebook will walk a user through using the Atlas compatible htc for training their own segmentation model. There is another, similar notebook for training a classification model based on spectral analysis, titled "TissueAtlasClassificationTraining.py" If you have not yet, please read the Setup tutorial for important information.
Start with necessary inports and define path to your dataset_settings .json. The tutorial is written with an example dataset, but you should replace references when appropriate with your own dataset.

In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path

import pandas as pd
from IPython.display import JSON
from typing import TYPE_CHECKING, Any, Callable, Union, Self
from htc import (
    Config,
    DataPath,
    DataSpecification,
    MetricAggregation,
    SpecsGeneration,
    create_class_scores_figure,
    settings,
)
from htc.models.data.SpecsGenerationAtlas import SpecsGenerationAtlas

intermediates_dir = settings.intermediates_dirs.external
print(intermediates_dir)

/omics/groups/OE0645/internal/data/htcdata/medium_test/external/intermediates


Then, we can specify important parameters for your training run, such as fold, train/test split, etc. replace the values in the following code block with the values of your choice

In [None]:
filter_txt = lambda p: p.contains_txt()
filters = [filter_txt] #list of callable filter functions
annotation_name = 'annotator1' #name of annotators to be used
test_ratio = 0 #ratio of images to be saved as test, i.e, not ued in any training. should be float between 0.0 and 1.0
n_folds = 3 #number of folds to make in the training data. training data (not test data) will be randomly split into n_folds different groups
#for each "fold", the network will train a model with one of the groups as validation and all the other groups as training data.  
seed = None #optional parameter that interacts with the random grouping of the folding operation. For a different fold upon every function call, set = None.
# for a consistent fold, set seed to a number of your choice, e.g. seed = 42
name = "testSegment" #name of a json file created in the following code block, that gets stored in the parent directory of this notebook. name it something simple and descriptive


In [3]:
tutorial_dir = Path().absolute()
external = settings.external_dir.external['path_dataset']#need brackets to acess the path, because settings.external_dir.external is a dictionary cointaing info about the external_dir.
#settings.external_dir is an object containing all the different external directories: in our case, there should always just be one with shortcut "external"
print(external)
specs_path = external/'data' / name
SpecsGenerationAtlas(intermediates_dir,
                filters = filters,
                annotation_name = annotation_name,
                test_ratio = test_ratio,
                n_folds = n_folds,
                seed = seed,
                name = name,
                ).generate_dataset(external / 'data')

/omics/groups/OE0645/internal/data/htcdata/medium_test/external
['P160_OP124_2023_06_28_Experiment1', 'P162_OP126_2023_07_06_Experiment1', 'P163_OP127_2023_07_12_Experiment1']


## Lightning Class

Next step is to choose/build our lightning class. The Lightning class (as in Pytorch Lightning) performs many aspects of managing training, and can be customized by creating your own child class. Most notably, the Lightning class allows you to specify your Loss function.

For this walkthrough, we will use the htc default "LightingImage" class, which is their default class for training on full images (as opposed to patches, pixels, or superpixels). This calculates loss as a weighted average of Dice loss and Cross-Entropy loss. See the htc's "networkTraining" tutorial for more info on the lightning class.

## Config
The last step before training is to create our configuration file. This file is also a json that contains important metadata, and it is used by the training process itself to configure training hyperparameters, like batch size and transformations. We will use the htc's Config ***class*** to write the config ***json***

The following Code blocks will write the config json for you. By default, it will store the config.json file in the same directory as your dataset_settings json. 

You can change your config by switching the values assigned in the code block immediately below. If you are confident in your understanding and want to make more specific or advanced changes to the config, you can add them in the next code block after (where the config is instantiated)

For a guide to possible config keys and their meanings, see the htc config.schema file in htc/utils.

In [4]:
#assign training hyperparameters
max_epochs = 2 #this can be whatever you want
batch_size = 8 #this is the number of SUBJECTS, rather than images, in each batch. The loader is designed to sort batches by subject. ***breaks with batch size 1, unclear why
#default batch size is 3 subjects
shuffle = True #this tells the batch generator to retrieve random, different, batches on every epoch.
#True causes it to be random, False will leave same batches across epochs. 
num_workers = "auto" #how many dataloading "worker" subprocesses to start. The optimal amount for fast loading is highly dependant on your system
#you can experiment on low-epoch runs to see what num_workers maximizes your training speed. 

For training, we use a new label mapping to read and understand the annotations. In this case, the label mapping is defined explictly in the code cell below. another option is the write a SettingsProject class for your project that contains a labell mapping attribute, and reference that settings file. This can be useful organizationally for keeping settings and configurations for one project in the same file. for an example of using such a class, see the config section of the TissueAtlasClassTraining tutorial.

Notably, It is possible to map multiple labels in your annotations to the same value for training, so that the model thinks of them as the same. (obviously, it is not possible to map one label in annotations to multiple classes in training). If you do so, you should make sure you also define mapping_index_name, to be clear about what name you want to recover from that class.

In [5]:
config = Config.from_model_name("default", "image")
config["inherits"] = "htc/context/models/configs/organ_transplantation_0.8.json" #using organ translplantation model
config["input/data_spec"] = specs_path
config["input/annotation_name"] = ["polygon#annotator1"]
config["validation/checkpoint_metric_mode"] = "class_level"



# We want to merge the annotations from all annotators into one label mask
config["input/merge_annotations"] = "union"

# We have a two-class problem and we want to ignore all unlabeled pixels
# Everything which is >= settings.label_index_thresh will later be considered invalid
config["label_mapping"] = {
        "last_valid_label_index": 3,
        "mapping_index_name": {
            "0": "uro_conduit",
            "1": "background",
            "254": "overlap" },
        "mapping_name_index": {
            "background_anorganic": 0, 
            "background_organic": 1,
            "ureter_left": 2,
            "ureter_right": 3,
            "overlap": 254
            },
        "unknown_invalid": False,
        "zero_is_invalid": False}
#leaving as none will use the label Id#s in the segmentation bloscs. if we want to remap the labels, we can specify here.
#could be useful for combining multiple labels into one label, without reloading the intermediates?
#some confusion on how background is handled/defined

#specify batch and sampler settings:
config['dataloader_kwargs/batch_size'] = batch_size
config['dataloader_kwargs/num_workers'] = num_workers

# Reduce the training time
config["trainer_kwargs/max_epochs"] = max_epochs

# Progress bars can cause problems in Jupyter notebooks so we disable them here (training does not take super long)
config["trainer_kwargs/enable_progress_bar"] = True

# Uncomment the following lines if you want to use one of the pretrained models as basis for our training
# config["model/pretrained_model"] = {
#     "model": "image",
#     "run_folder": "2022-02-03_22-58-44_generated_default_model_comparison",
#

config_path = external/'data'/ (name + "_config.json")
config.save_config(config_path)
JSON(config_path)

print(type(config_path))

<class 'pathlib.PosixPath'>


## Start the Training
You are now ready to train your network. open the file in this tutorial directory named "SegmentTraining.sh", and modify the config path variable to the path you just generated in the previous cell (it should be printed at the bottom of the cell)



Then, in a terminal, from the root directory of the repository, run:
```bash
 chmod +x tutorials/Urology_group_tutorials/training.sh
 sh tutorials/Urology_group_tutorials/training.sh
```

Now your training has started! Depending on your number of epochs and size of dataset, it will take time.

# Viewing Results

Once your training is complete, you can use htc code to view experimental analysis of your model.
start by finding and confirming location of your training directory. Navigate to your results directory (the one you set with the PATH environment variable.)
inside results, you can find your run in a path similar to the once below:
```bash
 training/<model_name>/<run_name>
```
Where the run name is usually the timestamp of the training with the name of the config used appended. it should contain a config.json, data.json, log.txt, and a fold directory for each fold you performed. if everything is there, run the following cell, replacing the input path with the absolute path to the run directory

In [None]:
!python htc/evaluation/run_table_generation.py --notebook htc/evaluation/ExperimentAnalysis.ipynb --input-path /omics/groups/OE0645/internal/data/htcdata/medium_test/results/training/image/2024-07-25_17-43-32_SegmentTrain_config

After the script runs, you should see a new ExperimentAnalysis.html file in the results folder. open the html in a broswer to see your results!

Happy Training!!!