# Tissue Atlas Classification Training Walkthrough
This notebook will walk a user through using the Atlas compatible htc for training their own organ classification (not segmentation) model based on spectral analysis. There is another, similar notebook for training a segmentation model, named "SegmentationTraining.ipynb" If you have not yet, please read the Setup tutorial for important information.
Start with necessary inputs and define path to your dataset_settings .json. The tutorial is written with a very small dataset (2 pigs) called "HeiPorSpectral_mod". Replace relevant directory paths / names with the names to your own dataset and json file

In [16]:
# This is a Python cell to define the relative navigation steps
import os
from pathlib import Path

# Get the current notebook directory
notebook_dir = Path().resolve()
print(notebook_dir)

# Define the relative path to the root directory (e.g., go up 2 levels)
levels_up = 2  # Adjust this based on your project structure
repo_root = notebook_dir.parents[levels_up - 1]  # Adjust the index based on the levels

print(repo_root)
# Save the root directory path to an environment variable
os.chdir(repo_root)

# Verify the current working directory
current_dir = Path().resolve()
print(f"Current working directory: {current_dir}")

/home/l328r/htc/tutorials/Urology_group_tutorials
/home/l328r/htc
Current working directory: /home/l328r/htc


In [17]:
!pwd

/home/l328r/htc


In [18]:
%load_ext autoreload
%autoreload 2

from pathlib import Path

from htc.tissue_atlas.settings_atlas import SettingsAtlas
import pandas as pd
from IPython.display import JSON
from typing import TYPE_CHECKING, Any, Callable, Union, Self
from htc import (
    Config,
    DataPath,
    DataSpecification,
    MetricAggregation,
    SpecsGeneration,
    create_class_scores_figure,
    settings,
)
from htc.models.data.SpecsGenerationAtlas import SpecsGenerationAtlas

intermediates_dir = settings.intermediates_dirs.external
print(intermediates_dir)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
/omics/groups/OE0645/internal/data/htcdata/medium_test/external/intermediates


Then, we can specify important parameters for your training run, such as fold, train/test split, etc. replace the values in the following code block with the values of your choice

In [19]:
#TO DO:
#filter by existence of .txt file next to hypergui within timestamp folder -- lets you know that its ok
#add batch size
#add epoch length
#add batch randomization conditions

filter_txt = lambda p: p.contains_txt()
filters = [filter_txt] #list of callable filter functions, can be a variety of things
annotation_name = 'annotator1' #name of annotators to be used
test_ratio = 0.33 #ratio of subjects to be saved as test, i.e, not ued in any training. should be float between 0.0 and 1.0 (if 1/3, write as 0.33)
n_folds = 2 #number of folds to make in the training data. training data (not test data) will be randomly split into n_folds different groups
#for each "fold", the network will train a model with one of the groups as validation and all the other groups as training data.  
seed = None #optional parameter that interacts with the random grouping of the folding operation. For a different fold upon every function call, set = None.
# for a consistent fold, set seed to a number of your choice, e.g. seed = 42
name = "Atlas" #name of a json file created in the following code block, that gets stored in the parent directory of this notebook. name it something simple and descriptive


In [20]:
tutorial_dir = Path().absolute()
external = settings.external_dir.external['path_dataset']#need brackets to acess the path, because settings.external_dir.external is a dictionary cointaing info about the external_dir.
#settings.external_dir is an object containing all the different external directories: in our case, there should always just be one with shortcut "external"
print(external)
specs_path = external/'data' / name
SpecsGenerationAtlas(intermediates_dir,
                filters = filters,
                annotation_name = annotation_name,
                test_ratio = test_ratio,
                n_folds = n_folds,
                seed = seed,
                name = name,
                ).generate_dataset(external / 'data')

/omics/groups/OE0645/internal/data/htcdata/medium_test/external
['P160_OP124_2023_06_28_Experiment1', 'P162_OP126_2023_07_06_Experiment1', 'P163_OP127_2023_07_12_Experiment1']


## Lightning Class

Next step is to choose/build our lightning class. The Lightning class (as in Pytorch Lightning) performs many aspects of managing training, and can b customized by creating your own child class. most notably, the Lightning class allows you to specify your Loss function.

For this walkthrough, we will use the htc default "LightingImage" class, which is their default class for training on full images (as opposed to patches, pixels, or superpixels). This calculates loss as a weighted average of Dice loss and Cross-Entropy loss. See the htc "networkTraining" tutorial for more info on the lightning class.

## Config
The last step before training is to create our configuration file. This file is also a json that contains important metadata, and it is used by the training process itself to configure training hyperparameters, like batch size and transformations. We will use the htc's Config ***class*** to write the config ***json***

The following Code block will write the config json for you. By default, it will store the config.json file in the same directory as your dataset_settings json.

In [21]:
#assign training hyperparameters
max_epochs = 1 #this can be whatever you want
batch_size = 20000 #batch size works differently for the median üpixel model. here, it shoould be quite large.
#CANNOT BE BATCH SIZE 1, breaks the batch norm steps?
#default batch size is 3 subjects
shuffle = True #this tells the batch generator to retrieve random, different, batches on every epoch.
#True causes it to be random, False will leave same batches across epochs. 
num_workers = "auto" #how many dataloading "worker" subprocesses to start. The optimal amount for fast loading is highly dependant on your system
#you can experiment on low-epoch runs to see what num_workers maximizes your training speed. 
#left to implement: specialized sampling practices? such as guaranteeing even organ distribution across classes

For training, we use a new label mapping to read and understand the annotations. there is an existing label mapping defined in the SettingsAtlas class, which we will use here for the tutorial.
If you have different requirements for a settings atlas, you could either write the label mapping yourself, using the settings atlas as a guide, or modify the one in settings atlas.

Notably, It is possible to map multiple labels in your annotations to the same value for training, so that the model thinks of them as the same. (obviously, it is not possible to map one label in annotations to multiple classes in training). If you do so, you should make sure you also define mapping_index_name, to be clear about what name you want to recover from that class. 

In [33]:
#load up settingsAtlas object, which contains among other properties a label mapping, that can be used by the training
Settings_Atlas = SettingsAtlas()
label_mapping_train = Settings_Atlas.label_mapping


In [25]:
config = Config("htc/tissue_atlas/median_pixel/configs/default.json")
#config["inherits"] = "/home/l328r/htc/htc/tissue_atlas/median_pixel/configs/default.json"
config["input/data_spec"] = specs_path
config["input/annotation_name"] = ["polygon#annotator1"]
config["validation/checkpoint_metric_mode"] = "class_level"



# We want to merge the annotations from all annotators into one label mask
config["input/merge_annotations"] = "union"

# We have a two-class problem and we want to ignore all unlabeled pixels
# Everything which is >= settings.label_index_thresh will later be considered invalid
config["label_mapping"] = label_mapping_train  #leaving as none will use the label Id#s in the segmentation bloscs. if we want to remap the labels, we can specify here.
                            #could be useful for combining multiple labels into one label, without reloading the intermediates?
    #"spleen": 0,
    #"gallbladder": 1,
    #"unlabeled": settings.label_index_thresh,
config["input/n_classes"] = 21 #right now this needs to be set, or else it assumes 0
#some confusion on how background is handled/defined

#specify batch and sampler settings:
config['dataloader_kwargs/batch_size'] = batch_size
config['dataloader_kwargs/num_workers'] = num_workers

# Reduce the training time
config["trainer_kwargs/max_epochs"] = 1

# Progress bars can cause problems in Jupyter notebooks so we disable them here (training does not take super long)
config["trainer_kwargs/enable_progress_bar"] = False

# Uncomment the following lines if you want to use one of the pretrained models as basis for our training
# config["model/pretrained_model"] = {
#     "model": "image",
#     "run_folder": "2022-02-03_22-58-44_generated_default_model_comparison",
#

config_path = external/'data'/ (name + "_config.json")
config.save_config(config_path)
JSON(config_path)

print(config_path)
print(label_mapping_train)

/omics/groups/OE0645/internal/data/htcdata/medium_test/external/data/Atlas_config.json
LabelMapping(stomach=0, small_bowel=1, colon=2, liver=3, gallbladder=4, pancreas=5, kidney=6, lung=7, heart=8, cartilage=9, bile_fluid=10, kidney_with_Gerotas_fascia=11, major_vein=12, peritoneum=13, muscle=14, skin=15, bone=16, omentum=17, bladder=18, spleen=19, uro_conduit=20)


## Start the Training
You are now ready to train your network. open the file in this tutorial directory named "training.sh", and modify the config path variable to the path you just generated in the previous cell (it should be printed at the bottom).

Then, in a terminal, from the root directory of the repository, run:
```bash
 chmod +x tutorials/Urology_group_tutorials/training.sh
 sh tutorials/Urology_group_tutorials/training.sh
```

now your training has started! depending on your number of epochs, it will take time.

# Viewing Results

Once your training is complete, you can use htc code to view experimental analysis of your model.
start by finding and confirming location of your training directory. Navigate to your results directory (the one you set with the PATH environment variable.)
inside results, you can find your run in a path similar to the once below:
```bash
 training/<model_name>/<run_name>
```
Where the run name is usually the timestamp of the training with the name of the config used appended. it should contain a config.json, data.json, log.txt, and a fold directory for each fold you performed. if everything is there, run the following s, replacing the input path with the absolute path to the run directory

In [29]:
!python htc/evaluation/run_table_generation.py --notebook htc/tissue_atlas/ExperimentAnalysisValidation.ipynb --input-path /omics/groups/OE0645/internal/data/htcdata/medium_test/results/training/median_pixel/2024-07-25_17-43-32_Atlas_config

starting at line [37m4[0m                                                    [2m          [0m
[1m[[0m[32mINFO[0m[1m][0m[1m[[0m[3mhtc[0m[1m][0m Will generate results for the following  [2mrun_table_generation.py:453[0m
runs:                                                [2m                           [0m
[1m[[0m[32mINFO[0m[1m][0m[1m[[0m[3mhtc[0m[1m][0m                                          [2mrun_table_generation.py:455[0m
median_pixel/[37m2024[0m-[37m07[0m-25_17-[37m43[0m-32_Atlas_config        [2m                           [0m
[35m/omics/groups/OE0645/internal/data/htcdata/medium_te[0m [2m                           [0m
[35mst/results/training/median_pixel/2024-07-25_17-43-32[0m [2m                           [0m
[35m/omics/groups/OE0645/internal/data/htcdata/medium_te[0m [2m                           [0m
[35mst/results/training/median_pixel/2024-07-25_17-43-32[0m [2m                           [0m
[2K[36mCheck for necessary 

After the cell completes, you should see a new validation_table file and an ExperimentAnalysisValidation.html file in the the run directory. Open the html file in browser to see the analysis of your experiment.

Bare in mind that this will only analyze the validation set. if you are satisfied with your model and want testing results, you must generate test predictions and run the same above code, but using ExperimentAnalysis.ipynb

Happy Training!!! : )