# Tissue Atlas Setup Walkthrough
This notebook walks a user through the process of setting up the DKFZ htc framework to work data from the DKFZ tissue atlas dataset. The Tissue Atlas dataset is organized differently from the data used by the original htc framework, so a slightly edited version of the htc framework must me used. This version retains the ability to use default datasets from the original htc framework (or aims to). but also allows users to access the tissue atlas.

This notebook will guide the user through the initial setup necessary to use the framework. After completing this notebook, a second Notebook, titled "TissueAtlasTraining" will take you through actually using the data. The notebook is designed for use by a medical researcher with little to no python experience, and it will walk you explicitly through all the steps you need to get started and run your own training and inference on a dataset of your choice.

Reminder: htc is only executable on Ubuntu. If you are running windows, I recommend using WSL to get set up with linux capabilities.


## PATH Environment Variables
In order to use the htc framework, you must first define PATH Environment variables in an appropriate file. Please consult the README in the htc repository for a more detailed treatment of these environment variables. Here, we will provide a basic overview of how they work and instructions on how to use them.

PATH environment variables are variables that the htc framework uses to locate the dataset(s) that you want it to use. First, you must create an environments file. If you have cloned the htc framework as suggested in the readme, simply navigate to the repositories root Directory (should be named htc). In this directory, create a new file named ".env"

For example, in a bash terminal, run:

```bash
cd ~/path/to/my/htc
nano .env
```

Replace "~/path/to/my/htc" with the actual path on your system to the cloned htc repository.

Now, you must define your PATH variables. the htc framework uses a specific naming convention, so please follow the steps carefully.

### PATH to dataset(s):
For our purposes, there are two types of PATH environment variables used by the framework. The first is the PATH_Tivita variable, which tells the framework where to look for the dataset that you wish to use. first, you must find the path on your system to the dataset that you want to use. In the current iteration of the framework, a 'dataset" has a specific meaning: a dataset is a directory in the larger Tissue Atlas folder structure, that itself contains a directory titled "data", which is where the imaging data itself is actually kept (This specific structure is a vestige of the original frameworks close coupling with the default dataset design, so while it may seem odd to the user it is motivated on the back end).
Importantly, the "dataset" path should not be to the "data" directory itself. In the current iteration, it must also not point to a parent or grand-parent directory. (The author aims to implement this soon. however, in the current structure, you should only want to point to one "dataset" because we have only binary annotations for each class). 

For example the following path is a valid path to a dataset:

```bash
 ~/TIVITA_Cat/Cat_Pig/Cat_atlas/Cat_0002_small_bowel
```
However, the next two are not valid "dataset" paths:

```bash
 ~/TIVITA_Cat/Cat_Pig/Cat_atlas/Cat_0002_small_bowel/data
 ~/TIVITA_Cat/Cat_Pig/Cat_atlas
```

Once you have found your dataset path, you can define a Path environment variable by copying and pasting and the following code into the .env file. (replace the paths with your own path)
```bash
 PATH_Tivita_Cat_0002_small_bowel="~/TIVITA_Cat/Cat_Pig/Cat_atlas/Cat_0002_small_bowel:shortcut=smallbowel"
```
The shortcut here provides an extra way to access your dataset further in the framework, name it how you like. However, the variable name is NOT trivial. it must follow the form:
```bash
 PATH_Tivita_<your_dataset_directory_name>
```
If the variable name does not match the path it is handed, the framework will not work.

If you have multiple datasets that you would like to use, simply add them in the same way to your .env folder. Finally, when you are done, in the root of your htc directory run:
```bash
source .env
```
You can also copy this source line to your .bashrc, so that every time you open a new terminal the .env file is automatically sourced

## PATH to Results

You also need to add a path to a "Results" folder. this is where the framework will send the output of training or inference tasks. the setup is pretty much the same as with the dataset, except simpler:
```bash
 PATH_HTC_RESULTS="~/path/to/results"
```
Do not specify a shortcut for the results path. You don't need one, and the framework will not recognize it. If you want to have multiple results folders, you can do that as well, for instructions please consult the README.md in the htc repository

## PATH to Exteral

The original htc framework is designed and strongly integrated around an expected dataset structure. In particular, The original htc expects a dataset to be divided into "intermediates" and "data", and within the "data" folder to find a specific json named "dataset_settings.json" (more on that later). However, with the tissue atlas we are somewhat stuck with the structure we have, and we cannot go adding folders and jsons here and there.

Instead, this modified framework allows you to specify an extra PATH environment variable to a directory named "external". This directory is in many ways interpreted by the framework as a dataset, and should mimic the expected structure of the dataset (with intermediates and data subdirectories). However, it contains no core data, and instead houses all the extra metadata, preprocessed data, and configuration files needed by the framework. The modified framework is written so that if it detects an external directory (as set with a PATH variable,), the external directory will take precedence over the dataset directory, and the framework will use the external directory for all the "extras", and the dataset directory only for the contents of the "data" subdirectory.

If this is confusing, Don't worry -- all you should know  to use it is that the externals directory is a workaround to allow us to build our extra files and information in a location of our choosing, that is not relative to the dataset. 

To define the variable, create an "external" directory at your desired location on our system. then write in .env:

```bash
 PATH_HTC_EXTERNAL="~/path/to/external:shortcut=myshortcut"
```
again, you may create a shortcut of your choosing.

# Building your external directory

now that we have defined an external directory, we need to make sure it has all the right contents. This consists of creating the "data" and "intermediates" subdirectories, and populating them with the appropriate files. 

The following code cells will do just that for you. Please run them, filling out.

In [1]:
#imports
from pathlib import Path

import json

import pandas as pd
from IPython.display import JSON
from typing import TYPE_CHECKING, Any, Callable, Union, Self
from htc import decompress_file, read_meta_file, read_tivita_hsi, settings
from htc.tivita.DataPath import DataPath
from htc import settings
from htc.tivita.metadata import generate_metadata_table
#secific preprocessing files
from htc.data_processing.run_l1_normalization import L1Normalization

Please replace <my_shortcut> in the data_dir definition with the shortcut you chose to your desired dataset:

In [5]:
#all variables are type: Path
data_dir = settings.data_dirs.test_dataset11june
external_dir = settings.external_dir['PATH_HTC_EXTERNAL']['path_dataset']
output_dir = settings.intermediates_dirs['PATH_HTC_EXTERNAL']

In [6]:
def populate_external(path: Path) -> None:
    """
    Populate the external directory with appropriate structure

    Args:
        path: Path to the external directory.
    """
    subdirectories = ["data", "intermediates",]
    intermediates_subdirectories = ["preprocessing", "rgb_crops", "segmentations", "tables" ]
    preprocessing_subdirectories = ["L1", "parameter_images"]
    for subdir in subdirectories:
        subdir_path = path / subdir
        subdir_path.mkdir(parents=True, exist_ok=True)
        print(f"Created directory: {subdir_path}")
        if subdir == "intermediates":
            for int_subdir in intermediates_subdirectories:
                int_subdir_path = subdir_path / int_subdir
                int_subdir_path.mkdir(parents=True, exist_ok=True)
                print(f"Created directory: {int_subdir_path}")
                if int_subdir == "preprocessing":
                    for pre_subdir in preprocessing_subdirectories:
                        pre_subdir_path = int_subdir_path / pre_subdir
                        pre_subdir_path.mkdir(parents=True, exist_ok=True)
                        print(f"Created directory: {pre_subdir_path}")
    

# Example usage:
populate_external(external_dir)

Created directory: /home/lucas/dkfz/htc/tests/test_11jun/external/data
Created directory: /home/lucas/dkfz/htc/tests/test_11jun/external/intermediates
Created directory: /home/lucas/dkfz/htc/tests/test_11jun/external/intermediates/preprocessing
Created directory: /home/lucas/dkfz/htc/tests/test_11jun/external/intermediates/preprocessing/L1
Created directory: /home/lucas/dkfz/htc/tests/test_11jun/external/intermediates/preprocessing/parameter_images
Created directory: /home/lucas/dkfz/htc/tests/test_11jun/external/intermediates/rgb_crops
Created directory: /home/lucas/dkfz/htc/tests/test_11jun/external/intermediates/segmentations
Created directory: /home/lucas/dkfz/htc/tests/test_11jun/external/intermediates/tables


### Specify Dataset Settings

The next step is to populate the "data" directory of the external directory with dataset_settings.json file, which contains with some important metadata about your dataset. This is used by the framework to facilitate data loading, and to provide some important functionalities with the data.




You can edit add more information to the .json if you like, such as subject or annotator mapping, but this tutorial does not yet cover those options. 

In [21]:
import shutil
import os
tutorial_dir = Path(os.getcwd())
default_dataset_settings = tutorial_dir / "setup_dataset_settings.json"
destination = external_dir/"data/dataset_settings.json"
shutil.copy(default_dataset_settings, destination)

PosixPath('/home/lucas/dkfz/htc/tests/test_11jun/external/data/dataset_settings.json')

Now we need to fill these subdirectories with the appropriate data

In [22]:
#get list of paths
paths = list(DataPath.iterate(data_dir, filters = None, annotation_name= None)) #might have to change kwargs

external directory found


In [23]:
#create l1 norm .blosc files
L1Normalization(
        paths=paths, file_type="blosc", output_dir=(output_dir/ "preprocessing"), regenerate= False
    ).run()

Output()

In [24]:
#create meta tables
dataset_name = data_dir.parent.name
tables_dir = output_dir / "tables"
tables_dir.mkdir(parents=True, exist_ok=True)
meta_dataframe = generate_metadata_table(paths)
meta_dataframe.to_csv(output_dir/f"tables/{dataset_name}@meta.csv")
meta_dataframe.to_feather(output_dir/f"tables/{dataset_name}@meta.feather")


# Tutorial Incomplete: missing segmentations section (ask htc authors)

## Loading Data

We will end with a brief demonstration of how the framework loads data. In the Training notebook, this process will be rolled into other class constructors, However, there are many functionalities for data analysis, such as retrieving and plotting median reflectance across the HSI spectrum for an organ (see original htc "General" tutorial for more info)

Start by importing necessary packages, and defining the Path object to your dataset_settings json.

In [25]:
%load_ext autoreload
%autoreload 2
from pathlib import Path

from htc import settings
from htc.tivita.DataPath import DataPath

There are then 3 ways to build datapaths to the images. one path should always always represents just one timestamp (image) directory

In [26]:
#1 via iteration: main tool to access images
#for "your_shortcut, use the shortcut you defined earlier in the PATH variable"

paths = list(DataPath.iterate(settings.data_dirs.test_dataset11june))
[p.timestamp for p in paths[:10]]

external directory found


['2021_04_15_09_22_02', '2021_04_28_08_49_12']

Now that are set up, you can go to the "TissueAtlasTraining.ipynb" notebook to begin training a model on your dataset