# Adapt your dataset

## NOTE:
`mml` handles preprocessing the data internally. No need to manually preprocess any data in advance. See `preprocess mode`.

Assume having used your dataset as a plain pytorch Dataset previously. Migration to `mml` is as easy as follows:




## Step 1: Write your DsetCreator and TaskCreator

`mml` distinguishes the concepts of "Datasets" and "Tasks". Whereby "Datasets" contains all data (plus maybe more meta information, additional tasks on the same data, additional test samples, etc.) and the "Task" is only a description which samples and labels of the "Dataset" belong to that specific task. There are a lot of convenience functions to simplify this process.

### Example: Reusing your previous dataset definition

In this example we use some `torchvision` dataset to be integrated into `mml`, but it may be fully replaced with your existing dataset class.

In [1]:
from mml.api import (
    DSetCreator,
    TaskCreator,
    get_iterator_and_mapping_from_image_dataset,
    TaskType,
    Keyword,
    License,
    Modality,
    DataKind,
)
from torchvision.datasets import STL10

In [4]:
REFERENCE = """
@inproceedings{Coates2011AnAO,
  title={An Analysis of Single-Layer Networks in Unsupervised Feature Learning},
  author={Adam Coates and A. Ng and Honglak Lee},
  booktitle={AISTATS},
  year={2011}
}"""

dset_creator = DSetCreator(dset_name="STL_10_DEMO")
train = STL10(root=dset_creator.download_path, split="train", download=True)
test = STL10(root=dset_creator.download_path, split="test", download=True)
dset_path = dset_creator.extract_from_pytorch_datasets(
    datasets={"training": train, "testing": test}, task_type=TaskType.CLASSIFICATION, class_names=train.classes
)
task_creator = TaskCreator(
    dset_path=dset_path,
    task_type=TaskType.CLASSIFICATION,
    name="STL_10_DEMO",
    desc="STL-10 image recognition task",
    ref=REFERENCE,
    url="https://cs.stanford.edu/~acoates/stl10/",
    instr="downloaded via torchvision dataset (https://pytorch.org/vision/stable/generated/torchvision.datasets.STL10.html#torchvision.datasets.STL10)",
    lic=License.UNKNOWN,
    release="2011",
    keywords=[Keyword.NATURAL_OBJECTS],
)
train_iterator, idx_to_class = get_iterator_and_mapping_from_image_dataset(
    root=dset_path / "training_data", classes=train.classes
)
test_iterator, idx_to_class_2 = get_iterator_and_mapping_from_image_dataset(
    root=dset_path / "testing_data", classes=test.classes
)
assert all([a == b for a, b in zip(idx_to_class, idx_to_class_2)])
task_creator.find_data(train_iterator=train_iterator, test_iterator=test_iterator, idx_to_class=idx_to_class)
task_creator.auto_complete()

That's it already! For the future you may reference your task with `stl10` (the value provided to `alias=` in the `TaskCreator`).

### Example: DSetCreator when using public data

In this case we recommend to implement the `DSetCreator` from scratch including the download of the data. This allows for better reproducibility. There are the following convenience functions so far:

 - `DSetCreator.download()` to download given a URL
 - `DSetCreator.kaggle_download()` to download given a kaggle dataset ID or competition ID
 - `DSetCreator.verify_pre_download()` if parts of the data have to be downloaded manually (e.g. access only after registration)
 - `DSetCreator.unpack_and_store()` simply call after any of the previous to extract the data from archive formats
 - `DSetCreator.transform_masks()` if necessary transform masks (e.g. from segmentation masks) to fit the `mml` requirements

### Example: TaskCreator, writing your own data iterator

If `get_iterator_and_mapping_from_image_dataset` does not fit your data structure, you may simply write an iterator yourself, as done with this example:


In [6]:
dset_creator = DSetCreator(dset_name="laryngeal_DEMO")
dset_creator.download(
    url="https://zenodo.org/record/1003200/files/laryngeal%20dataset.tar?download=1",
    file_name="laryngeal dataset.tar",
    data_kind=DataKind.TRAINING_DATA,
)
dset_path = dset_creator.unpack_and_store()
laryngeal_tissue = TaskCreator(
    dset_path=dset_path,
    task_type=TaskType.CLASSIFICATION,
    name="laryngeal_DEMO",
    desc="Laryngeal dataset for patches of healthy and early-stage cancerous laryngeal tissues",
    ref="...",
    url="https://nearlab.polimi.it/medical/dataset/",
    instr="download via zenodo.org/record/1003200/files/laryngeal%20dataset.tar?download=1",
    lic=License.CC_BY_NC_4_0,
    release="2017",
    keywords=[Keyword.MEDICAL, Keyword.LARYNGOSCOPY, Keyword.TISSUE_PATHOLOGY, Keyword.ENDOSCOPY],
)
classes = ["Hbv", "He", "IPCL", "Le"]
folds = ["FOLD 1", "FOLD 2", "FOLD 3"]
data_iterator = []
for fold in folds:
    root = dset_path / "training_data" / "laryngeal dataset" / f"{fold}"
    folders = [p.name for p in root.iterdir() if p.is_dir()]
    assert all([cl in folders for cl in classes]), "some class folder is not existent"
    for class_folder in root.iterdir():
        assert class_folder.is_dir()
        if class_folder.name not in classes:
            continue
        for img_path in class_folder.iterdir():
            data_iterator.append(
                {
                    Modality.SAMPLE_ID: img_path.stem,
                    Modality.IMAGE: img_path,
                    Modality.CLASS: classes.index(class_folder.name),
                }
            )
idx_to_class = {classes.index(cl): cl for cl in classes}
laryngeal_tissue.find_data(train_iterator=data_iterator, idx_to_class=idx_to_class)
laryngeal_tissue.auto_complete()

### Example: Multiple tasks per Dataset

This example will be added later.

## Step 2: BONUS - automize the task creation

`mml` has a `create` mode to generate tasks automatically. If set up correctly the above datasets would be downloaded and prepared automatically when calling `mml create tasks=example` (assuming `example.yaml` is already provided). This is much more convenient if using `mml` from within and not as a library - nevertheless possible and allows any other `mml` user that installed your package to quickly start on your data and code.

 - make your code installable via a package - you need a `pyproject.toml` and `setup.cfg` file for this
 - decorate the `DSetCreator` with `@register_dsetcreator` and your `TaskCreator` with `@register_taskcreator`
 - add an `activate.py` script to the root of your package's source code
 - import the module (file) that defines the creators within this file
 - in your `setup.cfg` (or `setup.py` or `pyproject.toml`, see [here](https://setuptools.pypa.io/en/latest/userguide/entry_point.html?highlight=entry_points#entry-points-syntax)) provide the correct entry point for `mml`

```cfg
[options.entry_points]
mml.plugins =
    some_key = your_package:activate
```

 - (replace some_key with a descriptive id and your_package with your package and your_module the module you want to refer to).
 - These tasks are now always linked when calling `mml task_list=[stl10,laryngeal_tissues]` 🎉



## Step 3: BONUS - add your task(s) to a tasks config file

In order to refer to your task(s) later on create a tasks config file in a configs folder that is linked to `mml`

 - if you cloned `mml` just navigate into `configs/tasks`
 - if you are writing your own package, create a `configs` folder, best at your package root level, add a `tasks` folder inside
 - create a new file `example.yaml` with the following content

```yaml
# @package _global_

tasks:
  - 'stl10'
  - 'laryngeal_tissues'

pivot:
  name: False
  tags: ''

tagging:
  all: False
  variants: []
```

 - add something like the following to your `activate.py` (see step before)

```python
from hydra.core.config_search_path import ConfigSearchPath
from hydra.core.plugins import Plugins
from hydra.plugins.search_path_plugin import SearchPathPlugin


# register plugin configs
class MMLINSERTPLUGINNAMESearchPathPlugin(SearchPathPlugin):
    def manipulate_search_path(self, search_path: ConfigSearchPath) -> None:
        # Sets the search path for mml with copied config files
        search_path.append(
            provider="mml-???", path=f"pkg://mml_???.configs"
        )


Plugins.instance().register(MMLINSERTPLUGINNAMESearchPathPlugin)
```

 - ofcourse you have to replace `INSERTPLUGINNAME` and `mml-???` / `mml_???`

These tasks are now always linked when calling `mml tasks=example` 🎉