# Advanced Training Pipeline

Once we have set up our pipeline, we can focus on trying out different combinations of data and models. However, to do this, we need to make sure that our pipeline can be easily adjusted for the specific task. This means that we should be able to run the pipeline using different models, settings, and other components. Imagine if we want to run 100 experiments with various combinations, it would be difficult and time-consuming to write separate scripts for each experiment. It would also be challenging to maintain and manage multiple pipelines or configurations if we cannot reuse the same pipeline.

<img src="https://raw.githubusercontent.com/facebookresearch/hydra/master/website/static/img/Hydra-Readme-logo2.svg" alt="logo" width="40%" />

In this section, I will demonstrate how we can create complex configurations and run multiple experiments with just one command using [Hydra](https://github.com/facebookresearch/hydra), a tool developed by Facebook Research. Hydra simplifies the process by providing a convenient way to manage configurations. We will begin by running a basic experiment with the default settings and then proceed to run multiple experiments effortlessly using Hydra. Finally, we will combine Hydra with the [Optuna Framework](https://optuna.org/) to perform a Hyperparameter Sweep, allowing us to run thousands of experiments to find the best configuration for our model.

This section is inspired from [ashleve/lightning-hydra-template](https://github.com/ashleve/lightning-hydra-template). Please read the repository the fully understand Lightning+Hydra setup

In [3]:
!pip install -q hydra-core hydra-colorlog timm

You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m


Instantiate the necessary paths and lets move to the root directory to accesss `train.py`

In [1]:
import os
import sys

ROOT_DIR = os.path.dirname(os.path.abspath(''))
DATA_DIR = os.path.join(ROOT_DIR, 'data/food-101-tiny')

TRAIN_DATA_PATH = os.path.join(DATA_DIR, 'train')
VAL_DATA_PATH = os.path.join(DATA_DIR, 'valid')

sys.path.append(ROOT_DIR)
%cd $ROOT_DIR

/home/haritsahm/Documents/Getting Started


## Hydra Configuration Pipeline

In Hydra, you can create separate configuration files for different parts of your project or experiment. Each file contains settings and parameters specific to that part. For example, you might have one file for model configuration, another for dataset configuration, and so on. Hydra allows you to define a hierarchical structure for these configuration files. You can specify relationships and dependencies between them. For instance, you can have a base configuration file that contains common settings shared by all other configurations, and then have specialized configuration files that override or extend the base settings.

When you run your experiment or application, Hydra intelligently combines these configuration files to create a single, cohesive configuration. It merges the settings from different files based on their hierarchy and resolves any conflicts or inconsistencies. The resulting configuration is a composition of all the specified settings, providing a comprehensive configuration that captures the specific requirements of your experiment. This modular approach makes it easy to manage and reuse configurations across different experiments or projects, but it will be **challenging** to manage the software since it's build using the OOP principles.

For more details on how to use hydra, please read the [documentation](https://hydra.cc/docs/intro/).

### Python Scripts

We're going to wrap the training pipeline in `src/train_pipeline.py`. Before that, we're going to copy all of the lightning functions that we developed from the previous section in to `src/`.

<details>
<summary><b>Training pipeline code sinppet</b></summary>

```python
def train(cfg: DictConfig) -> Tuple[dict, dict]:
    """Trains the model. Can additionally evaluate on a testset, using best weights obtained during
    training.

    This method is wrapped in optional @task_wrapper decorator, that controls the behavior during
    failure. Useful for multiruns, saving info about the crash, etc.

    Args:
        cfg (DictConfig): Configuration composed by Hydra.

    Returns:
        Tuple[dict, dict]: Dict with metrics and dict with all instantiated objects.
    """

    # set seed for random number generators in pytorch, numpy and python.random
    if cfg.get("seed"):
        L.seed_everything(cfg.seed, workers=True)

    log.info(f"Instantiating datamodule <{cfg.data._target_}>")
    datamodule: LightningDataModule = hydra.utils.instantiate(cfg.data)

    log.info(f"Instantiating model <{cfg.model._target_}>")
    model: LightningModule = hydra.utils.instantiate(cfg.model)

    log.info("Instantiating callbacks...")
    callbacks: List[Callback] = utils.instantiate_callbacks(cfg.get("callbacks"))

    log.info("Instantiating loggers...")
    logger: List[Logger] = utils.instantiate_loggers(cfg.get("logger"))

    log.info(f"Instantiating trainer <{cfg.trainer._target_}>")
    trainer: Trainer = hydra.utils.instantiate(cfg.trainer, callbacks=callbacks, logger=logger)

    object_dict = {
        "cfg": cfg,
        "datamodule": datamodule,
        "model": model,
        "callbacks": callbacks,
        "logger": logger,
        "trainer": trainer,
    }

    if logger:
        log.info("Logging hyperparameters!")
        utils.log_hyperparameters(object_dict)

    if cfg.get("compile"):
        log.info("Compiling model!")
        model = torch.compile(model)

    if cfg.get("train"):
        log.info("Starting training!")
        trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))

    train_metrics = trainer.callback_metrics

    if cfg.get("test"):
        log.info("Starting testing!")
        ckpt_path = trainer.checkpoint_callback.best_model_path
        if ckpt_path == "":
            log.warning("Best ckpt not found! Using current weights for testing...")
            ckpt_path = None
        trainer.test(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
        log.info(f"Best ckpt path: {ckpt_path}")

    test_metrics = trainer.callback_metrics

    # merge train and test metrics
    metric_dict = {**train_metrics, **test_metrics}

    return metric_dict, object_dict

```
</details>
<br>

The function above will construct the required modules to execute the training pipeline, e.g. `datamodule`, `model`, `callbacks`, `logger`, and `trainer`.

### Configuration Files

Next, we're going to define the configurations for every modules or functions that we're going to use in `configs/`.

```yaml
configs/
  train.yaml            -> Main configuration file
  trainer/              -> Trainer configs
  model/                -> Model configs
  logger/               -> Logger configs
  hparams_search/       -> Hyperparameter search configs
  experiment/           -> Experiment configs
  data/                 -> Data configs
  callbacks/            -> Callback configs
```

### How It Works

All PyTorch Lightning modules are dynamically instantiated from module paths specified in config. Example model config:

```yaml
_target_: src.models.ClassificationLightningModule
num_classes: 10
lr: 0.0001
net:
  _target_: src.models.ResNet18
  input_channels: 3
  num_classes: ${..num_classes}
```

Using this config we can instantiate the object with the following line:

```python
model = hydra.utils.instantiate(config.model)
```

This allows you to easily iterate over new models! Every time you create a new one, just specify its module path and parameters in appropriate config file. <br>

Switch between models and datamodules with command line arguments:

```bash
python train.py model=timm
```

### Main Config

Location: [configs/train.yaml](../configs/train.yaml) <br>
Main project config contains default training configuration.<br>
It determines how config is composed when simply executing command `python train.py`.<br>

<details>
<summary><b>Show main project config</b></summary>

```yaml
# order of defaults determines the order in which configs override each other
defaults:
  - _self_
  - data: food101.yaml
  - model: resnet18.yaml
  - callbacks: default.yaml
  - logger: null # set logger here or use command line (e.g. `python train.py logger=csv`)
  - trainer: default.yaml

  # experiment configs allow for version control of specific hyperparameters
  # e.g. best hyperparameters for given model and datamodule
  - experiment: null

  # config for hyperparameter optimization
  - hparams_search: null

work_dir: ${hydra:runtime.cwd}

# task name, determines output directory path
task_name: "train"

# path to data directory
data_dir: data/

# tags to help you identify your experiments
# you can overwrite this in experiment configs
# overwrite from command line with `python train.py tags="[first_tag, second_tag]"`
tags: ["dev"]

# set False to skip model training
train: True

# evaluate on test set, using best model weights achieved during training
# lightning chooses best weights based on the metric specified in checkpoint callback
test: True

# simply provide checkpoint path to resume training
ckpt_path: null

# seed for random number generators in pytorch, numpy and python.random
seed: null

# disable python warnings if they annoy you
ignore_warnings: True

# pretty print config tree at the start of the run using Rich library
print_config: True
```

</details>

### Experiment Config

Location: [configs/experiment](../configs/experiment)<br>
Experiment configs allow you to overwrite parameters from main config.<br>
For example, you can use them to version control best hyperparameters for each combination of model and dataset.

<details>
<summary><b>Show example experiment config</b></summary>

```yaml
# @package _global_

# to execute this experiment run:
# python train.py experiment=example

defaults:
  - override /data: food101.yaml
  - override /model: resnet18.yaml
  - override /callbacks: default.yaml
  - override /trainer: gpu.yaml

# all parameters below will be merged with parameters from default configurations set above
# this allows you to overwrite only specified parameters

tags: ["resnet18", "food101-tiny"]

seed: 12345

trainer:
  min_epochs: 1
  max_epochs: 10
  gradient_clip_val: 0.5
  precision: 16

model:
  num_classes: 10

data:
  batch_size: 16

logger:
  wandb:
    tags: ${tags}
    group: "resnet"
```

</details>

<br>


### Open a Configuration File

This example shows how to use hydra using python, but the same method also works with `cli`.

In [2]:
from hydra import initialize, initialize_config_module, initialize_config_dir, compose
from omegaconf import OmegaConf

with initialize(version_base=None, config_path="../configs/"):
    cfg = compose(config_name="train.yaml")
    print(OmegaConf.to_yaml(cfg))

work_dir: ${hydra:runtime.cwd}
task_name: train
data_dir: data/
tags:
- dev
train: true
test: true
ckpt_path: null
seed: null
print_config: true
data:
  _target_: src.dataset.Food101LitDatamodule
  data_dir: ${data_dir}
  input_size:
  - 384
  - 384
  batch_size: 16
  num_workers: 4
  pin_memory: false
model:
  _target_: src.models.ClassificationLightningModule
  num_classes: 10
  lr: 0.0001
  net:
    _target_: src.models.ResNet18
    input_channels: 3
    num_classes: ${..num_classes}
callbacks:
  model_checkpoint:
    _target_: lightning.pytorch.callbacks.ModelCheckpoint
    dirpath: null
    filename: epoch_{epoch:03d}
    monitor: val/acc
    mode: max
    save_last: true
    save_top_k: 1
    auto_insert_metric_name: false
  early_stopping:
    _target_: lightning.pytorch.callbacks.EarlyStopping
    monitor: val/acc
    patience: 100
    mode: max
trainer:
  _target_: lightning.pytorch.trainer.Trainer
  default_root_dir: null
  min_epochs: 1
  max_epochs: 10
  accelerator: cpu
  

The configuration above

### Running an Experiment

Using `hydra` is simple, just run the following command.

Equivalent CLI command:
```
python3 train.py -m experiment=food101
```

In [6]:
from src.train_pipeline import train

with initialize(version_base=None, config_path="../configs/"):
    cfg = compose(
        config_name="train.yaml",
        overrides=[f"data_dir={DATA_DIR}", "experiment=food101"]
    )
    train(cfg)

Global seed set to 12345
  rank_zero_warn(
  rank_zero_warn(
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | net          | ResNet18           | 12.6 M
1 | criterion    | CrossEntropyLoss   | 0     
2 | train_acc    | MulticlassAccuracy | 0     
3 | val_metrics  | MetricCollection   | 0     
4 | test_metrics | MetricCollection   | 0     
----------------------------------------------------
12.6 M    Trainable params
0         Non-trainable params
12.6 M    Total params
50.251    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.
Restoring states from the checkpoint path at /home/haritsahm/Documents/Getting Started/checkpoints/epoch_009-v1.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at /home/haritsahm/Documents/Getting Started/checkpoints/epoch_009-v1.ckpt


Testing: 0it [00:00, ?it/s]

### Running an Experiment with Overriden Parameters

We can override existing configuration to execute it with different configurations.

Equivalent CLI command:
```
python3 train.py -m experiment=food101 model=resnet18 model.lr=0.001 data.batch_size=8
```

In [7]:
overrides = [
    f"data_dir={DATA_DIR}",
    "experiment=food101",
    "model=resnet18",
    "model.lr=0.001",
    "data.batch_size=8",
]

with initialize(version_base=None, config_path="../configs/"):
    cfg = compose(
        config_name="train.yaml",
        overrides=overrides,
    )
    train(cfg)

Global seed set to 12345
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | net          | ResNet18           | 12.6 M
1 | criterion    | CrossEntropyLoss   | 0     
2 | train_acc    | MulticlassAccuracy | 0     
3 | val_metrics  | MetricCollection   | 0     
4 | test_metrics | MetricCollection   | 0     
----------------------------------------------------
12.6 M    Trainable params
0         Non-trainable params
12.6 M    Total params
50.251    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.
Restoring states from the checkpoint path at /home/haritsahm/Documents/Getting Started/checkpoints/epoch_008.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at /home/haritsahm/Documents/Getting Started/checkpoints/epoch_008.ckpt


Testing: 0it [00:00, ?it/s]

In [8]:
overrides = [
    f"data_dir={DATA_DIR}",
    "experiment=food101",
    "model=timm",
    "model.net.model_name=resnetv2_50",
    "model.lr=0.001",
    "data.batch_size=8",
]

with initialize(version_base=None, config_path="../configs/"):
    cfg = compose(
        config_name="train.yaml",
        overrides=overrides,
    )
    train(cfg)

Global seed set to 12345
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type               | Params
----------------------------------------------------
0 | net          | ResNetV2           | 23.5 M
1 | criterion    | CrossEntropyLoss   | 0     
2 | train_acc    | MulticlassAccuracy | 0     
3 | val_metrics  | MetricCollection   | 0     
4 | test_metrics | MetricCollection   | 0     
----------------------------------------------------
23.5 M    Trainable params
0         Non-trainable params
23.5 M    Total params
94.083    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.
Restoring states from the checkpoint path at /home/haritsahm/Documents/Getting Started/checkpoints/epoch_009-v2.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at /home/haritsahm/Documents/Getting Started/checkpoints/epoch_009-v2.ckpt


Testing: 0it [00:00, ?it/s]

In [23]:
!pip install -q timm

You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m


### Running Multiple Experiments

`hydra` is capable to run multiple experiments with different configurations. We need to add `--multirun` or `-m` to the python command and execute the main file. <br>
This command will run 4 experiments using `timm` model with lr-bs pairs of: `(0.001,8)`, `(0.001,16)`, `(0.0001,8)`, `(0.0001,16)`.

Please read the [documentation](https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/).

Unfortunately, `multirun` method is not executable via notebook cell like the previous cells, it must be executed with python command.
From the `Root Directory`, run the following command to run multiple experiments:
```
python3 train.py -m experiment=food101 model=timm trainer.max_epochs=5 model.lr=0.001,0.0001 data.batch_size=8,16 logger=wandb
```

In [9]:
%cd $ROOT_DIR

/home/haritsahm/Documents/Getting Started


In [10]:
!python3 train.py -m experiment=food101 model=timm trainer.max_epochs=5 model.lr=0.001,0.0001 data.batch_size=8,16 logger=wandb data_dir=data/food-101-tiny/

[2023-06-12 07:42:52,468][HYDRA] Launching 4 jobs locally
[2023-06-12 07:42:52,468][HYDRA] 	#0 : experiment=food101 model=timm trainer.max_epochs=5 model.lr=0.001 data.batch_size=8 logger=wandb data_dir=data/food-101-tiny/
Global seed set to 12345
[2023-06-12 07:42:52,700][src.train_pipeline][INFO] - Instantiating datamodule <src.dataset.Food101LitDatamodule>
[2023-06-12 07:42:54,383][src.train_pipeline][INFO] - Instantiating model <src.models.ClassificationLightningModule>
  rank_zero_warn(
[2023-06-12 07:42:54,958][src.train_pipeline][INFO] - Instantiating callbacks...
[2023-06-12 07:42:54,959][src.utils][INFO] - Instantiating callback <lightning.pytorch.callbacks.ModelCheckpoint>
[2023-06-12 07:42:54,960][src.utils][INFO] - Instantiating callback <lightning.pytorch.callbacks.EarlyStopping>
[2023-06-12 07:42:54,960][src.train_pipeline][INFO] - Instantiating loggers...
[2023-06-12 07:42:54,960][src.utils][INFO] - Instantiating logger <lightning.pytorch.loggers.wandb.WandbLogger>
[2023

## Hyperparameter Sweeps using Optuna

<img src="https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png" alt="logo" width="40%" />

[Optuna](https://optuna.org/) automates the process of searching for the optimal combination of hyperparameters by intelligently exploring the hyperparameter space. It employs state-of-the-art algorithms, including tree-structured Parzen estimators (TPE), genetic algorithms, and particle swarm optimization, to efficiently navigate the search space and find the best set of hyperparameters.

To use Optuna, you define an objective function that evaluates the performance of your model using a specific set of hyperparameters. Optuna then iteratively samples different hyperparameter configurations, evaluates their performance by calling the objective function, and updates its search strategy based on the collected results. This process continues for a specified number of iterations or until a convergence criterion is met.

To combine the capabilities of Hydra and Optuna, there is an [Optuna Sweeper plugin](https://hydra.cc/docs/plugins/optuna_sweeper/) available for Hydra. This plugin integrates Optuna's hyperparameter optimization capabilities into Hydra's configuration management. With the Optuna Sweeper plugin, you can define a search space for hyperparameters in your Hydra configuration files. During the hyperparameter sweep, Optuna will sample different combinations of hyperparameters and run the experiments accordingly.

By leveraging the Optuna Sweeper plugin in Hydra, you can easily perform hyperparameter optimization and explore different configurations without the need for writing separate scripts or managing multiple pipelines. It simplifies the process of finding the best hyperparameters for your machine learning models.


Install the Optuna Sweeper plugin for Hydra.
```shell
pip install hydra-optuna-sweeper
```

The following command will run 20 experiments to find the best configuration based on the `val/acc` metric. It will try to find the best `timm` model by experimenting with different `batch_size`, `lr`, and `model_name` parameters.

Unfortunately, `multirun` is not available in notebook.
From the `02_Model_Development/`, run the following command to run multiple experiments:
```
python3 train.py -m experiment=food101 hparams_search=food101_optuna model=timm trainer.max_epochs=5 logger=wandb
```

In [None]:
!pip install -q hydra-optuna-sweeper

In [None]:
!python3 train.py -m experiment=food101 hparams_search=food101_optuna model=timm trainer.max_epochs=5 logger=wandb