# Hydra Configurations Tutorial

## Introduction

CMB-ML manages a complex pipeline that processes data across multiple stages. Each stage produces outputs that need to be tracked, reused, and processed in later stages. Without a clear framework, this can lead to disorganized code, redundant logic, and errors.

The CMB-ML library offers a set of tools to manage the pipeline in a modular and scalable way. 

At the core of this approach is configuration management, which cleanly separates the logic of the process from its parameters. This separation ensures that the code remains streamlined while the details stay isolated and easy to manage.

This notebook introduces Hydra, a tool developed by Meta to allow for elegant configuration management of complex programs.

## Contents
View this notebook with [nbviewer](https://nbviewer.org/github/CMB-ML/cmb-ml/tree/main/demonstrations/A_hydra_tutorial.ipynb#Introduction) (or in your IDE) to enable these links.

- [Simple configurations](#Simple-configurations)
- [Nested configurations](#Nested-configurations)
- [Lists in configurations](#Lists-in-configurations)
- [The defaults list](#The-defaults-list)
- [Hydra tools](#Other-features-Interpolation-Addition-and-Subtraction)
- [Config initialization](#Initializing-the-config)
- [Next Steps](#Next-steps)

# Simple configurations

Hydra is a Python library. Usually, it's used with modules ("*.py" files), but it can be made to work in Jupyter notebooks. That will be discussed more later. 

First, I'll show some simple examples of how it's used. I need to load the library and a few tools to work with it:

In [1]:
import hydra
from hydra import compose, initialize
from omegaconf import DictConfig, OmegaConf

To start, I'll use Hydra to load a configuration from [cfg/simple.yaml](./cfg/simple.yaml):

In [2]:
with initialize(version_base=None, config_path="cfg"):
    cfg = compose(config_name='simple')

That configuration has two simple parameters:

``` yaml
some_string: abc
some_number: 3
```

I'll check the config:

In [3]:
print(OmegaConf.to_yaml(cfg))

some_string: abc
some_number: 3



I can access parameters in the config with two different styles. Then I can use that information. For instance:

In [4]:
# Parameters can be accessed as dictionary keys
n_repeats = cfg['some_number']

# And parameters can be accessed as attributes, using "dot notation"
my_text = cfg.some_string

for i in range(n_repeats):
    print(my_text)

abc
abc
abc


If I want to change parameters at runtime, I can do so using "overrides." I'll show that in a bit.

# Nested configurations

Parameters can also be nested, which allows for organization.

I'll look at a different config, [cfg/simple2.yaml](./cfg/simple2.yaml):

``` yaml
icon1:
  shape: square
  color: blue
icon2:
  shape: circle
  color: red
```

I'll check it:

In [5]:
with initialize(version_base=None, config_path="cfg"):
    cfg = compose(config_name='simple2')

print(OmegaConf.to_yaml(cfg))

icon1:
  shape: square
  color: blue
icon2:
  shape: circle
  color: red



I can now look up the details for icon1 and icon2.

In [6]:
icon_mapping = {
    ('square', 'blue'): '🟦',
    ('circle', 'red'): '🔴',
    ('square', 'red'): '🟥',
    ('circle', 'blue'): '🔵'
}

print(icon_mapping[(cfg.icon1.shape, cfg.icon1.color)])
print(icon_mapping[(cfg.icon2.shape, cfg.icon2.color)])

🟦
🔴


Since both icons have the same structure, I could also:

In [7]:
def print_icon(icon):
    print(icon_mapping[(icon.shape, icon.color)])

print_icon(cfg.icon1)
print_icon(cfg.icon2)

🟦
🔴


## Overrides

At runtime, I may want to do something slightly different and change a parameter. I can do this in a simple way, using "overrides." I'll change the color of one of the icons.

In [8]:
with initialize(version_base=None, config_path="cfg"):
    cfg = compose(config_name='simple2',
                  overrides=["icon1.shape=circle"])

print(OmegaConf.to_yaml(cfg))

def print_icon(icon):
    print(icon_mapping[(icon.shape, icon.color)])

print_icon(cfg.icon1)
print_icon(cfg.icon2)

icon1:
  shape: circle
  color: blue
icon2:
  shape: circle
  color: red

🔵
🔴


# Lists in configurations

I named the icons `icon1` and `icon2`, but they're really the same thing, just multiple instances. Instead, a list of things may be more appropriate. Indeed, I can have a list in my configurations instead, as shown in [cfg/simple3.yaml](./cfg/simple3.yaml):

``` yaml
icons:
  - shape: square
    color: blue
  - shape: circle
    color: red
```

I'll check it:

In [9]:
with initialize(version_base=None, config_path="cfg"):
    cfg = compose(config_name='simple3')

print(OmegaConf.to_yaml(cfg))

icons:
- shape: square
  color: blue
- shape: circle
  color: red



This is a list of two dictionaries. Because it's a list, I can iterate through it:

In [10]:
icon_mapping = {
    ('square', 'blue'): '🟦',
    ('circle', 'red'): '🔴',
    ('square', 'red'): '🟥',
    ('circle', 'blue'): '🔵'
}

for icon in cfg.icons:
    print(icon_mapping[icon.shape, icon.color])

🟦
🔴


**Note:** While lists are more flexible, there is the downside: (generally) individual elements in a list cannot be overridden using Hydra's override mechanism.

# The defaults list

One great feature of Hydra is that it can *compose* configurations, using a defaults list.

Consider the [cfg/sample_cfg.yaml](./cfg/sample_cfg.yaml), which uses parameters closer to what I'll use for CMB-ML:

```yaml
defaults:
  - scenario: scenario_512
  - splits: all
  - _self_

preset_strings : ["d9", "s4", "f1"]
```

Because I have the following directory structure in tutorial configs:
```
├─ cfg
│  ├─ scenario
|  |   ├─ scenario_128.yaml
|  |   └─ scenario_512.yaml
│  ├─ splits
│  │   ├─ 1-1.yaml
│  │   └─ all.yaml
│  └─ sample_cfg.yaml
└── tutorial notebooks here
```

The defaults list tells Hydra to create a key for "scenario", where the value is the contents of [scenario_512.yaml](./cfg/scenario/scenario_512.yaml). Similarly, "splits" will come from [all.yaml](./cfg/splits/all.yaml).

In [11]:
with initialize(version_base=None, config_path="cfg"):
    cfg = compose(config_name='sample_cfg')
    print(OmegaConf.to_yaml(cfg))

scenario:
  nside: 512
  map_fields: IQU
  precision: float
  units: uK_CMB
splits:
  name: '1450'
  Train:
    n_sims: 1000
  Valid:
    n_sims: 250
  Test:
    n_sims: 200
preset_strings:
- d9
- s4
- f1



This makes my configurations much shorter. When dealing with the complicated CMB-ML pipeline, this makes it much easier on me to find the settings I need.

The defaults list is a special list in Hydra. It allows swapping out whole sets of parameters. For instance, if I want to run a set of parameters for debugging, I can choose a simpler scenario and smaller set of splits.

In [12]:
with initialize(version_base=None, config_path="cfg"):
    cfg = compose(config_name='sample_cfg',
                  overrides=['scenario=scenario_128', 'splits="1-1"'])
    print(OmegaConf.to_yaml(cfg))

scenario:
  nside: 128
  map_fields: I
  precision: float
  units: uK_CMB
splits:
  name: 1-1
  Test:
    n_sims: 1
preset_strings:
- d9
- s4
- f1



The defaults list is very powerful and a significant reason behind the adoption of Hydra. I can chain together configurations files, for very modular setups. This could occur if (for instance) the scenario file also had a defaults list. It is how some of the models are set up.

There are some extra rules for the defaults list. It has to appear at the top of the file.

It must include "\_self_" somewhere on the list, which refers to the remainder of the YAML after the defaults list.

Order matters in the defaults list. If I define a parameter in two different places, the last one will take precedence. For instance, if \_self_ also defined "scenario.precision", then whatever is lower on the list will set that parameter.

# Other features: Interpolation, Addition, and Subtraction

I'll introduce three more concepts here. These can be very useful, though it's less frequent. They will be used in some of the following tutorials.

**Interpolation**: While the defaults list and use of ordering can be used to set keys in multiple places, I often find that *interpolation* is more convenient. This allows me to set a value and use it elsewhere. Hydra resolves ordering as it builds the configuration, but it resolves interpolation at run time. This allows me to circumvent circularity issues. It also allows me to build strings with multiple values.

Interpolation uses the syntax `"${thing.to.lookup}"`.

**Addition**: When performing overrides, I may want to include a parameter which isn't already present. I can do this with the `+` operator.

**Subtraction**: Similarly, when performing overrides, I may want to remove some parameter. This is done with the `~` operator.

In the following example, I remove `scenario.units`, and add `scenario.map_units` and `scenario.ps_units`. This is contrived, but shows both addition and subtraction. I also add a dataset name, which is composed by interpolation of other keys.

In [13]:
with initialize(version_base=None, config_path="cfg"):
    cfg = compose(config_name='sample_cfg',
                  overrides=['~scenario.units',
                             '+scenario.map_units="K_CMB"',
                             '+scenario.ps_units="uK_CMB^2"',
                             '+dataset_name="CMB-ML_${scenario.nside}_${splits.name}"',])
    print(OmegaConf.to_yaml(cfg))

cfg.dataset_name

scenario:
  nside: 512
  map_fields: IQU
  precision: float
  map_units: K_CMB
  ps_units: uK_CMB^2
splits:
  name: '1450'
  Train:
    n_sims: 1000
  Valid:
    n_sims: 250
  Test:
    n_sims: 200
preset_strings:
- d9
- s4
- f1
dataset_name: CMB-ML_${scenario.nside}_${splits.name}



'CMB-ML_512_1450'

Note how dataset_name in the YAML still shows the interpolation syntax, but when accessed programmatically (cfg.dataset_name), it is fully resolved.

# Initializing the config

Hydra does more than just pull the configuration. Depending on how it's started, we get different results.

Most of the development work I've done has used **Python modules** (*.py) files, which is where Hydra really shines. In that case, the `@hydra.main` decorator is used for top-level entry points:

```python
@hydra.main(version_base=None, config_path="cfg", config_name="sample_cfg")
def main(cfg: DictConfig) -> None:
    do_something(cfg)
```

When used this way, it automatically initialized the config, it manages the runtime and logging, and it can be used to run multiple times, sweeping over a set of parameters. In this case, the cfg object is scoped to just the entrypoint.

See [this python module](./B_hydra_script_tutorial.py) for a simple functioning example.

However, in the remaining **Jupyter notebooks**, I use a different instantiation method to make them global. That lets me have interactive access to the configuration. There are precautions to take, since it's a global object, which can otherwise lead to issues. You'll see patterns that look like this.

In [14]:
hydra.core.global_hydra.GlobalHydra.instance().clear() # if re-initialize is needed, clear the global hydra instance (in case of multiple calls to initialize)

initialize(version_base=None, config_path="cfg")

cfg = compose(config_name='sample_cfg')

print(OmegaConf.to_yaml(cfg))

scenario:
  nside: 512
  map_fields: IQU
  precision: float
  units: uK_CMB
splits:
  name: '1450'
  Train:
    n_sims: 1000
  Valid:
    n_sims: 250
  Test:
    n_sims: 200
preset_strings:
- d9
- s4
- f1



Hydra is a bit more limited in these notebooks because we can't use logging or sweeping in this interactive form. Similarly, most of the CMB-ML classes were written for modules, so some things may seem strange for a notebook. 

I've taken care to describe things in these demonstrations such that it translates to either notebook or module use. Hopefully, the differences will not be sources of confusion. Please reach out if they are.

# Using Environment Variables

You won't need to think about environment variables in Hydra configs often -- but you will when setting up CMB-ML for use with your system. If you see `${oc.env:SOME_ENV_VARIABLE}` that's using Hydra's interpolation mechanism in concert with system environment variables. When reading this config, Hydra would try to use the currently set `SOME_ENV_VARIABLE` in the interpolation. 

It's used as a string, just like other interpolations. If you were to set `SOME_ENV_VARIABLE=Hello`, then the following config:
```yaml
h_w: "${oc.env:SOME_ENV_VARIABLE}, World!"
```
would give "Hello, World!" as the value of `cfg.h_w`.

Again, this isn't often used, but may be seen when you set up your local system.

# Next steps

Hydra configurations form the backbone of CMB-ML, so that the modular and scalable code can be customized through structured configurations.

This notebook began with simple, flat configurations. It then described structures (hierarchies and lists) that can be used for more modular code. That structure justifies the use of a defaults list which simplifies composition of the configurations. Other tools, such as interpolation, were described. Last, I tried to clear up a potential point of confusion as you transition between notebooks and modules.

For more information on how we use Hydra configs, refer to:
- [Hydra documentation](https://hydra.cc/docs/intro/)
- [The top level configs README](https://github.com/CMB-ML/cmb-ml/blob/dev-package/cmbml/cfg/README.md)
- [The pipeline configs README](https://github.com/CMB-ML/cmb-ml/blob/dev-package/cmbml/cfg/pipeline/README.md)

Now that some background on Hydra has been established, continue with [setting up your local system](./C_setting_up_local.ipynb). You'll set up a configuration, 