# ParamGen Quickstart Guide
Alper Altuntas, NCAR\
Boulder, CO - 2021

## 1. Introduction

ParamGen is a lightweight, generic Python module for generating runtime parameters for earth system modeling applications. The module supports arbitrary Python expressions for the specification of parameter values. This provides a high level of flexibility and genericity.

ParamGen infers the values of model parameters from inclusive sets of *default parameters databases* (DPD) to be put together and maintained by the model developers. These databases are typically stored in a file written in a markup language such as xml, yaml or  json. ParamGen is generic, i.e., it is agnostic of any details of a particular modeling framework, model component, or input/output format. By default, the base ParamGen class supports xml, yaml and json as DPD (input) format. The only out-of-the-box output format, on the other hand, is the Fortran namelist format. New input and output formats can easily be introduced by application developers via class inheritance as will be discussed in this document. 

The primary property of a ParamGen instance is its `.data` member, which is of type Python dictionary, i.e., a collection of key-value pairs. When a ParamGen instance gets created, a dictionary must be provided to the ParamGen constructor to be accepted as its initial `.data`. This initial dictionary corresponds to the DPD, which may be read from xml, yaml, json, etc.

In the simplest case, the keys correspond to parameter names and the values correspond to parameter values. In a more involved case, the `.data` member may be formed as a nested dictionary for grouping model parameters into seperate namelist modules. Moreover, the keys of the `.data` member may consist of logical expressions, i.e., *guards*. The notion of guards is one of the most important concepts in ParamGen. A *guard* is a proposition of a parameter value (similar to how guards are propositions of commands in Dijkstra's Guarded Command Language). Take the following data, for instance:

```
NIGLOBAL:
    $OCN_GRID == "gx1v6":
       320
    $OCN_GRID == "tx0.66v1": 
       540
...
```

In the above nested dictionary, `NIGLOBAL` is interpreted as one of the parameter names.  Within the inner dictionary, however, we have two keys, both of which are logical expressions. These logical expressions, or guards, are regarded as propositions of the values following them. After the instantiation, the `.reduce()` method may be called to evaluate the guards and determine the values of model parameters. In the above example, assuming the expandable variable `OCN_GRID` is `"tx0.66v1"`, calling the reduce method results in: 

```
NIGLOBAL:
    540
...
```

Finally, the `.write()` method may be called to write the set of parameters in a desired format.

*Note: not sure about the "default parameters database" (DPD) term. Any alternative suggestions welcome.* -aa

## 2. ParamGen in Action

#### Obtaining ParamGen

Although ParamGen is model-agnostic, it is currently distributed within an experimental CESM fork. To obtain this CESM version, run the following commands:

```
git clone https://github.com/alperaltuntas/CESM.git -b paramGenBeta
(cd CESM; ./manage_externals/checkout_externals -o)
```

In the above CESM sandbox, ParamGen is located in `CESM/cime/scripts/lib/CIME/ParamGen`

#### Importing ParamGen class

The first step of working with ParamGen is to import the module. 
To import this experimental version of ParamGen module:

In [1]:
from paramgen import ParamGen

Note: In the case of a CESM model, ParamGen would be imported from a buildnml script. To do so, one would first append the ParamGen directory to the PATH. See `CESM/components/mom/cime_config/buildnml` as an example.

#### Instantiating a ParamGen object:
ParamGen constructor expects a `data` argument, that is a Python dictionary which may be nested or not. This dictionary corresponds to the default parameters database (DPD) that is the collection of parameter name-value pairs for all possible configurations. Let's first define a simple Python dictionary containing three variables `X`, `Y`, and `Z`:

In [2]:
DPD_dict = {"X" : 1.0,
            "Y" : True,
            "Z" : "foo" }

Now, create a ParamGen instance with this DPD dictionary:

In [3]:
pg = ParamGen(DPD_dict)

In [4]:
pg.data

{'X': 1.0, 'Y': True, 'Z': 'foo'}

We can now call the reduce method to generate the final version of `.data`:

In [5]:
pg.reduce()

In [6]:
pg.data

{'X': 1.0, 'Y': True, 'Z': 'foo'}

As expected, the reduced data is not any different from the initial data we passed to ParamGen constructor. The `.reduce()` method makes a difference only when the initial data contains conditionals, variable expansion, or Python expressions. Before describing these mechanisms, let's generate the same ParamGen instance via yaml and json formats:

#### Instantiating a ParamGen object via yaml or json:

ParamGen can be instantiated via yaml or json files using the following methods:

- `.from_yaml()`
- `.from_json()`


Under the hood, these methods simply create a Python dictionary from files with these formats and then call the ParamGen constructor with the generated dictionary.

#### yaml

In [7]:
%%writefile DPD.yaml
X: 1.0
Y: True
Z: foo

Writing DPD.yaml


In [8]:
# (2) Create a ParamGen instance:
pg = ParamGen.from_yaml("DPD.yaml")
pg.data

{'X': 1.0, 'Y': True, 'Z': 'foo'}

#### json

In [9]:
%%writefile DPD.json
{
    "X": 1.0,
    "Y": true,
    "Z": "foo"
}

Writing DPD.json


In [10]:
# (2) Create a ParamGen instance:
pg = ParamGen.from_json("DPD.json")
pg.data

{'X': 1.0, 'Y': True, 'Z': 'foo'}

#### Instantiating a ParamGen object via XML:

In a similar fashion, a ParamGen object may be created from an XML file. However, when working with XML, a specific schema must be satisfied. See the XML_NML.ipynb document for more information on how to work with XML within the ParamGen framework.

Out of the three commonly used markup languages, yaml has the most readible and concise syntax, especially when working with large number of parameters and nested entries. The disadvantage of yaml is that it is not distributed with the standard Python, unlike xml and json. So a third party yaml parser, e.g., PyYAML, is required.

Instead of using these file formats, we will continue to create ParamGen instances using Python dictionaries explicitly in the remainder of this documentation. Recall that ParamGen converts these formats to a dictionary before creating an instance so the instructions below apply to all ParamGen instances regardless of which DPD format is used.

## ParamGen Mechanisms
- Variable Expansion
- Guards
- Formulas
- Appending

### Variable expansion

Similar to shell parameter expansion mechanism in Linux, The `$` character may be used to introduce expandable variables in DPDs. These variables are expanded, i.e., replaced with their values, when the `.reduce()` method is called. Variable expansion may be employed in both keys and values of DPD dictionaries. To illustrate this mechanism, we define a new ParamGen instance:

In [11]:
pg = ParamGen({
    "${alpha}": 1.0,
    "Y": "$beta",
    "$gamma": "foo"
})

In [12]:
pg.data

{'${alpha}': 1.0, 'Y': '$beta', '$gamma': 'foo'}

In the above ParamGen instantiation, we specify three expandable variables in keys and values: `alpha`, `beta`, `gamma`. 
When expandable variables are included in the initial data, an `expand_func` must be provided. This function is required to take a string as an argument and return a scalar, i.e., a string, integer, float, or a boolean variable. The passed string corresponds to the variable name, while the return value corresponds the value of the expandable variable. A rather simple `expand_func` is defined below for demonstration purposes.

In [13]:
def expand_func(varname):
    if varname == "alpha":
        return "X"
    elif varname == "beta":
        return True
    elif varname == "gamma":
        return "Z"
    else:
        raise RuntimeError("Unknown variable")

In [14]:
pg.reduce(expand_func)

In [15]:
pg.data

{'X': 1.0, 'Y': 'True', '"Z"': 'foo'}

As seen above, all the expandable variables are expanded, i.e., replaced with their respective values. Notice that variable `beta` is converted from bool to string during variable expansion. The same behavior applies to numeric variables as well. However, this behavior is not restrictive because (1) all values are converted to strings before they are written out to text files anyways and (2) all logical expressions and formulas are to be strings to be evaluated. 

**Warning:** There is a behavioral difference between specifying string variables with curly braces vs. without curly braces. When a variable of type string gets specified ***without*** curly braces, it's value is automatically enclosed by quotes when the `reduce()` method is called. However, string variables specified ***with*** curly braces are not automatically enclosed by quotes. This behavior can be observed with the variable `gamma` which expands to `'"Z"'`. Had we specified gamma with curly braces, i.e., `${gamma}`, the value would rather be `'Z'`, and not `'"Z"'`. This can be confirmed with the variable `alpha` above, which expands to `'X'`.

This behavior is introduced in ParamGen as a means of keeping conditional expressions more concise. Compare the following two logical ParamGen expressions, which are equivalent, but the first one has expandable variables defined with curly braces. In the first version, not only do we have to explicitly enclose expandable variables with quotes (`"${...}"`), but also the entire expression (`'...'`) so as to make sure that YAML parser treats the entire logical formula as a single expression. In the second version, neither of the quotes is necessary, except, of course, for the literal strings `"gx1v7"` and `"datm"`.

`' "${OCN_GRID}" == "gx1v7" and "${COMP_ATM}" == "datm" ':`

`$OCN_GRID == "gx1v7" and $COMP_ATM == "datm":`

#### CESM/CIME XML variables as expandable variables

Within the CESM framework, CIME XML variables may easily be specified in DBDs as expandable variables. Typically, `ParamGen` is utilized in `buildnml` scripts of components. The first argument of all `buildnml` methods is the `case` variable which is an instance of `CIME.case.Case`. This CIME case object has a `.get_value()` method that returns the value of a given XML variable. This method may simply be passed to the `reduce()` method of ParamGen as an expand function:

```
def expand_func(varname):
    case.get_value(varname)
    
pg.reduce(expand_func)
```

Or, more concisely:

```
pg.reduce(lambda varname: case.get_value(varname))

```

Examples of this usage can be found in MOM6 implementation of ParamGen. Check out the following derived ParamGen classes of MOM6:

    - CESM/components/mom/cime_config/MOM_RPS/FType_input_nml.py
    - CESM/components/mom/cime_config/MOM_RPS/FType_MOM_params.py


### Guards

Recall that the keys of the `.data` dictionary specify the parameter names while the values correspond to the respective parameter values. Depending on the context, the keys may also be interpreted as guards, i.e., propositions of parameter values. The keys are interpreted as guards if all the keys at a certain level are logical expressions that evaluate to True or False. A data dictionary *without* any guards:

In [16]:
dict1 = {
    "var1": 1,
    "var2": 2
}

A data dictionary with some guards:

In [17]:
dict2 = {
    "var1": {
        True: 1,
        False: 0
    },
    "var2": 2
}

In the above dictionary `dict2`, the variable `var1` has two options, `1` and `0`. Which value gets picked for "var1" depends on the guards, i.e., the propositions `True` and `False`. Now let's create a ParamGen instance with the dictionary `dict2`:

In [18]:
pg = ParamGen(dict2)
pg.data

{'var1': {True: 1, False: 0}, 'var2': 2}

Observe the effect of calling the reduce method below:

In [19]:
pg.reduce()
pg.data

{'var1': 1, 'var2': 2}

#### Logical Python expressions as guards

The guards above are trivially specified to be `True` and `False`. In practice, however, guards are arbitrary Python expressions that evaluate to `True` or `False`. These expressions may have expandable variables, standard Python operators, method calls, etc. For an expression to be regarded as a guard, then, the expression must evaluate to `True` or `False`.

Note: In YAML, the quotes enclosing the expressions are not necessary, since the YAML parser automatically interprets those logical expressions as strings.
 
The following is an example with arbitrary Python expressions as guards:

In [20]:
def expand_func(varname):
    if varname == "one":
        return 1.0
    elif varname == "two":
        return 2.0
    else:
        raise RuntimeError("Unknown variable")
        
        
dict3 = {
    "var1": {
        '$one < $two' : 1,
        '$one > $two' : 0
    },
    "var2": 2
}

pg = ParamGen(dict3)
pg.reduce(expand_func)
pg.data

{'var1': 1, 'var2': 2}

#### Guard behavior

- If multiple guards evaluate to True, the last option gets picked. If it is desired to pick the first valid option, however, the default behavior may be changed by setting the optional `match` argument of ParamGen to `first`. For example: `pg = ParamGen(dict2, match='first')`
- If no guards evaluate to True, the parameter value gets set to `None`. In a model-specific write method, the parameters with the value `None` may, for example, be chosen to be omitted by the application developer. 
- the `else` keyword evaluates to True only if all other guards evaluate to False.
- When an expandable variable is attempted to be expanded, and if the value is undefined, ParamGen throws an error. In some cases, certain expandable variables may be defined only for certain configurations. 
For instance, in the below example, the variable `INIT_LAYERS_FROM_Z_FILE` is defined only if the `OCN_GRID` is one of `["gx1v6", "tx0.66v1", "tx0.25v1"]`. Therefore, to avoid undefined expandable variable error, we place the `INIT_LAYERS_FROM_Z_FILE` check below the `OCN_GRID` check, as follows:

```
    tempsalt:
        $OCN_GRID in ["gx1v6", "tx0.66v1", "tx0.25v1"]:
            $INIT_LAYERS_FROM_Z_FILE == "True":
                "${INPUTDIR}/${TEMP_SALT_Z_INIT_FILE}"
```

   This ensures that ParamGen attempts to expand `INIT_LAYERS_FROM_Z_FILE` variable only when `$OCN_GRID in ["gx1v6", "tx0.66v1", "tx0.25v1"]` evaluates to True.

### Formulas

In ParamGen, a variable value may be specified as a formula to be evaluated. This is done by setting the first character of a value to a space delimited `=` character. See the below example:

In [21]:
pg = ParamGen({
    'var1' : '= 2+3'
})
pg.reduce()
pg.data

{'var1': 5}

Note that formulas may also include expandable variables:

In [22]:
pg = ParamGen({
    'var1' : '= (2+3) / $two'
})
pg.reduce(expand_func)
pg.data

{'var1': 2.5}

### Appending

The `append()` method of ParamGen adds the data of a given ParamGen instance to the existing data. If a data entry with the same name already exists, it's value gets overriden with the new value. Otherwise, the new data entry is simply appended to the existing data.

In [23]:
pg1 = ParamGen({'a':1, 'b':2})
pg2 = ParamGen({'b':3, 'c':4})
pg1.append(pg2)
pg1.data

{'a': 1, 'b': 3, 'c': 4}

## 3. Special Use Cases

### Referencing across multiple ParamGen instances

The genericity that comes with the custom expand functions allows us to reference the data of a ParamGen instance in another ParamGen instance. To illustrate this use case, we define two ParamGen instances:

In [24]:
pg1 = ParamGen({
    'var1' : 'foo',
    'var2' : 'bar'
})

pg2 = ParamGen({
    'var3': '${var1}${var2}'        
})

Notice above that the second ParamGen instance, pg2, data includes references to variables defined in pg1 data. Now let's reduce the data of pg2 and pass a lambda function that returns the values of pg1 variables:

In [25]:
pg2.reduce(lambda varname: pg1.data[varname])

In [26]:
pg2.data

{'var3': 'foobar'}

**Note:** Cross-referencing, i.e., references to variables within the same instance, is not supported. (May be added later on if need be. -aa)

### Custom value inference

More involved expand functions may allow higher customizations. In the below example, for instance, we read in an xarray dataset `ds` and set the value of an expandable variable `my_fields_list` to the list of all variables in `ds`, that is `"lat air lon time"`.

In [27]:
pg1 = ParamGen({
    'var1' : 'foo',
    'var2' : 'bar'
})

pg2 = ParamGen({
    'param1': '${var1}${var2}',
    'param2': '$my_fields_list'
})

def expand_function(varname):
    if varname in pg1.data:
        return pg1.data[varname]
    elif varname == "my_fields_list":
        try:
            import xarray as xr
            ds = xr.tutorial.load_dataset("air_temperature")
            return ' '.join([var for var in ds.variables])
        except:
            print("Cannot load xarray module. Skipping...")
            return None

In [28]:
pg2.reduce(expand_function)

In [29]:
pg2.data

{'param1': 'foobar', 'param2': '"lat air lon time"'}

### Regular expression searches as guards

In some cases, it may be desirable to do regex searches as opposed to simpler string comparisons. One such example use case is provided below:

In [30]:
pg = ParamGen({
    'USE_MARBL_TRACERS': {
        'bool(re.search("MOM6%[^_]*MARBL", $COMPSET ))': True,
        'else': False
    }
})

In [31]:
pg.reduce(lambda varname:'1850_DATM%NYF_SLND_DICE%SSMI_MOM6%MARBL_DROF%NYF_SGLC_SWAV')
pg.data

{'USE_MARBL_TRACERS': True}

Notice how the `re.search()` method is used above in the first value guard. The guard evaluates to true since the `re.search()` method is able to find `"MOM6%[^_]*MARBL"` regex pattern in the specified COMPSET: 
`1850_DATM%NYF_SLND_DICE%SSMI_MOM6%MARBL_DROF%NYF_SGLC_SWAV`.

## 4. Notes on ParamGen for MOM6 in CESM

Here, we briefly describe how ParamGen is used within CESM to generate MOM6 runtime parameters. Called from the `buildnml` script of MOM6+CESM, the ParamGen module is used to generate the four main MOM6 runtime input files:

 1. **MOM_input:** Default MOM6 runtime parameters. The file syntax is based on the simple `key = value` pair. Example parameter entries from a typical MOM6 experiment:
 
 ```
 DIABATIC_FIRST = True   ! If true, apply diabatic and thermodynamic processes...
 DT_THERM = 3600.0       ! The thermodynamic and tracer advection time step.
 MIN_SALINITY = 0.0.     ! The minimum value of salinity when BOUND_SALINITY=True.
 ```
 
 2. **MOM_override:** An auxiliary file to override parameter values set in MOM_input
 3. **input.nml:** A file to set some general MOM6 and FMS variables. The file is in classical Fortran namelist format.
 4. **diag_table:** An input file to configure the model diagnostics. It has a relatively complex syntax. See https://mom6.readthedocs.io/en/latest/api/generated/pages/Diagnostics.html
  
 In addition to these files, ParamGen is used to generate the MOM6 version of the `input_data_list` file needed by CIME.
 
 ### Default Parameters Databases 
 
 For each of the input files mentioned above, we have a DBD that includes all of the default parameter values for any possible model configuration. In the case of `MOM_input` for instance, we have a DBD called `MOM_input.yaml` located in `components/mom/param_templates`. An example entry from this file:
 
 ```
     DT_THERM:
        description: |
            "[s] default = 3600.0
            The thermodynamic and tracer advection time step.
        value:
            $OCN_GRID == "MISOMIP": 1800.0
            else: >
                = ( ( $NCPL_BASE_PERIOD =="decade") * 86400.0 * 3650 +
                    ( $NCPL_BASE_PERIOD =="year") * 86400.0 * 365 +
                    ( $NCPL_BASE_PERIOD =="day") * 86400.0 +
                    ( $NCPL_BASE_PERIOD =="hour") * 3600.0 ) / $OCN_NCPL
```

In the above entry, the default value of the runtime parameter `DT_THERM` is specified, which depends on a few CIME variables such as `OCN_GRID`, and `OCN_NCPL`. Notice the usage of expandable variables, guards, and a formula.

Assuming `$OCN_GRID != "MISOMIP"`,  `$NCPL_BASE_PERIOD == "day"`, and `$OCN_NCPL==24`, the runtime parameter `DT_THERM` gets reduced to 3600.0 when `.reduce()` method is called. 

### Utilizing ParamGen class as a base

For each of the file category, we have developed individual classes derived from the `ParamGen` class. These classes are located in `CESM/components/mom/cime_config/MOM_RPS/` and are utilized in the `buildnml` file to generate the corresponding input files. 

    - FType_MOM_params.py
    - FType_diag_table.py
    - FType_input_data_list.py
    - FType_input_nml.py

Since the Fortran namelist syntax is already available as an out-of the box format, the most straightforward one is `FType_input_nml.py` which produces the `input.nml` file. The whole module consists of 15 lines of code:

In [None]:
import os, sys

CIMEROOT = os.environ.get("CIMEROOT")
if CIMEROOT is None:
    raise SystemExit("ERROR: must set CIMEROOT environment variable")
sys.path.append(os.path.join(CIMEROOT, "scripts", "lib", "CIME", "ParamGen"))
from paramgen import ParamGen

class FType_input_nml(ParamGen):
    """Encapsulates data and read/write methods for MOM6 (FMS) input.nml file"""

    def write(self, output_path, case):
        self.reduce(lambda varname: case.get_value(varname))
        self.write_nml(output_path)

Within the `buildnml` script, the above class is instantiated and utilized as follows:

```
...
input_nml = FType_input_nml.from_json(input_nml_src)
input_nml.write(input_nml_dest, case)
```

#### (todo) more descriptions and notes to come.