# Customizing and controlling xclim

xclim's behaviour can be controlled globally or contextually through `xclim.set_options`, which acts the same way as `xarray.set_options`.

In [None]:
import xarray as xr
import xclim
from xclim.testing import open_dataset

Let's create fake data with some missing values and mask every 10th, 20th and 30th of the month.This represents 9.6-10% of masked data for all months except February where it is 7.1%.

In [None]:
tasmax = xr.tutorial.open_dataset('air_temperature').air.resample(time='D').max(keep_attrs=True)
tasmax = tasmax.where(tasmax.time.dt.day % 10 != 0)

## Checks
Above, we created fake temperature data from a xarray tutorial dataset that doesn't have all the standard CF attributes. By default, when triggering a computation with an Indicator from xclim, warnings will be raised:

In [None]:
tx_mean = xclim.atmos.tx_mean(tasmax=tasmax, freq='MS') # compute monthly max tasmax

Setting `cf_compliance` to `'log'` mutes those warnings and sends them to the log instead.

In [None]:
xclim.set_options(cf_compliance='log')

tx_mean = xclim.atmos.tx_mean(tasmax=tasmax, freq='MS') # compute monthly max tasmax

## Missing values

For example, one can globally change the missing method.

Change the default missing method to "pct" and set its tolerance to 8%:

In [None]:
xclim.set_options(check_missing='pct', missing_options={'pct': {'tolerance': 0.08}})

tx_mean = xclim.atmos.tx_mean(tasmax=tasmax, freq='MS') # compute monthly max tasmax
tx_mean.sel(time='2013', lat=75, lon=200)

Only February has non-masked data. Let's say we want to use the "wmo" method (and its default options), but only once, we can do:

In [None]:
with xclim.set_options(check_missing="wmo"):
    tx_mean = xclim.atmos.tx_mean(tasmax=tasmax, freq='MS') # compute monthly max tasmax
tx_mean.sel(time='2013', lat=75, lon=200)

This method checks that there is less than `nm=5` invalid values in a month and that there are no consecutive runs of `nc>=4` invalid values. Thus, every month is now valid.

Finally, it is possible for advanced users to register their own method. Xclim's missing methods are in fact based on class instances. Thus, to create a custom missing class, one should implement a subclass based on `xclim.core.checks.MissingBase` and overriding at least the `is_missing` method. The method should take a `null` argument and  a `count` argument.

- `null` is a `DataArrayResample` instance of the resampled mask of invalid values in the input dataarray.
- `count` is the number of days in each resampled periods and any number of other keyword arguments. 

The `is_missing` method should return a boolean mask, at the same frequency as the indicator output (same as `count`), where True values are for elements that are considered missing and masked on the output.

When registering the class with the `xclim.core.checks.register_missing_method` decorator, the keyword arguments will be registered as options for the missing method. One can also implement a `validate` static method that receives only those options and returns whether they should be considered valid or not.

In [None]:
from xclim.core.missing import register_missing_method
from xclim.core.missing import MissingBase
from xclim.indices.run_length import longest_run

@register_missing_method("consecutive")
class MissingConsecutive(MissingBase):
    """Any period with more than max_n consecutive missing values is considered invalid"""
    def is_missing(self, null, count, max_n=5):
        return null.map(longest_run, dim="time") >= max_n

    @staticmethod
    def validate(max_n):
        return max_n > 0


The new method is now accessible and usable with:

In [None]:
with xclim.set_options(check_missing="consecutive", missing_options={'consecutive': {'max_n': 2}}):
    tx_mean = xclim.atmos.tx_mean(tasmax=tasmax, freq='MS') # compute monthly max tasmax
tx_mean.sel(time='2013', lat=75, lon=200)

## Indices vs Indicators

Internally and in the documentation, xclim makes a distinction between "indices" and "indicators".
 
### indice

 - A python function accepting DataArrays and other parameters (usually bultin types)
 - Returns one or several DataArrays. 
 - Handles the units : check inputs' units and set proper CF-compliant output units.
 - Performs not other checks or set any metadata.
 - Accessible through [xclim.indices](../indices.rst).
 
### indicator

 - An instance of a subclass of `xclim.core.indicator.Indicator` that wraps around an `indice` (stored in its `compute` property). 
 - Returns one or several DataArrays.
 - Handles missing values, performs input data and metadata checks (see [usage](usage.ipynb#Health-checks-and-metadata-attributes)).
 - Always ouputs data in the same units.
 - Adds dynamically generated metadata to the output after computation.
 - Accessible through [xclim.indicators ](../indicators_api.rst)

Most metadata stored in the Indicators is parsed from the underlying indice documentation, so defining indices with complete documentation and an appropriate signature helps the process. The two next sections go into details on the definition of both objects.

## Defining new indices

The annotated example below shows the general template to be followed when defining proper _indices_. In the comments `Ind` is the indicator instance that would be created from this function.

<div class="alert alert-info">

Note that it is not _needed_ to follow these standards when writing indices that will be wrapped in indicators. Problems in parsing will not raise errors at runtime, but will result in Indicators with poorer metadata than expected by most users, especially those that dynamically use indicators in other applications where the code is inaccessible, like web services.
    
</div>

In [None]:
from xclim.core.units import declare_units, convert_units_to
from xclim.indices.generic import threshold_count

# declaring units. The decorator, check the input units, so passing incompatible arrays will raise an error.
# As of xclim 0.24, units must be set appropriately by the function. 
# Input units can be given either directly ("K", "degC", "m", etc) or by dimensionnality ("[temperature]", "[length]", etc)
# Output units will only be reformatted to a CF-compliant format, ensuring consistency in xclim.
@declare_units(tasmax="[temperature]", thresh="[temperature]")
# Annotations are important : input *variables* need to have a DataArray annotation so
# the indicator can parse them from a dataset. Argument order is also important,
# inputs used as variables are first then the parameters follow.
def tx_days_compare(tasmax: xr.DataArray, thresh: str = "0 degC", op: str = '>', freq: str = "YS"):
    r"""Number of days where maximum daily temperature. is above or under a threshold.
    # First line of the docstring is "Ind.title"

    The daily maximum temperature is compared to a threshold using a given operator and the number
    of days where the condition is true is returned.
    # The first paragraph is "Ind.abstract"
     
    It assumes a daily input.
    # All subsequent paragraph are ignored and not parsed by the indicator.

    Parameters
    ----------
    # Each parameter will be parsed into the "Ind.parameters" list. The list is created from the call
    # signature. The name, description, units and choices are read from here. The default value and 
    # the annotation are read from the call signature. The annotations here should be the same unless
    # more human-readable versions are needed, or a fixed set of choices is required.
    tasmax : xarray.DataArray 
      Maximum daily temperature.
    thresh : str
      Threshold temperature to compare to.
    op : {'>', '<'}
      The operator to use.
      # A fixed set of choices can be imposed. Only strings, numbers, booleans or None are accepted.
    freq : str
      Resampling frequency.

    Returns
    -------
    # The line following the return type is set to "Ind.cf_attrs[0]['long_name']"
    # (the long_name attribute of the first output)
    # Output units (or dimension) are noted in the first line, but will not be parsed in the indicator.
    xarray.DataArray, [temperature]
      Maximum value of daily maximum temperature.

    Notes
    -----
    # Notes will be added as-is to the indicator.
    Let :math:`TX_{ij}` be the maximum temperature at day :math:`i` of period :math:`j`. Then the maximum
    daily maximum temperature for period :math:`j` is:

    .. math::

        TXx_j = max(TX_{ij})
    
    References
    ----------
    # References are also added as-is to the indicator.
    Smith, John and Tremblay, Robert, An dummy citation for examples in documentation. J. RTD. (2020).
    """
    thresh = convert_units_to(thresh, tasmax)  # Convert a unit string to a number.
    out = threshold_count(tasmax, op, thresh, freq)  # Basic operations are already implemented in xclim.indices.generic
    # Indices do not need to set any attributes, except units.
    # If the output has no units, the `declare_units` decorator will patch the one passed.
    # However, if xr.set_options(keep_attrs=True) was used here, temperature units might have been kept.
    # We patch just to be sure.
    out.attrs['units'] = "days"
    return out

# Small hack to get a normal docstring
# Comments are not really supported in docstrings, so we removed those added for the example.
doclines = [line for line in tx_days_compare.__doc__.split('\n') if not line.strip().startswith('#')]
tx_days_compare.__doc__ = '\n'.join(doclines)

### Generic functions for common operations

The [xclim.indices.generic](../api.rst#generic-indices-submodule) submodule contains useful functions for common computations (like `threshold_count` or `select_resample_op`) and many basic indice functions, as defined by [clix-meta](https://github.com/clix-meta/clix-meta). In order to reduce duplicate code, their use is recommended for xclim's indices. As previously said, the units handling has to be made explicitly when non trivial, [xclim.core.units](../api.rst#module-xclim.core.units) also exposes a few helpers for that (like `convert_units_to`, `to_agg_units` or `rate2amount`).

## Defining new indicators

xclim's Indicators are instances of (subclasses of) `xclim.core.indicator.Indicator`. While they are the central to xclim, their construction can be somewhat tricky as a lot happens backstage. Essentially, they act as self-aware functions, taking a set of input variables (DataArrays) and parameters (usually strings, integers or floats), performing some health checks on them and returning one or multiple DataArrays, with CF-compliant (and potentially translated) metadata attributes, masked according to a given missing value set of rules.
They define the following key attributes:

- the `identifier`, as string that uniquely identifies the indicator,
- the `realm`, one of "atmos", "land", "seaIce" or "ocean", classifying the domain of use of the indicator.
- the `compute` function that returns one or more DataArrays, the "indice",
- the `cfcheck` and `datacheck` methods that make sure the inputs are appropriate and valid.
- the `missing` function that masks elements based on null values in the input.
- all metadata attributes that will be attributed to the output and that document the indicator:
    - Indicator-level attribute are : `title`, `abstract`, `keywords`, `references` and `notes`.
    - Ouput variables attributes (respecting CF conventions) are: `var_name`, `standard_name`, `long_name`, `units`, `cell_methods`, `description` and `comment`. 

Output variables attributes are regrouped in `Indicator.cf_attrs` and input parameters are documented in `Indicator.parameters`.

A particularity of Indicators is that each instance corresponds to a single class: when creating a new indicator, a new class is automatically created. This is done for easy construction of indicators based on others, like shown further down.

See the [class documentation](../api.rst#module-xclim.core.indicator) for more info on the meaning of each attribute. The [indicators](https://github.com/Ouranosinc/xclim/tree/master/xclim/indicators) module contains over 50 examples of indicators to draw inspiration from.

### Metadata parsing vs explicit setting

As explained above, most metadata can be parsed from the indice's signature and docstring. Otherwise, it can always be set when creating a new Indicator instance *or* a new subclass. When _creating_ an indicator, output metadata attributes can be given as strings, or list of strings in the case of indicator returning multiple outputs. However, they are stored in the `cf_attrs` list of dictionaries on the instance.

### Internationalization of metadata

xclim offers the possibility to translate the main Indicator metadata field and automatically add the translations to the outputs. The mechnanic is explained in the [Internationalization](../internationalization.rst) page.

### Inputs and checks

There are two ways that xclim uses to decide which input arguments of the indicator's call function are considered _variables_ and which are _parameters_. 

- The `nvar` indicator integer attribute sets the number of arguments that are sent to the `datacheck` and `cfcheck` methods (in the signature's order).
- The annotations of the underlying indice (the `compute` method). Arguments annotated with the `xarray.DataArray` type are considered _variables_ and can be read from the dataset passed in `ds`.

### Indicator creation

There a two and a half ways for creating indicators:

- 1) By initializing an existing indicator (sub)class
- 1B) Subclassing an existing indicator and then initializing it.
- 2) From a dictionary

The two first methods are best when defining indicators in scripts of external modules and are explained here. Number 2 is best used when building virtual modules through YAML files, and is explained in the [mappings](mappings.ipynb) notebook and in the [submodule doc](../api.rst#module-xclim.core.indicator).

Creating a new indicator that simply modifies a few metadata output of an existing one is a simple call like:

In [None]:
from xclim.core.indicator import registry
from xclim.core.utils import wrapped_partial
from xclim.indices import tg_mean

# An indicator based on tg_mean, but returning Celsius and fixed on annual resampling
tg_mean_c = registry['TG_MEAN'](
    identifier='tg_mean_c',
    units='degC',
    title='Mean daily mean temperature but in degC',
    compute=wrapped_partial(tg_mean, freq='YS')  # We inject the freq arg.
)

In [None]:
print(tg_mean_c.__doc__)

The registry is a dictionary mapping indicator identifiers (in uppercase) to their class. This way, we could subclass `tg_mean` to create our new indicator. `tg_mean_c` is the exact same as `atmos.tg_mean`, but outputs the result in Celsius instead of Kelvins, has a different title and resamples to "YS". The `identifier` keyword is here needed in order to differentiate the new indicator from `tg_mean` itself. If it wasn't given, a warning would have been raised and further subclassing of  `tg_mean` would have in fact subclassed `tg_mean_c`, which is not wanted!

This method of class initialization is good for the cases where only metadata and perhaps the compute function is changed. However, to modify the CF compliance and data checks, we recommend creating a class first:

In [None]:
class TG_MAX_C(registry['TG_MAX']):
    identifier = "tg_max_c"
    missing = "wmo"
    title = 'Maximum daily mean temperature'
    units = 'degC'

    @staticmethod
    def cfcheck(tas):
        xclim.core.cfchecks.check_valid(tas, "standard_name", "air_temperature")
        # Add a very strict check on the long name.
        # glob-like wildcards can be used (here *)
        xclim.core.cfchecks.check_valid(tas, "long_name", "Surface * daily temperature")

    @staticmethod
    def datacheck(tas):
        xclim.core.datachecks.check_daily(tas)
        
tg_max_c = TG_MAX_C()

In [None]:
ds = open_dataset('ERA5/daily_surface_cancities_1990-1993.nc')

ds.tas.attrs['long_name'] = 'Surface average daily temperature'

with xclim.set_options(cf_compliance='raise'): 
    # The data passes the test we implemented ("average" is caught by the *)
    tmaxc = tg_max_c(tas=ds.tas)
tmaxc

A caveat of this method is that the new indicator is added to the registry with a non-trivial name. When an indicator subclass is created in a module outside `xclim.indicators`, the name of its parent module is prepended to its identifier in the registry. Here, the module is `__main__`, so:

In [None]:
'__main__.TG_MAX_C' in registry

A simple way to workaround this is to provided a (fake) module name. Passing one of `atmos`, `land`, `seaIce` or `ocean` will result in a normal entry in the registry. However, one could want to keep the distinction between the newly created indicators and the "official" ones by passing a custom name **upon instantiation**:

In [None]:
# Fake module is passed upon instantiation
tg_max_c2 = TG_MAX_C(module ='example')
print(tg_max_c2.__module__)
print('example.TG_MAX_C' in registry)

One pattern to create multiple indicators is to write a standard subclass that declares all the attributes that are common to indicators, then call this subclass with the custom attributes. See for example in [xclim.indicators.atmos](https://github.com/Ouranosinc/xclim/blob/master/xclim/indicators/atmos/_temperature.py) how indicators based on daily mean temperatures are created from the `Tas` subclass of the `Daily` class.

## Defining new modules

The [Mapped modules](mappings.ipynb) page explains how old and new indicators can be wrapped and regrouped in virtual modules. 