# Checking consistency of a scenario ensemble

It has happened in previous model comparison exercises that the reported data was not internally consistent. This can be due to incomplete variable hierarchies, reporting templates incompatible with model specifications, or user error.

In this tutorial, we show how to make the most of **pyam** to check that a scenario ensemble (or just a single scenario) is complete and that timeseries data "add up" across regions and along the variable tree (i.e., that the sum of values of the subcategories such as `Primary Energy|*` are identical to the values of the category `Primary Energy`).

<div class="alert alert-block alert-warning">
    This feature of the <b>pyam</b> package currently only supports "consistency"
    in the sense of a strictly hierarchical variable tree
    (with subcategories summing up to the category value)
    and subregions of depth 1 adding up the "World" region.
</div>

In [None]:
import pandas as pd
import pyam

We start with a hypothetical tutorial data set, which is constructed to highlight the individual validation features below.

The scenario below has two inconsistencies:

1. In year `2010` and regions `region_b` & `World`, the values of coal and wind do not add up to the total `Primary Energy` value
2. In year `2020` in the `World` region, the value of `Primary Energy` and `Primary Energy|Coal` is not the sum of `region_a` and `region_b` <br />
   (but the sum of wind and coal to `Primary Energy` in each sub-region  is correct)

In [None]:
tutorial_df = pd.DataFrame([
    ['World', 'Primary Energy', 'EJ/y', 7, 15],
    ['World', 'Primary Energy|Coal', 'EJ/y', 4, 11],
    ['World', 'Primary Energy|Wind', 'EJ/y', 2, 4],
    ['region_a', 'Primary Energy', 'EJ/y', 4, 8],
    ['region_a', 'Primary Energy|Coal', 'EJ/y', 2, 6],
    ['region_a', 'Primary Energy|Wind', 'EJ/y', 2, 2],
    ['region_b', 'Primary Energy', 'EJ/y', 3, 6],
    ['region_b', 'Primary Energy|Coal', 'EJ/y', 2, 4],
    ['region_b', 'Primary Energy|Wind', 'EJ/y', 0, 2],
],
    columns=['region', 'variable', 'unit', 2010, 2020]
)

df = pyam.IamDataFrame(data=tutorial_df, model='model_a', scenario='scen_a')

## Summary

With the [check_internal_consistency()](https://pyam-iamc.readthedocs.io/en/stable/api.html#pyam.IamDataFrame.check_internal_consistency) feature, we can check the internal consistency of a scenario ensemble (i.e., an `IamDataFrame` instance).
If this method returns `None`, the database is internally consistent (i.e. the total variables are the sum of the sectoral breakdowns and the regional breakdown).

In the rest of this tutorial, we give you a chance to better understand this method. We go through what it is actually doing and show you the kind of output you can expect.

## Checking that variables are the sum of their components

We are going to use the [check_aggregate()](https://pyam-iamc.readthedocs.io/en/stable/api.html#pyam.IamDataFrame.check_aggregate) method of the `IamDataFrame`
to check that the components of a variable add up to its total.
This method takes [np.is_close()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.isclose.html) arguments as keyword arguments. We show our recommended settings here.

In [None]:
np_isclose_args = {
    'equal_nan': True,
    'rtol': 1e-03,
    'atol': 1e-05,
}

The [check_aggregate()](https://pyam-iamc.readthedocs.io/en/stable/api.html#pyam.IamDataFrame.check_aggregate) function allows us to quickly verify whether a given variable is the sum of its sectoral components (e.g. `Primary Energy` should be equal to `Primary Energy|Coal` plus `Primary Energy|Wind`). The validation is performed separately for each region.

This section illustrates the first constructed inconsistency in this scenario. The returned [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) indicates where the aggregate is not equal to the sum of components.

In [None]:
df.check_aggregate('Primary Energy', **np_isclose_args)

In practice, it would now be up to the user to determine the cause of the inconsistency (or confirm that this is expected for some reason).

### Checking multiple variables

We can now construct a loop over all variables in this `IamDataFrame`.

In [None]:
for variable in df.variables():
    df.check_aggregate(variable, **np_isclose_args)

The log tells us the same message as in the previous example, and it shows that the other two variables (coal and wind) cannot be assessed because they have no subcategories.

<div class="alert alert-block alert-info">
Note that the detailed output (i.e., where the aggregation validation fails) is not shown in a notebook when calling the function within a loop.<br />
    Read <a href="https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/">this page</a> for helpful tips and tricks when working with Jupyter notebooks.
</div>

## Checking that timeseries subregions sum to aggregate regions

Similarly to checking that the sum of a variable's components give the declared total shown above, we can check that summing over subregions returns the value of a region.

To do this, we use the [check_aggregate_region](https://pyam-iamc.readthedocs.io/en/stable/api.html#pyam.IamDataFrame.check_aggregate_region) function. By default, this method checks that all the regions in the dataframe sum to `World`. 

Using this function allows us to quickly check if a regional total for a single variable is equal to the sum of its regional values.
This section illustrates the second constructed inconsistency in this scenario. 
The returned [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) indicates where the timeseries at the `region='World'` level is not equal to the sum of regional components.

In [None]:
df.check_aggregate_region('Primary Energy', **np_isclose_args)

## Checking complete internal consistency of a scenario (ensemble)

The previous sections illustrated two functions to validate specific variables across their subcategories or regional breakdown. These two functions are combined in the [check_internal_consistency()](https://pyam-iamc.readthedocs.io/en/stable/api.html#pyam.IamDataFrame.check_internal_consistency) feature.

If we have an internally consistent scenario ensemble (or single scenario), the function will return `None`; otherwise, it will return a [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) indicating all detected inconsistencies.

<div class="alert alert-block alert-warning">
    Note that at the moment, this method assumes that all the regions sum to the <b>World</b> region. See <a href="https://github.com/IAMconsortium/pyam/issues/106">this issue</a> for more information.
</div>

In [None]:
df.check_internal_consistency()

The output of this function reports both types of illustrative inconsistencies in the scenario constructed for this tutorial.