# Checking consistency of a scenario ensemble

It has happened ocassionally that the reported data is not internally consistent. Here we show how to make the most of **pyam** to check that a scenario ensemble is complete and that timeseries data "add up" across regions and along the variable tree (i.e., that the sum of values of the subcategories such as `Primary Energy|*` are identical to the values of the category `Primary Energy`).

We apply these tools to the sample AR5 data.

In [None]:
import time
from pprint import pprint

import pyam
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

We start with the tutorial data, it contains only a fraction of the AR5 data so is not internally consistent and is hence the perfect dataset to start with.

In [None]:
df = pyam.IamDataFrame(data='tutorial_AR5_data.csv', encoding='utf-8')

In [None]:
df.head()

## Summary

With the `pyam.IamDataFrame.check_internal_consistency` method, we can check the internal consistency of a database. If this method returns `None`, the database is internally consistent (i.e. the total variables are the sum of the sectoral breakdowns and the regional breakdown 

In the rest of this tutorial, we give you a chance to better understand this method. We go through what it is actually doing and show you the kind of output you can expect.

## Checking variables are the sum of their components

We are going to use the `check_aggregate` method of `IamDataFrame` to check that the components of a variable sum to its total. This method takes `np.is_close` arguments as keyword arguments, we show our recommended settings here.

In [None]:
np_isclose_args = {
    "equal_nan": True,
    "rtol": 1e-03,
    "atol": 1e-05,
}

Using `check_aggregate` on the `IamDataFrame` allows us to quickly check if a single variable is equal to the sum of its sectoral components (e.g. is `Emissions|CO2` equal to `Emissions|CO2|Transport` plus `Emissions|CO2|Solvents` plus `Emissions|CO2|Energy` etc.). A returned `DataFrame` will show us where the aggregate is not equal to the sum of components.

In [None]:
df.check_aggregate(
    "Emissions|CO2", 
    **np_isclose_args
)

As we are missing most of the sectoral data in this subset of AR5, the total variables are mostly not equal to their components. The data table above shows us which model-scenario-region combinations this is the case for. As a user, we would then have to examine which sectors we have for each of these model-scenario-region combinations in order to determine what is missing.

### Checking multiple variables

We can then wrap this altogether to check all or a subset of the variables in an `IamDataFrame`.

In [None]:
for variable in df.filter(level=1).variables():
    diff = df.check_aggregate(
        variable, 
        **np_isclose_args
    )
    # you could then make whatever summary you wanted
    # with diff

The output tells us where there are issues as well as where it is not possible to actually check sums because no components have been reported. 

## Checking that regions sum to aggregate regions

Similarly to checking that the sum of a variable's components give the declared total, we can check that summing regions gives the intended total.

To do this, we use the `check_aggregate_regions` method of `IamDataFrame`. By default, this method checks that all the regions in the dataframe sum to World. 

Using `check_aggregate_regions` on the `IamDataFrame` allows us to quickly check if a regional total for a single variable is equal to the sum of its regional contributors. A returned `DataFrame` will show us where the aggregate is not equal to the sum of components.

In [None]:
df.check_aggregate_region(
    "Emissions|CO2",
    **np_isclose_args
)

Again, as the AR5 snapshot is incomplete, all World sums are not equal to the regions provided.

Once again, we can repeat this analysis over all the variables of interest in an `IamDataFrame`.

In [None]:
for variable in df.variables():
    diff = df.check_aggregate_region(
        variable, 
        **np_isclose_args
    )
    # you could then make whatever summary you wanted
    # with diff
    if diff is not None:
        eg = diff

eg.head(20)

## An internally consistent database

If we have an internally consistent database, the returned `DataFrame` will always be none. 

Repeating the same analysis as above can then confirm that all is well with the database as well as give us some insight into which variables do not have regional or sectoral breakdowns reported.

In [None]:
consistent_df = pyam.IamDataFrame(data="tutorial_check_database.csv", encoding='utf-8')

In [None]:
for variable in consistent_df.filter(level=1).variables():
    diff = consistent_df.check_aggregate(
        variable, 
        **np_isclose_args
    )
    assert diff is None

In [None]:
for variable in consistent_df.filter(level=1).variables():
    diff = consistent_df.check_aggregate_region(
        variable, 
        **np_isclose_args
    )
    assert diff is None

## Putting it altogether

Finally, we provide the `check_internal_consistency` method which does all the above for you and returns a dictionary with all of the dataframes which document the errors.

Note: at the moment, this method's regional checking is limited to checking that all the regions sum to the World region. We cannot make this more automatic unless we start to store how the regions relate, see [this issue](https://github.com/IAMconsortium/pyam/issues/106). 

In [None]:
# if all is good, None is returned
print("Checking consistent data"); time.sleep(0.5)
assert consistent_df.check_internal_consistency() is None

# otherwise we get a dict back
print("Checking AR5 subset"); time.sleep(0.5)
errors = df.check_internal_consistency()

In [None]:
pprint([k for k in errors.keys()])

In [None]:
errors["Emissions|CO2-aggregate"]