# Aggregating and downscaling timeseries data

The **pyam** package offers many tools to facilitate processing of scenario data.
In this notebook, we illustrate methods to aggregate and downscale timeseries data of an **IamDataFrame** across regions and sectors, as well as checking consistency of given data along these dimensions.

In this tutorial, we show how to make the most of **pyam** to compute such aggregate timeseries data, and to check that a scenario ensemble (or just a single scenario) is complete and that timeseries data "add up" across regions and along the variable tree (i.e., that the sum of values of the subcategories such as `Primary Energy|*` are identical to the values of the category `Primary Energy`).

There are two distinct use cases where these features can be used.

### Use case 1: compute data at higher/lower sectoral or spatial aggregation

Given scenario results at a specific (usually very detailed) sectoral and spatial resolution, **pyam** offers a suite of functions to easily compute aggregate timeseries. For example, this allows to sum up national energy demand to regional or global values,
or to compute the average of a global carbon price weighted by regional emissions.

These functions can be used as part of an automated workflow to generate complete scenario results from raw model outputs.

### Use case 2: check the consistency of data across sectoral or spatial levels

In model comparison exercises or ensemble compilation projects, a user needs to verify the internal consistency of submitted scenario results (cf. Huppmann et al., 2018, doi: [10.1038/s41558-018-0317-4](http://rdcu.be/9i8a)).
Such inconsistencies can be due to incomplete variable hierarchies, reporting templates incompatible with model specifications, or user error.

## Overview

This notebook illustrates the following features:

0. Import data from file and inspect the scenario
1. Aggregate timeseries over sectors (i.e., sub-categories)
2. Aggregate timeseries over regions including weighted average
3. Downscale timeseries given at a region level to sub-regions using a proxy variable
4. Downscale timeseries using an explicit weighting dataframe
5. Check the internal consistency of a scenario (ensemble)

<div class="alert alert-info">

**See Also**

The **pyam** package also supports algebraic operations (addition, subtraction, multiplication, division)
on the timeseries data along any axis or dimension.
See the [algebraic operations tutorial notebook](https://pyam-iamc.readthedocs.io/en/stable/tutorials/algebraic_operations.html)
for more information.

</div>

In [None]:
import pandas as pd
import pyam

## 0. Import data from file and inspect the scenario

The stylized scenario used in this tutorial has data for two regions (`reg_a` & `reg_b`) as well as the `World` aggregate, and for categories of variables: primary energy demand, emissions, carbon price, and population.

In [None]:
df = pyam.IamDataFrame(data='tutorial_data_aggregating_downscaling.csv')

In [None]:
df.region

In [None]:
df.variable

## 1. Aggregating timeseries across sectors

Let's first display the data for the components of primary energy demand.

In [None]:
df.filter(variable='Primary Energy|*').timeseries()

Next, we are going to use the [aggregate()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.aggregate) function to compute the total `Primary Energy` from its components (wind and coal) in each region (including `World`).

The function returns an **IamDataFrame**, so we can use [timeseries()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.timeseries) to display the resulting data.

In [None]:
df.aggregate('Primary Energy').timeseries()

If we are interested in **use case 1**, we could use the argument `append=True` to directly add the computed aggregate to the **IamDataFrame** instance.

However, in this tutorial, the data already includes the total primary energy demand. Therefore, we illustrate **use case 2** and apply the [check_aggregate()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.check_aggregate) function to  verify whether a given variable is the sum of its sectoral components
(i.e., `Primary Energy` should be equal to `Primary Energy|Coal` plus `Primary Energy|Wind`).
The validation is performed separately for each region.

The function returns `None` if the validation is correct (which it is for primary energy demand)
or a [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) highlighting where the aggregate does not match (this will be illustrated in the next section).

In [None]:
df.check_aggregate('Primary Energy')

The function also returns useful logging messages when there is nothing to check (because there are no sectors below `Primary Energy|Wind`).

In [None]:
df.check_aggregate('Primary Energy|Wind')

## 2. Aggregating timeseries across subregions

Similarly to the previous example, we now use the [aggregate_region()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.aggregate_region) function to compute regional aggregates.
By default, this method sums all the regions in the dataframe to make a `World` region; this can be changed with the keyword arguments `region` and `subregions`.

In [None]:
df.aggregate_region('Primary Energy').timeseries()

### Adding regional components

As a next step, we use [check_aggregate_region()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.check_aggregate_region) to verify that the regional aggregate of CO2 emissions matches the timeseries data given in the scenario.

In [None]:
df.check_aggregate_region('Emissions|CO2')

As announced above, this validation failed and we see a dataframe of the expected data at the `region` level and the aggregation computed from the `subregions`.

Let's look at the entire emissions timeseries in the scenario to find out what is going on.

In [None]:
df.filter(variable='Emissions*').timeseries()

Investigating the data carefully, you will notice that emissions from the energy sector and agriculture, forestry & land use (AFOLU) are given in the subregions and the `World` region, whereas emissions from bunker fuels are only defined at the global level.
This is a common issue in emissions data, where some sources (e.g., global aviation and maritime transport) cannot be attributed to one region.

Luckily, the functions [aggregate_region()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.aggregate_region)
and [check_aggregate_region()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.check_aggregate_region)
support this use case:
by adding `components=True`, the regional aggregation will include any sub-categories of the variable that are only present at the `region` level but not in any subregion.

In [None]:
df.aggregate_region('Emissions|CO2', components=True).timeseries()

The regional aggregate now matches the data given at the `World` level in the tutorial data.

Note that the components to be included at the region level can also be specified directly via a list of variables, in this case we would use `components=['Emissions|CO2|Bunkers']`.

### Computing a weighted average across regions

One other frequent requirement when aggregating across regions is a weighted average.

To illustrate this feature, the tutorial data includes carbon price data.
Naturally, the appropriate weighting data are the regional carbon emissions.

The following cells show:

0. The carbon price data across the regions
1. A (failing) validation that the regional aggretion (without weights) matches the reported prices at the `World` level
2. The emissions-weighted average of carbon prices returned as a new **IamDataFrame**

In [None]:
df.filter(variable='Price|Carbon').timeseries()

In [None]:
df.check_aggregate_region('Price|Carbon')

In [None]:
df.aggregate_region('Price|Carbon', weight='Emissions|CO2').timeseries()

## 3. Downscaling timeseries data to subregions using a proxy

The inverse operation of regional aggregation is "downscaling" of timeseries data given at a regional level to a number of subregions, usually using some other data as proxy to divide and allocate the total to the subregions.

This section shows an example using the [downscale_region()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.downscale_region) function to divide the total primary energy demand using population as a proxy.

In [None]:
df.filter(variable='Population').timeseries()

In [None]:
df.downscale_region('Primary Energy', proxy='Population').timeseries()

By the way, the functions
[aggregate()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.aggregate), 
[aggregate_region()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.aggregate_region) and
[downscale_region()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.downscale_region)
also take lists of variables as `variable` argument.
See the next cell for an example.

In [None]:
var_list = ['Primary Energy', 'Primary Energy|Coal']
df.downscale_region(var_list, proxy='Population').timeseries()

## 4. Downscaling timeseries data to subregions using a weighting dataframe

In cases where using existing data directly as a proxy (as illustrated in the previous section) is not practical,
a user can also create a weighting dataframe and pass that directly to the `downscale_region()` function.

The example below uses the weighting factors implied by the population variable for easy comparison to the previous section.

In [None]:
weight = pd.DataFrame(
    [[0.66, 0.6], [0.33, 0.4]],
    index=pd.Series(['reg_a', 'reg_b'], name='region'),
    columns=pd.Series([2005, 2010], name='year')
)
weight

In [None]:
df.downscale_region(var_list, weight=weight).timeseries()

## 5. Checking the internal consistency of a scenario (ensemble)

The previous sections illustrated two functions to validate specific variables across their sectors (sub-categories) or regional disaggregation.
These two functions are combined in the [check_internal_consistency()](https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.check_internal_consistency) feature.

<div class="alert alert-warning">

This feature of the **pyam** package currently only supports "consistency"
in the sense of a strictly hierarchical variable tree
(with subcategories summing up to the category value including components, discussed above)
and that all the regions sum to the 'World' region.  
See [this issue](https://github.com/IAMconsortium/pyam/issues/106) for more information.

</div>

If we have an internally consistent scenario ensemble (or single scenario), the function will return `None`; otherwise, it will return a concatenation of [pandas.DataFrames](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) indicating all detected inconsistencies.

For this section, we use a tutorial scenario which is constructed to highlight the individual validation features below.
The scenario below has two inconsistencies:

1. In year `2010` and regions `region_b` & `World`, the values of coal and wind do not add up to the total `Primary Energy` value
2. In year `2020` in the `World` region, the value of `Primary Energy` and `Primary Energy|Coal` is not the sum of `region_a` and `region_b` <br />
   (but the sum of wind and coal to `Primary Energy` in each sub-region  is correct)

In [None]:
tutorial_df = pyam.IamDataFrame(pd.DataFrame([
    ['World', 'Primary Energy', 'EJ/yr', 7, 15],
    ['World', 'Primary Energy|Coal', 'EJ/yr', 4, 11],
    ['World', 'Primary Energy|Wind', 'EJ/yr', 2, 4],
    ['region_a', 'Primary Energy', 'EJ/yr', 4, 8],
    ['region_a', 'Primary Energy|Coal', 'EJ/yr', 2, 6],
    ['region_a', 'Primary Energy|Wind', 'EJ/yr', 2, 2],
    ['region_b', 'Primary Energy', 'EJ/yr', 3, 6],
    ['region_b', 'Primary Energy|Coal', 'EJ/yr', 2, 4],
    ['region_b', 'Primary Energy|Wind', 'EJ/yr', 0, 2],
],
    columns=['region', 'variable', 'unit', 2010, 2020]
), model='model_a', scenario='scen_a')

All checking-functions take arguments for [np.is_close()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.isclose.html) as keyword arguments. We show our recommended settings and how to use them here.

In [None]:
np_isclose_args = {
    'equal_nan': True,
    'rtol': 1e-03,
    'atol': 1e-05,
}

In [None]:
tutorial_df.check_internal_consistency(**np_isclose_args)

The output of this function reports both types of illustrative inconsistencies in the scenario constructed for this section.
The log also shows that the other two variables (coal and wind) cannot be assessed because they have no subcategories.

<div class="alert alert-info">

In practice, it would now be up to the user to determine
the cause of the inconsistency (or confirm that this is expected for some reason).

</div>