Partial region aggregation #99

phackstock · 2022-02-15T10:58:24Z

closes #96.

New Feature

This PR implements partial region aggregation.
Partial region aggregation happens when a common region is already being reported from a model natively.
This can encompass all variables or only a subset. Additionally this also works for variables that are only reported on this region level.

Example

Say we have model native results that look like this:

model_a reports a total of three regions: region_a, region_b and World
Regions a and b are native regions and World falls under the category of common regions (although it is reported by the model directly)
For region aggregation World is comprised of regions a and b.
For region region_a and region_b we report variables var_a and var_b.
For region World we report var_b and var_c.

The corresponding model mapping like like this:

model: model_a
native_regions:
- region_a
- region_b
common_regions:
- world:
  - region_a
  - region_b

The region processor now applies the following logic:

Variable var_a is not reported on World level, therefore it is calculated through region aggregation of regions region_a and region_b.
Variable var_b is reported on World level as well as region_a and region_b. By convention the model native results (on World level) take precedence over preforming aggregation.
Variable var_c is only reported on World level so it taken from there.

The resulting data frame then contains the following:

For region_a and region_b: variables: var_a and var_b
Fro World: variables var_a (calculated by aggregation) , var_b (taken from model native results by precedence), var_c (taken from model native results)

Open Points

As mentioned in point 2. of the region processor logic, we currently simply take the model native results over the aggregation. Additionally we could also calculate the aggregation and compare the results giving an error if the results do not match.

danielhuppmann

I think that implementation was too easy... Best to extend the test, I guess...

tests/data/region_processing/dsd/variable/variables.yaml

nomenclature/processor/region.py

Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>

phackstock · 2022-02-15T17:36:22Z

I added deep copies for both vars_default_args as well as vars_kwargs for each common region.
Additionally I added a dedicated test test_core.test_partial_aggregation that tests the case that you outlined @danielhuppmann:

We have two common regions common_region_A and common_region_B.
For common_region_A we report both Temperature|Mean and Primary Energy so both of those are removed from vars_default_args_cr (I named the copy with a "cr" at the end to signal that this is common region specific, plus I couldn't overwrite the original vars_default_args as I need to reset them for every loop iteration)
For common_region_b we only report Temperature|Mean so Primary Energy must be obtained by region aggregation.

This test would have failed without the use of the deep copy as we would have removed Primary Energy from the list of variables that need to be aggregated.
Now it works.

One more point remains though: I deliberately put the numbers so that for common_region_A which is comprised of region_A and region_B (as defined in tests/data/region_processing/partial_aggregation/model_a.yaml) for Primary Energy they do not line up. region_A + region_B = 1+3 (for 2005) while common_region_A reports 1 (for 2005).
Currently there are no checks for such consistency, should I add them @danielhuppmann?
If so what should the failure scenario look like, should we just print a warning or raise an error?
My personal preference would be in favor for this consistency check and raising a error in case of failure.
This could prevent some issues related to faulty aggregation, also connected to missing weights. If both the common region is already reported aggregated incorrectly, however, this check is useless. It might still catch some errors though.

phackstock · 2022-03-01T19:05:58Z

In the end this was a bit trickier that I thought including all edge cases. Nonetheless, here's my first prototype for partial region aggregation with the comparison feature between model native and aggregated results.

Points addressed

Added _aggregate_with_model_native which performs region aggregation including checking for differences between model native and aggregated results using pyam.compare. In case a difference is found, a warning level logging message is written, detailing the differences.
The logic of assembling the results is as follows:
1. All variables that are shared by model native and aggregated results are compared
2. If there are differences, issue a logging call at the warning level.
3. Assemble the results by using all model native data and add aggregated data for variables which are not in the model native ones, i.e. data that are only obtained through region aggregation. This step makes sure that: first, results which are exclusive either model native or aggregation regions are both taken and in case results exist for both, model native ones take precedence.
Modified the existing test case to check the logs for the expected message detailing the differences. Example output from the test case:

Differences found between model native and aggregated results
                                                          model native  aggregated
model scenario region          variable       unit  year                          
m_a   s_a      common_region_A Primary Energy EJ/yr 2005             1           4
                                                    2010             2           6

Added basic logging config to nomenclature/__init__.py.

Open points

The warning message I've just put together, maybe the information can be presented in a better way.
Right now the comparison is made on the basis of a single model and region for a common set of variables but across all provided scenarios. What this does not take into account currently is if there is a variable which is reported on the model native level for only a subset of scenarios. If there is a case where such a setting might occur I can modify _aggregate_with_model_native accordingly.
The configuration of the logger.
For variables which are supposed to be renamed I am not sure what logic should apply if the variable is reported on the model native and the aggregation level. Currently I've implemented it so that the renaming is done after region processing so that the comparison between model native and aggregated results happens before according to the logic detailed above. Not 100 percent sure though if this is desirable.
For variables where skip-region-aggregation is True but that exist in model native results, should we still upload them?
@danielhuppmann I've only merged the main branch into this PR branch after I created _aggregate_with_model_native so I didn't remember that you had already introduced _aggregate_region. Both helper functions fulfill similar purposes. The difference in terms of usage being that _aggregate_with_model_native is called both for 'simple' region aggregation as well as with 'kwargs'. Right now I've set it up so that _aggregate_with_model_native calls _aggregate_region in case kwargs are present. Should I consolidate both function into one or is it fine this way?

nomenclature/processor/region.py

danielhuppmann

Thanks @phackstock - the mechanics of the implementation look great, but I have a few implementation suggestions...

nomenclature/processor/region.py

phackstock · 2022-03-03T16:31:42Z

Thanks for the suggestions @danielhuppmann, I updated the code accordingly.
The two _aggregate_region... functions are now combined into a single one.

I also added a check for the warning message in test_region_processing_weighted_aggregation, in the third parameter configuration. I can also split this third one into its own test if you think that would bring more clarity. I could also revert this commit and add it its own PR since it's only partially related to this one. Additionally, we could also test the weighted aggregation for the missing variable failure scenario. I ran a quick check and the error it produces is the same "Inconsistent index between ...". It would be quite straightforward to distinguish between the two cases, we would just need to include a check for the variables present in the data frame inside the except clause. But that's probably good to include in a separate PR.

Coming back to the open points I raised in my comment two days ago:

Should I include a specific check on a scenario level in what is now _aggregate_region? As it stands the comparison is attempted for one region, model and variable but for all scenarios. Right now we would lose a variable that's missing some scenarios in the uploaded results but is present in the aggregated ones. Is edge case realistic or can we live without having to explicitly iterate over all scenarios?
The renaming is done after the aggregation so model native results would still take precedence here. I could see a (admittedly far fetched) use case where someone might want to upload model native results under the original variable name and the aggregated results under the new name, but I think this is too convoluted and we should keep it as it currently is.
If results for common regions are reported from the model directly but a variable has skip-region-aggregation as True this variable will not be uploaded even though technically we don't need region aggregation to obtain it. Is this behavior ok?

danielhuppmann · 2022-03-04T06:16:19Z

The two _aggregate_region... functions are now combined into a single one.

Thanks, looks good!

I also added a check for the warning message in test_region_processing_weighted_aggregation, in the third parameter configuration.

Nice, fine to leave here.

Additionally, we could also test the weighted aggregation for the missing variable failure scenario. I ran a quick check and the error it produces is the same "Inconsistent index between ...". It would be quite straightforward to distinguish between the two cases, we would just need to include a check for the variables present in the data frame inside the except clause. But that's probably good to include in a separate PR.

Not quite sure what you mean, maybe make an issue and tackle later...

Coming back to the open points I raised in my comment two days ago:

Should I include a specific check on a scenario level in what is now _aggregate_region? As it stands the comparison is attempted for one region, model and variable but for all scenarios. Right now we would lose a variable that's missing some scenarios in the uploaded results but is present in the aggregated ones. Is edge case realistic or can we live without having to explicitly iterate over all scenarios?

Now that you specifically ask... It might make sense to change the implementation strategy. Do all aggregation first, then do the comparison. I gave it a try, see phackstock#2

Additional benefits: faster performance (because filtering and comparison only once, not in every iteration), and only one warning message per model.

The renaming is done after the aggregation so model native results would still take precedence here. I could see a (admittedly far fetched) use case where someone might want to upload model native results under the original variable name and the aggregated results under the new name, but I think this is too convoluted and we should keep it as it currently is.

The suggestion would also take care of this.

If results for common regions are reported from the model directly but a variable has skip-region-aggregation as True this variable will not be uploaded even though technically we don't need region aggregation to obtain it. Is this behavior ok?

The suggestion would also take of it, because it filters to all valid variables provided at a common-region.

phackstock · 2022-03-04T10:17:19Z

Thanks a lot for the PR @danielhuppmann, we can continue the discussion over there.
Overall it looks very promising to me, both taking care of some corner cases and providing a speed boost is a really nice improvement.

…egation-alternative Alernative implementation for partial region aggregation

phackstock · 2022-03-07T13:52:53Z

Updates

Tests

Added unit tests for partial region aggregation covering the following cases:

Variable is available in provided and aggregated data and the same
Variable is only available in the provided data
Variable is only available in the aggregated data
Variable is not available in all scenarios in the provided data
Using skip-aggregation: true should only take provided results
Using the region-aggregation attribute to create an additional variable
Variable is available in provided and aggregated data but different

Tests also check logging output.
Thanks a lot again for your solution @danielhuppmann, it worked for all the edge cases I threw at it.

Compare & merge function

@danielhuppmann I took your compare & merge for provided and aggregated results and put it in a dedicated function. Now RegionProcessor.apply is a little bit easier to read.

Adjust pyam log level

In order to get rid of some warnings from pyam regarding empty data frames as a result of the comparison between provided and aggregated results I extended the context manager with adjust_log_level(logger="pyam", level="ERROR"): to include this part.

Open points

From my side the only thing that's remaining is to update the docs. I would put this in the here https://nomenclature-iamc.readthedocs.io/en/latest/usage.html#regionprocessor, if that works for you @danielhuppmann.

phackstock · 2022-03-07T16:16:19Z

When writing the documentation I stumbled upon a possible bug in the way partial region aggregation is performed.
Say we have a variable that is defined like this:

- Variable B:
    unit: EJ/yr
    region-aggregation:
        - Variable B:
            method: max

with the following model mapping:

model: m_a
common_regions:
  - World:
    - region_A
    - region_B

then providing data such as this:

[
    ["m_a", "s_a", "region_A", "Variable B", "EJ/yr", 1, 2],
    ["m_a", "s_a", "region_B", "Variable B", "EJ/yr", 3, 4],
    ["m_a", "s_a", "World", "Variable B", "EJ/yr", 4, 6]
]

and performing region aggregation would give us, according to our "provided data takes precedence over aggregated"-rule:

["m_a", "s_a", "World", "Variable B", "EJ/yr", 4, 6]

which would be incorrect as Variable B is defined as the max and not the sum, so we should instead get:

["m_a", "s_a", "World", "Variable B", "EJ/yr", 3, 4]

What are your thoughts on this @danielhuppmann?
One option would be to disallow renaming a variable to itself when using the region-aggregation attribute as part of a variable code list.

phackstock · 2022-03-07T16:54:44Z

Added some documentation about partial region aggregation. As touched on before, the docs will probably need another re-work soon, but that's a separate issue.

danielhuppmann · 2022-03-08T09:28:42Z

Re the potential bug or inconsistency in the partial region aggregation... I do not think that this is a bug or problem (from a technical-implementation point of view).

If "Variable B" is defined such that the maximum across regions should be reported at the "World" level, and a team reports the sum instead of the max, this is clearly a reporting error. The package can help identify reporting errors by writing a warning to the log, but it is not the purpose of the package to be smarter than the user...

phackstock · 2022-03-08T09:45:13Z

Ah yes fair enough, then it should be all good. The warning is written already anyway.
Since we have a few cases now where the user should read the log file maybe we can think about a way how to present the log file to the user in the scenario explorer.

danielhuppmann

Looks good to me!

PS: We should revisit the docs and harmonize the notation (provided data vs. original) in the not-too-distat future...

phackstock · 2022-03-08T09:53:34Z

Agreed on the docs part, I'm also not super happy with the distinction between "Getting Started" and "Usage". I find myself looking for things, not knowing in which part they are.

phackstock added 2 commits February 15, 2022 11:38

Add pass through for model native results

ee1cab3

Modify tests for partial aggregation

9e61564

phackstock requested a review from danielhuppmann February 15, 2022 10:58

phackstock self-assigned this Feb 15, 2022

danielhuppmann reviewed Feb 15, 2022

View reviewed changes

tests/data/region_processing/dsd/variable/variables.yaml Outdated Show resolved Hide resolved

nomenclature/processor/region.py Outdated Show resolved Hide resolved

phackstock and others added 4 commits February 15, 2022 17:00

Update tests/data/region_processing/dsd/variable/variables.yaml

12c4820

Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>

Reset tests to create dedicated partial aggregation test

94820fc

Add deep copy of variable list for each common region

479c659

Create dedicated test for partial aggregation

882382e

danielhuppmann mentioned this pull request Feb 16, 2022

Adding a IamSlice feature IAMconsortium/pyam#630

Closed

phackstock requested a review from danielhuppmann February 16, 2022 08:37

danielhuppmann mentioned this pull request Feb 18, 2022

Raise/warn if unexpected regions exist for region-processing #102

Closed

phackstock added 6 commits March 1, 2022 18:45

Add basic logging setup

4405514

Add _combine_aggregate_and_model_native

bc15392

Modify test for checking for warning message

5b269b9

Appease stickler

37d565d

Merge branch 'main' into feature/partial-region-aggregation

806464d

Fix merge conflicts

8d9cc9a

danielhuppmann reviewed Mar 3, 2022

View reviewed changes

nomenclature/processor/region.py Outdated Show resolved Hide resolved

danielhuppmann reviewed Mar 3, 2022

View reviewed changes

phackstock added 3 commits March 3, 2022 13:58

Remove unneccesary comments

2723e7b

Apply suggestions from @danielhuppmann

e691653

Add log check to weighted aggregation test

21e05b0

phackstock requested a review from danielhuppmann March 3, 2022 15:23

Alternative implementation for partial-aggregation

412b5a3

phackstock added 2 commits March 4, 2022 12:36

Merge pull request #2 from danielhuppmann/feature/partial-region-aggr…

61d2e93

…egation-alternative Alernative implementation for partial region aggregation

Appease stickler

d4e02c9

phackstock mentioned this pull request Mar 7, 2022

Common region completeness check #105

Open

phackstock added 3 commits March 7, 2022 14:41

Restructure comparsion into function

352138d

Add tests for partial region aggregation

72077b2

Make pyam logging adjustment explicit

b38ee83

phackstock added 2 commits March 7, 2022 17:10

Fix bug connected to region-aggregation attribute

45397e9

Add test for region-aggregation bug

2e2d485

Add partial region aggregation to docs

bee463b

danielhuppmann approved these changes Mar 8, 2022

View reviewed changes

phackstock merged commit c03f9d3 into IAMconsortium:main Mar 8, 2022

phackstock deleted the feature/partial-region-aggregation branch March 8, 2022 09:53

This was referenced Mar 8, 2022

Check if resulting IamDataFrame is empty #103

Open

Update error message for empty dataframe #107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial region aggregation #99

Partial region aggregation #99

phackstock commented Feb 15, 2022

danielhuppmann left a comment

phackstock commented Feb 15, 2022

phackstock commented Mar 1, 2022

danielhuppmann left a comment

phackstock commented Mar 3, 2022

danielhuppmann commented Mar 4, 2022

phackstock commented Mar 4, 2022

phackstock commented Mar 7, 2022

phackstock commented Mar 7, 2022

phackstock commented Mar 7, 2022

danielhuppmann commented Mar 8, 2022

phackstock commented Mar 8, 2022

danielhuppmann left a comment

phackstock commented Mar 8, 2022

Partial region aggregation #99

Partial region aggregation #99

Conversation

phackstock commented Feb 15, 2022

New Feature

Example

Open Points

danielhuppmann left a comment

Choose a reason for hiding this comment

phackstock commented Feb 15, 2022

phackstock commented Mar 1, 2022

Points addressed

Open points

danielhuppmann left a comment

Choose a reason for hiding this comment

phackstock commented Mar 3, 2022

danielhuppmann commented Mar 4, 2022

phackstock commented Mar 4, 2022

phackstock commented Mar 7, 2022

Updates

Tests

Compare & merge function

Adjust pyam log level

Open points

phackstock commented Mar 7, 2022

phackstock commented Mar 7, 2022

danielhuppmann commented Mar 8, 2022

phackstock commented Mar 8, 2022

danielhuppmann left a comment

Choose a reason for hiding this comment

phackstock commented Mar 8, 2022