Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial region aggregation #99

Merged

Conversation

phackstock
Copy link
Contributor

closes #96.

New Feature

This PR implements partial region aggregation.
Partial region aggregation happens when a common region is already being reported from a model natively.
This can encompass all variables or only a subset. Additionally this also works for variables that are only reported on this region level.

Example

Say we have model native results that look like this:

  • model_a reports a total of three regions: region_a, region_b and World
  • Regions a and b are native regions and World falls under the category of common regions (although it is reported by the model directly)
  • For region aggregation World is comprised of regions a and b.
  • For region region_a and region_b we report variables var_a and var_b.
  • For region World we report var_b and var_c.

The corresponding model mapping like like this:

model: model_a
native_regions:
- region_a
- region_b
common_regions:
- world:
  - region_a
  - region_b

The region processor now applies the following logic:

  1. Variable var_a is not reported on World level, therefore it is calculated through region aggregation of regions region_a and region_b.
  2. Variable var_b is reported on World level as well as region_a and region_b. By convention the model native results (on World level) take precedence over preforming aggregation.
  3. Variable var_c is only reported on World level so it taken from there.

The resulting data frame then contains the following:

  • For region_a and region_b: variables: var_a and var_b
  • Fro World: variables var_a (calculated by aggregation) , var_b (taken from model native results by precedence), var_c (taken from model native results)

Open Points

  • As mentioned in point 2. of the region processor logic, we currently simply take the model native results over the aggregation. Additionally we could also calculate the aggregation and compare the results giving an error if the results do not match.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that implementation was too easy... Best to extend the test, I guess...

tests/data/region_processing/dsd/variable/variables.yaml Outdated Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
@phackstock
Copy link
Contributor Author

I added deep copies for both vars_default_args as well as vars_kwargs for each common region.
Additionally I added a dedicated test test_core.test_partial_aggregation that tests the case that you outlined @danielhuppmann:

  • We have two common regions common_region_A and common_region_B.
  • For common_region_A we report both Temperature|Mean and Primary Energy so both of those are removed from vars_default_args_cr (I named the copy with a "cr" at the end to signal that this is common region specific, plus I couldn't overwrite the original vars_default_args as I need to reset them for every loop iteration)
  • For common_region_b we only report Temperature|Mean so Primary Energy must be obtained by region aggregation.

This test would have failed without the use of the deep copy as we would have removed Primary Energy from the list of variables that need to be aggregated.
Now it works.

One more point remains though: I deliberately put the numbers so that for common_region_A which is comprised of region_A and region_B (as defined in tests/data/region_processing/partial_aggregation/model_a.yaml) for Primary Energy they do not line up. region_A + region_B = 1+3 (for 2005) while common_region_A reports 1 (for 2005).
Currently there are no checks for such consistency, should I add them @danielhuppmann?
If so what should the failure scenario look like, should we just print a warning or raise an error?
My personal preference would be in favor for this consistency check and raising a error in case of failure.
This could prevent some issues related to faulty aggregation, also connected to missing weights. If both the common region is already reported aggregated incorrectly, however, this check is useless. It might still catch some errors though.

@phackstock
Copy link
Contributor Author

In the end this was a bit trickier that I thought including all edge cases. Nonetheless, here's my first prototype for partial region aggregation with the comparison feature between model native and aggregated results.

Points addressed

  • Added _aggregate_with_model_native which performs region aggregation including checking for differences between model native and aggregated results using pyam.compare. In case a difference is found, a warning level logging message is written, detailing the differences.
  • The logic of assembling the results is as follows:
    1. All variables that are shared by model native and aggregated results are compared
    2. If there are differences, issue a logging call at the warning level.
    3. Assemble the results by using all model native data and add aggregated data for variables which are not in the model native ones, i.e. data that are only obtained through region aggregation. This step makes sure that: first, results which are exclusive either model native or aggregation regions are both taken and in case results exist for both, model native ones take precedence.
  • Modified the existing test case to check the logs for the expected message detailing the differences. Example output from the test case:
Differences found between model native and aggregated results
                                                          model native  aggregated
model scenario region          variable       unit  year                          
m_a   s_a      common_region_A Primary Energy EJ/yr 2005             1           4
                                                    2010             2           6
  • Added basic logging config to nomenclature/__init__.py.

Open points

  • The warning message I've just put together, maybe the information can be presented in a better way.
  • Right now the comparison is made on the basis of a single model and region for a common set of variables but across all provided scenarios. What this does not take into account currently is if there is a variable which is reported on the model native level for only a subset of scenarios. If there is a case where such a setting might occur I can modify _aggregate_with_model_native accordingly.
  • The configuration of the logger.
  • For variables which are supposed to be renamed I am not sure what logic should apply if the variable is reported on the model native and the aggregation level. Currently I've implemented it so that the renaming is done after region processing so that the comparison between model native and aggregated results happens before according to the logic detailed above. Not 100 percent sure though if this is desirable.
  • For variables where skip-region-aggregation is True but that exist in model native results, should we still upload them?
  • @danielhuppmann I've only merged the main branch into this PR branch after I created _aggregate_with_model_native so I didn't remember that you had already introduced _aggregate_region. Both helper functions fulfill similar purposes. The difference in terms of usage being that _aggregate_with_model_native is called both for 'simple' region aggregation as well as with 'kwargs'. Right now I've set it up so that _aggregate_with_model_native calls _aggregate_region in case kwargs are present. Should I consolidate both function into one or is it fine this way?

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phackstock - the mechanics of the implementation look great, but I have a few implementation suggestions...

nomenclature/processor/region.py Outdated Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
@phackstock
Copy link
Contributor Author

Thanks for the suggestions @danielhuppmann, I updated the code accordingly.
The two _aggregate_region... functions are now combined into a single one.

I also added a check for the warning message in test_region_processing_weighted_aggregation, in the third parameter configuration. I can also split this third one into its own test if you think that would bring more clarity. I could also revert this commit and add it its own PR since it's only partially related to this one. Additionally, we could also test the weighted aggregation for the missing variable failure scenario. I ran a quick check and the error it produces is the same "Inconsistent index between ...". It would be quite straightforward to distinguish between the two cases, we would just need to include a check for the variables present in the data frame inside the except clause. But that's probably good to include in a separate PR.

Coming back to the open points I raised in my comment two days ago:

  • Should I include a specific check on a scenario level in what is now _aggregate_region? As it stands the comparison is attempted for one region, model and variable but for all scenarios. Right now we would lose a variable that's missing some scenarios in the uploaded results but is present in the aggregated ones. Is edge case realistic or can we live without having to explicitly iterate over all scenarios?
  • The renaming is done after the aggregation so model native results would still take precedence here. I could see a (admittedly far fetched) use case where someone might want to upload model native results under the original variable name and the aggregated results under the new name, but I think this is too convoluted and we should keep it as it currently is.
  • If results for common regions are reported from the model directly but a variable has skip-region-aggregation as True this variable will not be uploaded even though technically we don't need region aggregation to obtain it. Is this behavior ok?

@danielhuppmann
Copy link
Member

The two _aggregate_region... functions are now combined into a single one.

Thanks, looks good!

I also added a check for the warning message in test_region_processing_weighted_aggregation, in the third parameter configuration.

Nice, fine to leave here.

Additionally, we could also test the weighted aggregation for the missing variable failure scenario. I ran a quick check and the error it produces is the same "Inconsistent index between ...". It would be quite straightforward to distinguish between the two cases, we would just need to include a check for the variables present in the data frame inside the except clause. But that's probably good to include in a separate PR.

Not quite sure what you mean, maybe make an issue and tackle later...

Coming back to the open points I raised in my comment two days ago:

  • Should I include a specific check on a scenario level in what is now _aggregate_region? As it stands the comparison is attempted for one region, model and variable but for all scenarios. Right now we would lose a variable that's missing some scenarios in the uploaded results but is present in the aggregated ones. Is edge case realistic or can we live without having to explicitly iterate over all scenarios?

Now that you specifically ask... It might make sense to change the implementation strategy. Do all aggregation first, then do the comparison. I gave it a try, see phackstock#2

Additional benefits: faster performance (because filtering and comparison only once, not in every iteration), and only one warning message per model.

  • The renaming is done after the aggregation so model native results would still take precedence here. I could see a (admittedly far fetched) use case where someone might want to upload model native results under the original variable name and the aggregated results under the new name, but I think this is too convoluted and we should keep it as it currently is.

The suggestion would also take care of this.

  • If results for common regions are reported from the model directly but a variable has skip-region-aggregation as True this variable will not be uploaded even though technically we don't need region aggregation to obtain it. Is this behavior ok?

The suggestion would also take of it, because it filters to all valid variables provided at a common-region.

@phackstock
Copy link
Contributor Author

Thanks a lot for the PR @danielhuppmann, we can continue the discussion over there.
Overall it looks very promising to me, both taking care of some corner cases and providing a speed boost is a really nice improvement.

…egation-alternative

Alernative implementation for partial region aggregation
@phackstock
Copy link
Contributor Author

Updates

Tests

Added unit tests for partial region aggregation covering the following cases:

  • Variable is available in provided and aggregated data and the same
  • Variable is only available in the provided data
  • Variable is only available in the aggregated data
  • Variable is not available in all scenarios in the provided data
  • Using skip-aggregation: true should only take provided results
  • Using the region-aggregation attribute to create an additional variable
  • Variable is available in provided and aggregated data but different

Tests also check logging output.
Thanks a lot again for your solution @danielhuppmann, it worked for all the edge cases I threw at it.

Compare & merge function

@danielhuppmann I took your compare & merge for provided and aggregated results and put it in a dedicated function. Now RegionProcessor.apply is a little bit easier to read.

Adjust pyam log level

In order to get rid of some warnings from pyam regarding empty data frames as a result of the comparison between provided and aggregated results I extended the context manager with adjust_log_level(logger="pyam", level="ERROR"): to include this part.

Open points

From my side the only thing that's remaining is to update the docs. I would put this in the here https://nomenclature-iamc.readthedocs.io/en/latest/usage.html#regionprocessor, if that works for you @danielhuppmann.

@phackstock
Copy link
Contributor Author

When writing the documentation I stumbled upon a possible bug in the way partial region aggregation is performed.
Say we have a variable that is defined like this:

- Variable B:
    unit: EJ/yr
    region-aggregation:
        - Variable B:
            method: max

with the following model mapping:

model: m_a
common_regions:
  - World:
    - region_A
    - region_B

then providing data such as this:

[
    ["m_a", "s_a", "region_A", "Variable B", "EJ/yr", 1, 2],
    ["m_a", "s_a", "region_B", "Variable B", "EJ/yr", 3, 4],
    ["m_a", "s_a", "World", "Variable B", "EJ/yr", 4, 6]
]

and performing region aggregation would give us, according to our "provided data takes precedence over aggregated"-rule:

["m_a", "s_a", "World", "Variable B", "EJ/yr", 4, 6]

which would be incorrect as Variable B is defined as the max and not the sum, so we should instead get:

["m_a", "s_a", "World", "Variable B", "EJ/yr", 3, 4]

What are your thoughts on this @danielhuppmann?
One option would be to disallow renaming a variable to itself when using the region-aggregation attribute as part of a variable code list.

@phackstock
Copy link
Contributor Author

Added some documentation about partial region aggregation. As touched on before, the docs will probably need another re-work soon, but that's a separate issue.

@danielhuppmann
Copy link
Member

Re the potential bug or inconsistency in the partial region aggregation... I do not think that this is a bug or problem (from a technical-implementation point of view).

If "Variable B" is defined such that the maximum across regions should be reported at the "World" level, and a team reports the sum instead of the max, this is clearly a reporting error. The package can help identify reporting errors by writing a warning to the log, but it is not the purpose of the package to be smarter than the user...

@phackstock
Copy link
Contributor Author

Ah yes fair enough, then it should be all good. The warning is written already anyway.
Since we have a few cases now where the user should read the log file maybe we can think about a way how to present the log file to the user in the scenario explorer.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

PS: We should revisit the docs and harmonize the notation (provided data vs. original) in the not-too-distat future...

@phackstock
Copy link
Contributor Author

Agreed on the docs part, I'm also not super happy with the distinction between "Getting Started" and "Usage". I find myself looking for things, not knowing in which part they are.

@phackstock phackstock merged commit c03f9d3 into IAMconsortium:main Mar 8, 2022
@phackstock phackstock deleted the feature/partial-region-aggregation branch March 8, 2022 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partial region-aggregation for common-regions
2 participants