Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to skip existing intermediate variables when aggregating recursivly #532

Merged
merged 35 commits into from Jun 10, 2021

Conversation

pjuergens
Copy link
Contributor

@pjuergens pjuergens commented May 12, 2021

Please confirm that this PR has done the following:

  • Tests Added
  • Documentation Added
  • Name of contributors Added to AUTHORS.rst
  • Description in RELEASE_NOTES.md Added

Description of PR

Add an option to skip existing intermediate variables when aggregating recursivly as discussed in #525

@codecov
Copy link

codecov bot commented May 12, 2021

Codecov Report

Merging #532 (25edae0) into main (8d8aa6b) will increase coverage by 0.0%.
The diff coverage is 100.0%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main    #532   +/-   ##
=====================================
  Coverage   93.5%   93.5%           
=====================================
  Files         47      48    +1     
  Lines       5228    5257   +29     
=====================================
+ Hits        4891    4920   +29     
  Misses       337     337           
Impacted Files Coverage Δ
pyam/_aggregate.py 99.0% <100.0%> (+<0.1%) ⬆️
pyam/_compare.py 100.0% <100.0%> (ø)
pyam/core.py 92.6% <100.0%> (-0.1%) ⬇️
tests/conftest.py 100.0% <100.0%> (ø)
tests/test_feature_aggregate.py 98.8% <100.0%> (+<0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d8aa6b...25edae0. Read the comment docs.

@pjuergens pjuergens marked this pull request as draft May 12, 2021 09:13
@pjuergens
Copy link
Contributor Author

I don't understand why stickler-ci is failing as it doesn't tell me the line where the PR differs from the black code style

@pjuergens pjuergens marked this pull request as ready for review May 12, 2021 09:55
@danielhuppmann danielhuppmann self-requested a review May 12, 2021 10:49
pyam/core.py Outdated Show resolved Hide resolved
@danielhuppmann
Copy link
Member

You can directly commit my suggestion to your branch directly, this should appease Stickler. Going forward, I recommend that you install a Black linter utility on your machine and follow this workflow:

  • write the code
  • commit the changes
  • run the linter
  • commit the changes, looking carefully at what parts of the code were changed by the linter
  • then push to GitHub

After a while, you'll not need to run the linter anymore because you'll be coding black naturally...

Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
@pjuergens
Copy link
Contributor Author

thanks, that did the job :)

@danielhuppmann
Copy link
Member

Looking at the code implementation in a bit more detail, two thoughts:

  1. having thought about this feature again, I think it makes more sense to not add an extra kwarg skip_intermediate but instead do the following:

    • if an intermediate variable exists, check that the values are in line with the aggregate of its components (rather than simply skipping the computation) to ensure internal consistency along the variable hierarchy
    • if it doesn't exist, compute the aggregate of the components and append the data
  2. I'm a bit concerned with the proposed implementation about unexpected behavior if a user has an IamDataFrame with multiple scenarios, where one scenario (scen_a) has some intermediate variables and another (scen_b) has only the lower-level components. With the proposed implementation, scen_b would not be complete.

    The (in my opinion expected) behavior could be tested by adding the following line

    df2.aggregate("Secondary Energy|Electricity|Wind", append=True)

    in the function test_aggregate_recursive() before appending df2 to df.

    This way, the existing test could be modified and you wouldn't need to add an almost-duplicate test.

@pjuergens
Copy link
Contributor Author

I changed the implementation from comparing variables beforehand to comparing indices after aggregating, which should now be able to deal with different variable sets in different scenarios, regions or even years.

Concerning your first thought: I'd prefer to not check internal consistency while skipping intermediate variables but rather let the user check it afterwards. When explicitly using this feature it should be clear that the intermediate variables are not aggregated by pyam and should be checked. Running check_internal_consistency then gives the user more information about which variables differ and if it's due to rounding errors. In my opinion more useful for potential debugging.

@pjuergens pjuergens marked this pull request as draft May 14, 2021 09:47
@pjuergens
Copy link
Contributor Author

It seems like stickler uses slightly different definitions of black than the built-in black-linter in Spyder.

@danielhuppmann
Copy link
Member

Thanks for continuing to work on this, sorry for the issues with difference versions of black...

Re your comment:

When explicitly using this feature it should be clear that the intermediate variables are not aggregated by pyam and should be checked.

There is a tradeoff here:

  • advanced users will know what they are doing - for them, the increased efficiency of skipping validation is a bonus
  • for non-expert users, pyam should have sensible guardrails to prevent them from doing stupid things or running into unexpected behavior - for them, pyam should raise loud and specific error messages

And either way, first running recursive-aggregation and the checking consistency is not an efficient approach, because it requires computing a lot of data twice.

So let me modify my earlier suggestion:

  • drop the skip_intermediate keyword argument - skipping should be the default, but with validation
  • allow recursive to be a boolean (as it is now) or a string "skip-validate" - and skip the validation in that case.
    So your expert-level function could be called with
    df.aggregate(<variable>, recursive="skip-validate")
    but novice users would get an error if the hierarchy is not consistent.
    In core.py, you only have to change if recursive is True: to if recursive:, then the recursive-aggregation will be selected both if it's True or a dictionary.

Later extensions could then also do something like recursive={rtol=0.2} where the contents of the dictionary are passed to np.isclose() in the validation to avoid errors if data is almost identical but not quite.

@pjuergens
Copy link
Contributor Author

I implemented my approach for the aggregation check. I will change the interface with dropping the skip_intermediate keyword after the weekend.

@pjuergens pjuergens marked this pull request as ready for review May 19, 2021 13:31
Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor in-line comment for now - will need to check the approach more thoroughly later...

pyam/_aggregate.py Outdated Show resolved Hide resolved
Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pjuergens for the initiative!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants