clean-up of aggregation features #315

danielhuppmann · 2019-12-31T15:26:30Z

Please confirm that this PR has done the following:

Tests Added
Documentation Added
Description in RELEASE_NOTES.md Added

Description of PR

This PR implements a number of clean-ups following PRs #305 and #312:

the return type of aggregate() and aggregate_region() is changed to an IamDataFrame instance (per suggestion by @jkikstra), previously a timeseries-dataframe
the return type of check_aggregate() and check_aggregate_region() is changed to a pd.DataFrame with both expected and actual value (previously only the expected value)
adds an equals() function (originally used to make tests easier)
adds an pyam.testing.assert_frames_equal() function (to make tests easier)
the tutorial for aggregation, downscaling and consistency checking is reworked (now includes description of regional components, weighted average, setting variables as list, downscaling)
the aggregation tests are completely refactored (because of the changed return types, it was easier to completely rewrite rather than figure out how to salvage)

change return type to `IamDataFrame`

pyam/core.py

tests/test_feature_aggregate.py

znicholls

tl;dr - I don't think this is the best path forward, but I also know you have deadlines and I'm contributing less and less so don't want to be a blocker hence like @gidden I'll put request changes but ignore if you want.

I definitely haven't done the most thorough review. Given the PR has ~1 000 lines of changes that would take more time than I have unfortunately. In general though I think this makes lots of good changes, but I'm concerned about the aggregation feature testing.

the aggregation tests are completely refactored (because of the changed return types, it was easier to completely rewrite rather than figure out how to salvage)

I think this is a really bad idea and think the effort of putting tests back in is worth it. My plan a) would be to do it in this PR but making an issue and addressing it later could also be fine. To explain why I think this: checking aggregation and internal consistency is hard because there are lots of edge cases (mainly bunker related...) and because you want it to work on big datasets so you need some (admittedly annoyingly large) test sets. The existing tests had covered a lot of that and just throwing that away risks it never coming back (or bugs re-appearing when you really wish they wouldn't e.g. in the middle of checking AR6 data).

I like all the other changes to return types etc. and think they'll make things way easier to use. The only other thing that I would reconsider is that check_internal_consistency has components=True for check_aggregate_region which is the opposite to the default of check_aggregate_region, I would find this very confusing as a new user. I would make components=True the default for `check_aggregate_region.

pyam/aggregate.py

pyam/core.py

pyam/testing.py

Co-Authored-By: Zeb Nicholls <zebedee.nicholls@climate-energy-college.org>

)

gidden · 2020-01-10T08:08:03Z

@danielhuppmann I share @znicholls opinion on the test refactor here (also not blocking but with a strong preference).

I may not grok the gritty details, but is it possible to write a wrapper function in the tests that would translate the newly returned IamDataFrame back to a pd.DataFrame in the test's expected format and then apply that in each test a la pdt.assert_frame_equals(pd_to_pyam_test(obs), exp)?

@znicholls

…` (per comment by @znicholls)

danielhuppmann · 2020-01-10T08:29:01Z

re the comments by @znicholls and @gidden:

sorry that this is such a big refactoring - I divided it into a suite of PRs to make it manageable, but it's still huge...
API of check_internal_consistency(): I added a kwarg components so that the behaviour can be controlled and passed through to check_aggregate_region(). However, as to whether we should use True or False as default, detecting components that exist only at the region-level is one use case out of many (it's relevant mainly in the emissions domain). And it would interfere with using (weighted) average or other methods which are equally valid and relevant, so I changed that to False with a purpose.
about the tests: I agree that just dropping tests is a really bad idea, but the suite of tests were written so efficiently (a dozen different tests with hardly any documentation about their intent, all feeding into one master testing function with a bunch of if-else-logic to do the comparison), it just wasn't possible for me to disentangle this.
@gidden, your approach sounds smart but if you look at what was actually tested (length of indices, etc.) this won't work. The new test suite with assert_frame_equal() is far more explicit.
@znicholls, if you can spell out for each of the removed tests what their intent was, I'll be happy to re-introduce them in a next step.

francescolovat · 2020-01-10T14:01:45Z

Hi @danielhuppmann,

I've been through the pyam_first_steps.ipynb tutorial again.

It really looks nice. You've implemented all our suggestions in a very detailed way. I was really enjoyable to follow the steps in the notebook.

I have no further comments to add to it (I'll keep in mind some formatting features you included, e.g. the blue boxes, to potentially include them also in the MESSAGEix tutorials).

I regards of the previous conversation about the refactoring of tests, I have little experience with them. I'm still getting familiar with pandas.util.testing methods. Matt's suggestions of a wrapper function looks promising, however it is also a matter of the time left before the openmod meeting.

znicholls · 2020-01-11T01:14:03Z

sorry that this is such a big refactoring - I divided it into a suite of PRs to make it manageable, but it's still huge...

All good there's always a nasty conflict between time pressure and moving slowly enough that everyone can keep up (you're striking a good balance)

2. API of check_internal_consistency()

All makes sense thanks for spelling it out.

3. about the tests

fair, they were definitely not as clear as they should have been (I have since learnt that whilst this 'efficiency' approach looks good, it's actually a terrible idea as it's not explicit enough about what is going on). Let's park this in #317 for now

pyam/_aggregate.py

pyam/core.py

gidden · 2020-01-13T07:21:50Z

Ok all, will merge this now that everything is approved/reviewed/marked for future issues. Thanks @danielhuppmann for the huge effort here and all reviewers!

danielhuppmann added 30 commits December 31, 2019 16:10

add equals() function (with tests)

dc3ec0d

refactor auxiliary aggregation functions

51e4492

add check that arg in equals is an IamDataFrame (and test)

6784ca4

split aggregate[_region]() function into public and internal parts

19a84ee

change return type to `IamDataFrame`

refactor aggregation tests to expect IamDataFrame as return object

78060d6

use equals() in downscale-tests

7319871

move auxiliary functions related to aggregation features to own file

81cd8d5

add region-column to PRICE_MAX_DF

a54fbfa

smarter aggregation-tests parametrization

c67c37b

remove duplicate test

52b6a17

change order of tests

489a203

smarter parametrization and re-ordering of region-aggregation tests

ed7bb0e

rename full-feature dataframe for aggregation tests to simple_df

7ae19cf

remove unnecessary imports

45a2890

add treatment of empty result of aggregation (and tests)

58fc75d

change return object from check_aggregate[_region]()

d3f0fd9

fix return-object of aggregate-region with weights

4acdddd

complete rework of consistency-checking tutorial

a321d5d

update list of tutorials on doc-pages

55b0db7

fix docstrings

40842f2

refactor region/subregions tests

82187b8

refactor aggregate-and-append tests

360435b

remove duplicate "passing" tests (already covered by new tests)

1a5bfbe

refactor check-aggregate(-region) tests

093eef8

refactor top-level check_aggregate() test (with exclude_on_fail)

c4a0b35

clean-up

8bc1983

refactor tests for log-messages of region aggregation

04343ae

remove tests that are duplicated by new test suite

3435b91

refactor test for `check_internal_consistency()´

cee6955

remove test data for previous aggregation test suite

d1f325b

danielhuppmann added 3 commits January 9, 2020 21:55

change check_internal_consistency to return a concatenated dataframe

cfedb36

update API changes in release notes

46738e1

update the tutorial

ea900e7

stickler-ci reviewed Jan 9, 2020

View reviewed changes

pyam/core.py Outdated Show resolved Hide resolved

pyam/core.py Outdated Show resolved Hide resolved

tests/test_feature_aggregate.py Outdated Show resolved Hide resolved

appease stickler

8931cce

znicholls requested changes Jan 10, 2020

View reviewed changes

znicholls and others added 5 commits January 10, 2020 08:17

Minor updates in the notebooks

d942af5

cleaner import in testing module

ab8d431

Co-Authored-By: Zeb Nicholls <zebedee.nicholls@climate-energy-college.org>

refactor to group_and_agg() (suggested by @znicholls)

2e5687b

add continue in variable-components-loop (suggested by @znicholls)

cecc518

fix hard-coded region in check_aggregate_region() (found by @znicholls

af07892

)

allow components to be passed through `check_internal_consistency()…

f3aebfc

…` (per comment by @znicholls)

danielhuppmann added 3 commits January 10, 2020 09:37

add parameter to equals() docstring

7a4c978

appease stickler

ec1e418

add API change of check_internal_consistency() to readme

07fe142

znicholls mentioned this pull request Jan 11, 2020

Aggregation tests #317

Open

Move aggregate stuff into 'private' API

c189373

stickler-ci reviewed Jan 11, 2020

View reviewed changes

pyam/_aggregate.py Outdated Show resolved Hide resolved

pyam/core.py Outdated Show resolved Hide resolved

pyam/core.py Outdated Show resolved Hide resolved

appease stickler, revert to import functions instead of private module

8d01f42

znicholls approved these changes Jan 12, 2020

View reviewed changes

gidden approved these changes Jan 13, 2020

View reviewed changes

gidden merged commit 412ff2a into IAMconsortium:master Jan 13, 2020

danielhuppmann deleted the cleanup/aggregation branch January 13, 2020 10:10

danielhuppmann mentioned this pull request Feb 16, 2020

Df ops #333

Closed

3 tasks

danielhuppmann mentioned this pull request May 17, 2020

consistent return type in aggregate family of functions #255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean-up of aggregation features #315

clean-up of aggregation features #315

danielhuppmann commented Dec 31, 2019 •

edited

znicholls left a comment •

edited

gidden commented Jan 10, 2020

danielhuppmann commented Jan 10, 2020

francescolovat commented Jan 10, 2020

znicholls commented Jan 11, 2020 •

edited

gidden commented Jan 13, 2020

clean-up of aggregation features #315

clean-up of aggregation features #315

Conversation

danielhuppmann commented Dec 31, 2019 • edited

Please confirm that this PR has done the following:

Description of PR

znicholls left a comment • edited

Choose a reason for hiding this comment

gidden commented Jan 10, 2020

danielhuppmann commented Jan 10, 2020

francescolovat commented Jan 10, 2020

znicholls commented Jan 11, 2020 • edited

gidden commented Jan 13, 2020

danielhuppmann commented Dec 31, 2019 •

edited

znicholls left a comment •

edited

znicholls commented Jan 11, 2020 •

edited