Add a function to aggregate a variable to a region from subregions #207

danielhuppmann · 2019-03-07T15:59:16Z

Please confirm that this PR has done the following:

Tests Added
Documentation Added
Description in RELEASE_NOTES.md Added

Description of PR

This PR refactors the check_aggregate_regions() function into a separate aggregate_region() function and a check_...() function to use it in scenario postprocessing (calculating regional values rather than just checking that the aggregation is correct).

The important issue here is that any variables components that are only defined at the region (e.g., 'World') should be added to the regional total. Say that CO2 emissions from air travel are only accounted for at the global level, the following output should be returned:

Input

region	variable	unit	2020
Europe	Emissions\|CO2	GtCO2	5
Africa	Emissions\|CO2	GtCO2	3
World	Emissions\|CO2\|Air Travel	GtCO2	1

Expected output

region	variable	unit	2020
World	Emissions\|CO2	GtCO2	9

Refactoring as part of this PR

While implementing, there were a number of issues that I believed should be improved even though they break the API.

@znicholls, you implemented the first version of this, can you mark the items below if you agree with the change?

all [check_]aggregate[_regions]() functions: refactor units to unit for closer resemblance to df.filter()
the check_aggregate...() functions return the index columns in the standard IAMC-order
(i.e, [model, scenario, region, variable, unit])
the function check_aggregate_regions() was renamed to check_aggregate_region() because it only ever checks one region at a time
the function check_aggregate_regions() takes a kwarg subregions for an optional list of regions to aggregate and a kwarg components for variable components to be included at the region level
(before, it was not possible to select custom variable components and components referred to the custom subregions, differing from the use in check_aggregate()
_apply_filters() interprets col=None as no filter applied, i.e, all True (before, it would return all False). This streamlines passing the unit filter from the [check_]aggregate[_regions]() and is in line with how pandas treats slice(None).
_apply_filters() takes kwargs, not a dict

using latest `pyam.append()` beauty

for consistency with `filter()`

coveralls · 2019-03-07T16:42:08Z

Coverage increased (+0.1%) to 84.785% when pulling 52bd4b2 on danielhuppmann:aggregate_region into 381c4f6 on IAMconsortium:master.

coveralls · 2019-03-07T16:42:09Z

Coverage increased (+0.08%) to 91.918% when pulling e38afc5 on danielhuppmann:aggregate_region into 7d695a6 on IAMconsortium:master.

gidden · 2019-03-07T17:08:27Z

Hey @danielhuppmann quick first question. I think we already have an aggregate_regions style function in map_region. Is there overlap here? Happy to use any suggestions you have to improve/change it. Plus I see the docstring is definitely wrong in its summary...

pyam/core.py

znicholls · 2019-03-07T23:44:13Z

* `_apply_filters()` takes `kwargs`, not a `dict`

Probably need to ask @gidden about this one I think?

Otherwise looks very nice to me

danielhuppmann · 2019-03-11T12:29:51Z

Hey @danielhuppmann quick first question. I think we already have an aggregate_regions style function in map_region. Is there overlap here? Happy to use any suggestions you have to improve/change it. Plus I see the docstring is definitely wrong in its summary...

why do you think that the docstring is wrong?
well, I didn't have map_regions() on my radar... dug into it a bit now.

One could use aggregate_region() iteratively to get map_regions(), see the example in footnote [1].
Whether it's worth the effort, I can't tell...

main distinction
- map_regions()
  - implicitly assumes that the variable tree is complete
  - supports different mappings of subregions to regions per model
  - returns an average (?) over each subregion, see footnote [2]
  - takes a table with two columns as input, e.g.,
    pd.DataFrame([['NAF', 'R5MAF'], ['ME', 'R5MAF']], columns=['region', 'r5_region'])
- aggregate_region()
  - intended to build up a region "tree" similar to a variable tree (calling it recursively over a variable tree and region hierarchy from bottom up)
  - implicitly assumes a common (or at least not conflicting) region names
  - includes variable components that are only defined at the region level (not the subregions), see Input and Expected Output in the description of this PR.
  - takes the mapping as kwargs, e.g. region='R5MAF', subregions=['NAF', 'ME'] (see footnote [1])

[1] e.g., for df = IamDataFrame(REG_DF).filter(model='IMAGE') where REG_DF is from conftest.py

df.aggregate_region(variable='Primary Energy', region='R5MAF', subregions=['NAF', 'ME'],
                    components=[])

is similar to (except for the "average" issue in [2])

df.map_regions('r5_region').timeseries()

where region='R5MAF', subregions=['NAF', 'ME'] could also be read dynamically from run_control() as is implemented in map_regions().

[2] calling IamDataFrame(REG_DF).map_regions('r5_region').timeseries(), does NOT return the World timeseries.

gidden · 2019-03-29T10:28:55Z

I meant map_regions docstring is wrong, sorry! =)

gidden · 2019-03-29T10:39:53Z

Ok, just getting back to this now. I agree that there are differences between the two, but my suspicion here is that we should harmonize them into one function (or two, but different).

The original goal of map_regions was to be able to take either a many-to-one or one-to-many mapping and plot them. Perhaps it would be better to break this into two different functions, one to aggregate, and one to "paint" (for lack of a better word). What do you think? In principle, it should be relatively easy to update the tests/tutorials that go in "that direction" (many-to-one).

What do you think?

danielhuppmann · 2019-03-29T11:53:54Z

Agree that the two could, maybe should, be refactored into a data-manipulation and a “paint” function. But...

aggregate_regions() has the extra feature to automatically search for and add subcategories of the one variable and one region that it operates on that are not defined in the subregions.

map_regions() just aggregates all variables without consideration of the variable tree and whatever region mapping it gets - or it loads a default region mapping from file and does some model-dependent best-guessing how to apply the mapping.

Merging these two features will require a lot kwargs...

I did go down that rabbit hole to harmonise them and I did not see a light after an hour or two, so I abandoned the effort. We can venture into it again together next week...

gidden · 2019-04-23T09:09:15Z

Per in-person discussions, we have tentitavely agreed to break this into two functions:

aggregate_regions() for the many-to-one mapping
downscale_regions() for the one-to-many mapping

danielhuppmann · 2019-04-24T13:01:53Z

@gidden, merge conflicts resolved, should be good to be merged once the CI passes

pyam/core.py

znicholls · 2019-04-28T05:44:56Z

Just catching up on this now

Per in-person discussions, we have tentitavely agreed to break this into two functions:

aggregate_regions() for the many-to-one mapping

downscale_regions() for the one-to-many mapping

So in this PR you add aggregate_regions(), in a future PR we will add downscale_regions()?

danielhuppmann · 2019-04-28T06:16:25Z

Just catching up on this now

Per in-person discussions, we have tentitavely agreed to break this into two functions:

aggregate_regions() for the many-to-one mapping

downscale_regions() for the one-to-many mapping

So in this PR you add aggregate_regions(), in a future PR we will add downscale_regions()?

Yes (but renaming the functions to aggregate_region() and check_aggregate_region() because it is only one region being treated).

And as suggested in the inline discussion above, we could refactor the existing map_regions() to take a region-mapping and then call the functions iteratively as required.

@gidden

…ested by @gidden

pyam/core.py

gidden · 2019-04-29T08:25:27Z

hey @danielhuppmann. it turns out the geopandas dep is more complicated than initially thought - it seems some underlying datasets have changed (perhaps ISO names? I would need to dig further)... For the moment, can you try updating the CI conda installs to geopandas<0.5.0 and I can make an issue

appveyor.yml

.travis.yml

cherrypicked from `gidden:testci`

danielhuppmann · 2019-04-29T10:21:42Z

force-pushed to get rid of my own very sorry efforts at implementing @gidden's suggestion

cherrypicked from #226

danielhuppmann · 2019-04-29T11:12:42Z

Thanks for the assist with geopandas, @gidden!

I made an issue for the refactoring of map_regions() (#225) as discussed. So should be good to go!

gidden · 2019-04-29T11:12:49Z

Woo, thanks so much @danielhuppmann. can you please add that conversation to the map_region() issue?

gidden · 2019-04-29T11:12:57Z

hah, beat me to it =)

danielhuppmann added 17 commits March 4, 2019 12:02

clean-up of aggregate() function

b57d02f

using latest `pyam.append()` beauty

auxiliary function _variable_components()

dee9395

move _variable_components() below aggregate function block

972cdf5

make _apply_filters() take kwargs instead of dict

e8209fc

make _apply_filter(col=None) behave as if no filter applied

769e34c

refactor all aggregate-functions to unit as kwarg

7e9942b

for consistency with `filter()`

replace _aggregate_by_variables() by more generic function

e08da79

refactor internal pd.Series variables to be not called df_*

9604d74

shorten name of auxiliary function _aggregate, allow by as list

65661dc

add function aggregate_region()

6418c51

refactor check_aggregate_region() function to use aggregate_region()

5a5f7e1

remove deprecated auxiliary function and index lists

a7356f1

check for subregions only where variable exists

96302c1

use df.timeseries() as return of check_aggregate[_region]()`

b64d3e7

update order of index in tests (using timeseries() in check*())

026bd8c

implement refactoring of function name and args in tests

c09a317

update inline comments and logger messages

ad75447

danielhuppmann requested review from gidden and znicholls March 7, 2019 15:59

danielhuppmann added 2 commits March 7, 2019 17:14

add to release notes

ad6ec76

update check_internal_consistency()

52bd4b2

znicholls reviewed Mar 7, 2019

View reviewed changes

pyam/core.py Show resolved Hide resolved

znicholls approved these changes Mar 7, 2019

View reviewed changes

danielhuppmann mentioned this pull request Mar 8, 2019

decide if/how to fill in missing columns in constructor #208

Closed

Merge branch 'master' into aggregate_region

9e7f1db

gidden reviewed Apr 26, 2019

View reviewed changes

pyam/core.py Show resolved Hide resolved

gidden reviewed Apr 26, 2019

View reviewed changes

pyam/core.py Show resolved Hide resolved

gidden reviewed Apr 26, 2019

View reviewed changes

pyam/core.py Outdated Show resolved Hide resolved

danielhuppmann added 2 commits April 28, 2019 12:57

add docstring for _variable_components() as requested by @gidden

c2a5c52

remove unit arg from (check_)aggregate(_region) functions as requ…

585fe41

…ested by @gidden

stickler-ci reviewed Apr 28, 2019

View reviewed changes

pyam/core.py Outdated Show resolved Hide resolved

appease stickler

787c9a5

danielhuppmann mentioned this pull request Apr 29, 2019

map_regions() to downscale_regions() #225

Open

gidden reviewed Apr 29, 2019

View reviewed changes

appveyor.yml Outdated Show resolved Hide resolved

gidden reviewed Apr 29, 2019

View reviewed changes

.travis.yml Outdated Show resolved Hide resolved

restrict geopandas<0.5.0to make region-plot tests pass

e38afc5

cherrypicked from `gidden:testci`

danielhuppmann force-pushed the aggregate_region branch from f51bf94 to e38afc5 Compare April 29, 2019 10:19

danielhuppmann mentioned this pull request Apr 29, 2019

try to fix ci #226

Closed

gidden merged commit 4077929 into IAMconsortium:master Apr 29, 2019

danielhuppmann deleted the aggregate_region branch April 29, 2019 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a function to aggregate a variable to a region from subregions #207

Add a function to aggregate a variable to a region from subregions #207

danielhuppmann commented Mar 7, 2019 •

edited

coveralls commented Mar 7, 2019

coveralls commented Mar 7, 2019 •

edited

gidden commented Mar 7, 2019

znicholls commented Mar 7, 2019

danielhuppmann commented Mar 11, 2019 •

edited

gidden commented Mar 29, 2019

gidden commented Mar 29, 2019

danielhuppmann commented Mar 29, 2019

gidden commented Apr 23, 2019

danielhuppmann commented Apr 24, 2019

znicholls commented Apr 28, 2019

danielhuppmann commented Apr 28, 2019

gidden commented Apr 29, 2019

danielhuppmann commented Apr 29, 2019

danielhuppmann commented Apr 29, 2019

gidden commented Apr 29, 2019

gidden commented Apr 29, 2019

Add a function to aggregate a variable to a region from subregions #207

Add a function to aggregate a variable to a region from subregions #207

Conversation

danielhuppmann commented Mar 7, 2019 • edited

Please confirm that this PR has done the following:

Description of PR

Refactoring as part of this PR

coveralls commented Mar 7, 2019

coveralls commented Mar 7, 2019 • edited

gidden commented Mar 7, 2019

znicholls commented Mar 7, 2019

danielhuppmann commented Mar 11, 2019 • edited

gidden commented Mar 29, 2019

gidden commented Mar 29, 2019

danielhuppmann commented Mar 29, 2019

gidden commented Apr 23, 2019

danielhuppmann commented Apr 24, 2019

znicholls commented Apr 28, 2019

danielhuppmann commented Apr 28, 2019

gidden commented Apr 29, 2019

danielhuppmann commented Apr 29, 2019

danielhuppmann commented Apr 29, 2019

gidden commented Apr 29, 2019

gidden commented Apr 29, 2019

danielhuppmann commented Mar 7, 2019 •

edited

coveralls commented Mar 7, 2019 •

edited

danielhuppmann commented Mar 11, 2019 •

edited