Add attributes to access index dimensions #432

danielhuppmann · 2020-10-03T11:39:37Z

Please confirm that this PR has done the following:

Tests Added
Documentation Added
Description in RELEASE_NOTES.md Added

Description of PR

Per this comment by @Rlamboll that scenario would be a more intuitive way to retrieve the list of scenarios compared to scenarios(), this PR adds the index dimensions as attributes. This is closer to the behavior that a pandas-user would expect.

Open questions:

the return type is a list, but could also be a pandas.Series - any preferences?
(except for continuous-time, which is a pandas.DateTimeIndex)
the existing functions (models(), scenarios(), regions()) are marked for deprecation - any objections?
the existing function variables() can either return a series (only variables) or a dataframe (variables and units) - any preferences whether to keep this as is, or alternative suggestions?

Rlamboll · 2020-10-03T11:54:03Z

Sorry, there may have been a misunderstanding with that comment - the expected behaviour in pandas is that .scenario returns the whole column "scenario", rather than the summary statistics listing only unique scenarios (i.e. the summary info for the column). I'm not sure I would have a preference for this notation over the old notation - it's shorter to write but requires a breaking change.

danielhuppmann · 2020-10-05T08:05:49Z

Thanks @Rlamboll, you are right, this makes the behavior more similar to pandas but it is still different. I updated the docstrings to at least clarify the behaviour.

I understood your comment in response to the overview of an IamDataFrame now returned via print(df) and df.info() (#427):

<class 'pyam.core.IamDataFrame'>
Index dimensions:
 * model    : AIM/CGE 2.1, GENeSYS-MOD 1.0, ... WITCH-GLOBIOM 4.4 (8)
 * scenario : 1.0, CD-LINKS_INDCi, CD-LINKS_NPi, ... Faster Transition Scenario (8)
 * region   : R5ASIA, R5LAM, R5MAF, R5OECD90+EU, R5REF, R5ROWO, World (7)
 * variable : ... (6)
 * unit     : EJ/yr, Mt CO2/yr, °C (3)
 * year     : 2010, 2020, 2030, 2040, 2050, 2060, 2070, 2080, ... 2100 (10)
Meta indicators:
   exclude (bool) False (1)

In light of this, it makes sense to have df.model to return the (unique) list of models... (which I thought was your point).

@gidden @byersiiasa @znicholls @jkikstra, any thoughts?

znicholls · 2020-10-05T08:21:57Z

This is going to be annoying, but I'd leave the API. Having .model return a unique list, rather than the entire column, would be super confusing in my opinion (given that if you did .model on a pandas dataframe you'd get the entire column). If users need the unique values, they could just do .model.unique()..?

Rlamboll · 2020-10-05T08:34:24Z

@danielhuppmann this does indeed resolve the inconsistency I was referencing at the time... but I had imagined you would resolve that by adding letter "s"es to the new printouts we were discussing rather than recoding loads of other interfaces! Then it would read "models:" <output of models()> etc.

danielhuppmann · 2020-10-05T14:05:21Z

Thanks for the feedback!

I guess part of the confusion comes from different understanding of what an IamDataFrame is:

(mainly) the timeseries dataframe, or
the timeseries dataframe, a meta dataframe for indicators and categories, and (later) other relevant attributes e.g. provenance and workflow recipes (track provenance and the operations recipe #287)

I want to move in a direction of (2), whereas you see (1) as the expected behavior - but then, I see a challenging discussion whether df.model should return the pd.Series from the long or wide timeseries dataframe...

I didn't use plurals in the info() output for two reasons:

it's not how pandas or xarray display this information (they use the column/index/coordinate names)
it can't be extended to automatically cover additional non-standard columns in the timeseries data, e.g., species vs. climate_model - one should get an s, the other one not...

Important to note that this PR adds an attribute for all index dimensions, not just the standard ones...

danielhuppmann · 2020-10-09T11:42:40Z

Again, thanks for sharing your concerns! As there were no responses to my previous response, two statements and a follow-up question...

I think that once users have seen the output of print(df) a few times, it will be natural for them to use df.model to get the (unique) list of models. So I see this as a useful forward-looking change.
If you want to get the full column, I think we should require users to be explicit whether they want it from a long (df.data.model) or wide table format (df.timeseries().reset_index().model)

Question: how often have you actually used either of the to-be-deprecated functions (models(), scenarios(), regions()) in a workflow or script? In my experience, I only use them when exploring some data, while setting up workflows - not in the actual workflows...

If you think this is a terrible change, I'll forget the PR - but I still see it as an improvement in the long term.

Rlamboll · 2020-10-09T13:35:26Z

I hadn't realised that your point above was a question rather than statement - I have made clear I am in camp 1, and Silicone does not interact with metadata since I don't know what it is or how the new data will change it (although if you append your output to the original df it will be conserved), so camp 2 cannot realistically entice me in without concrete use-cases that readily interconvert with pandas dfs.
I use these commands about 10 times in my code and quite a few tens of times in various notebooks (I leave in some of the exploratory work so people reading it can follow). Since it's a 1:1 substitution it's not a hard change to recode, it just feels like an unnecessary break in continuity: annoying rather than terrible, but without significant upside.
What would be more useful than this is a way to list unique combinations of models and scenarios and optionally regions, since it's fiddly to extract from the .data and wasteful to loop over all possible combinations of model/scenario, most of which don't exist.

znicholls · 2020-10-10T03:17:04Z

two statements

These both make good sense to me!

how often have you actually used either of the to-be-deprecated functions (models(), scenarios(), regions()) in a workflow or script?

I don't actually use pyam very much anymore so for me it's never... When I did use it, the answer was also basically never, like you it was only for looking at what was available, never actually a key part of the processing.

What would be more useful than this is a way to list unique combinations of models and scenarios and optionally regions, since it's fiddly to extract from the .data and wasteful to loop over all possible combinations of model/scenario, most of which don't exist.

Unless I'm mistaken, I'm pretty sure you can do

# get a dataframe with unique combos
iamdf.meta[["model", "scenario"]].drop_duplicates()
# you can then make lists or whatever else with some trickery
iamdf.meta[["model", "scenario"]].drop_duplicates().to_records("dict")
iamdf.meta[["model", "scenario"]].drop_duplicates().to_dict("records")
[tuple(v) for _, v in iamdf.meta[["model", "scenario"]].drop_duplicates().iterrows()]

I still see it as an improvement in the long term

I'm convinced too

danielhuppmann added the enhancement label Oct 3, 2020

danielhuppmann self-assigned this Oct 3, 2020

danielhuppmann marked this pull request as ready for review October 9, 2020 11:58

danielhuppmann mentioned this pull request Oct 10, 2020

Add an index attribute of model-scenario combinations #438

Merged

3 tasks

rebase-and-squash

70229a8

danielhuppmann force-pushed the feature/index-attributes branch from 66ad926 to 70229a8 Compare October 15, 2020 11:02

add to release notes

6e88939

danielhuppmann merged commit 97f6c76 into IAMconsortium:master Oct 15, 2020

danielhuppmann deleted the feature/index-attributes branch October 15, 2020 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add attributes to access index dimensions #432

Add attributes to access index dimensions #432

danielhuppmann commented Oct 3, 2020 •

edited

Rlamboll commented Oct 3, 2020

danielhuppmann commented Oct 5, 2020

znicholls commented Oct 5, 2020

Rlamboll commented Oct 5, 2020

danielhuppmann commented Oct 5, 2020

danielhuppmann commented Oct 9, 2020

Rlamboll commented Oct 9, 2020

znicholls commented Oct 10, 2020

Add attributes to access index dimensions #432

Add attributes to access index dimensions #432

Conversation

danielhuppmann commented Oct 3, 2020 • edited

Please confirm that this PR has done the following:

Description of PR

Rlamboll commented Oct 3, 2020

danielhuppmann commented Oct 5, 2020

znicholls commented Oct 5, 2020

Rlamboll commented Oct 5, 2020

danielhuppmann commented Oct 5, 2020

danielhuppmann commented Oct 9, 2020

Rlamboll commented Oct 9, 2020

znicholls commented Oct 10, 2020

danielhuppmann commented Oct 3, 2020 •

edited