Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add attributes to access index dimensions #432

Merged

Conversation

danielhuppmann
Copy link
Member

@danielhuppmann danielhuppmann commented Oct 3, 2020

Please confirm that this PR has done the following:

  • Tests Added
  • Documentation Added
  • Description in RELEASE_NOTES.md Added

Description of PR

Per this comment by @Rlamboll that scenario would be a more intuitive way to retrieve the list of scenarios compared to scenarios(), this PR adds the index dimensions as attributes. This is closer to the behavior that a pandas-user would expect.

Open questions:

  • the return type is a list, but could also be a pandas.Series - any preferences?
    (except for continuous-time, which is a pandas.DateTimeIndex)
  • the existing functions (models(), scenarios(), regions()) are marked for deprecation - any objections?
  • the existing function variables() can either return a series (only variables) or a dataframe (variables and units) - any preferences whether to keep this as is, or alternative suggestions?

@Rlamboll
Copy link
Collaborator

Rlamboll commented Oct 3, 2020

Sorry, there may have been a misunderstanding with that comment - the expected behaviour in pandas is that .scenario returns the whole column "scenario", rather than the summary statistics listing only unique scenarios (i.e. the summary info for the column). I'm not sure I would have a preference for this notation over the old notation - it's shorter to write but requires a breaking change.

@danielhuppmann
Copy link
Member Author

Thanks @Rlamboll, you are right, this makes the behavior more similar to pandas but it is still different. I updated the docstrings to at least clarify the behaviour.

I understood your comment in response to the overview of an IamDataFrame now returned via print(df) and df.info() (#427):

<class 'pyam.core.IamDataFrame'>
Index dimensions:
 * model    : AIM/CGE 2.1, GENeSYS-MOD 1.0, ... WITCH-GLOBIOM 4.4 (8)
 * scenario : 1.0, CD-LINKS_INDCi, CD-LINKS_NPi, ... Faster Transition Scenario (8)
 * region   : R5ASIA, R5LAM, R5MAF, R5OECD90+EU, R5REF, R5ROWO, World (7)
 * variable : ... (6)
 * unit     : EJ/yr, Mt CO2/yr, °C (3)
 * year     : 2010, 2020, 2030, 2040, 2050, 2060, 2070, 2080, ... 2100 (10)
Meta indicators:
   exclude (bool) False (1)

In light of this, it makes sense to have df.model to return the (unique) list of models... (which I thought was your point).

@gidden @byersiiasa @znicholls @jkikstra, any thoughts?

@znicholls
Copy link
Collaborator

This is going to be annoying, but I'd leave the API. Having .model return a unique list, rather than the entire column, would be super confusing in my opinion (given that if you did .model on a pandas dataframe you'd get the entire column). If users need the unique values, they could just do .model.unique()..?

@Rlamboll
Copy link
Collaborator

Rlamboll commented Oct 5, 2020

@danielhuppmann this does indeed resolve the inconsistency I was referencing at the time... but I had imagined you would resolve that by adding letter "s"es to the new printouts we were discussing rather than recoding loads of other interfaces! Then it would read "models:" <output of models()> etc.

@danielhuppmann
Copy link
Member Author

Thanks for the feedback!

I guess part of the confusion comes from different understanding of what an IamDataFrame is:

  1. (mainly) the timeseries dataframe, or
  2. the timeseries dataframe, a meta dataframe for indicators and categories, and (later) other relevant attributes e.g. provenance and workflow recipes (track provenance and the operations recipe #287)

I want to move in a direction of (2), whereas you see (1) as the expected behavior - but then, I see a challenging discussion whether df.model should return the pd.Series from the long or wide timeseries dataframe...

I didn't use plurals in the info() output for two reasons:

  • it's not how pandas or xarray display this information (they use the column/index/coordinate names)
  • it can't be extended to automatically cover additional non-standard columns in the timeseries data, e.g., species vs. climate_model - one should get an s, the other one not...

Important to note that this PR adds an attribute for all index dimensions, not just the standard ones...

@danielhuppmann
Copy link
Member Author

Again, thanks for sharing your concerns! As there were no responses to my previous response, two statements and a follow-up question...

  1. I think that once users have seen the output of print(df) a few times, it will be natural for them to use df.model to get the (unique) list of models. So I see this as a useful forward-looking change.
  2. If you want to get the full column, I think we should require users to be explicit whether they want it from a long (df.data.model) or wide table format (df.timeseries().reset_index().model)

Question: how often have you actually used either of the to-be-deprecated functions (models(), scenarios(), regions()) in a workflow or script? In my experience, I only use them when exploring some data, while setting up workflows - not in the actual workflows...

If you think this is a terrible change, I'll forget the PR - but I still see it as an improvement in the long term.

@danielhuppmann danielhuppmann marked this pull request as ready for review October 9, 2020 11:58
@Rlamboll
Copy link
Collaborator

Rlamboll commented Oct 9, 2020

I hadn't realised that your point above was a question rather than statement - I have made clear I am in camp 1, and Silicone does not interact with metadata since I don't know what it is or how the new data will change it (although if you append your output to the original df it will be conserved), so camp 2 cannot realistically entice me in without concrete use-cases that readily interconvert with pandas dfs.
I use these commands about 10 times in my code and quite a few tens of times in various notebooks (I leave in some of the exploratory work so people reading it can follow). Since it's a 1:1 substitution it's not a hard change to recode, it just feels like an unnecessary break in continuity: annoying rather than terrible, but without significant upside.
What would be more useful than this is a way to list unique combinations of models and scenarios and optionally regions, since it's fiddly to extract from the .data and wasteful to loop over all possible combinations of model/scenario, most of which don't exist.

@znicholls
Copy link
Collaborator

two statements

These both make good sense to me!

how often have you actually used either of the to-be-deprecated functions (models(), scenarios(), regions()) in a workflow or script?

I don't actually use pyam very much anymore so for me it's never... When I did use it, the answer was also basically never, like you it was only for looking at what was available, never actually a key part of the processing.

What would be more useful than this is a way to list unique combinations of models and scenarios and optionally regions, since it's fiddly to extract from the .data and wasteful to loop over all possible combinations of model/scenario, most of which don't exist.

Unless I'm mistaken, I'm pretty sure you can do

# get a dataframe with unique combos
iamdf.meta[["model", "scenario"]].drop_duplicates()
# you can then make lists or whatever else with some trickery
iamdf.meta[["model", "scenario"]].drop_duplicates().to_records("dict")
iamdf.meta[["model", "scenario"]].drop_duplicates().to_dict("records")
[tuple(v) for _, v in iamdf.meta[["model", "scenario"]].drop_duplicates().iterrows()]

I still see it as an improvement in the long term

I'm convinced too

@danielhuppmann danielhuppmann merged commit 97f6c76 into IAMconsortium:master Oct 15, 2020
@danielhuppmann danielhuppmann deleted the feature/index-attributes branch October 15, 2020 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants