Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decide if/how to fill in missing columns in constructor #208

Closed
gidden opened this issue Mar 8, 2019 · 7 comments
Closed

decide if/how to fill in missing columns in constructor #208

gidden opened this issue Mar 8, 2019 · 7 comments

Comments

@gidden
Copy link
Member

gidden commented Mar 8, 2019

During PR #199 we had a use case that became unsupported in the final implementation, notably filling in "missing" values in expected columns

For example, a dataframe looking like

scenario 	year 	Population 	GDP 	Urbanization
0 	SSP1 	2010 	6.868687e+09 	7.641454e+13 	0.516281
1 	SSP1 	2015 	7.210848e+09 	9.249094e+13 	0.546193
2 	SSP1 	2020 	7.517782e+09 	1.144206e+14 	0.584815
3 	SSP1 	2025 	7.782887e+09 	1.409554e+14 	0.621583
4 	SSP1 	2030 	7.999304e+09 	1.725584e+14 	0.656344

At the moment raises an error:

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-d2c631ea80be> in <module>()
----> 1 y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'])

~/.local/lib/python3.5/site-packages/pyam_iamc-0.1.2+44.g381c4f6-py3.5.egg/pyam/core.py in __init__(self, data, **kwargs)
     69         # import data from pd.DataFrame or read from source
     70         if isinstance(data, pd.DataFrame) or isinstance(data, pd.Series):
---> 71             _data = format_data(data.copy(), **kwargs)
     72         elif has_ix and isinstance(data, ixmp.TimeSeries):
     73             _data = read_ix(data, **kwargs)

~/.local/lib/python3.5/site-packages/pyam_iamc-0.1.2+44.g381c4f6-py3.5.egg/pyam/utils.py in format_data(df, **kwargs)
    188     if not set(IAMC_IDX).issubset(set(df.columns)):
    189         missing = list(set(IAMC_IDX) - set(df.columns))
--> 190         raise ValueError("missing required columns `{}`!".format(missing))
    191 
    192     # check whether data in wide format (IAMC) or long format (`value` column)

ValueError: missing required columns `['model', 'unit', 'region']`!

At some point in the PR, default values would be filled in for these three columns (just with their column names) for ease of use. In many cases, I find that I don't actually care what these values are, and in fact just want the mountain of other nice pyam utilities to work with my data.

So the question is: should we force users to fill in these, e.g.,

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'], model='foo', region='bar', unit='baz')

or should we do that for them with column names or some other value?

@gidden
Copy link
Member Author

gidden commented Mar 8, 2019

cc @danielhuppmann @znicholls

@znicholls
Copy link
Collaborator

Tricky one, I'm not sure. I've tried doing auto-filling using None in OpenSCM and it hasn't been happy so that solution, whilst ideal, might be a bit hairy to make behave (pandas can be temperamental with None and nan values). The plan 'b' of filling with the column name seems like an ok fall back with plan 'c' just being to force users to fill in.

@danielhuppmann
Copy link
Member

I agree that all required columns other than variable can default to None (not sure how I feel about variable=None).

Need to check whether the „check for duplicates“ part at the end of format_data() continues to work as expected.

@danielhuppmann
Copy link
Member

Update following comment by @znicholls:

If pandas behaves weird with None in columns, forcing users to provide names might be preferable.

@danielhuppmann
Copy link
Member

One more thought about None in columns: how do expect behaviour if we append an IamDataFrame with model=None to a “regular” frame? df.filter(model=None) will not work (I think) and will also conflict with suggested changes in #207.

@znicholls
Copy link
Collaborator

hmmm ok so maybe None is a bad idea. nan could work but it also creates plenty of havoc with pandas (and wouldn't work with the current drop_duplicate call in format_data).

@danielhuppmann
Copy link
Member

This issue has been resolved in the sense that the constructor now takes keyword arguments with a default value for columns that are not in the input dataframe as suggested above:

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'], model='foo', region='bar', unit='baz')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants