decide if/how to fill in missing columns in constructor #208

gidden · 2019-03-08T09:12:13Z

During PR #199 we had a use case that became unsupported in the final implementation, notably filling in "missing" values in expected columns

For example, a dataframe looking like

scenario 	year 	Population 	GDP 	Urbanization
0 	SSP1 	2010 	6.868687e+09 	7.641454e+13 	0.516281
1 	SSP1 	2015 	7.210848e+09 	9.249094e+13 	0.546193
2 	SSP1 	2020 	7.517782e+09 	1.144206e+14 	0.584815
3 	SSP1 	2025 	7.782887e+09 	1.409554e+14 	0.621583
4 	SSP1 	2030 	7.999304e+09 	1.725584e+14 	0.656344

At the moment raises an error:

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-d2c631ea80be> in <module>()
----> 1 y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'])

~/.local/lib/python3.5/site-packages/pyam_iamc-0.1.2+44.g381c4f6-py3.5.egg/pyam/core.py in __init__(self, data, **kwargs)
     69         # import data from pd.DataFrame or read from source
     70         if isinstance(data, pd.DataFrame) or isinstance(data, pd.Series):
---> 71             _data = format_data(data.copy(), **kwargs)
     72         elif has_ix and isinstance(data, ixmp.TimeSeries):
     73             _data = read_ix(data, **kwargs)

~/.local/lib/python3.5/site-packages/pyam_iamc-0.1.2+44.g381c4f6-py3.5.egg/pyam/utils.py in format_data(df, **kwargs)
    188     if not set(IAMC_IDX).issubset(set(df.columns)):
    189         missing = list(set(IAMC_IDX) - set(df.columns))
--> 190         raise ValueError("missing required columns `{}`!".format(missing))
    191 
    192     # check whether data in wide format (IAMC) or long format (`value` column)

ValueError: missing required columns `['model', 'unit', 'region']`!

At some point in the PR, default values would be filled in for these three columns (just with their column names) for ease of use. In many cases, I find that I don't actually care what these values are, and in fact just want the mountain of other nice pyam utilities to work with my data.

So the question is: should we force users to fill in these, e.g.,

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'], model='foo', region='bar', unit='baz')

or should we do that for them with column names or some other value?

The text was updated successfully, but these errors were encountered:

gidden · 2019-03-08T09:14:03Z

cc @danielhuppmann @znicholls

znicholls · 2019-03-08T09:34:41Z

Tricky one, I'm not sure. I've tried doing auto-filling using None in OpenSCM and it hasn't been happy so that solution, whilst ideal, might be a bit hairy to make behave (pandas can be temperamental with None and nan values). The plan 'b' of filling with the column name seems like an ok fall back with plan 'c' just being to force users to fill in.

danielhuppmann · 2019-03-08T09:34:50Z

I agree that all required columns other than variable can default to None (not sure how I feel about variable=None).

Need to check whether the „check for duplicates“ part at the end of format_data() continues to work as expected.

danielhuppmann · 2019-03-08T09:40:33Z

Update following comment by @znicholls:

If pandas behaves weird with None in columns, forcing users to provide names might be preferable.

danielhuppmann · 2019-03-08T09:45:04Z

One more thought about None in columns: how do expect behaviour if we append an IamDataFrame with model=None to a “regular” frame? df.filter(model=None) will not work (I think) and will also conflict with suggested changes in #207.

znicholls · 2019-03-08T10:11:25Z

hmmm ok so maybe None is a bad idea. nan could work but it also creates plenty of havoc with pandas (and wouldn't work with the current drop_duplicate call in format_data).

danielhuppmann · 2020-02-19T06:53:26Z

This issue has been resolved in the sense that the constructor now takes keyword arguments with a default value for columns that are not in the input dataframe as suggested above:

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'], model='foo', region='bar', unit='baz')

gidden mentioned this issue Mar 8, 2019

bug in constructor when passing column mapped arguments #210

Closed

znicholls mentioned this issue Apr 22, 2019

ScmDataFrame: Better 'model' column handling openscm/openscm#114

Open

danielhuppmann closed this as completed Feb 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decide if/how to fill in missing columns in constructor #208

decide if/how to fill in missing columns in constructor #208

gidden commented Mar 8, 2019

gidden commented Mar 8, 2019

znicholls commented Mar 8, 2019

danielhuppmann commented Mar 8, 2019

danielhuppmann commented Mar 8, 2019

danielhuppmann commented Mar 8, 2019

znicholls commented Mar 8, 2019

danielhuppmann commented Feb 19, 2020

decide if/how to fill in missing columns in constructor #208

decide if/how to fill in missing columns in constructor #208

Comments

gidden commented Mar 8, 2019

gidden commented Mar 8, 2019

znicholls commented Mar 8, 2019

danielhuppmann commented Mar 8, 2019

danielhuppmann commented Mar 8, 2019

danielhuppmann commented Mar 8, 2019

znicholls commented Mar 8, 2019

danielhuppmann commented Feb 19, 2020