New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interp update #456
Interp update #456
Conversation
seems like test failures are related to iiasa queries! |
already pinged @peterkolp and @fonfon about this... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, thanks @gidden!
Minor quibble: I guess a substantial performance increase could be had by avoiding doing df.timeseries()
for every iteration, instead trying something like:
df = self.timeseries()
time = set(time) if islistable(time) else set([time]). # cast to set to drop duplicates
lst = [None] * len(time)
# apply fill_series, re-add time dimension to index, set series name
for i, t in enumerate(time):
lst[i] = df[np.isnan(df[t])].apply(fill_series, raw=False, axis=1, time=t).dropna()
lst[i].index = append_index_level(
index=_values.index, codes=[0] * len(_values),
level=[t], name=self.time_col, order=self._data.index.names)
lst[i].name = 'value'
# append interpolated values to `_data` and sort index
self._data = self._data.append(pd.concat(lst)).sort_index()
Looks like @danielhuppmann has the review under control. My only thought: if you cast to wide format first, could you get an extra performance boost by using panda’s or numpy’s inbuilt Interpolation rather than looping over all the times? |
Agreed. if not islistable(time):
time = [time]
pyam.IamDataFrame(
idf.timeseries().reindex(columns=np.sort(idf.year + time)).interpolate(method='slinear', axis=1),
meta=idf.meta
) has proven to beat pyam's implementation by leaps and bounds. |
Indeed, it would be great to switch to a more performant implementation of the interpolation! One issue to keep in mind, though, is that pandas.DataFrame.interpolate fills in all missing values in the dataframe, not just the values for one new column. That's why I didn't use it at the time. And numpy.interp assumes that the values are constant outside the timeseries domain, which is also not a valid general assumption. |
Yes you'd have to choose the subset you want to interpolate first if you didn't just want to interpolate everything. Another thought: you could just also change the API so the user just specifies the new time points they desire (including any which are already present) rather than just adding extra time points alone. So, say you had data for 2010, 2020 and 2030 and you wanted to interpolate to have 2010, 2015, 2020, 2025 and 2030. In the current implementation you'd do df.interpolate([2015, 2025]) But we could just make it so it's df.interpolate([2010, 2015, 2020, 2025, 2030]) And just leave all the calculations to pandas. That interface would mean that interpolating would always leave you with a nan-free dataset.
scipy interp1d will extrapolate if you want with the p.s. @coroa
was that a typo? |
Not sure about the interpolate interface change, will be difficult to guarantee some sort of efficient deprecation path. On the other hand a nice way to solve the all-nan's filled out problem.
something like that :) |
Otherwise, something along the lines of (untested) df = self.timeseries()
new_values = df.apply(lambda s: pd.Series(scipy.interpolate.interp1d(s.index, s.values, method='slinear', **kwargs)(time), time), axis=1)
for year in time:
if year in df:
df[year].fillna(new_values[year], inplace=True)
else:
i = bisect(df.columns, year)
df.insert(i, year, new_values[year])
return pyam.IamDataFrame(df, meta=self.meta) should work fine. |
Thanks for the discussion all! I'll take a look at the suggestions, update, and flag for final review. |
Ok folks, I have updated implementation as attached in this notebook. Let me know what you think - it should maintain prior behavior (only updating columns asked for). |
Ok... this took me a bit further down a rabbit hole than I expected. The basic approach follows what @coroa suggested earlier, and includes:
There were a few issues I ran into that I will raise to others. Namely an issue for @danielhuppmann / @Rlamboll regarding interpolation with @coroa happy if you take a look at the final version here too. |
def test_interpolate_extra_cols(): | ||
# check hat interpolation with non-matching extra_cols has no effect (#351) | ||
EXTRA_COL_DF = pd.DataFrame([ | ||
['foo', 2005, 1], | ||
['foo', 2010, 2], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rlamboll and @danielhuppmann I had to update this test to get correct behavior. The current implementation uses the wide-form timeseries
object, which treats each index value (row) as an indepedent observation. Therefore, the foo
and bar
entries here are each interpolated. If I left it as is, I would get an error, because there are too few datapoints to interpolate (i.e., 1 each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I believe this is consistent with the current state as of the discussion in #351. It's unclear whether this is the desired behaviour or not, but it's what we're used to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly concerned that this behavior might throw users that have an IamDataFrame where just one timeseries-row has only one datapoint...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the moment, an error will be raised if trying to interpolate with a single data point. I suggest we allow this, as I am not sure we 'want' to support a use case of pyam data with a single data point (this is what metadata is for, right)? @danielhuppmann I will let you mark if resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried if a user has a table like the one below and wants to interpolate for 2015.
variable | 2010 | 2020 | 2030 |
---|---|---|---|
Primary Energy | 1 | 2 | 3 |
Population | 5 | nan | nan |
I would expect a return-object like
variable | 2010 | 2015 | 2020 | 2030 |
---|---|---|---|---|
Primary Energy | 1 | 1.5 | 2 | 3 |
Population | 5 | nan | nan | nan |
def test_interpolate_extra_cols(): | ||
# check hat interpolation with non-matching extra_cols has no effect (#351) | ||
EXTRA_COL_DF = pd.DataFrame([ | ||
['foo', 2005, 1], | ||
['foo', 2010, 2], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I believe this is consistent with the current state as of the discussion in #351. It's unclear whether this is the desired behaviour or not, but it's what we're used to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great, thanks @gidden! Just a bunch of minor comments...
I ran a test on my standard large-data ensemble (4 scenarios, 2m data points) and can report:
- old implementation (using two separate calls for interpolation): 64.4 sec
- new implementation (using two separate calls for interpolation): 50.8 sec
- new implementation (calling it via a list): 29.3 sec
tests/test_core.py
Outdated
@@ -642,8 +686,11 @@ def test_interpolate_extra_cols(): | |||
df2 = df.copy() | |||
df2.interpolate(2007) | |||
|
|||
# assert that interpolation didn't change any data | |||
assert_iamframe_equal(df, df2) | |||
# interpolate should work as if this is a new index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear to me what this comment means...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have cleared it up
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Ok @danielhuppmann - should be good for final review now. I must admit, I am a bit surprised at the rather minor speedups (@coroa can you confirm these numbers match your observations?). But in general, this PR was meant to support additional input, not optimization, so no need to block it there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gidden, really nice!
resolved merge conflicts |
Please confirm that this PR has done the following:
Description of PR
This PR fixes a bug when there are no values to interpolate (all scenarios already have this value) and allows users to past a list-like input for interpolation of multiple time points.
closes #240
closes #371