Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interp update #456

Merged
merged 23 commits into from Nov 24, 2020
Merged

Interp update #456

merged 23 commits into from Nov 24, 2020

Conversation

gidden
Copy link
Member

@gidden gidden commented Nov 10, 2020

Please confirm that this PR has done the following:

  • Tests Added
  • Documentation Added
  • Description in RELEASE_NOTES.md Added

Description of PR

This PR fixes a bug when there are no values to interpolate (all scenarios already have this value) and allows users to past a list-like input for interpolation of multiple time points.

closes #240
closes #371

pyam/core.py Outdated Show resolved Hide resolved
@gidden
Copy link
Member Author

gidden commented Nov 10, 2020

seems like test failures are related to iiasa queries!

@danielhuppmann
Copy link
Member

seems like test failures are related to iiasa queries!

already pinged @peterkolp and @fonfon about this...

@danielhuppmann
Copy link
Member

danielhuppmann commented Nov 10, 2020

@gidden, is this related to #371? not quite sure what the bug to be resolved is, can't replicate the issue described there...

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thanks @gidden!

Minor quibble: I guess a substantial performance increase could be had by avoiding doing df.timeseries() for every iteration, instead trying something like:

        df = self.timeseries()
        time = set(time) if islistable(time) else set([time]). # cast to set to drop duplicates
        lst = [None] * len(time)

        # apply fill_series, re-add time dimension to index, set series name
        for i, t in enumerate(time):
            lst[i] = df[np.isnan(df[t])].apply(fill_series, raw=False, axis=1, time=t).dropna()
            lst[i].index = append_index_level(
                index=_values.index, codes=[0] * len(_values),
                level=[t], name=self.time_col, order=self._data.index.names)
            lst[i].name = 'value'

        # append interpolated values to `_data` and sort index
        self._data = self._data.append(pd.concat(lst)).sort_index()

@znicholls
Copy link
Collaborator

Looks like @danielhuppmann has the review under control. My only thought: if you cast to wide format first, could you get an extra performance boost by using panda’s or numpy’s inbuilt Interpolation rather than looping over all the times?

@coroa
Copy link
Collaborator

coroa commented Nov 11, 2020

My only thought: if you cast to wide format first, could you get an extra performance boost by using panda’s or numpy’s inbuilt Interpolation rather than looping over all the times?

Agreed.

if not islistable(time):
    time = [time]
pyam.IamDataFrame(
    idf.timeseries().reindex(columns=np.sort(idf.year + time)).interpolate(method='slinear', axis=1),
    meta=idf.meta
)

has proven to beat pyam's implementation by leaps and bounds.

@danielhuppmann
Copy link
Member

Indeed, it would be great to switch to a more performant implementation of the interpolation!

One issue to keep in mind, though, is that pandas.DataFrame.interpolate fills in all missing values in the dataframe, not just the values for one new column. That's why I didn't use it at the time. And numpy.interp assumes that the values are constant outside the timeseries domain, which is also not a valid general assumption.

@znicholls
Copy link
Collaborator

is that pandas.DataFrame.interpolate fills in all missing values in the dataframe, not just the values for one new column

Yes you'd have to choose the subset you want to interpolate first if you didn't just want to interpolate everything.

Another thought: you could just also change the API so the user just specifies the new time points they desire (including any which are already present) rather than just adding extra time points alone.

So, say you had data for 2010, 2020 and 2030 and you wanted to interpolate to have 2010, 2015, 2020, 2025 and 2030.

In the current implementation you'd do

df.interpolate([2015, 2025])

But we could just make it so it's

df.interpolate([2010, 2015, 2020, 2025, 2030])

And just leave all the calculations to pandas. That interface would mean that interpolating would always leave you with a nan-free dataset.

And numpy.interp assumes that the values are constant outside the timeseries domain

scipy interp1d will extrapolate if you want with the fill_value argument.

p.s. @coroa

heaps and bounds

was that a typo?

@coroa
Copy link
Collaborator

coroa commented Nov 11, 2020

Not sure about the interpolate interface change, will be difficult to guarantee some sort of efficient deprecation path. On the other hand a nice way to solve the all-nan's filled out problem.

And numpy.interp assumes that the values are constant outside the timeseries domain

scipy interp1d will extrapolate if you want with the fill_value argument.

method='slinear' of pandas is indeed passed to scipy's interp1d (together with all extra **kwargs) and its default is to raise a ValueError, unless fill_value is set to f.ex. `'extrapolate'.

p.s. @coroa

heaps and bounds

was that a typo?

something like that :)

@coroa
Copy link
Collaborator

coroa commented Nov 11, 2020

Otherwise, something along the lines of (untested)

df = self.timeseries()
new_values = df.apply(lambda s: pd.Series(scipy.interpolate.interp1d(s.index, s.values, method='slinear', **kwargs)(time), time), axis=1)
for year in time:
    if year in df:
        df[year].fillna(new_values[year], inplace=True)
    else:
        i = bisect(df.columns, year)
        df.insert(i, year, new_values[year])
return pyam.IamDataFrame(df, meta=self.meta)

should work fine.

@gidden
Copy link
Member Author

gidden commented Nov 11, 2020

Thanks for the discussion all! I'll take a look at the suggestions, update, and flag for final review.

@gidden
Copy link
Member Author

gidden commented Nov 14, 2020

Ok folks, I have updated implementation as attached in this notebook. Let me know what you think - it should maintain prior behavior (only updating columns asked for).
interpolate_implementation (1).pdf

pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
tests/test_core.py Show resolved Hide resolved
tests/test_core.py Outdated Show resolved Hide resolved
tests/test_core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
@gidden
Copy link
Member Author

gidden commented Nov 14, 2020

Ok... this took me a bit further down a rabbit hole than I expected. The basic approach follows what @coroa suggested earlier, and includes:

  1. pivot to wide-format once
  2. interpolate using pandas.DataFrame.interpolate, passing along kwargs
  3. include back only time values that were asked for (keeping old behavior)
  4. return a copy by default (in the future)

There were a few issues I ran into that I will raise to others. Namely an issue for @danielhuppmann / @Rlamboll regarding interpolation with extra_cols and an issue with datetime.datetime for @znicholls. I will ping on the specific lines.

@coroa happy if you take a look at the final version here too.

pyam/core.py Show resolved Hide resolved
def test_interpolate_extra_cols():
# check hat interpolation with non-matching extra_cols has no effect (#351)
EXTRA_COL_DF = pd.DataFrame([
['foo', 2005, 1],
['foo', 2010, 2],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rlamboll and @danielhuppmann I had to update this test to get correct behavior. The current implementation uses the wide-form timeseries object, which treats each index value (row) as an indepedent observation. Therefore, the foo and bar entries here are each interpolated. If I left it as is, I would get an error, because there are too few datapoints to interpolate (i.e., 1 each)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I believe this is consistent with the current state as of the discussion in #351. It's unclear whether this is the desired behaviour or not, but it's what we're used to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly concerned that this behavior might throw users that have an IamDataFrame where just one timeseries-row has only one datapoint...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the moment, an error will be raised if trying to interpolate with a single data point. I suggest we allow this, as I am not sure we 'want' to support a use case of pyam data with a single data point (this is what metadata is for, right)? @danielhuppmann I will let you mark if resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried if a user has a table like the one below and wants to interpolate for 2015.

variable 2010 2020 2030
Primary Energy 1 2 3
Population 5 nan nan

I would expect a return-object like

variable 2010 2015 2020 2030
Primary Energy 1 1.5 2 3
Population 5 nan nan nan

def test_interpolate_extra_cols():
# check hat interpolation with non-matching extra_cols has no effect (#351)
EXTRA_COL_DF = pd.DataFrame([
['foo', 2005, 1],
['foo', 2010, 2],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I believe this is consistent with the current state as of the discussion in #351. It's unclear whether this is the desired behaviour or not, but it's what we're used to.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great, thanks @gidden! Just a bunch of minor comments...

I ran a test on my standard large-data ensemble (4 scenarios, 2m data points) and can report:

  • old implementation (using two separate calls for interpolation): 64.4 sec
  • new implementation (using two separate calls for interpolation): 50.8 sec
  • new implementation (calling it via a list): 29.3 sec

pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
@@ -642,8 +686,11 @@ def test_interpolate_extra_cols():
df2 = df.copy()
df2.interpolate(2007)

# assert that interpolation didn't change any data
assert_iamframe_equal(df, df2)
# interpolate should work as if this is a new index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me what this comment means...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have cleared it up

tests/test_core.py Outdated Show resolved Hide resolved
tests/test_core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
pyam/core.py Outdated Show resolved Hide resolved
gidden and others added 5 commits November 17, 2020 12:20
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
Co-authored-by: Daniel Huppmann <dh@dergelbesalon.at>
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
pyam/core.py Outdated Show resolved Hide resolved
tests/test_core.py Outdated Show resolved Hide resolved
@gidden
Copy link
Member Author

gidden commented Nov 17, 2020

Ok @danielhuppmann - should be good for final review now. I must admit, I am a bit surprised at the rather minor speedups (@coroa can you confirm these numbers match your observations?). But in general, this PR was meant to support additional input, not optimization, so no need to block it there.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gidden, really nice!

@danielhuppmann
Copy link
Member

resolved merge conflicts

@danielhuppmann danielhuppmann merged commit 27eee85 into IAMconsortium:master Nov 24, 2020
@gidden gidden deleted the interp-update branch June 15, 2022 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

interpolate fails if 'accidently applied twice' interpolate does not follow other return conventions
6 participants