Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve execution speed of rdtools.degradation_classical_decomposition #371

Merged
merged 7 commits into from Jun 2, 2023

Conversation

kandersolar
Copy link
Member

@kandersolar kandersolar commented May 18, 2023

  • Code changes are covered by tests
  • [ ] Code changes have been evaluated for compatibility/integration with TrendAnalysis
  • [ ] New functions added to __init__.py
  • [ ] API.rst is up to date, along with other sphinx docs pages
  • [ ] Example notebooks are rerun and differences in results scrutinized
  • Updated changelog

rdtools.degradation_classical_decomposition is rather slow for large inputs (~10 seconds for a 6-year daily dataset). The large runtime is caused by two computational bottlenecks: the moving average calculation and the M-K trend test. The current implementations of these calculations use python loops and are straightforward to replace with vectorized pandas/numpy operations. Doing this speeds up the overall rdtools.degradation_classical_decomposition runtime by a couple orders of magnitude.

The following table compares runtimes (values in seconds), along with their ratio, for various input lengths (number of years of daily values).

 years  v2.1.5    PR   ratio
     2   0.717 0.013  53.4
     3   2.560 0.015 169.8
     4   5.992 0.026 226.2
     5   7.445 0.041 182.2
     6  10.614 0.056 190.1
     7  14.840 0.080 184.6

Here is some code to verify that the new implementations produce output equivalent to the current implementations:

MK-test
for n in [10, 100, 1000]:
    # setup
    x = np.random.rand(n)
    
    # current method 
    s = 0
    for k in range(n - 1):
        for j in range(k + 1, n):
            s += np.sign(x[j] - x[k])
    
    # new method
    s2 = np.sum(np.triu(np.sign(-np.subtract.outer(x, x)), 1))

    assert s == s2
    print(s, s2)
Moving average
for nyears in [2, 3, 4]:
    # setup
    times = pd.date_range('2000-01-01', freq='d', periods=nyears*365)
    noise = np.random.normal(0, 0.1, len(times))
    df = pd.DataFrame({
        'energy_normalized': 1 + noise,
    }, index=times)
    day_diffs = (df.index - df.index[0])
    df['days'] = day_diffs / pd.Timedelta('1d')
    df['years'] = df.days / 365.0

    # current method
    it = df.iterrows()
    energy_ma = []
    for i, row in it:
        if row.years - 0.5 >= min(df.years) and \
           row.years + 0.5 <= max(df.years):
            roll = df[(df.years <= row.years + 0.5) &
                      (df.years >= row.years - 0.5)]
            energy_ma.append(roll.energy_normalized.mean())
        else:
            energy_ma.append(np.nan)

    df['energy_ma_loop'] = energy_ma

    # new method:
    energy_ma = df['energy_normalized'].rolling('365d', center=True).mean()
    has_full_year = (df['years'] > df['years'][0] + 0.5) & (df['years'] < df['years'][-1] - 0.5)
    energy_ma[~has_full_year] = np.nan
    df['energy_ma_pandas'] = energy_ma

    pd.testing.assert_series_equal(df['energy_ma_loop'], df['energy_ma_pandas'], check_names=False)

@kandersolar
Copy link
Member Author

requirements-min is failing. It looks like the necessary pandas functionality was only added in pandas v1.3, release July 2, 2021. Is it okay to bump the minimum version to 1.3? If that's too recent (not quite two years), I could revert the moving average calculation improvement and just keep the M-K test, which would still be a nice runtime improvement.

Also, I took the liberty of making a 2.1.6 whatsnew file for this. Happy to change to whatever the release plan is, or feel free to just push changes yourself :)

energy_ma.append(np.nan)

energy_ma = df['energy_normalized'].rolling('365d', center=True).mean()
has_full_year = (df['years'] > df['years'][0] + 0.5) & (df['years'] < df['years'][-1] - 0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make a difference that the old method use >= and <=?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For daily values, I think it does not matter. Because this code defines a year to be 365 days, with daily values the closest you can get to 0.5 is 0.49863014 or 0.50136986, so there is no difference between < and <=.

For some sub-daily cases (e.g. 12 hour intervals), then it would matter, so I have updated to match previous behavior. Thanks! But it also made me realize that the old method of using floats for the datetime equality checks produced inconsistent behavior. For example, with 12-hour inputs, the moving window is sometimes length 730 and sometimes length 731 using the loop approach. In contrast, the pandas windows are always length 730. So although the codes give identical results for daily inputs, they can give slightly different results for sub-daily. The difference is very minor but I point it out for completeness.

@mdeceglie
Copy link
Collaborator

I think it makes sense to update the minimum pandas version to 1.3. Looks like #373 needs a more recent minimum version as well.

@kandersolar
Copy link
Member Author

I think it makes sense to update the minimum pandas version to 1.3

Done. As is often the case with increasing minimum versions, it required increasing some others as well.

@mdeceglie mdeceglie changed the base branch from master to release/2.1.6 June 2, 2023 18:35
Copy link
Collaborator

@mdeceglie mdeceglie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @kandersolar and @mikofski

@mdeceglie mdeceglie merged commit 02c68d9 into release/2.1.6 Jun 2, 2023
16 checks passed
@mdeceglie mdeceglie deleted the fast-mk branch June 2, 2023 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants