Improve execution speed of `rdtools.degradation_classical_decomposition` #371

kandersolar · 2023-05-18T15:20:41Z

Code changes are covered by tests
~~[ ] Code changes have been evaluated for compatibility/integration with TrendAnalysis~~
~~[ ] New functions added to __init__.py~~
~~[ ] API.rst is up to date, along with other sphinx docs pages~~
~~[ ] Example notebooks are rerun and differences in results scrutinized~~
Updated changelog

rdtools.degradation_classical_decomposition is rather slow for large inputs (~10 seconds for a 6-year daily dataset). The large runtime is caused by two computational bottlenecks: the moving average calculation and the M-K trend test. The current implementations of these calculations use python loops and are straightforward to replace with vectorized pandas/numpy operations. Doing this speeds up the overall rdtools.degradation_classical_decomposition runtime by a couple orders of magnitude.

The following table compares runtimes (values in seconds), along with their ratio, for various input lengths (number of years of daily values).

 years  v2.1.5    PR   ratio
     2   0.717 0.013  53.4
     3   2.560 0.015 169.8
     4   5.992 0.026 226.2
     5   7.445 0.041 182.2
     6  10.614 0.056 190.1
     7  14.840 0.080 184.6

Here is some code to verify that the new implementations produce output equivalent to the current implementations:

MK-test

for n in [10, 100, 1000]:
    # setup
    x = np.random.rand(n)
    
    # current method 
    s = 0
    for k in range(n - 1):
        for j in range(k + 1, n):
            s += np.sign(x[j] - x[k])
    
    # new method
    s2 = np.sum(np.triu(np.sign(-np.subtract.outer(x, x)), 1))

    assert s == s2
    print(s, s2)

Moving average

for nyears in [2, 3, 4]:
    # setup
    times = pd.date_range('2000-01-01', freq='d', periods=nyears*365)
    noise = np.random.normal(0, 0.1, len(times))
    df = pd.DataFrame({
        'energy_normalized': 1 + noise,
    }, index=times)
    day_diffs = (df.index - df.index[0])
    df['days'] = day_diffs / pd.Timedelta('1d')
    df['years'] = df.days / 365.0

    # current method
    it = df.iterrows()
    energy_ma = []
    for i, row in it:
        if row.years - 0.5 >= min(df.years) and \
           row.years + 0.5 <= max(df.years):
            roll = df[(df.years <= row.years + 0.5) &
                      (df.years >= row.years - 0.5)]
            energy_ma.append(roll.energy_normalized.mean())
        else:
            energy_ma.append(np.nan)

    df['energy_ma_loop'] = energy_ma

    # new method:
    energy_ma = df['energy_normalized'].rolling('365d', center=True).mean()
    has_full_year = (df['years'] > df['years'][0] + 0.5) & (df['years'] < df['years'][-1] - 0.5)
    energy_ma[~has_full_year] = np.nan
    df['energy_ma_pandas'] = energy_ma

    pd.testing.assert_series_equal(df['energy_ma_loop'], df['energy_ma_pandas'], check_names=False)

kandersolar · 2023-05-18T15:30:20Z

requirements-min is failing. It looks like the necessary pandas functionality was only added in pandas v1.3, release July 2, 2021. Is it okay to bump the minimum version to 1.3? If that's too recent (not quite two years), I could revert the moving average calculation improvement and just keep the M-K test, which would still be a nice runtime improvement.

Also, I took the liberty of making a 2.1.6 whatsnew file for this. Happy to change to whatever the release plan is, or feel free to just push changes yourself :)

mikofski · 2023-05-19T03:50:54Z

rdtools/degradation.py

-            energy_ma.append(np.nan)
-
+    energy_ma = df['energy_normalized'].rolling('365d', center=True).mean()
+    has_full_year = (df['years'] > df['years'][0] + 0.5) & (df['years'] < df['years'][-1] - 0.5)


does it make a difference that the old method use >= and <=?

For daily values, I think it does not matter. Because this code defines a year to be 365 days, with daily values the closest you can get to 0.5 is 0.49863014 or 0.50136986, so there is no difference between < and <=.

For some sub-daily cases (e.g. 12 hour intervals), then it would matter, so I have updated to match previous behavior. Thanks! But it also made me realize that the old method of using floats for the datetime equality checks produced inconsistent behavior. For example, with 12-hour inputs, the moving window is sometimes length 730 and sometimes length 731 using the loop approach. In contrast, the pandas windows are always length 730. So although the codes give identical results for daily inputs, they can give slightly different results for sub-daily. The difference is very minor but I point it out for completeness.

mdeceglie · 2023-05-19T18:45:39Z

I think it makes sense to update the minimum pandas version to 1.3. Looks like #373 needs a more recent minimum version as well.

kandersolar · 2023-05-19T20:51:36Z

I think it makes sense to update the minimum pandas version to 1.3

Done. As is often the case with increasing minimum versions, it required increasing some others as well.

mdeceglie

LGTM. Thanks @kandersolar and @mikofski

kandersolar added 2 commits May 18, 2023 10:56

vectorize s-calculation in _mk_test

9d53c38

use pandas moving average instead of manual

0b10f3e

kandersolar added the enhancement label May 18, 2023

Create v2.1.6.rst

8d4a3ae

mikofski reviewed May 19, 2023

View reviewed changes

use <= and >=

4f0790c

kandersolar added 3 commits May 19, 2023 14:57

increase minimum pandas and numpy versions

f34e1a1

and statsmodels

34a2760

and scipy

0248148

mdeceglie changed the base branch from master to release/2.1.6 June 2, 2023 18:35

mdeceglie approved these changes Jun 2, 2023

View reviewed changes

mdeceglie merged commit 02c68d9 into release/2.1.6 Jun 2, 2023
16 checks passed

mdeceglie deleted the fast-mk branch June 2, 2023 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve execution speed of `rdtools.degradation_classical_decomposition` #371

Improve execution speed of `rdtools.degradation_classical_decomposition` #371

kandersolar commented May 18, 2023 •

edited

kandersolar commented May 18, 2023

mikofski May 19, 2023

kandersolar May 19, 2023

mdeceglie commented May 19, 2023

kandersolar commented May 19, 2023

mdeceglie left a comment

Improve execution speed of rdtools.degradation_classical_decomposition #371

Improve execution speed of rdtools.degradation_classical_decomposition #371

Conversation

kandersolar commented May 18, 2023 • edited

kandersolar commented May 18, 2023

mikofski May 19, 2023

Choose a reason for hiding this comment

kandersolar May 19, 2023

Choose a reason for hiding this comment

mdeceglie commented May 19, 2023

kandersolar commented May 19, 2023

mdeceglie left a comment

Choose a reason for hiding this comment

Improve execution speed of `rdtools.degradation_classical_decomposition` #371

Improve execution speed of `rdtools.degradation_classical_decomposition` #371

kandersolar commented May 18, 2023 •

edited