Skip to content

SweatStack/mean-max-computation-experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How fast can we compute mean-max curves in Python?

Mean-max is a standard analysis in endurance sports. For every possible duration (1 second, 2 seconds, ... up to the full activity), find the highest average power sustained over that window. A 6-hour ride gives you about 21,000 windows to check against 21,000 positions.

Why does speed matter?

Mean-max is typically computed once per activity and persisted. A second or two is fine for that. But there are cases where it needs to be faster: local analytical workloads that recompute mean-max many times (for example, calculating power curves for different fatigue states based on energy expenditure), or "online" scenarios where mean-max isn't persisted and needs to be derived on the fly from raw timeseries data.

Results

We benchmark the same algorithm across a range of approaches: a naive Python baseline, a typical Pandas implementation, a pure NumPy version, Numba with JIT compilation, and a Rust extension via PyO3. Each approach is explained in detail below, but the results come first.

Tested on a 6-hour cycling power file (21,666 samples at 1Hz).

Exact, single-threaded

All of these compute the identical result, on a single thread. This is the apples-to-apples comparison.

Approach         Time         Speedup     Implementation
────────────────────────────────────────────────────────
Rust             47.3 ms      24x         PyO3, iterator-based
Numba            60.0 ms      19x         @njit compiled loop
NumPy            114.0 ms     10x         Python loop, NumPy ops
Pandas ←         1,126 ms     1x          Python loop, pd.Series.diff (reference)

With trade-offs (parallelism, approximation)

These change the rules: multiple cores, approximate results, or both. Faster, but not a direct comparison. See Beyond single-threaded exact computation for details.

Approach                    Time       Speedup     Trade-off
────────────────────────────────────────────────────────────────────────
Rust sparse + parallel      0.5 ms     2,200x      ~0.3% error, multi-threaded
Rust sparse                 2.4 ms     470x        ~0.3% error
NumPy sparse                5.0 ms     225x        ~0.3% error
Rust parallel               7.6 ms     150x        multi-threaded

Too slow to measure on the full dataset

Extrapolated from smaller subsets.

Approach         Measured              Extrapolated            Algorithm
───────────────────────────────────────────────────────────────────────
Pure Python      22 ms / 1K samples    ~11 s / 21K samples     O(N^2) cumsum
Naive            405 ms / 500 samples  ~9 hours / 21K samples  O(N^3) no cumsum

All speedups are relative to the Pandas reference (1,126ms).

Takeaways

  1. Dropping Pandas for NumPy gives 10x. Biggest improvement for the least effort.
  2. Rust is the fastest exact single-threaded implementation at 24x. Numba (19x) is close and much easier to set up.
  3. If you can accept ~0.3% error, sparse windows are the practical winner. Even single-threaded Rust sparse (2.4ms, 470x) is fast enough for real-time use. Add parallelism and it drops to 0.5ms.
  4. Pure Python is about 10x slower than Pandas. Pandas has real overhead, but Python's interpreter without any C-backed library is far worse.
  5. The cumsum trick matters more than anything else. Naive (no cumsum) extrapolates to ~9 hours. With cumsum, even pure Python does it in ~11 seconds. Same language, ~3,000x difference, just from the algorithm.

The algorithm

Naive approach, O(N^3)

For each window size, slide through every position and sum the values from scratch:

for w in 1..N:
    for start in 0..N-w:
        mean = sum(power[start..start+w]) / w    ← O(w) work each time
        result[w] = max(result[w], mean)

The inner sum recomputes overlapping additions every time. For N=21,666 that's roughly 1.7 trillion operations, or about 9 hours in pure Python.

Cumsum trick, O(N^2)

A prefix sum lets you compute the sum of any contiguous range in O(1). Instead of re-adding w values, you subtract two pre-computed values:

cumsum[i] = power[0] + power[1] + ... + power[i-1]

sum(power[start..start+w]) = cumsum[start+w] - cumsum[start]

This eliminates the innermost loop:

cumsum = prefix_sum(power)       ← O(N) once
for w in 1..N:
    rolling_sums = cumsum[w:] - cumsum[:-w]
    result[w] = max(rolling_sums) / w

For N=21,666 that's about 235 million operations. All benchmarked implementations use this approach. They only differ in language and library overhead.

Can we do better than O(N^2)?

Probably not. The core problem, computing max_i(S[i+w] - S[i]) for all w, is equivalent to a max-plus convolution. No significantly sub-quadratic algorithm is known for this, and there are conditional lower bounds suggesting O(N^2) is close to optimal.

This is important context for the benchmark. The implementations are not leaving algorithmic performance on the table. The cumsum trick is the one structural improvement, and the inner loop that remains (subtract two values, compare against a running max, contiguous memory) is about as simple and vectorization-friendly as it gets. What we're really measuring is how fast different languages and runtimes can execute that loop.

Approaches

# Name File What it does
1 Naive implementations/naive.py Triple nested loop, no cumsum. Exists to show why the cumsum trick matters.
2 Pure Python implementations/pure_python.py Cumsum with raw Python loops. Shows the cost of CPython's interpreter without any C-backed library to help.
3 Pandas implementations/pandas_baseline.py Python loop calling pd.Series.diff() per window. A typical Pandas-based implementation. Per-call overhead (index alignment, type checks, Series construction) dominates the runtime.
4 NumPy implementations/numpy_loop.py Same Python loop, but cumsum[w:] - cumsum[:-w] on raw NumPy arrays. Drops all Pandas overhead while keeping the code nearly identical. The easiest 10x you'll find.
5 Numba implementations/numba_parallel.py @njit compiles the double loop to machine code. No per-iteration Python overhead at all.
6 Rust rust/src/lib.rs Same algorithm in Rust, exposed to Python via PyO3. Iterator-based inner loop helps the compiler auto-vectorize.

Beyond single-threaded exact computation

The benchmark above is a single-threaded, exact-result comparison. It isolates the effect of language and library choice. The repo also includes approaches that change the rules.

Parallelism. Each window size is independent, so the outer loop parallelizes trivially. Rust with Rayon reaches 7.6ms (150x vs Pandas). Any approach could be parallelized in principle, but Python's GIL limits this to compiled backends. The parallel variant is included as mean_max_parallel in Rust.

Sparse windows. You don't need to compute every single window size. The mean-max curve changes fast at short durations (a 5-second sprint looks very different from a 6-second one) but barely moves at long durations (the best 59-minute and 60-minute averages are almost identical). A pragmatic, domain-driven approach: compute every second up to 3 minutes, every 5 seconds up to 10 minutes, and progressively coarser from there. This gives about 700 windows instead of 21,000, with the interpolation error staying below 0.4%. The repo includes a NumPy version (5.0ms) and Rust versions in both single-threaded (2.4ms) and parallel (0.5ms) form. Combining sparse windows with parallelism is trivial because the two optimizations are orthogonal: sparse reduces the number of windows, parallelism distributes them across cores. No added implementation complexity beyond what each technique requires on its own.

What fast mean-max enables

At 0.5-2ms per activity, analyses that weren't practical before become trivial:

  • Fatigue-dependent power curves. Compute mean-max after for example every 500kJ of energy expenditure to see how the power curve degrades over a ride.
  • Interactive filtering. Recompute the curve on every UI interaction (date range, sub sport, route segment, training zone) without perceptible delay.

Running the benchmarks

# Install dependencies
uv pip install -e .

# Build the Rust extension
uv pip install -e ./rust/

# Run all benchmarks (includes parallel and sparse variants)
uv run pytest -v --benchmark-sort=mean

The naive and pure Python benchmarks run on small subsets (500 and 1,000 samples). All other benchmarks run on the full 21,666-sample dataset and verify correctness against a reference generated from the Pandas baseline.

Project structure

.
├── implementations/               # All Python implementations
│   ├── naive.py                   #   Triple nested loop
│   ├── pure_python.py             #   Cumsum + raw loops
│   ├── pandas_baseline.py         #   pd.Series.diff per window
│   ├── numpy_loop.py              #   NumPy array ops
│   ├── numba_parallel.py          #   @njit compiled
│   └── sparse_windows.py          #   Approximate: domain-driven sparse windows
├── rust/                          # Rust implementations
│   ├── src/lib.rs                 #   mean_max, mean_max_parallel, mean_max_sparse, mean_max_sparse_parallel
│   ├── Cargo.toml
│   └── pyproject.toml
├── benchmarks/                    # pytest-benchmark tests
│   ├── conftest.py                #   Loads power.csv, generates reference
│   ├── bench_naive.py
│   ├── bench_pure_python.py
│   ├── bench_pandas.py
│   ├── bench_numpy.py
│   ├── bench_numba.py
│   ├── bench_rust.py
│   ├── bench_rust_parallel.py
│   ├── bench_rust_sparse.py
│   ├── bench_rust_sparse_parallel.py
│   └── bench_sparse.py
├── power.csv                      # Test data: 6-hour ride, 21,666 samples
├── pyproject.toml
└── LICENSE

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors