Mean-max is a standard analysis in endurance sports. For every possible duration (1 second, 2 seconds, ... up to the full activity), find the highest average power sustained over that window. A 6-hour ride gives you about 21,000 windows to check against 21,000 positions.
Mean-max is typically computed once per activity and persisted. A second or two is fine for that. But there are cases where it needs to be faster: local analytical workloads that recompute mean-max many times (for example, calculating power curves for different fatigue states based on energy expenditure), or "online" scenarios where mean-max isn't persisted and needs to be derived on the fly from raw timeseries data.
We benchmark the same algorithm across a range of approaches: a naive Python baseline, a typical Pandas implementation, a pure NumPy version, Numba with JIT compilation, and a Rust extension via PyO3. Each approach is explained in detail below, but the results come first.
Tested on a 6-hour cycling power file (21,666 samples at 1Hz).
All of these compute the identical result, on a single thread. This is the apples-to-apples comparison.
Approach Time Speedup Implementation
────────────────────────────────────────────────────────
Rust 47.3 ms 24x PyO3, iterator-based
Numba 60.0 ms 19x @njit compiled loop
NumPy 114.0 ms 10x Python loop, NumPy ops
Pandas ← 1,126 ms 1x Python loop, pd.Series.diff (reference)
These change the rules: multiple cores, approximate results, or both. Faster, but not a direct comparison. See Beyond single-threaded exact computation for details.
Approach Time Speedup Trade-off
────────────────────────────────────────────────────────────────────────
Rust sparse + parallel 0.5 ms 2,200x ~0.3% error, multi-threaded
Rust sparse 2.4 ms 470x ~0.3% error
NumPy sparse 5.0 ms 225x ~0.3% error
Rust parallel 7.6 ms 150x multi-threaded
Extrapolated from smaller subsets.
Approach Measured Extrapolated Algorithm
───────────────────────────────────────────────────────────────────────
Pure Python 22 ms / 1K samples ~11 s / 21K samples O(N^2) cumsum
Naive 405 ms / 500 samples ~9 hours / 21K samples O(N^3) no cumsum
All speedups are relative to the Pandas reference (1,126ms).
- Dropping Pandas for NumPy gives 10x. Biggest improvement for the least effort.
- Rust is the fastest exact single-threaded implementation at 24x. Numba (19x) is close and much easier to set up.
- If you can accept ~0.3% error, sparse windows are the practical winner. Even single-threaded Rust sparse (2.4ms, 470x) is fast enough for real-time use. Add parallelism and it drops to 0.5ms.
- Pure Python is about 10x slower than Pandas. Pandas has real overhead, but Python's interpreter without any C-backed library is far worse.
- The cumsum trick matters more than anything else. Naive (no cumsum) extrapolates to ~9 hours. With cumsum, even pure Python does it in ~11 seconds. Same language, ~3,000x difference, just from the algorithm.
For each window size, slide through every position and sum the values from scratch:
for w in 1..N:
for start in 0..N-w:
mean = sum(power[start..start+w]) / w ← O(w) work each time
result[w] = max(result[w], mean)
The inner sum recomputes overlapping additions every time. For N=21,666 that's roughly 1.7 trillion operations, or about 9 hours in pure Python.
A prefix sum lets you compute the sum of any contiguous range in O(1). Instead of re-adding w values, you subtract two pre-computed values:
cumsum[i] = power[0] + power[1] + ... + power[i-1]
sum(power[start..start+w]) = cumsum[start+w] - cumsum[start]
This eliminates the innermost loop:
cumsum = prefix_sum(power) ← O(N) once
for w in 1..N:
rolling_sums = cumsum[w:] - cumsum[:-w]
result[w] = max(rolling_sums) / w
For N=21,666 that's about 235 million operations. All benchmarked implementations use this approach. They only differ in language and library overhead.
Probably not. The core problem, computing max_i(S[i+w] - S[i]) for all w, is equivalent to a max-plus convolution. No significantly sub-quadratic algorithm is known for this, and there are conditional lower bounds suggesting O(N^2) is close to optimal.
This is important context for the benchmark. The implementations are not leaving algorithmic performance on the table. The cumsum trick is the one structural improvement, and the inner loop that remains (subtract two values, compare against a running max, contiguous memory) is about as simple and vectorization-friendly as it gets. What we're really measuring is how fast different languages and runtimes can execute that loop.
| # | Name | File | What it does |
|---|---|---|---|
| 1 | Naive | implementations/naive.py |
Triple nested loop, no cumsum. Exists to show why the cumsum trick matters. |
| 2 | Pure Python | implementations/pure_python.py |
Cumsum with raw Python loops. Shows the cost of CPython's interpreter without any C-backed library to help. |
| 3 | Pandas | implementations/pandas_baseline.py |
Python loop calling pd.Series.diff() per window. A typical Pandas-based implementation. Per-call overhead (index alignment, type checks, Series construction) dominates the runtime. |
| 4 | NumPy | implementations/numpy_loop.py |
Same Python loop, but cumsum[w:] - cumsum[:-w] on raw NumPy arrays. Drops all Pandas overhead while keeping the code nearly identical. The easiest 10x you'll find. |
| 5 | Numba | implementations/numba_parallel.py |
@njit compiles the double loop to machine code. No per-iteration Python overhead at all. |
| 6 | Rust | rust/src/lib.rs |
Same algorithm in Rust, exposed to Python via PyO3. Iterator-based inner loop helps the compiler auto-vectorize. |
The benchmark above is a single-threaded, exact-result comparison. It isolates the effect of language and library choice. The repo also includes approaches that change the rules.
Parallelism. Each window size is independent, so the outer loop parallelizes trivially. Rust with Rayon reaches 7.6ms (150x vs Pandas). Any approach could be parallelized in principle, but Python's GIL limits this to compiled backends. The parallel variant is included as mean_max_parallel in Rust.
Sparse windows. You don't need to compute every single window size. The mean-max curve changes fast at short durations (a 5-second sprint looks very different from a 6-second one) but barely moves at long durations (the best 59-minute and 60-minute averages are almost identical). A pragmatic, domain-driven approach: compute every second up to 3 minutes, every 5 seconds up to 10 minutes, and progressively coarser from there. This gives about 700 windows instead of 21,000, with the interpolation error staying below 0.4%. The repo includes a NumPy version (5.0ms) and Rust versions in both single-threaded (2.4ms) and parallel (0.5ms) form. Combining sparse windows with parallelism is trivial because the two optimizations are orthogonal: sparse reduces the number of windows, parallelism distributes them across cores. No added implementation complexity beyond what each technique requires on its own.
At 0.5-2ms per activity, analyses that weren't practical before become trivial:
- Fatigue-dependent power curves. Compute mean-max after for example every 500kJ of energy expenditure to see how the power curve degrades over a ride.
- Interactive filtering. Recompute the curve on every UI interaction (date range, sub sport, route segment, training zone) without perceptible delay.
# Install dependencies
uv pip install -e .
# Build the Rust extension
uv pip install -e ./rust/
# Run all benchmarks (includes parallel and sparse variants)
uv run pytest -v --benchmark-sort=meanThe naive and pure Python benchmarks run on small subsets (500 and 1,000 samples). All other benchmarks run on the full 21,666-sample dataset and verify correctness against a reference generated from the Pandas baseline.
.
├── implementations/ # All Python implementations
│ ├── naive.py # Triple nested loop
│ ├── pure_python.py # Cumsum + raw loops
│ ├── pandas_baseline.py # pd.Series.diff per window
│ ├── numpy_loop.py # NumPy array ops
│ ├── numba_parallel.py # @njit compiled
│ └── sparse_windows.py # Approximate: domain-driven sparse windows
├── rust/ # Rust implementations
│ ├── src/lib.rs # mean_max, mean_max_parallel, mean_max_sparse, mean_max_sparse_parallel
│ ├── Cargo.toml
│ └── pyproject.toml
├── benchmarks/ # pytest-benchmark tests
│ ├── conftest.py # Loads power.csv, generates reference
│ ├── bench_naive.py
│ ├── bench_pure_python.py
│ ├── bench_pandas.py
│ ├── bench_numpy.py
│ ├── bench_numba.py
│ ├── bench_rust.py
│ ├── bench_rust_parallel.py
│ ├── bench_rust_sparse.py
│ ├── bench_rust_sparse_parallel.py
│ └── bench_sparse.py
├── power.csv # Test data: 6-hour ride, 21,666 samples
├── pyproject.toml
└── LICENSE
MIT