Skip to content

Replace dispatch heuristics with profiling-derived thresholds; fix AVX-512/MSVC build#15

Merged
OldCrow merged 18 commits into
mainfrom
investigate-gaussian-avx512-perf
Apr 12, 2026
Merged

Replace dispatch heuristics with profiling-derived thresholds; fix AVX-512/MSVC build#15
OldCrow merged 18 commits into
mainfrom
investigate-gaussian-avx512-perf

Conversation

@OldCrow
Copy link
Copy Markdown
Owner

@OldCrow OldCrow commented Apr 12, 2026

Summary

Replace the heuristic-based dispatch threshold system with a constexpr lookup table derived from empirical profiling data across four SIMD architectures. Fix AVX-512 build and test infrastructure for MSVC/Windows.

Closes #14

Dispatch threshold rework

The previous dispatcher had three compounding problems (documented in #14):

  1. Base thresholds reused unrelated statistical constants as placeholders
  2. refineWithCapabilities() inverted the SIMD-efficiency adjustment
  3. Distribution-specific divisors compounded the error

This PR replaces both AdaptiveThresholdCalculator and PerformanceDispatcher::Thresholds with a single constexpr lookup table indexed by (SIMDLevel, DistributionType, OperationType). Each entry is derived directly from profiling CSV data collected on four machines:

  • NEON (Apple M1, 8C/8T)
  • AVX (Ivy Bridge i7-3820QM, 4P/8T)
  • AVX2 (Kaby Lake i7-7820HQ, 4P/8T)
  • AVX-512 (Zen 4 Ryzen 7 7445HS, 6P/12T)

Key design decisions:

  • SCALAR→VECTORIZED boundary: architecture-independent (SIMD width + overhead)
  • VECTORIZED→PARALLEL boundary: from the lookup table (shifts by 4+ orders of magnitude across SIMD levels)
  • PARALLEL vs WORK_STEALING: runtime logic, not in the table (depends on workload irregularity and core count)
  • Beta CDF gets SIZE_MAX sentinel on all architectures (no SIMD path — scalar continued-fraction)

AVX-512/MSVC fixes

  • CMake: Global SIMD flag now follows SIMDDetection results — uses /arch:AVX512 when detected instead of hardcoding /arch:AVX2 for all MSVC x64 builds
  • Test validators: AMD branch gains __AVX512F__ tier; SIMD and parallel thresholds adjusted for wide-vector characteristics (vectorized-to-parallel crossovers at 50K–100K vs 8–64 on narrower architectures)
  • Student-T MLE: NU_MAX = 1000 upper bound prevents Newton-Raphson divergence; initial moment estimate clamped to 100
  • Thread safety test: vector<bool>vector<int> (bit-packing caused concurrent writes to race on the same byte)
  • Threading overhead bound: 100μs → 500μs (Windows SRWLOCK + scheduler jitter)

Other improvements

  • Beta batch paths: hoist lgamma prefix out of the loop via new beta_i(x, a, b, log_beta_prefix) overload
  • Canonical strategy profiler (tools/strategy_profile) replaces ad-hoc benchmarks as the primary dispatcher-threshold evidence source
  • BUILD_SYSTEM_GUIDE.md updated for AVX-512 (not server-only, MSVC flag behavior, source file listing)

Profiling data

Four profile bundles in data/profiles/dispatcher/, each containing 1728 measurements (9 distributions × 3 operations × 16 batch sizes × 4 strategies).

Testing

  • 45/45 tests pass on AVX-512 (Zen 4, Windows/MSVC), NEON (M1, macOS/Clang), AVX (Ivy Bridge, macOS/Clang), AVX2 (Kaby Lake, macOS/Clang)
  • SIMD correctness: 54/54 pass on all architectures
  • pylibstats: 168/168 tests, 85/85 SciPy comparison checks, 8–44× speedups over SciPy

Warp conversations: session 1, session 2

Co-Authored-By: Oz oz-agent@warp.dev

OldCrow and others added 18 commits April 11, 2026 22:13
Root cause: selectStrategyBasedOnCapabilities unconditionally preferred
WORK_STEALING for all distributions at batch_size >= work_stealing_min
(8000 for AVX-512). Profiling showed WORK_STEALING is 3-4x slower than
PARALLEL for regular workloads (Gaussian, Exponential, Uniform, etc.)
due to load-balancing overhead on uniform-cost elements.

Changes:
- WORK_STEALING now limited to distributions with irregular per-element
  cost (Poisson, Gamma, ChiSquared) where load balancing helps
- AVX-512 base parallel_min raised from 500 to 5000 (wider SIMD keeps
  VECTORIZED competitive to higher batch sizes)
- AVX-512 work_stealing_min raised from 8000 to 50000

Impact (pylibstats benchmark, Gaussian N=100k):
  PDF: 0.2x vs SciPy -> 2.6x
  CDF: 0.4x -> 3.3x

Add gaussian_strategy_profile tool for per-strategy timing investigation.

Co-Authored-By: Oz <oz-agent@warp.dev>
- inverse_t_cdf: raise normal-approximation cutoff from df>100 to
  df>1000 (consistent with t_cdf); Newton-Raphson now refines the
  estimate for intermediate degrees of freedom (fixes TTableValues)
- performance_dispatcher: use 2x base parallel_min for simple
  distributions (Uniform, Discrete) so threading overhead cannot
  undercut low per-element cost; extend createForSIMDLevel to all
  9 distribution types; clamp per-distribution thresholds after
  refineWithCapabilities (fixes DistributionSpecificThresholds)
- test_gamma_enhanced: use absolute time bound when traditional_time
  ≤ 2μs instead of ratio check — dispatch overhead dominates at
  sub-microsecond scalar times (fixes AutoDispatchAssessment)

Co-Authored-By: Oz <oz-agent@warp.dev>
- test_performance_dispatcher: use batch_size=3 (below all simd_min
  thresholds) instead of 5 which is above NEON/SSE2 simd_min of 4
- validators.h: lower parallel validation thresholds for small-medium
  batch sizes where threading overhead dominates on architectures with
  efficient vectorization
- test_discrete_enhanced: replace hardcoded parallel speedup assertions
  with architecture-aware adaptive validators consistent with other
  enhanced test suites

NOTE: the dispatch thresholds in performance_dispatcher.cpp have known
issues across all architectures — inverted SIMD-efficiency refinement
logic and non-empirical base thresholds cause PARALLEL to be selected
at batch sizes where VECTORIZED is faster. This needs a dedicated
follow-up using gaussian_strategy_profile on each target architecture.

Co-Authored-By: Oz <oz-agent@warp.dev>
Add strategy_profile tool that benchmarks forced SCALAR/VECTORIZED/PARALLEL/
WORK_STEALING across all 9 distributions, 3 operations (PDF/LogPDF/CDF), and
16 batch sizes. Produces canonical CSV for dispatcher threshold tuning.

Update capture_dispatcher_profile.sh and summarize_dispatcher_profile.py to
use the new profiler as the canonical data source. Capture script now copies
bundles into tracked data/profiles/dispatcher/ so profiles from all target
architectures accumulate in version control.

Remove 4 superseded tools:
- gaussian_strategy_profile.cpp (strict subset of strategy_profile)
- parallel_threshold_benchmark.cpp (strict subset of strategy_profile)
- performance_dispatcher_tool.cpp (simulation-based, not measured data)
- learning_analyzer.cpp (simulation-based, not measured data)

Include NEON profiling bundle from Mac Mini M1 (1728 measurements).
Update tool references in CMakeLists.txt, README.md, WARP.md,
PROJECT_CONCEPT.md, and tools/README.md.

Co-Authored-By: Oz <oz-agent@warp.dev>
Captured on Intel Core i7-7820HQ @ 2.90GHz (darwin-x86_64, AVX2, 4C/8T).
9 distributions × 3 operations × 16 batch sizes = 1,728 measurements.

Key crossover findings:
- Beta CDF, Gaussian CDF, StudentT CDF, Uniform PDF/LogPDF: VECTORIZED
  wins at all measured batch sizes (parallel never pays)
- Poisson PDF: parallel threshold 2,000; LogPDF: 50,000
- StudentT PDF/LogPDF: parallel threshold 100,000
- Most others (ChiSquared, Exponential, Gamma, Gaussian PDF/LogPDF):
  parallel crossover at batch size 8-16

Co-Authored-By: Oz <oz-agent@warp.dev>
Remove the Dev (-O1) NEON profile and add a Release (-O3) capture.
Release profiles are canonical for threshold tuning since they reflect
production optimization levels. Strategy win distribution shifts with
-O3: WORK_STEALING gains at PARALLEL's expense as per-element cost
decreases and threading overhead becomes relatively more significant.

Co-Authored-By: Oz <oz-agent@warp.dev>
Canonical strategy_profile run on Ivy Bridge with Release build (Clang -O3).
9 distributions x 3 operations x 4 strategies x 16 batch sizes.
Needs bundling via capture_dispatcher_profile.sh for full metadata.

Co-Authored-By: Oz <oz-agent@warp.dev>
Full capture_dispatcher_profile.sh bundle for Ivy Bridge i7-3820QM (SSE2+AVX).
Release build, Clang -O3. 9 distributions x 3 ops x 4 strategies x 16 sizes.
Includes metadata, summary, crossovers, best strategies, and logs.

Co-Authored-By: Oz <oz-agent@warp.dev>
Captured on ASUS TUF A16 with AMD Ryzen 7 7445HS (6P/12T, Zen 4).
Release build, MSVC 17 2022, AVX-512 enabled.
Completes four-architecture profiling dataset: NEON, AVX, AVX2, AVX-512.

Co-Authored-By: Oz <oz-agent@warp.dev>
Beta CDF: hoist lgamma(a+b)-lgamma(a)-lgamma(b) prefix out of the
per-element loop in getCumulativeProbabilityBatchUnsafeImpl. Add
beta_i(x, a, b, log_prefix) overload to skip redundant lgamma calls.
Fix PARALLEL/WS lambdas to acquire cache_mutex_ once instead of per
element and use the hoisted prefix with direct beta_i calls.

Beta PDF/LogPDF: replace per-element scalar std::log/std::exp in
PARALLEL/WS lambdas with chunked (1024-element) delegation to the
SIMD batch impl (vector_log/vector_exp). Parallel tasks now get SIMD
within each chunk instead of losing vectorization entirely.

Also update vector_beta_i to hoist the lgamma prefix.

33/33 correctness tests pass, 54/54 SIMD verification tests pass.

Co-Authored-By: Oz <oz-agent@warp.dev>
…able

Add dispatch_thresholds.h with per-(SIMDLevel, DistributionType, OperationType)
parallel thresholds derived from four-architecture Release profiling data
(NEON, AVX, AVX2, AVX-512). Each of the 108 entries traces directly to a
profiling bundle in data/profiles/dispatcher/.

Add OperationType enum (PDF, LOG_PDF, CDF, BATCH_FIT) and new
selectStrategy() method that replaces the old complexity-based dispatch with
a three-line table lookup: SCALAR below simd_min, VECTORIZED below parallel
threshold, then PARALLEL or WORK_STEALING based on platform.

P-vs-WS selection uses platform detection: macOS/GCD+HT prefers
WORK_STEALING, Windows/TP prefers PARALLEL, macOS/GCD without HT prefers
PARALLEL. Based on four-architecture profiling showing threading backend
as the dominant factor (not distribution type).

Beta gets SIZE_MAX on all architectures — vectorization is not viable for
any Beta operation due to the serial incomplete-beta continued fraction.

Update all 24 autoDispatch() call sites across 8 distributions to pass
OperationType instead of ComputationComplexity. Update 6 parallelBatchFit
call sites to use dispatch_table::BATCH_FIT_MIN directly.

Old threshold systems (AdaptiveThresholdCalculator, Thresholds struct with
refineWithCapabilities) retained for now as deprecated — removal follows
in a separate commit.

33/33 correctness tests pass. 54/54 SIMD verification tests pass.
36/36 parallel correctness tests pass.

Co-Authored-By: Oz <oz-agent@warp.dev>
…rategy

Update tests, tools, and examples to use selectStrategy() with
OperationType instead of selectOptimalStrategy() with ComputationComplexity.
No deprecated API calls remain in the codebase.

Co-Authored-By: Oz <oz-agent@warp.dev>
Delete parallel_thresholds.h/.cpp (AdaptiveThresholdCalculator),
distribution_characteristics.h (empirical complexity constants), and
empirical_characteristics_demo.cpp (demo tool for deleted system).

Remove deprecated selectOptimalStrategy() and selectStrategyBasedOnCapabilities()
from PerformanceDispatcher. Simplify Thresholds struct population to fixed
defaults (constexpr lookup table in dispatch_thresholds.h is now the authority).

Replace all get_optimal_parallel_threshold() calls with
get_min_elements_for_distribution_parallel(). Update docs to reflect changes.

Co-Authored-By: Oz <oz-agent@warp.dev>
Mark 'system' as [[maybe_unused]] — the constexpr threshold table
replaced the runtime system-capability conditioning.

Co-Authored-By: Oz <oz-agent@warp.dev>
Superseded by the bundled profile in data/profiles/dispatcher/.

Co-Authored-By: Oz <oz-agent@warp.dev>
- CMake: use /arch:AVX512 globally when SIMDDetection detects AVX-512,
  instead of hardcoding /arch:AVX2 for all MSVC x64 builds.  Ensures
  __AVX512F__ is defined in non-SIMD source files (validators, tests).
  Clang-cl path updated symmetrically (-mavx512f).

- validators.h: add AVX-512 awareness to adaptive test thresholds.
  AMD branch gains __AVX512F__ tier (base 2.0, Zen4 double-pumped).
  Complex-distribution SIMD multiplier reduced to 0.7x on AVX-512
  (lgamma/factorial scalar bottlenecks limit wide-pipeline benefit).
  Parallel thresholds below 100K accept >= 0.1x (forced PARALLEL below
  the vectorized-to-parallel crossover is expected to underperform).
  Large-batch SIMD multiplier lowered to 1.05x (amortisation curve
  flattens earlier on 8-wide processing).

- student_t.cpp: add NU_MAX=1000 upper bound and clamp initial moment
  estimate to 100, preventing Newton-Raphson divergence in the flat
  tail of the score function when sample excess kurtosis is near zero.

- test_student_t_enhanced.cpp: increase MLE sample size from 500 to
  2000 for stable convergence across stdlib implementations (MSVC vs
  libc++ produce different samples from identical mt19937 seeds).

- test_system_capabilities.cpp: replace vector<bool> with vector<int>
  in ThreadSafety test (bit-packing caused concurrent writes to
  different indices to race on the same byte).  Widen threading
  overhead bound from 100us to 500us (Windows scheduler jitter).

Co-Authored-By: Oz <oz-agent@warp.dev>
…r, source list

- Correct 'server CPUs' to 'Intel Skylake-X+, AMD Zen4+' for AVX-512
- Add AVX-512 detection output example
- Document that Windows global SIMD flag follows SIMDDetection results
- Add /arch:AVX512 to MSVC manual flags example
- Add simd_avx512.cpp and simd_dispatch.cpp to source file listing

Co-Authored-By: Oz <oz-agent@warp.dev>
Leftover from the old complexity-loop in displayDispatcherConfiguration()
that was simplified during the dispatch rework.

Co-Authored-By: Oz <oz-agent@warp.dev>
@OldCrow OldCrow changed the title Fix AVX-512 dispatch issue and stabilize cross-architecture tests Replace dispatch heuristics with profiling-derived thresholds; fix AVX-512/MSVC build Apr 12, 2026
@OldCrow OldCrow merged commit d48530d into main Apr 12, 2026
26 checks passed
@OldCrow OldCrow deleted the investigate-gaussian-avx512-perf branch April 12, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rework dispatch thresholds: replace heuristics with empirical crossover data

1 participant