Replace dispatch heuristics with profiling-derived thresholds; fix AVX-512/MSVC build by OldCrow · Pull Request #15 · OldCrow/libstats

OldCrow · 2026-04-12T04:03:07Z

Summary

Replace the heuristic-based dispatch threshold system with a constexpr lookup table derived from empirical profiling data across four SIMD architectures. Fix AVX-512 build and test infrastructure for MSVC/Windows.

Closes #14

Dispatch threshold rework

The previous dispatcher had three compounding problems (documented in #14):

Base thresholds reused unrelated statistical constants as placeholders
refineWithCapabilities() inverted the SIMD-efficiency adjustment
Distribution-specific divisors compounded the error

This PR replaces both AdaptiveThresholdCalculator and PerformanceDispatcher::Thresholds with a single constexpr lookup table indexed by (SIMDLevel, DistributionType, OperationType). Each entry is derived directly from profiling CSV data collected on four machines:

NEON (Apple M1, 8C/8T)
AVX (Ivy Bridge i7-3820QM, 4P/8T)
AVX2 (Kaby Lake i7-7820HQ, 4P/8T)
AVX-512 (Zen 4 Ryzen 7 7445HS, 6P/12T)

Key design decisions:

SCALAR→VECTORIZED boundary: architecture-independent (SIMD width + overhead)
VECTORIZED→PARALLEL boundary: from the lookup table (shifts by 4+ orders of magnitude across SIMD levels)
PARALLEL vs WORK_STEALING: runtime logic, not in the table (depends on workload irregularity and core count)
Beta CDF gets SIZE_MAX sentinel on all architectures (no SIMD path — scalar continued-fraction)

AVX-512/MSVC fixes

CMake: Global SIMD flag now follows SIMDDetection results — uses /arch:AVX512 when detected instead of hardcoding /arch:AVX2 for all MSVC x64 builds
Test validators: AMD branch gains __AVX512F__ tier; SIMD and parallel thresholds adjusted for wide-vector characteristics (vectorized-to-parallel crossovers at 50K–100K vs 8–64 on narrower architectures)
Student-T MLE: NU_MAX = 1000 upper bound prevents Newton-Raphson divergence; initial moment estimate clamped to 100
Thread safety test: vector<bool> → vector<int> (bit-packing caused concurrent writes to race on the same byte)
Threading overhead bound: 100μs → 500μs (Windows SRWLOCK + scheduler jitter)

Other improvements

Beta batch paths: hoist lgamma prefix out of the loop via new beta_i(x, a, b, log_beta_prefix) overload
Canonical strategy profiler (tools/strategy_profile) replaces ad-hoc benchmarks as the primary dispatcher-threshold evidence source
BUILD_SYSTEM_GUIDE.md updated for AVX-512 (not server-only, MSVC flag behavior, source file listing)

Profiling data

Four profile bundles in data/profiles/dispatcher/, each containing 1728 measurements (9 distributions × 3 operations × 16 batch sizes × 4 strategies).

Testing

45/45 tests pass on AVX-512 (Zen 4, Windows/MSVC), NEON (M1, macOS/Clang), AVX (Ivy Bridge, macOS/Clang), AVX2 (Kaby Lake, macOS/Clang)
SIMD correctness: 54/54 pass on all architectures
pylibstats: 168/168 tests, 85/85 SciPy comparison checks, 8–44× speedups over SciPy

Warp conversations: session 1, session 2

Co-Authored-By: Oz oz-agent@warp.dev

Root cause: selectStrategyBasedOnCapabilities unconditionally preferred WORK_STEALING for all distributions at batch_size >= work_stealing_min (8000 for AVX-512). Profiling showed WORK_STEALING is 3-4x slower than PARALLEL for regular workloads (Gaussian, Exponential, Uniform, etc.) due to load-balancing overhead on uniform-cost elements. Changes: - WORK_STEALING now limited to distributions with irregular per-element cost (Poisson, Gamma, ChiSquared) where load balancing helps - AVX-512 base parallel_min raised from 500 to 5000 (wider SIMD keeps VECTORIZED competitive to higher batch sizes) - AVX-512 work_stealing_min raised from 8000 to 50000 Impact (pylibstats benchmark, Gaussian N=100k): PDF: 0.2x vs SciPy -> 2.6x CDF: 0.4x -> 3.3x Add gaussian_strategy_profile tool for per-strategy timing investigation. Co-Authored-By: Oz <oz-agent@warp.dev>

- inverse_t_cdf: raise normal-approximation cutoff from df>100 to df>1000 (consistent with t_cdf); Newton-Raphson now refines the estimate for intermediate degrees of freedom (fixes TTableValues) - performance_dispatcher: use 2x base parallel_min for simple distributions (Uniform, Discrete) so threading overhead cannot undercut low per-element cost; extend createForSIMDLevel to all 9 distribution types; clamp per-distribution thresholds after refineWithCapabilities (fixes DistributionSpecificThresholds) - test_gamma_enhanced: use absolute time bound when traditional_time ≤ 2μs instead of ratio check — dispatch overhead dominates at sub-microsecond scalar times (fixes AutoDispatchAssessment) Co-Authored-By: Oz <oz-agent@warp.dev>

- test_performance_dispatcher: use batch_size=3 (below all simd_min thresholds) instead of 5 which is above NEON/SSE2 simd_min of 4 - validators.h: lower parallel validation thresholds for small-medium batch sizes where threading overhead dominates on architectures with efficient vectorization - test_discrete_enhanced: replace hardcoded parallel speedup assertions with architecture-aware adaptive validators consistent with other enhanced test suites NOTE: the dispatch thresholds in performance_dispatcher.cpp have known issues across all architectures — inverted SIMD-efficiency refinement logic and non-empirical base thresholds cause PARALLEL to be selected at batch sizes where VECTORIZED is faster. This needs a dedicated follow-up using gaussian_strategy_profile on each target architecture. Co-Authored-By: Oz <oz-agent@warp.dev>

Add strategy_profile tool that benchmarks forced SCALAR/VECTORIZED/PARALLEL/ WORK_STEALING across all 9 distributions, 3 operations (PDF/LogPDF/CDF), and 16 batch sizes. Produces canonical CSV for dispatcher threshold tuning. Update capture_dispatcher_profile.sh and summarize_dispatcher_profile.py to use the new profiler as the canonical data source. Capture script now copies bundles into tracked data/profiles/dispatcher/ so profiles from all target architectures accumulate in version control. Remove 4 superseded tools: - gaussian_strategy_profile.cpp (strict subset of strategy_profile) - parallel_threshold_benchmark.cpp (strict subset of strategy_profile) - performance_dispatcher_tool.cpp (simulation-based, not measured data) - learning_analyzer.cpp (simulation-based, not measured data) Include NEON profiling bundle from Mac Mini M1 (1728 measurements). Update tool references in CMakeLists.txt, README.md, WARP.md, PROJECT_CONCEPT.md, and tools/README.md. Co-Authored-By: Oz <oz-agent@warp.dev>

Captured on Intel Core i7-7820HQ @ 2.90GHz (darwin-x86_64, AVX2, 4C/8T). 9 distributions × 3 operations × 16 batch sizes = 1,728 measurements. Key crossover findings: - Beta CDF, Gaussian CDF, StudentT CDF, Uniform PDF/LogPDF: VECTORIZED wins at all measured batch sizes (parallel never pays) - Poisson PDF: parallel threshold 2,000; LogPDF: 50,000 - StudentT PDF/LogPDF: parallel threshold 100,000 - Most others (ChiSquared, Exponential, Gamma, Gaussian PDF/LogPDF): parallel crossover at batch size 8-16 Co-Authored-By: Oz <oz-agent@warp.dev>

Remove the Dev (-O1) NEON profile and add a Release (-O3) capture. Release profiles are canonical for threshold tuning since they reflect production optimization levels. Strategy win distribution shifts with -O3: WORK_STEALING gains at PARALLEL's expense as per-element cost decreases and threading overhead becomes relatively more significant. Co-Authored-By: Oz <oz-agent@warp.dev>

Canonical strategy_profile run on Ivy Bridge with Release build (Clang -O3). 9 distributions x 3 operations x 4 strategies x 16 batch sizes. Needs bundling via capture_dispatcher_profile.sh for full metadata. Co-Authored-By: Oz <oz-agent@warp.dev>

Full capture_dispatcher_profile.sh bundle for Ivy Bridge i7-3820QM (SSE2+AVX). Release build, Clang -O3. 9 distributions x 3 ops x 4 strategies x 16 sizes. Includes metadata, summary, crossovers, best strategies, and logs. Co-Authored-By: Oz <oz-agent@warp.dev>

Captured on ASUS TUF A16 with AMD Ryzen 7 7445HS (6P/12T, Zen 4). Release build, MSVC 17 2022, AVX-512 enabled. Completes four-architecture profiling dataset: NEON, AVX, AVX2, AVX-512. Co-Authored-By: Oz <oz-agent@warp.dev>

Beta CDF: hoist lgamma(a+b)-lgamma(a)-lgamma(b) prefix out of the per-element loop in getCumulativeProbabilityBatchUnsafeImpl. Add beta_i(x, a, b, log_prefix) overload to skip redundant lgamma calls. Fix PARALLEL/WS lambdas to acquire cache_mutex_ once instead of per element and use the hoisted prefix with direct beta_i calls. Beta PDF/LogPDF: replace per-element scalar std::log/std::exp in PARALLEL/WS lambdas with chunked (1024-element) delegation to the SIMD batch impl (vector_log/vector_exp). Parallel tasks now get SIMD within each chunk instead of losing vectorization entirely. Also update vector_beta_i to hoist the lgamma prefix. 33/33 correctness tests pass, 54/54 SIMD verification tests pass. Co-Authored-By: Oz <oz-agent@warp.dev>

…able Add dispatch_thresholds.h with per-(SIMDLevel, DistributionType, OperationType) parallel thresholds derived from four-architecture Release profiling data (NEON, AVX, AVX2, AVX-512). Each of the 108 entries traces directly to a profiling bundle in data/profiles/dispatcher/. Add OperationType enum (PDF, LOG_PDF, CDF, BATCH_FIT) and new selectStrategy() method that replaces the old complexity-based dispatch with a three-line table lookup: SCALAR below simd_min, VECTORIZED below parallel threshold, then PARALLEL or WORK_STEALING based on platform. P-vs-WS selection uses platform detection: macOS/GCD+HT prefers WORK_STEALING, Windows/TP prefers PARALLEL, macOS/GCD without HT prefers PARALLEL. Based on four-architecture profiling showing threading backend as the dominant factor (not distribution type). Beta gets SIZE_MAX on all architectures — vectorization is not viable for any Beta operation due to the serial incomplete-beta continued fraction. Update all 24 autoDispatch() call sites across 8 distributions to pass OperationType instead of ComputationComplexity. Update 6 parallelBatchFit call sites to use dispatch_table::BATCH_FIT_MIN directly. Old threshold systems (AdaptiveThresholdCalculator, Thresholds struct with refineWithCapabilities) retained for now as deprecated — removal follows in a separate commit. 33/33 correctness tests pass. 54/54 SIMD verification tests pass. 36/36 parallel correctness tests pass. Co-Authored-By: Oz <oz-agent@warp.dev>

…rategy Update tests, tools, and examples to use selectStrategy() with OperationType instead of selectOptimalStrategy() with ComputationComplexity. No deprecated API calls remain in the codebase. Co-Authored-By: Oz <oz-agent@warp.dev>

Delete parallel_thresholds.h/.cpp (AdaptiveThresholdCalculator), distribution_characteristics.h (empirical complexity constants), and empirical_characteristics_demo.cpp (demo tool for deleted system). Remove deprecated selectOptimalStrategy() and selectStrategyBasedOnCapabilities() from PerformanceDispatcher. Simplify Thresholds struct population to fixed defaults (constexpr lookup table in dispatch_thresholds.h is now the authority). Replace all get_optimal_parallel_threshold() calls with get_min_elements_for_distribution_parallel(). Update docs to reflect changes. Co-Authored-By: Oz <oz-agent@warp.dev>

Mark 'system' as [[maybe_unused]] — the constexpr threshold table replaced the runtime system-capability conditioning. Co-Authored-By: Oz <oz-agent@warp.dev>

Superseded by the bundled profile in data/profiles/dispatcher/. Co-Authored-By: Oz <oz-agent@warp.dev>

- CMake: use /arch:AVX512 globally when SIMDDetection detects AVX-512, instead of hardcoding /arch:AVX2 for all MSVC x64 builds. Ensures __AVX512F__ is defined in non-SIMD source files (validators, tests). Clang-cl path updated symmetrically (-mavx512f). - validators.h: add AVX-512 awareness to adaptive test thresholds. AMD branch gains __AVX512F__ tier (base 2.0, Zen4 double-pumped). Complex-distribution SIMD multiplier reduced to 0.7x on AVX-512 (lgamma/factorial scalar bottlenecks limit wide-pipeline benefit). Parallel thresholds below 100K accept >= 0.1x (forced PARALLEL below the vectorized-to-parallel crossover is expected to underperform). Large-batch SIMD multiplier lowered to 1.05x (amortisation curve flattens earlier on 8-wide processing). - student_t.cpp: add NU_MAX=1000 upper bound and clamp initial moment estimate to 100, preventing Newton-Raphson divergence in the flat tail of the score function when sample excess kurtosis is near zero. - test_student_t_enhanced.cpp: increase MLE sample size from 500 to 2000 for stable convergence across stdlib implementations (MSVC vs libc++ produce different samples from identical mt19937 seeds). - test_system_capabilities.cpp: replace vector<bool> with vector<int> in ThreadSafety test (bit-packing caused concurrent writes to different indices to race on the same byte). Widen threading overhead bound from 100us to 500us (Windows scheduler jitter). Co-Authored-By: Oz <oz-agent@warp.dev>

…r, source list - Correct 'server CPUs' to 'Intel Skylake-X+, AMD Zen4+' for AVX-512 - Add AVX-512 detection output example - Document that Windows global SIMD flag follows SIMDDetection results - Add /arch:AVX512 to MSVC manual flags example - Add simd_avx512.cpp and simd_dispatch.cpp to source file listing Co-Authored-By: Oz <oz-agent@warp.dev>

Leftover from the old complexity-loop in displayDispatcherConfiguration() that was simplified during the dispatch rework. Co-Authored-By: Oz <oz-agent@warp.dev>

OldCrow and others added 18 commits April 11, 2026 22:13

Fix unused parameter warning in createForSIMDLevel

d8e31ea

Mark 'system' as [[maybe_unused]] — the constexpr threshold table replaced the runtime system-capability conditioning. Co-Authored-By: Oz <oz-agent@warp.dev>

Remove stale strategy_profile_results.csv from project root

7a68b94

Superseded by the bundled profile in data/profiles/dispatcher/. Co-Authored-By: Oz <oz-agent@warp.dev>

Remove unused MAX_COMPLEXITY_DEMOS constant from system_inspector

9089f2c

Leftover from the old complexity-loop in displayDispatcherConfiguration() that was simplified during the dispatch rework. Co-Authored-By: Oz <oz-agent@warp.dev>

OldCrow changed the title ~~Fix AVX-512 dispatch issue and stabilize cross-architecture tests~~ Replace dispatch heuristics with profiling-derived thresholds; fix AVX-512/MSVC build Apr 12, 2026

OldCrow merged commit d48530d into main Apr 12, 2026
26 checks passed

OldCrow deleted the investigate-gaussian-avx512-perf branch April 12, 2026 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace dispatch heuristics with profiling-derived thresholds; fix AVX-512/MSVC build#15

Replace dispatch heuristics with profiling-derived thresholds; fix AVX-512/MSVC build#15
OldCrow merged 18 commits into
mainfrom
investigate-gaussian-avx512-perf

OldCrow commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OldCrow commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dispatch threshold rework

AVX-512/MSVC fixes

Other improvements

Profiling data

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

OldCrow commented Apr 12, 2026 •

edited

Loading