Conversation
Co-Authored-By: Oz <oz-agent@warp.dev>
Add max-reduce and adaptive recurrence paths in calculators, wire CMake experiment flags, include contour/hotspot profiling tools, and set adaptive policy to pairwise for N<=2 and max-reduce for N>=3. Co-Authored-By: Oz <oz-agent@warp.dev>
…ase gate script - fb_recurrence_policy.h: FbRecurrenceMode enum, FbHostProfile struct, selectFbRecurrenceMode(), isFbBoundaryPoint(), toString() helpers. - forward_backward_calculator.h/.cpp: wire resolveRecurrenceMode() to the policy module; add setRecurrenceModeOverride/getRecurrenceModeOverride/ getRecurrenceMode; implement A2 (policy-driven dispatch), A3 (boundary probe + thread-local LRU cache + hysteresis), A4 (env var + instance override). - test_fb_mode_parity.cpp (D1): forces Pairwise vs MaxReduce on identical (hmm, obs) pairs and asserts logP, logAlpha, logBeta agree within 1e-9 absolute / 1e-12 relative; covers N=2..8 discrete and N=4/8/16 continuous. - test_bw_parity.cpp (D2): one-step Baum-Welch determinism, EM monotonicity, and parameter-invariant checks. - tests/CMakeLists.txt: register both new tests; remove duplicate test_bw_parity entry. - scripts/phase_gate.ps1 (D3): runs all 7 correctness-gate tests, reports PASS/FAIL per target, exits non-zero on any failure or missing binary. Phase gate: 7/7 PASS (MSVC Release, Ryzen / Windows x86_64). Co-Authored-By: Oz <oz-agent@warp.dev>
- Remove FbIsaClass enum, FbHostProfile::isa field, and the ISA detection block in makeFbHostProfile(). The ISA class was never consulted by any policy decision; the only architecture-specific branch (arm64 Clang) already used a direct preprocessor check. - Remove the simd_platform.h include from fb_recurrence_policy.h. It existed solely to populate the unused ISA field and pulled SIMD intrinsic headers (<intrin.h>, <immintrin.h>) into every TU that included forward_backward_calculator.h. - Remove toString(FbIsaClass) helper (dead with the enum). - Update makeFbHostProfile() doc comment and selectFbRecurrenceMode() @param tag to reflect that compiler identity is the sole policy axis. - Remove spurious mutable qualifier from logEmitBuf_ and logEmitByTime_ in forward_backward_calculator.h; these fields are only written in non-const compute() and must not be mutable. Phase gate: 7/7 PASS. Co-Authored-By: Oz <oz-agent@warp.dev>
LogSpaceOps was unreferenced outside its own translation unit (no calculator, trainer, test, tool, or benchmark called it). Delete the headers and source, drop the source from LIBHMM_SOURCES in CMakeLists.txt, drop the include from libhmm.h, and remove three stale doc references in simd_platform.h (the other two referenced calculators removed in v3.0.0-alpha). Co-Authored-By: Oz <oz-agent@warp.dev>
…ction Replace the per-compiler/per-runtime probe machinery in fb_recurrence_policy.h with a minimal ISA-based static threshold (FbRecurrenceMode enum + selectFbRecurrenceMode). Drop the unused probeRecurrenceMode method and the LIBHMM_FB_MODE env-var reference from forward_backward_calculator.h. In forward_backward_calculator.cpp, route the max-reduce path through the new TranscendentalKernels scalar backend so AVX2/NEON implementations can swap in without further structural changes. Co-Authored-By: Oz <oz-agent@warp.dev>
Switch baum_welch_trainer.cpp to time-major emission layout (logEmitByTime[t*N+j], stride-1 in the xi inner loop) and a flat transposed transition buffer for contiguous access. Detect zero-mass transitions once per train() call and route dense models through a branch-free xi inner loop using TranscendentalKernels::accumulate_exp_sum2_bias; sparse models keep the existing zero-skip path. Co-Authored-By: Oz <oz-agent@warp.dev>
tools/bw_hotspot.cpp breaks Baum-Welch runtime into FB, gamma accumulation, and dense/sparse xi accumulation, mirroring the production split. Useful for tracking xi exp-call dominance and validating SIMD changes. Register in tools/CMakeLists.txt. Co-Authored-By: Oz <oz-agent@warp.dev>
…per scripts Capture focus n2-8 sweep CSVs (pairwise + max-reduce + adaptive_static_v1), per-compiler ryzen-windows reruns (msvc/clangcl/mingw), HMMLib 9-pass median-gate dumps, the 26-Apr rollback patch, and the helper python scripts (run_focus_compiler_sweep.py, run_focus_single_compiler.py, run_hmmlib_passes.py, summarize_windows_compiler_rerun.py). .log files remain gitignored. Co-Authored-By: Oz <oz-agent@warp.dev>
…tors-phase1 # Conflicts: # CMakeLists.txt # include/libhmm/libhmm.h # include/libhmm/platform/simd_platform.h
…NEON)
Move kernel bodies from the header to a new src/performance/transcendental_kernels.cpp.
Add four-tier ISA cascade for each of the five kernels, mirroring the existing
Tier-2 distribution kernels (gaussian_distribution.cpp, exponential_distribution.cpp).
Vector exp(double) design:
- Range reduction: x = N*ln2 + r, |r| <= ln2/2, Cephes split ln2 = ln2_hi + ln2_lo.
- Polynomial: 13-term Horner of sum(r^k/k!). Truncation < 7.4e-17 at r = ln2/2.
- 2^N: (n + 1023) << 52 via integer bit manipulation.
- Underflow guard: clamp x >= constants::probability::MIN_LOG_PROBABILITY (-700);
mask output lanes to 0 for inputs at or below that threshold. Handles LOG_ZERO
= -inf sentinels branch-free. No +inf / NaN handling (callers guarantee finite
or LOG_ZERO inputs).
- AVX path is AVX-1 compatible (Ivy Bridge / Catalina): 2^N step uses two 128-bit
halves to avoid AVX2-only _mm256_cvtepi32_epi64.
Kernel cascade pattern:
AVX-512 8-wide __m512d (uses _mm512_fmadd_pd)
AVX/AVX2 4-wide __m256d (compiler fuses FMA under AVX2)
SSE2 2-wide __m128d
NEON 2-wide float64x2_t (uses vfmaq_f64)
scalar tail and portable fallback
Each ISA block advances i and the outer scalar variable (maxVal / sum) is
seeded from the previous block's result so the cascade handles any size
without early returns (avoids MSVC C4702 unreachable-code warnings).
Build system: add src/performance/transcendental_kernels.cpp,
src/calculators/forward_backward_calculator.cpp, and
src/training/baum_welch_trainer.cpp to LIBHMM_SIMD_SOURCES so
LIBHMM_BEST_SIMD_FLAGS and the LIBHMM_HAS_* macros fire correctly.
Drop the dead TranscendentalBackend enum (zero callers; outlier vs project
convention). Active-ISA reporting uses simd::feature_string() from simd_platform.h.
Sparse-path BW xi loop stays scalar (masking non-zero transitions in a SIMD
loop costs more than it saves for sparse models; comment added at call site).
New test: tests/performance/test_transcendental_kernels.cpp.
Five kernels x N in {1,2,3,4,7,8,15,16,31,32,64}; std::exp inline reference
(not kernel scalar variant); tolerance 1e-12 rel / 1e-15 abs.
Performance: bw_hotspot (Zen 4 / Windows / MSVC, AVX-512, median 8 runs):
BEFORE (scalar) AFTER (AVX-512)
FB N=8 T=1000 0.725 ms 0.533 ms (1.36x)
FB N=16 T=500 1.585 ms 0.574 ms (2.76x)
FB N=32 T=2000 32.743 ms 5.772 ms (5.67x)
Xi N=16 T=500 0.758 ms 0.658 ms (1.15x)
Xi N=32 T=2000 18.169 ms 17.700 ms (1.03x)
FB max-reduce is the primary beneficiary (5.7x at N=32). BW xi accumulation
shows modest improvement at these sizes -- the dense-xi inner loop is
memory-bandwidth-bound at N>=16, not compute-bound.
Tests: 36/36 ctest + 7/7 phase-gate passing on Windows/MSVC Release.
simd_inspection: 6/6 smoke tests pass; vector width = 8 lanes (AVX-512).
Co-Authored-By: Oz <oz-agent@warp.dev>
…_sweep tool Measured via new fb_crossover_sweep tool (ForwardBackwardCalculator with setRecurrenceModeOverride, Zen 4 / MSVC / AVX-512, T=1000, median 8 runs): N=2: MaxReduce 2.1x slower -- Pairwise wins N=3: MaxReduce 1.1x slower -- Pairwise wins N=4: MaxReduce 1.7x faster -- crossover N=8: MaxReduce 5.0x faster N=32: MaxReduce 15x faster The pre-SIMD threshold (N>=5 on x86) was set before TranscendentalKernels had SIMD backends. With AVX-512/AVX/SSE2 kernels now active, MaxReduce breaks even at N=4 on this hardware. The arm64 threshold was already N>=4 (unchanged). Since both arms now return the same thing, collapse the #if defined(__aarch64__) block to a single unconditional threshold. If future NEON vs x86 measurements diverge, the split can be reintroduced. tools/fb_crossover_sweep.cpp: new diagnostic tool that times Pairwise vs MaxReduce via the production calculator at N=2..64 and marks the mode that selectFbRecurrenceMode() currently picks for each N. 36/36 ctest + 7/7 phase-gate passing. Co-Authored-By: Oz <oz-agent@warp.dev>
HMMLib is header-only and has zero Boost includes in its headers. The find_package(Boost) guard in the HMMLIB_READY check was a historical cargo-cult that broke on CMake 3.30+ (CMP0167 NEW routes find_package(Boost) to Boost's own config files, which are absent in a headers-only extraction). Replace the Boost check with a direct existence test for HMMlib/hmm.hpp. Simplify enable_hmmlib() to add only the HMMLib directory as a SYSTEM include (the Boost_INCLUDE_DIRS and Boost_LIBRARIES lines were also dead). Co-Authored-By: Oz <oz-agent@warp.dev>
Add simd_kernels_internal.h: internal header that provides inline log_pd_* and exp_pd_* helpers (AVX-512 / AVX / SSE2 / NEON) for use by Tier-2 distribution TUs compiled with LIBHMM_BEST_SIMD_FLAGS. Also extend transcendental_kernels.cpp with the same log_pd_* helpers (kept separately because TranscendentalKernels lives in its own TU). Vector log design: - Range reduction: extract IEEE754 exponent e and mantissa m (x = 2^e*m, m in [1,2)). If m > sqrt(2): e += 1, m *= 0.5 (m in [1/sqrt(2), sqrt(2)]). - y = (m-1)/(m+1), |y| <= 0.172. - Polynomial: log(m) = 2y*(1 + y^2/3 + ... + y^12/13), 7-term Horner. Truncation at |y|_max is adequate for distribution callers (1e-10 abs). - Reconstruction: log(x) = e*LN2_HI + e*LN2_LO + log(m) (Cephes split). - Guard: x <= 0 lanes -> -inf; no NaN (callers validate x > 0). - AVX-512 int64->double via scalar store (no AVX-512 DQ required). - AVX path stays AVX-1 compatible (same 128-bit half trick as exp_pd_avx). LogNormalDistribution::getBatchLogProbabilities: Tier 2 free function lognormal_logpdf_batch. Per element: lx = log(x); res = S*(lx-mu)^2 - lx - C where S = negHalfSigmaSquaredInv_, C = logNormalizationConstant_. ParetoDistribution::getBatchLogProbabilities: Tier 2 free function pareto_logpdf_batch. Per element: if x < xm -> -inf; else logK + kLogXm - kPlus1 * log(x). xm guard is a vector mask (no branch in the SIMD body). Build: log_normal_distribution.cpp and pareto_distribution.cpp were already in LIBHMM_SIMD_SOURCES; simd_kernels_internal.h is included directly and fires when those TUs have LIBHMM_BEST_SIMD_FLAGS active. 36/36 ctest + 7/7 phase-gate passing on Windows/MSVC Release. Co-Authored-By: Oz <oz-agent@warp.dev>
tests/platform/test_simd_platform.cpp verifies simd_platform.h at two levels: Compile-time (#error): ISA hierarchy invariants -- AVX512 implies AVX and SSE2, AVX2 implies AVX, AVX implies SSE2, SSE4.1 implies SSE2, NEON and x86 macros are mutually exclusive. A broken macro combination becomes a build error rather than a silent runtime failure. Runtime (GTest, 12 assertions): contracts on feature_string() (non-null, non-empty, string value agrees with active macros), double_vector_width() and float_vector_width() (power-of-two, float == 2*double), optimal_alignment() (power-of-two >= 8, covers one SIMD register), has_simd_support() / supports_vectorization() consistency, and compile-time constant / function agreement for DOUBLE_SIMD_WIDTH, FLOAT_SIMD_WIDTH, SIMD_ALIGNMENT. Not compiled with LIBHMM_BEST_SIMD_FLAGS -- tests the detection infrastructure, not the intrinsics. Also updates simd_platform.h: replaces the stale four-item consumer list with a concise description that won't drift as consumers are added. 37/37 tests pass. Co-Authored-By: Oz <oz-agent@warp.dev>
Without this, cmake defaults to no build type (effectively -O0). At -O0 the compiler emits VZEROUPPER in the k_exp_pd_avx function prologue (before a dynamic stack-probe call for the 6336-byte stack frame) and BEFORE the __m256d ymm0 argument is saved to the stack. VZEROUPPER zeros the upper 128 bits of all YMM registers, so x[2] and x[3] are silently set to 0.0 inside the function body, producing exp(0)=1.0 instead of the correct tiny values for those lanes. At Release (-O3) the static inline helpers are inlined into their callers and no function-call ABI boundary exists where this can occur. 37/37 tests pass; simd_inspection 6/6 (AVX, width=4). Co-Authored-By: Oz <oz-agent@warp.dev>
…type requirement
WARP.md: document that configure_catalina.sh now defaults to Release, explain
the -O0 VZEROUPPER / __m256d argument corruption issue, and give the
RelWithDebInfo override for debuggable builds.
Ivy Bridge / AVX-1 (macOS Catalina, i7-3820QM) benchmark results:
fb_crossover_sweep: N>=4 MaxReduce threshold confirmed correct on 4-wide AVX.
N=2: Pairwise (ratio 1.83), N=3: tied (0.98), N=4+: MaxReduce clearly wins.
No threshold change required.
libhmm throughput (FB): peaks ~100k state-steps/ms at N=128.
vs GHMM (Gaussian continuous, Forward-Backward):
GHMM ~2.2x faster on Forward (was ~5x on Windows/Zen4 before SIMD work).
libhmm Viterbi ~5x faster than GHMM (0.066ms vs 0.373ms at T=1000).
All log-likelihoods match to <1e-10.
vs HMMLib (discrete): HMMLib ~4x faster.
The narrowed GHMM Forward gap (5x→2.2x) is consistent with the transcendental
kernel SIMD work landing on this platform. The AVX-1 (no AVX2/FMA) path was
exercised for the first time here.
Co-Authored-By: Oz <oz-agent@warp.dev>
… data Focused N=2..8 pairwise vs MaxReduce crossover sweep from MacBook Pro 9,1 (i7-3820QM, AVX no AVX2, macOS Catalina). Confirms N>=4 MaxReduce threshold is correct on this platform. Co-Authored-By: Oz <oz-agent@warp.dev>
Apple M1 / ARM NEON / macOS Tahoe 26.4.1 / Homebrew LLVM 22: Tests: 37/37 pass, simd_inspection 6/6. LIBHMM_HAS_NEON=YES, vector width=2, feature_string='ARM NEON (Apple Silicon)'. fb_crossover_sweep (NEON 2-wide): N=2: Pairwise (1.70x) N=3: tied (1.001) N=4+: MaxReduce (0.68 at N=4, 0.15 at N=32) N>=4 threshold confirmed correct — no change required. MaxReduce advantage grows faster on NEON than AVX (6x at N=16 vs 3.7x on Ivy Bridge). libhmm throughput (FB): peaks ~240k state-steps/ms at N=128. vs GHMM (Gaussian continuous): GHMM ~3.0x faster (was ~5x pre-SIMD, ~2.2x on Ivy Bridge). libhmm Viterbi faster than GHMM on all sizes. All log-likelihoods match to <1e-10. vs HMMLib (discrete): HMMLib ~3.4x faster. Also add previously collected M1/Tahoe crossover and HMMLib 9-pass sweep CSVs across four compiler configurations (AppleClang, GCC 15, Homebrew LLVM, reruns). Co-Authored-By: Oz <oz-agent@warp.dev>
Intel i7-7820HQ (Kaby Lake), AVX2+FMA, macOS 13.7.8 Ventura / AppleClang. Tests: 37/37 pass, simd_inspection 6/6. LIBHMM_HAS_AVX2=YES, LIBHMM_HAS_AVX512=-, width=4, feature='AVX2'. No correctness issues; identical code path to Ivy Bridge AVX-1 with compiler auto-FMA fusion (vfmadd) on the Horner polynomial. fb_crossover_sweep (AVX2 4-wide): N=2: Pairwise (1.94x) N=3: tied (1.011) N=4+: MaxReduce (0.62 at N=4) N>=4 threshold confirmed correct — no change required. libhmm throughput (FB): peaks ~90k state-steps/ms at N=64. vs HMMLib (discrete): HMMLib ~4.3x faster. No GHMM installed on this machine. No anomalies. AVX2+FMA auto-fusion produces expected minor improvements vs AVX-1. k_exp_pd_avx validated across all three x86 ISA tiers. Also add previously collected Kaby Lake crossover and high-N CSVs. Co-Authored-By: Oz <oz-agent@warp.dev>
GHMM copied to ~/Development post-initial commit; added here as amendment. vs GHMM (Gaussian continuous, Kaby Lake AVX2+FMA): GHMM ~2.3x faster Forward (4.9k vs 8.9k obs/ms average). libhmm Viterbi: 3-4x faster than GHMM (0.053ms vs 0.181ms at T=1000). All log-likelihoods match to <1e-10. Numerical match: YES on all sizes. Pattern consistent across platforms: Zen4/AVX-512: GHMM ~2.2x fwd libhmm Viterbi ~5x faster Ivy Bridge/AVX: GHMM ~2.2x fwd libhmm Viterbi ~3x faster Kaby Lake/AVX2: GHMM ~2.3x fwd libhmm Viterbi ~3-4x faster M1/NEON: GHMM ~3.0x fwd libhmm Viterbi ~3x faster Co-Authored-By: Oz <oz-agent@warp.dev>
- test_transcendental_kernels.cpp: update 'Level 8 section' reference to 'Performance Primitives section' following the test group rename. - transcendental_kernels.cpp: add missing '// Scalar tail.' comment to accumulate_exp_sum2_bias, consistent with the other four kernels. Co-Authored-By: Oz <oz-agent@warp.dev>
pre-commit: - Remove trailing whitespace from Doxygen comment lines in log_normal_distribution.cpp and pareto_distribution.cpp. - Add missing EOF newline to forward_backward_calculator.h. - Apply clang-format to log_normal and pareto distribution files. cppcheck: - Add cppcheck-suppress redundantInitialization to the AVX-512 blocks in reduce_max_sum2 and reduce_max_sum3. When only LIBHMM_HAS_AVX512 is defined, maxVal=neg_inf at entry is overwritten before any read. The initialization is an intentional cascade seed for non-AVX512 tiers; the suppression documents this. Co-Authored-By: Oz <oz-agent@warp.dev>
…ppressions clang-format (v19.1.7 via pre-commit mirrors-clang-format): Applied to all tracked C/C++ source and header files. No semantic changes -- formatting only. Register this commit in .git-blame-ignore-revs to keep git blame output clean. cppcheck: Move inline redundantInitialization suppressions onto the flagged statement line in reduce_max_sum2 and reduce_max_sum3 (was two lines above; cppcheck requires same-line or immediately- preceding-line placement). Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
… fix .gitattributes: change *.ps1 from eol=crlf to eol=lf. The existing eol=crlf rule caused git to check out phase_gate.ps1 as CRLF on CI (Ubuntu), which the pre-commit mixed-line-ending hook then flagged. PowerShell 7 handles LF line endings on all platforms. transcendental_kernels.cpp: replace inline cppcheck-suppress comments (which did not work regardless of placement) with a structural fix. double maxVal is now declared without initialisation; the AVX-512 #if block sets it, and the #else branch assigns neg_inf for all non-AVX512 paths. No redundant initialisation in any configuration. Co-Authored-By: Oz <oz-agent@warp.dev>
Co-Authored-By: Oz <oz-agent@warp.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Delivers the performance work planned for
perf/trainers-calculators-phase1. Validated on four platforms (Windows/MSVC/AVX-512, macOS/Kaby Lake/AVX2, macOS/Ivy Bridge/AVX-1, macOS/M1/NEON); 37/37 tests pass on all four.Changes
SIMD transcendental kernels
src/performance/transcendental_kernels.cpp— five inner-loop kernels (reduce_max_sum2,sum_exp_sum2_minus_max,reduce_max_sum3,sum_exp_sum3_minus_max,accumulate_exp_sum2_bias) now have full AVX-512 / AVX / SSE2 / NEON implementations. Consumed byForwardBackwardCalculator(FB max-reduce recurrence) andBaumWelchTrainer(dense-xi accumulation). Both TUs added toLIBHMM_SIMD_SOURCES.Vector
exphelper uses 13-term Horner polynomial with Cephesln2range reduction and branch-free underflow masking atMIN_LOG_PROBABILITY(= −700). AVX path is AVX-1 compatible (splits 256-bit 2^N step into two 128-bit halves to avoid AVX2-only_mm256_cvtepi32_epi64).Tier-2 LogNormal and Pareto
log_normal_distribution.cppandpareto_distribution.cppgain explicit-intrinsicsgetBatchLogProbabilitiesusing a vectorloghelper (IEEE-754 exponent extraction, 7-term Horner, split-LN2 reconstruction, ≤ 5 ULP).SIMD helper consolidation
include/libhmm/performance/simd_kernels_internal.his the single source of truth for vectorexp/logprimitives.transcendental_kernels.cpppreviously carried identical duplicate bodies; those are removed (−591 lines) and replaced with#include+kernels::k_*call sites.fb_recurrence_policy.hrelocationMoved from
include/libhmm/calculators/toinclude/libhmm/performance/— it is a cross-cutting performance primitive, not a calculator-specific interface. Three consumer includes updated.FB recurrence crossover retuning
N≥5 → N≥4 on x86 after profiling on Zen 4 / AVX-512: MaxReduce is 1.7× faster at N=4 post-SIMD. New
fb_crossover_sweeptool andfb_contour_sweeptool provide sweep and heatmap data.BW hotspot profiling
bw_hotspotandhotspot_breakdowntools allow independent timing of the three BW E-step cost centres (FB, gamma accumulation, xi accumulation).Tests
tests/platform/test_simd_platform.cpp— fills the Platform Capabilities group: compile-time#errorISA hierarchy invariants + 12 runtime assertions onsimd_platform.hutility functions.tests/performance/test_transcendental_kernels.cpp— five kernels × 11 sizes,std::expas oracle, 1e-12 rel / 1e-15 abs tolerance.tests/training/test_bw_parity.cpp— BW determinism (bit-exact) and EM monotonicity.tests/calculators/test_fb_mode_parity.cpp— Pairwise vs MaxReduce log-likelihood agreement.Documentation and test structure
performance/PERFORMANCE_ARCHITECTURE.mdupdated: Tier-2 coverage, delivered recurrence-kernel SIMD, correctedLIBHMM_SIMD_SOURCESlist,simd_kernels_internal.hnoted.WARP.md: Tier-2 list, test group names.tests/CMakeLists.txt: numeric Level labels replaced with semantic group names; Performance Primitives reordered before Distributions to reflect dependency order.Benchmark highlights (Zen 4 / AVX-512, T=1000)
Cross-platform data in
benchmark-analysis/.Plans
Conversation: https://app.warp.dev/conversation/4294c1ae-52ec-4582-80f7-acabb801c408
Co-Authored-By: Oz oz-agent@warp.dev