perf: SIMD transcendental kernels, Tier-2 LogNormal/Pareto, dedup helpers, test realignment by OldCrow · Pull Request #3 · OldCrow/libhmm

OldCrow · 2026-05-03T02:56:14Z

Summary

Delivers the performance work planned for perf/trainers-calculators-phase1. Validated on four platforms (Windows/MSVC/AVX-512, macOS/Kaby Lake/AVX2, macOS/Ivy Bridge/AVX-1, macOS/M1/NEON); 37/37 tests pass on all four.

Changes

SIMD transcendental kernels

src/performance/transcendental_kernels.cpp — five inner-loop kernels (reduce_max_sum2, sum_exp_sum2_minus_max, reduce_max_sum3, sum_exp_sum3_minus_max, accumulate_exp_sum2_bias) now have full AVX-512 / AVX / SSE2 / NEON implementations. Consumed by ForwardBackwardCalculator (FB max-reduce recurrence) and BaumWelchTrainer (dense-xi accumulation). Both TUs added to LIBHMM_SIMD_SOURCES.

Vector exp helper uses 13-term Horner polynomial with Cephes ln2 range reduction and branch-free underflow masking at MIN_LOG_PROBABILITY (= −700). AVX path is AVX-1 compatible (splits 256-bit 2^N step into two 128-bit halves to avoid AVX2-only _mm256_cvtepi32_epi64).

Tier-2 LogNormal and Pareto

log_normal_distribution.cpp and pareto_distribution.cpp gain explicit-intrinsics getBatchLogProbabilities using a vector log helper (IEEE-754 exponent extraction, 7-term Horner, split-LN2 reconstruction, ≤ 5 ULP).

SIMD helper consolidation

include/libhmm/performance/simd_kernels_internal.h is the single source of truth for vector exp/log primitives. transcendental_kernels.cpp previously carried identical duplicate bodies; those are removed (−591 lines) and replaced with #include + kernels::k_* call sites.

`fb_recurrence_policy.h` relocation

Moved from include/libhmm/calculators/ to include/libhmm/performance/ — it is a cross-cutting performance primitive, not a calculator-specific interface. Three consumer includes updated.

FB recurrence crossover retuning

N≥5 → N≥4 on x86 after profiling on Zen 4 / AVX-512: MaxReduce is 1.7× faster at N=4 post-SIMD. New fb_crossover_sweep tool and fb_contour_sweep tool provide sweep and heatmap data.

BW hotspot profiling

bw_hotspot and hotspot_breakdown tools allow independent timing of the three BW E-step cost centres (FB, gamma accumulation, xi accumulation).

Tests

tests/platform/test_simd_platform.cpp — fills the Platform Capabilities group: compile-time #error ISA hierarchy invariants + 12 runtime assertions on simd_platform.h utility functions.
tests/performance/test_transcendental_kernels.cpp — five kernels × 11 sizes, std::exp as oracle, 1e-12 rel / 1e-15 abs tolerance.
tests/training/test_bw_parity.cpp — BW determinism (bit-exact) and EM monotonicity.
tests/calculators/test_fb_mode_parity.cpp — Pairwise vs MaxReduce log-likelihood agreement.

Documentation and test structure

performance/PERFORMANCE_ARCHITECTURE.md updated: Tier-2 coverage, delivered recurrence-kernel SIMD, corrected LIBHMM_SIMD_SOURCES list, simd_kernels_internal.h noted.
WARP.md: Tier-2 list, test group names.
tests/CMakeLists.txt: numeric Level labels replaced with semantic group names; Performance Primitives reordered before Distributions to reflect dependency order.

Benchmark highlights (Zen 4 / AVX-512, T=1000)

Kernel	Speedup
FB max-reduce, N=32	5.7×
BW xi dense, N=8	1.15×
BW xi dense, N=32	1.03×

Cross-platform data in benchmark-analysis/.

Plans

Conversation: https://app.warp.dev/conversation/4294c1ae-52ec-4582-80f7-acabb801c408

Co-Authored-By: Oz oz-agent@warp.dev

Co-Authored-By: Oz <oz-agent@warp.dev>

Add max-reduce and adaptive recurrence paths in calculators, wire CMake experiment flags, include contour/hotspot profiling tools, and set adaptive policy to pairwise for N<=2 and max-reduce for N>=3. Co-Authored-By: Oz <oz-agent@warp.dev>

…ase gate script - fb_recurrence_policy.h: FbRecurrenceMode enum, FbHostProfile struct, selectFbRecurrenceMode(), isFbBoundaryPoint(), toString() helpers. - forward_backward_calculator.h/.cpp: wire resolveRecurrenceMode() to the policy module; add setRecurrenceModeOverride/getRecurrenceModeOverride/ getRecurrenceMode; implement A2 (policy-driven dispatch), A3 (boundary probe + thread-local LRU cache + hysteresis), A4 (env var + instance override). - test_fb_mode_parity.cpp (D1): forces Pairwise vs MaxReduce on identical (hmm, obs) pairs and asserts logP, logAlpha, logBeta agree within 1e-9 absolute / 1e-12 relative; covers N=2..8 discrete and N=4/8/16 continuous. - test_bw_parity.cpp (D2): one-step Baum-Welch determinism, EM monotonicity, and parameter-invariant checks. - tests/CMakeLists.txt: register both new tests; remove duplicate test_bw_parity entry. - scripts/phase_gate.ps1 (D3): runs all 7 correctness-gate tests, reports PASS/FAIL per target, exits non-zero on any failure or missing binary. Phase gate: 7/7 PASS (MSVC Release, Ryzen / Windows x86_64). Co-Authored-By: Oz <oz-agent@warp.dev>

@param

- Remove FbIsaClass enum, FbHostProfile::isa field, and the ISA detection block in makeFbHostProfile(). The ISA class was never consulted by any policy decision; the only architecture-specific branch (arm64 Clang) already used a direct preprocessor check. - Remove the simd_platform.h include from fb_recurrence_policy.h. It existed solely to populate the unused ISA field and pulled SIMD intrinsic headers (<intrin.h>, <immintrin.h>) into every TU that included forward_backward_calculator.h. - Remove toString(FbIsaClass) helper (dead with the enum). - Update makeFbHostProfile() doc comment and selectFbRecurrenceMode() @param tag to reflect that compiler identity is the sole policy axis. - Remove spurious mutable qualifier from logEmitBuf_ and logEmitByTime_ in forward_backward_calculator.h; these fields are only written in non-const compute() and must not be mutable. Phase gate: 7/7 PASS. Co-Authored-By: Oz <oz-agent@warp.dev>

LogSpaceOps was unreferenced outside its own translation unit (no calculator, trainer, test, tool, or benchmark called it). Delete the headers and source, drop the source from LIBHMM_SOURCES in CMakeLists.txt, drop the include from libhmm.h, and remove three stale doc references in simd_platform.h (the other two referenced calculators removed in v3.0.0-alpha). Co-Authored-By: Oz <oz-agent@warp.dev>

…ction Replace the per-compiler/per-runtime probe machinery in fb_recurrence_policy.h with a minimal ISA-based static threshold (FbRecurrenceMode enum + selectFbRecurrenceMode). Drop the unused probeRecurrenceMode method and the LIBHMM_FB_MODE env-var reference from forward_backward_calculator.h. In forward_backward_calculator.cpp, route the max-reduce path through the new TranscendentalKernels scalar backend so AVX2/NEON implementations can swap in without further structural changes. Co-Authored-By: Oz <oz-agent@warp.dev>

Switch baum_welch_trainer.cpp to time-major emission layout (logEmitByTime[t*N+j], stride-1 in the xi inner loop) and a flat transposed transition buffer for contiguous access. Detect zero-mass transitions once per train() call and route dense models through a branch-free xi inner loop using TranscendentalKernels::accumulate_exp_sum2_bias; sparse models keep the existing zero-skip path. Co-Authored-By: Oz <oz-agent@warp.dev>

tools/bw_hotspot.cpp breaks Baum-Welch runtime into FB, gamma accumulation, and dense/sparse xi accumulation, mirroring the production split. Useful for tracking xi exp-call dominance and validating SIMD changes. Register in tools/CMakeLists.txt. Co-Authored-By: Oz <oz-agent@warp.dev>

…per scripts Capture focus n2-8 sweep CSVs (pairwise + max-reduce + adaptive_static_v1), per-compiler ryzen-windows reruns (msvc/clangcl/mingw), HMMLib 9-pass median-gate dumps, the 26-Apr rollback patch, and the helper python scripts (run_focus_compiler_sweep.py, run_focus_single_compiler.py, run_hmmlib_passes.py, summarize_windows_compiler_rerun.py). .log files remain gitignored. Co-Authored-By: Oz <oz-agent@warp.dev>

…tors-phase1 # Conflicts: # CMakeLists.txt # include/libhmm/libhmm.h # include/libhmm/platform/simd_platform.h

…tors-phase1

…NEON) Move kernel bodies from the header to a new src/performance/transcendental_kernels.cpp. Add four-tier ISA cascade for each of the five kernels, mirroring the existing Tier-2 distribution kernels (gaussian_distribution.cpp, exponential_distribution.cpp). Vector exp(double) design: - Range reduction: x = N*ln2 + r, |r| <= ln2/2, Cephes split ln2 = ln2_hi + ln2_lo. - Polynomial: 13-term Horner of sum(r^k/k!). Truncation < 7.4e-17 at r = ln2/2. - 2^N: (n + 1023) << 52 via integer bit manipulation. - Underflow guard: clamp x >= constants::probability::MIN_LOG_PROBABILITY (-700); mask output lanes to 0 for inputs at or below that threshold. Handles LOG_ZERO = -inf sentinels branch-free. No +inf / NaN handling (callers guarantee finite or LOG_ZERO inputs). - AVX path is AVX-1 compatible (Ivy Bridge / Catalina): 2^N step uses two 128-bit halves to avoid AVX2-only _mm256_cvtepi32_epi64. Kernel cascade pattern: AVX-512 8-wide __m512d (uses _mm512_fmadd_pd) AVX/AVX2 4-wide __m256d (compiler fuses FMA under AVX2) SSE2 2-wide __m128d NEON 2-wide float64x2_t (uses vfmaq_f64) scalar tail and portable fallback Each ISA block advances i and the outer scalar variable (maxVal / sum) is seeded from the previous block's result so the cascade handles any size without early returns (avoids MSVC C4702 unreachable-code warnings). Build system: add src/performance/transcendental_kernels.cpp, src/calculators/forward_backward_calculator.cpp, and src/training/baum_welch_trainer.cpp to LIBHMM_SIMD_SOURCES so LIBHMM_BEST_SIMD_FLAGS and the LIBHMM_HAS_* macros fire correctly. Drop the dead TranscendentalBackend enum (zero callers; outlier vs project convention). Active-ISA reporting uses simd::feature_string() from simd_platform.h. Sparse-path BW xi loop stays scalar (masking non-zero transitions in a SIMD loop costs more than it saves for sparse models; comment added at call site). New test: tests/performance/test_transcendental_kernels.cpp. Five kernels x N in {1,2,3,4,7,8,15,16,31,32,64}; std::exp inline reference (not kernel scalar variant); tolerance 1e-12 rel / 1e-15 abs. Performance: bw_hotspot (Zen 4 / Windows / MSVC, AVX-512, median 8 runs): BEFORE (scalar) AFTER (AVX-512) FB N=8 T=1000 0.725 ms 0.533 ms (1.36x) FB N=16 T=500 1.585 ms 0.574 ms (2.76x) FB N=32 T=2000 32.743 ms 5.772 ms (5.67x) Xi N=16 T=500 0.758 ms 0.658 ms (1.15x) Xi N=32 T=2000 18.169 ms 17.700 ms (1.03x) FB max-reduce is the primary beneficiary (5.7x at N=32). BW xi accumulation shows modest improvement at these sizes -- the dense-xi inner loop is memory-bandwidth-bound at N>=16, not compute-bound. Tests: 36/36 ctest + 7/7 phase-gate passing on Windows/MSVC Release. simd_inspection: 6/6 smoke tests pass; vector width = 8 lanes (AVX-512). Co-Authored-By: Oz <oz-agent@warp.dev>

…_sweep tool Measured via new fb_crossover_sweep tool (ForwardBackwardCalculator with setRecurrenceModeOverride, Zen 4 / MSVC / AVX-512, T=1000, median 8 runs): N=2: MaxReduce 2.1x slower -- Pairwise wins N=3: MaxReduce 1.1x slower -- Pairwise wins N=4: MaxReduce 1.7x faster -- crossover N=8: MaxReduce 5.0x faster N=32: MaxReduce 15x faster The pre-SIMD threshold (N>=5 on x86) was set before TranscendentalKernels had SIMD backends. With AVX-512/AVX/SSE2 kernels now active, MaxReduce breaks even at N=4 on this hardware. The arm64 threshold was already N>=4 (unchanged). Since both arms now return the same thing, collapse the #if defined(__aarch64__) block to a single unconditional threshold. If future NEON vs x86 measurements diverge, the split can be reintroduced. tools/fb_crossover_sweep.cpp: new diagnostic tool that times Pairwise vs MaxReduce via the production calculator at N=2..64 and marks the mode that selectFbRecurrenceMode() currently picks for each N. 36/36 ctest + 7/7 phase-gate passing. Co-Authored-By: Oz <oz-agent@warp.dev>

HMMLib is header-only and has zero Boost includes in its headers. The find_package(Boost) guard in the HMMLIB_READY check was a historical cargo-cult that broke on CMake 3.30+ (CMP0167 NEW routes find_package(Boost) to Boost's own config files, which are absent in a headers-only extraction). Replace the Boost check with a direct existence test for HMMlib/hmm.hpp. Simplify enable_hmmlib() to add only the HMMLib directory as a SYSTEM include (the Boost_INCLUDE_DIRS and Boost_LIBRARIES lines were also dead). Co-Authored-By: Oz <oz-agent@warp.dev>

Add simd_kernels_internal.h: internal header that provides inline log_pd_* and exp_pd_* helpers (AVX-512 / AVX / SSE2 / NEON) for use by Tier-2 distribution TUs compiled with LIBHMM_BEST_SIMD_FLAGS. Also extend transcendental_kernels.cpp with the same log_pd_* helpers (kept separately because TranscendentalKernels lives in its own TU). Vector log design: - Range reduction: extract IEEE754 exponent e and mantissa m (x = 2^e*m, m in [1,2)). If m > sqrt(2): e += 1, m *= 0.5 (m in [1/sqrt(2), sqrt(2)]). - y = (m-1)/(m+1), |y| <= 0.172. - Polynomial: log(m) = 2y*(1 + y^2/3 + ... + y^12/13), 7-term Horner. Truncation at |y|_max is adequate for distribution callers (1e-10 abs). - Reconstruction: log(x) = e*LN2_HI + e*LN2_LO + log(m) (Cephes split). - Guard: x <= 0 lanes -> -inf; no NaN (callers validate x > 0). - AVX-512 int64->double via scalar store (no AVX-512 DQ required). - AVX path stays AVX-1 compatible (same 128-bit half trick as exp_pd_avx). LogNormalDistribution::getBatchLogProbabilities: Tier 2 free function lognormal_logpdf_batch. Per element: lx = log(x); res = S*(lx-mu)^2 - lx - C where S = negHalfSigmaSquaredInv_, C = logNormalizationConstant_. ParetoDistribution::getBatchLogProbabilities: Tier 2 free function pareto_logpdf_batch. Per element: if x < xm -> -inf; else logK + kLogXm - kPlus1 * log(x). xm guard is a vector mask (no branch in the SIMD body). Build: log_normal_distribution.cpp and pareto_distribution.cpp were already in LIBHMM_SIMD_SOURCES; simd_kernels_internal.h is included directly and fires when those TUs have LIBHMM_BEST_SIMD_FLAGS active. 36/36 ctest + 7/7 phase-gate passing on Windows/MSVC Release. Co-Authored-By: Oz <oz-agent@warp.dev>

tests/platform/test_simd_platform.cpp verifies simd_platform.h at two levels: Compile-time (#error): ISA hierarchy invariants -- AVX512 implies AVX and SSE2, AVX2 implies AVX, AVX implies SSE2, SSE4.1 implies SSE2, NEON and x86 macros are mutually exclusive. A broken macro combination becomes a build error rather than a silent runtime failure. Runtime (GTest, 12 assertions): contracts on feature_string() (non-null, non-empty, string value agrees with active macros), double_vector_width() and float_vector_width() (power-of-two, float == 2*double), optimal_alignment() (power-of-two >= 8, covers one SIMD register), has_simd_support() / supports_vectorization() consistency, and compile-time constant / function agreement for DOUBLE_SIMD_WIDTH, FLOAT_SIMD_WIDTH, SIMD_ALIGNMENT. Not compiled with LIBHMM_BEST_SIMD_FLAGS -- tests the detection infrastructure, not the intrinsics. Also updates simd_platform.h: replaces the stale four-item consumer list with a concise description that won't drift as consumers are added. 37/37 tests pass. Co-Authored-By: Oz <oz-agent@warp.dev>

Without this, cmake defaults to no build type (effectively -O0). At -O0 the compiler emits VZEROUPPER in the k_exp_pd_avx function prologue (before a dynamic stack-probe call for the 6336-byte stack frame) and BEFORE the __m256d ymm0 argument is saved to the stack. VZEROUPPER zeros the upper 128 bits of all YMM registers, so x[2] and x[3] are silently set to 0.0 inside the function body, producing exp(0)=1.0 instead of the correct tiny values for those lanes. At Release (-O3) the static inline helpers are inlined into their callers and no function-call ABI boundary exists where this can occur. 37/37 tests pass; simd_inspection 6/6 (AVX, width=4). Co-Authored-By: Oz <oz-agent@warp.dev>

…type requirement WARP.md: document that configure_catalina.sh now defaults to Release, explain the -O0 VZEROUPPER / __m256d argument corruption issue, and give the RelWithDebInfo override for debuggable builds. Ivy Bridge / AVX-1 (macOS Catalina, i7-3820QM) benchmark results: fb_crossover_sweep: N>=4 MaxReduce threshold confirmed correct on 4-wide AVX. N=2: Pairwise (ratio 1.83), N=3: tied (0.98), N=4+: MaxReduce clearly wins. No threshold change required. libhmm throughput (FB): peaks ~100k state-steps/ms at N=128. vs GHMM (Gaussian continuous, Forward-Backward): GHMM ~2.2x faster on Forward (was ~5x on Windows/Zen4 before SIMD work). libhmm Viterbi ~5x faster than GHMM (0.066ms vs 0.373ms at T=1000). All log-likelihoods match to <1e-10. vs HMMLib (discrete): HMMLib ~4x faster. The narrowed GHMM Forward gap (5x→2.2x) is consistent with the transcendental kernel SIMD work landing on this platform. The AVX-1 (no AVX2/FMA) path was exercised for the first time here. Co-Authored-By: Oz <oz-agent@warp.dev>

… data Focused N=2..8 pairwise vs MaxReduce crossover sweep from MacBook Pro 9,1 (i7-3820QM, AVX no AVX2, macOS Catalina). Confirms N>=4 MaxReduce threshold is correct on this platform. Co-Authored-By: Oz <oz-agent@warp.dev>

Apple M1 / ARM NEON / macOS Tahoe 26.4.1 / Homebrew LLVM 22: Tests: 37/37 pass, simd_inspection 6/6. LIBHMM_HAS_NEON=YES, vector width=2, feature_string='ARM NEON (Apple Silicon)'. fb_crossover_sweep (NEON 2-wide): N=2: Pairwise (1.70x) N=3: tied (1.001) N=4+: MaxReduce (0.68 at N=4, 0.15 at N=32) N>=4 threshold confirmed correct — no change required. MaxReduce advantage grows faster on NEON than AVX (6x at N=16 vs 3.7x on Ivy Bridge). libhmm throughput (FB): peaks ~240k state-steps/ms at N=128. vs GHMM (Gaussian continuous): GHMM ~3.0x faster (was ~5x pre-SIMD, ~2.2x on Ivy Bridge). libhmm Viterbi faster than GHMM on all sizes. All log-likelihoods match to <1e-10. vs HMMLib (discrete): HMMLib ~3.4x faster. Also add previously collected M1/Tahoe crossover and HMMLib 9-pass sweep CSVs across four compiler configurations (AppleClang, GCC 15, Homebrew LLVM, reruns). Co-Authored-By: Oz <oz-agent@warp.dev>

Intel i7-7820HQ (Kaby Lake), AVX2+FMA, macOS 13.7.8 Ventura / AppleClang. Tests: 37/37 pass, simd_inspection 6/6. LIBHMM_HAS_AVX2=YES, LIBHMM_HAS_AVX512=-, width=4, feature='AVX2'. No correctness issues; identical code path to Ivy Bridge AVX-1 with compiler auto-FMA fusion (vfmadd) on the Horner polynomial. fb_crossover_sweep (AVX2 4-wide): N=2: Pairwise (1.94x) N=3: tied (1.011) N=4+: MaxReduce (0.62 at N=4) N>=4 threshold confirmed correct — no change required. libhmm throughput (FB): peaks ~90k state-steps/ms at N=64. vs HMMLib (discrete): HMMLib ~4.3x faster. No GHMM installed on this machine. No anomalies. AVX2+FMA auto-fusion produces expected minor improvements vs AVX-1. k_exp_pd_avx validated across all three x86 ISA tiers. Also add previously collected Kaby Lake crossover and high-N CSVs. Co-Authored-By: Oz <oz-agent@warp.dev>

GHMM copied to ~/Development post-initial commit; added here as amendment. vs GHMM (Gaussian continuous, Kaby Lake AVX2+FMA): GHMM ~2.3x faster Forward (4.9k vs 8.9k obs/ms average). libhmm Viterbi: 3-4x faster than GHMM (0.053ms vs 0.181ms at T=1000). All log-likelihoods match to <1e-10. Numerical match: YES on all sizes. Pattern consistent across platforms: Zen4/AVX-512: GHMM ~2.2x fwd libhmm Viterbi ~5x faster Ivy Bridge/AVX: GHMM ~2.2x fwd libhmm Viterbi ~3x faster Kaby Lake/AVX2: GHMM ~2.3x fwd libhmm Viterbi ~3-4x faster M1/NEON: GHMM ~3.0x fwd libhmm Viterbi ~3x faster Co-Authored-By: Oz <oz-agent@warp.dev>

- test_transcendental_kernels.cpp: update 'Level 8 section' reference to 'Performance Primitives section' following the test group rename. - transcendental_kernels.cpp: add missing '// Scalar tail.' comment to accumulate_exp_sum2_bias, consistent with the other four kernels. Co-Authored-By: Oz <oz-agent@warp.dev>

pre-commit: - Remove trailing whitespace from Doxygen comment lines in log_normal_distribution.cpp and pareto_distribution.cpp. - Add missing EOF newline to forward_backward_calculator.h. - Apply clang-format to log_normal and pareto distribution files. cppcheck: - Add cppcheck-suppress redundantInitialization to the AVX-512 blocks in reduce_max_sum2 and reduce_max_sum3. When only LIBHMM_HAS_AVX512 is defined, maxVal=neg_inf at entry is overwritten before any read. The initialization is an intentional cascade seed for non-AVX512 tiers; the suppression documents this. Co-Authored-By: Oz <oz-agent@warp.dev>

…ppressions clang-format (v19.1.7 via pre-commit mirrors-clang-format): Applied to all tracked C/C++ source and header files. No semantic changes -- formatting only. Register this commit in .git-blame-ignore-revs to keep git blame output clean. cppcheck: Move inline redundantInitialization suppressions onto the flagged statement line in reduce_max_sum2 and reduce_max_sum3 (was two lines above; cppcheck requires same-line or immediately- preceding-line placement). Co-Authored-By: Oz <oz-agent@warp.dev>

Co-Authored-By: Oz <oz-agent@warp.dev>

… fix .gitattributes: change *.ps1 from eol=crlf to eol=lf. The existing eol=crlf rule caused git to check out phase_gate.ps1 as CRLF on CI (Ubuntu), which the pre-commit mixed-line-ending hook then flagged. PowerShell 7 handles LF line endings on all platforms. transcendental_kernels.cpp: replace inline cppcheck-suppress comments (which did not work regardless of placement) with a structural fix. double maxVal is now declared without initialisation; the AVX-512 #if block sets it, and the #else branch assigns neg_inf for all non-AVX512 paths. No redundant initialisation in any configuration. Co-Authored-By: Oz <oz-agent@warp.dev>

Co-Authored-By: Oz <oz-agent@warp.dev>

OldCrow and others added 28 commits April 27, 2026 20:14

docs: require session-start architecture build routing in WARP

10b2ac1

Co-Authored-By: Oz <oz-agent@warp.dev>

Merge remote-tracking branch 'origin/main' into perf/trainers-calcula…

e23cd64

…tors-phase1 # Conflicts: # CMakeLists.txt # include/libhmm/libhmm.h # include/libhmm/platform/simd_platform.h

Merge remote-tracking branch 'origin/main' into perf/trainers-calcula…

e0bc0d8

…tors-phase1

chore: register format commit 662c172 in .git-blame-ignore-revs

39ba7c9

Co-Authored-By: Oz <oz-agent@warp.dev>

Release v3.3.0: SIMD transcendental kernels, Tier-2 LogNormal/Pareto

b9d231d

Co-Authored-By: Oz <oz-agent@warp.dev>

OldCrow merged commit 87145e0 into main May 3, 2026
7 checks passed

OldCrow deleted the perf/trainers-calculators-phase1 branch May 3, 2026 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: SIMD transcendental kernels, Tier-2 LogNormal/Pareto, dedup helpers, test realignment#3

perf: SIMD transcendental kernels, Tier-2 LogNormal/Pareto, dedup helpers, test realignment#3
OldCrow merged 28 commits intomainfrom
perf/trainers-calculators-phase1

OldCrow commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OldCrow commented May 3, 2026

Summary

Changes

SIMD transcendental kernels

Tier-2 LogNormal and Pareto

SIMD helper consolidation

fb_recurrence_policy.h relocation

FB recurrence crossover retuning

BW hotspot profiling

Tests

Documentation and test structure

Benchmark highlights (Zen 4 / AVX-512, T=1000)

Plans

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`fb_recurrence_policy.h` relocation