Skip to content

perf: optimize SIMD check and encode implementations#39

Merged
DaniPopes merged 20 commits intomasterfrom
dani/optimize-simd
Mar 2, 2026
Merged

perf: optimize SIMD check and encode implementations#39
DaniPopes merged 20 commits intomasterfrom
dani/optimize-simd

Conversation

@DaniPopes
Copy link
Copy Markdown
Owner

Replace check algorithm with signed overflow trick for large inputs (≥128 bytes). Add AVX2 check path processing 32 bytes per iteration. Double encode throughput (32→64 bytes/iter) via new AVX2 path.

References:

Split from #35, credit to @zerosnacks.

Replace check algorithm with signed overflow trick for large inputs (≥128 bytes).
Add AVX2 check path processing 32 bytes per iteration.
Double encode throughput (32→64 bytes/iter) via new AVX2 path.

References:
- http://0x80.pl/notesen/2022-01-17-validating-hex-parse.html
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Feb 25, 2026

Merging this PR will degrade performance by 30.01%

⚡ 12 improved benchmarks
❌ 4 regressed benchmarks
✅ 20 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
bench1_32b 321.9 ns 460 ns -30.01%
bench5_128k 206.2 µs 151.7 µs +35.89%
bench3_2k 3.5 µs 2.8 µs +26.51%
bench4_16k 26 µs 19.3 µs +34.71%
bench2_256b 812.5 ns 915 ns -11.2%
bench3_2k 7.8 µs 7.1 µs +10.14%
bench1_32b 429.4 ns 493.1 ns -12.9%
bench3_2k 5.6 µs 4.8 µs +16.34%
bench5_128k 314.6 µs 260.1 µs +20.94%
bench6_1m 2.5 ms 2.1 ms +21.24%
bench4_16k 39.7 µs 32.9 µs +20.46%
bench6_1m 2.5 ms 2.1 ms +21.27%
bench6_1m 1.6 ms 1.2 ms +36.07%
bench1_32b 1.3 µs 1.5 µs -10.07%
bench4_16k 41.6 µs 34.9 µs +19.16%
bench5_128k 312 µs 257.5 µs +21.17%

Comparing dani/optimize-simd (852c862) with master (023646a)

Open in CodSpeed

Apply Muła & Langdale signed overflow trick to SSE2 check path too,
reducing from 10 to 5 operations per chunk. Remove length threshold
dispatch since both paths now use the same efficient algorithm.
Extract check_chunk_sse2 and use it directly in check_avx2's remainder
handler instead of calling check_sse2 which goes through chunks_exact.
Since the AVX2 remainder is at most 31 bytes, there's at most one
16-byte chunk — no need for a loop.
@DaniPopes DaniPopes merged commit eab3daf into master Mar 2, 2026
20 of 21 checks passed
@DaniPopes DaniPopes deleted the dani/optimize-simd branch March 2, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant