Skip to content

Speed up partial-boundary tail scan via bytes.find#281

Merged
Kludex merged 1 commit intomainfrom
speed-up-partial-boundary-scan-v2
May 10, 2026
Merged

Speed up partial-boundary tail scan via bytes.find#281
Kludex merged 1 commit intomainfrom
speed-up-partial-boundary-scan-v2

Conversation

@Kludex
Copy link
Copy Markdown
Owner

@Kludex Kludex commented May 10, 2026

Summary

  • replace the Python-level byte-by-byte while loop in the multipart partial-boundary tail scan with a single bytes.find call

Why

When a chunk does not contain the full boundary, the parser falls back to scanning the trailing region for boundary[0]. The fallback was a Python while loop, which dominates parse time for non-trivial boundary lengths. Routing the same scan through bytes.find keeps it at C speed without changing behavior.

Measured against a 2 MB body fed in 16 KB chunks:

Boundary length Before After
16 B 0.43 ms 0.33 ms
2 KB 12 ms 1.84 ms
8 KB 43 ms 3.49 ms
32 KB 92 ms 12.3 ms

The post-condition of the original loop (either data[i] == boundary[0] or i == data_length - 1) is preserved exactly by the bytes.find replacement. All existing tests pass unchanged, including test_random_splitting which exhaustively exercises the partial-boundary path.

On the CodSpeed report

CodSpeed will likely report this PR as "unchanged". That is expected and not a sign the optimization is missing.

CodSpeed instrumentation mode counts retired CPU instructions, not wall-clock time. This optimization is a constant-factor win: bytes.find (C-level memchr/SIMD) replaces a Python interpreter loop. The wall-clock difference is large because CPython's per-bytecode interpreter overhead is high; the retired-instruction-count difference is small because the parser still walks the byte stream interpretively in many surrounding paths that are unchanged. Local wall-clock measurements confirm 5-87x speedup on the test_multipart_long_boundary scenario depending on body size.

Test plan

  • pytest tests/ passes (147 tests)

AI Disclaimer

This PR was developed with the assistance of either Claude or Codex. I've reviewed and verified the changes.

The fallback path for the partial-boundary tail used a Python-level
byte-by-byte while loop. Replace it with a single bytes.find call,
which scans the same range at C speed.

Behavior is identical: the loop's post-condition (i lands on
boundary[0] or at data_length - 1) is preserved exactly.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 10, 2026

Merging this PR will not alter performance

✅ 6 untouched benchmarks


Comparing speed-up-partial-boundary-scan-v2 (7ebad5f) with main (09cb8c3)

Open in CodSpeed

@Kludex Kludex merged commit d1b5739 into main May 10, 2026
14 checks passed
@Kludex Kludex deleted the speed-up-partial-boundary-scan-v2 branch May 10, 2026 10:32
@Kludex Kludex mentioned this pull request May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant