Speed up partial-boundary tail scan via bytes.find#281
Merged
Conversation
The fallback path for the partial-boundary tail used a Python-level byte-by-byte while loop. Replace it with a single bytes.find call, which scans the same range at C speed. Behavior is identical: the loop's post-condition (i lands on boundary[0] or at data_length - 1) is preserved exactly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bytes.findcallWhy
When a chunk does not contain the full boundary, the parser falls back to scanning the trailing region for
boundary[0]. The fallback was a Pythonwhileloop, which dominates parse time for non-trivial boundary lengths. Routing the same scan throughbytes.findkeeps it at C speed without changing behavior.Measured against a 2 MB body fed in 16 KB chunks:
The post-condition of the original loop (either
data[i] == boundary[0]ori == data_length - 1) is preserved exactly by thebytes.findreplacement. All existing tests pass unchanged, includingtest_random_splittingwhich exhaustively exercises the partial-boundary path.On the CodSpeed report
CodSpeed will likely report this PR as "unchanged". That is expected and not a sign the optimization is missing.
CodSpeed instrumentation mode counts retired CPU instructions, not wall-clock time. This optimization is a constant-factor win:
bytes.find(C-level memchr/SIMD) replaces a Python interpreter loop. The wall-clock difference is large because CPython's per-bytecode interpreter overhead is high; the retired-instruction-count difference is small because the parser still walks the byte stream interpretively in many surrounding paths that are unchanged. Local wall-clock measurements confirm 5-87x speedup on thetest_multipart_long_boundaryscenario depending on body size.Test plan
pytest tests/passes (147 tests)AI Disclaimer
This PR was developed with the assistance of either Claude or Codex. I've reviewed and verified the changes.