build(deps): bump sha2 from 0.10.9 to 0.11.0 in /keygen-rs#5
Merged
Conversation
Bumps [sha2](https://github.com/RustCrypto/hashes) from 0.10.9 to 0.11.0. - [Commits](RustCrypto/hashes@sha2-v0.10.9...sha2-v0.11.0) --- updated-dependencies: - dependency-name: sha2 dependency-version: 0.11.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Verification on RTX 4090 with --tier minimal forced via
XCHPLOT2_STREAMING=1 surfaced two corrections:
1. The post-cuts overall peak is 4228 MB at T3 sort phase
(d_t3 2080 + d_frags_out 2080 + CUB scratch 68), NOT
~3700 MB at T3 match as the previous commit's README +
dispatch message claimed. Cuts #1+#2 reduced T1/T2-sort
GATHER peaks (sub-phases inside the sort phase) but the
CUB DeviceRadixSort itself still allocates four cap-sized
uint32 buffers under DoubleBuffer mode (~4 cap = 4160 MB
at k=28), so T1/T2/T3 sort phases stay near the 4 GiB
line. Per-phase peaks measured:
Xs : 4136 MB (4 cap u32 + CUB scratch 40)
T1 sort : 4180 MB (4 cap u32 + CUB scratch 20)
T2 sort : 4170 MB (4 cap u32 + CUB scratch 10)
T3 match : ~3700 MB (cut #3 working as designed)
T3 sort : 4228 MB ← bottleneck
Compact tier unchanged at 5200 MB peak (no cuts active).
Drop from 5200 → 4228 (−972 MB / −19%) is real, just less
than the README claimed.
2. 4 GiB cards (≤ ~3.5 GiB free post-CUDA-context) still
don't fit. Closing that gap requires the SYCL-branch's
cuts #5 (CUB sort output tiling with host accumulators
across T1/T2/T3 sort) and #6 (Xs gen+sort+pack tiling) —
not yet ported to cuda-only. The minimal tier in its
current form fits 5 GiB+ cards (RTX 2060, RX 6600 / 6600
XT, RX 7600) comfortably with ~1 GiB headroom.
Updates:
src/host/BatchPlotter.cpp: kMinimalFloorBytes 3828 → 4356
MiB (= 4228 measured peak + 128 MB margin). Dispatch
message floor reads as "4.25 GiB floor" instead of the
overstated "3.74 GiB".
README.md: minimal-tier description rewritten with measured
peak (4228 MB), the new bottleneck phase (T3 sort), the
accurate target hardware (5 GiB+ cards, not 4 GiB), and a
pointer to cuts #5/#6 as the remaining work for genuine
4 GiB-card support. Top-of-file streaming-floor summary
updated 3.8 → 4.25 GiB.
tools/xchplot2/cli.cpp: --tier help text updated to match.
Verified byte-identical at k=22 across plain / compact / minimal
(sha256 17dbf594…) and at k=28 across compact / minimal
(sha256 f42e62ad…). Plain pool and compact streaming paths
unchanged by this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Replaces the four cap-sized uint32/uint64 buffers that CUB
DeviceRadixSort needs in DoubleBuffer mode with cap/N per-tile
buffers + host accumulators across all three sort phases. Drops
each phase peak from ~4200 MB to 3640 MB at k=28 by paying ~7 sec
of host CPU merge time per plot (k=28 minimal: 12 → 22 s/plot).
Each tile's CUB sort lands either in the input slice (d_*_mi /
d_t3) or the cap/N alternate buffer; whichever side it lands on,
we D2H to a host pinned accumulator at the matching offset. After
all tiles, we free the per-tile device buffers and the input
buffer, then run a tree of pairwise stable in-place merges on
host (std::inplace_merge for keys-only; a hand-rolled
paired_merge_t* for the pairs cases). The result is a globally
sorted run that we H2D back to the output buffer that downstream
consumers expect.
T1 sort: scratch.h_keys_merged + scratch.h_t2_xbits as accumulators.
h_keys_merged was already going to receive the T1 sorted-mi park
after the device-side merge — cut #5 just writes it directly,
skipping the round-trip. h_t2_xbits is dead at T1 sort time
(T2 match staging hasn't filled it yet) so it doubles as the
T1 vals accumulator. Final H2D rehydrates d_t1_merged_vals
from h_t2_xbits for the cut #1 gather phase. d_t1_keys_merged
stays null — h_keys_merged is already the parked form. Per-
phase peak: 4180 → 3640 MB.
T2 sort: scratch.h_keys_merged + a per-plot pinned h_t2_sort_vals
(cap × u32, freed at end of phase). h_t2_xbits is NOT reused
for T2 sort — cut #2's xbits gather still reads h_t2_xbits as
the parked unsorted xbits stream, so an in-place reuse would
corrupt that data. Mirror of T1 sort otherwise. Per-phase peak:
4170 → 3640 MB.
T3 sort: scratch.h_meta as the keys-only accumulator. h_meta's
cut #3 lifetime as parked T2 meta ends at the H2D-back step
that cut #3 emits before T3 sort entry, so it's reusable.
SortKeys (no vals) → std::inplace_merge for the host merge
step. Per-phase peak: 4228 → 3640 MB.
Plus a small init_u32_identity_offset kernel — the cap/N tile
sort needs its vals_in seeded with global positions
[tile_start..tile_end) so the post-merge d_merged_vals stream
indexes directly into the cap-sized d_t*_meta / d_t*_xbits.
Verification (RTX 4090 at k=22 + k=28 strength=2):
- k=22 plain / compact / minimal byte-identical (sha256 17dbf594…).
- k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…).
- k=28 minimal peak 4228 → 4136 MB (-92 MB; cut #5 saves on
each sort phase but cap-sized Xs gen+sort+pack is the new
overall bottleneck — cut #6 closes that gap).
- Compact / plain paths unchanged (the new tile path is gated
on scratch.gather_tile_count >= 2 + the per-tier scratch
pinned slots being populated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill
pushed a commit
that referenced
this pull request
Apr 28, 2026
Closes the last cap × 4 (uint32) hot-spot. The non-tiled Xs phase peaks at four cap-sized uint32 buffers + CUB DoubleBuffer scratch (~4136 MB at k=28); cut #6 generates once into 2 cap × u32, then sorts in N tiles using cap/N alternate buffers, accumulates into host pinned, and packs into d_xs without ever holding 4 cap on device. src/host/GpuPipeline.cu Xs phase: when scratch.gather_tile_count >= 2 + scratch.h_meta != nullptr, take a tiled_xs branch: 1. Allocate d_xs_keys_full + d_xs_vals_full (2 cap × u32). 2. launch_xs_gen → fill them. 3. Allocate one shared cap/N alternate pair (keys + vals) + CUB scratch sized for tile_cap_xs. 4. For each tile in [0, N): CUB DoubleBuffer SortPairs over the slice, D2H sorted (key, val) pair to scratch.h_meta reinterpreted as a 2-cap u32 buffer (h_xs_keys at the first cap entries, h_xs_vals at the next cap — h_meta is cap × u64 = 2 cap × u32 of storage, with total_xs <= cap so both halves fit). h_meta gets overwritten by T1 match's cut #4 D2H later, so reusing it through Xs is safe. 5. Free per-tile alt + scratch + d_xs_keys_full + d_xs_vals_full (peak drops to 0 device-side). 6. Host paired stable merge (cut #5 shape) over h_xs_keys + h_xs_vals so the host buffers end up globally sorted by match_info with vals tiebreak. 7. Allocate d_xs (cap × XsCandidateGpu = 2 cap) and pack via two cudaMemcpy2DAsync H2D copies — match_info field gets h_xs_keys at struct stride 8, x field gets h_xs_vals at the same stride. No separate d_xs_keys_b / d_xs_vals_b on-device pack pair needed. Per-phase peak: 2 cap (full keys+vals) + 2 cap/N (sort alt) + scratch ≈ 2.5 cap = 2570 MB at N=4. Final d_xs alloc is the post-merge peak at ~2 cap = 2080 MB. Plain / compact paths unchanged (gated on the same tier flags as the other cuts). src/host/BatchPlotter.cpp: kMinimalFloorBytes 4356 → 3768 MiB (= 3640 measured peak + 128 MiB margin). Dispatch message "3.68 GiB floor". README.md: minimal-tier description rewritten as six layered cuts with measured per-phase peaks (Xs 2570, T1/T2 sort 3640, T3 match/sort 3640) and the new ~31 s/plot wall (vs ~12 s compact) reflecting the host-CPU merge overhead. Top-of-file streaming-floor summary 4.25 → 3.7 GiB. 4 GiB cards now targeted (with the standard "real 4 GiB hardware reports ~3.5 GiB free post-CUDA-context, please report actual fit" caveat). tools/xchplot2/cli.cpp: --tier help "minimal = ~3.7 GiB floor, fits 4 GiB". Verification on RTX 4090 (XCHPLOT2_STREAMING=1 + --tier minimal, POS2GPU_STREAMING_STATS=1): - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…). - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…). - k=28 minimal peak 4228 → 3640 MB; the bottleneck is now T1 sort / T2 sort / T3 match / T3 sort all tied at 3640 MB (T2 match was already at this level via the existing N=8 staging). - k=28 minimal wall: ~31 s/plot (vs ~12 s compact). The 2.6× slowdown matches the SYCL-branch's measured ~34 vs ~13 s for the same six-cut configuration on sm_89. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps sha2 from 0.10.9 to 0.11.0.
Commits
ffe0939Release sha2 0.11.0 (#806)8991b65Use the standard order of the[package]section fields (#807)3d2bc57sha2: refactor backends (#802)faa55fbsha3: bumpkeccakto v0.2 (#803)d3e6489sha3 v0.11.0-rc.9 (#801)bbf6f51sha2: tweak backend docs (#800)155dbbfsha3: add default value for theDSgeneric parameter onTurboShake128/256...ed514f2Use published version ofkeccakv0.2 (#799)702bcd8Migrate to closure-basedkeccak(#796)827c043sha3 v0.11.0-rc.8 (#794)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)