build(deps): bump sha2 from 0.10.9 to 0.11.0 in /keygen-rs by dependabot[bot] · Pull Request #5 · Jsewill/xchplot2

dependabot · 2026-04-27T20:59:02Z

Bumps sha2 from 0.10.9 to 0.11.0.

Commits

ffe0939 Release sha2 0.11.0 (#806)
8991b65 Use the standard order of the [package] section fields (#807)
3d2bc57 sha2: refactor backends (#802)
faa55fb sha3: bump keccak to v0.2 (#803)
d3e6489 sha3 v0.11.0-rc.9 (#801)
bbf6f51 sha2: tweak backend docs (#800)
155dbbf sha3: add default value for the DS generic parameter on TurboShake128/256...
ed514f2 Use published version of keccak v0.2 (#799)
702bcd8 Migrate to closure-based keccak (#796)
827c043 sha3 v0.11.0-rc.8 (#794)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [sha2](https://github.com/RustCrypto/hashes) from 0.10.9 to 0.11.0. - [Commits](RustCrypto/hashes@sha2-v0.10.9...sha2-v0.11.0) --- updated-dependencies: - dependency-name: sha2 dependency-version: 0.11.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Verification on RTX 4090 with --tier minimal forced via XCHPLOT2_STREAMING=1 surfaced two corrections: 1. The post-cuts overall peak is 4228 MB at T3 sort phase (d_t3 2080 + d_frags_out 2080 + CUB scratch 68), NOT ~3700 MB at T3 match as the previous commit's README + dispatch message claimed. Cuts #1+#2 reduced T1/T2-sort GATHER peaks (sub-phases inside the sort phase) but the CUB DeviceRadixSort itself still allocates four cap-sized uint32 buffers under DoubleBuffer mode (~4 cap = 4160 MB at k=28), so T1/T2/T3 sort phases stay near the 4 GiB line. Per-phase peaks measured: Xs : 4136 MB (4 cap u32 + CUB scratch 40) T1 sort : 4180 MB (4 cap u32 + CUB scratch 20) T2 sort : 4170 MB (4 cap u32 + CUB scratch 10) T3 match : ~3700 MB (cut #3 working as designed) T3 sort : 4228 MB ← bottleneck Compact tier unchanged at 5200 MB peak (no cuts active). Drop from 5200 → 4228 (−972 MB / −19%) is real, just less than the README claimed. 2. 4 GiB cards (≤ ~3.5 GiB free post-CUDA-context) still don't fit. Closing that gap requires the SYCL-branch's cuts #5 (CUB sort output tiling with host accumulators across T1/T2/T3 sort) and #6 (Xs gen+sort+pack tiling) — not yet ported to cuda-only. The minimal tier in its current form fits 5 GiB+ cards (RTX 2060, RX 6600 / 6600 XT, RX 7600) comfortably with ~1 GiB headroom. Updates: src/host/BatchPlotter.cpp: kMinimalFloorBytes 3828 → 4356 MiB (= 4228 measured peak + 128 MB margin). Dispatch message floor reads as "4.25 GiB floor" instead of the overstated "3.74 GiB". README.md: minimal-tier description rewritten with measured peak (4228 MB), the new bottleneck phase (T3 sort), the accurate target hardware (5 GiB+ cards, not 4 GiB), and a pointer to cuts #5/#6 as the remaining work for genuine 4 GiB-card support. Top-of-file streaming-floor summary updated 3.8 → 4.25 GiB. tools/xchplot2/cli.cpp: --tier help text updated to match. Verified byte-identical at k=22 across plain / compact / minimal (sha256 17dbf594…) and at k=28 across compact / minimal (sha256 f42e62ad…). Plain pool and compact streaming paths unchanged by this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the four cap-sized uint32/uint64 buffers that CUB DeviceRadixSort needs in DoubleBuffer mode with cap/N per-tile buffers + host accumulators across all three sort phases. Drops each phase peak from ~4200 MB to 3640 MB at k=28 by paying ~7 sec of host CPU merge time per plot (k=28 minimal: 12 → 22 s/plot). Each tile's CUB sort lands either in the input slice (d_*_mi / d_t3) or the cap/N alternate buffer; whichever side it lands on, we D2H to a host pinned accumulator at the matching offset. After all tiles, we free the per-tile device buffers and the input buffer, then run a tree of pairwise stable in-place merges on host (std::inplace_merge for keys-only; a hand-rolled paired_merge_t* for the pairs cases). The result is a globally sorted run that we H2D back to the output buffer that downstream consumers expect. T1 sort: scratch.h_keys_merged + scratch.h_t2_xbits as accumulators. h_keys_merged was already going to receive the T1 sorted-mi park after the device-side merge — cut #5 just writes it directly, skipping the round-trip. h_t2_xbits is dead at T1 sort time (T2 match staging hasn't filled it yet) so it doubles as the T1 vals accumulator. Final H2D rehydrates d_t1_merged_vals from h_t2_xbits for the cut #1 gather phase. d_t1_keys_merged stays null — h_keys_merged is already the parked form. Per- phase peak: 4180 → 3640 MB. T2 sort: scratch.h_keys_merged + a per-plot pinned h_t2_sort_vals (cap × u32, freed at end of phase). h_t2_xbits is NOT reused for T2 sort — cut #2's xbits gather still reads h_t2_xbits as the parked unsorted xbits stream, so an in-place reuse would corrupt that data. Mirror of T1 sort otherwise. Per-phase peak: 4170 → 3640 MB. T3 sort: scratch.h_meta as the keys-only accumulator. h_meta's cut #3 lifetime as parked T2 meta ends at the H2D-back step that cut #3 emits before T3 sort entry, so it's reusable. SortKeys (no vals) → std::inplace_merge for the host merge step. Per-phase peak: 4228 → 3640 MB. Plus a small init_u32_identity_offset kernel — the cap/N tile sort needs its vals_in seeded with global positions [tile_start..tile_end) so the post-merge d_merged_vals stream indexes directly into the cap-sized d_t*_meta / d_t*_xbits. Verification (RTX 4090 at k=22 + k=28 strength=2): - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…). - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…). - k=28 minimal peak 4228 → 4136 MB (-92 MB; cut #5 saves on each sort phase but cap-sized Xs gen+sort+pack is the new overall bottleneck — cut #6 closes that gap). - Compact / plain paths unchanged (the new tile path is gated on scratch.gather_tile_count >= 2 + the per-tier scratch pinned slots being populated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the last cap × 4 (uint32) hot-spot. The non-tiled Xs phase peaks at four cap-sized uint32 buffers + CUB DoubleBuffer scratch (~4136 MB at k=28); cut #6 generates once into 2 cap × u32, then sorts in N tiles using cap/N alternate buffers, accumulates into host pinned, and packs into d_xs without ever holding 4 cap on device. src/host/GpuPipeline.cu Xs phase: when scratch.gather_tile_count >= 2 + scratch.h_meta != nullptr, take a tiled_xs branch: 1. Allocate d_xs_keys_full + d_xs_vals_full (2 cap × u32). 2. launch_xs_gen → fill them. 3. Allocate one shared cap/N alternate pair (keys + vals) + CUB scratch sized for tile_cap_xs. 4. For each tile in [0, N): CUB DoubleBuffer SortPairs over the slice, D2H sorted (key, val) pair to scratch.h_meta reinterpreted as a 2-cap u32 buffer (h_xs_keys at the first cap entries, h_xs_vals at the next cap — h_meta is cap × u64 = 2 cap × u32 of storage, with total_xs <= cap so both halves fit). h_meta gets overwritten by T1 match's cut #4 D2H later, so reusing it through Xs is safe. 5. Free per-tile alt + scratch + d_xs_keys_full + d_xs_vals_full (peak drops to 0 device-side). 6. Host paired stable merge (cut #5 shape) over h_xs_keys + h_xs_vals so the host buffers end up globally sorted by match_info with vals tiebreak. 7. Allocate d_xs (cap × XsCandidateGpu = 2 cap) and pack via two cudaMemcpy2DAsync H2D copies — match_info field gets h_xs_keys at struct stride 8, x field gets h_xs_vals at the same stride. No separate d_xs_keys_b / d_xs_vals_b on-device pack pair needed. Per-phase peak: 2 cap (full keys+vals) + 2 cap/N (sort alt) + scratch ≈ 2.5 cap = 2570 MB at N=4. Final d_xs alloc is the post-merge peak at ~2 cap = 2080 MB. Plain / compact paths unchanged (gated on the same tier flags as the other cuts). src/host/BatchPlotter.cpp: kMinimalFloorBytes 4356 → 3768 MiB (= 3640 measured peak + 128 MiB margin). Dispatch message "3.68 GiB floor". README.md: minimal-tier description rewritten as six layered cuts with measured per-phase peaks (Xs 2570, T1/T2 sort 3640, T3 match/sort 3640) and the new ~31 s/plot wall (vs ~12 s compact) reflecting the host-CPU merge overhead. Top-of-file streaming-floor summary 4.25 → 3.7 GiB. 4 GiB cards now targeted (with the standard "real 4 GiB hardware reports ~3.5 GiB free post-CUDA-context, please report actual fit" caveat). tools/xchplot2/cli.cpp: --tier help "minimal = ~3.7 GiB floor, fits 4 GiB". Verification on RTX 4090 (XCHPLOT2_STREAMING=1 + --tier minimal, POS2GPU_STREAMING_STATS=1): - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…). - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…). - k=28 minimal peak 4228 → 3640 MB; the bottleneck is now T1 sort / T2 sort / T3 match / T3 sort all tied at 3640 MB (T2 match was already at this level via the existing N=8 staging). - k=28 minimal wall: ~31 s/plot (vs ~12 s compact). The 2.6× slowdown matches the SYCL-branch's measured ~34 vs ~13 s for the same six-cut configuration on sm_89. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update rust code labels Apr 27, 2026

Jsewill merged commit 2ad7501 into main Apr 27, 2026
11 checks passed

dependabot Bot deleted the dependabot/cargo/keygen-rs/sha2-0.11.0 branch April 27, 2026 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build(deps): bump sha2 from 0.10.9 to 0.11.0 in /keygen-rs#5

build(deps): bump sha2 from 0.10.9 to 0.11.0 in /keygen-rs#5
Jsewill merged 1 commit into
mainfrom
dependabot/cargo/keygen-rs/sha2-0.11.0

dependabot Bot commented on behalf of github Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dependabot Bot commented on behalf of github Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant