Skip to content

build(deps): bump sha2 from 0.10.9 to 0.11.0 in /keygen-rs#5

Merged
Jsewill merged 1 commit into
mainfrom
dependabot/cargo/keygen-rs/sha2-0.11.0
Apr 27, 2026
Merged

build(deps): bump sha2 from 0.10.9 to 0.11.0 in /keygen-rs#5
Jsewill merged 1 commit into
mainfrom
dependabot/cargo/keygen-rs/sha2-0.11.0

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github Apr 27, 2026

Bumps sha2 from 0.10.9 to 0.11.0.

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [sha2](https://github.com/RustCrypto/hashes) from 0.10.9 to 0.11.0.
- [Commits](RustCrypto/hashes@sha2-v0.10.9...sha2-v0.11.0)

---
updated-dependencies:
- dependency-name: sha2
  dependency-version: 0.11.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update rust code labels Apr 27, 2026
@Jsewill Jsewill merged commit 2ad7501 into main Apr 27, 2026
11 checks passed
@dependabot dependabot Bot deleted the dependabot/cargo/keygen-rs/sha2-0.11.0 branch April 27, 2026 21:46
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Verification on RTX 4090 with --tier minimal forced via
XCHPLOT2_STREAMING=1 surfaced two corrections:

  1. The post-cuts overall peak is 4228 MB at T3 sort phase
     (d_t3 2080 + d_frags_out 2080 + CUB scratch 68), NOT
     ~3700 MB at T3 match as the previous commit's README +
     dispatch message claimed. Cuts #1+#2 reduced T1/T2-sort
     GATHER peaks (sub-phases inside the sort phase) but the
     CUB DeviceRadixSort itself still allocates four cap-sized
     uint32 buffers under DoubleBuffer mode (~4 cap = 4160 MB
     at k=28), so T1/T2/T3 sort phases stay near the 4 GiB
     line. Per-phase peaks measured:

       Xs       : 4136 MB (4 cap u32 + CUB scratch 40)
       T1 sort  : 4180 MB (4 cap u32 + CUB scratch 20)
       T2 sort  : 4170 MB (4 cap u32 + CUB scratch 10)
       T3 match : ~3700 MB (cut #3 working as designed)
       T3 sort  : 4228 MB ← bottleneck

     Compact tier unchanged at 5200 MB peak (no cuts active).
     Drop from 5200 → 4228 (−972 MB / −19%) is real, just less
     than the README claimed.

  2. 4 GiB cards (≤ ~3.5 GiB free post-CUDA-context) still
     don't fit. Closing that gap requires the SYCL-branch's
     cuts #5 (CUB sort output tiling with host accumulators
     across T1/T2/T3 sort) and #6 (Xs gen+sort+pack tiling) —
     not yet ported to cuda-only. The minimal tier in its
     current form fits 5 GiB+ cards (RTX 2060, RX 6600 / 6600
     XT, RX 7600) comfortably with ~1 GiB headroom.

Updates:

  src/host/BatchPlotter.cpp: kMinimalFloorBytes 3828 → 4356
    MiB (= 4228 measured peak + 128 MB margin). Dispatch
    message floor reads as "4.25 GiB floor" instead of the
    overstated "3.74 GiB".

  README.md: minimal-tier description rewritten with measured
    peak (4228 MB), the new bottleneck phase (T3 sort), the
    accurate target hardware (5 GiB+ cards, not 4 GiB), and a
    pointer to cuts #5/#6 as the remaining work for genuine
    4 GiB-card support. Top-of-file streaming-floor summary
    updated 3.8 → 4.25 GiB.

  tools/xchplot2/cli.cpp: --tier help text updated to match.

Verified byte-identical at k=22 across plain / compact / minimal
(sha256 17dbf594…) and at k=28 across compact / minimal
(sha256 f42e62ad…). Plain pool and compact streaming paths
unchanged by this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Replaces the four cap-sized uint32/uint64 buffers that CUB
DeviceRadixSort needs in DoubleBuffer mode with cap/N per-tile
buffers + host accumulators across all three sort phases. Drops
each phase peak from ~4200 MB to 3640 MB at k=28 by paying ~7 sec
of host CPU merge time per plot (k=28 minimal: 12 → 22 s/plot).

Each tile's CUB sort lands either in the input slice (d_*_mi /
d_t3) or the cap/N alternate buffer; whichever side it lands on,
we D2H to a host pinned accumulator at the matching offset. After
all tiles, we free the per-tile device buffers and the input
buffer, then run a tree of pairwise stable in-place merges on
host (std::inplace_merge for keys-only; a hand-rolled
paired_merge_t* for the pairs cases). The result is a globally
sorted run that we H2D back to the output buffer that downstream
consumers expect.

  T1 sort: scratch.h_keys_merged + scratch.h_t2_xbits as accumulators.
    h_keys_merged was already going to receive the T1 sorted-mi park
    after the device-side merge — cut #5 just writes it directly,
    skipping the round-trip. h_t2_xbits is dead at T1 sort time
    (T2 match staging hasn't filled it yet) so it doubles as the
    T1 vals accumulator. Final H2D rehydrates d_t1_merged_vals
    from h_t2_xbits for the cut #1 gather phase. d_t1_keys_merged
    stays null — h_keys_merged is already the parked form. Per-
    phase peak: 4180 → 3640 MB.

  T2 sort: scratch.h_keys_merged + a per-plot pinned h_t2_sort_vals
    (cap × u32, freed at end of phase). h_t2_xbits is NOT reused
    for T2 sort — cut #2's xbits gather still reads h_t2_xbits as
    the parked unsorted xbits stream, so an in-place reuse would
    corrupt that data. Mirror of T1 sort otherwise. Per-phase peak:
    4170 → 3640 MB.

  T3 sort: scratch.h_meta as the keys-only accumulator. h_meta's
    cut #3 lifetime as parked T2 meta ends at the H2D-back step
    that cut #3 emits before T3 sort entry, so it's reusable.
    SortKeys (no vals) → std::inplace_merge for the host merge
    step. Per-phase peak: 4228 → 3640 MB.

Plus a small init_u32_identity_offset kernel — the cap/N tile
sort needs its vals_in seeded with global positions
[tile_start..tile_end) so the post-merge d_merged_vals stream
indexes directly into the cap-sized d_t*_meta / d_t*_xbits.

Verification (RTX 4090 at k=22 + k=28 strength=2):
  - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…).
  - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…).
  - k=28 minimal peak 4228 → 4136 MB (-92 MB; cut #5 saves on
    each sort phase but cap-sized Xs gen+sort+pack is the new
    overall bottleneck — cut #6 closes that gap).
  - Compact / plain paths unchanged (the new tile path is gated
    on scratch.gather_tile_count >= 2 + the per-tier scratch
    pinned slots being populated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jsewill pushed a commit that referenced this pull request Apr 28, 2026
Closes the last cap × 4 (uint32) hot-spot. The non-tiled Xs phase
peaks at four cap-sized uint32 buffers + CUB DoubleBuffer scratch
(~4136 MB at k=28); cut #6 generates once into 2 cap × u32, then
sorts in N tiles using cap/N alternate buffers, accumulates into
host pinned, and packs into d_xs without ever holding 4 cap on
device.

  src/host/GpuPipeline.cu Xs phase: when scratch.gather_tile_count
    >= 2 + scratch.h_meta != nullptr, take a tiled_xs branch:

    1. Allocate d_xs_keys_full + d_xs_vals_full (2 cap × u32).
    2. launch_xs_gen → fill them.
    3. Allocate one shared cap/N alternate pair (keys + vals) +
       CUB scratch sized for tile_cap_xs.
    4. For each tile in [0, N): CUB DoubleBuffer SortPairs over
       the slice, D2H sorted (key, val) pair to scratch.h_meta
       reinterpreted as a 2-cap u32 buffer (h_xs_keys at the
       first cap entries, h_xs_vals at the next cap — h_meta is
       cap × u64 = 2 cap × u32 of storage, with total_xs <= cap
       so both halves fit). h_meta gets overwritten by T1
       match's cut #4 D2H later, so reusing it through Xs is safe.
    5. Free per-tile alt + scratch + d_xs_keys_full +
       d_xs_vals_full (peak drops to 0 device-side).
    6. Host paired stable merge (cut #5 shape) over h_xs_keys +
       h_xs_vals so the host buffers end up globally sorted by
       match_info with vals tiebreak.
    7. Allocate d_xs (cap × XsCandidateGpu = 2 cap) and pack via
       two cudaMemcpy2DAsync H2D copies — match_info field gets
       h_xs_keys at struct stride 8, x field gets h_xs_vals at
       the same stride. No separate d_xs_keys_b / d_xs_vals_b
       on-device pack pair needed.

    Per-phase peak: 2 cap (full keys+vals) + 2 cap/N (sort alt)
    + scratch ≈ 2.5 cap = 2570 MB at N=4. Final d_xs alloc is
    the post-merge peak at ~2 cap = 2080 MB. Plain / compact
    paths unchanged (gated on the same tier flags as the other
    cuts).

  src/host/BatchPlotter.cpp: kMinimalFloorBytes 4356 → 3768 MiB
    (= 3640 measured peak + 128 MiB margin). Dispatch message
    "3.68 GiB floor".

  README.md: minimal-tier description rewritten as six layered
    cuts with measured per-phase peaks (Xs 2570, T1/T2 sort 3640,
    T3 match/sort 3640) and the new ~31 s/plot wall (vs ~12 s
    compact) reflecting the host-CPU merge overhead. Top-of-file
    streaming-floor summary 4.25 → 3.7 GiB. 4 GiB cards now
    targeted (with the standard "real 4 GiB hardware reports
    ~3.5 GiB free post-CUDA-context, please report actual fit"
    caveat).

  tools/xchplot2/cli.cpp: --tier help "minimal = ~3.7 GiB floor,
    fits 4 GiB".

Verification on RTX 4090 (XCHPLOT2_STREAMING=1 + --tier minimal,
POS2GPU_STREAMING_STATS=1):
  - k=22 plain / compact / minimal byte-identical (sha256 17dbf594…).
  - k=28 minimal byte-identical with k=28 compact (sha256 f42e62ad…).
  - k=28 minimal peak 4228 → 3640 MB; the bottleneck is now T1
    sort / T2 sort / T3 match / T3 sort all tied at 3640 MB
    (T2 match was already at this level via the existing N=8
    staging).
  - k=28 minimal wall: ~31 s/plot (vs ~12 s compact). The 2.6×
    slowdown matches the SYCL-branch's measured ~34 vs ~13 s
    for the same six-cut configuration on sm_89.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant