Skip to content

Claude/compare rustynum ndarray 5e p rn#150

Merged
AdaWorldAPI merged 8 commits into
masterfrom
claude/compare-rustynum-ndarray-5ePRn
May 16, 2026
Merged

Claude/compare rustynum ndarray 5e p rn#150
AdaWorldAPI merged 8 commits into
masterfrom
claude/compare-rustynum-ndarray-5ePRn

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

No description provided.

claude added 5 commits May 16, 2026 20:12
… types

6-tier refactoring shopping list transforming hpc/ from a bolted-on
raw-slice library into a first-class ndarray citizen:

- Tier 1: Type bridges (Fingerprint/VsaVector ↔ ArrayView, BF16 → BlasFloat)
- Tier 2: Extension traits (HdcOps, Quantize, Prefilter, SimdMath)
- Tier 3: Backend wiring (core sum/mean → SIMD, unified detection)
- Tier 4: View factories (Arrow → ArrayView)
- Tier 5: Zip modernization (VML tails, parallel hamming)
- Tier 6: N-D axis reductions with SIMD dispatch

No deletions — raw-slice functions stay for FFI/embedded. Extension traits
add ndarray-native overloads that delegate to existing SIMD kernels.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
…ltering

The meta-level insight: don't iterate candidates and probe words,
iterate words and filter candidates. ndarray's F-order strides make
the same Array2<u64> serve both column scanning (K0/K1, contiguous)
and per-record access (K2, strided on survivors).

Key ideas:
- Column-major database: K0 becomes sequential 8-byte scan (8x cache util)
- Bitmask survivor propagation (no branching, pure SIMD mask narrowing)
- BF16 field-separated SoA (sign/exp/mantissa pre-decomposed at ingest)
- QualiaColumns (16 parallel dimension arrays for per-dim scan)
- TieredDatabase (K0 in L2, K1 in L3, K2 in DRAM)
- Arrow's columnar format aligns natively with this layout

Expected: 4-8x cascade throughput, 18x qualia dimension scans.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
Merges three inputs into one executable plan:
- REFACTOR_HPC_INTEGRATION.md (type bridges, extension traits)
- SOA_KERNEL_ARCHITECTURE.md (columnar cascade, field separation)
- Transformer session feedback (API conventions, namespace, codegen)

8 waves: conventions → macros → bridges → traits → backend → SoA → namespace → release
Critical path: 15 days serial, 10 days with parallelism.

The SoA meta-insight applies at 6 levels:
  1. Database layout (column-per-word)
  2. API surface (_into forms = caller-controlled layout)
  3. Module structure (hpc/cog/ext/io = columns of concern)
  4. ndarray types (F-order strides encode the duality)
  5. Cascade pipeline (widen reads as population shrinks)
  6. Arrow/Lance (storage IS the compute layout, zero ETL)

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
…contract

Fourth analysis pass adds Wave 1 (SIMD Consumer Primitives) — 5 P0
primitives blocking 158 raw-intrinsic violations in lance-graph:

- I8x16::from_i4_packed_u64 (closure-batch pattern)
- I8x16::saturating_abs (VPABSB correction for i8::MIN)
- U16x8::gather_u16 (quantized distance tables)
- prefetch_read_t0/t1/t2 (hint wrappers)
- U64x8::popcnt + xor_popcount (HDC hamming)

Sequence now 9 waves (was 8), dependency graph updated,
lance-graph unblocked at day 6 of 12-day plan.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
Canonical spec for 5 P0 SIMD primitives lance-graph requires:
- I8x16::from_i4_packed_u64 (closure-batch)
- I8x16::saturating_abs (VPABSB + VPMINUB clamp)
- U16x8::gather_u16 (with Codex P2 OOB correction for 32-bit gather)
- prefetch_read_t0/t1/t2
- U64x8::popcnt + xor_popcount (NEON: vpaddlq cascade, NOT vaddvq_u8)

Includes Codex P2 corrections for gather OOB and NEON popcount
lane contamination.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4321fcf33c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Safe: u8 slice reinterpreted as u64 (alignment checked)
let words = unsafe {
std::slice::from_raw_parts(
bytes.as_ptr() as *const u64,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid unaligned u64 views over u8 channels

For CogRecord channels backed by Array<u8>, the buffer is only guaranteed to be byte-aligned, so bytes.as_ptr() as *const u64 can create an unaligned &[u64] and trigger UB whenever the allocation is not 8-byte aligned. If this plan is implemented for channel_as_words, use an aligned storage type, align_to with fallback copying, or unaligned reads instead of returning an ArrayView1<u64> over the raw u8 allocation.

Useful? React with 👍 / 👎.

{
fn softmax_axis(&self, axis: Axis) -> Array<A, Ix2> {
let mut result = self.to_owned();
for mut row in result.axis_iter_mut(axis) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Iterate lanes for softmax along the requested axis

For a [batch, features] matrix, axis_iter_mut(Axis(1)) iterates columns, not rows, so the documented batch_logits.softmax_axis(Axis(1)) case would normalize each feature across the batch instead of normalizing each row over features. Use lanes_mut(axis)/lanes(axis) or otherwise iterate the subviews whose elements lie along the requested axis; the same issue appears in the log-softmax loop below.

Useful? React with 👍 / 👎.

/// Lane-wise population count. Each lane returns its u64 bit-count (0..=64).
pub fn popcnt(self) -> Self;
/// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance.
pub fn xor_popcount(self, other: Self) -> u64;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve xor_popcount return type contract

This wave spec defines U64x8::xor_popcount as a horizontal-sum u64, but the new binding contract in .claude/knowledge/vertical-simd-consumer-contract.md lines 167-173 defines the same method as returning Self with per-lane popcounts. Implementing either signature will leave one of the newly added docs and its consumers wrong, so make one API canonical or split these into separate per-lane and horizontal-sum methods.

Useful? React with 👍 / 👎.

claude added 3 commits May 16, 2026 21:18
Resolves add/add conflict on .claude/knowledge/vertical-simd-consumer-contract.md
by taking master's version (PR #149 — the polished READ BY / P0 TRIGGERS form
with agent routing). CLAUDE.md gains the W1a contract hard rule pointer
from master.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
…detection

UNIFIED_REFACTOR_SEQUENCE.md:
- Add "Dispatch Model" section documenting compile-time cfg(target_feature) routing
- Wave 1 contract: replace "three backends" with correct per-file impl rule
- Replace rules 6-7 with: no is_x86_feature_detected, no #[target_feature(enable)]
- Wave 5: reframe as "delete dead detection code" not "unify runtime singleton"
- Add rules 9-10 (don't touch simd_avx2.rs, don't reach for rayon)

REFACTOR_HPC_INTEGRATION.md:
- §3.2: replace LazyLock<CpuCaps> proposal with "delete 877 lines of dead code"
- Architecture diagram: "backend dispatch" → "cfg(target_feature) routing"
- Phase C execution order updated to match

Keeps: all type bridges, extension traits, SoA cascade, Wave sequencing,
W1a primitive specs, VPABSB correction, palette-256 priority, NEON 2×128-bit,
Arrow integration, codegen macros, namespace restructure, effort estimates.

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
…aints

Remove the "Dispatch Model" explainer section, excessive rule expansions,
and paragraph-length justifications for obvious things. What remains:
- Known constraints list (7 bullets from prior sessions)
- Terse per-primitive contract (7 rules, no essays)
- 7 don'ts instead of 10 (dropped the ones restating known constraints)
- §3.2 reduced to 3 lines (dead code, delete it)
- Architecture diagram without dispatch lecture

https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
@AdaWorldAPI AdaWorldAPI merged commit 9f3cf22 into master May 16, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants