Skip to content

[Math] New QIPC ops for single-threaded linalg#683

Open
hughperkins wants to merge 26 commits into
mainfrom
hp/new-qipc-ops-linalg
Open

[Math] New QIPC ops for single-threaded linalg#683
hughperkins wants to merge 26 commits into
mainfrom
hp/new-qipc-ops-linalg

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Adds a free function quadrants.lang.matrix_ops.frobenius_inner(A, B) and a
matching Matrix.frobenius_inner(other) method, computing
⟨A, B⟩ = Σ_ij A_ij B_ij. Mirrors the existing norm_sqr function — they are
the same operation when A == B.

Tests parametrise over arch=qd.gpu (CUDA / Metal / Vulkan / AMDGPU) for both
f32 and f64, on square sizes 2, 3, 6, 9, 12 (matching qipc's IPC needs) plus
rectangular shapes 9×12, 12×3, 2×4 to cover the non-square use cases (Hessian
blocks in qipc).

Closes the "Frobenius inner product" gap row in
perso_hugh/doc/qipc/qipc_gaps_linalg.md.
…12 · 12×9)

Adds test_matmul_chain_qipc_sizes_{f32,f64} verifying that the largest matmul
chain qipc's IPC pipeline needs (9×12 · 12×12 · 12×9 → 9×9) compiles cleanly
and matches numpy. Both the chained form (A @ B @ C) and the staged form
(AB = A @ B; AB @ C) are checked, since the chained form may stress the
backend codegen differently (intermediate has 1296 FMAs unrolled).

Parametrised over arch=qd.gpu so CUDA / Metal / Vulkan / AMDGPU all run.
Quadrants imposes no enforced size cap on matmul; the "n·m ≤ 32" warning
in qipc's design doc is qipc-side only.

Closes the "Matrix.__matmul__ correctness at large sizes" gap row in
perso_hugh/doc/qipc/qipc_gaps_linalg.md (gap (verify) → ✅).
qipc's ARAP rotation R = U @ V.T must be a proper rotation (det R = +1) for
any input deformation gradient F. The libuipc convention enforced by qipc is
det(U) = det(V) = +1 always, with the sign of det(F) absorbed into σ
(σ may have a negative entry when det(F) < 0).

This test verifies that quadrants' qd.svd at 3×3 follows the same convention,
across the cases qipc actually exercises:
  - identity (det = +1), reflection (det = -1), generic positive- and
    negative-det matrices, SPD, near rank-deficient, near-degenerate
    singular values.

Parametrised over arch=qd.gpu × {f32, f64}.
…pivoting

Adds _inverse_lu in matrix_ops.py: in-place Gauss elimination with partial
pivoting, fully unrolled (static range for all loop bounds, runtime int for
pivot-row index). The inverse function dispatches to it for N >= 5; sizes
1–4 keep the existing closed-form cofactor-expansion paths. Precondition
relaxed from dim_lt(0, 5) to dim_lt(0, 13).

The implementation maintains a working copy `a` for in-place LU and a
parallel matrix `b` initialised to identity that receives the same row
swaps + row eliminations; at the end `b = L⁻¹ P` and the inverse is read
column-by-column by back-solving `U x = b[:, c]`.

Tests at tests/python/test_linalg.py::test_inverse_large_{f32,f64} cover
N ∈ {5..12} × {diagonally-dominant, SPD, permuted-upper-triangular}. The
permuted-upper-triangular factory has a zero in [0, 0] so it specifically
exercises the pivoting path. Tolerance scales with condition number ×
machine epsilon (50 × cond × eps + dtype floor).

Parametrised over arch=qd.gpu so CUDA / Metal / Vulkan / AMDGPU all run.
Adds python/quadrants/_funcs_sym_eig_general.py with sym_eig_general(A, dt)
and make_spd(A, dt). Implements Eigen 3.4's SelfAdjointEigenSolver compute()
path: Householder tridiagonalisation + implicit QR with Wilkinson shift +
ascending sort. Direct port of qipc/_src/core/linalg/evd.py — qipc can drop
its private copy once this lands.

qd.sym_eig now dispatches: N=2/3 keep the existing closed-form
_sym_eig{2,3}x3 paths; 4 ≤ N ≤ 12 → sym_eig_general. Also exposes
qd.make_spd(A) which projects a symmetric matrix to the nearest PSD matrix
in Frobenius norm by clamping eigenvalues to ≥ 0 — qipc's per-element
Hessian projection.

Tests at tests/python/test_eig.py:
  - test_sym_eig_general_{f32,f64}: N ∈ {4, 5, 6, 9, 12} × {random symmetric,
    SPD, indefinite, diagonal, repeated-eigenvalues}. Verifies eigenvalues
    match numpy, eigenvectors are orthogonal, and Q diag(λ) Qᵀ ≈ A.
  - test_make_spd_{f32,f64}: N ∈ {4, 6, 9, 12} × {indefinite, random, SPD}.
    Verifies symmetry, PSD-ness (min eig ≥ -tol), and that the result
    matches numpy-reference clamping.

Parametrised over arch=qd.gpu so CUDA / Metal / Vulkan / AMDGPU all run.
The previous Householder + implicit-QR port produced wrong eigenvalues for
N>3 (off-diagonal residue ~50% of the input scale), and the algorithm's
many static branches did not lend themselves to debugging via printf.

Switch sym_eig_general (and the make_spd that builds on it) to cyclic
Jacobi. The Jacobi loop is fully unrolled with static(range): runtime range
loops in @func that return values were observed to iterate only once on
this branch, so static unrolling of MAX_SWEEPS=6 sweeps is what actually
reduces the off-diagonals across passes. The 6-sweep budget gives ~6 digits
in f32 and ~12 digits in f64 for N≤9 on the test factories.

N=12 (used by qipc's 12×12 contact Hessians) is dropped from this path:
the fully static-unrolled Jacobi at N=12 with 6 sweeps does not finish
compiling within 15 minutes on CUDA. Either a blocked/partially-runtime
implementation, or porting qipc's exact `sym_evd` template-mutation pattern
(ndarray of compound types passed via template()), is needed to recover
N=12 — tracked as a follow-up.

- _MAX_SWEEPS = 6 (sweep / (p,q) / per-row updates all static).
- 2-pass right/left rotation (no `r != p, q` static guards) keeps the
  unrolled body lean enough to compile for N≤9.
- sym_eig() now raises for N≥10 with a follow-up note.
- test_eig.py: parametrize sym_eig_general / make_spd at N ∈ {4, 5, 6, 9}
  (drop 12).
Hugh's catch: the previous sweep loop wasn't broken — it was being
parallelized. quadrants @qd.kernel reaches into a callee @qd.func and
parallelizes the func's outermost runtime range loop when the kernel
itself doesn't have one. The sweep loop saw _MAX_SWEEPS threads each
running one sweep on a fresh copy of the locals, with last-write-wins.
That's why static(range) was the only thing that "worked", and that's
what blew up the compile time at N=12.

Fix is a one-liner in the test kernel (for _tid in range(1):) plus
flipping the sweep loop back to runtime range:

  for _sweep in range(_MAX_SWEEPS):   # was static(range(_MAX_SWEEPS))

With that, N=12 compiles in ~2 minutes (vs not finishing in 15 min
before) and the cap in qd.sym_eig() goes back up to 12, restoring qipc's
target sizes.

- _MAX_SWEEPS = 12 (was 6) — runtime, so cost is per-call, not compile.
- qd.sym_eig() supports N ∈ {2..12}; N≥13 raises.
- test_eig.py: re-add N=12 to the parametrize lists; wrap kernel calls
  in `for _tid in range(1):`.
- Docstring on _funcs_sym_eig_general points at the gotcha note in
  perso_hugh that explains the parallel-for behavior.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40af9db7ab

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

eigvecs[i, j] = zero
eigvecs[i, i] = one

for _sweep in range(_MAX_SWEEPS):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make Jacobi sweeps sequential inside sym_eig

For callers that use qd.sym_eig in the normal scalar-kernel style, e.g. eigvals[None], eigvecs[None] = qd.sym_eig(A[None]) without a dummy outer for, this runtime range is parallelized by the kernel machinery instead of executing the 12 sweeps sequentially; the new implementation's own docstring calls out that this produces incorrect iteration semantics. That means the newly supported n >= 4 path can return a non-converged eigensystem unless every caller restructures its kernel, unlike the existing 2x2/3x3 API. Since _MAX_SWEEPS is a fixed constant, this should be hidden inside the implementation rather than imposed as a caller requirement.

Useful? React with 👍 / 👎.



__all__ = ["randn", "polar_decompose", "eig", "sym_eig", "svd", "solve"]
__all__ = ["randn", "polar_decompose", "eig", "sym_eig", "make_spd", "svd", "solve"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update the API snapshot for public additions

Adding make_spd to _funcs.__all__ exposes qd.make_spd through the top-level wildcard import, and this same commit also exposes Matrix.frobenius_inner; however tests/python/test_api.py::test_api compares dir(qd) and dir(qd.Matrix) against hard-coded lists that contain neither name. The existing CPU API snapshot test will therefore fail even though the new APIs are intentional, so please update the expected API lists with these additions.

Useful? React with 👍 / 👎.

Comment on lines +622 to +623
if A.n <= 12:
return sym_eig_general(A, dt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep user docs in sync with the expanded API

This expands the public qd.sym_eig contract from 2x2/3x3 to sizes up to 12, but no docs/ files were changed; AGENTS.md specifically requires docs updates for public API or usage changes. The current user guide still advertises qd.sym_eig as 2x2/3x3-only, and related changes in this commit also leave the documented inverse() size cap and new public helpers stale, so users will get contradictory guidance.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

Tests (test_eig.py): four new contract / edge-case tests for the
N≥4 cyclic-Jacobi path, all parametrized over qd.gpu × N ∈ {4,6,9,12}
and wrapped in the required `for _tid in range(1):` outer loop:

  - test_sym_eig_alpha_identity_f64 — α·I (incl. α=0) at every N to
    cover the fully-degenerate / repeated-eigenvalue case.
  - test_make_spd_idempotent_f64 — make_spd(make_spd(A)) ≈ make_spd(A)
    over indefinite / negative-definite / SPD inputs.
  - test_make_spd_negative_definite_zero_f64 — all-negative-eig
    inputs project to the zero matrix.
  - test_sym_eig_above_cap_raises — N=13 raises with a clear
    "up to 12" message instead of silently miscompiling.

30/30 pass on cluster (CUDA + Vulkan + CPU, ~32 min) and amddesktop
(AMDGPU + Vulkan + CPU, ~10 min).

Docs (user_guide/decompositions.md): updated the table to include
N up to 12 for qd.sym_eig and add qd.make_spd; documented the
cyclic-Jacobi path, the for-_tid kernel pattern requirement, and
compile/runtime cost characteristics. Replaced the inlined make_spd
3×3 snippet with a real qd.make_spd example for N≥4 (kept the 3×3
recipe for users below the cap). Added a Frobenius inner-product
section to user_guide/matrix_vector.md and updated the Matrix.inverse
size cap from 4×4 to 12×12.
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

- decompositions.md → linalg_per_thread.md (rename + retitle "Per-thread
  linear algebra"; updated opening paragraph to make the per-thread,
  non-cooperative semantics explicit; updated "Related" cross-links).
- matrix_vector.md → split: type / storage / declaration content stays
  here; element-wise + closed-form ops (arithmetic, dot/cross, norm,
  transpose/det/trace/inverse, frobenius_inner, mat-vec/mat-mat multiply)
  move to matrix_vector_per_thread.md.
- index.md toctree: replaces `decompositions` with
  `matrix_vector_per_thread` + `linalg_per_thread` under Core concepts.

The "_per_thread" suffix marks pages whose ops run per thread in
registers with no cross-thread cooperation (no shared memory, no syncs,
no warp/subgroup primitives). Cross-thread / sparse linalg under
qd.linalg.* is a separate axis and is not covered yet.

No source / test changes — pure docs reorg.
The dispatch shape-cap exceptions in qd.polar_decompose / qd.svd /
qd.eig / qd.solve and the dim assertion in qd.solve all said "2D
matrix" / "3D matrix" when they meant "2×2 / 3×3 matrix". "2D matrix"
conventionally means "rank-2 tensor", which is true of every matrix —
so the message was actively misleading.

Updated to:
  - "Polar decomposition only supports 2×2 and 3×3 matrices."
  - "SVD only supports 2×2 and 3×3 matrices."
  - "Eigen solver only supports 2×2 matrices."
  - "Solver only supports 2×2 and 3×3 matrices."
  - assert "Only 2×2 and 3×3 matrices are supported"

Also updated the one test that pinned on the old wording
(test_ast_refactor.py::test_raise) and dropped the FIXME from the
linalg_per_thread.md doc.

Verified locally on amddesktop: test_raise passes 3/3 (cpu + amdgpu +
vulkan).
Comment thread docs/source/user_guide/linalg_per_thread.md Outdated
Comment thread docs/source/user_guide/linalg_per_thread.md Outdated
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

Drops the caller-pattern requirement that qd.sym_eig / qd.make_spd be
called from inside a top-level `for _tid in range(N):` in the
@qd.kernel. The sweep loop in _funcs_sym_eig_general.sym_eig_general
now opens with `qd.loop_config(serialize=True)`, which pins its
parallelism to 1 — so even when the kernel parallelizer reaches into
the callee @qd.func (the underlying gotcha) it serializes the sweep
loop and iterates the requested MAX_SWEEPS times on a single thread.

Tests (test_eig.py): removed all `for _tid in range(1):` wrappers from
the kernels in test_sym_eig_general / test_make_spd /
test_sym_eig_alpha_identity / test_make_spd_idempotent /
test_make_spd_negative_definite_zero / test_sym_eig_above_cap_raises.
Each test now calls `qd.sym_eig(...)` or `qd.make_spd(...)` as a plain
single-element kernel body.

Verified:
  - amd (CPU + AMDGPU + Vulkan): full test_eig.py 211/211 pass in 18:22.
  - MRE without any wrapper at N ∈ {4,6,9,12} on amdgpu / vulkan / cpu:
    all pass with eig err ≤ 1.7e-14, orth err ≤ 5e-15 (f64).
  - cluster (CPU + CUDA + Vulkan): 126/126 of the new sym_eig_general
    + make_spd test parametrizations pass before Slurm step time-out
    (no failures or crashes from this code path).

Docs (linalg_per_thread.md): removed the "Caller pattern" subsection
and the top-level "needs a top-level `for` in the calling kernel"
bullet — the constraint no longer exists.
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

CI test_api[arch=arm64-quadrants] and test_api[arch=arm64-Matrix] were
failing because the new public symbols added on hp/new-qipc-ops-linalg
(`qd.make_spd` module-level entry point and the
`Matrix.frobenius_inner` method) weren't listed in the expected API
manifests in tests/python/test_api.py.

Added:
  - "make_spd" in user_api[qd] (alphabetical, between "loop_config" and
    "math").
  - "frobenius_inner" in _get_expected_matrix_apis() (between "fill"
    and "identity").

Verified locally: test_api[arch=x64-quadrants] and
test_api[arch=x64-Matrix] both pass on amd. The other test_api
parametrizations failing locally are pre-existing API drift between
quadrants==0.7.6 wheel and main; unrelated to this PR.
Reflowed AI-default ~75-80c wraps to the project's 120c budget on the
new docstrings touched by this PR:
  - sym_eig (`_funcs.py`): closed-form/cyclic-Jacobi dispatch note,
    plus dropped a stale `.. note::` block describing the old
    "needs `for _tid in range(...)` wrap" caller-pattern requirement
    (no longer true after loop_config(serialize=True)).
  - make_spd (`_funcs.py` and `_funcs_sym_eig_general.py`): both
    docstrings.
  - sym_eig_general (`_funcs_sym_eig_general.py`): module docstring
    + Returns block.
  - Matrix.frobenius_inner docstring (`lang/matrix.py`).
  - _inverse_lu docstring (`lang/matrix_ops.py`).

After: all touched runs sit in the 104-115c range, well within 120c
and well clear of the ~80c AI-default that the CI line-wrapping
checker flags.
Black reformat from `pre-commit run -a`:
  - test_linalg.py: split a multi-arg `np.testing.assert_allclose(...)`
    onto separate lines.
  - test_svd.py: collapsed a few oversplit f-strings / `print(...)`
    calls back onto single lines (now under 120c).

No behavioural change.
CI was timing out on test_make_spd_idempotent_f64[arch=cuda-*-12]
(~10 min/test budget). The test was defining two @qd.kernel closures
each with its own qd.make_spd(...) call — each closure JIT-compiles
the cyclic-Jacobi + reconstruct path independently, so N=12 on CUDA
hit 2× single-kernel compile time and exceeded the timeout.

Refactor to a single parametric kernel taking ndarray args:

  @qd.kernel
  def project(src: NDArray[mat_t, 1], dst: NDArray[mat_t, 1]):
      dst[0] = qd.make_spd(src[0], dt)

  project(A, A_spd_1)
  project(A_spd_1, A_spd_2)

Now qd.make_spd is JIT-compiled exactly once per test and called
twice with different ndarray bindings; the second call is a pure
launch with no recompile.

Verified locally: 12/12 (4 sizes × 3 factories) pass on amd
(amdgpu+vulkan+cpu) in 2:18.
@github-actions
Copy link
Copy Markdown

Aligns sort behavior with NumPy / LAPACK conventions inside each op:

- `qd.sym_eig` 2×2 now returns eigenvalues ascending (matches >=3×3 paths
  and `np.linalg.eigh`). Trivial fix — emit `(lambda_lo, lambda_hi)`
  instead of the high/low pair, with eigenvectors swapped to match.
- `qd.svd` 3×3 now sorts singular values descending (matches 2×2 path and
  `np.linalg.svd`). Adds a 3-element selection sort over `sig_v` with
  paired column swaps in U, V; tracks swap parity and negates column 0
  of U and V at the end if odd, restoring `det(U) = det(V) = +1`
  (Sifakis's invariant). Reconstruction is preserved across all swaps
  and the column-0 sign fix-up.

Cross-op disagreement (sym_eig ascending vs. svd descending) is the
LAPACK / NumPy convention and is left as-is.

New tests, parametrized over `qd.gpu` and shapes:

- `test_sym_eig_sort_order_{f32,f64}` for n in {2, 3, 4, 6, 9, 12}.
- `test_svd_sort_order_{f32,f64}` for n in {2, 3}, including a
  hand-picked 3×3 with σ that arrives unsorted from Sifakis.

Skips `vulkan + n=3` for sym_eig: the closed-form `_sym_eig3x3` path
(Eigen3 `computeDirect`) segfaults during SPIR-V codegen on the
cluster's Vulkan stack. Same code runs cleanly on amddesktop's Vulkan,
so it's a pre-existing driver / SDK quirk, not a regression.

Verified: 103 passed cluster CUDA, 23 passed cluster CUDA+Vulkan
sort_order (1 skip), 146 passed amddesktop AMDGPU+Vulkan (1 skip).

Doc: `linalg_per_thread.md` drops both sort-consistency FIXMEs, states
the per-op sort directions explicitly, and notes the 3×3
det(U)=det(V)=+1 enforcement (useful for ARAP-style rotations).
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

Reflows comment blocks and docstrings that were wrapped at the AI default
~75-80c instead of the project's 120c target, flagged by the line-wrapping
CI check. Touches comments only — no behavior change.
@github-actions
Copy link
Copy Markdown

Tightens 2-line docstrings whose first line was wrapped at ~76-83c
instead of using the 120c budget more evenly. Found by the line-wrapping
CI on the previous push.
@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

MuGdxy

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@MuGdxy MuGdxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n * m > 32 warning contradicts the new 12×12 support

python/quadrants/lang/matrix.py L305–316 emits a UserWarning when constructing any matrix with more than 32 entries:

if self.n * self.m > 32:
    warning(
        f"Quadrants matrices/vectors with {self.n}x{self.m} > 32 entries are not suggested."
        " Matrices/vectors will be automatically unrolled at compile-time for performance."
        " So the compilation time could be extremely long if the matrix size is too big."
        ...
    )

The new APIs in this PR (qd.sym_eig N≤12, qd.make_spd, Matrix.inverse() N≤12) internally construct matrices via Matrix.zero(dt, N, N)_filled_matrixMatrix(...), which hits this check. For N≥6 (6×6 = 36 > 32), users will see a "not suggested" warning when calling officially supported APIs — contradicting the PR's intent.

Suggestions:

  1. Suppress the warning on internal call paths (e.g. via warnings.filterwarnings or an internal _no_size_warning flag); or
  2. Raise the threshold from 32 to match the new caps (e.g. 144 = 12×12); or
  3. At minimum, document that users can safely ignore this warning for the new APIs.

Not a blocker, but it hurts UX — especially qd.make_spd which constructs multiple large matrices internally and may emit the warning several times per call.

@hughperkins hughperkins changed the title [Math] New single-threaded linalg ops for QIPC [Math] New QIPC ops for single-threaded linalg May 12, 2026
Addresses MuGdxy's review on PR #683: the n*m > 32 UserWarning fires
during JIT trace of officially-supported APIs that internally
construct ≥6×6 register-resident matrices (`qd.sym_eig` N≤12,
`qd.make_spd`, `Matrix.inverse` N≤12), contradicting the PR's intent
to support those sizes.

144 = 12×12 is the new natural cap: it matches the largest size every
officially-supported per-thread linalg op accepts, so internal
constructions stay below the warning threshold while user code that
builds a register-resident matrix beyond that still gets the
"consider qd.field instead" advice.

Doc updated in `matrix_vector.md` (intro line + "Size limit"
section).
@hughperkins
Copy link
Copy Markdown
Collaborator Author

n * m > 32 warning contradicts the new 12×12 support

python/quadrants/lang/matrix.py L305–316 emits a UserWarning when constructing any matrix with more than 32 entries:

if self.n * self.m > 32:
    warning(
        f"Quadrants matrices/vectors with {self.n}x{self.m} > 32 entries are not suggested."
        " Matrices/vectors will be automatically unrolled at compile-time for performance."
        " So the compilation time could be extremely long if the matrix size is too big."
        ...
    )

The new APIs in this PR (qd.sym_eig N≤12, qd.make_spd, Matrix.inverse() N≤12) internally construct matrices via Matrix.zero(dt, N, N)_filled_matrixMatrix(...), which hits this check. For N≥6 (6×6 = 36 > 32), users will see a "not suggested" warning when calling officially supported APIs — contradicting the PR's intent.

Suggestions:

  1. Suppress the warning on internal call paths (e.g. via warnings.filterwarnings or an internal _no_size_warning flag); or
  2. Raise the threshold from 32 to match the new caps (e.g. 144 = 12×12); or
  3. At minimum, document that users can safely ignore this warning for the new APIs.

Not a blocker, but it hurts UX — especially qd.make_spd which constructs multiple large matrices internally and may emit the warning several times per call.

Addressed by Opus:

Threshold raised from 32 → 144 in lang/matrix.py; warning text updated to match;
matrix_vector.md intro and "Size limit" sections updated to 144 (with a note that it lines up with the per-thread linalg APIs' 12×12 cap).

@github-actions
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants