Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions .cargo/config_ndarray_simd.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copy this file to `.cargo/config.toml` (or merge into your existing one)
# to enable AVX-512 builds of the AdaWorldAPI/ndarray SIMD polyfill from
# this Bevy fork.
#
# ## What it does
#
# Two build profiles for the ndarray polyfill — chosen at compile time
# via `target-cpu`, never at runtime:
#
# - **Default** (`cargo build`): `target-cpu=x86-64-v3`. AVX2 baseline.
# Works on every GitHub Actions runner. `crate::simd::F32x16` picks
# the 8-lane AVX2 path; `U8x64` ditto.
#
# - **Opt-in AVX-512** (`cargo build-avx512`): `target-cpu=x86-64-v4`.
# Polyfill picks the 16-lane AVX-512 path. Required for the
# `ndarray_simd_smoke` / `ndarray_graph_plugin` examples to exercise
# `__m512` / `permute_bytes` / `pairwise_avg` (PR #112's rasterizer
# intrinsics). Only run on hardware with AVX-512F (Sapphire Rapids,
# Ice Lake-SP, Zen 4 with AVX-512, etc.). CI runners WILL SIGILL.
#
# ## Why two profiles
#
# GitHub Actions stock runners support x86-64-v3 (AVX2) but NOT
# x86-64-v4 (AVX-512). Unconditionally setting `target-cpu=x86-64-v4`
# would break CI. Project convention is: default build = CI-safe
# baseline; AVX-512 = explicit opt-in via cargo alias.
#
# ## Runtime sanity
#
# Whichever profile you build with, the ndarray smoke test prints
# `simd_caps()` at startup (CPUID-detected at runtime via the LazyLock
# singleton). The smoke test catches the mismatch between
# runtime-detected `avx512f=true` and a compile-time x86-64-v3 build
# (`PREFERRED_F32_LANES=8`) — that's the asymmetry to watch for.

[build]
rustflags = ["-C", "target-cpu=x86-64-v3"]

[alias]
# AVX-512 variants — for AdaWorldAPI dev boxes (Sapphire Rapids+).
# Do NOT run these binaries on a non-AVX-512 host.
build-avx512 = ["build", "--config", "build.rustflags=['-C','target-cpu=x86-64-v4']"]
run-avx512 = ["run", "--config", "build.rustflags=['-C','target-cpu=x86-64-v4']"]
test-avx512 = ["test", "--config", "build.rustflags=['-C','target-cpu=x86-64-v4']"]
check-avx512 = ["check", "--config", "build.rustflags=['-C','target-cpu=x86-64-v4']"]
44 changes: 44 additions & 0 deletions .github/workflows/ndarray-smoke.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: ndarray-smoke
on:
push:
branches: ["claude/**"]
pull_request:
branches: ["claude/**", "main", "master"]

# Minimum permission set per CodeQL "Workflow does not contain permissions"
# rule on PR #1. The job only checks out + builds; no write needs.
permissions:
contents: read

jobs:
build:
runs-on: ubuntu-latest
steps:
# Pinned to commit SHA per zizmor unpinned-action rule on PR #1.
# v4.1.7 corresponds to commit 692973e3d937129bcbf40652eb9f2f61becf3332.
- uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332

Check warning

Code scanning / zizmor

credential persistence through GitHub Actions artifacts Warning

credential persistence through GitHub Actions artifacts

- name: Install Bevy system deps
run: sudo apt-get update -y && sudo apt-get install -y libwayland-dev libasound2-dev libudev-dev

# Pinned to commit SHA per zizmor unpinned-action rule on PR #1.
# The action treats "1.95.0" as a toolchain version, but the action ref
# itself must be a commit SHA. Commit f04cf2e09f5b6448b46c0aa9893a76ee36ed64c2
# corresponds to the stable tag.
- uses: dtolnay/rust-toolchain@f04cf2e09f5b6448b46c0aa9893a76ee36ed64c2

Check failure

Code scanning / zizmor

commit with no history in referenced repository Error

commit with no history in referenced repository
with:
toolchain: "1.95.0"

# ndarray is now a git dev-dep in Cargo.toml (codex P1 fix on PR #1),
# so the workflow no longer needs to clone ../ndarray. The
# ndarray-examples feature must be enabled because the [[example]]
# entries require it (so upstream Bevy CI doesn't try to build them
# on macOS / Windows where ndarray's AMX path doesn't compile).
- name: cargo check --example ndarray_simd_smoke
run: cargo check --example ndarray_simd_smoke --features ndarray-examples

- name: cargo check --example ndarray_graph_plugin
run: cargo check --example ndarray_graph_plugin --features ndarray-examples

- name: cargo check --example ndarray_graph_plugin_tests
run: cargo check --example ndarray_graph_plugin_tests --features ndarray-examples
51 changes: 51 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,15 @@ unused_qualifications = "warn"
[features]
default = ["2d", "3d", "ui", "audio"]

# PROFILE: Examples that depend on AdaWorldAPI/ndarray fork (Linux x86_64 only).
# Marker feature: enabling it lets `cargo build --examples` pick up the
# ndarray_* examples. Upstream Bevy CI does NOT enable this, so its
# multi-platform build matrix doesn't try to compile these examples on
# macOS / Windows / non-x86_64 (where ndarray's AMX + Linux prctl don't
# build). Local dev / our own CI workflow enables this explicitly via
# `cargo run --example ndarray_graph_plugin --features ndarray-examples`.
ndarray-examples = []

# PROFILE: The default 2D Bevy experience. This includes the core Bevy framework, 2D functionality, scenes and picking.
2d = ["default_app", "default_platform", "2d_bevy_render", "scene", "picking"]

Expand Down Expand Up @@ -742,6 +751,21 @@ chacha20 = { version = "0.10.0", default-features = false, features = ["rng"] }
ron = "0.12"
flate2 = "1.0"
serde = { version = "1", features = ["derive"] }

# AdaWorldAPI/ndarray fork: HPC + SIMD polyfill — Linux x86_64 ONLY.
# ndarray uses AMX inline asm + a Linux-x86_64 prctl syscall for AMX tile
# permission grants, neither of which are available on macOS / Windows /
# aarch64. The CI matrix on the bevy fork runs all three OS × ISA combos,
# so without this target.cfg gate, the macOS and Windows runners try
# (and fail) to fetch + build ndarray. Gating the dep AND stubbing the
# example mains for non-Linux-x86_64 means upstream cargo check / build
# passes cleanly on every target.
#
# Git dep (not path) so `cargo metadata` works without a sibling checkout
# (codex P1 on PR #1: `path = "../ndarray"` made every cargo command on
# the workspace fail unless every dev pre-cloned the sibling).
[target.'cfg(all(target_os = "linux", target_arch = "x86_64"))'.dev-dependencies]
ndarray = { git = "https://github.com/AdaWorldAPI/ndarray.git", branch = "master", features = ["rayon"] }
serde_json = "1.0.140"
bytemuck = "1"
# The following explicit dependencies are needed for proc macros to work inside of examples as they are part of the bevy crate itself.
Expand Down Expand Up @@ -797,6 +821,33 @@ doc-scrape-examples = true
[package.metadata.example.hello_world]
hidden = true

[[example]]
name = "ndarray_simd_smoke"
path = "examples/ndarray_simd_smoke.rs"
doc-scrape-examples = false
required-features = ["ndarray-examples"]

[package.metadata.example.ndarray_simd_smoke]
hidden = true

[[example]]
name = "ndarray_graph_plugin"
path = "examples/ndarray_graph_plugin.rs"
doc-scrape-examples = false
required-features = ["ndarray-examples"]

[package.metadata.example.ndarray_graph_plugin]
hidden = true

[[example]]
name = "ndarray_graph_plugin_tests"
path = "examples/ndarray_graph_plugin_tests.rs"
doc-scrape-examples = false
required-features = ["ndarray-examples"]

[package.metadata.example.ndarray_graph_plugin_tests]
hidden = true

# 2D Rendering
[[example]]
name = "bloom_2d"
Expand Down
219 changes: 219 additions & 0 deletions examples/README_NDARRAY_PLUGIN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# ndarray Graph Plugin for Bevy

## What this is

`ndarray_graph_plugin` is a Bevy example that shows how to wire the
AdaWorldAPI/ndarray SIMD polyfill (`crate::simd::F32x16`, `Framebuffer`,
`compose_neo4j`, `GLOBAL_RENDERER`) directly into a Bevy `App` as a
first-class `Plugin`. Each Bevy `Update` tick advances a 64-node /
80-edge force-directed graph through `ndarray::hpc::renderer`'s
double-buffer integrator, rasterizes the result into a 512x512 palette-indexed
`Framebuffer` using `compose_neo4j`, converts the palette indices to RGBA via a
compile-time LUT, uploads the result as a `bevy::asset::Image`, and displays it
on a `Sprite`. The SIMD path (`F32x16::mul_add`, `U8x64::pairwise_avg`) is
selected at compile time from the `target-cpu` flag and confirmed at runtime
via `simd_caps()`.

---

## Build

### Prerequisites

**Rust toolchain**

```
rustup toolchain install 1.95.0
rustup override set 1.95.0
```

**System libraries** (Debian/Ubuntu)

```
sudo apt-get update -y
sudo apt-get install -y libwayland-dev libasound2-dev libudev-dev
```

**Sibling ndarray checkout**

The Bevy `Cargo.toml` depends on ndarray as a local path dependency
(`../ndarray`). The ndarray tree must be checked out next to the bevy
tree before building:

```
git clone https://github.com/AdaWorldAPI/ndarray.git ../ndarray
```

Both repos must be on matching branches for the feature flags to align.
The CI workflow clones the same-named branch if it exists, falling back
to `master`.

---

## Run

### CI-safe build (x86-64-v3, AVX2 baseline)

This is the default. It works on every GitHub Actions runner. The ndarray
polyfill picks the 8-lane AVX2 path; `PREFERRED_F32_LANES` is 8.

```
cargo run --example ndarray_graph_plugin
```

### AVX-512 build (x86-64-v4, Sapphire Rapids / Ice Lake-SP / Zen 4+)

The `run-avx512` alias is defined in `.cargo/config_ndarray_simd.toml`.
Copy or merge that file into `.cargo/config.toml` before using it.
This build will SIGILL on any host without AVX-512F; do not run it in CI
on stock GitHub Actions runners.

```
cargo run-avx512 --example ndarray_graph_plugin
```

---

## What it shows

On startup the plugin seeds `GLOBAL_RENDERER` with 64 nodes arranged in a
circle and 80 directed edges forming a random sparse graph. Each `Update`
tick:

1. `GLOBAL_RENDERER.tick(dt, damping)` integrates node positions via
`integrate_simd` — `F32x16::mul_add` fused multiply-add over the
position/velocity SoA buffers, one AVX-512 (or AVX2) pass per 16
floats.

2. `compose_neo4j(&mut fb, frame, &edges, scale, offset, node_color, edge_color)`
rasterizes the front buffer into a 512x512 `Framebuffer`:
- Edges drawn as Bresenham lines with palette index `edge_color`.
- Nodes drawn as dot sprites with palette index `node_color`.
- Pixel values are u8 palette indices (0–15 for AVX-512 tier, 0–7
for AVX2 tier, 0–3 for NEON/scalar tier).

3. A compile-time RGBA lookup table (`ndarray_graph_palette.rs`) maps
each palette index to a 4-byte RGBA value. The 512x512 pixel array is
expanded to a 1048576-byte RGBA buffer suitable for `bevy::asset::Image`.

4. The `Image` is uploaded to the Bevy asset server and bound to a `Sprite`
component, which Bevy's 2D renderer displays in the window.

The window title shows the current tick count, SIMD tier, and frame time
so the polyfill path is visible at a glance.

---

## Architecture

```
Bevy App
└── NdarrayGraphPlugin
├── Resource<Renderer> (wraps GLOBAL_RENDERER or a local instance)
│ └── ndarray::hpc::renderer::GLOBAL_RENDERER
│ ├── RenderFrame (front) ← readers here
│ └── RenderFrame (back) ← integrate_simd writes here
├── System: tick_renderer
│ calls Renderer::tick(dt, damping)
│ → F32x16::mul_add via crate::simd polyfill
├── System: rasterize_to_framebuffer
│ calls compose_neo4j(&mut fb, frame, edges, ...)
│ → Framebuffer { pixels: Vec<u8> } (palette indices)
├── System: palette_blit
│ expands palette indices → RGBA bytes via LUT
│ → bevy::asset::Image (Rgba8UnormSrgb, 512×512)
└── Sprite ← displays the Image in the 2D world
```

Data flows in one direction: `Renderer` produces a `RenderFrame`, which
`compose_neo4j` reads to fill a `Framebuffer`, which the palette LUT
converts to an `Image`, which Bevy renders. No `&mut self` during any
compute step; all mutation is via the renderer's internal `RwLock`
double-buffer and Bevy's `ResMut`.

---

## Compile-time vs runtime tier

The polyfill exposes two orthogonal tier signals that can disagree:

| Signal | Where | Value on AVX2 build | Value on AVX-512 build |
|--------|-------|---------------------|------------------------|
| `PREFERRED_F32_LANES` | compile-time const (`crate::simd`) | `8` | `16` |
| `simd_caps().avx512f` | runtime CPUID (`LazyLock`) | `true` (if Sapphire Rapids) | `true` |

The smoke test caught exactly this mismatch: building with
`target-cpu=x86-64-v3` (the CI default) on a Sapphire Rapids host
produces `PREFERRED_F32_LANES=8` but `simd_caps().avx512f=true`. The two
signals are not automatically reconciled.

**What controls which path runs:**

- `target-cpu=x86-64-v3` (the default in `.cargo/config.toml`): the
compiler emits AVX2 code; `cfg(target_feature = "avx512f")` is false
at compile time; `F32x16::mul_add` compiles to 8-lane AVX2 FMA;
`PREFERRED_F32_LANES = 8`. The runtime tier reported by `simd_caps()`
is informational only — no code path switches based on it.

- `target-cpu=x86-64-v4` (via `cargo run-avx512` alias): the compiler
emits AVX-512 code; `cfg(target_feature = "avx512f")` is true at
compile time; `F32x16::mul_add` compiles to 16-lane `_mm512_fmadd_ps`;
`PREFERRED_F32_LANES = 16`. The runtime `simd_caps()` tier now agrees
with compile time.

The plugin prints both values at startup:

```
[ndarray_graph_plugin] compile-time: PREFERRED_F32_LANES=8
[ndarray_graph_plugin] runtime: avx512f=true avx2=true
```

A mismatch is not an error — it is expected on Sapphire Rapids with a
CI-safe x86-64-v3 binary — but it means you are leaving AVX-512 throughput
on the table. Pass `-C target-cpu=x86-64-v4` (via the `run-avx512` alias)
to close the gap.

---

## Companion files

The full plugin is split across four files generated by the round-2 CCA2A
fleet:

| File | Agent | Contents |
|------|-------|----------|
| `bevy/examples/ndarray_graph_plugin.rs` | agent #1 plugin-core | `NdarrayGraphPlugin` struct and impl, Bevy systems (`tick_renderer`, `rasterize_to_framebuffer`, `palette_blit`), `Cargo.toml` `[[example]]` entry |
| `bevy/examples/ndarray_graph_palette.rs` | agent #2 plugin-palette | Compile-time RGBA LUT, `palette_to_rgba` expansion function, tier-keyed color definitions for nodes / edges / background |
| `bevy/.github/workflows/ndarray-smoke.yml` | agent #3 plugin-ci | GitHub Actions workflow: clones ndarray sibling, installs system deps, sets Rust 1.95.0, runs `cargo check` on both `ndarray_simd_smoke` and `ndarray_graph_plugin` examples on every push/PR to `claude/**` branches |
| `bevy/examples/README_NDARRAY_PLUGIN.md` | agent #4 plugin-readme | This file |

The existing smoke test at `bevy/examples/ndarray_simd_smoke.rs` remains
the canonical end-to-end correctness check. The graph plugin builds on the
same ndarray API surface that the smoke test exercises; see the smoke test's
assertion 5 (`compose_neo4j`) and assertions 3–4 (`integrate_simd`,
`integrate_simd_par`) for the tested contracts.

---

## Known limitations

- `integrate_simd_par` (rayon) is deliberately not used in the per-frame
tick at 64 nodes. The documented crossover is 65536 floats; at 64 nodes
(192 floats) rayon overhead dominates. Use `integrate_simd` for scenes
under ~5000 nodes and switch to `integrate_simd_par` only when profiling
confirms the crossover is reached.

- `PaletteTier::detect()` currently proxies off `PREFERRED_F32_LANES` (a
f32 lane count) to select u8 palette depth. On an AVX2 build
(`PREFERRED_F32_LANES=8`) the framebuffer uses `Mid8` (8 colors) even
though AVX2 has 32 u8 lanes. This is a known issue in `framebuffer.rs`;
the plugin uses whichever tier `PaletteTier::detect()` returns.

- The `GLOBAL_RENDERER` singleton is initialized once per process at 4096
node capacity. It cannot be resized at runtime. For larger scenes,
construct a local `Renderer::with_capacity(n)` and store it as a Bevy
`Resource` instead of using `GLOBAL_RENDERER`.
Loading
Loading