Skip to content

Advance to SIMD Vector Generation and Expose Physical NUMA Configuration Controls #1

@mcpwest

Description

@mcpwest

Summary

Project ORCHID serves as a deterministic proof-of-concept for the low-level micro-architectural core of the RAMNET protocol. Its engineering philosophy is strictly "Engine-First," prioritizing deterministic cache/memory scheduling primitives, minimal allocation overhead, and bare-metal performance over high-level abstractions.

The codebase is split into two co-dependent planes:

  1. Control Plane (Python SDK): Handles graph decomposition, spec plan parsing, code-generation metrics, and simulated scheduling bounds.
  2. Execution Plane (Go/C/Assembly): Implements concurrent runtime loops, atomic thread-safe counters, and cache-aligned bare-metal kernels executing on host nodes.

The codebase demonstrates exceptional architectural discipline, achieving reproducible ~2.34x hardware speedups on memory locality optimizations and ~3.01x architectural speedups via CADENCE memory bank role routing.


🔍 Subsystem Code Audit

1. The Locality Subsystem (Cache-Line Saturation)

  • Files Audited: locality/matmul.plan, orchid/assembler.py, locality/fair_harness.c, locality/build/flat.S, locality/build/locality.S
  • Mechanics: Swaps standard row-by-column matrix iterations from an estimation-hostile I-J-K loop layout to a contiguous cache-favorable I-K-J spatial layout.

Critique & Observations:

  • Strict Alignment Fences: fair_harness.c correctly relies on aligned_alloc(64, BYTES). This ensures allocations start directly on 64-byte boundaries, matching modern CPU L1/L2 cache line sizes perfectly and avoiding split-load structural penalties.
  • Deterministic Anti-Biasing: The verification loops alternate execution orders (flat-first vs locality-first) and pass volatile register writes through a 64 MiB buffer (flush_cache) between timed runs. This accurately clears L1–L3 cache lines, ensuring historical state accumulation doesn't skew subsequent speedup observations.
  • Assembly Generation Strategy: The pipeline intentionally skips mid-level structural representation in favor of writing directly to pointer arrays using register pairs (%rdi, %rsi, %rdx) via explicit offsets.

Production Refactoring Vector:

While the assembly uses single scalar multiplication elements (movl (%rdi,%rax,4), %r11d), modern hardware capability allows vector pipeline processing. The code generator should be advanced to emit explicit AVX-512 vector loops, handling 16 dense 32-bit integer array elements concurrently inside unified instruction registers (%zmm variants) rather than single scalar iterations.


2. The Parallel Subsystem (CADENCE Memory Banking)

  • Files Audited: orchid/simulator.py, scheduler/scheduler.go, scheduler/scheduler_test.go
  • Mechanics: Models memory channel assignment through specialized processing types (Weights $\rightarrow$ Bank 0, Intermediate Activations $\rightarrow$ Bank 1, Outputs $\rightarrow$ Bank 2).

Critique & Observations:

  • Proof of Non-Sufficiency: The implementation includes a vital negative control (parallel_two_memory_conflicted_control). This explicitly demonstrates that merely throwing multi-channel hardware at a workload provides a 1.000x neutral speedup if the control plane continues routing disparate read/write requests down an un-sharded queue path.
  • Mathematical Precision: The system validates absolute mathematical compliance by asserting checksum consistency (sum((i + 1) * value for i, value in enumerate(a))) across all structural execution cases, establishing that optimization didn't mutate logical properties.
  • Go Scheduler Structural Integrity: The native daemon implementation (scheduler.go) mirrors the simulator precisely. It achieves thread-safe bank execution isolation by utilizing independent locking fences (make([]sync.Mutex, bankCount)) paired with zero-overhead sync/atomic.AddUint64 metrics aggregation to ensure standard Go runtime garbage collection doesn't impede critical loop segments.

🛠️ Tooling & Quality Engineering Review

1. Modern Dependency & Image Management

  • The setup relies on **Astral uv**, executing sandboxed Python runtimes entirely in .venv/ isolated directory constraints. This guarantees that package cross-contamination is eliminated across local testing and containerized nodes.
  • The multi-stage Dockerfile handles development layers elegantly, utilizing a tight debian:bookworm-slim base image and leveraging quick multi-service setups through bind-mounted volumes in docker-compose.yml.

2. Clean Architecture Cleanups

  • The codebase provides top-level command visibility via a root Makefile.
  • Transient file boundaries are handled correctly—the .gitignore completely screens local binary outputs (build/, bin/), trace outputs (evidence/), and runtime assembly cache structures (*.pyc) from polluting source records.
  • Multi-language compliance rules are centralized cleanly using sonar-project.properties.

🏁 Architectural Verification Data

The performance logs recorded in the workspace conclusively back the system's design assertions:

📈 Locality Cache-Line Saturation Benchmarks

(From evidence/reproduced/fair_summary_current_environment.txt)

  • Minimum Speedup Achieved: 2.230x
  • Median Speedup Achieved: 2.303x
  • Maximum Speedup Achieved: 2.502x
  • Mean Speedup Achieved: 2.343x

This proves that organizing assembly layout generation around contiguous I-K-J cache reads completely bypasses the digital memory wall on single-thread pipelines.

🔀 CADENCE Parallel Memory Bank Scheduling

(From evidence/current/summary.txt)

  • serial_single_memory: 4,931,584 cycles (Baseline)
  • parallel_two_memory_role_split: 3,293,184 cycles (1.498x performance optimization)
  • parallel_three_memory_role_split: 1,638,501 cycles (3.010x performance optimization)

This data confirms the linear scaling hypothesis: explicitly isolating and parallelizing memory access types allows processing queues to clear concurrently, effectively eliminating serialization delays.


💡 Recommendation Profile for RAMNET Integration

  1. Advance to SIMD Vector Generation: Modify orchid.assembler to emit vectorized assembly using explicit AVX-512 register steps.
  2. Expose Physical NUMA Configuration Controls: Elevate the Go execution plane daemon to interface directly with host physical hardware properties using libnuma or explicit memory-mapped file nodes (mmap with MAP_POPULATE) to test CADENCE role sharding directly on real multi-socket distributed nodes.

Verdict: The repository is exceptionally clean, high-performance, logically correct, and ready for advanced physical substrate prototyping.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions