███╗ ██╗███████╗███████╗
████╗ ██║██╔════╝██╔════╝
██╔██╗ ██║█████╗ █████╗
██║╚██╗██║██╔══╝ ██╔══╝
██║ ╚████║███████╗██║
╚═╝ ╚═══╝╚══════╝╚═╝
*The portable computation graph engine for AI workloads inside HydraLogOS *
Write once. Run anywhere HydraLogOS runs. No device management.
NEF is not a model format, training framework, or GPU driver wrapper.
It is a lazy computation graph system — a complete pipeline from operator definition through device planning, kernel compilation, and hardware execution — targeting heterogeneous compute across NVIDIA, AMD, Intel, NPU, and CPU targets with zero explicit device management from user code.
Running AI workloads on heterogeneous hardware today means writing this kind of code:
# Without NEF — you manage everything manually
tensor = tensor.to("cuda:0") # device hell
if torch.cuda.is_available():
kernel = cuda_kernel(tensor) # backend-specific paths
elif rocm_available():
kernel = rocm_kernel(tensor) # more branching
memory_pool.pin(tensor) # manual memory
torch.cuda.synchronize() # explicit syncNEF eliminates all of it:
import nef
a = nef.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=nef.float32)
b = nef.tensor([[5.0, 6.0], [7.0, 8.0]], dtype=nef.float32)
c = nef.matmul(a, b) # ← no execution yet. graph node created.
c.execute() # ← optimizer → planner → compiler → hardware. done.No tensor.to("cuda"). No backend conditionals. No memory calls.
┌─────────────────────────────────────────┐
│ NEF API (Python / Go) │
└──────────────────┬──────────────────────┘
│
┌─────────▼─────────┐
│ Graph Builder │ ← Lazy DAG / IR
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ Optimizer │ ← Fusion · Folding · Elimination
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ Device Planner │ ← Op → Hardware assignment
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ Kernel Compiler │ ← Backend-specific lowering
└──┬──┬──┬──┬───────┘
│ │ │ │
┌────────┘ │ │ └────────┐
▼ ▼ ▼ ▼
NVIDIA GPU AMD CPU SIMD NPU
(CUDA/PTX) (ROCm)(AVX-512) (Vendor)
│ │ │ │
└───────────┴──┴───────────┘
│
┌─────────▼─────────┐
│ Execution Runtime │ ← Async · Parallel · Streamed
└───────────────────┘
① Graph Builder — Lazy IR Layer
Converts API calls into a Directed Acyclic Graph (DAG). No hardware decisions happen here. No execution happens here. Every op call simply extends the graph.
- Nodes — individual ops (MatMul, Softmax, LayerNorm, RMSNorm …)
- Edges — tensor dependencies between nodes
- Metadata — shape, dtype, estimated FLOPs, device hint
a = nef.tensor([1, 2, 3])
b = nef.tensor([4, 5, 6])
c = nef.matmul(a, b) # → DAG node added. Nothing ran.
d = nef.softmax(c) # → DAG node added. Nothing ran.
# Graph: a,b → matmul → softmax → d② Optimizer — Graph Transformation Passes
Runs a deterministic sequence of passes before compilation. Does not alter numerical output beyond floating-point rounding equivalence.
| Pass | What it does |
|---|---|
| Node Fusion | Adjacent elementwise ops collapse into a single kernel |
| Constant Folding | Static subgraphs computed at compile time |
| Dead Node Elimination | Unreachable nodes removed from graph |
| Memory Reuse | Tensors that can share buffers are identified |
| Op Simplification | Expensive ops replaced with cheaper equivalents |
③ Device Planner — Hardware Assignment
Maps each graph node to the best available hardware target. Heuristic-driven, with developer override support.
| Op Pattern | Default Target | Why |
|---|---|---|
| Large MatMul (≥ 1M params) | CUDA / ROCm GPU | Parallelism |
| Transformer Attention | GPU / NPU | Memory-bandwidth bound |
| Small elementwise ops | CPU SIMD | GPU launch overhead > cost |
| Quantized ops | NPU (if present) | Power efficiency |
| Everything else | CPU SIMD | Correctness fallback |
Inserts memory transfer nodes automatically at device boundaries.
④ Kernel Compiler — Backend Lowering
Lowers abstract ops to backend-specific executable kernels. Results are cached by (op_type, shape, dtype, backend) — warm re-execution skips compilation entirely.
| Backend | Target | Compilation path |
|---|---|---|
| CUDA | NVIDIA GPUs | PTX / cuBLAS / cuDNN |
| ROCm | AMD GPUs | HIP / hipBLAS |
| Level Zero | Intel Arc GPUs | SPIR-V |
| CPU SIMD | x86-64 / ARM | AVX2 / AVX-512 / NEON |
| NPU Delegate | Qualcomm / Apple ANE | Vendor SDK |
⑤ Execution Runtime — Async Graph Dispatch
- Parallel scheduling — independent branches execute concurrently
- Async dispatch — non-blocking kernel launch with explicit sync barriers
- Stream management — per-device CUDA/HIP streams; CPU thread pool
- Memory coordination — host↔device transfers inserted at boundary nodes
- Materialization — tensors pulled to CPU memory only when accessed
NEF follows define-then-run, identical in philosophy to JAX JIT and MLX lazy evaluation.
nef.matmul(a, b) → graph node added. zero compute.
nef.softmax(c) → graph node added. zero compute.
nef.layernorm(d) → graph node added. zero compute.
result.execute() → full graph: optimized → compiled → dispatched.
Execution triggers:
- An explicit
.execute()or.eval()call - A Python operation requiring a concrete value (
print(t),t.numpy()) - HydraLogOS scheduler forcing materialization for downstream consumers
| Platform | Backend | Status |
|---|---|---|
| 🟢 NVIDIA GPU | CUDA / PTX / cuBLAS | active |
| 🟢 AMD GPU | ROCm / HIP / hipBLAS | active |
| 🟡 Intel Arc GPU | Level Zero / SPIR-V | planned |
| 🟢 CPU (x86-64) | AVX2 / AVX-512 / VNNI | active |
| 🟢 CPU (ARM) | NEON / SVE | active |
| 🟡 NPU | Qualcomm QNN / Apple ANE | planned |
| ⚪ WASM | Browser / Edge | future |
| Metric | Target |
|---|---|
| Graph planning overhead | < 1ms for graphs ≤ 10K nodes |
| GPU utilization (LLM inference) | ≥ 85% sustained |
| Kernel cache hit rate (warm) | ≥ 95% |
| CPU fallback penalty vs GPU | ≤ 2× for ops ≤ 1M elements |
| Memory transfer overhead | Zero-copy where supported; otherwise < 5% of total |
import "github.com/Hexa08/NEF"
a := nef.Tensor([]float32{1, 2, 3, 4}, []int{2, 2})
b := nef.Tensor([]float32{5, 6, 7, 8}, []int{2, 2})
c := nef.MatMul(a, b)
c.Execute()
fmt.Println(c.Numpy())NEF graphs can be saved to .nef files for deployment via the HydraLogOS registry.
{
"version": "1.0",
"nef_format": "graph-v1",
"graph": {
"nodes": [
{
"id": "node_0",
"op": "matmul",
"inputs": ["tensor_a", "tensor_b"],
"output": "tensor_c",
"preferred_device": "gpu"
}
],
"edges": [{ "from": "node_0", "to": "node_1" }]
},
"tensors": {
"tensor_a": { "shape": [1024, 4096], "dtype": "float16" }
},
"target": "auto",
"compiler_cache": "embedded"
}Deploy with:
hydra run model.nefNEF is a first-class subsystem of HydraLogOS — not a plugin.
hydra run model.nef
│
▼
hydrad daemon
├── resource allocation (GPU slots · memory budget)
├── NEF runtime init (device detection · graph deserialization)
├── execution dispatch (async graph scheduling)
└── result → HydraLogOS scheduler / output consumer
hydrad controls NEF lifecycle. NEF exposes a gRPC control interface consumed by the scheduler. Graphs can be preempted, paused, and resumed mid-execution.
| Failure | Response |
|---|---|
| GPU out of memory | Evict cache → retry on smaller batch → CPU fallback |
| Backend compile failure | Log → reroute to CPU → continue |
| NPU probe failure | Silently disable NPU; proceed with GPU/CPU |
| Graph cycle detected | NEFGraphCycleError raised at construction time |
hydrad preemption |
Serialize in-progress graph state; resume on restart |
No silent corruption. All failures surface through structured error types.
git clone https://github.com/Hexa08/NEF
cd NEF
pip install -e .Run the test suite:
PYTHONPATH=src python -m pytestBuild a .nef file:
PYTHONPATH=src python -c "
import nef
a = nef.tensor([[1.0]], dtype=nef.float32)
b = nef.tensor([[2.0]], dtype=nef.float32)
nef.matmul(a, b).build('model.nef')
"- Lazy tensor graph construction
- Deferred execution via
.execute()/.numpy() - Graph optimizer + device planner stubs
- Kernel compiler pipeline (CPU path)
- CUDA backend integration
- ROCm backend integration
- Distributed NEF (multi-node tensor/pipeline parallel)
- Streaming graphs (real-time token-by-token LLM execution)
- Quantization-aware execution (INT4 / INT8)
- Dynamic shape support (variable-length sequences)
- WASM backend (browser / edge inference)
Clarifying the scope to avoid confusion.
- ❌ Not a replacement for PyTorch / JAX in training workflows
- ❌ Not a low-level GPU driver or CUDA wrapper
- ❌ Not a model storage format (≠ GGUF, ONNX, SafeTensors)
- ❌ Not a distributed training coordinator
NEF is the execution layer — everything below the graph, everything above the hardware.
- No direct hardware access from user code — all execution routes through
hydrad - Execution sandboxed under HydraLogOS's process isolation
- No arbitrary kernel injection — kernels compiled from whitelisted op templates only
- Graph validation runs before optimization; malformed graphs are rejected at construction
NEF — Neural Essence Format
v0.1.0-draft · HydraLogOS Internal · github.com/Hexa08/NEF
Built for HydraLogOS. Designed to disappear into the hardware.
