GitHub - Hexa08/NEF

███╗   ██╗███████╗███████╗
████╗  ██║██╔════╝██╔════╝
██╔██╗ ██║█████╗  █████╗  
██║╚██╗██║██╔══╝  ██╔══╝  
██║ ╚████║███████╗██║     
╚═╝  ╚═══╝╚══════╝╚═╝

Neural Essence Format

*The portable computation graph engine for AI workloads inside HydraLogOS *

Write once. Run anywhere HydraLogOS runs. No device management.

What is NEF?

NEF is not a model format, training framework, or GPU driver wrapper.

It is a lazy computation graph system — a complete pipeline from operator definition through device planning, kernel compilation, and hardware execution — targeting heterogeneous compute across NVIDIA, AMD, Intel, NPU, and CPU targets with zero explicit device management from user code.

The Problem It Solves

Running AI workloads on heterogeneous hardware today means writing this kind of code:

# Without NEF — you manage everything manually
tensor = tensor.to("cuda:0")                         # device hell
if torch.cuda.is_available():
    kernel = cuda_kernel(tensor)                     # backend-specific paths
elif rocm_available():
    kernel = rocm_kernel(tensor)                     # more branching
memory_pool.pin(tensor)                              # manual memory
torch.cuda.synchronize()                             # explicit sync

NEF eliminates all of it:

import nef

a = nef.tensor([[1.0, 2.0], [3.0, 4.0]], dtype=nef.float32)
b = nef.tensor([[5.0, 6.0], [7.0, 8.0]], dtype=nef.float32)

c = nef.matmul(a, b)   # ← no execution yet. graph node created.
c.execute()            # ← optimizer → planner → compiler → hardware. done.

No tensor.to("cuda"). No backend conditionals. No memory calls.

Architecture

┌─────────────────────────────────────────┐
│          NEF API  (Python / Go)         │
└──────────────────┬──────────────────────┘
                   │
         ┌─────────▼─────────┐
         │   Graph Builder   │  ←  Lazy DAG / IR
         └─────────┬─────────┘
                   │
         ┌─────────▼─────────┐
         │     Optimizer     │  ←  Fusion · Folding · Elimination
         └─────────┬─────────┘
                   │
         ┌─────────▼─────────┐
         │   Device Planner  │  ←  Op → Hardware assignment
         └─────────┬─────────┘
                   │
         ┌─────────▼─────────┐
         │  Kernel Compiler  │  ←  Backend-specific lowering
         └──┬──┬──┬──┬───────┘
            │  │  │  │
   ┌────────┘  │  │  └────────┐
   ▼           ▼  ▼           ▼
NVIDIA GPU   AMD  CPU SIMD   NPU
(CUDA/PTX) (ROCm)(AVX-512) (Vendor)
   │           │  │           │
   └───────────┴──┴───────────┘
                   │
         ┌─────────▼─────────┐
         │ Execution Runtime │  ←  Async · Parallel · Streamed
         └───────────────────┘

▶ Open live interactive version

---

Core Components

① Graph Builder — Lazy IR Layer

Converts API calls into a Directed Acyclic Graph (DAG). No hardware decisions happen here. No execution happens here. Every op call simply extends the graph.

Nodes — individual ops (MatMul, Softmax, LayerNorm, RMSNorm …)
Edges — tensor dependencies between nodes
Metadata — shape, dtype, estimated FLOPs, device hint

a = nef.tensor([1, 2, 3])
b = nef.tensor([4, 5, 6])
c = nef.matmul(a, b)    # → DAG node added. Nothing ran.
d = nef.softmax(c)      # → DAG node added. Nothing ran.
                        # Graph: a,b → matmul → softmax → d

② Optimizer — Graph Transformation Passes

Runs a deterministic sequence of passes before compilation. Does not alter numerical output beyond floating-point rounding equivalence.

Pass	What it does
Node Fusion	Adjacent elementwise ops collapse into a single kernel
Constant Folding	Static subgraphs computed at compile time
Dead Node Elimination	Unreachable nodes removed from graph
Memory Reuse	Tensors that can share buffers are identified
Op Simplification	Expensive ops replaced with cheaper equivalents

③ Device Planner — Hardware Assignment

Maps each graph node to the best available hardware target. Heuristic-driven, with developer override support.

Op Pattern	Default Target	Why
Large MatMul (≥ 1M params)	CUDA / ROCm GPU	Parallelism
Transformer Attention	GPU / NPU	Memory-bandwidth bound
Small elementwise ops	CPU SIMD	GPU launch overhead > cost
Quantized ops	NPU (if present)	Power efficiency
Everything else	CPU SIMD	Correctness fallback

Inserts memory transfer nodes automatically at device boundaries.

④ Kernel Compiler — Backend Lowering

Lowers abstract ops to backend-specific executable kernels. Results are cached by (op_type, shape, dtype, backend) — warm re-execution skips compilation entirely.

Backend	Target	Compilation path
CUDA	NVIDIA GPUs	PTX / cuBLAS / cuDNN
ROCm	AMD GPUs	HIP / hipBLAS
Level Zero	Intel Arc GPUs	SPIR-V
CPU SIMD	x86-64 / ARM	AVX2 / AVX-512 / NEON
NPU Delegate	Qualcomm / Apple ANE	Vendor SDK

⑤ Execution Runtime — Async Graph Dispatch

Parallel scheduling — independent branches execute concurrently
Async dispatch — non-blocking kernel launch with explicit sync barriers
Stream management — per-device CUDA/HIP streams; CPU thread pool
Memory coordination — host↔device transfers inserted at boundary nodes
Materialization — tensors pulled to CPU memory only when accessed

Lazy Execution Model

NEF follows define-then-run, identical in philosophy to JAX JIT and MLX lazy evaluation.

nef.matmul(a, b)    →  graph node added.  zero compute.
nef.softmax(c)      →  graph node added.  zero compute.
nef.layernorm(d)    →  graph node added.  zero compute.
result.execute()    →  full graph: optimized → compiled → dispatched.

Execution triggers:

An explicit .execute() or .eval() call
A Python operation requiring a concrete value (print(t), t.numpy())
HydraLogOS scheduler forcing materialization for downstream consumers

Supported Hardware

Platform	Backend	Status
🟢 NVIDIA GPU	CUDA / PTX / cuBLAS	`active`
🟢 AMD GPU	ROCm / HIP / hipBLAS	`active`
🟡 Intel Arc GPU	Level Zero / SPIR-V	`planned`
🟢 CPU (x86-64)	AVX2 / AVX-512 / VNNI	`active`
🟢 CPU (ARM)	NEON / SVE	`active`
🟡 NPU	Qualcomm QNN / Apple ANE	`planned`
⚪ WASM	Browser / Edge	`future`

Performance Targets

Metric	Target
Graph planning overhead	`< 1ms` for graphs ≤ 10K nodes
GPU utilization (LLM inference)	`≥ 85%` sustained
Kernel cache hit rate (warm)	`≥ 95%`
CPU fallback penalty vs GPU	`≤ 2×` for ops ≤ 1M elements
Memory transfer overhead	Zero-copy where supported; otherwise `< 5%` of total

Go API

import "github.com/Hexa08/NEF"

a := nef.Tensor([]float32{1, 2, 3, 4}, []int{2, 2})
b := nef.Tensor([]float32{5, 6, 7, 8}, []int{2, 2})

c := nef.MatMul(a, b)
c.Execute()

fmt.Println(c.Numpy())

Serialized Graph Format

NEF graphs can be saved to .nef files for deployment via the HydraLogOS registry.

{
  "version": "1.0",
  "nef_format": "graph-v1",
  "graph": {
    "nodes": [
      {
        "id": "node_0",
        "op": "matmul",
        "inputs": ["tensor_a", "tensor_b"],
        "output": "tensor_c",
        "preferred_device": "gpu"
      }
    ],
    "edges": [{ "from": "node_0", "to": "node_1" }]
  },
  "tensors": {
    "tensor_a": { "shape": [1024, 4096], "dtype": "float16" }
  },
  "target": "auto",
  "compiler_cache": "embedded"
}

Deploy with:

hydra run model.nef

HydraLogOS Integration

NEF is a first-class subsystem of HydraLogOS — not a plugin.

hydra run model.nef
        │
        ▼
    hydrad daemon
        ├── resource allocation  (GPU slots · memory budget)
        ├── NEF runtime init     (device detection · graph deserialization)
        ├── execution dispatch   (async graph scheduling)
        └── result → HydraLogOS scheduler / output consumer

hydrad controls NEF lifecycle. NEF exposes a gRPC control interface consumed by the scheduler. Graphs can be preempted, paused, and resumed mid-execution.

Failure Handling

Failure	Response
GPU out of memory	Evict cache → retry on smaller batch → CPU fallback
Backend compile failure	Log → reroute to CPU → continue
NPU probe failure	Silently disable NPU; proceed with GPU/CPU
Graph cycle detected	`NEFGraphCycleError` raised at construction time
`hydrad` preemption	Serialize in-progress graph state; resume on restart

No silent corruption. All failures surface through structured error types.

Quick Start

git clone https://github.com/Hexa08/NEF
cd NEF
pip install -e .

Run the test suite:

PYTHONPATH=src python -m pytest

Build a .nef file:

PYTHONPATH=src python -c "
import nef
a = nef.tensor([[1.0]], dtype=nef.float32)
b = nef.tensor([[2.0]], dtype=nef.float32)
nef.matmul(a, b).build('model.nef')
"

Roadmap

What NEF Is NOT

Clarifying the scope to avoid confusion.

❌ Not a replacement for PyTorch / JAX in training workflows
❌ Not a low-level GPU driver or CUDA wrapper
❌ Not a model storage format (≠ GGUF, ONNX, SafeTensors)
❌ Not a distributed training coordinator

NEF is the execution layer — everything below the graph, everything above the hardware.

Security

No direct hardware access from user code — all execution routes through hydrad
Execution sandboxed under HydraLogOS's process isolation
No arbitrary kernel injection — kernels compiled from whitelisted op templates only
Graph validation runs before optimization; malformed graphs are rejected at construction

NEF — Neural Essence Format
v0.1.0-draft · HydraLogOS Internal · github.com/Hexa08/NEF

Built for HydraLogOS. Designed to disappear into the hardware.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
assets		assets
src/nef		src/nef
tests		tests
.gitignore		.gitignore
README.md		README.md
nef_pipeline_animated.html		nef_pipeline_animated.html
pyproject.toml		pyproject.toml
setup_nef.sh		setup_nef.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Essence Format

What is NEF?

The Problem It Solves

Architecture

Core Components

Lazy Execution Model

Supported Hardware

Performance Targets

Go API

Serialized Graph Format

HydraLogOS Integration

Failure Handling

Quick Start

Roadmap

What NEF Is NOT

Security

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neural Essence Format

What is NEF?

The Problem It Solves

Architecture

Core Components

Lazy Execution Model

Supported Hardware

Performance Targets

Go API

Serialized Graph Format

HydraLogOS Integration

Failure Handling

Quick Start

Roadmap

What NEF Is NOT

Security

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages