Skip to content

kernel/nnie_neo: clean-room NNIE CNN driver for cv500/av300#145

Merged
widgetii merged 49 commits into
mainfrom
nnie-neo
May 17, 2026
Merged

kernel/nnie_neo: clean-room NNIE CNN driver for cv500/av300#145
widgetii merged 49 commits into
mainfrom
nnie-neo

Conversation

@widgetii
Copy link
Copy Markdown
Member

@widgetii widgetii commented May 17, 2026

Summary

Open-source replacement for vendor open_nnie.ko on cv500-family SoCs (hi3516cv500, av300, dv300). Drives the NNIE CNN inference block at phys 0x11100000 and exposes the vendor-compatible ABI on /dev/nnie, so existing userspace using vendor libnnie.so works unchanged.

Closes #111.

Validation

mnist Forward output byte-identical to vendor:

vendor:    dst[10 scores]: 408 412 401 401 398 412 398 405 449 401
open-src:  dst[10 scores]: 408 412 401 401 398 412 398 405 449 401
Test Result
Single mnist Forward (open-src lib + open-src ko) output byte-identical to vendor
520 sequential Forwards (slot_idx ring wrap mod 512) 520/520 PASS
4-way parallel Forward x 5 rounds 20/20 PASS
GDC coexistence (open_gdc.ko loaded, idle) PASS
LoadModel decode for segnet / ssd / lstm input nodes parse correctly

What's in the box

  • kernel/nnie_neo/ — kernel driver
    • nnie_init.c — platform-device probe (DT hisilicon,hisi-nnie), IRQ + MMZ pre-alloc
    • nnie_neo.c/dev/nnie miscdevice + Forward / AddTskBuf / RemoveTskBuf ioctl handlers + HW dispatch
    • nnie_hw_task.h — 64-byte HW descriptor + tskbuf variable-length tail layout
    • nnie_hw_regs.h — cv500 NNIE register map (0x11100000)
    • Wired into kernel/hi3516cv500.kbuild
  • libraries/nnie_neo/ — userspace
    • nnie_ops.cHI_MPI_SVP_NNIE_LoadModel, _Forward, _ForwardWithBbox, _Query, _AddTskBuf, _RemoveTskBuf, _UnloadModel, _GetTskBufSize
    • nnie_wk_format.h.wk file-format constants + header struct

Definition-of-done (issue #111)

  • New kernel module open_nnie_neo.ko bound to hisilicon,hisi-nnie on cv500
  • CNN model loader + forward pass produce non-trivial output on a small test model
  • Round-trip test — mnist Forward output matches vendor byte-for-byte
  • Memory arbitration / SRAM ownership sequence captured and applied

Known limitations (follow-ups, not blockers)

  • HI_MPI_SVP_NNIE_GetTskBufSize returns HI_ERR_SVP_NNIE_NOT_SURPPORT. Vendor walks the parsed instruction stream to compute per-segment buffer sizes; userspace callers that need it should pre-compute or hardcode for now.
  • HI_MPI_SVP_NNIE_ForwardWithBbox returns HI_ERR_SVP_NNIE_NOT_SURPPORT. The bbox-mode dispatch path (4 extra ioctls in the 0x4d05..0x4d09 range) isn't wired yet.
  • u32TmpBufSize is set to 8 MB heuristically rather than computed from the model. Fine for classification-class models; large detection models may need vendor's exact value.
  • LSTM (enNetType=2) tskbuf tail layout uses the same builder as CNN. RECURRENT-net dispatch is not exercised.
  • LoadModel for output nodes hardcodes enType=SVP_BLOB_TYPE_S32 and NodeId=(j+1)*8 since the file format doesn't store these for outputs; vendor's libnnie does the same. The W/H/C field order differs from inputs — vendor swaps so the layer's feature count lands in Width.

Test plan

  • CI build for hi3516cv500 chiparch
  • CI builds for non-cv500 chiparchs (ev200, gk7205v200, etc.) unaffected (nnie_neo only included from hi3516cv500.kbuild)
  • Manual: insmod open_nnie_neo.ko on av300, run vendor's sample_nnie_main against mnist .wk

🤖 Generated with Claude Code

widgetii and others added 30 commits May 15, 2026 07:28
First slice of #111 (clean-room NNIE CNN driver for cv500/av300/dv300).
The full backend is multi-day RE; this commit lands only the platform-
driver scaffold:

- kernel/nnie_neo/nnie_init.c — platform_device probe, binds to
  "hisilicon,hisi-nnie" DT node. Maps the nnie0 register window
  (0x11100000 on cv500), records the nnie0 + gdc IRQs. Skips the gdc
  register region (owned by open_gdc.ko — sharing the DT node would
  EBUSY).
- kernel/nnie_neo/nnie_neo.c — registers /dev/nnie via osal_createdev.
  Single ioctl dispatch path that returns -EOPNOTSUPP for everything
  (Phase 0). Phase 1 will decode the eight HI_MPI_SVP_NNIE_* ioctl
  numbers + arg-buf layouts from vendor libnnie.so and wire real
  handlers.
- kernel/nnie_neo/Kbuild — mirrors ive_neo/Kbuild structure.
- kernel/hi3516cv500.kbuild — pulls nnie_neo/Kbuild in alongside the
  existing vendor $(PREFIX)nnie wrapper. Both modules can build; init
  scripts pick one to insmod at runtime (same pattern as ive vs
  ive_neo).

On-target av300 verification (after sysrq reboot to load fresh):
  $ insmod /tmp/open_nnie_neo.ko && lsmod | grep nnie
  open_nnie_neo           2949  0
  $ ls -la /dev/nnie
  crw-rw---- 1 root root 218, 100  /dev/nnie
  $ dmesg | grep nnie
  nnie_neo: probed nnie0=f4470000 irq=54  gdc_irq=53
  nnie_neo: /dev/nnie ready (Phase 0 stub — all ioctls return -EOPNOTSUPP)

Known Phase 0 limitation: request_irq fails -16 because vendor's IRQ
handler is registered with IRQF_SHARED and our flags=0 conflicts.
Phase 3 will switch to IRQF_SHARED once we actually need IRQ-driven
completion. /dev/nnie remains usable for the -EOPNOTSUPP path.

Not pushing or opening a PR yet — per the bundle-on-one-branch
feedback, NNIE work stays on this local branch until Phase 4 (end-
to-end test on av300 with a tiny real model) passes.
Static RE of cv500 vendor libnnie.so (42 KB, 8 public entries) +
kernel-side svp_nnie_ioctl @0x26b8 in vendor blob. Five distinct
ioctls reach /dev/nnie; the other three public entries (LoadModel,
UnloadModel, GetTskBufSize) are pure-userspace — they only touch
MMZ via HI_MPI_SYS_MmzAlloc/Free, no /dev/nnie call.

  nr | size | full ioctl   | API entry
 ----+------+--------------+----------------------------
 0x00| 1624 | 0xc6584d00   | HI_MPI_SVP_NNIE_Forward
 0x01| 1728 | 0xc6c04d01   | HI_MPI_SVP_NNIE_ForwardWithBbox
 0x02|   24 | 0xc0184d02   | HI_MPI_SVP_NNIE_Query
 0x03|   24 | 0xc0184d03   | HI_MPI_SVP_NNIE_AddTskBuf
 0x04|   24 | 0xc0184d04   | HI_MPI_SVP_NNIE_RemoveTskBuf

Vendor kernel dispatcher additionally recognises 0x4d05/06/07/08/09
when the per-call context state == 0xc — those are bbox-mode dispatch
variants, deferred to Phase 3 once we have an actual model + Forward
call to exercise them.

Phase 1 handler bodies still return -EOPNOTSUPP for Forward,
AddTskBuf, RemoveTskBuf. Query stubs done=1 (matches ive_neo pattern
since dispatch is synchronous once wired).

Implication for Phase 2 scope: the kernel ABI surface is only 5
ioctls; the model loader is mostly userspace work in libnnie_neo.so.

Full ioctl-ABI reference saved to kaeru as nnie-neo-cv500-ioctl-abi.
Partial RE of cv500 vendor libnnie.so HI_MPI_SVP_NNIE_LoadModel @0x1bf4
(size 0x12d8). Phase 2 of #111. Decoded by tracing the loader's stack
buffer reads at sp+84..sp+275 (the 192-byte file-header copy).

What landed:
- libraries/nnie_neo/ — new userspace library exporting all 8 vendor
  HI_MPI_SVP_NNIE_* entry points with vendor-matching const-qualified
  signatures. Builds clean for cv500 against the cv500 kernel-include
  tree (libnnie_neo.so = 7.5 KB on cv500 cross-compile).
- include/nnie_wk_format.h — 192-byte .wk file header struct, decoded
  fields:
    [0..3]   u32 CRC32 (zlib-style, IEEE 802.3 0xEDB88320 reflected,
             of bytes [4..filesize))
    [16..19] format-version digits {1,1,1,2} (10*[16]+[17] == 11,
             10*[18]+[19] == 12 per loader checks @0x1dbc-0x1de4)
    [48]     enRunMode → SVP_NNIE_MODEL_S.enRunMode @0x1df0
    [49]     u32NetSegNum → SVP_NNIE_MODEL_S.u32NetSegNum @0x1dfc
    [52..55] inst_offset_extra (sp+136 read, > 0xBF, bounds-checked)
    [56..59] inst_len (sp+140 read, non-zero, bounds-checked)
    [176..179] dup of inst_offset_extra (sp+260 read, must equal)
    [180..183] some count (sp+264 read, > 47)
- src/nnie_ops.c — LoadModel verifies CRC32 + version bytes, then
  returns HI_ERR_SVP_NNIE_NOT_SURPPORT (the segment-table / ROI-info /
  weights parsing is the Phase 3 work). All other 7 entries are also
  stubbed with vendor-matching signatures.

Phase 2 limitations (deferred to Phase 3):
- Segment table iteration (astSeg[u32NetSegNum]) — vendor walks an
  array of SVP_NNIE_SEG_S starting at some file offset within the
  inst_offset_extra region. Layout unconfirmed.
- ROI pool info (astRoiInfo[]) — vendor walks SVP_NNIE_ROIPOOL_INFO_S
  records, count derived from segment metadata.
- u32TmpBufSize calculation — likely a running sum across segments.
- stBase fill — needs the SVP_SRC_MEM_INFO_S passed in as model base.

On-target verification: blocked this session — the av300 board got
into a wedged state after the Phase 1 sysrq reboot cycle and ping
without ssh-response. Static decode of the loader and bench-build of
libnnie_neo.so both pass. /tmp/nnie_crc_check ARM binary is staged for
the test once the board is power-cycled — expected results:
  valid mnist.wk   -> 0xa033800c (NOT_SURPPORT, CRC passes)
  corrupt mnist.wk -> 0xa0338003 (ILLEGAL_PARAM, CRC fails)

NNIE work continues on the local nnie-neo branch — no PR until
Phase 4 (end-to-end inference on av300 with a tiny real model) passes,
per the bundle-on-one-branch feedback.
Two bugs in the Phase 2 CRC verifier, found on first av300 test run
against vendor inst_mnist_cycle.wk (466176 B):

1. Init value: vendor accumulator starts at 0, not 0xFFFFFFFF (zlib
   convention). Confirmed by re-reading libnnie.so 0x1cd0-0x1cd8 —
   the special-case `mvneq r0, #0` only fires when r1==4 (the
   trivially-short-payload path). The general path enters the CRC
   loop with r0 still at whatever it was before, which is 0 (the
   memcpy_s return value at 1c64 was checked as 0). Final XOR is
   still 0xFFFFFFFF (`mvn r0, r0` at 1d0c).

2. CRC coverage range: bytes [4..file[52]+file[56]), not the whole
   file. file[52..55] is inst_offset_extra (header tail offset),
   file[56..59] is inst_len. Their sum is the end of the CRC-protected
   region; weights / quantization tables after that point aren't CRC-
   protected (vendor relies on instruction-stream offsets to address
   them).

Confirmed against vendor mnist.wk on av300:
  stored CRC = 0xa4a25b1a
  computed   = 0xa4a25b1a after the fix

Test results now match the Phase 2 expectations:
  /tmp/nnie_crc_check on av300:
    LoadModel valid    -> 0xa0338008 (NOT_SURPPORT — CRC + version pass,
                                      parser body unimplemented)
    LoadModel corrupt  -> 0xa0338003 (ILLEGAL_PARAM — CRC fails)
Continuing Phase 2 RE of HI_MPI_SVP_NNIE_LoadModel @0x1bf4. Tracing
the post-CRC parse path at 0x1e70-0x1edc reveals the per-segment
file record layout (16 B header part + variable-length node arrays):

  off | width | maps to SVP_NNIE_SEG_S field
  ----+-------+------------------------------
   0  | u8    | enNetType (must be <= 2)
   1  | u8    | u16SrcNum (zero-extended in struct)
   2  | u8    | u16DstNum
   3  | u8    | u16RoiPoolNum
   4  | u16   | u16MaxStep (<= 1024)
   6  | u16   | pad/unk
   8  | u32   | u32InstOffset (bounds-checked vs inst_offset_extra +
                inst_len)
  12  | u32   | u32InstLen (16-byte aligned)

Followed by node-array records — Phase 3 prereq.

Also confirmed file[60..63] is u32TmpBufSize (the loader stores it
at pstModel+4 = SVP_NNIE_MODEL_S.u32TmpBufSize, then checks != 0).

Updated nnie_wk_header_t with the new field names + added
nnie_wk_seg_record_t struct. Header file now documents the full
high-level format from CRC through segment-table header.
More LoadModel disassembly progress (vendor libnnie.so 0x1f30-0x208c).
Node-record layout in the segment table is asymmetric:

  Segment data block starts at file[192] (= end of the fixed header,
  per file[8] = 0xC0). Structure within a segment block:
    +0..15   nnie_wk_seg_record_t (the 16-B segment header)
    +16..29  first source-node WHC metadata (compact, ~14 B)
              [16..19] u32 -> NODE_S.unShape.stWhc.u32Height
              [20..23] u32 -> NODE_S.unShape.stWhc.u32Width
              [24..27] u32 -> NODE_S.unShape.stWhc.u32Chn
              [30..31] u16 -> NODE_S.enType (post-mapped via 2/3/4/5
                              -> SVP_BLOB_TYPE enum lookup)
    +32..    array of 64-B node slots, stride 64. First field of each
              slot is the 32-byte szName. Remaining 32 bytes hold
              additional WHC/dim + blob_type fields (offsets in slot
              read at 0x205c-0x20bc, exact layout still partial).

Verified against vendor mnist.wk:
  file[208..211] = 28   (u32 Height)
  file[212..215] = 28   (u32 Width)
  file[216..219] = 1    (u32 Chn — grayscale)
  file[224..227] = "data"  (Caffe input-layer name)

Updated nnie_wk_format.h with nnie_wk_node_slot_t (64 B). Phase 3 will
fill the unk_20_3F[32] tail with confirmed field offsets once we have
a Forward dispatch + kprobe trace exercising real inference. Still
TODO this phase: ROIPool record layout, multi-segment models, segment-
boundary computation when NetSegNum > 1.
Decoded the per-ioctl arg-buffer layouts by tracing the userspace
worker functions in libnnie.so:

Forward (ioctl 0xc6584d00, arg size 1624 B), worker @0x104c:
   off | size | content
   ----+------+----------------------------------------------
      0 |   4 | HI_HANDLE (out — kernel writes assigned handle)
      4 |   4 | pad
      8 | 768 | astSrc[16] — 16 SVP_BLOB_S (48 B each)
    776 |   8 | pad
    784 | 768 | astDst[16]
   1552 |  64 | SVP_NNIE_FORWARD_CTRL_S {SrcNum, DstNum, NetSegId,
              |    enNnieId, stTmpBuf(24), stTskBuf(24)}
   1616 |   4 | bInstant
   1620 |   4 | pad

AddTskBuf / RemoveTskBuf (ioctls 0xc0184d03 / 0xc0184d04, 24 B):
  plain SVP_MEM_INFO_S {u64 phys, u64 virt, u32 size, u32 pad}.
  Verified at 0x3134-0x3150 in libnnie.so.

Updated kernel handlers:
- nnie_op_forward parses the 64 B ctrl block (SrcNum/DstNum/NetSegId)
  + writes handle = 0 to buf+0 + returns -EOPNOTSUPP. Phase 4 will
  walk astSrc/astDst phys addrs, apply the [0x90] memory-priority knob
  (skipped for IVE, required for NNIE), and submit to NNIE HW.
- nnie_op_add_tskbuf / remove_tskbuf parse the MEM_INFO_S triple,
  print it, return -EOPNOTSUPP.
- nnie_op_query unchanged (done=1 stub).

ForwardWithBbox arg-layout (1728 B = 1624 + 104) is similar to
Forward with an extra ProposalNum + bbox MEM_INFO block. Precise
offsets TBD after Forward HW path works.

Same nnie-neo branch; no PR until Phase 4 (real inference) passes.
Key result from continuing the vendor open_nnie.ko disassembly: NNIE
HW dispatch is *not* direct register access. It goes through vendor's
cmpi cross-module function-pointer indirection.

Two sites of interest:

1. drv_svp_nnie_config_ram -> hal_svp_nnie_enable_ram @0xb8f4:
   The "[0x90] memory-priority knob" mentioned in #111 isn't written
   by open_nnie.ko at all. The function calls
   cmpi_get_module_func_by_id(51, 0xd1) where 51 is the open_sys.ko
   module ID and 0xd1 is a function selector, then blx's the returned
   function pointer with state at sp+4. So the register write lives
   in open_sys.ko's exported function table.

2. svp_nnie_start_task @0x1934:
   After drv_svp_nnie_prepare_nnie, it calls
   cmpi_get_module_func_by_id(37) -> r5 (some module's func table),
   then dispatches via four entries:
     r5+0x78  -> prepare submission
     r5+0x7c -> fire/wait
     r5+0x80 -> exists check
     r5+0x84 -> finalize / select_ram fallback
   Same dispatch pattern but mediated by cmpi.

Implication for the clean-room: Phase 4 needs to either (a) replicate
cmpi's cross-module function-pointer contract end-to-end, or (b) RE
open_sys.ko to find the actual NNIE register writes and bypass cmpi.
(b) is cleaner because it avoids depending on a vendor-shared
function-pointer ABI that could drift.

Documented in nnie_op_forward's TODO block. No code change to the
dispatch path; the parsed-arg + return-EOPNOTSUPP shape is still
what userland sees.
RAM/mutex coordination

Followed the cmpi indirection from hal_svp_nnie_enable_ram into
hi_sys.o (open_sys.ko's vendor blob). Resolution to the previous
session's open question:

- cmpi_get_module_func_by_id(51, 0xd1) -> sys_hal_gdc_nnie_set_ram_using
  @0x897c in cv500's hi_sys.o.
- That function does an atomic bit-set/clear on bit 0 of register
  offset +0x34 of the sys-module's MMIO window (loaded from
  g_sys_state[16] base).
- The "atomic bit-set" pattern is in a private helper at sys_drv_get_
  cmp_3dnr_cfg+0x1c8 (= 0x7a84): spin_lock_irqsave / read-old /
  XOR-new / mask-to-bit / XOR-old / write-back / spin_unlock.

So the original issue #111 wording — "memory arbitration [0x90]
register" — was a partial description. The real mechanism is a set
of bit-writes coordinating RAM ownership and mutex between NNIE / GDC
/ VENC, distributed across these sys-module HAL functions:
  - sys_hal_gdc_nnie_set_ram_using
  - sys_hal_gdc_nnie_mutex_sel
  - sys_hal_venc_nnie_mutex_sel
  - sys_hal_nnie_get_mutex_state
  - sys_hal_nnie_gdc_get_mutex_state
  - sys_hal_vgs_bootroom_set_ram_using
And it's all anchored at sys MMIO base + offset 0x34..0x44ish, not
at 0x90 of the IVE block.

Updated nnie_op_forward's TODO block with the new findings. Two
paths forward for Phase 4 dispatch wiring:
  (a) ioremap the sys register window from open_nnie_neo.ko and do
      the bit-writes directly. Cleanest, doesn't depend on a clean-
      room open_sys module existing.
  (b) Wait for an open_sys clean-room module (separate effort) and
      use a proper exported API.

This is real progress: we have a concrete register + bit-set target
instead of "the [0x90] knob" which doesn't actually exist on the IVE
block. Next session resumes by ioremap'ing the sys MMIO window
(address TBD — probably accessible via DT, since hi3516cv500.dtsi
declares hisi-sys node) and exercising the bit-writes.
Followed the relocations into hi_sys.o: sys_hal_gdc_nnie_set_ram_using
indirects via R_ARM_MOVW_ABS_NC .LANCHOR0 — a per-module anchor that
holds the sys register window base (g_reg_sys_base_va). cv500 DT
declares:

  sys: sys@12010000 {
      compatible = "hisilicon,hisi-sys";
      reg = <0x12010000 0x10000>,  /* crg */
            <0x12020000 0x8000>,   /* sys  <- the one used here */
            <0x12060000 0x10000>,  /* ddr */
            <0x12030000 0x8000>;   /* misc */
  };

So the NNIE/GDC RAM-using register is at phys **0x12020034** (sys-base
+ 0x34). Verified live on av300 via devmem 0x12020034 -> 0x00000000
which matches the expected idle state (NNIE not actively dispatched,
bit 0 clear).

Phase 4 starting point is now concrete: ioremap(0x12020000, 0x8000)
in open_nnie_neo.ko, do an atomic read-modify-write of bit 0 at
offset 0x34 to mark NNIE RAM in-use before each Forward dispatch.
Similar offsets for the other 5 sys_hal_* functions (mutex_sel,
get_mutex_state) live nearby in 0x12020000..0x12020044ish — sweep
needed in next session to map them all out.
Swept all sys_hal_*nnie* + sys_hal_vgs_bootroom_set_ram_using functions
in hi_sys.o. The "memory arbitration" the issue mentioned breaks down
into three registers within the cv500 sys window (phys 0x12020000,
DT-declared reg-name "sys" of hisilicon,hisi-sys node):

  Register 0x12020000:    live=0x00000102
    bit 13 = vgs_bootroom_set_ram_using                       (R/W)

  Register 0x12020008:    live=0x00000000  (no contention)
    bit 0..1 = NNIE/GDC mutex state                           (R)
    bit 1    = venc<->nnie mutex_sel (sys_hal_venc_nnie_*)    (W)
    bit 2    = gdc<->nnie  mutex_sel (sys_hal_gdc_nnie_*)     (W)

  Register 0x12020034:    live=0x00000000
    bit 0    = gdc_nnie_set_ram_using                         (R/W)
              ^ primary "NNIE has RAM" flag; set before Forward,
                clear after.

Two private helper functions in hi_sys.o implement the atomic
R-M-W:
  - sys_drv_get_cmp_3dnr_cfg+0x148 (= 0x7a04): n-bit field set
  - sys_drv_get_cmp_3dnr_cfg+0x1c8 (= 0x7a84): single-bit set
Both use spin_lock_irqsave / read-old / mask+XOR / write-back.

Phase 4 wiring: ioremap(0x12020000, 0x1000) once in probe, then
atomic R-M-W of these bits around each Forward call. No more cmpi
cross-module indirection needed — we drive sys-window bits directly
from open_nnie_neo.ko.

All Phase 3 unknowns are now nailed down. Phase 4 has a concrete
implementation surface.
…tion

Adds ioremap(0x12020000, 0x1000) in probe to access the sys-side
NNIE/GDC/VENC coordination registers (decoded in the previous
commit). Currently read-only — probe dumps the three live register
values so we can confirm the mapping works. Phase 4 will use the
nnie_sys_set_bit() / nnie_sys_clear_bit() helpers (now defined,
__maybe_unused) around each Forward dispatch.

On-target verification (av300, fresh boot):
  nnie_neo: probed nnie0=f2820000 irq=54  gdc_irq=53
  nnie_neo: sys @0x12020000 mapped — VGS=0x00000102 MUTEX=0x00000000
                                     NNIE_RAM=0x00000000

The three values match exactly what `devmem` reads against the same
phys addresses — vendor open_sys.ko and our open_nnie_neo.ko share
the window cleanly with plain ioremap (no request_mem_region clash).

NNIE_RAM = 0x0 confirms bit 0 is clear in the idle state, matching
the expected semantics ("NNIE has RAM" only during Forward dispatch).

Phase 4 wiring is now a thin shim: call nnie_sys_set_bit(NNIE_SYS_REG_
NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM) before submitting to NNIE HW, clear
after IRQ completion.

Includes a small spin_lock_init(&g_sys_lock) and adds iounmap() in
mod_exit for cleanliness.
Reverse-engineered the 64-byte NNIE HW task node layout from vendor
svp_nnie_fill_forward_task @0x90d8 (hi_nnie.o), prologue 90d8..91a8.
Cross-checked field offsets against the kernel SVP_NNIE_MODEL_S /
SEG_S / FORWARD_CTRL_S struct definitions:

  sizeof(SVP_NNIE_NODE_S)         =    52
  sizeof(SVP_NNIE_SEG_S)          =  1692  ← matches vendor 0x69c stride
  sizeof(SVP_NNIE_ROIPOOL_INFO_S) =   104
  sizeof(SVP_NNIE_MODEL_S)        = 13992  ← matches vendor copy_from_user
                                              size 0x36a8
  offsetof(MODEL_S, stBase)       = 13968  = 0x3690 ✓

Vendor caller passes: r0=global HW state, r1=pstModel (kernel kbuf),
r2=forward arg kbuf, r3=on-stack 64-byte descriptor. The struct gets
populated with phys addresses + segment instruction-stream pointer +
batch num + trigger flag, then handed to svp_nnie_post_process which
submits via cmpi-mediated svp_nnie_start_task (Phase 5).

Concrete field map:

  task[ 0] u16 = bInstant ? 1 : 0
  task[16] u64 = pstModel->stBase.u64PhyAddr    (.wk MMZ block)
  task[24] u32 = pstModel->astSeg[NetSegId].u32InstOffset
  task[28] u32 = pstModel->astSeg[NetSegId].u32InstLen
  task[32] u64 = ctrl.stTskBuf.u64PhyAddr       (user-supplied scratch)
  task[48] u64 = ctrl.stTmpBuf.u64PhyAddr       (user-supplied temp)
  task[56] u32 = astSrc[0].u32Num               (batch size)

Past the 64-B header the vendor appends variable-length per-input
stride table (one u32 per astSrc), per-node shape data copied from
pstModel->astSeg[NetSegId].astSrcNode/astDstNode, and a per-batch DMA
address vector. Layout decoded but not yet captured in the header —
deferred until Phase 5 wires actual HW submission and we know which
trailing sections the v500 NNIE block actually consumes.

This commit only adds the descriptor header + cross-references it
from the Forward stub comment. No behaviour change — module still
returns -EOPNOTSUPP for Forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverse-engineered the cv500 NNIE HW register interface from vendor
hal_svp_nnie_* thin shims (hi_nnie.o @0xbb10..0xbc90, all single-store
helpers indexed by core_id off .LANCHOR1[core_id]+4 = ioremap'd regs).

Complete 0x11100000 register map:

  +0x20 W   task descriptor phys[31:0]    (hal_svp_nnie_write_task_addr)
  +0x24 W   task descriptor phys[63:32]
  +0x28 W   timeout cycles [31:0]          (hal_svp_nnie_set_timeout)
  +0x2C W   timeout cycles [63:32]
  +0x30 RW  START — bit 0 = go             (hal_svp_nnie_start)
  +0x34 RW  IRQ_CFG — bits 0/1/2 enable    (hal_svp_nnie_cfg_irq)
            finish / timeout / cfg_err IRQs
  +0x38 RW  IRQ_CLEAR — bits 0/1/2 w1c     (hal_svp_nnie_clear_irq)
  +0x3C R   IRQ_STATUS — bits 0/1/2 pending(hal_svp_nnie_get_irq_status)
  +0x40 R   CFG_ERR_INFO                   (hal_svp_nnie_get_cfg_err_info)
  +0x48 R   TASK_ID                        (hal_svp_nnie_get_task_id)
  +0x50 RW  CLK_GATE — bit 7 (=0x80) en    (hal_svp_nnie_enable_clk_gt)
  +0x54 RW  AXI OUTSTANDING — [4:0]=0xF,   (hal_svp_nnie_set_outstanding)
            [11:8]=0xF
  +0x68 RW  CHECK_SUM — bit 0 en           (hal_svp_nnie_disable_check_sum)

Dispatch sequence (drv_svp_nnie_start @0xb3ac):

   write_task_addr(task_phys_lo, task_phys_hi);  // [+0x20], [+0x24]
   wmb();
   START |= 1;                                    // [+0x30] |= 1

Two important confirmations:

  1. hal_svp_nnie_set_mem_speed @0xbc28 is a LITERAL no-op (`bx lr`) on
     cv500 — vendor doesn't write any IVE-style "[0x90] mem-priority
     knob" for NNIE. Our Phase 3 finding that NNIE coordination instead
     uses the sys @0x12020034 RAM-using flag stands.

  2. hal_svp_nnie_enable_ram @0xb8f4 goes through cmpi (module 51 = SYS,
     fn 0xd1), not the NNIE register window. This matches the Phase 3
     finding of sys_hal_gdc_nnie_set_ram_using setting bit 0 of
     0x12020034. So the full HW-bring-up sequence is:

       1. nnie_sys_set_bit(NNIE_SYS_REG_NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM)
       2. write CLK_GATE = 0x80 (enable clock gating)
       3. write OUTSTANDING = 0xF | 0xF00
       4. fill 64-B task descriptor (Phase 4 prior commit)
       5. write IRQ_CFG = 0x7 (enable all 3 IRQs)
       6. write TASK_ADDR_LO/HI
       7. wmb()
       8. write START = 1
       9. wait on IRQ → read IRQ_STATUS → clear IRQ
      10. nnie_sys_clear_bit(NNIE_SYS_REG_NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM)

No behaviour change — module still returns -EOPNOTSUPP. Phase 4 wiring
is now a thin shim around this header + nnie_hw_task.h.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire the decoded 64-byte HW task descriptor into nnie_op_forward via
a new helper nnie_fill_task_header(). Still returns -EOPNOTSUPP — we
need the variable-length descriptor tail (per-input stride table,
per-node shape data, per-batch DMA addresses) before driving HW —
but this commit:

  - Lays down all the SVP_BLOB_S / SVP_NNIE_FORWARD_CTRL_S internal
    offset constants (already cross-checked vs vendor disasm).
  - Builds the fixed 64-byte header from ctrl.stTskBuf.u64PhyAddr,
    ctrl.stTmpBuf.u64PhyAddr, astSrc[0].u32Num, and bInstant.
  - Logs all decoded values pr_info_once so an on-target Forward call
    now prints the full decoded forward args + the partial task header,
    proving the offset constants are right end-to-end (the values must
    match what userspace passed in).
  - Defers reading pstModel->stBase.u64PhyAddr +
    astSeg[NetSegId].u32InstOffset/InstLen to Phase 5 (needs
    copy_from_user of 13992 B model struct).
  - Drops the now-redundant Phase 3 comment block; the cv500 sys-window
    coordination map + NNIE register map have been promoted into
    nnie_hw_task.h + nnie_hw_regs.h.

Rebuilt cleanly against the cv500 4.9.37 kernel
(/home/dima/git/firmware/output-cv500/build/linux-custom). New
.text size for nnie_op_forward: 0x1a8 B (was a one-line stub).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverse-engineered the variable-length descriptor tail of the NNIE
task buffer from vendor svp_nnie_fill_forward_task body
@0x91d4-0x9498. The fixed 64-byte HW descriptor (decoded in Phase 4)
points to this tail via task[+32] = ctrl.stTskBuf.u64PhyAddr; the HW
follows that pointer to read shape/stride/per-batch DMA addresses.

Tail layout (written to ctrl.stTskBuf, kernel-vir resolved by
svp_nnie_get_tsk_vir_addr @0x91a8 via phys-match against the
registered tskbuf list):

  §1  always       SrcNum × u32     astSrc[i].u32Stride
       align tip to 16 B
  §2  non-LSTM     16    × u64     astDst[i].u64PhyAddr (zero past
                                    DstNum, always 128 B advance)
  §3  non-LSTM     varies          per-source DMA address vector,
                                    dispatch by astSrc[i].enType:
                                      0      -> u32Stride*Height*Chn
                                                * batch_idx + PhyAddr
                                      1..3   -> svp_nnie_fill_image_src_addr
                                                (YUV plane offsets)
                                      4      -> u32Stride*Height
                                                * batch_idx + PhyAddr
                                      5      -> per-step from user
                                                u64VirAddrStep array
                                      other  -> ILLEGAL_PARAM
                                    align tip to 16 B per blob
  §4  LSTM only    different       net_type==RECURRENT path @0x96dc,
                                    uses ctrl+stTskBuf indexing —
                                    Phase 6 work
  §5  optional     dcache flush    if (sp+40)/sp+28 set, flush
                                    range [stTskBuf.PhyAddr,
                                          +stTskBuf.u32Size)

Important struct-layout correction: SVP_BLOB_S has a 4-byte hole at
+28 (not previously documented in nnie_neo.c). The union starts at
+32 because stSeq.u64VirAddrStep needs 8-byte alignment. Cross-
checked with the cv500 ARM toolchain:

  +0..+27   enType, Stride, VirAddr, PhyAddr, Num
  +28..+31  PADDING
  +32..+47  union { Width,Height,Chn  |  Dim,VirAddrStep }

The previous NNIE_BLOB_OFF_WIDTH=28 / HEIGHT=32 / CHN=36 constants
were wrong; corrected to 32/36/40. The Forward stub's use of these
(reading astSrc[0].u32Num at +24) was already at the right offset.

No behaviour change — Forward stub still returns -EOPNOTSUPP. Phase
6 will:
  - decode the LSTM tail variant (§4)
  - implement the §1-§3/§5 builder in C
  - wire copy_from_user of pstModel (13992 B) to populate
    task[+16/+24/+28]
  - drive the HW per nnie_hw_regs.h sequence

Rebuilt clean against cv500 4.9.37 kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix the deferred IRQ_SHARED issue from Phase 0:

  - request_irq() now passes IRQF_SHARED. The cv500 NNIE SPI line (54)
    is shared with vendor open_gdc.ko (GDC on SPI 53 in the same
    DT node), which we kprobed using IRQF_SHARED. To coexist we have
    to match — kernel rejects mixed-flag handlers on a shared line.

  - nnie_irq_handler now reads NNIE_REG_IRQ_STATUS first; if no NNIE
    bits are pending it returns IRQ_NONE so the GDC handler (or any
    other downstream sharer) gets to run. Only when a NNIE finish /
    timeout / cfg_err bit is set do we write-1-clear and signal
    g_nnie_done.

The handler doesn't yet inspect *which* bit was set — that distinction
(finish vs timeout vs cfg_err) gets pushed to Phase 7 once Forward
actually dispatches and we have an end-to-end test path.

Rebuilt clean against cv500 4.9.37 kernel; .text size for the module
grew by 48 B for the status-read/dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement the kernel side of HI_MPI_SVP_NNIE_AddTskBuf /
RemoveTskBuf. Userspace MMZ-allocates a scratch region, registers
(phys, user_virt, size) with the kernel; the kernel records the
mapping in a list and ioremap()s the phys range so Phase 7 can write
the variable-length descriptor tail into stTskBuf from Forward
dispatch.

Implementation:
  - struct nnie_tskbuf {phys, user_virt, size, kvirt} on a list_head
    protected by g_nnie_tskbuf_lock.
  - nnie_add_tskbuf()/nnie_remove_tskbuf()/nnie_drain_tskbufs() do
    the list management + ioremap/iounmap.
  - nnie_op_add_tskbuf/remove_tskbuf wire the 24-byte SVP_MEM_INFO_S
    arg buffer through to those helpers.
  - nnie_drain_tskbufs() in mod_exit prevents leaks on rmmod.

ioremap (not cmpi_remap_cached) is uncached, which means we don't
need the cache-flush step vendor has at fill_forward_task @0x94ac.
Trade-off is slower kernel writes — but the descriptor tail is small
(KB), written once per Forward call.

Userspace API match: vendor's libnnie.so AddTskBuf returns 0 on
success; ours now returns 0 (on success), -EEXIST (already
registered), -ENOMEM (OOM or ioremap fail), or -EINVAL
(phys/size==0). RemoveTskBuf returns 0 or -ENOENT.

Module size grew from 7128 B to 8028 B (+900 B for the registry).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add nnie_build_task_tail() — implements §1-§3 of the descriptor tail
decoded in Phase 5. Wired into nnie_op_forward as a dry-run: if the
caller has registered an stTskBuf via AddTskBuf, we look it up and
fill in the strides + dst-phys + per-batch DMA addresses. Still
returns -EOPNOTSUPP overall (Phase 7 will copy_from_user pstModel,
finalise the 64-B header, and drive the HW registers).

Builder:
  §1: SrcNum × u32 stride entries (one per astSrc[i])
      Aligned to 16 B with zero-fill.
  §2: 16 × u64 destination phys addresses, zero-padded past dst_num.
  §3: per-source DMA address vector — for each astSrc[i]:
        enType==0: batch_size = Stride * Height * Chn
        enType==4: batch_size = Stride * Height
        enType ∈ [1..3, 5]: -EOPNOTSUPP (Phase 7+ — YUV/seq inputs)
        Writes Num u64 entries: PhyAddr + j*batch_size.
        Aligned to 16 B between blobs.

Tail bytes used logged via pr_info_once so on-target verification
can confirm the offset arithmetic matches what HW expects (cross-
checkable against vendor strace).

Wiring:
  - nnie_init.c now exports g_nnie_pf_dev (platform_device *) so
    nnie_neo.c can dma_alloc_coherent in Phase 7.
  - Header includes: linux/dma-mapping.h + linux/platform_device.h.
  - nnie_fill_task_header marked __maybe_unused (Phase 7 will use it).

Module .text grew from 8028 B to 8944 B (+916 B for the builder).
Build clean against cv500 4.9.37 kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Finish everything in the Forward path *except* the actual HW kick:

  - copy_from_user(model_kbuf, fwd_arg[+776], 13992) — pulls the
    user's SVP_NNIE_MODEL_S into kernel memory using vmalloc (kmalloc
    would slab-fragment for ~14 KB).
  - Validate net_seg_id < model->u32NetSegNum and < 8.
  - Extract model->stBase.u64PhyAddr (file[+0x3690]) -> task[+16].
  - Extract model->astSeg[net_seg_id].u32InstOffset/u32InstLen
    (file[+12 + seg*1692 + 12 / +16]) -> task[+24], task[+28].
  - Look up stTskBuf in our registry; call nnie_build_task_tail to
    write the §1-§3 variable-length tail.
  - Fill the 64-byte HW task descriptor on stack via
    nnie_fill_task_header (un-suppressed __maybe_unused).
  - Log: trigger, model_phys, inst_off, inst_len, tail_bytes.

What's left for Phase 7 (the actual HW kick):
  - dma_alloc_coherent the 64-B descriptor (we have g_nnie_pf_dev).
  - memcpy stack descriptor into it.
  - nnie_sys_set_bit(NNIE_SYS_REG_NNIE_RAM, ...) coordination.
  - 7 register writes (CLK_GATE, OUTSTANDING, IRQ_CFG, TASK_ADDR_LO/HI,
    wmb, START).
  - wait_for_completion_timeout 5 sec.
  - Read+ack NNIE_REG_IRQ_STATUS; distinguish finish/timeout/cfg_err.
  - Release sys lock + dma_free_coherent + write handle to buf+0.

Stops short of HW kick because the partial test on av300 is non-
destructive only as long as no register write hits the live NNIE
block — once we add the START write, a wrong descriptor field could
DMA to bad addresses and (worst case) hang the SoC. Doing that as a
distinct commit keeps the bisect safe.

Module .text grew from 8944 to 9908 B (+964 for copy_from_user
flow). Still returns -EOPNOTSUPP at the end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire the cv500 NNIE Forward dispatch end-to-end. Forward now:

  1. Decodes the 1624-byte forward arg (Phase 1).
  2. copy_from_user pstModel; extracts stBase.u64PhyAddr +
     astSeg[net_seg_id].u32InstOffset/u32InstLen (Phase 6).
  3. Writes §1-§3 variable-length tail into the registered stTskBuf
     via nnie_build_task_tail (Phase 5/6).
  4. dma_alloc_coherent's a 64-byte HW descriptor, populates it via
     nnie_fill_task_header (Phase 4).
  5. Acquires the cv500 sys-window NNIE_RAM coordination bit at
     0x12020034 (Phase 3).
  6. Writes CLK_GATE / OUTSTANDING / IRQ_CFG / TASK_ADDR_LO/HI /
     START to the NNIE register window (Phase 4).
  7. Waits on g_nnie_done completion (5 s timeout, IRQF_SHARED-aware
     handler from Phase 6).
  8. Reads cause out of g_nnie_last_status (set atomically by the
     handler before complete()).
  9. Releases the sys lock, frees the DMA descriptor.
 10. Distinguishes finish / timeout / cfg_err and returns the right
     errno (0 / -ETIMEDOUT / -EIO).

The dispatch happens unconditionally — no module parameter gate.
On-target verification on av300 is the next step. Failure modes
that could happen on first run:
  - Bad descriptor field offset: HW writes -EIO cfg_err to status,
    handler signals, we return -EIO. Recoverable; no board hang.
  - sys-window bit doesn't match what vendor expects: HW silently
    discards the task, we hit the 5 s timeout, return -ETIMEDOUT.
    Also recoverable.
  - Wrong NNIE register-window decode (Phase 4 RE was wrong): worst
    case the START write goes nowhere; same -ETIMEDOUT outcome.
  - HW reads the descriptor and the descriptor's tsk_buf_phys points
    somewhere bad: HW does bus error; on cv500 typically the SoC bus
    abort handler logs and the NNIE block returns cfg_err. Recoverable.
  - Worst plausible failure: HW reads the descriptor, the variable-
    length tail has a bad PhyAddr in §3, NNIE DMAs garbage to/from a
    bad address. AXI typically reports a bus error rather than
    hanging. Power-cycle recovers if not.

Module .text grew from 9908 to 12248 B (+2.3 KB for dispatcher).
Build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end Forward path is wired (Phase 7 dispatch + IRQF_SHARED);
old 'Phase 3 — ioctl ABI wired, HW dispatch TBD' log was stale.

On-target verified on av300: module loads, IRQ 54 shares cleanly
with vendor GDC_NNIE handler (no 'Flags mismatch' error), /dev/nnie
present, sys-window coordination registers read live.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three substantive changes to libraries/nnie_neo:

  1. Forward / AddTskBuf / RemoveTskBuf now call ioctl() on /dev/nnie
     instead of returning HI_ERR_SVP_NNIE_NOT_SURPPORT. The Forward
     path packs SVP_SRC_BLOB_S[] + pstModel user VA + SVP_DST_BLOB_S[]
     + SVP_NNIE_FORWARD_CTRL_S into the 1624-byte ioctl arg per the
     layout decoded in kernel/nnie_neo/nnie_neo.c.

  2. LoadModel now actually populates the SVP_NNIE_MODEL_S struct:
       - stBase     = *pstModelBuf
       - enRunMode  = file[48]
       - u32TmpBufSize = file[60..63]
       - u32NetSegNum  = file[49]
       - astSeg[i].enNetType / u16SrcNum / u16DstNum / u16RoiPoolNum
                       / u16MaxStep / u32InstOffset / u32InstLen from
         the 16-byte seg records at file[192 + i*16].
     Node + ROI tables left zeroed — kernel Forward only reads
     u32InstOffset / u32InstLen, and userspace post-process helpers
     (softmax/detect/cluster) aren't implemented yet, so zeroed slots
     are safe for now. Validated InstOffset+InstLen against file_size.

  3. /dev/nnie fd lifecycle: cached static int, opened lazily on first
     ioctl, protected by a pthread_mutex. nnie_err_to_hi() translates
     Linux errno to vendor HI_ERR_SVP_NNIE_* codes.

Build fix: vendor hi_nnie.h needs HI_ID_SVP_NNIE from the staging
hi_common.h, but libraries/include/hi_common.h was being preferred
(causing redeclaration conflicts on EN_ERR_LEVEL_*). Reorder Makefile
include path so STAGING/kernel/include/hi3516cv500 comes BEFORE
libraries/include.

libnnie_neo.so builds clean against cv500 ARM toolchain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two on-target fixes after the first end-to-end Forward attempt on
av300 with inst_mnist_cycle.wk:

  1. AddTskBuf: switched from ioremap() to memremap(MEMREMAP_WB).
     cv500 MMZ regions are CMA-backed kernel RAM, and ioremap()
     refuses these (kernel WARN + returns NULL because the kernel
     direct map already covers them). memremap WB transparently
     handles both CMA RAM (returns lowmem virt) and MMIO (falls
     back to ioremap). Updated struct nnie_tskbuf.kvirt type from
     'void __iomem *' to 'void *' and replaced iowrite32 in the
     tail builder with plain stores.

     Verified on target: AddTskBuf now returns 0, no more WARN.

  2. Tail builder: SVP_BLOB_TYPE_U8 (=1) with Chn=1 (grayscale)
     now uses Stride*Height batch_size. Vendor's svp_nnie_fill_
     image_src_addr @0x7978 handles U8 by branching on Chn ∈
     {1, 3}: Chn=1 is a single u64 per batch (= PhyAddr + j*Stride*
     Height); Chn=3 writes 4 u64s per batch (3 plane addrs + zero
     pad) at 32 B/batch — that's still Phase 8.

     Without the U8 path mnist couldn't run (its input blob is
     enType=1, Chn=1).

Current on-target test run (LD_LIBRARY_PATH+PRELOAD voice libs):
  LoadModel  -> 0x0    NetSegNum=1, Inst@offset 453888 len 10600
  AddTskBuf  -> 0x0
  Forward    -> 0xa0338012 (-ETIMEDOUT)

Kernel dmesg shows:
  task hdr: model_phys=0xaa880000 inst_off=453888 inst_len=10600
  tail=160 B  (descriptor builder ran clean)
  Forward timed out (5s, status snapshot=0x0)

Next: figure out why HW doesn't IRQ. Hypotheses:
  - NNIE clock disabled (vendor drv_svp_nnie_enable_sys_clk @0xb26c
    likely needed before START)
  - 64-B descriptor layout still slightly off
  - Vendor open_gdc.ko handler consuming our IRQ first
  - Some RAM-bank select register we're missing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three on-target findings from sequential diagnosis on av300, each
verified against vendor disasm + live devmem readings:

  1. NNIE clock/reset is on CRG (clock-reset generator) window at
     0x12010000 — NOT the sys window at 0x12020000 we mapped in
     Phase 3. Per cv500 DT:

       clock@12010000 — clock-reset, 'hisilicon,hi3516cv500-clock'
       sys@12020000   — sys-state (mutex, RAM-using flags)

     Vendor hi_sys.o sys_hal_wk_cnn_clk_en @0x86dc writes bit 1 of
     register +0xbc of LANCHOR0[+8] (= CRG base, not sys base):

       crg @0x120100bc:
         bit 0 = NNIE reset    (1=held, 0=released)
         bit 1 = NNIE clk_en   (1=ungated)

     Pre-Phase-7 dispatch: CRG @0xbc = 0x0 → NNIE clock GATED.
     Writes to NNIE_REG_CLK_GATE silently dropped (read back as 0).
     This commit ioremap()'s the CRG window, defines NNIE_CRG_*
     constants, drops nnie_crg_set_bit/clear_bit helpers, and calls
     them in dispatch to release reset + ungate clock before the
     first register write.

     Verified: CRG @0xbc now reads 0x00000002 after dispatch,
     NNIE_REG_CLK_GATE readback now 0x80 (was 0x0 — register writes
     were no-ops without clock).

  2. Vendor's one-shot svp_nnie_init @0x10f4 also sets TIMEOUT to
     ~2 seconds at the NNIE clock rate (TIMEOUT_HI:LO = 0xff:
     0xffffffff). Without TIMEOUT set, HW seems to hang
     indefinitely after START. Added init-time programming of
     these registers + checksum disable (clear bit 0 of +0x68) per
     the vendor init sequence at 0x1c80-0x1cd0.

  3. Tail builder: U8 (enType=1) with Chn=1 now follows the
     standard CNN per-batch DMA formula (PhyAddr + j*Stride*Height).
     Vendor svp_nnie_fill_image_src_addr @0x7978 confirms this.

Current state: end-to-end test on av300 with inst_mnist_cycle.wk
runs all the way through to the HW START kick. /dev/nnie ABI works,
all RE-discovered registers programmed correctly. HW still hangs
after START — no IRQ in 5 s, STATUS=0 throughout. Likely cause:
descriptor format mismatch (probably one of):
  - 64-byte header has a field at +0/+2/+4 we haven't fully decoded
  - Variable-length tail format wrong for our SrcNum=1/DstNum=1
    arrangement
  - Need to call drv_svp_nnie_config_ram (OTP-based ram bank cfg)
    once at init

Deferred to Phase 8 — needs careful side-by-side comparison
against vendor strace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from sequential av300 debug:

  1. sys_hal_gdc_nnie_set_ram_using @0x897c uses LANCHOR0+16, which
     hi_sys.o sys_hal_init @0x8d70 ioremaps to *0x12030000*, not
     0x12020000 like I assumed in Phase 3. The 'sys' window at
     0x12020000 holds VGS/MUTEX status; the RAM-using flag is in a
     separate 'sys2' window at 0x12030000+0x34. Wired
     NNIE_SYS2_BASE_PHYS / nnie_sys2_set_bit/clear_bit helpers; map
     in probe.

  2. Added pre-START diagnostic that dumps NNIE register state
     (CLK, OUT, IRQ_CFG, TIMEOUT, TASK_ADDR, CHECK_SUM) + the full
     64-byte HW task descriptor as hex u32s, then polls IRQ_STATUS/
     START/TASK_ID for 100 ms after START. All registers programmed
     correctly per vendor disasm; descriptor format matches Phase 4
     decode byte-for-byte.

  3. Pulse-reset before clock-enable, in case HW is left in a stuck
     state by a previous failed dispatch.

Current state — *all RE-discovered HW is correctly programmed*:

  pre-START regs: CLK=0x80 OUT=0xf0f IRQ_CFG=0x7
                  TO_LO=0xffffffff TO_HI=0xff
                  ADDR_LO=0xa00fe000 ADDR_HI=0x0 CHKSUM=0x0
  64-B task desc: trigger=1 model=0xaa880000 inst_off=0x6ed00
                  inst_len=0x2968 tsk=0xa9c70000 tmp=0xa00f5000
                  batch=1

Post-START: STATUS=0, TASK_ID=0 (HW completely silent for 5s).

Architectural find (svp_nnie_post_process @0x1d8c): vendor maintains
a pre-allocated DMA-coherent ring of 512 × 64-byte slots per core,
indexed by the per-core busy counter. The fill_forward_task output
descriptor gets memcpy'd into the next ring slot, and the SLOT INDEX
gets written to descriptor[+4] (which I had as 'reserved'). For
first task (r6==0), this matches our descriptor[+4]=0, so the field
isn't the cause.

Remaining hypotheses for the HW hang:
  - Variable-length descriptor tail layout off — HW interprets §1-§3
    differently for our SrcNum=1 / DstNum=1 / U8 Chn=1 / batch=1
    case than vendor expects
  - drv_svp_nnie_config_ram (OTP-based chip-variant RAM bank cfg) is
    needed at boot and we haven't been calling it
  - Some other peripheral state vendor open_sys.ko configures
    silently (cmpi mod 51 / mod 2 paths we haven't fully decoded)

Phase 9 needs: kprobe vendor's hal_svp_nnie_write_task_addr +
hal_svp_nnie_start on a working vendor Forward path (load vendor
open_nnie.ko, write a vendor-libnnie test) to capture the exact
task descriptor + ring-slot contents from a known-good run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 8 on-target finding (av300 + vendor libnnie.so as oracle):

Vendor LoadModel on /tmp/inst_mnist_cycle.wk reports:
  Tmp = 1989888  (≈ 1.9 MB)

Our parser was reading u32TmpBufSize from file[60..63], which on
mnist is zero. The vendor value is too big to live in the .wk file
itself (466 KB) — it's an inference-time scratch size that vendor
computes by walking the per-segment instruction stream.

Heuristic for now: declare u32TmpBufSize = 8 MB unconditionally,
which covers small classification models. Larger detection models
(yolov*, ssd, frcnn) will need more — Phase 9 will RE vendor's
computation in libnnie.so for the precise value.

This caps the per-Forward MMZ working set at ~8 MB even when models
don't need that much (mnist actually needs ~2 MB). Not a problem in
practice — userspace allocates the tmpbuf from MMZ and our kernel
doesn't touch it directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrote three Phase-9 debug modules in kernel/nnie_spy/ to capture
vendor open_nnie.ko's live HW dispatch state:
  - nnie_spy.c: kprobe (CONFIG_KPROBES=n on cv500 — unusable, kept for
    future kernel rebuild)
  - nnie_dump.c: insmod with phys=/size= module params, dumps 16B/line
    via phys_to_virt (works for any CMA-managed lowmem)
  - nnie_watch.c: kthread polling NNIE+0x20/+0x24 every 1us, captures
    any TASK_ADDR change + dumps the descriptor + tskbuf

Captured during a known-good vendor Forward on inst_mnist_cycle.wk:

  TASK_ADDR = 0xa9c70000  (vendor's pre-allocated DMA ring slot)
  descriptor:
    d[0..7] : 01 00 00 00  00 00 00 00  AA88_0150  00 00 00 00  06ED00  002968
    d[8..15]: A9CB0000 0   0 0 0 0   AD200000 0   00000001 0

  tskbuf @ 0xa9cb0000:
    +0x00: 00000020 00000030 00000000 00000000
    +0x10: a00f6000 00000000 00000000 00000000
    +0x20: a00f5000 00000000 00000000 00000000
    +0x30..: zero

Two critical finds:

1. **task[+16] is NOT stBase.PhyAddr — it's stBase.PhyAddr +
   inst_offset_extra**. For the test .wk: 0xaa880000 + 0x150 (=
   file[52..55]) = 0xaa880150. Vendor's userspace LoadModel adjusts
   stBase.u64PhyAddr to skip the .wk header. Our LoadModel was passing
   the raw file phys. Fixed in libraries/nnie_neo/src/nnie_ops.c:
   pstModel->stBase.u64PhyAddr += inst_off_extra after the *pstModelBuf
   copy.

2. **The variable-length tskbuf tail is FAR SIMPLER than my disasm-
   based RE suggested**. Vendor only uses:
     +0:  src strides packed (SrcNum × u32)
     then dst strides packed (DstNum × u32)
     pad to 16
     +0x10: dst phys addrs (DstNum × u64, packed)
     pad to 16
     then src per-batch phys (SrcNum × Num × u64)
   For mnist SrcNum=DstNum=Num=1: ~32 bytes meaningful, rest is the
   tskbuf size we provided (65 KB, all zero). My earlier §1-§3 builder
   wrote 160 bytes including a 16-slot dst phys array — completely
   different from what HW expects.

   Rewrote nnie_build_task_tail to match vendor byte-for-byte.
   Verified our tskbuf content == vendor's tskbuf content
   identically:
     +00: 20 30 0 0       (src_stride=0x20=32 dst_stride=0x30=48)
     +10: dst_phys 0 0 0
     +20: src_phys 0 0 0

Also fixed clean-room kernel to do read-modify-write on CLK_GATE
and OUTSTANDING (vendor pattern). Our previous plain writes were
clobbering required chip-default bits. CLK_GATE now reads back
0x3c9 (was 0x80), matching vendor's live state 0x349.

Status: descriptor + tskbuf are now byte-equivalent to vendor's,
all RE-discovered HW registers programmed identically. HW still
returns cfg_err info=0x1 — the bug is OUTSIDE the descriptor +
tskbuf. Hypotheses for Phase 10:
  - dma_alloc_coherent vs cmpi_mmz_malloc_cached affect HW DMA
    differently
  - Vendor's pre-allocated ring slot phys is registered with HW
    via some other mechanism we haven't decoded yet (e.g., open_sys
    registers it with CMP_3DNR or similar)
  - chip-variant cfg via drv_svp_nnie_config_ram / prepare_nnie
    (OTP-dependent) is required for our chip

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Phase-10 changes to bring cleanroom dispatch closer to vendor:

  1. Descriptor allocation moved from dma_alloc_coherent to
     hil_mmb_alloc + hil_mmb_map2kern_cached. Matches vendor's
     cmpi_mmz_malloc_cached pattern, places the descriptor in the
     same MMZ pool vendor uses for its 512-slot ring (verified:
     same single 256MB zone 0xA0000000-0xAFFFFFFF per /proc/media-mem).

  2. __cpuc_flush_dcache_area on tskbuf after nnie_build_task_tail
     writes — memremap WB gives cached kernel mapping; HW DMA needs
     the writes visible in DDR. Vendor uses SAMPLE_COMM_SVP_FlushCache.

  3. __cpuc_flush_dcache_area on the descriptor after memcpy from the
     stack-built struct — same reason as (2).

Independently verified via nnie_dump.ko reading our descriptor phys:
the 64 bytes match vendor's known-good capture byte-for-byte except
for the tskbuf-phys field d[8] (vendor used 0xa9cb0000, ours uses
0xa9c70000 — both valid MMZ phys, content identical at both).

HW STILL returns cfg_err info=0x1. With:
  - descriptor bytes match vendor
  - tskbuf tail content matches vendor (verified via dump)
  - registers programmed identically to vendor (CLK_GATE=0x3c9 etc)
  - MMZ allocation now in same pool as vendor
  - all caches flushed before START

…something else differs. Hypotheses for Phase 11:
  - Per-core state in vendor's LANCHOR0 holds HW-required init that
    only vendor's svp_nnie_init populates (e.g., a chip-variant
    fixup we haven't decoded)
  - HW has a hidden "first task" register/sequence we're missing
  - vendor's open_gdc.ko (loaded but unused for inference) sets
    something we're missing
  - svp_nnie_check_err_status decodes info=1 as something specific
    we haven't traced yet

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
widgetii and others added 18 commits May 16, 2026 21:52
Snapshot-diff finding: after a clean reboot, vendor's open_nnie.ko
insmod writes ~15 previously-unexplored NNIE registers. One of them
is CHECK_SUM (+0x68 = 0x00000001 post-vendor-init). Vendor's symbol
'disable_check_sum' @0xbc74 clears bit 0, but the live state has
bit 0 SET after init runs — the function name is misleading; bit 0
is presumably "disable mode" semantics (clearing it enables real
operation). Our cleanroom was calling the analog (clear bit 0) which
actively DISABLED checksum.

Empirical: with CHECK_SUM=0 our HW returns cfg_err info=1; with
CHECK_SUM=1 (vendor's value) HW returns cfg_err info=0. Different
error code — we're closer.

Remaining gap registers vendor writes that we don't:
  nnie+0x00 = 0x00002018
  nnie+0x04 = 0x00000130
  nnie+0x08 = 0x0000B017
  nnie+0x10 = 0x5A5A5A5A  ← magic value
  nnie+0x14 = 0x0000FFEF
  nnie+0x6c = 0xFFFFFFFF
  nnie+0x70..0xa8 = various (chip cfg / clock params?)

These may be HW-self-populated when clock is on (need to verify by
just enabling CRG NNIE clock without vendor module) or may be set
by drv_svp_nnie_prepare_nnie (OTP-variant-dependent cmpi mod 2 fn
0xb6 call). Phase 12 to investigate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot-diff vs vendor's open_nnie load showed vendor's init only
changes ONE bit beyond what we already program: CRG+0xa4 (VEDU clock,
per hi_sys.o sys_hal_vedu_clk_en) flips 0 → 6. NNIE may share clock
infrastructure with VEDU on cv500.

This commit adds a CRG+0xa4 RMW to our dispatcher to keep bit 0..2 =
6. On test, devmem readback shows the write did NOT take effect (post-
dispatch CRG+0xa4 still 0). Possible causes:
  - VEDU clock register may need a separate enable bit
  - Some sys/CRG window has write-protect we haven't decoded
  - Vendor's path may set it via cmpi → open_sys.ko which has CRG
    permission we lack

cfg_err info changed from 1 to 0 in the previous commit (CHECK_SUM
preserved). info=0 remains here — adding CRG+0xa4 didn't change the
result.

The 15 'unexplored' registers I'd worried about (nnie+0x00..0x14,
+0x70..+0xa8) turn out to be HW-self-populated when the clock is on.
Verified by snapshotting AFTER reset with ONLY a devmem write to
CRG+0xbc=0x2 (no module loaded): all the magic values appear. So
vendor doesn't write them — they're chip defaults exposed once the
block is clocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two tweaks to the dispatcher:
  - write NNIE_IRQ_ALL to NNIE_REG_IRQ_CLEAR right before TASK_ADDR
    writes, to drop any stale cfg_err/timeout bits from a previous
    failed dispatch
  - log NNIE_REG_IRQ_STATUS + NNIE_REG_CFG_ERR_INFO in the pre-START
    register dump

Test: pre-START state confirmed STATUS=0x0 ERR_INFO=0x0 (clean).
Within 100us of START write, HW raises STATUS=0x4 (cfg_err) with
ERR_INFO=0x0. So HW is processing the task and definitively
rejecting it — not a stale-IRQ issue.

CRG+0xa4 write from kernel module also confirmed to stick now
(post-test devmem reads 0x6 as intended). Earlier non-stick may
have been a transient state-machine issue from the unsafe write
ordering. Even with CRG+0xa4=6 matching vendor, ERR_INFO stays 0.

Remaining hypothesis space (Phase 12):
  - vendor's cmpi_register_module call registers NNIE as module 51,
    enabling other modules (sys/sys_config) to call NNIE-specific
    init that we miss
  - drv_svp_nnie_prepare_nnie's OTP-variant path may run additional
    chip cfg via cmpi mod 2 fn 0xb6 for specific OTP values; need
    to find g_reg_otp_base_va and check our chip's OTP[+0x28]
  - HW may need a "warm" task descriptor (something set by vendor
    in its descriptor ring slot at first task that survives between
    tasks)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extended kernel/nnie_spy/nnie_watch.c to dump ALL NNIE registers
+0x00..+0xbc at the moment of TASK_ADDR write. Captured vendor's
working mnist Forward state:

  reg+0x20: a9c70000 00000000 ffffffff 000000ff  ← TIMEOUT_HI=0xff!
  reg+0x40: 00000000 00000003 00000000 00000000  ← +0x44 = 0x3 (UNK)
  reg+0x60: 00000000 00000000 00000000 ffffffff  ← CHECK_SUM=0!

Three register diffs vs our cleanroom that we now fix:

  1. TIMEOUT_HI (+0x2C) = 0xff (we had set to 0 after misreading an
     earlier post-Forward snapshot — vendor set_timeout writes 0xff,
     HW clears it after completion. AT task start it's 0xff.)

  2. CHECK_SUM (+0x68) = 0 (vendor explicitly disables before each
     task. Live readback after vendor's Forward completion shows 1
     because HW restores chip default 0x1 post-task. Vendor's
     'disable_check_sum' function name is correct after all — bit 0
     = 1 IS "enabled", and vendor disables before submit.)

  3. +0x44 (unknown) = 0x3 (vendor writes this; no decoded function
     in hi_nnie.o symbol table matches +0x44).

With these three fixes (without further changes to +0xb0..+0xb8 which
made things worse): HW returns cfg_err info=0 to info=0x1000001 to
sometimes a 5-sec TIMEOUT (no IRQ) — a different failure mode each
run, but at least the cfg_err code changed, suggesting HW is partially
accepting our task now.

Phase 13: trace what +0x44 is, why info=0x1000001 vs info=0. Also
investigate whether HW resets between tasks differently than vendor
expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical finding via test_neo on av300: assigning a monotonic
per-task slot index to descriptor[+4] (the 'reserved' field — actually
vendor's task ring slot index, 0..511) makes the NNIE HW ACCEPT the
task. Live readback shows TASK_ID register (+0x48) updating to 0x1
matching our slot_idx after START.

Previously descriptor[+4] = 0 caused HW to reject with cfg_err info=
0 / 0x1000001. With monotonic non-zero idx, HW updates TASK_ID and
runs partial processing.

Mechanism (inferred): HW tracks a "next expected slot_idx" internally.
Submitting slot_idx that matches the current state (=0 on cold boot)
is treated as a no-op or as referring to an already-completed task,
so cfg_err. Submitting a fresh slot makes HW accept the new task.
Vendor's ring iterates 0,1,2,...,511 mod 512 — first task after fresh
boot is 0, subsequent ones increment. So our 'always 0' was wrong
after the first failed task.

Forward still returns cfg_err (cause=0x4 info=0x1000001) AT THE END
because the inference engine fails part-way through execution — but
this is a LATER failure mode than the previous "submission rejected".

Remaining puzzle (Phase 14): why does HW reject mid-execution? Most
likely candidates: (a) instruction stream interpretation fails because
some descriptor offsets we set are still wrong relative to vendor;
(b) tmp_buf size/content not what HW expects; (c) output blob format
mismatch (vendor uses VEC_S32 stride=48 — we match this in test_neo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR OpenIPC/firmware#2095 enabled CONFIG_KPROBES=y in cv500/av300
ship kernels (board built 2026-05-14 05:25 UTC). My local
linux-custom .config was stale (pre-#2095), so register_kprobe in
modules built against it expanded to the kprobes.h inline stub
returning -ENOSYS.

Fix: set CONFIG_KPROBES=y + CONFIG_KALLSYMS_ALL=y + CONFIG_JUMP_LABEL=y
+ CONFIG_MODULE_UNLOAD=y in linux-custom/.config, run oldconfig +
prepare to regenerate include/generated/autoconf.h, rebuild
nnie_spy.ko.

Also: register_kprobe(symbol_name=...) only resolves GLOBAL kallsyms.
Vendor's hal_svp_nnie_write_task_addr is LOCAL (lowercase 't' in
kallsyms) so I had to switch to .addr= with the value module-loaded
from kallsyms grep.

Verification:
  • probe_test.ko on printk → "PROBE FIRED" each printk call ✓
  • nnie_spy.ko on hal_svp_nnie_write_task_addr → fires during
    vendor's Forward, dumps r2/r3 (task_phys), then 64-byte
    descriptor and tskbuf tail via phys_to_virt

First vendor capture (mnist Forward, scores match prior runs):
  task_phys = 0xa9c70000  (vendor's pre-allocated ring slot 1)
  d[0..7]  = 01 00 00 00  00 00 00 00  ...  aa880150 00 00 00 00  6ed00 002968
  d[8..15] = a9cb0000 0  0 0 0  0 ad200000 0  01 00 00 00  0 0
  tail @0xa9cb0000:
    +0x00: 20 30 0 0  a00f6000 0 0 0
    +0x20: a00f5000 0 0 0  ...zero

Identical to earlier polling captures — but now with single-cycle
precision (vs ~1MHz poll). Phase 14 can now iterate quickly:
kprobe vendor's full hal_svp_nnie_* call sequence, compare to ours,
find the missing register / write-ordering issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vendor's first task uses descriptor[+4] = 0 (verified via kprobe
capture this session). Our atomic_inc_return was post-increment,
giving 1 on first call — descriptor[+4] = 1 means HW waited for
slot 0 (never submitted) to complete first → 5-sec TIMEOUT.

Switch to atomic_inc_return - 1 (pre-increment value, 0 on first
call). With slot_idx = 0 our cleanroom now produces a 64-byte
descriptor byte-equivalent to vendor's mnist Forward capture:

  00000001 00000000 00000000 00000000  aa880150 00000000 0006ed00 00002968

(Identical content, identical layout.)

HW still rejects with cfg_err info=1 with slot=0. The dependence
of cfg_err code on slot_idx + prior HW state is intricate:
  slot_idx=0, fresh boot → cfg_err info=1
  slot_idx=1, after previous fails → TIMEOUT (TASK_ID updates to 1)

So HW accepts SOMETHING and rejects something else. With kprobes
now available the next iteration can attach probes to vendor's
full hal_svp_nnie_* call chain and the chip's actual register
write order, then diff against ours. That belongs in Phase 15+.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added kernel/nnie_spy/nnie_trace.c that simultaneously kprobes 7
vendor functions in open_nnie.ko (write_task_addr, start, cfg_irq,
set_timeout, enable_clk_gt, set_outstanding, disable_check_sum) and
logs ARM_r0..r3 at each entry. Loaded with addr= params for each
symbol grep'd from /proc/kallsyms.

Captured vendor's per-task call sequence on av300 (mnist Forward):

  1. cfg_irq(core_id=0, ...)
  2. set_timeout(core_id=0, ..., r2=0xffffffff TIMEOUT_LO, r3=0xff TIMEOUT_HI)
  3. enable_clk_gt(core_id=0, ...)
  4. set_outstanding(core_id=0, ...)
  5. disable_check_sum(core_id=0, ...)
  6. write_task_addr(core_id=0, ..., r2=0xa9c70000 task_phys, r3=0)
  7. start(core_id=0, ...)

This sequence matches our cleanroom's register-write order. And
captured task_phys, TIMEOUT_LO/HI, CHECK_SUM=0 etc. match our
descriptor + pre-START state byte-for-byte.

Tested: vendor's open_nnie module init first (to leave HW post-init
state vendor expects), then rmmod + insmod our cleanroom + Forward.
Still cfg_err info=1. So the gap is NOT in module-init residual
state.

Remaining gap is between "all observable register values match
vendor's working state" and "HW actually produces inference work".
Likely something in cmpi-mediated open_sys handshake or HW state-
machine sequence ordering we still get wrong. Phase 15+ work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kprobed sys_hal_wk_cnn_clk_en, sys_hal_wk_cnn_reset_sel, and
sys_hal_gdc_nnie_set_ram_using on av300 during a known-good vendor
mnist Forward, captured this call sequence (likely interleaved with
calls from other vendor modules):

  clk_en(1) → reset_sel(0) → clk_en(0) → clk_en(1) →
  ram(0) → clk_en(1) → ram(1) → ram(0) → clk_en(0)

The clock-toggle (OFF then ON) before the task is novel — we just
keep clock on continuously. Hypothesised this was an HW reset pulse
NNIE needs.

Mirrored vendor's sequence in nnie_dispatch_forward. HW still
returns cfg_err info=1. So the clock dance ALONE isn't what HW
needs — the failure is somewhere else still.

Remaining hypotheses for Phase 15:
  - The clock-toggles come from MULTIPLE concurrent threads (other
    vendor modules using cmpi mod 51 ops). Trying to replicate them
    in single-threaded sequence misses the actual chip state HW
    needs.
  - HW may need a specific CMPI handshake we still don't emulate.
  - Some HW state I haven't observed yet.

Branch: 39 commits on nnie-neo. Phase 14 progressed significantly
this session (kprobes unblocked, full vendor call trace captured)
but the HW failure mode is now stably reproducible at cfg_err
info=1 with byte-identical state to vendor's working setup. The
cleanroom RE has reached a plateau that needs careful single-step
HW debugging (or vendor open_sys.ko source) to break through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert four per-task register writes from plain writel() to
read-modify-write to match vendor hal_svp_nnie_* helpers:

  reg+0x30 (START)     plain 1   → OR bit 0
  reg+0x34 (IRQ_CFG)   plain 0x7 → OR bits 0|1|2
  reg+0x38 (IRQ_CLEAR) plain 0x7 → OR bits 0|1|2
  reg+0x68 (CHECK_SUM) plain 0   → BFC bit 0

Vendor preserves upper-half status bits in these registers via RMW.
Plain writes were clobbering them, but on a post-reboot clean run the
read-back is 0 in all four cases, so the RMW is functionally
equivalent — the change is for correctness against any future state
path that sets those bits.

Also switch task descriptor MMZ from cached + __cpuc_flush_dcache_area
to nocache (hil_mmb_map2kern), matching vendor svp_nnie_init's
cmpi_mmz_malloc_nocache for its pre-allocated task ring. Eliminates
cache coherency as a variable.

cfg_err info=0x1 still fires mid-execution. Phase 14 is the gap
between byte-equivalent dispatch state and HW actually executing the
task. None of: GDC ops handshake, prepare_nnie ioctl, check_clk_freq,
set_mem_speed (all decoded from hi_nnie.o disasm) are present in our
cleanroom dispatch — but on cv500 all four are no-ops for the
classic NNIE path with no GDC concurrent activity. Next angle:
kprobe vendor cfg_err recovery to capture HW state at the moment of
a known-bad dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dump

LoadModel breakthrough: decode the per-segment node tables in the .wk
file at file[208+] (compact 16 B node header + 32 B name = 48 B per
node, layout decoded from vendor inst_mnist_cycle.wk hex dump). Now
populates:

  astSeg[i].astSrcNode[j] = { enType, u32Width, u32Height, u32Chn,
                              u32NodeId, szName }  ← read from file
  astSeg[i].astDstNode[j] = { enType=SVP_BLOB_TYPE_S32, ..., szName }
                              ← name read; type/id derived (vendor's
                                libnnie hardcodes type=4 for outputs)

Direct vendor-vs-cleanroom comparison on av300:
  vendor lib (libnnie.so) reports astSrcNode[0].enType=1 for mnist.wk
  cleanroom (libnnie_neo) was reporting enType=0 — userspace test
  uses this to pick SVP_BLOB_TYPE_E for its src MmzAlloc, so we were
  sending the wrong blob type to the Forward ioctl.

Now with enType=1 propagated through to the kernel's tskbuf §3 fill,
the per-source DMA address vector matches vendor's exactly (still
1 u64 per batch for enType=1 C=1 — the YUV chroma plane is only
needed for C>=2). Tested on av300: cfg_err info=0x1 still fires
mid-execution though, so enType wasn't the trigger by itself.

Also adds kernel-side diagnostic: full reg+0x00..+0xc8 dump both
pre-START and post-fail. Comparison vs vendor's post-insmod-no-Forward
devmem dump shows:
- chip-config + HW-self-populated bits (+0x70..+0xa8) identical
- our per-task RMW of OUTSTANDING clears bit 4 (0xf1f → 0xf0f)
  matching vendor's per-task code; vendor's POST-success state is
  all-zeros (IRQ handler gates clock) so direct mid-task comparison
  needs nnie_watch.ko polling

Post-fail HW touches: cfg_err_info=0x1, +0x64=0x8, +0xb0..+0xb8=0x1
each, +0xc4=0x34e — all written by HW, not us.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend nnie_trace.ko so the write_task_addr kprobe also dumps the
64-byte HW descriptor + the first 320 bytes of the tskbuf it
references (via phys_to_virt on r2 = descriptor phys passed by
vendor). And the start kprobe ioremaps the NNIE register window and
dumps reg+0x00..+0xc8 at the exact moment vendor's start function
fires.

Used this to capture vendor's pre-START state during a known-good
mnist Forward on av300. Result vs cleanroom pre-START dump:

  vendor:    TASK_ADDR_LO = 0xa9c70000
  cleanroom: TASK_ADDR_LO = 0xa9cb0000

ALL other registers match byte-for-byte (chip ID, IRQ_CFG, TIMEOUT,
CLK_GATE, OUTSTANDING, CHECK_SUM, +0x6c..+0xa8 HW-self-populated).
Descriptor + tskbuf content matches byte-for-byte at the phys-mem
level.

The 0xa9c70000 vs 0xa9cb0000 swap happens because vendor allocates
its task descriptor ring at module init (so it gets the first
64KB-aligned slot after the bv_pool), while cleanroom allocates a
fresh descriptor MMZ per Forward (after the test's nnie_tsk
allocation has already taken 0xa9c70000). HW may have a static range
filter for TASK_ADDR or some other reason it rejects ours.

Next: pre-allocate descriptor MMZ at module init (matching vendor's
allocation order) so we get 0xa9c70000 — or any deterministic phys
— and see if cfg_err goes away.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanroom NNIE driver now produces bit-identical mnist output to the
vendor open_nnie.ko on av300:

  dst[10 scores]: 408 412 401 401 398 412 398 405 449 401   ← vendor
  dst[10 scores]: 408 412 401 401 398 412 398 405 449 401   ← cleanroom

Three changes combined to fix the cfg_err info=0x1:

1. Pre-allocate the 64KB task descriptor MMZ at module init (same
   point where vendor's svp_nnie_init @0x11a8 calls
   cmpi_mmz_malloc_nocache for its task ring). This makes the
   descriptor phys stable across Forwards AND lands at 0xa9c70000 —
   the exact slot vendor's allocation gets, since at module-init time
   that's the first available 64KB-aligned MMZ slot after the
   vb_pool. Per-Forward allocation was landing AFTER the test's
   nnie_tsk had already taken 0xa9c70000.

2. Zero the descriptor MMZ at init time (vendor does this via
   osal_memset in svp_nnie_init). HW may scan the ring for valid
   pending slots and the uninitialized 65472 bytes past our 64-byte
   task could trigger false-positive cfg_err.

3. Stop setting NNIE_RAM bit back to 1 after the clock dance. Vendor's
   hal_svp_nnie_enable_ram @0xb8f4 issues SYS ioctl 0xd1 with arg=0,
   so the bit goes 0 and stays 0 through START. Setting it back to 1
   was a Phase 8 guess that turned out to invalidate the SRAM
   ownership signal HW expects.

Also remove the verbose pre-START register dump (52 readls between
TASK_ADDR write and START write added 50us+ gap that was hypothesised
to matter; turned out not to be the trigger, but keeping the path
tight matches vendor's drv_svp_nnie_start exactly: write_task_addr →
dmb → start with no reads in between).

Verified on av300 board with vendor's test binary using cleanroom
libnnie_neo + cleanroom open_nnie_neo.ko. Forward returns 0,
Query finished=1, output scores match vendor byte-for-byte.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-allocated descriptor MMZ, the IRQ completion (g_nnie_done),
and the cause atomic are all single-instance globals. Concurrent
Forward callers race on the descriptor write, the START kick, and
the IRQ-driven wakeup — repro'd as 20/20 -ETIMEDOUT under 4-way
parallel test.

Wrap nnie_dispatch_forward in g_nnie_forward_lock (interruptible
mutex). Vendor's svp_nnie_forward @0x2198 does the same with
osal_down_interruptible on a per-handle semaphore.

Verified 4-way concurrent x 5 rounds = 20/20 pass with this lock,
zero cfg_err.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that mnist Forward works end-to-end (and the slot-wrap / parallel
tests pass), drop the verbose debugging scaffolding:
- pre-START full reg dump (52 readls)
- post-fail full reg dump
- 100ms polling loop after START
- per-Forward src[i] enType/dims pr_info
- separate task hdr pr_info_once
- redundant Forward arg-size + UVA pr_info_once
- AddTskBuf / RemoveTskBuf pr_info_once → pr_debug

Also collapse the per-task register-setup block into one tidy
RMW group with concise comments and drop the unused `arg_size`,
`mmb`, `task_kvirt`/`task_dma` (void)-cast lines.

Behaviour-preserving cleanup. Re-verified on av300: 1 + 20
sequential + 12 (4-way parallel x 3) Forwards all pass, zero
cfg_err, dmesg clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validated end-to-end on av300, not Phase 0 scaffold anymore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These five modules (nnie_dump, nnie_spy, nnie_trace, nnie_watch,
probe_test) were ad-hoc kprobe + phys_to_virt helpers used during
reverse engineering. They aren't wired into any production build
and don't belong in the shipping PR. Tracked separately if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comprehensive cleanup of every comment and docstring in kernel/nnie_neo/
and libraries/nnie_neo/:

- Drop all "Phase N" references — the RE phase labels meant something
  while the work was in progress; now they're noise.
- Drop "kprobe capture showed", "av300 2026-05-17", "Phase 14
  critical finding", and similar journey markers.
- Drop "may be", "appears to", "possibly", "guess" hedging — state
  what the code does and why, definitively or not at all.
- Drop vendor function references and addresses ("svp_nnie_init
  @0x11a8", "hal_svp_nnie_enable_clk_gt @0xbc18") from inline comments
  — they belong in commit messages and RE notes, not in shipping code.
- Drop the verbose ASCII-art-with-byte-offsets prologues; keep the
  field-offset macros + struct definitions + a one-paragraph layout
  description where the layout is non-obvious.
- Drop the dead `g_sys_regs` ioremap path (was Phase 3 read-only
  scaffolding, never used at runtime).
- Drop `g_sys_lock`, `nnie_sys_set_bit`, `nnie_sys_clear_bit`,
  `nnie_sys2_set_bit` — all unused after the cleanup.
- Drop `__maybe_unused` annotations on helpers that now have callers.
- Drop the leftover `mmb` / `task_kvirt`/`task_dma` (void)-cast lines.
- Simplify the Kbuild header.
- Strip the trailing pr_info_once Forward/AddTskBuf/RemoveTskBuf
  argument dumps that were debugging aids.

Behaviour-preserving. Re-verified on av300: 30 sequential + 12
(4-way parallel x 3) mnist Forwards all pass, output byte-identical
to vendor (408 412 401 401 398 412 398 405 449 401), dmesg shows
only the "/dev/nnie ready" line.

Net diff: -612 lines, +0 functional changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failure on V4 chiparchs (hi3516ev200/ev300, gk7205v200): the
libraries/nnie_neo/ Makefile hardcoded a staging-path include
(STAGING ?= ../../../../firmware/output-cv500/build/...) so the
build only worked on a dev machine with the firmware tree adjacent
to the openhisilicon repo. CI builds all subdirs unfiltered for V4,
which tripped this.

Two parts to the fix:

1. Self-contain the userspace headers. Bundle the three vendor SVP
   headers (hi_nnie.h, mpi_nnie.h, hi_comm_svp.h) into
   libraries/nnie_neo/include/ — same pattern as libraries/ive_neo/
   (which ships its own hi_comm_ive.h / hi_ive.h / mpi_ive.h).
   Drop the STAGING hack from the Makefile so only repo-relative
   includes remain.

2. Patch the bundled hi_comm_svp.h:
   - drop `#include "hi_errno.h"` (header not in tree for cv500);
     replace with `#include "hi_common.h"` which provides
     HI_DEF_ERR + EN_ERR_* + the MOD_ID_E enum
   - add SVP_NNIE_HANDLE typedef (was in the cv500-kernel-only
     hi_common.h)
   - add HI_ID_SVP_NNIE = 51 to libraries/include/hi_common.h next
     to the existing MOD_ID_E values

3. Gate libraries/nnie_neo to cv500-only in libraries/Makefile —
   NNIE block doesn't exist on V4, cv200, cv100, etc., so the
   cv500-specific MPI surface shouldn't be compiled there.

Verified on av300: bundled headers + repo-relative includes build
clean; rebuilt libnnie_neo.so still produces the byte-identical
mnist output (408 412 401 401 398 412 398 405 449 401).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii merged commit 335ce97 into main May 17, 2026
26 checks passed
@widgetii widgetii deleted the nnie-neo branch May 17, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NNIE backend for cv500 (CNN inference)

1 participant