kernel/nnie_neo: clean-room NNIE CNN driver for cv500/av300 by widgetii · Pull Request #145 · OpenIPC/openhisilicon

widgetii · 2026-05-17T05:30:57Z

Summary

Open-source replacement for vendor open_nnie.ko on cv500-family SoCs (hi3516cv500, av300, dv300). Drives the NNIE CNN inference block at phys 0x11100000 and exposes the vendor-compatible ABI on /dev/nnie, so existing userspace using vendor libnnie.so works unchanged.

Closes #111.

Validation

mnist Forward output byte-identical to vendor:

vendor:    dst[10 scores]: 408 412 401 401 398 412 398 405 449 401
open-src:  dst[10 scores]: 408 412 401 401 398 412 398 405 449 401

Test	Result
Single mnist Forward (open-src lib + open-src ko)	output byte-identical to vendor
520 sequential Forwards (slot_idx ring wrap mod 512)	520/520 PASS
4-way parallel Forward x 5 rounds	20/20 PASS
GDC coexistence (`open_gdc.ko` loaded, idle)	PASS
LoadModel decode for segnet / ssd / lstm	input nodes parse correctly

What's in the box

kernel/nnie_neo/ — kernel driver
- nnie_init.c — platform-device probe (DT hisilicon,hisi-nnie), IRQ + MMZ pre-alloc
- nnie_neo.c — /dev/nnie miscdevice + Forward / AddTskBuf / RemoveTskBuf ioctl handlers + HW dispatch
- nnie_hw_task.h — 64-byte HW descriptor + tskbuf variable-length tail layout
- nnie_hw_regs.h — cv500 NNIE register map (0x11100000)
- Wired into kernel/hi3516cv500.kbuild
libraries/nnie_neo/ — userspace
- nnie_ops.c — HI_MPI_SVP_NNIE_LoadModel, _Forward, _ForwardWithBbox, _Query, _AddTskBuf, _RemoveTskBuf, _UnloadModel, _GetTskBufSize
- nnie_wk_format.h — .wk file-format constants + header struct

Definition-of-done (issue #111)

New kernel module open_nnie_neo.ko bound to hisilicon,hisi-nnie on cv500
CNN model loader + forward pass produce non-trivial output on a small test model
Round-trip test — mnist Forward output matches vendor byte-for-byte
Memory arbitration / SRAM ownership sequence captured and applied

Known limitations (follow-ups, not blockers)

HI_MPI_SVP_NNIE_GetTskBufSize returns HI_ERR_SVP_NNIE_NOT_SURPPORT. Vendor walks the parsed instruction stream to compute per-segment buffer sizes; userspace callers that need it should pre-compute or hardcode for now.
HI_MPI_SVP_NNIE_ForwardWithBbox returns HI_ERR_SVP_NNIE_NOT_SURPPORT. The bbox-mode dispatch path (4 extra ioctls in the 0x4d05..0x4d09 range) isn't wired yet.
u32TmpBufSize is set to 8 MB heuristically rather than computed from the model. Fine for classification-class models; large detection models may need vendor's exact value.
LSTM (enNetType=2) tskbuf tail layout uses the same builder as CNN. RECURRENT-net dispatch is not exercised.
LoadModel for output nodes hardcodes enType=SVP_BLOB_TYPE_S32 and NodeId=(j+1)*8 since the file format doesn't store these for outputs; vendor's libnnie does the same. The W/H/C field order differs from inputs — vendor swaps so the layer's feature count lands in Width.

Test plan

CI build for hi3516cv500 chiparch
CI builds for non-cv500 chiparchs (ev200, gk7205v200, etc.) unaffected (nnie_neo only included from hi3516cv500.kbuild)
Manual: insmod open_nnie_neo.ko on av300, run vendor's sample_nnie_main against mnist .wk

🤖 Generated with Claude Code

First slice of #111 (clean-room NNIE CNN driver for cv500/av300/dv300). The full backend is multi-day RE; this commit lands only the platform- driver scaffold: - kernel/nnie_neo/nnie_init.c — platform_device probe, binds to "hisilicon,hisi-nnie" DT node. Maps the nnie0 register window (0x11100000 on cv500), records the nnie0 + gdc IRQs. Skips the gdc register region (owned by open_gdc.ko — sharing the DT node would EBUSY). - kernel/nnie_neo/nnie_neo.c — registers /dev/nnie via osal_createdev. Single ioctl dispatch path that returns -EOPNOTSUPP for everything (Phase 0). Phase 1 will decode the eight HI_MPI_SVP_NNIE_* ioctl numbers + arg-buf layouts from vendor libnnie.so and wire real handlers. - kernel/nnie_neo/Kbuild — mirrors ive_neo/Kbuild structure. - kernel/hi3516cv500.kbuild — pulls nnie_neo/Kbuild in alongside the existing vendor $(PREFIX)nnie wrapper. Both modules can build; init scripts pick one to insmod at runtime (same pattern as ive vs ive_neo). On-target av300 verification (after sysrq reboot to load fresh): $ insmod /tmp/open_nnie_neo.ko && lsmod | grep nnie open_nnie_neo 2949 0 $ ls -la /dev/nnie crw-rw---- 1 root root 218, 100 /dev/nnie $ dmesg | grep nnie nnie_neo: probed nnie0=f4470000 irq=54 gdc_irq=53 nnie_neo: /dev/nnie ready (Phase 0 stub — all ioctls return -EOPNOTSUPP) Known Phase 0 limitation: request_irq fails -16 because vendor's IRQ handler is registered with IRQF_SHARED and our flags=0 conflicts. Phase 3 will switch to IRQF_SHARED once we actually need IRQ-driven completion. /dev/nnie remains usable for the -EOPNOTSUPP path. Not pushing or opening a PR yet — per the bundle-on-one-branch feedback, NNIE work stays on this local branch until Phase 4 (end- to-end test on av300 with a tiny real model) passes.

Static RE of cv500 vendor libnnie.so (42 KB, 8 public entries) + kernel-side svp_nnie_ioctl @0x26b8 in vendor blob. Five distinct ioctls reach /dev/nnie; the other three public entries (LoadModel, UnloadModel, GetTskBufSize) are pure-userspace — they only touch MMZ via HI_MPI_SYS_MmzAlloc/Free, no /dev/nnie call. nr | size | full ioctl | API entry ----+------+--------------+---------------------------- 0x00| 1624 | 0xc6584d00 | HI_MPI_SVP_NNIE_Forward 0x01| 1728 | 0xc6c04d01 | HI_MPI_SVP_NNIE_ForwardWithBbox 0x02| 24 | 0xc0184d02 | HI_MPI_SVP_NNIE_Query 0x03| 24 | 0xc0184d03 | HI_MPI_SVP_NNIE_AddTskBuf 0x04| 24 | 0xc0184d04 | HI_MPI_SVP_NNIE_RemoveTskBuf Vendor kernel dispatcher additionally recognises 0x4d05/06/07/08/09 when the per-call context state == 0xc — those are bbox-mode dispatch variants, deferred to Phase 3 once we have an actual model + Forward call to exercise them. Phase 1 handler bodies still return -EOPNOTSUPP for Forward, AddTskBuf, RemoveTskBuf. Query stubs done=1 (matches ive_neo pattern since dispatch is synchronous once wired). Implication for Phase 2 scope: the kernel ABI surface is only 5 ioctls; the model loader is mostly userspace work in libnnie_neo.so. Full ioctl-ABI reference saved to kaeru as nnie-neo-cv500-ioctl-abi.

Partial RE of cv500 vendor libnnie.so HI_MPI_SVP_NNIE_LoadModel @0x1bf4 (size 0x12d8). Phase 2 of #111. Decoded by tracing the loader's stack buffer reads at sp+84..sp+275 (the 192-byte file-header copy). What landed: - libraries/nnie_neo/ — new userspace library exporting all 8 vendor HI_MPI_SVP_NNIE_* entry points with vendor-matching const-qualified signatures. Builds clean for cv500 against the cv500 kernel-include tree (libnnie_neo.so = 7.5 KB on cv500 cross-compile). - include/nnie_wk_format.h — 192-byte .wk file header struct, decoded fields: [0..3] u32 CRC32 (zlib-style, IEEE 802.3 0xEDB88320 reflected, of bytes [4..filesize)) [16..19] format-version digits {1,1,1,2} (10*[16]+[17] == 11, 10*[18]+[19] == 12 per loader checks @0x1dbc-0x1de4) [48] enRunMode → SVP_NNIE_MODEL_S.enRunMode @0x1df0 [49] u32NetSegNum → SVP_NNIE_MODEL_S.u32NetSegNum @0x1dfc [52..55] inst_offset_extra (sp+136 read, > 0xBF, bounds-checked) [56..59] inst_len (sp+140 read, non-zero, bounds-checked) [176..179] dup of inst_offset_extra (sp+260 read, must equal) [180..183] some count (sp+264 read, > 47) - src/nnie_ops.c — LoadModel verifies CRC32 + version bytes, then returns HI_ERR_SVP_NNIE_NOT_SURPPORT (the segment-table / ROI-info / weights parsing is the Phase 3 work). All other 7 entries are also stubbed with vendor-matching signatures. Phase 2 limitations (deferred to Phase 3): - Segment table iteration (astSeg[u32NetSegNum]) — vendor walks an array of SVP_NNIE_SEG_S starting at some file offset within the inst_offset_extra region. Layout unconfirmed. - ROI pool info (astRoiInfo[]) — vendor walks SVP_NNIE_ROIPOOL_INFO_S records, count derived from segment metadata. - u32TmpBufSize calculation — likely a running sum across segments. - stBase fill — needs the SVP_SRC_MEM_INFO_S passed in as model base. On-target verification: blocked this session — the av300 board got into a wedged state after the Phase 1 sysrq reboot cycle and ping without ssh-response. Static decode of the loader and bench-build of libnnie_neo.so both pass. /tmp/nnie_crc_check ARM binary is staged for the test once the board is power-cycled — expected results: valid mnist.wk -> 0xa033800c (NOT_SURPPORT, CRC passes) corrupt mnist.wk -> 0xa0338003 (ILLEGAL_PARAM, CRC fails) NNIE work continues on the local nnie-neo branch — no PR until Phase 4 (end-to-end inference on av300 with a tiny real model) passes, per the bundle-on-one-branch feedback.

Two bugs in the Phase 2 CRC verifier, found on first av300 test run against vendor inst_mnist_cycle.wk (466176 B): 1. Init value: vendor accumulator starts at 0, not 0xFFFFFFFF (zlib convention). Confirmed by re-reading libnnie.so 0x1cd0-0x1cd8 — the special-case `mvneq r0, #0` only fires when r1==4 (the trivially-short-payload path). The general path enters the CRC loop with r0 still at whatever it was before, which is 0 (the memcpy_s return value at 1c64 was checked as 0). Final XOR is still 0xFFFFFFFF (`mvn r0, r0` at 1d0c). 2. CRC coverage range: bytes [4..file[52]+file[56]), not the whole file. file[52..55] is inst_offset_extra (header tail offset), file[56..59] is inst_len. Their sum is the end of the CRC-protected region; weights / quantization tables after that point aren't CRC- protected (vendor relies on instruction-stream offsets to address them). Confirmed against vendor mnist.wk on av300: stored CRC = 0xa4a25b1a computed = 0xa4a25b1a after the fix Test results now match the Phase 2 expectations: /tmp/nnie_crc_check on av300: LoadModel valid -> 0xa0338008 (NOT_SURPPORT — CRC + version pass, parser body unimplemented) LoadModel corrupt -> 0xa0338003 (ILLEGAL_PARAM — CRC fails)

Continuing Phase 2 RE of HI_MPI_SVP_NNIE_LoadModel @0x1bf4. Tracing the post-CRC parse path at 0x1e70-0x1edc reveals the per-segment file record layout (16 B header part + variable-length node arrays): off | width | maps to SVP_NNIE_SEG_S field ----+-------+------------------------------ 0 | u8 | enNetType (must be <= 2) 1 | u8 | u16SrcNum (zero-extended in struct) 2 | u8 | u16DstNum 3 | u8 | u16RoiPoolNum 4 | u16 | u16MaxStep (<= 1024) 6 | u16 | pad/unk 8 | u32 | u32InstOffset (bounds-checked vs inst_offset_extra + inst_len) 12 | u32 | u32InstLen (16-byte aligned) Followed by node-array records — Phase 3 prereq. Also confirmed file[60..63] is u32TmpBufSize (the loader stores it at pstModel+4 = SVP_NNIE_MODEL_S.u32TmpBufSize, then checks != 0). Updated nnie_wk_header_t with the new field names + added nnie_wk_seg_record_t struct. Header file now documents the full high-level format from CRC through segment-table header.

More LoadModel disassembly progress (vendor libnnie.so 0x1f30-0x208c). Node-record layout in the segment table is asymmetric: Segment data block starts at file[192] (= end of the fixed header, per file[8] = 0xC0). Structure within a segment block: +0..15 nnie_wk_seg_record_t (the 16-B segment header) +16..29 first source-node WHC metadata (compact, ~14 B) [16..19] u32 -> NODE_S.unShape.stWhc.u32Height [20..23] u32 -> NODE_S.unShape.stWhc.u32Width [24..27] u32 -> NODE_S.unShape.stWhc.u32Chn [30..31] u16 -> NODE_S.enType (post-mapped via 2/3/4/5 -> SVP_BLOB_TYPE enum lookup) +32.. array of 64-B node slots, stride 64. First field of each slot is the 32-byte szName. Remaining 32 bytes hold additional WHC/dim + blob_type fields (offsets in slot read at 0x205c-0x20bc, exact layout still partial). Verified against vendor mnist.wk: file[208..211] = 28 (u32 Height) file[212..215] = 28 (u32 Width) file[216..219] = 1 (u32 Chn — grayscale) file[224..227] = "data" (Caffe input-layer name) Updated nnie_wk_format.h with nnie_wk_node_slot_t (64 B). Phase 3 will fill the unk_20_3F[32] tail with confirmed field offsets once we have a Forward dispatch + kprobe trace exercising real inference. Still TODO this phase: ROIPool record layout, multi-segment models, segment- boundary computation when NetSegNum > 1.

Decoded the per-ioctl arg-buffer layouts by tracing the userspace worker functions in libnnie.so: Forward (ioctl 0xc6584d00, arg size 1624 B), worker @0x104c: off | size | content ----+------+---------------------------------------------- 0 | 4 | HI_HANDLE (out — kernel writes assigned handle) 4 | 4 | pad 8 | 768 | astSrc[16] — 16 SVP_BLOB_S (48 B each) 776 | 8 | pad 784 | 768 | astDst[16] 1552 | 64 | SVP_NNIE_FORWARD_CTRL_S {SrcNum, DstNum, NetSegId, | enNnieId, stTmpBuf(24), stTskBuf(24)} 1616 | 4 | bInstant 1620 | 4 | pad AddTskBuf / RemoveTskBuf (ioctls 0xc0184d03 / 0xc0184d04, 24 B): plain SVP_MEM_INFO_S {u64 phys, u64 virt, u32 size, u32 pad}. Verified at 0x3134-0x3150 in libnnie.so. Updated kernel handlers: - nnie_op_forward parses the 64 B ctrl block (SrcNum/DstNum/NetSegId) + writes handle = 0 to buf+0 + returns -EOPNOTSUPP. Phase 4 will walk astSrc/astDst phys addrs, apply the [0x90] memory-priority knob (skipped for IVE, required for NNIE), and submit to NNIE HW. - nnie_op_add_tskbuf / remove_tskbuf parse the MEM_INFO_S triple, print it, return -EOPNOTSUPP. - nnie_op_query unchanged (done=1 stub). ForwardWithBbox arg-layout (1728 B = 1624 + 104) is similar to Forward with an extra ProposalNum + bbox MEM_INFO block. Precise offsets TBD after Forward HW path works. Same nnie-neo branch; no PR until Phase 4 (real inference) passes.

Key result from continuing the vendor open_nnie.ko disassembly: NNIE HW dispatch is *not* direct register access. It goes through vendor's cmpi cross-module function-pointer indirection. Two sites of interest: 1. drv_svp_nnie_config_ram -> hal_svp_nnie_enable_ram @0xb8f4: The "[0x90] memory-priority knob" mentioned in #111 isn't written by open_nnie.ko at all. The function calls cmpi_get_module_func_by_id(51, 0xd1) where 51 is the open_sys.ko module ID and 0xd1 is a function selector, then blx's the returned function pointer with state at sp+4. So the register write lives in open_sys.ko's exported function table. 2. svp_nnie_start_task @0x1934: After drv_svp_nnie_prepare_nnie, it calls cmpi_get_module_func_by_id(37) -> r5 (some module's func table), then dispatches via four entries: r5+0x78 -> prepare submission r5+0x7c -> fire/wait r5+0x80 -> exists check r5+0x84 -> finalize / select_ram fallback Same dispatch pattern but mediated by cmpi. Implication for the clean-room: Phase 4 needs to either (a) replicate cmpi's cross-module function-pointer contract end-to-end, or (b) RE open_sys.ko to find the actual NNIE register writes and bypass cmpi. (b) is cleaner because it avoids depending on a vendor-shared function-pointer ABI that could drift. Documented in nnie_op_forward's TODO block. No code change to the dispatch path; the parsed-arg + return-EOPNOTSUPP shape is still what userland sees.

RAM/mutex coordination Followed the cmpi indirection from hal_svp_nnie_enable_ram into hi_sys.o (open_sys.ko's vendor blob). Resolution to the previous session's open question: - cmpi_get_module_func_by_id(51, 0xd1) -> sys_hal_gdc_nnie_set_ram_using @0x897c in cv500's hi_sys.o. - That function does an atomic bit-set/clear on bit 0 of register offset +0x34 of the sys-module's MMIO window (loaded from g_sys_state[16] base). - The "atomic bit-set" pattern is in a private helper at sys_drv_get_ cmp_3dnr_cfg+0x1c8 (= 0x7a84): spin_lock_irqsave / read-old / XOR-new / mask-to-bit / XOR-old / write-back / spin_unlock. So the original issue #111 wording — "memory arbitration [0x90] register" — was a partial description. The real mechanism is a set of bit-writes coordinating RAM ownership and mutex between NNIE / GDC / VENC, distributed across these sys-module HAL functions: - sys_hal_gdc_nnie_set_ram_using - sys_hal_gdc_nnie_mutex_sel - sys_hal_venc_nnie_mutex_sel - sys_hal_nnie_get_mutex_state - sys_hal_nnie_gdc_get_mutex_state - sys_hal_vgs_bootroom_set_ram_using And it's all anchored at sys MMIO base + offset 0x34..0x44ish, not at 0x90 of the IVE block. Updated nnie_op_forward's TODO block with the new findings. Two paths forward for Phase 4 dispatch wiring: (a) ioremap the sys register window from open_nnie_neo.ko and do the bit-writes directly. Cleanest, doesn't depend on a clean- room open_sys module existing. (b) Wait for an open_sys clean-room module (separate effort) and use a proper exported API. This is real progress: we have a concrete register + bit-set target instead of "the [0x90] knob" which doesn't actually exist on the IVE block. Next session resumes by ioremap'ing the sys MMIO window (address TBD — probably accessible via DT, since hi3516cv500.dtsi declares hisi-sys node) and exercising the bit-writes.

Followed the relocations into hi_sys.o: sys_hal_gdc_nnie_set_ram_using indirects via R_ARM_MOVW_ABS_NC .LANCHOR0 — a per-module anchor that holds the sys register window base (g_reg_sys_base_va). cv500 DT declares: sys: sys@12010000 { compatible = "hisilicon,hisi-sys"; reg = <0x12010000 0x10000>, /* crg */ <0x12020000 0x8000>, /* sys <- the one used here */ <0x12060000 0x10000>, /* ddr */ <0x12030000 0x8000>; /* misc */ }; So the NNIE/GDC RAM-using register is at phys **0x12020034** (sys-base + 0x34). Verified live on av300 via devmem 0x12020034 -> 0x00000000 which matches the expected idle state (NNIE not actively dispatched, bit 0 clear). Phase 4 starting point is now concrete: ioremap(0x12020000, 0x8000) in open_nnie_neo.ko, do an atomic read-modify-write of bit 0 at offset 0x34 to mark NNIE RAM in-use before each Forward dispatch. Similar offsets for the other 5 sys_hal_* functions (mutex_sel, get_mutex_state) live nearby in 0x12020000..0x12020044ish — sweep needed in next session to map them all out.

Swept all sys_hal_*nnie* + sys_hal_vgs_bootroom_set_ram_using functions in hi_sys.o. The "memory arbitration" the issue mentioned breaks down into three registers within the cv500 sys window (phys 0x12020000, DT-declared reg-name "sys" of hisilicon,hisi-sys node): Register 0x12020000: live=0x00000102 bit 13 = vgs_bootroom_set_ram_using (R/W) Register 0x12020008: live=0x00000000 (no contention) bit 0..1 = NNIE/GDC mutex state (R) bit 1 = venc<->nnie mutex_sel (sys_hal_venc_nnie_*) (W) bit 2 = gdc<->nnie mutex_sel (sys_hal_gdc_nnie_*) (W) Register 0x12020034: live=0x00000000 bit 0 = gdc_nnie_set_ram_using (R/W) ^ primary "NNIE has RAM" flag; set before Forward, clear after. Two private helper functions in hi_sys.o implement the atomic R-M-W: - sys_drv_get_cmp_3dnr_cfg+0x148 (= 0x7a04): n-bit field set - sys_drv_get_cmp_3dnr_cfg+0x1c8 (= 0x7a84): single-bit set Both use spin_lock_irqsave / read-old / mask+XOR / write-back. Phase 4 wiring: ioremap(0x12020000, 0x1000) once in probe, then atomic R-M-W of these bits around each Forward call. No more cmpi cross-module indirection needed — we drive sys-window bits directly from open_nnie_neo.ko. All Phase 3 unknowns are now nailed down. Phase 4 has a concrete implementation surface.

…tion Adds ioremap(0x12020000, 0x1000) in probe to access the sys-side NNIE/GDC/VENC coordination registers (decoded in the previous commit). Currently read-only — probe dumps the three live register values so we can confirm the mapping works. Phase 4 will use the nnie_sys_set_bit() / nnie_sys_clear_bit() helpers (now defined, __maybe_unused) around each Forward dispatch. On-target verification (av300, fresh boot): nnie_neo: probed nnie0=f2820000 irq=54 gdc_irq=53 nnie_neo: sys @0x12020000 mapped — VGS=0x00000102 MUTEX=0x00000000 NNIE_RAM=0x00000000 The three values match exactly what `devmem` reads against the same phys addresses — vendor open_sys.ko and our open_nnie_neo.ko share the window cleanly with plain ioremap (no request_mem_region clash). NNIE_RAM = 0x0 confirms bit 0 is clear in the idle state, matching the expected semantics ("NNIE has RAM" only during Forward dispatch). Phase 4 wiring is now a thin shim: call nnie_sys_set_bit(NNIE_SYS_REG_ NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM) before submitting to NNIE HW, clear after IRQ completion. Includes a small spin_lock_init(&g_sys_lock) and adds iounmap() in mod_exit for cleanliness.

Reverse-engineered the 64-byte NNIE HW task node layout from vendor svp_nnie_fill_forward_task @0x90d8 (hi_nnie.o), prologue 90d8..91a8. Cross-checked field offsets against the kernel SVP_NNIE_MODEL_S / SEG_S / FORWARD_CTRL_S struct definitions: sizeof(SVP_NNIE_NODE_S) = 52 sizeof(SVP_NNIE_SEG_S) = 1692 ← matches vendor 0x69c stride sizeof(SVP_NNIE_ROIPOOL_INFO_S) = 104 sizeof(SVP_NNIE_MODEL_S) = 13992 ← matches vendor copy_from_user size 0x36a8 offsetof(MODEL_S, stBase) = 13968 = 0x3690 ✓ Vendor caller passes: r0=global HW state, r1=pstModel (kernel kbuf), r2=forward arg kbuf, r3=on-stack 64-byte descriptor. The struct gets populated with phys addresses + segment instruction-stream pointer + batch num + trigger flag, then handed to svp_nnie_post_process which submits via cmpi-mediated svp_nnie_start_task (Phase 5). Concrete field map: task[ 0] u16 = bInstant ? 1 : 0 task[16] u64 = pstModel->stBase.u64PhyAddr (.wk MMZ block) task[24] u32 = pstModel->astSeg[NetSegId].u32InstOffset task[28] u32 = pstModel->astSeg[NetSegId].u32InstLen task[32] u64 = ctrl.stTskBuf.u64PhyAddr (user-supplied scratch) task[48] u64 = ctrl.stTmpBuf.u64PhyAddr (user-supplied temp) task[56] u32 = astSrc[0].u32Num (batch size) Past the 64-B header the vendor appends variable-length per-input stride table (one u32 per astSrc), per-node shape data copied from pstModel->astSeg[NetSegId].astSrcNode/astDstNode, and a per-batch DMA address vector. Layout decoded but not yet captured in the header — deferred until Phase 5 wires actual HW submission and we know which trailing sections the v500 NNIE block actually consumes. This commit only adds the descriptor header + cross-references it from the Forward stub comment. No behaviour change — module still returns -EOPNOTSUPP for Forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reverse-engineered the cv500 NNIE HW register interface from vendor hal_svp_nnie_* thin shims (hi_nnie.o @0xbb10..0xbc90, all single-store helpers indexed by core_id off .LANCHOR1[core_id]+4 = ioremap'd regs). Complete 0x11100000 register map: +0x20 W task descriptor phys[31:0] (hal_svp_nnie_write_task_addr) +0x24 W task descriptor phys[63:32] +0x28 W timeout cycles [31:0] (hal_svp_nnie_set_timeout) +0x2C W timeout cycles [63:32] +0x30 RW START — bit 0 = go (hal_svp_nnie_start) +0x34 RW IRQ_CFG — bits 0/1/2 enable (hal_svp_nnie_cfg_irq) finish / timeout / cfg_err IRQs +0x38 RW IRQ_CLEAR — bits 0/1/2 w1c (hal_svp_nnie_clear_irq) +0x3C R IRQ_STATUS — bits 0/1/2 pending(hal_svp_nnie_get_irq_status) +0x40 R CFG_ERR_INFO (hal_svp_nnie_get_cfg_err_info) +0x48 R TASK_ID (hal_svp_nnie_get_task_id) +0x50 RW CLK_GATE — bit 7 (=0x80) en (hal_svp_nnie_enable_clk_gt) +0x54 RW AXI OUTSTANDING — [4:0]=0xF, (hal_svp_nnie_set_outstanding) [11:8]=0xF +0x68 RW CHECK_SUM — bit 0 en (hal_svp_nnie_disable_check_sum) Dispatch sequence (drv_svp_nnie_start @0xb3ac): write_task_addr(task_phys_lo, task_phys_hi); // [+0x20], [+0x24] wmb(); START |= 1; // [+0x30] |= 1 Two important confirmations: 1. hal_svp_nnie_set_mem_speed @0xbc28 is a LITERAL no-op (`bx lr`) on cv500 — vendor doesn't write any IVE-style "[0x90] mem-priority knob" for NNIE. Our Phase 3 finding that NNIE coordination instead uses the sys @0x12020034 RAM-using flag stands. 2. hal_svp_nnie_enable_ram @0xb8f4 goes through cmpi (module 51 = SYS, fn 0xd1), not the NNIE register window. This matches the Phase 3 finding of sys_hal_gdc_nnie_set_ram_using setting bit 0 of 0x12020034. So the full HW-bring-up sequence is: 1. nnie_sys_set_bit(NNIE_SYS_REG_NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM) 2. write CLK_GATE = 0x80 (enable clock gating) 3. write OUTSTANDING = 0xF | 0xF00 4. fill 64-B task descriptor (Phase 4 prior commit) 5. write IRQ_CFG = 0x7 (enable all 3 IRQs) 6. write TASK_ADDR_LO/HI 7. wmb() 8. write START = 1 9. wait on IRQ → read IRQ_STATUS → clear IRQ 10. nnie_sys_clear_bit(NNIE_SYS_REG_NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM) No behaviour change — module still returns -EOPNOTSUPP. Phase 4 wiring is now a thin shim around this header + nnie_hw_task.h. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire the decoded 64-byte HW task descriptor into nnie_op_forward via a new helper nnie_fill_task_header(). Still returns -EOPNOTSUPP — we need the variable-length descriptor tail (per-input stride table, per-node shape data, per-batch DMA addresses) before driving HW — but this commit: - Lays down all the SVP_BLOB_S / SVP_NNIE_FORWARD_CTRL_S internal offset constants (already cross-checked vs vendor disasm). - Builds the fixed 64-byte header from ctrl.stTskBuf.u64PhyAddr, ctrl.stTmpBuf.u64PhyAddr, astSrc[0].u32Num, and bInstant. - Logs all decoded values pr_info_once so an on-target Forward call now prints the full decoded forward args + the partial task header, proving the offset constants are right end-to-end (the values must match what userspace passed in). - Defers reading pstModel->stBase.u64PhyAddr + astSeg[NetSegId].u32InstOffset/InstLen to Phase 5 (needs copy_from_user of 13992 B model struct). - Drops the now-redundant Phase 3 comment block; the cv500 sys-window coordination map + NNIE register map have been promoted into nnie_hw_task.h + nnie_hw_regs.h. Rebuilt cleanly against the cv500 4.9.37 kernel (/home/dima/git/firmware/output-cv500/build/linux-custom). New .text size for nnie_op_forward: 0x1a8 B (was a one-line stub). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reverse-engineered the variable-length descriptor tail of the NNIE task buffer from vendor svp_nnie_fill_forward_task body @0x91d4-0x9498. The fixed 64-byte HW descriptor (decoded in Phase 4) points to this tail via task[+32] = ctrl.stTskBuf.u64PhyAddr; the HW follows that pointer to read shape/stride/per-batch DMA addresses. Tail layout (written to ctrl.stTskBuf, kernel-vir resolved by svp_nnie_get_tsk_vir_addr @0x91a8 via phys-match against the registered tskbuf list): §1 always SrcNum × u32 astSrc[i].u32Stride align tip to 16 B §2 non-LSTM 16 × u64 astDst[i].u64PhyAddr (zero past DstNum, always 128 B advance) §3 non-LSTM varies per-source DMA address vector, dispatch by astSrc[i].enType: 0 -> u32Stride*Height*Chn * batch_idx + PhyAddr 1..3 -> svp_nnie_fill_image_src_addr (YUV plane offsets) 4 -> u32Stride*Height * batch_idx + PhyAddr 5 -> per-step from user u64VirAddrStep array other -> ILLEGAL_PARAM align tip to 16 B per blob §4 LSTM only different net_type==RECURRENT path @0x96dc, uses ctrl+stTskBuf indexing — Phase 6 work §5 optional dcache flush if (sp+40)/sp+28 set, flush range [stTskBuf.PhyAddr, +stTskBuf.u32Size) Important struct-layout correction: SVP_BLOB_S has a 4-byte hole at +28 (not previously documented in nnie_neo.c). The union starts at +32 because stSeq.u64VirAddrStep needs 8-byte alignment. Cross- checked with the cv500 ARM toolchain: +0..+27 enType, Stride, VirAddr, PhyAddr, Num +28..+31 PADDING +32..+47 union { Width,Height,Chn | Dim,VirAddrStep } The previous NNIE_BLOB_OFF_WIDTH=28 / HEIGHT=32 / CHN=36 constants were wrong; corrected to 32/36/40. The Forward stub's use of these (reading astSrc[0].u32Num at +24) was already at the right offset. No behaviour change — Forward stub still returns -EOPNOTSUPP. Phase 6 will: - decode the LSTM tail variant (§4) - implement the §1-§3/§5 builder in C - wire copy_from_user of pstModel (13992 B) to populate task[+16/+24/+28] - drive the HW per nnie_hw_regs.h sequence Rebuilt clean against cv500 4.9.37 kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix the deferred IRQ_SHARED issue from Phase 0: - request_irq() now passes IRQF_SHARED. The cv500 NNIE SPI line (54) is shared with vendor open_gdc.ko (GDC on SPI 53 in the same DT node), which we kprobed using IRQF_SHARED. To coexist we have to match — kernel rejects mixed-flag handlers on a shared line. - nnie_irq_handler now reads NNIE_REG_IRQ_STATUS first; if no NNIE bits are pending it returns IRQ_NONE so the GDC handler (or any other downstream sharer) gets to run. Only when a NNIE finish / timeout / cfg_err bit is set do we write-1-clear and signal g_nnie_done. The handler doesn't yet inspect *which* bit was set — that distinction (finish vs timeout vs cfg_err) gets pushed to Phase 7 once Forward actually dispatches and we have an end-to-end test path. Rebuilt clean against cv500 4.9.37 kernel; .text size for the module grew by 48 B for the status-read/dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement the kernel side of HI_MPI_SVP_NNIE_AddTskBuf / RemoveTskBuf. Userspace MMZ-allocates a scratch region, registers (phys, user_virt, size) with the kernel; the kernel records the mapping in a list and ioremap()s the phys range so Phase 7 can write the variable-length descriptor tail into stTskBuf from Forward dispatch. Implementation: - struct nnie_tskbuf {phys, user_virt, size, kvirt} on a list_head protected by g_nnie_tskbuf_lock. - nnie_add_tskbuf()/nnie_remove_tskbuf()/nnie_drain_tskbufs() do the list management + ioremap/iounmap. - nnie_op_add_tskbuf/remove_tskbuf wire the 24-byte SVP_MEM_INFO_S arg buffer through to those helpers. - nnie_drain_tskbufs() in mod_exit prevents leaks on rmmod. ioremap (not cmpi_remap_cached) is uncached, which means we don't need the cache-flush step vendor has at fill_forward_task @0x94ac. Trade-off is slower kernel writes — but the descriptor tail is small (KB), written once per Forward call. Userspace API match: vendor's libnnie.so AddTskBuf returns 0 on success; ours now returns 0 (on success), -EEXIST (already registered), -ENOMEM (OOM or ioremap fail), or -EINVAL (phys/size==0). RemoveTskBuf returns 0 or -ENOENT. Module size grew from 7128 B to 8028 B (+900 B for the registry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add nnie_build_task_tail() — implements §1-§3 of the descriptor tail decoded in Phase 5. Wired into nnie_op_forward as a dry-run: if the caller has registered an stTskBuf via AddTskBuf, we look it up and fill in the strides + dst-phys + per-batch DMA addresses. Still returns -EOPNOTSUPP overall (Phase 7 will copy_from_user pstModel, finalise the 64-B header, and drive the HW registers). Builder: §1: SrcNum × u32 stride entries (one per astSrc[i]) Aligned to 16 B with zero-fill. §2: 16 × u64 destination phys addresses, zero-padded past dst_num. §3: per-source DMA address vector — for each astSrc[i]: enType==0: batch_size = Stride * Height * Chn enType==4: batch_size = Stride * Height enType ∈ [1..3, 5]: -EOPNOTSUPP (Phase 7+ — YUV/seq inputs) Writes Num u64 entries: PhyAddr + j*batch_size. Aligned to 16 B between blobs. Tail bytes used logged via pr_info_once so on-target verification can confirm the offset arithmetic matches what HW expects (cross- checkable against vendor strace). Wiring: - nnie_init.c now exports g_nnie_pf_dev (platform_device *) so nnie_neo.c can dma_alloc_coherent in Phase 7. - Header includes: linux/dma-mapping.h + linux/platform_device.h. - nnie_fill_task_header marked __maybe_unused (Phase 7 will use it). Module .text grew from 8028 B to 8944 B (+916 B for the builder). Build clean against cv500 4.9.37 kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Finish everything in the Forward path *except* the actual HW kick: - copy_from_user(model_kbuf, fwd_arg[+776], 13992) — pulls the user's SVP_NNIE_MODEL_S into kernel memory using vmalloc (kmalloc would slab-fragment for ~14 KB). - Validate net_seg_id < model->u32NetSegNum and < 8. - Extract model->stBase.u64PhyAddr (file[+0x3690]) -> task[+16]. - Extract model->astSeg[net_seg_id].u32InstOffset/u32InstLen (file[+12 + seg*1692 + 12 / +16]) -> task[+24], task[+28]. - Look up stTskBuf in our registry; call nnie_build_task_tail to write the §1-§3 variable-length tail. - Fill the 64-byte HW task descriptor on stack via nnie_fill_task_header (un-suppressed __maybe_unused). - Log: trigger, model_phys, inst_off, inst_len, tail_bytes. What's left for Phase 7 (the actual HW kick): - dma_alloc_coherent the 64-B descriptor (we have g_nnie_pf_dev). - memcpy stack descriptor into it. - nnie_sys_set_bit(NNIE_SYS_REG_NNIE_RAM, ...) coordination. - 7 register writes (CLK_GATE, OUTSTANDING, IRQ_CFG, TASK_ADDR_LO/HI, wmb, START). - wait_for_completion_timeout 5 sec. - Read+ack NNIE_REG_IRQ_STATUS; distinguish finish/timeout/cfg_err. - Release sys lock + dma_free_coherent + write handle to buf+0. Stops short of HW kick because the partial test on av300 is non- destructive only as long as no register write hits the live NNIE block — once we add the START write, a wrong descriptor field could DMA to bad addresses and (worst case) hang the SoC. Doing that as a distinct commit keeps the bisect safe. Module .text grew from 8944 to 9908 B (+964 for copy_from_user flow). Still returns -EOPNOTSUPP at the end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire the cv500 NNIE Forward dispatch end-to-end. Forward now: 1. Decodes the 1624-byte forward arg (Phase 1). 2. copy_from_user pstModel; extracts stBase.u64PhyAddr + astSeg[net_seg_id].u32InstOffset/u32InstLen (Phase 6). 3. Writes §1-§3 variable-length tail into the registered stTskBuf via nnie_build_task_tail (Phase 5/6). 4. dma_alloc_coherent's a 64-byte HW descriptor, populates it via nnie_fill_task_header (Phase 4). 5. Acquires the cv500 sys-window NNIE_RAM coordination bit at 0x12020034 (Phase 3). 6. Writes CLK_GATE / OUTSTANDING / IRQ_CFG / TASK_ADDR_LO/HI / START to the NNIE register window (Phase 4). 7. Waits on g_nnie_done completion (5 s timeout, IRQF_SHARED-aware handler from Phase 6). 8. Reads cause out of g_nnie_last_status (set atomically by the handler before complete()). 9. Releases the sys lock, frees the DMA descriptor. 10. Distinguishes finish / timeout / cfg_err and returns the right errno (0 / -ETIMEDOUT / -EIO). The dispatch happens unconditionally — no module parameter gate. On-target verification on av300 is the next step. Failure modes that could happen on first run: - Bad descriptor field offset: HW writes -EIO cfg_err to status, handler signals, we return -EIO. Recoverable; no board hang. - sys-window bit doesn't match what vendor expects: HW silently discards the task, we hit the 5 s timeout, return -ETIMEDOUT. Also recoverable. - Wrong NNIE register-window decode (Phase 4 RE was wrong): worst case the START write goes nowhere; same -ETIMEDOUT outcome. - HW reads the descriptor and the descriptor's tsk_buf_phys points somewhere bad: HW does bus error; on cv500 typically the SoC bus abort handler logs and the NNIE block returns cfg_err. Recoverable. - Worst plausible failure: HW reads the descriptor, the variable- length tail has a bad PhyAddr in §3, NNIE DMAs garbage to/from a bad address. AXI typically reports a bus error rather than hanging. Power-cycle recovers if not. Module .text grew from 9908 to 12248 B (+2.3 KB for dispatcher). Build clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end Forward path is wired (Phase 7 dispatch + IRQF_SHARED); old 'Phase 3 — ioctl ABI wired, HW dispatch TBD' log was stale. On-target verified on av300: module loads, IRQ 54 shares cleanly with vendor GDC_NNIE handler (no 'Flags mismatch' error), /dev/nnie present, sys-window coordination registers read live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three substantive changes to libraries/nnie_neo: 1. Forward / AddTskBuf / RemoveTskBuf now call ioctl() on /dev/nnie instead of returning HI_ERR_SVP_NNIE_NOT_SURPPORT. The Forward path packs SVP_SRC_BLOB_S[] + pstModel user VA + SVP_DST_BLOB_S[] + SVP_NNIE_FORWARD_CTRL_S into the 1624-byte ioctl arg per the layout decoded in kernel/nnie_neo/nnie_neo.c. 2. LoadModel now actually populates the SVP_NNIE_MODEL_S struct: - stBase = *pstModelBuf - enRunMode = file[48] - u32TmpBufSize = file[60..63] - u32NetSegNum = file[49] - astSeg[i].enNetType / u16SrcNum / u16DstNum / u16RoiPoolNum / u16MaxStep / u32InstOffset / u32InstLen from the 16-byte seg records at file[192 + i*16]. Node + ROI tables left zeroed — kernel Forward only reads u32InstOffset / u32InstLen, and userspace post-process helpers (softmax/detect/cluster) aren't implemented yet, so zeroed slots are safe for now. Validated InstOffset+InstLen against file_size. 3. /dev/nnie fd lifecycle: cached static int, opened lazily on first ioctl, protected by a pthread_mutex. nnie_err_to_hi() translates Linux errno to vendor HI_ERR_SVP_NNIE_* codes. Build fix: vendor hi_nnie.h needs HI_ID_SVP_NNIE from the staging hi_common.h, but libraries/include/hi_common.h was being preferred (causing redeclaration conflicts on EN_ERR_LEVEL_*). Reorder Makefile include path so STAGING/kernel/include/hi3516cv500 comes BEFORE libraries/include. libnnie_neo.so builds clean against cv500 ARM toolchain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two on-target fixes after the first end-to-end Forward attempt on av300 with inst_mnist_cycle.wk: 1. AddTskBuf: switched from ioremap() to memremap(MEMREMAP_WB). cv500 MMZ regions are CMA-backed kernel RAM, and ioremap() refuses these (kernel WARN + returns NULL because the kernel direct map already covers them). memremap WB transparently handles both CMA RAM (returns lowmem virt) and MMIO (falls back to ioremap). Updated struct nnie_tskbuf.kvirt type from 'void __iomem *' to 'void *' and replaced iowrite32 in the tail builder with plain stores. Verified on target: AddTskBuf now returns 0, no more WARN. 2. Tail builder: SVP_BLOB_TYPE_U8 (=1) with Chn=1 (grayscale) now uses Stride*Height batch_size. Vendor's svp_nnie_fill_ image_src_addr @0x7978 handles U8 by branching on Chn ∈ {1, 3}: Chn=1 is a single u64 per batch (= PhyAddr + j*Stride* Height); Chn=3 writes 4 u64s per batch (3 plane addrs + zero pad) at 32 B/batch — that's still Phase 8. Without the U8 path mnist couldn't run (its input blob is enType=1, Chn=1). Current on-target test run (LD_LIBRARY_PATH+PRELOAD voice libs): LoadModel -> 0x0 NetSegNum=1, Inst@offset 453888 len 10600 AddTskBuf -> 0x0 Forward -> 0xa0338012 (-ETIMEDOUT) Kernel dmesg shows: task hdr: model_phys=0xaa880000 inst_off=453888 inst_len=10600 tail=160 B (descriptor builder ran clean) Forward timed out (5s, status snapshot=0x0) Next: figure out why HW doesn't IRQ. Hypotheses: - NNIE clock disabled (vendor drv_svp_nnie_enable_sys_clk @0xb26c likely needed before START) - 64-B descriptor layout still slightly off - Vendor open_gdc.ko handler consuming our IRQ first - Some RAM-bank select register we're missing Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@0xbc

Three on-target findings from sequential diagnosis on av300, each verified against vendor disasm + live devmem readings: 1. NNIE clock/reset is on CRG (clock-reset generator) window at 0x12010000 — NOT the sys window at 0x12020000 we mapped in Phase 3. Per cv500 DT: clock@12010000 — clock-reset, 'hisilicon,hi3516cv500-clock' sys@12020000 — sys-state (mutex, RAM-using flags) Vendor hi_sys.o sys_hal_wk_cnn_clk_en @0x86dc writes bit 1 of register +0xbc of LANCHOR0[+8] (= CRG base, not sys base): crg @0x120100bc: bit 0 = NNIE reset (1=held, 0=released) bit 1 = NNIE clk_en (1=ungated) Pre-Phase-7 dispatch: CRG @0xbc = 0x0 → NNIE clock GATED. Writes to NNIE_REG_CLK_GATE silently dropped (read back as 0). This commit ioremap()'s the CRG window, defines NNIE_CRG_* constants, drops nnie_crg_set_bit/clear_bit helpers, and calls them in dispatch to release reset + ungate clock before the first register write. Verified: CRG @0xbc now reads 0x00000002 after dispatch, NNIE_REG_CLK_GATE readback now 0x80 (was 0x0 — register writes were no-ops without clock). 2. Vendor's one-shot svp_nnie_init @0x10f4 also sets TIMEOUT to ~2 seconds at the NNIE clock rate (TIMEOUT_HI:LO = 0xff: 0xffffffff). Without TIMEOUT set, HW seems to hang indefinitely after START. Added init-time programming of these registers + checksum disable (clear bit 0 of +0x68) per the vendor init sequence at 0x1c80-0x1cd0. 3. Tail builder: U8 (enType=1) with Chn=1 now follows the standard CNN per-batch DMA formula (PhyAddr + j*Stride*Height). Vendor svp_nnie_fill_image_src_addr @0x7978 confirms this. Current state: end-to-end test on av300 with inst_mnist_cycle.wk runs all the way through to the HW START kick. /dev/nnie ABI works, all RE-discovered registers programmed correctly. HW still hangs after START — no IRQ in 5 s, STATUS=0 throughout. Likely cause: descriptor format mismatch (probably one of): - 64-byte header has a field at +0/+2/+4 we haven't fully decoded - Variable-length tail format wrong for our SrcNum=1/DstNum=1 arrangement - Need to call drv_svp_nnie_config_ram (OTP-based ram bank cfg) once at init Deferred to Phase 8 — needs careful side-by-side comparison against vendor strace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings from sequential av300 debug: 1. sys_hal_gdc_nnie_set_ram_using @0x897c uses LANCHOR0+16, which hi_sys.o sys_hal_init @0x8d70 ioremaps to *0x12030000*, not 0x12020000 like I assumed in Phase 3. The 'sys' window at 0x12020000 holds VGS/MUTEX status; the RAM-using flag is in a separate 'sys2' window at 0x12030000+0x34. Wired NNIE_SYS2_BASE_PHYS / nnie_sys2_set_bit/clear_bit helpers; map in probe. 2. Added pre-START diagnostic that dumps NNIE register state (CLK, OUT, IRQ_CFG, TIMEOUT, TASK_ADDR, CHECK_SUM) + the full 64-byte HW task descriptor as hex u32s, then polls IRQ_STATUS/ START/TASK_ID for 100 ms after START. All registers programmed correctly per vendor disasm; descriptor format matches Phase 4 decode byte-for-byte. 3. Pulse-reset before clock-enable, in case HW is left in a stuck state by a previous failed dispatch. Current state — *all RE-discovered HW is correctly programmed*: pre-START regs: CLK=0x80 OUT=0xf0f IRQ_CFG=0x7 TO_LO=0xffffffff TO_HI=0xff ADDR_LO=0xa00fe000 ADDR_HI=0x0 CHKSUM=0x0 64-B task desc: trigger=1 model=0xaa880000 inst_off=0x6ed00 inst_len=0x2968 tsk=0xa9c70000 tmp=0xa00f5000 batch=1 Post-START: STATUS=0, TASK_ID=0 (HW completely silent for 5s). Architectural find (svp_nnie_post_process @0x1d8c): vendor maintains a pre-allocated DMA-coherent ring of 512 × 64-byte slots per core, indexed by the per-core busy counter. The fill_forward_task output descriptor gets memcpy'd into the next ring slot, and the SLOT INDEX gets written to descriptor[+4] (which I had as 'reserved'). For first task (r6==0), this matches our descriptor[+4]=0, so the field isn't the cause. Remaining hypotheses for the HW hang: - Variable-length descriptor tail layout off — HW interprets §1-§3 differently for our SrcNum=1 / DstNum=1 / U8 Chn=1 / batch=1 case than vendor expects - drv_svp_nnie_config_ram (OTP-based chip-variant RAM bank cfg) is needed at boot and we haven't been calling it - Some other peripheral state vendor open_sys.ko configures silently (cmpi mod 51 / mod 2 paths we haven't fully decoded) Phase 9 needs: kprobe vendor's hal_svp_nnie_write_task_addr + hal_svp_nnie_start on a working vendor Forward path (load vendor open_nnie.ko, write a vendor-libnnie test) to capture the exact task descriptor + ring-slot contents from a known-good run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 8 on-target finding (av300 + vendor libnnie.so as oracle): Vendor LoadModel on /tmp/inst_mnist_cycle.wk reports: Tmp = 1989888 (≈ 1.9 MB) Our parser was reading u32TmpBufSize from file[60..63], which on mnist is zero. The vendor value is too big to live in the .wk file itself (466 KB) — it's an inference-time scratch size that vendor computes by walking the per-segment instruction stream. Heuristic for now: declare u32TmpBufSize = 8 MB unconditionally, which covers small classification models. Larger detection models (yolov*, ssd, frcnn) will need more — Phase 9 will RE vendor's computation in libnnie.so for the precise value. This caps the per-Forward MMZ working set at ~8 MB even when models don't need that much (mnist actually needs ~2 MB). Not a problem in practice — userspace allocates the tmpbuf from MMZ and our kernel doesn't touch it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wrote three Phase-9 debug modules in kernel/nnie_spy/ to capture vendor open_nnie.ko's live HW dispatch state: - nnie_spy.c: kprobe (CONFIG_KPROBES=n on cv500 — unusable, kept for future kernel rebuild) - nnie_dump.c: insmod with phys=/size= module params, dumps 16B/line via phys_to_virt (works for any CMA-managed lowmem) - nnie_watch.c: kthread polling NNIE+0x20/+0x24 every 1us, captures any TASK_ADDR change + dumps the descriptor + tskbuf Captured during a known-good vendor Forward on inst_mnist_cycle.wk: TASK_ADDR = 0xa9c70000 (vendor's pre-allocated DMA ring slot) descriptor: d[0..7] : 01 00 00 00 00 00 00 00 AA88_0150 00 00 00 00 06ED00 002968 d[8..15]: A9CB0000 0 0 0 0 0 AD200000 0 00000001 0 tskbuf @ 0xa9cb0000: +0x00: 00000020 00000030 00000000 00000000 +0x10: a00f6000 00000000 00000000 00000000 +0x20: a00f5000 00000000 00000000 00000000 +0x30..: zero Two critical finds: 1. **task[+16] is NOT stBase.PhyAddr — it's stBase.PhyAddr + inst_offset_extra**. For the test .wk: 0xaa880000 + 0x150 (= file[52..55]) = 0xaa880150. Vendor's userspace LoadModel adjusts stBase.u64PhyAddr to skip the .wk header. Our LoadModel was passing the raw file phys. Fixed in libraries/nnie_neo/src/nnie_ops.c: pstModel->stBase.u64PhyAddr += inst_off_extra after the *pstModelBuf copy. 2. **The variable-length tskbuf tail is FAR SIMPLER than my disasm- based RE suggested**. Vendor only uses: +0: src strides packed (SrcNum × u32) then dst strides packed (DstNum × u32) pad to 16 +0x10: dst phys addrs (DstNum × u64, packed) pad to 16 then src per-batch phys (SrcNum × Num × u64) For mnist SrcNum=DstNum=Num=1: ~32 bytes meaningful, rest is the tskbuf size we provided (65 KB, all zero). My earlier §1-§3 builder wrote 160 bytes including a 16-slot dst phys array — completely different from what HW expects. Rewrote nnie_build_task_tail to match vendor byte-for-byte. Verified our tskbuf content == vendor's tskbuf content identically: +00: 20 30 0 0 (src_stride=0x20=32 dst_stride=0x30=48) +10: dst_phys 0 0 0 +20: src_phys 0 0 0 Also fixed clean-room kernel to do read-modify-write on CLK_GATE and OUTSTANDING (vendor pattern). Our previous plain writes were clobbering required chip-default bits. CLK_GATE now reads back 0x3c9 (was 0x80), matching vendor's live state 0x349. Status: descriptor + tskbuf are now byte-equivalent to vendor's, all RE-discovered HW registers programmed identically. HW still returns cfg_err info=0x1 — the bug is OUTSIDE the descriptor + tskbuf. Hypotheses for Phase 10: - dma_alloc_coherent vs cmpi_mmz_malloc_cached affect HW DMA differently - Vendor's pre-allocated ring slot phys is registered with HW via some other mechanism we haven't decoded yet (e.g., open_sys registers it with CMP_3DNR or similar) - chip-variant cfg via drv_svp_nnie_config_ram / prepare_nnie (OTP-dependent) is required for our chip Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three Phase-10 changes to bring cleanroom dispatch closer to vendor: 1. Descriptor allocation moved from dma_alloc_coherent to hil_mmb_alloc + hil_mmb_map2kern_cached. Matches vendor's cmpi_mmz_malloc_cached pattern, places the descriptor in the same MMZ pool vendor uses for its 512-slot ring (verified: same single 256MB zone 0xA0000000-0xAFFFFFFF per /proc/media-mem). 2. __cpuc_flush_dcache_area on tskbuf after nnie_build_task_tail writes — memremap WB gives cached kernel mapping; HW DMA needs the writes visible in DDR. Vendor uses SAMPLE_COMM_SVP_FlushCache. 3. __cpuc_flush_dcache_area on the descriptor after memcpy from the stack-built struct — same reason as (2). Independently verified via nnie_dump.ko reading our descriptor phys: the 64 bytes match vendor's known-good capture byte-for-byte except for the tskbuf-phys field d[8] (vendor used 0xa9cb0000, ours uses 0xa9c70000 — both valid MMZ phys, content identical at both). HW STILL returns cfg_err info=0x1. With: - descriptor bytes match vendor - tskbuf tail content matches vendor (verified via dump) - registers programmed identically to vendor (CLK_GATE=0x3c9 etc) - MMZ allocation now in same pool as vendor - all caches flushed before START …something else differs. Hypotheses for Phase 11: - Per-core state in vendor's LANCHOR0 holds HW-required init that only vendor's svp_nnie_init populates (e.g., a chip-variant fixup we haven't decoded) - HW has a hidden "first task" register/sequence we're missing - vendor's open_gdc.ko (loaded but unused for inference) sets something we're missing - svp_nnie_check_err_status decodes info=1 as something specific we haven't traced yet Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Snapshot-diff finding: after a clean reboot, vendor's open_nnie.ko insmod writes ~15 previously-unexplored NNIE registers. One of them is CHECK_SUM (+0x68 = 0x00000001 post-vendor-init). Vendor's symbol 'disable_check_sum' @0xbc74 clears bit 0, but the live state has bit 0 SET after init runs — the function name is misleading; bit 0 is presumably "disable mode" semantics (clearing it enables real operation). Our cleanroom was calling the analog (clear bit 0) which actively DISABLED checksum. Empirical: with CHECK_SUM=0 our HW returns cfg_err info=1; with CHECK_SUM=1 (vendor's value) HW returns cfg_err info=0. Different error code — we're closer. Remaining gap registers vendor writes that we don't: nnie+0x00 = 0x00002018 nnie+0x04 = 0x00000130 nnie+0x08 = 0x0000B017 nnie+0x10 = 0x5A5A5A5A ← magic value nnie+0x14 = 0x0000FFEF nnie+0x6c = 0xFFFFFFFF nnie+0x70..0xa8 = various (chip cfg / clock params?) These may be HW-self-populated when clock is on (need to verify by just enabling CRG NNIE clock without vendor module) or may be set by drv_svp_nnie_prepare_nnie (OTP-variant-dependent cmpi mod 2 fn 0xb6 call). Phase 12 to investigate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Snapshot-diff vs vendor's open_nnie load showed vendor's init only changes ONE bit beyond what we already program: CRG+0xa4 (VEDU clock, per hi_sys.o sys_hal_vedu_clk_en) flips 0 → 6. NNIE may share clock infrastructure with VEDU on cv500. This commit adds a CRG+0xa4 RMW to our dispatcher to keep bit 0..2 = 6. On test, devmem readback shows the write did NOT take effect (post- dispatch CRG+0xa4 still 0). Possible causes: - VEDU clock register may need a separate enable bit - Some sys/CRG window has write-protect we haven't decoded - Vendor's path may set it via cmpi → open_sys.ko which has CRG permission we lack cfg_err info changed from 1 to 0 in the previous commit (CHECK_SUM preserved). info=0 remains here — adding CRG+0xa4 didn't change the result. The 15 'unexplored' registers I'd worried about (nnie+0x00..0x14, +0x70..+0xa8) turn out to be HW-self-populated when the clock is on. Verified by snapshotting AFTER reset with ONLY a devmem write to CRG+0xbc=0x2 (no module loaded): all the magic values appear. So vendor doesn't write them — they're chip defaults exposed once the block is clocked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two tweaks to the dispatcher: - write NNIE_IRQ_ALL to NNIE_REG_IRQ_CLEAR right before TASK_ADDR writes, to drop any stale cfg_err/timeout bits from a previous failed dispatch - log NNIE_REG_IRQ_STATUS + NNIE_REG_CFG_ERR_INFO in the pre-START register dump Test: pre-START state confirmed STATUS=0x0 ERR_INFO=0x0 (clean). Within 100us of START write, HW raises STATUS=0x4 (cfg_err) with ERR_INFO=0x0. So HW is processing the task and definitively rejecting it — not a stale-IRQ issue. CRG+0xa4 write from kernel module also confirmed to stick now (post-test devmem reads 0x6 as intended). Earlier non-stick may have been a transient state-machine issue from the unsafe write ordering. Even with CRG+0xa4=6 matching vendor, ERR_INFO stays 0. Remaining hypothesis space (Phase 12): - vendor's cmpi_register_module call registers NNIE as module 51, enabling other modules (sys/sys_config) to call NNIE-specific init that we miss - drv_svp_nnie_prepare_nnie's OTP-variant path may run additional chip cfg via cmpi mod 2 fn 0xb6 for specific OTP values; need to find g_reg_otp_base_va and check our chip's OTP[+0x28] - HW may need a "warm" task descriptor (something set by vendor in its descriptor ring slot at first task that survives between tasks) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extended kernel/nnie_spy/nnie_watch.c to dump ALL NNIE registers +0x00..+0xbc at the moment of TASK_ADDR write. Captured vendor's working mnist Forward state: reg+0x20: a9c70000 00000000 ffffffff 000000ff ← TIMEOUT_HI=0xff! reg+0x40: 00000000 00000003 00000000 00000000 ← +0x44 = 0x3 (UNK) reg+0x60: 00000000 00000000 00000000 ffffffff ← CHECK_SUM=0! Three register diffs vs our cleanroom that we now fix: 1. TIMEOUT_HI (+0x2C) = 0xff (we had set to 0 after misreading an earlier post-Forward snapshot — vendor set_timeout writes 0xff, HW clears it after completion. AT task start it's 0xff.) 2. CHECK_SUM (+0x68) = 0 (vendor explicitly disables before each task. Live readback after vendor's Forward completion shows 1 because HW restores chip default 0x1 post-task. Vendor's 'disable_check_sum' function name is correct after all — bit 0 = 1 IS "enabled", and vendor disables before submit.) 3. +0x44 (unknown) = 0x3 (vendor writes this; no decoded function in hi_nnie.o symbol table matches +0x44). With these three fixes (without further changes to +0xb0..+0xb8 which made things worse): HW returns cfg_err info=0 to info=0x1000001 to sometimes a 5-sec TIMEOUT (no IRQ) — a different failure mode each run, but at least the cfg_err code changed, suggesting HW is partially accepting our task now. Phase 13: trace what +0x44 is, why info=0x1000001 vs info=0. Also investigate whether HW resets between tasks differently than vendor expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Empirical finding via test_neo on av300: assigning a monotonic per-task slot index to descriptor[+4] (the 'reserved' field — actually vendor's task ring slot index, 0..511) makes the NNIE HW ACCEPT the task. Live readback shows TASK_ID register (+0x48) updating to 0x1 matching our slot_idx after START. Previously descriptor[+4] = 0 caused HW to reject with cfg_err info= 0 / 0x1000001. With monotonic non-zero idx, HW updates TASK_ID and runs partial processing. Mechanism (inferred): HW tracks a "next expected slot_idx" internally. Submitting slot_idx that matches the current state (=0 on cold boot) is treated as a no-op or as referring to an already-completed task, so cfg_err. Submitting a fresh slot makes HW accept the new task. Vendor's ring iterates 0,1,2,...,511 mod 512 — first task after fresh boot is 0, subsequent ones increment. So our 'always 0' was wrong after the first failed task. Forward still returns cfg_err (cause=0x4 info=0x1000001) AT THE END because the inference engine fails part-way through execution — but this is a LATER failure mode than the previous "submission rejected". Remaining puzzle (Phase 14): why does HW reject mid-execution? Most likely candidates: (a) instruction stream interpretation fails because some descriptor offsets we set are still wrong relative to vendor; (b) tmp_buf size/content not what HW expects; (c) output blob format mismatch (vendor uses VEC_S32 stride=48 — we match this in test_neo). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR OpenIPC/firmware#2095 enabled CONFIG_KPROBES=y in cv500/av300 ship kernels (board built 2026-05-14 05:25 UTC). My local linux-custom .config was stale (pre-#2095), so register_kprobe in modules built against it expanded to the kprobes.h inline stub returning -ENOSYS. Fix: set CONFIG_KPROBES=y + CONFIG_KALLSYMS_ALL=y + CONFIG_JUMP_LABEL=y + CONFIG_MODULE_UNLOAD=y in linux-custom/.config, run oldconfig + prepare to regenerate include/generated/autoconf.h, rebuild nnie_spy.ko. Also: register_kprobe(symbol_name=...) only resolves GLOBAL kallsyms. Vendor's hal_svp_nnie_write_task_addr is LOCAL (lowercase 't' in kallsyms) so I had to switch to .addr= with the value module-loaded from kallsyms grep. Verification: • probe_test.ko on printk → "PROBE FIRED" each printk call ✓ • nnie_spy.ko on hal_svp_nnie_write_task_addr → fires during vendor's Forward, dumps r2/r3 (task_phys), then 64-byte descriptor and tskbuf tail via phys_to_virt First vendor capture (mnist Forward, scores match prior runs): task_phys = 0xa9c70000 (vendor's pre-allocated ring slot 1) d[0..7] = 01 00 00 00 00 00 00 00 ... aa880150 00 00 00 00 6ed00 002968 d[8..15] = a9cb0000 0 0 0 0 0 ad200000 0 01 00 00 00 0 0 tail @0xa9cb0000: +0x00: 20 30 0 0 a00f6000 0 0 0 +0x20: a00f5000 0 0 0 ...zero Identical to earlier polling captures — but now with single-cycle precision (vs ~1MHz poll). Phase 14 can now iterate quickly: kprobe vendor's full hal_svp_nnie_* call sequence, compare to ours, find the missing register / write-ordering issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Vendor's first task uses descriptor[+4] = 0 (verified via kprobe capture this session). Our atomic_inc_return was post-increment, giving 1 on first call — descriptor[+4] = 1 means HW waited for slot 0 (never submitted) to complete first → 5-sec TIMEOUT. Switch to atomic_inc_return - 1 (pre-increment value, 0 on first call). With slot_idx = 0 our cleanroom now produces a 64-byte descriptor byte-equivalent to vendor's mnist Forward capture: 00000001 00000000 00000000 00000000 aa880150 00000000 0006ed00 00002968 (Identical content, identical layout.) HW still rejects with cfg_err info=1 with slot=0. The dependence of cfg_err code on slot_idx + prior HW state is intricate: slot_idx=0, fresh boot → cfg_err info=1 slot_idx=1, after previous fails → TIMEOUT (TASK_ID updates to 1) So HW accepts SOMETHING and rejects something else. With kprobes now available the next iteration can attach probes to vendor's full hal_svp_nnie_* call chain and the chip's actual register write order, then diff against ours. That belongs in Phase 15+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Added kernel/nnie_spy/nnie_trace.c that simultaneously kprobes 7 vendor functions in open_nnie.ko (write_task_addr, start, cfg_irq, set_timeout, enable_clk_gt, set_outstanding, disable_check_sum) and logs ARM_r0..r3 at each entry. Loaded with addr= params for each symbol grep'd from /proc/kallsyms. Captured vendor's per-task call sequence on av300 (mnist Forward): 1. cfg_irq(core_id=0, ...) 2. set_timeout(core_id=0, ..., r2=0xffffffff TIMEOUT_LO, r3=0xff TIMEOUT_HI) 3. enable_clk_gt(core_id=0, ...) 4. set_outstanding(core_id=0, ...) 5. disable_check_sum(core_id=0, ...) 6. write_task_addr(core_id=0, ..., r2=0xa9c70000 task_phys, r3=0) 7. start(core_id=0, ...) This sequence matches our cleanroom's register-write order. And captured task_phys, TIMEOUT_LO/HI, CHECK_SUM=0 etc. match our descriptor + pre-START state byte-for-byte. Tested: vendor's open_nnie module init first (to leave HW post-init state vendor expects), then rmmod + insmod our cleanroom + Forward. Still cfg_err info=1. So the gap is NOT in module-init residual state. Remaining gap is between "all observable register values match vendor's working state" and "HW actually produces inference work". Likely something in cmpi-mediated open_sys handshake or HW state- machine sequence ordering we still get wrong. Phase 15+ work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kprobed sys_hal_wk_cnn_clk_en, sys_hal_wk_cnn_reset_sel, and sys_hal_gdc_nnie_set_ram_using on av300 during a known-good vendor mnist Forward, captured this call sequence (likely interleaved with calls from other vendor modules): clk_en(1) → reset_sel(0) → clk_en(0) → clk_en(1) → ram(0) → clk_en(1) → ram(1) → ram(0) → clk_en(0) The clock-toggle (OFF then ON) before the task is novel — we just keep clock on continuously. Hypothesised this was an HW reset pulse NNIE needs. Mirrored vendor's sequence in nnie_dispatch_forward. HW still returns cfg_err info=1. So the clock dance ALONE isn't what HW needs — the failure is somewhere else still. Remaining hypotheses for Phase 15: - The clock-toggles come from MULTIPLE concurrent threads (other vendor modules using cmpi mod 51 ops). Trying to replicate them in single-threaded sequence misses the actual chip state HW needs. - HW may need a specific CMPI handshake we still don't emulate. - Some HW state I haven't observed yet. Branch: 39 commits on nnie-neo. Phase 14 progressed significantly this session (kprobes unblocked, full vendor call trace captured) but the HW failure mode is now stably reproducible at cfg_err info=1 with byte-identical state to vendor's working setup. The cleanroom RE has reached a plateau that needs careful single-step HW debugging (or vendor open_sys.ko source) to break through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Convert four per-task register writes from plain writel() to read-modify-write to match vendor hal_svp_nnie_* helpers: reg+0x30 (START) plain 1 → OR bit 0 reg+0x34 (IRQ_CFG) plain 0x7 → OR bits 0|1|2 reg+0x38 (IRQ_CLEAR) plain 0x7 → OR bits 0|1|2 reg+0x68 (CHECK_SUM) plain 0 → BFC bit 0 Vendor preserves upper-half status bits in these registers via RMW. Plain writes were clobbering them, but on a post-reboot clean run the read-back is 0 in all four cases, so the RMW is functionally equivalent — the change is for correctness against any future state path that sets those bits. Also switch task descriptor MMZ from cached + __cpuc_flush_dcache_area to nocache (hil_mmb_map2kern), matching vendor svp_nnie_init's cmpi_mmz_malloc_nocache for its pre-allocated task ring. Eliminates cache coherency as a variable. cfg_err info=0x1 still fires mid-execution. Phase 14 is the gap between byte-equivalent dispatch state and HW actually executing the task. None of: GDC ops handshake, prepare_nnie ioctl, check_clk_freq, set_mem_speed (all decoded from hi_nnie.o disasm) are present in our cleanroom dispatch — but on cv500 all four are no-ops for the classic NNIE path with no GDC concurrent activity. Next angle: kprobe vendor cfg_err recovery to capture HW state at the moment of a known-bad dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dump LoadModel breakthrough: decode the per-segment node tables in the .wk file at file[208+] (compact 16 B node header + 32 B name = 48 B per node, layout decoded from vendor inst_mnist_cycle.wk hex dump). Now populates: astSeg[i].astSrcNode[j] = { enType, u32Width, u32Height, u32Chn, u32NodeId, szName } ← read from file astSeg[i].astDstNode[j] = { enType=SVP_BLOB_TYPE_S32, ..., szName } ← name read; type/id derived (vendor's libnnie hardcodes type=4 for outputs) Direct vendor-vs-cleanroom comparison on av300: vendor lib (libnnie.so) reports astSrcNode[0].enType=1 for mnist.wk cleanroom (libnnie_neo) was reporting enType=0 — userspace test uses this to pick SVP_BLOB_TYPE_E for its src MmzAlloc, so we were sending the wrong blob type to the Forward ioctl. Now with enType=1 propagated through to the kernel's tskbuf §3 fill, the per-source DMA address vector matches vendor's exactly (still 1 u64 per batch for enType=1 C=1 — the YUV chroma plane is only needed for C>=2). Tested on av300: cfg_err info=0x1 still fires mid-execution though, so enType wasn't the trigger by itself. Also adds kernel-side diagnostic: full reg+0x00..+0xc8 dump both pre-START and post-fail. Comparison vs vendor's post-insmod-no-Forward devmem dump shows: - chip-config + HW-self-populated bits (+0x70..+0xa8) identical - our per-task RMW of OUTSTANDING clears bit 4 (0xf1f → 0xf0f) matching vendor's per-task code; vendor's POST-success state is all-zeros (IRQ handler gates clock) so direct mid-task comparison needs nnie_watch.ko polling Post-fail HW touches: cfg_err_info=0x1, +0x64=0x8, +0xb0..+0xb8=0x1 each, +0xc4=0x34e — all written by HW, not us. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extend nnie_trace.ko so the write_task_addr kprobe also dumps the 64-byte HW descriptor + the first 320 bytes of the tskbuf it references (via phys_to_virt on r2 = descriptor phys passed by vendor). And the start kprobe ioremaps the NNIE register window and dumps reg+0x00..+0xc8 at the exact moment vendor's start function fires. Used this to capture vendor's pre-START state during a known-good mnist Forward on av300. Result vs cleanroom pre-START dump: vendor: TASK_ADDR_LO = 0xa9c70000 cleanroom: TASK_ADDR_LO = 0xa9cb0000 ALL other registers match byte-for-byte (chip ID, IRQ_CFG, TIMEOUT, CLK_GATE, OUTSTANDING, CHECK_SUM, +0x6c..+0xa8 HW-self-populated). Descriptor + tskbuf content matches byte-for-byte at the phys-mem level. The 0xa9c70000 vs 0xa9cb0000 swap happens because vendor allocates its task descriptor ring at module init (so it gets the first 64KB-aligned slot after the bv_pool), while cleanroom allocates a fresh descriptor MMZ per Forward (after the test's nnie_tsk allocation has already taken 0xa9c70000). HW may have a static range filter for TASK_ADDR or some other reason it rejects ours. Next: pre-allocate descriptor MMZ at module init (matching vendor's allocation order) so we get 0xa9c70000 — or any deterministic phys — and see if cfg_err goes away. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cleanroom NNIE driver now produces bit-identical mnist output to the vendor open_nnie.ko on av300: dst[10 scores]: 408 412 401 401 398 412 398 405 449 401 ← vendor dst[10 scores]: 408 412 401 401 398 412 398 405 449 401 ← cleanroom Three changes combined to fix the cfg_err info=0x1: 1. Pre-allocate the 64KB task descriptor MMZ at module init (same point where vendor's svp_nnie_init @0x11a8 calls cmpi_mmz_malloc_nocache for its task ring). This makes the descriptor phys stable across Forwards AND lands at 0xa9c70000 — the exact slot vendor's allocation gets, since at module-init time that's the first available 64KB-aligned MMZ slot after the vb_pool. Per-Forward allocation was landing AFTER the test's nnie_tsk had already taken 0xa9c70000. 2. Zero the descriptor MMZ at init time (vendor does this via osal_memset in svp_nnie_init). HW may scan the ring for valid pending slots and the uninitialized 65472 bytes past our 64-byte task could trigger false-positive cfg_err. 3. Stop setting NNIE_RAM bit back to 1 after the clock dance. Vendor's hal_svp_nnie_enable_ram @0xb8f4 issues SYS ioctl 0xd1 with arg=0, so the bit goes 0 and stays 0 through START. Setting it back to 1 was a Phase 8 guess that turned out to invalidate the SRAM ownership signal HW expects. Also remove the verbose pre-START register dump (52 readls between TASK_ADDR write and START write added 50us+ gap that was hypothesised to matter; turned out not to be the trigger, but keeping the path tight matches vendor's drv_svp_nnie_start exactly: write_task_addr → dmb → start with no reads in between). Verified on av300 board with vendor's test binary using cleanroom libnnie_neo + cleanroom open_nnie_neo.ko. Forward returns 0, Query finished=1, output scores match vendor byte-for-byte. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pre-allocated descriptor MMZ, the IRQ completion (g_nnie_done), and the cause atomic are all single-instance globals. Concurrent Forward callers race on the descriptor write, the START kick, and the IRQ-driven wakeup — repro'd as 20/20 -ETIMEDOUT under 4-way parallel test. Wrap nnie_dispatch_forward in g_nnie_forward_lock (interruptible mutex). Vendor's svp_nnie_forward @0x2198 does the same with osal_down_interruptible on a per-handle semaphore. Verified 4-way concurrent x 5 rounds = 20/20 pass with this lock, zero cfg_err. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Now that mnist Forward works end-to-end (and the slot-wrap / parallel tests pass), drop the verbose debugging scaffolding: - pre-START full reg dump (52 readls) - post-fail full reg dump - 100ms polling loop after START - per-Forward src[i] enType/dims pr_info - separate task hdr pr_info_once - redundant Forward arg-size + UVA pr_info_once - AddTskBuf / RemoveTskBuf pr_info_once → pr_debug Also collapse the per-task register-setup block into one tidy RMW group with concise comments and drop the unused `arg_size`, `mmb`, `task_kvirt`/`task_dma` (void)-cast lines. Behaviour-preserving cleanup. Re-verified on av300: 1 + 20 sequential + 12 (4-way parallel x 3) Forwards all pass, zero cfg_err, dmesg clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Validated end-to-end on av300, not Phase 0 scaffold anymore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These five modules (nnie_dump, nnie_spy, nnie_trace, nnie_watch, probe_test) were ad-hoc kprobe + phys_to_virt helpers used during reverse engineering. They aren't wired into any production build and don't belong in the shipping PR. Tracked separately if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Comprehensive cleanup of every comment and docstring in kernel/nnie_neo/ and libraries/nnie_neo/: - Drop all "Phase N" references — the RE phase labels meant something while the work was in progress; now they're noise. - Drop "kprobe capture showed", "av300 2026-05-17", "Phase 14 critical finding", and similar journey markers. - Drop "may be", "appears to", "possibly", "guess" hedging — state what the code does and why, definitively or not at all. - Drop vendor function references and addresses ("svp_nnie_init @0x11a8", "hal_svp_nnie_enable_clk_gt @0xbc18") from inline comments — they belong in commit messages and RE notes, not in shipping code. - Drop the verbose ASCII-art-with-byte-offsets prologues; keep the field-offset macros + struct definitions + a one-paragraph layout description where the layout is non-obvious. - Drop the dead `g_sys_regs` ioremap path (was Phase 3 read-only scaffolding, never used at runtime). - Drop `g_sys_lock`, `nnie_sys_set_bit`, `nnie_sys_clear_bit`, `nnie_sys2_set_bit` — all unused after the cleanup. - Drop `__maybe_unused` annotations on helpers that now have callers. - Drop the leftover `mmb` / `task_kvirt`/`task_dma` (void)-cast lines. - Simplify the Kbuild header. - Strip the trailing pr_info_once Forward/AddTskBuf/RemoveTskBuf argument dumps that were debugging aids. Behaviour-preserving. Re-verified on av300: 30 sequential + 12 (4-way parallel x 3) mnist Forwards all pass, output byte-identical to vendor (408 412 401 401 398 412 398 405 449 401), dmesg shows only the "/dev/nnie ready" line. Net diff: -612 lines, +0 functional changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI failure on V4 chiparchs (hi3516ev200/ev300, gk7205v200): the libraries/nnie_neo/ Makefile hardcoded a staging-path include (STAGING ?= ../../../../firmware/output-cv500/build/...) so the build only worked on a dev machine with the firmware tree adjacent to the openhisilicon repo. CI builds all subdirs unfiltered for V4, which tripped this. Two parts to the fix: 1. Self-contain the userspace headers. Bundle the three vendor SVP headers (hi_nnie.h, mpi_nnie.h, hi_comm_svp.h) into libraries/nnie_neo/include/ — same pattern as libraries/ive_neo/ (which ships its own hi_comm_ive.h / hi_ive.h / mpi_ive.h). Drop the STAGING hack from the Makefile so only repo-relative includes remain. 2. Patch the bundled hi_comm_svp.h: - drop `#include "hi_errno.h"` (header not in tree for cv500); replace with `#include "hi_common.h"` which provides HI_DEF_ERR + EN_ERR_* + the MOD_ID_E enum - add SVP_NNIE_HANDLE typedef (was in the cv500-kernel-only hi_common.h) - add HI_ID_SVP_NNIE = 51 to libraries/include/hi_common.h next to the existing MOD_ID_E values 3. Gate libraries/nnie_neo to cv500-only in libraries/Makefile — NNIE block doesn't exist on V4, cv200, cv100, etc., so the cv500-specific MPI surface shouldn't be compiled there. Verified on av300: bundled headers + repo-relative includes build clean; rebuilt libnnie_neo.so still produces the byte-identical mnist output (408 412 401 401 398 412 398 405 449 401). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

widgetii and others added 30 commits May 15, 2026 07:28

nnie_neo: bump init print to reflect Phase 3 state

0fa8724

widgetii and others added 18 commits May 16, 2026 21:52

kernel/hi3516cv500.kbuild: refresh nnie_neo include comment

805aced

Validated end-to-end on av300, not Phase 0 scaffold anymore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

widgetii mentioned this pull request May 17, 2026

ive_neo: scrub RE journey shrapnel from comments #146

Merged

3 tasks

widgetii merged commit 335ce97 into main May 17, 2026
26 checks passed

widgetii deleted the nnie-neo branch May 17, 2026 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kernel/nnie_neo: clean-room NNIE CNN driver for cv500/av300#145

kernel/nnie_neo: clean-room NNIE CNN driver for cv500/av300#145
widgetii merged 49 commits into
mainfrom
nnie-neo

widgetii commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

widgetii commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

What's in the box

Definition-of-done (issue #111)

Known limitations (follow-ups, not blockers)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

widgetii commented May 17, 2026 •

edited

Loading