Conversation
First slice of #111 (clean-room NNIE CNN driver for cv500/av300/dv300). The full backend is multi-day RE; this commit lands only the platform- driver scaffold: - kernel/nnie_neo/nnie_init.c — platform_device probe, binds to "hisilicon,hisi-nnie" DT node. Maps the nnie0 register window (0x11100000 on cv500), records the nnie0 + gdc IRQs. Skips the gdc register region (owned by open_gdc.ko — sharing the DT node would EBUSY). - kernel/nnie_neo/nnie_neo.c — registers /dev/nnie via osal_createdev. Single ioctl dispatch path that returns -EOPNOTSUPP for everything (Phase 0). Phase 1 will decode the eight HI_MPI_SVP_NNIE_* ioctl numbers + arg-buf layouts from vendor libnnie.so and wire real handlers. - kernel/nnie_neo/Kbuild — mirrors ive_neo/Kbuild structure. - kernel/hi3516cv500.kbuild — pulls nnie_neo/Kbuild in alongside the existing vendor $(PREFIX)nnie wrapper. Both modules can build; init scripts pick one to insmod at runtime (same pattern as ive vs ive_neo). On-target av300 verification (after sysrq reboot to load fresh): $ insmod /tmp/open_nnie_neo.ko && lsmod | grep nnie open_nnie_neo 2949 0 $ ls -la /dev/nnie crw-rw---- 1 root root 218, 100 /dev/nnie $ dmesg | grep nnie nnie_neo: probed nnie0=f4470000 irq=54 gdc_irq=53 nnie_neo: /dev/nnie ready (Phase 0 stub — all ioctls return -EOPNOTSUPP) Known Phase 0 limitation: request_irq fails -16 because vendor's IRQ handler is registered with IRQF_SHARED and our flags=0 conflicts. Phase 3 will switch to IRQF_SHARED once we actually need IRQ-driven completion. /dev/nnie remains usable for the -EOPNOTSUPP path. Not pushing or opening a PR yet — per the bundle-on-one-branch feedback, NNIE work stays on this local branch until Phase 4 (end- to-end test on av300 with a tiny real model) passes.
Static RE of cv500 vendor libnnie.so (42 KB, 8 public entries) + kernel-side svp_nnie_ioctl @0x26b8 in vendor blob. Five distinct ioctls reach /dev/nnie; the other three public entries (LoadModel, UnloadModel, GetTskBufSize) are pure-userspace — they only touch MMZ via HI_MPI_SYS_MmzAlloc/Free, no /dev/nnie call. nr | size | full ioctl | API entry ----+------+--------------+---------------------------- 0x00| 1624 | 0xc6584d00 | HI_MPI_SVP_NNIE_Forward 0x01| 1728 | 0xc6c04d01 | HI_MPI_SVP_NNIE_ForwardWithBbox 0x02| 24 | 0xc0184d02 | HI_MPI_SVP_NNIE_Query 0x03| 24 | 0xc0184d03 | HI_MPI_SVP_NNIE_AddTskBuf 0x04| 24 | 0xc0184d04 | HI_MPI_SVP_NNIE_RemoveTskBuf Vendor kernel dispatcher additionally recognises 0x4d05/06/07/08/09 when the per-call context state == 0xc — those are bbox-mode dispatch variants, deferred to Phase 3 once we have an actual model + Forward call to exercise them. Phase 1 handler bodies still return -EOPNOTSUPP for Forward, AddTskBuf, RemoveTskBuf. Query stubs done=1 (matches ive_neo pattern since dispatch is synchronous once wired). Implication for Phase 2 scope: the kernel ABI surface is only 5 ioctls; the model loader is mostly userspace work in libnnie_neo.so. Full ioctl-ABI reference saved to kaeru as nnie-neo-cv500-ioctl-abi.
Partial RE of cv500 vendor libnnie.so HI_MPI_SVP_NNIE_LoadModel @0x1bf4 (size 0x12d8). Phase 2 of #111. Decoded by tracing the loader's stack buffer reads at sp+84..sp+275 (the 192-byte file-header copy). What landed: - libraries/nnie_neo/ — new userspace library exporting all 8 vendor HI_MPI_SVP_NNIE_* entry points with vendor-matching const-qualified signatures. Builds clean for cv500 against the cv500 kernel-include tree (libnnie_neo.so = 7.5 KB on cv500 cross-compile). - include/nnie_wk_format.h — 192-byte .wk file header struct, decoded fields: [0..3] u32 CRC32 (zlib-style, IEEE 802.3 0xEDB88320 reflected, of bytes [4..filesize)) [16..19] format-version digits {1,1,1,2} (10*[16]+[17] == 11, 10*[18]+[19] == 12 per loader checks @0x1dbc-0x1de4) [48] enRunMode → SVP_NNIE_MODEL_S.enRunMode @0x1df0 [49] u32NetSegNum → SVP_NNIE_MODEL_S.u32NetSegNum @0x1dfc [52..55] inst_offset_extra (sp+136 read, > 0xBF, bounds-checked) [56..59] inst_len (sp+140 read, non-zero, bounds-checked) [176..179] dup of inst_offset_extra (sp+260 read, must equal) [180..183] some count (sp+264 read, > 47) - src/nnie_ops.c — LoadModel verifies CRC32 + version bytes, then returns HI_ERR_SVP_NNIE_NOT_SURPPORT (the segment-table / ROI-info / weights parsing is the Phase 3 work). All other 7 entries are also stubbed with vendor-matching signatures. Phase 2 limitations (deferred to Phase 3): - Segment table iteration (astSeg[u32NetSegNum]) — vendor walks an array of SVP_NNIE_SEG_S starting at some file offset within the inst_offset_extra region. Layout unconfirmed. - ROI pool info (astRoiInfo[]) — vendor walks SVP_NNIE_ROIPOOL_INFO_S records, count derived from segment metadata. - u32TmpBufSize calculation — likely a running sum across segments. - stBase fill — needs the SVP_SRC_MEM_INFO_S passed in as model base. On-target verification: blocked this session — the av300 board got into a wedged state after the Phase 1 sysrq reboot cycle and ping without ssh-response. Static decode of the loader and bench-build of libnnie_neo.so both pass. /tmp/nnie_crc_check ARM binary is staged for the test once the board is power-cycled — expected results: valid mnist.wk -> 0xa033800c (NOT_SURPPORT, CRC passes) corrupt mnist.wk -> 0xa0338003 (ILLEGAL_PARAM, CRC fails) NNIE work continues on the local nnie-neo branch — no PR until Phase 4 (end-to-end inference on av300 with a tiny real model) passes, per the bundle-on-one-branch feedback.
Two bugs in the Phase 2 CRC verifier, found on first av300 test run
against vendor inst_mnist_cycle.wk (466176 B):
1. Init value: vendor accumulator starts at 0, not 0xFFFFFFFF (zlib
convention). Confirmed by re-reading libnnie.so 0x1cd0-0x1cd8 —
the special-case `mvneq r0, #0` only fires when r1==4 (the
trivially-short-payload path). The general path enters the CRC
loop with r0 still at whatever it was before, which is 0 (the
memcpy_s return value at 1c64 was checked as 0). Final XOR is
still 0xFFFFFFFF (`mvn r0, r0` at 1d0c).
2. CRC coverage range: bytes [4..file[52]+file[56]), not the whole
file. file[52..55] is inst_offset_extra (header tail offset),
file[56..59] is inst_len. Their sum is the end of the CRC-protected
region; weights / quantization tables after that point aren't CRC-
protected (vendor relies on instruction-stream offsets to address
them).
Confirmed against vendor mnist.wk on av300:
stored CRC = 0xa4a25b1a
computed = 0xa4a25b1a after the fix
Test results now match the Phase 2 expectations:
/tmp/nnie_crc_check on av300:
LoadModel valid -> 0xa0338008 (NOT_SURPPORT — CRC + version pass,
parser body unimplemented)
LoadModel corrupt -> 0xa0338003 (ILLEGAL_PARAM — CRC fails)
Continuing Phase 2 RE of HI_MPI_SVP_NNIE_LoadModel @0x1bf4. Tracing
the post-CRC parse path at 0x1e70-0x1edc reveals the per-segment
file record layout (16 B header part + variable-length node arrays):
off | width | maps to SVP_NNIE_SEG_S field
----+-------+------------------------------
0 | u8 | enNetType (must be <= 2)
1 | u8 | u16SrcNum (zero-extended in struct)
2 | u8 | u16DstNum
3 | u8 | u16RoiPoolNum
4 | u16 | u16MaxStep (<= 1024)
6 | u16 | pad/unk
8 | u32 | u32InstOffset (bounds-checked vs inst_offset_extra +
inst_len)
12 | u32 | u32InstLen (16-byte aligned)
Followed by node-array records — Phase 3 prereq.
Also confirmed file[60..63] is u32TmpBufSize (the loader stores it
at pstModel+4 = SVP_NNIE_MODEL_S.u32TmpBufSize, then checks != 0).
Updated nnie_wk_header_t with the new field names + added
nnie_wk_seg_record_t struct. Header file now documents the full
high-level format from CRC through segment-table header.
More LoadModel disassembly progress (vendor libnnie.so 0x1f30-0x208c).
Node-record layout in the segment table is asymmetric:
Segment data block starts at file[192] (= end of the fixed header,
per file[8] = 0xC0). Structure within a segment block:
+0..15 nnie_wk_seg_record_t (the 16-B segment header)
+16..29 first source-node WHC metadata (compact, ~14 B)
[16..19] u32 -> NODE_S.unShape.stWhc.u32Height
[20..23] u32 -> NODE_S.unShape.stWhc.u32Width
[24..27] u32 -> NODE_S.unShape.stWhc.u32Chn
[30..31] u16 -> NODE_S.enType (post-mapped via 2/3/4/5
-> SVP_BLOB_TYPE enum lookup)
+32.. array of 64-B node slots, stride 64. First field of each
slot is the 32-byte szName. Remaining 32 bytes hold
additional WHC/dim + blob_type fields (offsets in slot
read at 0x205c-0x20bc, exact layout still partial).
Verified against vendor mnist.wk:
file[208..211] = 28 (u32 Height)
file[212..215] = 28 (u32 Width)
file[216..219] = 1 (u32 Chn — grayscale)
file[224..227] = "data" (Caffe input-layer name)
Updated nnie_wk_format.h with nnie_wk_node_slot_t (64 B). Phase 3 will
fill the unk_20_3F[32] tail with confirmed field offsets once we have
a Forward dispatch + kprobe trace exercising real inference. Still
TODO this phase: ROIPool record layout, multi-segment models, segment-
boundary computation when NetSegNum > 1.
Decoded the per-ioctl arg-buffer layouts by tracing the userspace
worker functions in libnnie.so:
Forward (ioctl 0xc6584d00, arg size 1624 B), worker @0x104c:
off | size | content
----+------+----------------------------------------------
0 | 4 | HI_HANDLE (out — kernel writes assigned handle)
4 | 4 | pad
8 | 768 | astSrc[16] — 16 SVP_BLOB_S (48 B each)
776 | 8 | pad
784 | 768 | astDst[16]
1552 | 64 | SVP_NNIE_FORWARD_CTRL_S {SrcNum, DstNum, NetSegId,
| enNnieId, stTmpBuf(24), stTskBuf(24)}
1616 | 4 | bInstant
1620 | 4 | pad
AddTskBuf / RemoveTskBuf (ioctls 0xc0184d03 / 0xc0184d04, 24 B):
plain SVP_MEM_INFO_S {u64 phys, u64 virt, u32 size, u32 pad}.
Verified at 0x3134-0x3150 in libnnie.so.
Updated kernel handlers:
- nnie_op_forward parses the 64 B ctrl block (SrcNum/DstNum/NetSegId)
+ writes handle = 0 to buf+0 + returns -EOPNOTSUPP. Phase 4 will
walk astSrc/astDst phys addrs, apply the [0x90] memory-priority knob
(skipped for IVE, required for NNIE), and submit to NNIE HW.
- nnie_op_add_tskbuf / remove_tskbuf parse the MEM_INFO_S triple,
print it, return -EOPNOTSUPP.
- nnie_op_query unchanged (done=1 stub).
ForwardWithBbox arg-layout (1728 B = 1624 + 104) is similar to
Forward with an extra ProposalNum + bbox MEM_INFO block. Precise
offsets TBD after Forward HW path works.
Same nnie-neo branch; no PR until Phase 4 (real inference) passes.
Key result from continuing the vendor open_nnie.ko disassembly: NNIE HW dispatch is *not* direct register access. It goes through vendor's cmpi cross-module function-pointer indirection. Two sites of interest: 1. drv_svp_nnie_config_ram -> hal_svp_nnie_enable_ram @0xb8f4: The "[0x90] memory-priority knob" mentioned in #111 isn't written by open_nnie.ko at all. The function calls cmpi_get_module_func_by_id(51, 0xd1) where 51 is the open_sys.ko module ID and 0xd1 is a function selector, then blx's the returned function pointer with state at sp+4. So the register write lives in open_sys.ko's exported function table. 2. svp_nnie_start_task @0x1934: After drv_svp_nnie_prepare_nnie, it calls cmpi_get_module_func_by_id(37) -> r5 (some module's func table), then dispatches via four entries: r5+0x78 -> prepare submission r5+0x7c -> fire/wait r5+0x80 -> exists check r5+0x84 -> finalize / select_ram fallback Same dispatch pattern but mediated by cmpi. Implication for the clean-room: Phase 4 needs to either (a) replicate cmpi's cross-module function-pointer contract end-to-end, or (b) RE open_sys.ko to find the actual NNIE register writes and bypass cmpi. (b) is cleaner because it avoids depending on a vendor-shared function-pointer ABI that could drift. Documented in nnie_op_forward's TODO block. No code change to the dispatch path; the parsed-arg + return-EOPNOTSUPP shape is still what userland sees.
RAM/mutex coordination Followed the cmpi indirection from hal_svp_nnie_enable_ram into hi_sys.o (open_sys.ko's vendor blob). Resolution to the previous session's open question: - cmpi_get_module_func_by_id(51, 0xd1) -> sys_hal_gdc_nnie_set_ram_using @0x897c in cv500's hi_sys.o. - That function does an atomic bit-set/clear on bit 0 of register offset +0x34 of the sys-module's MMIO window (loaded from g_sys_state[16] base). - The "atomic bit-set" pattern is in a private helper at sys_drv_get_ cmp_3dnr_cfg+0x1c8 (= 0x7a84): spin_lock_irqsave / read-old / XOR-new / mask-to-bit / XOR-old / write-back / spin_unlock. So the original issue #111 wording — "memory arbitration [0x90] register" — was a partial description. The real mechanism is a set of bit-writes coordinating RAM ownership and mutex between NNIE / GDC / VENC, distributed across these sys-module HAL functions: - sys_hal_gdc_nnie_set_ram_using - sys_hal_gdc_nnie_mutex_sel - sys_hal_venc_nnie_mutex_sel - sys_hal_nnie_get_mutex_state - sys_hal_nnie_gdc_get_mutex_state - sys_hal_vgs_bootroom_set_ram_using And it's all anchored at sys MMIO base + offset 0x34..0x44ish, not at 0x90 of the IVE block. Updated nnie_op_forward's TODO block with the new findings. Two paths forward for Phase 4 dispatch wiring: (a) ioremap the sys register window from open_nnie_neo.ko and do the bit-writes directly. Cleanest, doesn't depend on a clean- room open_sys module existing. (b) Wait for an open_sys clean-room module (separate effort) and use a proper exported API. This is real progress: we have a concrete register + bit-set target instead of "the [0x90] knob" which doesn't actually exist on the IVE block. Next session resumes by ioremap'ing the sys MMIO window (address TBD — probably accessible via DT, since hi3516cv500.dtsi declares hisi-sys node) and exercising the bit-writes.
Followed the relocations into hi_sys.o: sys_hal_gdc_nnie_set_ram_using
indirects via R_ARM_MOVW_ABS_NC .LANCHOR0 — a per-module anchor that
holds the sys register window base (g_reg_sys_base_va). cv500 DT
declares:
sys: sys@12010000 {
compatible = "hisilicon,hisi-sys";
reg = <0x12010000 0x10000>, /* crg */
<0x12020000 0x8000>, /* sys <- the one used here */
<0x12060000 0x10000>, /* ddr */
<0x12030000 0x8000>; /* misc */
};
So the NNIE/GDC RAM-using register is at phys **0x12020034** (sys-base
+ 0x34). Verified live on av300 via devmem 0x12020034 -> 0x00000000
which matches the expected idle state (NNIE not actively dispatched,
bit 0 clear).
Phase 4 starting point is now concrete: ioremap(0x12020000, 0x8000)
in open_nnie_neo.ko, do an atomic read-modify-write of bit 0 at
offset 0x34 to mark NNIE RAM in-use before each Forward dispatch.
Similar offsets for the other 5 sys_hal_* functions (mutex_sel,
get_mutex_state) live nearby in 0x12020000..0x12020044ish — sweep
needed in next session to map them all out.
Swept all sys_hal_*nnie* + sys_hal_vgs_bootroom_set_ram_using functions
in hi_sys.o. The "memory arbitration" the issue mentioned breaks down
into three registers within the cv500 sys window (phys 0x12020000,
DT-declared reg-name "sys" of hisilicon,hisi-sys node):
Register 0x12020000: live=0x00000102
bit 13 = vgs_bootroom_set_ram_using (R/W)
Register 0x12020008: live=0x00000000 (no contention)
bit 0..1 = NNIE/GDC mutex state (R)
bit 1 = venc<->nnie mutex_sel (sys_hal_venc_nnie_*) (W)
bit 2 = gdc<->nnie mutex_sel (sys_hal_gdc_nnie_*) (W)
Register 0x12020034: live=0x00000000
bit 0 = gdc_nnie_set_ram_using (R/W)
^ primary "NNIE has RAM" flag; set before Forward,
clear after.
Two private helper functions in hi_sys.o implement the atomic
R-M-W:
- sys_drv_get_cmp_3dnr_cfg+0x148 (= 0x7a04): n-bit field set
- sys_drv_get_cmp_3dnr_cfg+0x1c8 (= 0x7a84): single-bit set
Both use spin_lock_irqsave / read-old / mask+XOR / write-back.
Phase 4 wiring: ioremap(0x12020000, 0x1000) once in probe, then
atomic R-M-W of these bits around each Forward call. No more cmpi
cross-module indirection needed — we drive sys-window bits directly
from open_nnie_neo.ko.
All Phase 3 unknowns are now nailed down. Phase 4 has a concrete
implementation surface.
…tion
Adds ioremap(0x12020000, 0x1000) in probe to access the sys-side
NNIE/GDC/VENC coordination registers (decoded in the previous
commit). Currently read-only — probe dumps the three live register
values so we can confirm the mapping works. Phase 4 will use the
nnie_sys_set_bit() / nnie_sys_clear_bit() helpers (now defined,
__maybe_unused) around each Forward dispatch.
On-target verification (av300, fresh boot):
nnie_neo: probed nnie0=f2820000 irq=54 gdc_irq=53
nnie_neo: sys @0x12020000 mapped — VGS=0x00000102 MUTEX=0x00000000
NNIE_RAM=0x00000000
The three values match exactly what `devmem` reads against the same
phys addresses — vendor open_sys.ko and our open_nnie_neo.ko share
the window cleanly with plain ioremap (no request_mem_region clash).
NNIE_RAM = 0x0 confirms bit 0 is clear in the idle state, matching
the expected semantics ("NNIE has RAM" only during Forward dispatch).
Phase 4 wiring is now a thin shim: call nnie_sys_set_bit(NNIE_SYS_REG_
NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM) before submitting to NNIE HW, clear
after IRQ completion.
Includes a small spin_lock_init(&g_sys_lock) and adds iounmap() in
mod_exit for cleanliness.
Reverse-engineered the 64-byte NNIE HW task node layout from vendor
svp_nnie_fill_forward_task @0x90d8 (hi_nnie.o), prologue 90d8..91a8.
Cross-checked field offsets against the kernel SVP_NNIE_MODEL_S /
SEG_S / FORWARD_CTRL_S struct definitions:
sizeof(SVP_NNIE_NODE_S) = 52
sizeof(SVP_NNIE_SEG_S) = 1692 ← matches vendor 0x69c stride
sizeof(SVP_NNIE_ROIPOOL_INFO_S) = 104
sizeof(SVP_NNIE_MODEL_S) = 13992 ← matches vendor copy_from_user
size 0x36a8
offsetof(MODEL_S, stBase) = 13968 = 0x3690 ✓
Vendor caller passes: r0=global HW state, r1=pstModel (kernel kbuf),
r2=forward arg kbuf, r3=on-stack 64-byte descriptor. The struct gets
populated with phys addresses + segment instruction-stream pointer +
batch num + trigger flag, then handed to svp_nnie_post_process which
submits via cmpi-mediated svp_nnie_start_task (Phase 5).
Concrete field map:
task[ 0] u16 = bInstant ? 1 : 0
task[16] u64 = pstModel->stBase.u64PhyAddr (.wk MMZ block)
task[24] u32 = pstModel->astSeg[NetSegId].u32InstOffset
task[28] u32 = pstModel->astSeg[NetSegId].u32InstLen
task[32] u64 = ctrl.stTskBuf.u64PhyAddr (user-supplied scratch)
task[48] u64 = ctrl.stTmpBuf.u64PhyAddr (user-supplied temp)
task[56] u32 = astSrc[0].u32Num (batch size)
Past the 64-B header the vendor appends variable-length per-input
stride table (one u32 per astSrc), per-node shape data copied from
pstModel->astSeg[NetSegId].astSrcNode/astDstNode, and a per-batch DMA
address vector. Layout decoded but not yet captured in the header —
deferred until Phase 5 wires actual HW submission and we know which
trailing sections the v500 NNIE block actually consumes.
This commit only adds the descriptor header + cross-references it
from the Forward stub comment. No behaviour change — module still
returns -EOPNOTSUPP for Forward.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverse-engineered the cv500 NNIE HW register interface from vendor
hal_svp_nnie_* thin shims (hi_nnie.o @0xbb10..0xbc90, all single-store
helpers indexed by core_id off .LANCHOR1[core_id]+4 = ioremap'd regs).
Complete 0x11100000 register map:
+0x20 W task descriptor phys[31:0] (hal_svp_nnie_write_task_addr)
+0x24 W task descriptor phys[63:32]
+0x28 W timeout cycles [31:0] (hal_svp_nnie_set_timeout)
+0x2C W timeout cycles [63:32]
+0x30 RW START — bit 0 = go (hal_svp_nnie_start)
+0x34 RW IRQ_CFG — bits 0/1/2 enable (hal_svp_nnie_cfg_irq)
finish / timeout / cfg_err IRQs
+0x38 RW IRQ_CLEAR — bits 0/1/2 w1c (hal_svp_nnie_clear_irq)
+0x3C R IRQ_STATUS — bits 0/1/2 pending(hal_svp_nnie_get_irq_status)
+0x40 R CFG_ERR_INFO (hal_svp_nnie_get_cfg_err_info)
+0x48 R TASK_ID (hal_svp_nnie_get_task_id)
+0x50 RW CLK_GATE — bit 7 (=0x80) en (hal_svp_nnie_enable_clk_gt)
+0x54 RW AXI OUTSTANDING — [4:0]=0xF, (hal_svp_nnie_set_outstanding)
[11:8]=0xF
+0x68 RW CHECK_SUM — bit 0 en (hal_svp_nnie_disable_check_sum)
Dispatch sequence (drv_svp_nnie_start @0xb3ac):
write_task_addr(task_phys_lo, task_phys_hi); // [+0x20], [+0x24]
wmb();
START |= 1; // [+0x30] |= 1
Two important confirmations:
1. hal_svp_nnie_set_mem_speed @0xbc28 is a LITERAL no-op (`bx lr`) on
cv500 — vendor doesn't write any IVE-style "[0x90] mem-priority
knob" for NNIE. Our Phase 3 finding that NNIE coordination instead
uses the sys @0x12020034 RAM-using flag stands.
2. hal_svp_nnie_enable_ram @0xb8f4 goes through cmpi (module 51 = SYS,
fn 0xd1), not the NNIE register window. This matches the Phase 3
finding of sys_hal_gdc_nnie_set_ram_using setting bit 0 of
0x12020034. So the full HW-bring-up sequence is:
1. nnie_sys_set_bit(NNIE_SYS_REG_NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM)
2. write CLK_GATE = 0x80 (enable clock gating)
3. write OUTSTANDING = 0xF | 0xF00
4. fill 64-B task descriptor (Phase 4 prior commit)
5. write IRQ_CFG = 0x7 (enable all 3 IRQs)
6. write TASK_ADDR_LO/HI
7. wmb()
8. write START = 1
9. wait on IRQ → read IRQ_STATUS → clear IRQ
10. nnie_sys_clear_bit(NNIE_SYS_REG_NNIE_RAM, NNIE_SYS_BIT_NNIE_RAM)
No behaviour change — module still returns -EOPNOTSUPP. Phase 4 wiring
is now a thin shim around this header + nnie_hw_task.h.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire the decoded 64-byte HW task descriptor into nnie_op_forward via
a new helper nnie_fill_task_header(). Still returns -EOPNOTSUPP — we
need the variable-length descriptor tail (per-input stride table,
per-node shape data, per-batch DMA addresses) before driving HW —
but this commit:
- Lays down all the SVP_BLOB_S / SVP_NNIE_FORWARD_CTRL_S internal
offset constants (already cross-checked vs vendor disasm).
- Builds the fixed 64-byte header from ctrl.stTskBuf.u64PhyAddr,
ctrl.stTmpBuf.u64PhyAddr, astSrc[0].u32Num, and bInstant.
- Logs all decoded values pr_info_once so an on-target Forward call
now prints the full decoded forward args + the partial task header,
proving the offset constants are right end-to-end (the values must
match what userspace passed in).
- Defers reading pstModel->stBase.u64PhyAddr +
astSeg[NetSegId].u32InstOffset/InstLen to Phase 5 (needs
copy_from_user of 13992 B model struct).
- Drops the now-redundant Phase 3 comment block; the cv500 sys-window
coordination map + NNIE register map have been promoted into
nnie_hw_task.h + nnie_hw_regs.h.
Rebuilt cleanly against the cv500 4.9.37 kernel
(/home/dima/git/firmware/output-cv500/build/linux-custom). New
.text size for nnie_op_forward: 0x1a8 B (was a one-line stub).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverse-engineered the variable-length descriptor tail of the NNIE
task buffer from vendor svp_nnie_fill_forward_task body
@0x91d4-0x9498. The fixed 64-byte HW descriptor (decoded in Phase 4)
points to this tail via task[+32] = ctrl.stTskBuf.u64PhyAddr; the HW
follows that pointer to read shape/stride/per-batch DMA addresses.
Tail layout (written to ctrl.stTskBuf, kernel-vir resolved by
svp_nnie_get_tsk_vir_addr @0x91a8 via phys-match against the
registered tskbuf list):
§1 always SrcNum × u32 astSrc[i].u32Stride
align tip to 16 B
§2 non-LSTM 16 × u64 astDst[i].u64PhyAddr (zero past
DstNum, always 128 B advance)
§3 non-LSTM varies per-source DMA address vector,
dispatch by astSrc[i].enType:
0 -> u32Stride*Height*Chn
* batch_idx + PhyAddr
1..3 -> svp_nnie_fill_image_src_addr
(YUV plane offsets)
4 -> u32Stride*Height
* batch_idx + PhyAddr
5 -> per-step from user
u64VirAddrStep array
other -> ILLEGAL_PARAM
align tip to 16 B per blob
§4 LSTM only different net_type==RECURRENT path @0x96dc,
uses ctrl+stTskBuf indexing —
Phase 6 work
§5 optional dcache flush if (sp+40)/sp+28 set, flush
range [stTskBuf.PhyAddr,
+stTskBuf.u32Size)
Important struct-layout correction: SVP_BLOB_S has a 4-byte hole at
+28 (not previously documented in nnie_neo.c). The union starts at
+32 because stSeq.u64VirAddrStep needs 8-byte alignment. Cross-
checked with the cv500 ARM toolchain:
+0..+27 enType, Stride, VirAddr, PhyAddr, Num
+28..+31 PADDING
+32..+47 union { Width,Height,Chn | Dim,VirAddrStep }
The previous NNIE_BLOB_OFF_WIDTH=28 / HEIGHT=32 / CHN=36 constants
were wrong; corrected to 32/36/40. The Forward stub's use of these
(reading astSrc[0].u32Num at +24) was already at the right offset.
No behaviour change — Forward stub still returns -EOPNOTSUPP. Phase
6 will:
- decode the LSTM tail variant (§4)
- implement the §1-§3/§5 builder in C
- wire copy_from_user of pstModel (13992 B) to populate
task[+16/+24/+28]
- drive the HW per nnie_hw_regs.h sequence
Rebuilt clean against cv500 4.9.37 kernel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix the deferred IRQ_SHARED issue from Phase 0:
- request_irq() now passes IRQF_SHARED. The cv500 NNIE SPI line (54)
is shared with vendor open_gdc.ko (GDC on SPI 53 in the same
DT node), which we kprobed using IRQF_SHARED. To coexist we have
to match — kernel rejects mixed-flag handlers on a shared line.
- nnie_irq_handler now reads NNIE_REG_IRQ_STATUS first; if no NNIE
bits are pending it returns IRQ_NONE so the GDC handler (or any
other downstream sharer) gets to run. Only when a NNIE finish /
timeout / cfg_err bit is set do we write-1-clear and signal
g_nnie_done.
The handler doesn't yet inspect *which* bit was set — that distinction
(finish vs timeout vs cfg_err) gets pushed to Phase 7 once Forward
actually dispatches and we have an end-to-end test path.
Rebuilt clean against cv500 4.9.37 kernel; .text size for the module
grew by 48 B for the status-read/dispatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement the kernel side of HI_MPI_SVP_NNIE_AddTskBuf /
RemoveTskBuf. Userspace MMZ-allocates a scratch region, registers
(phys, user_virt, size) with the kernel; the kernel records the
mapping in a list and ioremap()s the phys range so Phase 7 can write
the variable-length descriptor tail into stTskBuf from Forward
dispatch.
Implementation:
- struct nnie_tskbuf {phys, user_virt, size, kvirt} on a list_head
protected by g_nnie_tskbuf_lock.
- nnie_add_tskbuf()/nnie_remove_tskbuf()/nnie_drain_tskbufs() do
the list management + ioremap/iounmap.
- nnie_op_add_tskbuf/remove_tskbuf wire the 24-byte SVP_MEM_INFO_S
arg buffer through to those helpers.
- nnie_drain_tskbufs() in mod_exit prevents leaks on rmmod.
ioremap (not cmpi_remap_cached) is uncached, which means we don't
need the cache-flush step vendor has at fill_forward_task @0x94ac.
Trade-off is slower kernel writes — but the descriptor tail is small
(KB), written once per Forward call.
Userspace API match: vendor's libnnie.so AddTskBuf returns 0 on
success; ours now returns 0 (on success), -EEXIST (already
registered), -ENOMEM (OOM or ioremap fail), or -EINVAL
(phys/size==0). RemoveTskBuf returns 0 or -ENOENT.
Module size grew from 7128 B to 8028 B (+900 B for the registry).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add nnie_build_task_tail() — implements §1-§3 of the descriptor tail
decoded in Phase 5. Wired into nnie_op_forward as a dry-run: if the
caller has registered an stTskBuf via AddTskBuf, we look it up and
fill in the strides + dst-phys + per-batch DMA addresses. Still
returns -EOPNOTSUPP overall (Phase 7 will copy_from_user pstModel,
finalise the 64-B header, and drive the HW registers).
Builder:
§1: SrcNum × u32 stride entries (one per astSrc[i])
Aligned to 16 B with zero-fill.
§2: 16 × u64 destination phys addresses, zero-padded past dst_num.
§3: per-source DMA address vector — for each astSrc[i]:
enType==0: batch_size = Stride * Height * Chn
enType==4: batch_size = Stride * Height
enType ∈ [1..3, 5]: -EOPNOTSUPP (Phase 7+ — YUV/seq inputs)
Writes Num u64 entries: PhyAddr + j*batch_size.
Aligned to 16 B between blobs.
Tail bytes used logged via pr_info_once so on-target verification
can confirm the offset arithmetic matches what HW expects (cross-
checkable against vendor strace).
Wiring:
- nnie_init.c now exports g_nnie_pf_dev (platform_device *) so
nnie_neo.c can dma_alloc_coherent in Phase 7.
- Header includes: linux/dma-mapping.h + linux/platform_device.h.
- nnie_fill_task_header marked __maybe_unused (Phase 7 will use it).
Module .text grew from 8028 B to 8944 B (+916 B for the builder).
Build clean against cv500 4.9.37 kernel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Finish everything in the Forward path *except* the actual HW kick:
- copy_from_user(model_kbuf, fwd_arg[+776], 13992) — pulls the
user's SVP_NNIE_MODEL_S into kernel memory using vmalloc (kmalloc
would slab-fragment for ~14 KB).
- Validate net_seg_id < model->u32NetSegNum and < 8.
- Extract model->stBase.u64PhyAddr (file[+0x3690]) -> task[+16].
- Extract model->astSeg[net_seg_id].u32InstOffset/u32InstLen
(file[+12 + seg*1692 + 12 / +16]) -> task[+24], task[+28].
- Look up stTskBuf in our registry; call nnie_build_task_tail to
write the §1-§3 variable-length tail.
- Fill the 64-byte HW task descriptor on stack via
nnie_fill_task_header (un-suppressed __maybe_unused).
- Log: trigger, model_phys, inst_off, inst_len, tail_bytes.
What's left for Phase 7 (the actual HW kick):
- dma_alloc_coherent the 64-B descriptor (we have g_nnie_pf_dev).
- memcpy stack descriptor into it.
- nnie_sys_set_bit(NNIE_SYS_REG_NNIE_RAM, ...) coordination.
- 7 register writes (CLK_GATE, OUTSTANDING, IRQ_CFG, TASK_ADDR_LO/HI,
wmb, START).
- wait_for_completion_timeout 5 sec.
- Read+ack NNIE_REG_IRQ_STATUS; distinguish finish/timeout/cfg_err.
- Release sys lock + dma_free_coherent + write handle to buf+0.
Stops short of HW kick because the partial test on av300 is non-
destructive only as long as no register write hits the live NNIE
block — once we add the START write, a wrong descriptor field could
DMA to bad addresses and (worst case) hang the SoC. Doing that as a
distinct commit keeps the bisect safe.
Module .text grew from 8944 to 9908 B (+964 for copy_from_user
flow). Still returns -EOPNOTSUPP at the end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire the cv500 NNIE Forward dispatch end-to-end. Forward now:
1. Decodes the 1624-byte forward arg (Phase 1).
2. copy_from_user pstModel; extracts stBase.u64PhyAddr +
astSeg[net_seg_id].u32InstOffset/u32InstLen (Phase 6).
3. Writes §1-§3 variable-length tail into the registered stTskBuf
via nnie_build_task_tail (Phase 5/6).
4. dma_alloc_coherent's a 64-byte HW descriptor, populates it via
nnie_fill_task_header (Phase 4).
5. Acquires the cv500 sys-window NNIE_RAM coordination bit at
0x12020034 (Phase 3).
6. Writes CLK_GATE / OUTSTANDING / IRQ_CFG / TASK_ADDR_LO/HI /
START to the NNIE register window (Phase 4).
7. Waits on g_nnie_done completion (5 s timeout, IRQF_SHARED-aware
handler from Phase 6).
8. Reads cause out of g_nnie_last_status (set atomically by the
handler before complete()).
9. Releases the sys lock, frees the DMA descriptor.
10. Distinguishes finish / timeout / cfg_err and returns the right
errno (0 / -ETIMEDOUT / -EIO).
The dispatch happens unconditionally — no module parameter gate.
On-target verification on av300 is the next step. Failure modes
that could happen on first run:
- Bad descriptor field offset: HW writes -EIO cfg_err to status,
handler signals, we return -EIO. Recoverable; no board hang.
- sys-window bit doesn't match what vendor expects: HW silently
discards the task, we hit the 5 s timeout, return -ETIMEDOUT.
Also recoverable.
- Wrong NNIE register-window decode (Phase 4 RE was wrong): worst
case the START write goes nowhere; same -ETIMEDOUT outcome.
- HW reads the descriptor and the descriptor's tsk_buf_phys points
somewhere bad: HW does bus error; on cv500 typically the SoC bus
abort handler logs and the NNIE block returns cfg_err. Recoverable.
- Worst plausible failure: HW reads the descriptor, the variable-
length tail has a bad PhyAddr in §3, NNIE DMAs garbage to/from a
bad address. AXI typically reports a bus error rather than
hanging. Power-cycle recovers if not.
Module .text grew from 9908 to 12248 B (+2.3 KB for dispatcher).
Build clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end Forward path is wired (Phase 7 dispatch + IRQF_SHARED); old 'Phase 3 — ioctl ABI wired, HW dispatch TBD' log was stale. On-target verified on av300: module loads, IRQ 54 shares cleanly with vendor GDC_NNIE handler (no 'Flags mismatch' error), /dev/nnie present, sys-window coordination registers read live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three substantive changes to libraries/nnie_neo:
1. Forward / AddTskBuf / RemoveTskBuf now call ioctl() on /dev/nnie
instead of returning HI_ERR_SVP_NNIE_NOT_SURPPORT. The Forward
path packs SVP_SRC_BLOB_S[] + pstModel user VA + SVP_DST_BLOB_S[]
+ SVP_NNIE_FORWARD_CTRL_S into the 1624-byte ioctl arg per the
layout decoded in kernel/nnie_neo/nnie_neo.c.
2. LoadModel now actually populates the SVP_NNIE_MODEL_S struct:
- stBase = *pstModelBuf
- enRunMode = file[48]
- u32TmpBufSize = file[60..63]
- u32NetSegNum = file[49]
- astSeg[i].enNetType / u16SrcNum / u16DstNum / u16RoiPoolNum
/ u16MaxStep / u32InstOffset / u32InstLen from
the 16-byte seg records at file[192 + i*16].
Node + ROI tables left zeroed — kernel Forward only reads
u32InstOffset / u32InstLen, and userspace post-process helpers
(softmax/detect/cluster) aren't implemented yet, so zeroed slots
are safe for now. Validated InstOffset+InstLen against file_size.
3. /dev/nnie fd lifecycle: cached static int, opened lazily on first
ioctl, protected by a pthread_mutex. nnie_err_to_hi() translates
Linux errno to vendor HI_ERR_SVP_NNIE_* codes.
Build fix: vendor hi_nnie.h needs HI_ID_SVP_NNIE from the staging
hi_common.h, but libraries/include/hi_common.h was being preferred
(causing redeclaration conflicts on EN_ERR_LEVEL_*). Reorder Makefile
include path so STAGING/kernel/include/hi3516cv500 comes BEFORE
libraries/include.
libnnie_neo.so builds clean against cv500 ARM toolchain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two on-target fixes after the first end-to-end Forward attempt on
av300 with inst_mnist_cycle.wk:
1. AddTskBuf: switched from ioremap() to memremap(MEMREMAP_WB).
cv500 MMZ regions are CMA-backed kernel RAM, and ioremap()
refuses these (kernel WARN + returns NULL because the kernel
direct map already covers them). memremap WB transparently
handles both CMA RAM (returns lowmem virt) and MMIO (falls
back to ioremap). Updated struct nnie_tskbuf.kvirt type from
'void __iomem *' to 'void *' and replaced iowrite32 in the
tail builder with plain stores.
Verified on target: AddTskBuf now returns 0, no more WARN.
2. Tail builder: SVP_BLOB_TYPE_U8 (=1) with Chn=1 (grayscale)
now uses Stride*Height batch_size. Vendor's svp_nnie_fill_
image_src_addr @0x7978 handles U8 by branching on Chn ∈
{1, 3}: Chn=1 is a single u64 per batch (= PhyAddr + j*Stride*
Height); Chn=3 writes 4 u64s per batch (3 plane addrs + zero
pad) at 32 B/batch — that's still Phase 8.
Without the U8 path mnist couldn't run (its input blob is
enType=1, Chn=1).
Current on-target test run (LD_LIBRARY_PATH+PRELOAD voice libs):
LoadModel -> 0x0 NetSegNum=1, Inst@offset 453888 len 10600
AddTskBuf -> 0x0
Forward -> 0xa0338012 (-ETIMEDOUT)
Kernel dmesg shows:
task hdr: model_phys=0xaa880000 inst_off=453888 inst_len=10600
tail=160 B (descriptor builder ran clean)
Forward timed out (5s, status snapshot=0x0)
Next: figure out why HW doesn't IRQ. Hypotheses:
- NNIE clock disabled (vendor drv_svp_nnie_enable_sys_clk @0xb26c
likely needed before START)
- 64-B descriptor layout still slightly off
- Vendor open_gdc.ko handler consuming our IRQ first
- Some RAM-bank select register we're missing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three on-target findings from sequential diagnosis on av300, each
verified against vendor disasm + live devmem readings:
1. NNIE clock/reset is on CRG (clock-reset generator) window at
0x12010000 — NOT the sys window at 0x12020000 we mapped in
Phase 3. Per cv500 DT:
clock@12010000 — clock-reset, 'hisilicon,hi3516cv500-clock'
sys@12020000 — sys-state (mutex, RAM-using flags)
Vendor hi_sys.o sys_hal_wk_cnn_clk_en @0x86dc writes bit 1 of
register +0xbc of LANCHOR0[+8] (= CRG base, not sys base):
crg @0x120100bc:
bit 0 = NNIE reset (1=held, 0=released)
bit 1 = NNIE clk_en (1=ungated)
Pre-Phase-7 dispatch: CRG @0xbc = 0x0 → NNIE clock GATED.
Writes to NNIE_REG_CLK_GATE silently dropped (read back as 0).
This commit ioremap()'s the CRG window, defines NNIE_CRG_*
constants, drops nnie_crg_set_bit/clear_bit helpers, and calls
them in dispatch to release reset + ungate clock before the
first register write.
Verified: CRG @0xbc now reads 0x00000002 after dispatch,
NNIE_REG_CLK_GATE readback now 0x80 (was 0x0 — register writes
were no-ops without clock).
2. Vendor's one-shot svp_nnie_init @0x10f4 also sets TIMEOUT to
~2 seconds at the NNIE clock rate (TIMEOUT_HI:LO = 0xff:
0xffffffff). Without TIMEOUT set, HW seems to hang
indefinitely after START. Added init-time programming of
these registers + checksum disable (clear bit 0 of +0x68) per
the vendor init sequence at 0x1c80-0x1cd0.
3. Tail builder: U8 (enType=1) with Chn=1 now follows the
standard CNN per-batch DMA formula (PhyAddr + j*Stride*Height).
Vendor svp_nnie_fill_image_src_addr @0x7978 confirms this.
Current state: end-to-end test on av300 with inst_mnist_cycle.wk
runs all the way through to the HW START kick. /dev/nnie ABI works,
all RE-discovered registers programmed correctly. HW still hangs
after START — no IRQ in 5 s, STATUS=0 throughout. Likely cause:
descriptor format mismatch (probably one of):
- 64-byte header has a field at +0/+2/+4 we haven't fully decoded
- Variable-length tail format wrong for our SrcNum=1/DstNum=1
arrangement
- Need to call drv_svp_nnie_config_ram (OTP-based ram bank cfg)
once at init
Deferred to Phase 8 — needs careful side-by-side comparison
against vendor strace.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from sequential av300 debug:
1. sys_hal_gdc_nnie_set_ram_using @0x897c uses LANCHOR0+16, which
hi_sys.o sys_hal_init @0x8d70 ioremaps to *0x12030000*, not
0x12020000 like I assumed in Phase 3. The 'sys' window at
0x12020000 holds VGS/MUTEX status; the RAM-using flag is in a
separate 'sys2' window at 0x12030000+0x34. Wired
NNIE_SYS2_BASE_PHYS / nnie_sys2_set_bit/clear_bit helpers; map
in probe.
2. Added pre-START diagnostic that dumps NNIE register state
(CLK, OUT, IRQ_CFG, TIMEOUT, TASK_ADDR, CHECK_SUM) + the full
64-byte HW task descriptor as hex u32s, then polls IRQ_STATUS/
START/TASK_ID for 100 ms after START. All registers programmed
correctly per vendor disasm; descriptor format matches Phase 4
decode byte-for-byte.
3. Pulse-reset before clock-enable, in case HW is left in a stuck
state by a previous failed dispatch.
Current state — *all RE-discovered HW is correctly programmed*:
pre-START regs: CLK=0x80 OUT=0xf0f IRQ_CFG=0x7
TO_LO=0xffffffff TO_HI=0xff
ADDR_LO=0xa00fe000 ADDR_HI=0x0 CHKSUM=0x0
64-B task desc: trigger=1 model=0xaa880000 inst_off=0x6ed00
inst_len=0x2968 tsk=0xa9c70000 tmp=0xa00f5000
batch=1
Post-START: STATUS=0, TASK_ID=0 (HW completely silent for 5s).
Architectural find (svp_nnie_post_process @0x1d8c): vendor maintains
a pre-allocated DMA-coherent ring of 512 × 64-byte slots per core,
indexed by the per-core busy counter. The fill_forward_task output
descriptor gets memcpy'd into the next ring slot, and the SLOT INDEX
gets written to descriptor[+4] (which I had as 'reserved'). For
first task (r6==0), this matches our descriptor[+4]=0, so the field
isn't the cause.
Remaining hypotheses for the HW hang:
- Variable-length descriptor tail layout off — HW interprets §1-§3
differently for our SrcNum=1 / DstNum=1 / U8 Chn=1 / batch=1
case than vendor expects
- drv_svp_nnie_config_ram (OTP-based chip-variant RAM bank cfg) is
needed at boot and we haven't been calling it
- Some other peripheral state vendor open_sys.ko configures
silently (cmpi mod 51 / mod 2 paths we haven't fully decoded)
Phase 9 needs: kprobe vendor's hal_svp_nnie_write_task_addr +
hal_svp_nnie_start on a working vendor Forward path (load vendor
open_nnie.ko, write a vendor-libnnie test) to capture the exact
task descriptor + ring-slot contents from a known-good run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 8 on-target finding (av300 + vendor libnnie.so as oracle): Vendor LoadModel on /tmp/inst_mnist_cycle.wk reports: Tmp = 1989888 (≈ 1.9 MB) Our parser was reading u32TmpBufSize from file[60..63], which on mnist is zero. The vendor value is too big to live in the .wk file itself (466 KB) — it's an inference-time scratch size that vendor computes by walking the per-segment instruction stream. Heuristic for now: declare u32TmpBufSize = 8 MB unconditionally, which covers small classification models. Larger detection models (yolov*, ssd, frcnn) will need more — Phase 9 will RE vendor's computation in libnnie.so for the precise value. This caps the per-Forward MMZ working set at ~8 MB even when models don't need that much (mnist actually needs ~2 MB). Not a problem in practice — userspace allocates the tmpbuf from MMZ and our kernel doesn't touch it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrote three Phase-9 debug modules in kernel/nnie_spy/ to capture
vendor open_nnie.ko's live HW dispatch state:
- nnie_spy.c: kprobe (CONFIG_KPROBES=n on cv500 — unusable, kept for
future kernel rebuild)
- nnie_dump.c: insmod with phys=/size= module params, dumps 16B/line
via phys_to_virt (works for any CMA-managed lowmem)
- nnie_watch.c: kthread polling NNIE+0x20/+0x24 every 1us, captures
any TASK_ADDR change + dumps the descriptor + tskbuf
Captured during a known-good vendor Forward on inst_mnist_cycle.wk:
TASK_ADDR = 0xa9c70000 (vendor's pre-allocated DMA ring slot)
descriptor:
d[0..7] : 01 00 00 00 00 00 00 00 AA88_0150 00 00 00 00 06ED00 002968
d[8..15]: A9CB0000 0 0 0 0 0 AD200000 0 00000001 0
tskbuf @ 0xa9cb0000:
+0x00: 00000020 00000030 00000000 00000000
+0x10: a00f6000 00000000 00000000 00000000
+0x20: a00f5000 00000000 00000000 00000000
+0x30..: zero
Two critical finds:
1. **task[+16] is NOT stBase.PhyAddr — it's stBase.PhyAddr +
inst_offset_extra**. For the test .wk: 0xaa880000 + 0x150 (=
file[52..55]) = 0xaa880150. Vendor's userspace LoadModel adjusts
stBase.u64PhyAddr to skip the .wk header. Our LoadModel was passing
the raw file phys. Fixed in libraries/nnie_neo/src/nnie_ops.c:
pstModel->stBase.u64PhyAddr += inst_off_extra after the *pstModelBuf
copy.
2. **The variable-length tskbuf tail is FAR SIMPLER than my disasm-
based RE suggested**. Vendor only uses:
+0: src strides packed (SrcNum × u32)
then dst strides packed (DstNum × u32)
pad to 16
+0x10: dst phys addrs (DstNum × u64, packed)
pad to 16
then src per-batch phys (SrcNum × Num × u64)
For mnist SrcNum=DstNum=Num=1: ~32 bytes meaningful, rest is the
tskbuf size we provided (65 KB, all zero). My earlier §1-§3 builder
wrote 160 bytes including a 16-slot dst phys array — completely
different from what HW expects.
Rewrote nnie_build_task_tail to match vendor byte-for-byte.
Verified our tskbuf content == vendor's tskbuf content
identically:
+00: 20 30 0 0 (src_stride=0x20=32 dst_stride=0x30=48)
+10: dst_phys 0 0 0
+20: src_phys 0 0 0
Also fixed clean-room kernel to do read-modify-write on CLK_GATE
and OUTSTANDING (vendor pattern). Our previous plain writes were
clobbering required chip-default bits. CLK_GATE now reads back
0x3c9 (was 0x80), matching vendor's live state 0x349.
Status: descriptor + tskbuf are now byte-equivalent to vendor's,
all RE-discovered HW registers programmed identically. HW still
returns cfg_err info=0x1 — the bug is OUTSIDE the descriptor +
tskbuf. Hypotheses for Phase 10:
- dma_alloc_coherent vs cmpi_mmz_malloc_cached affect HW DMA
differently
- Vendor's pre-allocated ring slot phys is registered with HW
via some other mechanism we haven't decoded yet (e.g., open_sys
registers it with CMP_3DNR or similar)
- chip-variant cfg via drv_svp_nnie_config_ram / prepare_nnie
(OTP-dependent) is required for our chip
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Phase-10 changes to bring cleanroom dispatch closer to vendor:
1. Descriptor allocation moved from dma_alloc_coherent to
hil_mmb_alloc + hil_mmb_map2kern_cached. Matches vendor's
cmpi_mmz_malloc_cached pattern, places the descriptor in the
same MMZ pool vendor uses for its 512-slot ring (verified:
same single 256MB zone 0xA0000000-0xAFFFFFFF per /proc/media-mem).
2. __cpuc_flush_dcache_area on tskbuf after nnie_build_task_tail
writes — memremap WB gives cached kernel mapping; HW DMA needs
the writes visible in DDR. Vendor uses SAMPLE_COMM_SVP_FlushCache.
3. __cpuc_flush_dcache_area on the descriptor after memcpy from the
stack-built struct — same reason as (2).
Independently verified via nnie_dump.ko reading our descriptor phys:
the 64 bytes match vendor's known-good capture byte-for-byte except
for the tskbuf-phys field d[8] (vendor used 0xa9cb0000, ours uses
0xa9c70000 — both valid MMZ phys, content identical at both).
HW STILL returns cfg_err info=0x1. With:
- descriptor bytes match vendor
- tskbuf tail content matches vendor (verified via dump)
- registers programmed identically to vendor (CLK_GATE=0x3c9 etc)
- MMZ allocation now in same pool as vendor
- all caches flushed before START
…something else differs. Hypotheses for Phase 11:
- Per-core state in vendor's LANCHOR0 holds HW-required init that
only vendor's svp_nnie_init populates (e.g., a chip-variant
fixup we haven't decoded)
- HW has a hidden "first task" register/sequence we're missing
- vendor's open_gdc.ko (loaded but unused for inference) sets
something we're missing
- svp_nnie_check_err_status decodes info=1 as something specific
we haven't traced yet
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot-diff finding: after a clean reboot, vendor's open_nnie.ko insmod writes ~15 previously-unexplored NNIE registers. One of them is CHECK_SUM (+0x68 = 0x00000001 post-vendor-init). Vendor's symbol 'disable_check_sum' @0xbc74 clears bit 0, but the live state has bit 0 SET after init runs — the function name is misleading; bit 0 is presumably "disable mode" semantics (clearing it enables real operation). Our cleanroom was calling the analog (clear bit 0) which actively DISABLED checksum. Empirical: with CHECK_SUM=0 our HW returns cfg_err info=1; with CHECK_SUM=1 (vendor's value) HW returns cfg_err info=0. Different error code — we're closer. Remaining gap registers vendor writes that we don't: nnie+0x00 = 0x00002018 nnie+0x04 = 0x00000130 nnie+0x08 = 0x0000B017 nnie+0x10 = 0x5A5A5A5A ← magic value nnie+0x14 = 0x0000FFEF nnie+0x6c = 0xFFFFFFFF nnie+0x70..0xa8 = various (chip cfg / clock params?) These may be HW-self-populated when clock is on (need to verify by just enabling CRG NNIE clock without vendor module) or may be set by drv_svp_nnie_prepare_nnie (OTP-variant-dependent cmpi mod 2 fn 0xb6 call). Phase 12 to investigate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot-diff vs vendor's open_nnie load showed vendor's init only
changes ONE bit beyond what we already program: CRG+0xa4 (VEDU clock,
per hi_sys.o sys_hal_vedu_clk_en) flips 0 → 6. NNIE may share clock
infrastructure with VEDU on cv500.
This commit adds a CRG+0xa4 RMW to our dispatcher to keep bit 0..2 =
6. On test, devmem readback shows the write did NOT take effect (post-
dispatch CRG+0xa4 still 0). Possible causes:
- VEDU clock register may need a separate enable bit
- Some sys/CRG window has write-protect we haven't decoded
- Vendor's path may set it via cmpi → open_sys.ko which has CRG
permission we lack
cfg_err info changed from 1 to 0 in the previous commit (CHECK_SUM
preserved). info=0 remains here — adding CRG+0xa4 didn't change the
result.
The 15 'unexplored' registers I'd worried about (nnie+0x00..0x14,
+0x70..+0xa8) turn out to be HW-self-populated when the clock is on.
Verified by snapshotting AFTER reset with ONLY a devmem write to
CRG+0xbc=0x2 (no module loaded): all the magic values appear. So
vendor doesn't write them — they're chip defaults exposed once the
block is clocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two tweaks to the dispatcher:
- write NNIE_IRQ_ALL to NNIE_REG_IRQ_CLEAR right before TASK_ADDR
writes, to drop any stale cfg_err/timeout bits from a previous
failed dispatch
- log NNIE_REG_IRQ_STATUS + NNIE_REG_CFG_ERR_INFO in the pre-START
register dump
Test: pre-START state confirmed STATUS=0x0 ERR_INFO=0x0 (clean).
Within 100us of START write, HW raises STATUS=0x4 (cfg_err) with
ERR_INFO=0x0. So HW is processing the task and definitively
rejecting it — not a stale-IRQ issue.
CRG+0xa4 write from kernel module also confirmed to stick now
(post-test devmem reads 0x6 as intended). Earlier non-stick may
have been a transient state-machine issue from the unsafe write
ordering. Even with CRG+0xa4=6 matching vendor, ERR_INFO stays 0.
Remaining hypothesis space (Phase 12):
- vendor's cmpi_register_module call registers NNIE as module 51,
enabling other modules (sys/sys_config) to call NNIE-specific
init that we miss
- drv_svp_nnie_prepare_nnie's OTP-variant path may run additional
chip cfg via cmpi mod 2 fn 0xb6 for specific OTP values; need
to find g_reg_otp_base_va and check our chip's OTP[+0x28]
- HW may need a "warm" task descriptor (something set by vendor
in its descriptor ring slot at first task that survives between
tasks)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extended kernel/nnie_spy/nnie_watch.c to dump ALL NNIE registers
+0x00..+0xbc at the moment of TASK_ADDR write. Captured vendor's
working mnist Forward state:
reg+0x20: a9c70000 00000000 ffffffff 000000ff ← TIMEOUT_HI=0xff!
reg+0x40: 00000000 00000003 00000000 00000000 ← +0x44 = 0x3 (UNK)
reg+0x60: 00000000 00000000 00000000 ffffffff ← CHECK_SUM=0!
Three register diffs vs our cleanroom that we now fix:
1. TIMEOUT_HI (+0x2C) = 0xff (we had set to 0 after misreading an
earlier post-Forward snapshot — vendor set_timeout writes 0xff,
HW clears it after completion. AT task start it's 0xff.)
2. CHECK_SUM (+0x68) = 0 (vendor explicitly disables before each
task. Live readback after vendor's Forward completion shows 1
because HW restores chip default 0x1 post-task. Vendor's
'disable_check_sum' function name is correct after all — bit 0
= 1 IS "enabled", and vendor disables before submit.)
3. +0x44 (unknown) = 0x3 (vendor writes this; no decoded function
in hi_nnie.o symbol table matches +0x44).
With these three fixes (without further changes to +0xb0..+0xb8 which
made things worse): HW returns cfg_err info=0 to info=0x1000001 to
sometimes a 5-sec TIMEOUT (no IRQ) — a different failure mode each
run, but at least the cfg_err code changed, suggesting HW is partially
accepting our task now.
Phase 13: trace what +0x44 is, why info=0x1000001 vs info=0. Also
investigate whether HW resets between tasks differently than vendor
expects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical finding via test_neo on av300: assigning a monotonic per-task slot index to descriptor[+4] (the 'reserved' field — actually vendor's task ring slot index, 0..511) makes the NNIE HW ACCEPT the task. Live readback shows TASK_ID register (+0x48) updating to 0x1 matching our slot_idx after START. Previously descriptor[+4] = 0 caused HW to reject with cfg_err info= 0 / 0x1000001. With monotonic non-zero idx, HW updates TASK_ID and runs partial processing. Mechanism (inferred): HW tracks a "next expected slot_idx" internally. Submitting slot_idx that matches the current state (=0 on cold boot) is treated as a no-op or as referring to an already-completed task, so cfg_err. Submitting a fresh slot makes HW accept the new task. Vendor's ring iterates 0,1,2,...,511 mod 512 — first task after fresh boot is 0, subsequent ones increment. So our 'always 0' was wrong after the first failed task. Forward still returns cfg_err (cause=0x4 info=0x1000001) AT THE END because the inference engine fails part-way through execution — but this is a LATER failure mode than the previous "submission rejected". Remaining puzzle (Phase 14): why does HW reject mid-execution? Most likely candidates: (a) instruction stream interpretation fails because some descriptor offsets we set are still wrong relative to vendor; (b) tmp_buf size/content not what HW expects; (c) output blob format mismatch (vendor uses VEC_S32 stride=48 — we match this in test_neo). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR OpenIPC/firmware#2095 enabled CONFIG_KPROBES=y in cv500/av300 ship kernels (board built 2026-05-14 05:25 UTC). My local linux-custom .config was stale (pre-#2095), so register_kprobe in modules built against it expanded to the kprobes.h inline stub returning -ENOSYS. Fix: set CONFIG_KPROBES=y + CONFIG_KALLSYMS_ALL=y + CONFIG_JUMP_LABEL=y + CONFIG_MODULE_UNLOAD=y in linux-custom/.config, run oldconfig + prepare to regenerate include/generated/autoconf.h, rebuild nnie_spy.ko. Also: register_kprobe(symbol_name=...) only resolves GLOBAL kallsyms. Vendor's hal_svp_nnie_write_task_addr is LOCAL (lowercase 't' in kallsyms) so I had to switch to .addr= with the value module-loaded from kallsyms grep. Verification: • probe_test.ko on printk → "PROBE FIRED" each printk call ✓ • nnie_spy.ko on hal_svp_nnie_write_task_addr → fires during vendor's Forward, dumps r2/r3 (task_phys), then 64-byte descriptor and tskbuf tail via phys_to_virt First vendor capture (mnist Forward, scores match prior runs): task_phys = 0xa9c70000 (vendor's pre-allocated ring slot 1) d[0..7] = 01 00 00 00 00 00 00 00 ... aa880150 00 00 00 00 6ed00 002968 d[8..15] = a9cb0000 0 0 0 0 0 ad200000 0 01 00 00 00 0 0 tail @0xa9cb0000: +0x00: 20 30 0 0 a00f6000 0 0 0 +0x20: a00f5000 0 0 0 ...zero Identical to earlier polling captures — but now with single-cycle precision (vs ~1MHz poll). Phase 14 can now iterate quickly: kprobe vendor's full hal_svp_nnie_* call sequence, compare to ours, find the missing register / write-ordering issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vendor's first task uses descriptor[+4] = 0 (verified via kprobe capture this session). Our atomic_inc_return was post-increment, giving 1 on first call — descriptor[+4] = 1 means HW waited for slot 0 (never submitted) to complete first → 5-sec TIMEOUT. Switch to atomic_inc_return - 1 (pre-increment value, 0 on first call). With slot_idx = 0 our cleanroom now produces a 64-byte descriptor byte-equivalent to vendor's mnist Forward capture: 00000001 00000000 00000000 00000000 aa880150 00000000 0006ed00 00002968 (Identical content, identical layout.) HW still rejects with cfg_err info=1 with slot=0. The dependence of cfg_err code on slot_idx + prior HW state is intricate: slot_idx=0, fresh boot → cfg_err info=1 slot_idx=1, after previous fails → TIMEOUT (TASK_ID updates to 1) So HW accepts SOMETHING and rejects something else. With kprobes now available the next iteration can attach probes to vendor's full hal_svp_nnie_* call chain and the chip's actual register write order, then diff against ours. That belongs in Phase 15+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added kernel/nnie_spy/nnie_trace.c that simultaneously kprobes 7 vendor functions in open_nnie.ko (write_task_addr, start, cfg_irq, set_timeout, enable_clk_gt, set_outstanding, disable_check_sum) and logs ARM_r0..r3 at each entry. Loaded with addr= params for each symbol grep'd from /proc/kallsyms. Captured vendor's per-task call sequence on av300 (mnist Forward): 1. cfg_irq(core_id=0, ...) 2. set_timeout(core_id=0, ..., r2=0xffffffff TIMEOUT_LO, r3=0xff TIMEOUT_HI) 3. enable_clk_gt(core_id=0, ...) 4. set_outstanding(core_id=0, ...) 5. disable_check_sum(core_id=0, ...) 6. write_task_addr(core_id=0, ..., r2=0xa9c70000 task_phys, r3=0) 7. start(core_id=0, ...) This sequence matches our cleanroom's register-write order. And captured task_phys, TIMEOUT_LO/HI, CHECK_SUM=0 etc. match our descriptor + pre-START state byte-for-byte. Tested: vendor's open_nnie module init first (to leave HW post-init state vendor expects), then rmmod + insmod our cleanroom + Forward. Still cfg_err info=1. So the gap is NOT in module-init residual state. Remaining gap is between "all observable register values match vendor's working state" and "HW actually produces inference work". Likely something in cmpi-mediated open_sys handshake or HW state- machine sequence ordering we still get wrong. Phase 15+ work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kprobed sys_hal_wk_cnn_clk_en, sys_hal_wk_cnn_reset_sel, and
sys_hal_gdc_nnie_set_ram_using on av300 during a known-good vendor
mnist Forward, captured this call sequence (likely interleaved with
calls from other vendor modules):
clk_en(1) → reset_sel(0) → clk_en(0) → clk_en(1) →
ram(0) → clk_en(1) → ram(1) → ram(0) → clk_en(0)
The clock-toggle (OFF then ON) before the task is novel — we just
keep clock on continuously. Hypothesised this was an HW reset pulse
NNIE needs.
Mirrored vendor's sequence in nnie_dispatch_forward. HW still
returns cfg_err info=1. So the clock dance ALONE isn't what HW
needs — the failure is somewhere else still.
Remaining hypotheses for Phase 15:
- The clock-toggles come from MULTIPLE concurrent threads (other
vendor modules using cmpi mod 51 ops). Trying to replicate them
in single-threaded sequence misses the actual chip state HW
needs.
- HW may need a specific CMPI handshake we still don't emulate.
- Some HW state I haven't observed yet.
Branch: 39 commits on nnie-neo. Phase 14 progressed significantly
this session (kprobes unblocked, full vendor call trace captured)
but the HW failure mode is now stably reproducible at cfg_err
info=1 with byte-identical state to vendor's working setup. The
cleanroom RE has reached a plateau that needs careful single-step
HW debugging (or vendor open_sys.ko source) to break through.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert four per-task register writes from plain writel() to read-modify-write to match vendor hal_svp_nnie_* helpers: reg+0x30 (START) plain 1 → OR bit 0 reg+0x34 (IRQ_CFG) plain 0x7 → OR bits 0|1|2 reg+0x38 (IRQ_CLEAR) plain 0x7 → OR bits 0|1|2 reg+0x68 (CHECK_SUM) plain 0 → BFC bit 0 Vendor preserves upper-half status bits in these registers via RMW. Plain writes were clobbering them, but on a post-reboot clean run the read-back is 0 in all four cases, so the RMW is functionally equivalent — the change is for correctness against any future state path that sets those bits. Also switch task descriptor MMZ from cached + __cpuc_flush_dcache_area to nocache (hil_mmb_map2kern), matching vendor svp_nnie_init's cmpi_mmz_malloc_nocache for its pre-allocated task ring. Eliminates cache coherency as a variable. cfg_err info=0x1 still fires mid-execution. Phase 14 is the gap between byte-equivalent dispatch state and HW actually executing the task. None of: GDC ops handshake, prepare_nnie ioctl, check_clk_freq, set_mem_speed (all decoded from hi_nnie.o disasm) are present in our cleanroom dispatch — but on cv500 all four are no-ops for the classic NNIE path with no GDC concurrent activity. Next angle: kprobe vendor cfg_err recovery to capture HW state at the moment of a known-bad dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dump
LoadModel breakthrough: decode the per-segment node tables in the .wk
file at file[208+] (compact 16 B node header + 32 B name = 48 B per
node, layout decoded from vendor inst_mnist_cycle.wk hex dump). Now
populates:
astSeg[i].astSrcNode[j] = { enType, u32Width, u32Height, u32Chn,
u32NodeId, szName } ← read from file
astSeg[i].astDstNode[j] = { enType=SVP_BLOB_TYPE_S32, ..., szName }
← name read; type/id derived (vendor's
libnnie hardcodes type=4 for outputs)
Direct vendor-vs-cleanroom comparison on av300:
vendor lib (libnnie.so) reports astSrcNode[0].enType=1 for mnist.wk
cleanroom (libnnie_neo) was reporting enType=0 — userspace test
uses this to pick SVP_BLOB_TYPE_E for its src MmzAlloc, so we were
sending the wrong blob type to the Forward ioctl.
Now with enType=1 propagated through to the kernel's tskbuf §3 fill,
the per-source DMA address vector matches vendor's exactly (still
1 u64 per batch for enType=1 C=1 — the YUV chroma plane is only
needed for C>=2). Tested on av300: cfg_err info=0x1 still fires
mid-execution though, so enType wasn't the trigger by itself.
Also adds kernel-side diagnostic: full reg+0x00..+0xc8 dump both
pre-START and post-fail. Comparison vs vendor's post-insmod-no-Forward
devmem dump shows:
- chip-config + HW-self-populated bits (+0x70..+0xa8) identical
- our per-task RMW of OUTSTANDING clears bit 4 (0xf1f → 0xf0f)
matching vendor's per-task code; vendor's POST-success state is
all-zeros (IRQ handler gates clock) so direct mid-task comparison
needs nnie_watch.ko polling
Post-fail HW touches: cfg_err_info=0x1, +0x64=0x8, +0xb0..+0xb8=0x1
each, +0xc4=0x34e — all written by HW, not us.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend nnie_trace.ko so the write_task_addr kprobe also dumps the 64-byte HW descriptor + the first 320 bytes of the tskbuf it references (via phys_to_virt on r2 = descriptor phys passed by vendor). And the start kprobe ioremaps the NNIE register window and dumps reg+0x00..+0xc8 at the exact moment vendor's start function fires. Used this to capture vendor's pre-START state during a known-good mnist Forward on av300. Result vs cleanroom pre-START dump: vendor: TASK_ADDR_LO = 0xa9c70000 cleanroom: TASK_ADDR_LO = 0xa9cb0000 ALL other registers match byte-for-byte (chip ID, IRQ_CFG, TIMEOUT, CLK_GATE, OUTSTANDING, CHECK_SUM, +0x6c..+0xa8 HW-self-populated). Descriptor + tskbuf content matches byte-for-byte at the phys-mem level. The 0xa9c70000 vs 0xa9cb0000 swap happens because vendor allocates its task descriptor ring at module init (so it gets the first 64KB-aligned slot after the bv_pool), while cleanroom allocates a fresh descriptor MMZ per Forward (after the test's nnie_tsk allocation has already taken 0xa9c70000). HW may have a static range filter for TASK_ADDR or some other reason it rejects ours. Next: pre-allocate descriptor MMZ at module init (matching vendor's allocation order) so we get 0xa9c70000 — or any deterministic phys — and see if cfg_err goes away. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanroom NNIE driver now produces bit-identical mnist output to the vendor open_nnie.ko on av300: dst[10 scores]: 408 412 401 401 398 412 398 405 449 401 ← vendor dst[10 scores]: 408 412 401 401 398 412 398 405 449 401 ← cleanroom Three changes combined to fix the cfg_err info=0x1: 1. Pre-allocate the 64KB task descriptor MMZ at module init (same point where vendor's svp_nnie_init @0x11a8 calls cmpi_mmz_malloc_nocache for its task ring). This makes the descriptor phys stable across Forwards AND lands at 0xa9c70000 — the exact slot vendor's allocation gets, since at module-init time that's the first available 64KB-aligned MMZ slot after the vb_pool. Per-Forward allocation was landing AFTER the test's nnie_tsk had already taken 0xa9c70000. 2. Zero the descriptor MMZ at init time (vendor does this via osal_memset in svp_nnie_init). HW may scan the ring for valid pending slots and the uninitialized 65472 bytes past our 64-byte task could trigger false-positive cfg_err. 3. Stop setting NNIE_RAM bit back to 1 after the clock dance. Vendor's hal_svp_nnie_enable_ram @0xb8f4 issues SYS ioctl 0xd1 with arg=0, so the bit goes 0 and stays 0 through START. Setting it back to 1 was a Phase 8 guess that turned out to invalidate the SRAM ownership signal HW expects. Also remove the verbose pre-START register dump (52 readls between TASK_ADDR write and START write added 50us+ gap that was hypothesised to matter; turned out not to be the trigger, but keeping the path tight matches vendor's drv_svp_nnie_start exactly: write_task_addr → dmb → start with no reads in between). Verified on av300 board with vendor's test binary using cleanroom libnnie_neo + cleanroom open_nnie_neo.ko. Forward returns 0, Query finished=1, output scores match vendor byte-for-byte. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-allocated descriptor MMZ, the IRQ completion (g_nnie_done), and the cause atomic are all single-instance globals. Concurrent Forward callers race on the descriptor write, the START kick, and the IRQ-driven wakeup — repro'd as 20/20 -ETIMEDOUT under 4-way parallel test. Wrap nnie_dispatch_forward in g_nnie_forward_lock (interruptible mutex). Vendor's svp_nnie_forward @0x2198 does the same with osal_down_interruptible on a per-handle semaphore. Verified 4-way concurrent x 5 rounds = 20/20 pass with this lock, zero cfg_err. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that mnist Forward works end-to-end (and the slot-wrap / parallel tests pass), drop the verbose debugging scaffolding: - pre-START full reg dump (52 readls) - post-fail full reg dump - 100ms polling loop after START - per-Forward src[i] enType/dims pr_info - separate task hdr pr_info_once - redundant Forward arg-size + UVA pr_info_once - AddTskBuf / RemoveTskBuf pr_info_once → pr_debug Also collapse the per-task register-setup block into one tidy RMW group with concise comments and drop the unused `arg_size`, `mmb`, `task_kvirt`/`task_dma` (void)-cast lines. Behaviour-preserving cleanup. Re-verified on av300: 1 + 20 sequential + 12 (4-way parallel x 3) Forwards all pass, zero cfg_err, dmesg clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validated end-to-end on av300, not Phase 0 scaffold anymore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These five modules (nnie_dump, nnie_spy, nnie_trace, nnie_watch, probe_test) were ad-hoc kprobe + phys_to_virt helpers used during reverse engineering. They aren't wired into any production build and don't belong in the shipping PR. Tracked separately if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comprehensive cleanup of every comment and docstring in kernel/nnie_neo/
and libraries/nnie_neo/:
- Drop all "Phase N" references — the RE phase labels meant something
while the work was in progress; now they're noise.
- Drop "kprobe capture showed", "av300 2026-05-17", "Phase 14
critical finding", and similar journey markers.
- Drop "may be", "appears to", "possibly", "guess" hedging — state
what the code does and why, definitively or not at all.
- Drop vendor function references and addresses ("svp_nnie_init
@0x11a8", "hal_svp_nnie_enable_clk_gt @0xbc18") from inline comments
— they belong in commit messages and RE notes, not in shipping code.
- Drop the verbose ASCII-art-with-byte-offsets prologues; keep the
field-offset macros + struct definitions + a one-paragraph layout
description where the layout is non-obvious.
- Drop the dead `g_sys_regs` ioremap path (was Phase 3 read-only
scaffolding, never used at runtime).
- Drop `g_sys_lock`, `nnie_sys_set_bit`, `nnie_sys_clear_bit`,
`nnie_sys2_set_bit` — all unused after the cleanup.
- Drop `__maybe_unused` annotations on helpers that now have callers.
- Drop the leftover `mmb` / `task_kvirt`/`task_dma` (void)-cast lines.
- Simplify the Kbuild header.
- Strip the trailing pr_info_once Forward/AddTskBuf/RemoveTskBuf
argument dumps that were debugging aids.
Behaviour-preserving. Re-verified on av300: 30 sequential + 12
(4-way parallel x 3) mnist Forwards all pass, output byte-identical
to vendor (408 412 401 401 398 412 398 405 449 401), dmesg shows
only the "/dev/nnie ready" line.
Net diff: -612 lines, +0 functional changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
CI failure on V4 chiparchs (hi3516ev200/ev300, gk7205v200): the
libraries/nnie_neo/ Makefile hardcoded a staging-path include
(STAGING ?= ../../../../firmware/output-cv500/build/...) so the
build only worked on a dev machine with the firmware tree adjacent
to the openhisilicon repo. CI builds all subdirs unfiltered for V4,
which tripped this.
Two parts to the fix:
1. Self-contain the userspace headers. Bundle the three vendor SVP
headers (hi_nnie.h, mpi_nnie.h, hi_comm_svp.h) into
libraries/nnie_neo/include/ — same pattern as libraries/ive_neo/
(which ships its own hi_comm_ive.h / hi_ive.h / mpi_ive.h).
Drop the STAGING hack from the Makefile so only repo-relative
includes remain.
2. Patch the bundled hi_comm_svp.h:
- drop `#include "hi_errno.h"` (header not in tree for cv500);
replace with `#include "hi_common.h"` which provides
HI_DEF_ERR + EN_ERR_* + the MOD_ID_E enum
- add SVP_NNIE_HANDLE typedef (was in the cv500-kernel-only
hi_common.h)
- add HI_ID_SVP_NNIE = 51 to libraries/include/hi_common.h next
to the existing MOD_ID_E values
3. Gate libraries/nnie_neo to cv500-only in libraries/Makefile —
NNIE block doesn't exist on V4, cv200, cv100, etc., so the
cv500-specific MPI surface shouldn't be compiled there.
Verified on av300: bundled headers + repo-relative includes build
clean; rebuilt libnnie_neo.so still produces the byte-identical
mnist output (408 412 401 401 398 412 398 405 449 401).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Open-source replacement for vendor
open_nnie.koon cv500-family SoCs (hi3516cv500, av300, dv300). Drives the NNIE CNN inference block at phys0x11100000and exposes the vendor-compatible ABI on/dev/nnie, so existing userspace using vendorlibnnie.soworks unchanged.Closes #111.
Validation
mnist Forward output byte-identical to vendor:
open_gdc.koloaded, idle)What's in the box
kernel/nnie_neo/— kernel drivernnie_init.c— platform-device probe (DThisilicon,hisi-nnie), IRQ + MMZ pre-allocnnie_neo.c—/dev/nniemiscdevice + Forward / AddTskBuf / RemoveTskBuf ioctl handlers + HW dispatchnnie_hw_task.h— 64-byte HW descriptor + tskbuf variable-length tail layoutnnie_hw_regs.h— cv500 NNIE register map (0x11100000)kernel/hi3516cv500.kbuildlibraries/nnie_neo/— userspacennie_ops.c—HI_MPI_SVP_NNIE_LoadModel,_Forward,_ForwardWithBbox,_Query,_AddTskBuf,_RemoveTskBuf,_UnloadModel,_GetTskBufSizennie_wk_format.h—.wkfile-format constants + header structDefinition-of-done (issue #111)
open_nnie_neo.kobound tohisilicon,hisi-nnieon cv500Known limitations (follow-ups, not blockers)
HI_MPI_SVP_NNIE_GetTskBufSizereturnsHI_ERR_SVP_NNIE_NOT_SURPPORT. Vendor walks the parsed instruction stream to compute per-segment buffer sizes; userspace callers that need it should pre-compute or hardcode for now.HI_MPI_SVP_NNIE_ForwardWithBboxreturnsHI_ERR_SVP_NNIE_NOT_SURPPORT. The bbox-mode dispatch path (4 extra ioctls in the0x4d05..0x4d09range) isn't wired yet.u32TmpBufSizeis set to 8 MB heuristically rather than computed from the model. Fine for classification-class models; large detection models may need vendor's exact value.enNetType=2) tskbuf tail layout uses the same builder as CNN. RECURRENT-net dispatch is not exercised.enType=SVP_BLOB_TYPE_S32andNodeId=(j+1)*8since the file format doesn't store these for outputs; vendor's libnnie does the same. The W/H/C field order differs from inputs — vendor swaps so the layer's feature count lands in Width.Test plan
hi3516cv500chiparchnnie_neoonly included fromhi3516cv500.kbuild)open_nnie_neo.koon av300, run vendor'ssample_nnie_mainagainst mnist.wk🤖 Generated with Claude Code