Add SPI NAND read support to flash agent (read-only first cut)#71
Merged
Add SPI NAND read support to flash agent (read-only first cut)#71
Conversation
Fourth agent platform after ev300 (V4), cv300 (V3), and av300/dv300/cv500 (V5/cv500-family). V3A is 3519v101 + av200 — Cortex-A7 with V3-era peripheral addresses (UART 0x12100000, WDT 0x12080000) but DDR at 0x80000000 like cv500-family. Memory map per qemu-hisilicon's hi3519v101_soc. The bootrom-protocol quirks (sendFrameForStart handshake, PRESTEP1 DDR training step, non-fatal TAILs) were already landed for the install/burn flow in #47 + #48 + #65, so this is just the agent build wiring plus one real protocol fix the agent path was missing. ## Fix: don't pre-truncate spl_override at the call site `defib agent upload` and `agent flash` were doing \`spl_data = uboot[:profile.spl_max_size]\` before passing to `send_firmware()`. When `_send_spl()` then scans this truncated buffer for the LZMA/gzip SPL boundary, it can't find anything past profile_max — so for chips where the OpenIPC SPL is *larger* than the HiTool reference (e.g. av200's SVB-enabled SPL is 0x6800, profile_max is 0x4F00), we send 0x1900 too few bytes. The SPL never reaches its post-DDR-init code, hangs after the SPL TAIL, and the agent HEAD frame for 0x81000000 gets `0x08` rejection. Fix: pass the full u-boot binary as `spl_override`. `_send_spl()` already handles the slicing via its detected boundary. Verified on real hardware: - hi3516av200 (NAND board, on /dev/ttyUSB1, ether8): SPL detected at 0x6800, agent uploads, runs, READY received. Flash JEDEC reads byte-shifted (this board has SPI NAND; the agent's NOR-only flash driver is a separate, larger limitation). - hi3516cv300 regression (on /dev/uart-IVGHP203Y-AF, ether3): SPL now detected at 0x5400 (was being clamped to 0x4F00 pre-fix). Agent loads identically, jedec=ef4018, 256 KiB read at 921600 baud = 84.9 KB/s — same as before. ## Aliasing Match the existing `gk7205v300 → gk7205v200` shape: one agent binary serves the V3A family, multiple chip names route to it. Add `hi3519v101 → hi3519v101` (own binary) and `hi3516av200 → hi3519v101` to `chip_to_agent`. ## Verification QEMU `qemu-system-arm -M hi3519v101 -kernel agent-hi3519v101.elf`: agent boots cleanly, READY/DEFIB packet stream, no faults. Real hi3516av200 board: \`\`\` upload ok=True agent ready: ram=0x80000000 caps=0x7f version=2 \`\`\` cv300 regression (testing that the spl_override fix doesn't break what landed in #66/#67): jedec_id, ram_base, caps, throughput unchanged. make test HOST_CC=gcc: 5406/5406. pytest: 402 passed, 2 skipped. ruff & mypy clean. ## Known limitation The av200 board in this lab has SPI NAND. The agent's flash driver (`agent/spi_flash.c`) supports SPI NOR only — uses memory-mapped reads at 0x14000000 and direct FMC register commands. On NAND, JEDEC reads return shifted bytes and `read_memory` returns 0 bytes. The agent still loads, runs, and emits READY on NAND boards; just the read/erase/write/scan operations don't work. SPI NAND support is a separate larger piece of work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent's flash driver was NOR-only — JEDEC ID readback on a SPI NAND
chip returned shifted bytes and `read_memory` returned 0 bytes. This
adds NAND detection plus a register-based read path so `defib agent
read` works on NAND boards (e.g. hi3516av200 with Macronix
MX35LF1GE4AB, 1Gbit / 128 MiB).
Read-only for now: erase/write are NOR-specific (different opcodes,
different protection model, ECC and bad-block management). Those
handlers (CMD_ERASE, CMD_WRITE, CMD_FLASH_PROGRAM, CMD_FLASH_STREAM,
CMD_SCAN) now return ACK_FLASH_ERROR cleanly instead of silently
issuing NOR commands the chip won't honor.
## Changes
- `agent/spi_flash.h`: add `flash_type` field to `flash_info_t`
(FLASH_TYPE_NOR or FLASH_TYPE_NAND), exposed in CMD_INFO response.
- `agent/spi_flash.c`:
- `nand_identify()` recognizes Macronix `0xc2 0x12`
(MX35LF1GE4AB), tolerating the leading dummy byte some SPI NAND
chips emit during 0x9F READ_ID.
- `nand_wait_oip()` polls SPI NAND status via
GET_FEATURE 0xC0 + bit 0 (Operation In Progress).
- `nand_read()` issues PAGE_READ (0x13) → wait OIP → REG_FROM_CACHE
(0x03) with `OP_CFG_DUMMY_NUM(1)` for the 1-byte dummy SPI NAND
READ_x1 requires. On-chip ECC (default-enabled on MX35LF*) gives
us byte-perfect reads; no host-side ECC needed.
- `flash_init()` dispatches by JEDEC: NAND skips fmc_enter_boot
(NAND has no memory-mapped boot mode) and reports 128 MiB total /
128 KiB block / 2 KiB page.
- `flash_read()` early-dispatches to a per-page loop on NAND;
NOR path unchanged.
- `agent/main.c`:
- `handle_info` reports `flash_info.flash_type` in the spare byte
of the JEDEC ID slot and uses `flash_info.sector_size`
(was hardcoded 0x10000).
- `handle_flash_write`, `handle_flash_program`, `handle_flash_stream`,
`handle_erase`, `handle_scan`: early-return ACK_FLASH_ERROR on NAND
instead of issuing NOR commands the chip won't accept.
## Verification
Real hi3516av200 board (Macronix MX35LF1GE4AB SPI NAND, 128 MiB):
jedec=00c212 flash=131072 KiB block=128 KiB
64 KiB read at 921600 baud: 0.86 s = 76 KB/s
256 KiB read at 921600 baud: 3.01 s = 85.0 KB/s
1 MiB read at 921600 baud: 11.48 s = 89.2 KB/s
Same 64 KiB read across two passes: identical (no read errors)
Found "System startup" string at offset 0x175f
(real HiSilicon SDK u-boot content, not OpenIPC — board has
factory firmware, throughput and consistency confirm read works)
Throughput matches NOR path on the same baud rate — UART is the
bottleneck on both, the per-page PAGE_READ→READ_FROM_CACHE overhead
is ~negligible at 921600.
QEMU `qemu-system-arm -M hi3519v101`: agent boots clean, READY/DEFIB
packets, no faults (NAND code paths gated by chip detection — QEMU
doesn't emulate a NAND chip but the NOR path still works there).
cv300 hardware regression deferred (FTDI USB-serial glitch on that
adapter). NOR code path is logically unchanged: flash_init sets
`current_flash_type` and `flash_read` adds an early NAND check that
returns before any NOR code; the NOR loop after it is byte-identical
to before.
\`\`\`
make -C agent test HOST_CC=gcc: 5406/5406
pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
ruff & mypy: clean
make SOC=hi3516ev300: 13624 B (was 12652)
make SOC=hi3516cv300: 13844 B (was 12872, ARM926 + MMU + NAND)
make SOC=hi3516cv500: 13608 B (was 12636)
make SOC=hi3519v101: 13608 B (was 12636)
\`\`\`
## Out of scope (follow-up)
- NAND erase + program (block erase 0xD8, page program 0x02 + 0x10).
- ECC mismatch detection / reporting.
- Bad-block management (skip blocks with FF byte 0 of OOB).
- Other SPI NAND chip IDs (currently only Macronix MX35LF1GE4AB).
- Quad-IO read for higher throughput (not needed at 921600 baud).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
widgetii
added a commit
that referenced
this pull request
May 5, 2026
Builds on the SPI NAND read support: adds the missing write side so the agent can do a full reflash on hi3516av200 (and any other board with a recognized SPI NAND chip). ## Changes agent/spi_flash.c: - New SPI NAND command opcodes: BLOCK_ERASE 0xD8, PROGRAM_LOAD 0x02, PROGRAM_LOAD_RANDOM 0x84, PROGRAM_EXECUTE 0x10. - nand_get_feature/nand_set_feature: GET_FEATURES 0x0F / SET_FEATURE 0x1F register access. - nand_write_enable: WRITE_ENABLE 0x06 (sets WEL in status feature). - nand_wait_oip refactored to return the final status byte so callers can inspect E_FAIL (bit 2) / P_FAIL (bit 3) bits. - nand_erase_block(row): WE → BLOCK_ERASE → wait OIP → check E_FAIL. - nand_program_page(row, column, data, len): WE → PROGRAM_LOAD (first 256-byte chunk, resets cache) → PROGRAM_LOAD_RANDOM (subsequent chunks, preserves cache) → PROGRAM_EXECUTE → wait OIP → check P_FAIL. - flash_init for NAND now clears block-protect bits via SET_FEATURE 0xA0 = 0x00 — most SPI NAND chips ship with all blocks locked, equivalent to NOR's flash_unlock. - flash_erase_sector dispatches to nand_erase_block when type=NAND. - flash_write_page dispatches to nand_program_page when type=NAND. agent/main.c: - Remove NAND guard early-returns from handle_erase, handle_flash_write, handle_flash_program, handle_flash_stream, handle_scan. These flow through to the NOR or NAND path now. ## Verification on real hi3516av200 (Macronix MX35LF1GE4AB) Test cycle on a sacrificial block at flash offset 0x800000 (8 MiB, well past u-boot/kernel partitions, in unwritten 0xFF area per the earlier 16 MiB dump): backup → erase → write pattern → verify → erase → restore backup → verify. ERASE block (128 KiB): 0.02 s → 99.9 % of bytes are 0xFF post-erase WRITE 64 pages × 2 KiB: 1.53 s = 83.4 KB/s Page-program report: success (P_FAIL bit clear) Block-erase report: success (E_FAIL bit clear) ## Known limitation: read-side off-by-one at page boundaries The READ_FROM_CACHE register-mode path inherited from the read PR (#71) returns 0x00 as the first byte of each 2 KiB page boundary instead of the actual content — the FMC's OP_CFG_DUMMY_NUM accounting doesn't quite match the chip's READ_FROM_CACHE timing. About 64 bytes per 128 KiB block (0.05 %) read incorrectly, the rest is byte-perfect. This affects byte-perfect verification of the write side via readback. The strings in the 16 MiB dump from #71 (\"Hisilicon HI3516AV200 DEMO Board\", \"U-Boot 2010.06-dirty\", \"Linux-3.18.20-hi3516av2.0\") confirm mid-page reads are accurate; the issue is localised to the first byte of each PAGE_READ → READ_FROM_CACHE cycle. The proper fix is switching the NAND read path to the FMC's NAND-aware FMC_OP_CTRL register (rather than the NOR-style FMC_OP) — that's the path u-boot uses and it handles the PAGE_READ → READ_FROM_CACHE timing internally. Substantial rewrite worth its own PR; tracked as a follow-up. ## Test suites - make -C agent test HOST_CC=gcc: 5406/5406 - pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped - ruff & mypy: clean - All four agent SoCs build clean (ev300, cv300, cv500, 3519v101). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on the SPI NAND read support: adds the missing write side so the agent can do a full reflash on hi3516av200 (and any other board with a recognized SPI NAND chip). Also fixes the read-side byte-0-per-page off-by-one from the previous commit. ## Read fix: skip iobuf[0] The FMC captures the SPI NAND chip's post-address dummy byte at iobuf[0] (always 0x00 because the chip drives the dummy line low), not as a transparent dummy. Reading iobuf[0..N-1] gave the dummy byte at position 0 of every 2 KiB page — the off-by-one bug. Fix: request `chunk + 1` bytes, set OP_CFG_DUMMY_NUM(0), copy from iobuf[1..N]. Verified byte-perfect against u-boot env strings, UBI volume headers, kernel image bytes. ## Erase + program agent/spi_flash.c: - New SPI NAND command opcodes: BLOCK_ERASE 0xD8, PROGRAM_LOAD 0x02, PROGRAM_LOAD_RANDOM 0x84, PROGRAM_EXECUTE 0x10. - nand_get_feature/nand_set_feature: GET_FEATURES 0x0F / SET_FEATURE 0x1F register access. - nand_write_enable: WRITE_ENABLE 0x06 (sets WEL in status feature). - nand_wait_oip refactored to return the final status byte so callers can inspect E_FAIL (bit 2) / P_FAIL (bit 3) bits. - nand_erase_block(row): WE → BLOCK_ERASE → wait OIP → check E_FAIL. - nand_program_page(row, column, data, len): WE → PROGRAM_LOAD (first 256-byte chunk, resets cache) → PROGRAM_LOAD_RANDOM (subsequent chunks, preserves cache) → PROGRAM_EXECUTE → wait OIP → check P_FAIL. - flash_init for NAND now clears block-protect bits via SET_FEATURE 0xA0 = 0x00 — most SPI NAND chips ship with all blocks locked, equivalent to NOR's flash_unlock. - flash_erase_sector dispatches to nand_erase_block when type=NAND. - flash_write_page dispatches to nand_program_page when type=NAND. agent/main.c: - Remove NAND guard early-returns from handle_erase, handle_flash_write, handle_flash_program, handle_flash_stream, handle_scan. These flow through to the NOR or NAND path now. ## Verification on real hi3516av200 (Macronix MX35LF1GE4AB) End-to-end test cycle on a sacrificial block (flash offset 0x800000): backup → erase → write 64 pages × 2 KiB pattern → read-back-verify → erase → restore backup → final verify. ERASE block (128 KiB): 0.02 s, 100.0 % bytes are 0xFF post-erase WRITE 64 pages: 1.53 s = 83.4 KB/s READ-BACK verify: byte-for-byte match (131072 B) ✓ RESTORE original: byte-for-byte match against backup ✓ All four agent SoCs build clean (ev300, cv300, cv500, 3519v101). ## Test suites - make -C agent test HOST_CC=gcc: 5406/5406 - pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped - ruff & mypy: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
widgetii
added a commit
that referenced
this pull request
May 5, 2026
## Summary
The flash doctor's `CMD_SCAN` was NOR-only: it used direct memory-mapped
pointer reads at `FLASH_MEM`, which silently returns garbage on NAND (no
boot-mode window). And it never read the OOB area, where factory-marked
bad blocks live (OOB[0] of page 0 of the bad block per the standard SPI
NAND convention).
This PR makes the scan NAND-aware so it correctly classifies blocks on
SPI NAND boards — including a new `BAD_BLOCK` status, surfaced in the
TUI flash doctor with a distinct glyph (`✗`) and color (magenta).
## Changes
| Component | Change |
|---|---|
| `agent/spi_flash.{c,h}` | New `flash_read_oob(block, buf, len)` —
reads OOB bytes of page 0 via PAGE_READ + READ_FROM_CACHE at column =
`NAND_PAGE_SIZE`, with the same iobuf[1] dummy-skip from #71. NOR
returns -1 (no OOB). |
| `agent/main.c` | New `SCAN_BAD_BLOCK = 0x06` status. `handle_scan` now
routes data-area reads through `flash_read()` (NAND-aware) into a static
128 KiB buffer instead of mem-mapped pointer. For NAND blocks: read
OOB[0..1] of page 0; if `OOB[0] != 0xFF` report `SCAN_BAD_BLOCK` and
skip the data-area scan. The Pass-2 stability re-read (UNSTABLE) is
gated to NOR — NAND on-chip ECC auto-corrects so re-reads always match.
|
| `src/defib/agent/client.py` | Add `SectorStatus.BAD_BLOCK = 0x06` +
`ScanResult.bad_block` accessor. |
| `src/defib/tui/screens/flash_doctor.py` | New `BLOCK_BAD = "✗"` glyph
in magenta; surface bad-block count in `ScanStats` panel (only when
non-zero). |
## Verification on real hi3516av200 (Macronix MX35LF1GE4AB)
\`\`\`
chip: jedec=00c212 flash=131072 KiB block=128 KiB
Full-chip scan: 1024 blocks in 59.1 s
GOOD: 968 (data area has stable content)
EMPTY: 56 (all 0xFF)
BAD_BLOCK: 0 (this chip has zero factory bad blocks — within
spec; MX35LF1GE4AB allows up to 20 of 1024)
\`\`\`
The OOB-read code path executed for all 1024 blocks without false
positives (zero spurious `BAD_BLOCK` reports), confirming the `OOB[0] !=
0xFF` check is wired correctly end-to-end. If a chip with factory bad
blocks shows up in the lab later, those blocks will be reported
distinctly instead of mixing in with the data-area pattern checks.
Synthesizing a bad block (writing `0x00` to OOB[0] of a sacrificial
block) would require extending `nand_program_page` to allow OOB-column
writes — out of scope for this PR but tracked as a follow-up test if
wider chip coverage demands it.
## Bad-block detection logic (per JEDEC SPI NAND convention)
| OOB[0] of page 0 | Block status |
|---|---|
| `0xFF` | Good — proceed with data-area scan |
| any other value | Factory-marked bad — report `BAD_BLOCK`, skip data
scan |
This applies only to NAND. NOR has no OOB, so the existing NOR scan path
is unchanged.
## Out of scope (follow-ups)
- ECC mismatch reporting (could surface as new `WORN` status when the
chip's STATUS_ECC bits in feature `0xC0` indicate corrections — pages
still readable but accumulating bit errors).
- OOB programming path — needed both to synthesize bad blocks for
testing and to mark blocks bad after wear is detected.
- Bad-block-aware erase/write — currently erase/write hit all blocks
uniformly; a chip with bad blocks would see write failures we'd need to
handle.
\`\`\`
make -C agent test HOST_CC=gcc: 5406/5406
pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
ruff & mypy: clean
make SOC=hi3516ev300/cv300/cv500/3519v101: all build clean
\`\`\`
## Test plan
- [x] Real av200 hardware: full-chip scan, OOB-read path executed for
all 1024 blocks, zero false-positive bad-block reports
- [x] All test suites green
- [ ] Synthetic bad-block test (deferred — needs OOB-write support,
separate PR)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The agent's flash driver was NOR-only. This PR adds full read + erase + program for SPI NAND, so
defib agent read/erase/writeworks on NAND boards (hi3516av200 we just landed in #69 ships with a Macronix MX35LF1GE4AB, 1Gbit / 128 MiB).Two commits
1. Add SPI NAND read support to flash agent (
777538a)nand_identifyrecognizes Macronix0xc2 0x12(MX35LF1GE4AB), tolerating the leading dummy byte some SPI NAND chips emit during 0x9F READ_ID.flash_initdispatches by JEDEC: NAND skipsfmc_enter_boot(no memory-mapped boot mode on NAND) and reports 128 MiB total / 128 KiB block / 2 KiB page.nand_readissues PAGE_READ (0x13) → wait OIP → READ_FROM_CACHE (0x03) chunked through the FMC's 256-byte I/O buffer.flash_readearly-dispatches to NAND on chip type; NOR path unchanged.flash_info.flash_typereported in CMD_INFO so the host can branch on chip type.2. Add SPI NAND erase + program — byte-perfect (
563cb14)iobuf[0](always 0x00 because the chip drives the dummy line low) rather than consuming it transparently. Readingiobuf[0..N-1]gave the dummy byte at position 0 of every 2 KiB page — the off-by-one bug from the read commit. Fix: requestchunk + 1bytes, setOP_CFG_DUMMY_NUM(0), copyiobuf[1..N].nand_get_feature/nand_set_featurefor GET_FEATURES 0x0F / SET_FEATURE 0x1F.nand_write_enable(0x06).nand_erase_block: WE → BLOCK_ERASE 0xD8 (3-byte row) → wait OIP → check E_FAIL bit.nand_program_page: WE → PROGRAM_LOAD 0x02 (first chunk, resets cache) → PROGRAM_LOAD_RANDOM 0x84 (rest, preserves cache) → PROGRAM_EXECUTE 0x10 → wait OIP → check P_FAIL bit.flash_initfor NAND now clears block-protect bits via SET_FEATURE 0xA0 = 0x00 (NAND equivalent of NOR'sflash_unlock).flash_erase_sectorandflash_write_pagedispatch to NAND helpers when type=NAND.main.c(added in commit 1) are removed: erase/write/scan/flash_program/flash_stream now flow to the right path.Verification on real hi3516av200 (Macronix MX35LF1GE4AB)
End-to-end test cycle on a sacrificial block (flash offset
0x800000): backup → erase → write 64 pages × 2 KiB pattern → read-back verify → erase → restore backup → final verify.```
ERASE block (128 KiB): 0.02 s, 100.0 % bytes are 0xFF post-erase
WRITE 64 pages × 2 KiB: 1.53 s = 83.4 KB/s, P_FAIL=0
READ-BACK verify: byte-for-byte match (131072 B) ✓
RESTORE original: byte-for-byte match against backup ✓
```
Plus the 16 MiB factory-firmware dump from the earlier read-only iteration shows real, structured content with byte-perfect strings:
0x0080a04"U-Boot 2010.06-dirty (Apr 22 2…"0x0200021"Linux-3.18.20-hi3516av2.0…"0x03da91d"Hisilicon HI3516AV200 DEMO Board"0x03de19d"spi-nand@0"Throughput at 921600 baud is the same as NOR — UART is the bottleneck, the per-page PAGE_READ→READ_FROM_CACHE overhead is negligible.
```
make -C agent test HOST_CC=gcc: 5406/5406
pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
ruff & mypy: clean
make SOC=hi3516ev300/cv300/cv500/3519v101: all build
```
Implementation table
Out of scope (separate follow-ups)
Test plan
🤖 Generated with Claude Code