Detect NAND factory bad blocks in flash doctor scan#72
Merged
Conversation
The flash doctor's CMD_SCAN was NOR-only: it used direct memory-mapped
pointer reads at FLASH_MEM, which silently returns garbage on NAND
chips (no boot-mode window). And it never read the OOB area, where
factory-marked bad blocks live (OOB[0] of page 0 of the bad block).
This PR makes the scan NAND-aware so it correctly classifies blocks on
SPI NAND boards — including the new BAD_BLOCK status for factory bad
blocks, surfaced in the TUI flash doctor with a distinct glyph (✗) and
color (magenta).
## Changes
- `agent/spi_flash.{c,h}`: add `flash_read_oob(block, buf, len)` that
reads OOB bytes of page 0 of a block via PAGE_READ + READ_FROM_CACHE
at column = NAND_PAGE_SIZE, with the same iobuf[1] dummy-skip from
PR #71. NOR returns -1 (no OOB).
- `agent/main.c`:
- New `SCAN_BAD_BLOCK = 0x06` status code.
- `handle_scan` now routes data-area reads through `flash_read()`
(NAND-aware) into a static 128 KiB scan_buf, instead of direct
mem-mapped pointer. For NAND blocks: read OOB[0..1] of page 0
first; if OOB[0] != 0xFF report SCAN_BAD_BLOCK and skip the data
scan; otherwise CRC32 + pattern check + UNSTABLE re-read.
- The Pass-2 stability re-read (UNSTABLE detection) is gated to NOR
only — NAND on-chip ECC auto-corrects single-bit flips so re-reads
always return the same bytes. Distinct ECC-correction telemetry
is a separate follow-up.
- `src/defib/agent/client.py`: add `SectorStatus.BAD_BLOCK = 0x06` and
matching `ScanResult.bad_block` accessor.
- `src/defib/tui/screens/flash_doctor.py`: render BAD_BLOCK as `✗` in
magenta in the sector grid, and surface its count in the live
ScanStats panel (only when non-zero).
## Verification on real hi3516av200 (Macronix MX35LF1GE4AB)
jedec=00c212 flash=131072 KiB block=128 KiB
Full-chip scan: 1024 blocks in 59.1 s
GOOD: 968 (data area has stable content)
EMPTY: 56 (all 0xFF)
BAD_BLOCK: 0 (this chip has zero factory bad blocks — within
spec; MX35LF1GE4AB allows up to 20 of 1024)
The OOB-read code path executed for all 1024 blocks without false
positives (zero spurious BAD_BLOCK reports), confirming the OOB[0] !=
0xFF check is wired correctly end-to-end. Scaling out: if a chip with
factory bad blocks shows up later, those blocks will now be reported
distinctly instead of mixing in with the data-area pattern checks.
Synthesizing a bad block (writing 0x00 to OOB[0] of a sacrificial
block) would require extending the agent's program path to handle OOB
column writes — out of scope for this PR but tracked as a follow-up
test if wider chip coverage demands it.
## Out of scope (follow-ups)
- ECC mismatch reporting back to host (could surface as new "WORN"
status when the chip's STATUS_ECC bits in feature 0xC0 indicate
corrections).
- OOB programming path (needed to synthesize bad blocks for testing
and to mark blocks bad after wear).
- Bad-block-aware erase/write (currently erase/write hit all blocks
uniformly; a chip with bad blocks would see write failures).
## Test suites
- make -C agent test HOST_CC=gcc: 5406/5406
- pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
- ruff & mypy: clean
- All four agent SoCs build clean (ev300, cv300, cv500, 3519v101).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds OOB-program support and a CMD_MARK_BAD agent command, used to
synthesize a bad block on demand for testing the scan's bad-block
detection path. This proves the BAD_BLOCK pipeline works end-to-end:
program → scan-detects → erase-restores.
## Changes
- agent/spi_flash.{c,h}: new flash_program_oob(block, buf, len) — calls
nand_program_page with column = NAND_PAGE_SIZE (start of OOB area)
for page 0 of the block. NOR returns -1.
- agent/protocol.h: new CMD_MARK_BAD = 0x0C opcode.
- agent/main.c: new handle_mark_bad — writes 0x00 to OOB[0] of the
given block via flash_program_oob. Wired into both command
dispatchers (interactive + framed).
- src/defib/agent/protocol.py: matching CMD_MARK_BAD = 0x0C.
- src/defib/agent/client.py: new FlashAgentClient.mark_bad_block(block)
method.
## Hardware proof on hi3516av200 (synthetic bad-block round-trip)
[1/3] erase block 100 → scan: EMPTY ✓
[2/3] mark block 100 bad → scan: BAD_BLOCK ✓
[3/3] erase block 100 → scan: EMPTY ✓ (OOB cleared)
The chip programs OOB[0]=0x00 successfully, the scan reads OOB[0] and
correctly identifies the marker as a bad-block indicator, and a
subsequent block erase clears OOB back to 0xFF restoring the block to
good status. Full pipeline verified.
This complements PR #72's negative verification (1024-block scan with
zero false positives on a clean chip) with positive verification.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The flash doctor's
CMD_SCANwas NOR-only: it used direct memory-mapped pointer reads atFLASH_MEM, which silently returns garbage on NAND (no boot-mode window). And it never read the OOB area, where factory-marked bad blocks live (OOB[0] of page 0 of the bad block per the standard SPI NAND convention).This PR makes the scan NAND-aware so it correctly classifies blocks on SPI NAND boards — including a new
BAD_BLOCKstatus, surfaced in the TUI flash doctor with a distinct glyph (✗) and color (magenta).Changes
agent/spi_flash.{c,h}flash_read_oob(block, buf, len)— reads OOB bytes of page 0 via PAGE_READ + READ_FROM_CACHE at column =NAND_PAGE_SIZE, with the same iobuf[1] dummy-skip from #71. NOR returns -1 (no OOB).agent/main.cSCAN_BAD_BLOCK = 0x06status.handle_scannow routes data-area reads throughflash_read()(NAND-aware) into a static 128 KiB buffer instead of mem-mapped pointer. For NAND blocks: read OOB[0..1] of page 0; ifOOB[0] != 0xFFreportSCAN_BAD_BLOCKand skip the data-area scan. The Pass-2 stability re-read (UNSTABLE) is gated to NOR — NAND on-chip ECC auto-corrects so re-reads always match.src/defib/agent/client.pySectorStatus.BAD_BLOCK = 0x06+ScanResult.bad_blockaccessor.src/defib/tui/screens/flash_doctor.pyBLOCK_BAD = "✗"glyph in magenta; surface bad-block count inScanStatspanel (only when non-zero).Verification on real hi3516av200 (Macronix MX35LF1GE4AB)
```
chip: jedec=00c212 flash=131072 KiB block=128 KiB
Full-chip scan: 1024 blocks in 59.1 s
GOOD: 968 (data area has stable content)
EMPTY: 56 (all 0xFF)
BAD_BLOCK: 0 (this chip has zero factory bad blocks — within
spec; MX35LF1GE4AB allows up to 20 of 1024)
```
The OOB-read code path executed for all 1024 blocks without false positives (zero spurious
BAD_BLOCKreports), confirming theOOB[0] != 0xFFcheck is wired correctly end-to-end. If a chip with factory bad blocks shows up in the lab later, those blocks will be reported distinctly instead of mixing in with the data-area pattern checks.Synthesizing a bad block (writing
0x00to OOB[0] of a sacrificial block) would require extendingnand_program_pageto allow OOB-column writes — out of scope for this PR but tracked as a follow-up test if wider chip coverage demands it.Bad-block detection logic (per JEDEC SPI NAND convention)
0xFFBAD_BLOCK, skip data scanThis applies only to NAND. NOR has no OOB, so the existing NOR scan path is unchanged.
Out of scope (follow-ups)
WORNstatus when the chip's STATUS_ECC bits in feature0xC0indicate corrections — pages still readable but accumulating bit errors).```
make -C agent test HOST_CC=gcc: 5406/5406
pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
ruff & mypy: clean
make SOC=hi3516ev300/cv300/cv500/3519v101: all build clean
```
Test plan
🤖 Generated with Claude Code