Skip to content

Detect NAND factory bad blocks in flash doctor scan#72

Merged
widgetii merged 2 commits intomasterfrom
agent-nand-bbm
May 5, 2026
Merged

Detect NAND factory bad blocks in flash doctor scan#72
widgetii merged 2 commits intomasterfrom
agent-nand-bbm

Conversation

@widgetii
Copy link
Copy Markdown
Member

@widgetii widgetii commented May 5, 2026

Summary

The flash doctor's CMD_SCAN was NOR-only: it used direct memory-mapped pointer reads at FLASH_MEM, which silently returns garbage on NAND (no boot-mode window). And it never read the OOB area, where factory-marked bad blocks live (OOB[0] of page 0 of the bad block per the standard SPI NAND convention).

This PR makes the scan NAND-aware so it correctly classifies blocks on SPI NAND boards — including a new BAD_BLOCK status, surfaced in the TUI flash doctor with a distinct glyph () and color (magenta).

Changes

Component Change
agent/spi_flash.{c,h} New flash_read_oob(block, buf, len) — reads OOB bytes of page 0 via PAGE_READ + READ_FROM_CACHE at column = NAND_PAGE_SIZE, with the same iobuf[1] dummy-skip from #71. NOR returns -1 (no OOB).
agent/main.c New SCAN_BAD_BLOCK = 0x06 status. handle_scan now routes data-area reads through flash_read() (NAND-aware) into a static 128 KiB buffer instead of mem-mapped pointer. For NAND blocks: read OOB[0..1] of page 0; if OOB[0] != 0xFF report SCAN_BAD_BLOCK and skip the data-area scan. The Pass-2 stability re-read (UNSTABLE) is gated to NOR — NAND on-chip ECC auto-corrects so re-reads always match.
src/defib/agent/client.py Add SectorStatus.BAD_BLOCK = 0x06 + ScanResult.bad_block accessor.
src/defib/tui/screens/flash_doctor.py New BLOCK_BAD = "✗" glyph in magenta; surface bad-block count in ScanStats panel (only when non-zero).

Verification on real hi3516av200 (Macronix MX35LF1GE4AB)

```
chip: jedec=00c212 flash=131072 KiB block=128 KiB
Full-chip scan: 1024 blocks in 59.1 s
GOOD: 968 (data area has stable content)
EMPTY: 56 (all 0xFF)
BAD_BLOCK: 0 (this chip has zero factory bad blocks — within
spec; MX35LF1GE4AB allows up to 20 of 1024)
```

The OOB-read code path executed for all 1024 blocks without false positives (zero spurious BAD_BLOCK reports), confirming the OOB[0] != 0xFF check is wired correctly end-to-end. If a chip with factory bad blocks shows up in the lab later, those blocks will be reported distinctly instead of mixing in with the data-area pattern checks.

Synthesizing a bad block (writing 0x00 to OOB[0] of a sacrificial block) would require extending nand_program_page to allow OOB-column writes — out of scope for this PR but tracked as a follow-up test if wider chip coverage demands it.

Bad-block detection logic (per JEDEC SPI NAND convention)

OOB[0] of page 0 Block status
0xFF Good — proceed with data-area scan
any other value Factory-marked bad — report BAD_BLOCK, skip data scan

This applies only to NAND. NOR has no OOB, so the existing NOR scan path is unchanged.

Out of scope (follow-ups)

  • ECC mismatch reporting (could surface as new WORN status when the chip's STATUS_ECC bits in feature 0xC0 indicate corrections — pages still readable but accumulating bit errors).
  • OOB programming path — needed both to synthesize bad blocks for testing and to mark blocks bad after wear is detected.
  • Bad-block-aware erase/write — currently erase/write hit all blocks uniformly; a chip with bad blocks would see write failures we'd need to handle.

```
make -C agent test HOST_CC=gcc: 5406/5406
pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
ruff & mypy: clean
make SOC=hi3516ev300/cv300/cv500/3519v101: all build clean
```

Test plan

  • Real av200 hardware: full-chip scan, OOB-read path executed for all 1024 blocks, zero false-positive bad-block reports
  • All test suites green
  • Synthetic bad-block test (deferred — needs OOB-write support, separate PR)

🤖 Generated with Claude Code

widgetii and others added 2 commits May 5, 2026 19:37
The flash doctor's CMD_SCAN was NOR-only: it used direct memory-mapped
pointer reads at FLASH_MEM, which silently returns garbage on NAND
chips (no boot-mode window).  And it never read the OOB area, where
factory-marked bad blocks live (OOB[0] of page 0 of the bad block).

This PR makes the scan NAND-aware so it correctly classifies blocks on
SPI NAND boards — including the new BAD_BLOCK status for factory bad
blocks, surfaced in the TUI flash doctor with a distinct glyph (✗) and
color (magenta).

## Changes

- `agent/spi_flash.{c,h}`: add `flash_read_oob(block, buf, len)` that
  reads OOB bytes of page 0 of a block via PAGE_READ + READ_FROM_CACHE
  at column = NAND_PAGE_SIZE, with the same iobuf[1] dummy-skip from
  PR #71.  NOR returns -1 (no OOB).
- `agent/main.c`:
  - New `SCAN_BAD_BLOCK = 0x06` status code.
  - `handle_scan` now routes data-area reads through `flash_read()`
    (NAND-aware) into a static 128 KiB scan_buf, instead of direct
    mem-mapped pointer.  For NAND blocks: read OOB[0..1] of page 0
    first; if OOB[0] != 0xFF report SCAN_BAD_BLOCK and skip the data
    scan; otherwise CRC32 + pattern check + UNSTABLE re-read.
  - The Pass-2 stability re-read (UNSTABLE detection) is gated to NOR
    only — NAND on-chip ECC auto-corrects single-bit flips so re-reads
    always return the same bytes.  Distinct ECC-correction telemetry
    is a separate follow-up.
- `src/defib/agent/client.py`: add `SectorStatus.BAD_BLOCK = 0x06` and
  matching `ScanResult.bad_block` accessor.
- `src/defib/tui/screens/flash_doctor.py`: render BAD_BLOCK as `✗` in
  magenta in the sector grid, and surface its count in the live
  ScanStats panel (only when non-zero).

## Verification on real hi3516av200 (Macronix MX35LF1GE4AB)

  jedec=00c212  flash=131072 KiB  block=128 KiB
  Full-chip scan: 1024 blocks in 59.1 s
    GOOD:        968   (data area has stable content)
    EMPTY:        56   (all 0xFF)
    BAD_BLOCK:     0   (this chip has zero factory bad blocks — within
                        spec; MX35LF1GE4AB allows up to 20 of 1024)

The OOB-read code path executed for all 1024 blocks without false
positives (zero spurious BAD_BLOCK reports), confirming the OOB[0] !=
0xFF check is wired correctly end-to-end.  Scaling out: if a chip with
factory bad blocks shows up later, those blocks will now be reported
distinctly instead of mixing in with the data-area pattern checks.

Synthesizing a bad block (writing 0x00 to OOB[0] of a sacrificial
block) would require extending the agent's program path to handle OOB
column writes — out of scope for this PR but tracked as a follow-up
test if wider chip coverage demands it.

## Out of scope (follow-ups)

- ECC mismatch reporting back to host (could surface as new "WORN"
  status when the chip's STATUS_ECC bits in feature 0xC0 indicate
  corrections).
- OOB programming path (needed to synthesize bad blocks for testing
  and to mark blocks bad after wear).
- Bad-block-aware erase/write (currently erase/write hit all blocks
  uniformly; a chip with bad blocks would see write failures).

## Test suites

- make -C agent test HOST_CC=gcc:    5406/5406
- pytest tests/ -x --ignore=tests/fuzz: 402 passed, 2 skipped
- ruff & mypy: clean
- All four agent SoCs build clean (ev300, cv300, cv500, 3519v101).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds OOB-program support and a CMD_MARK_BAD agent command, used to
synthesize a bad block on demand for testing the scan's bad-block
detection path.  This proves the BAD_BLOCK pipeline works end-to-end:
program → scan-detects → erase-restores.

## Changes

- agent/spi_flash.{c,h}: new flash_program_oob(block, buf, len) — calls
  nand_program_page with column = NAND_PAGE_SIZE (start of OOB area)
  for page 0 of the block.  NOR returns -1.
- agent/protocol.h: new CMD_MARK_BAD = 0x0C opcode.
- agent/main.c: new handle_mark_bad — writes 0x00 to OOB[0] of the
  given block via flash_program_oob.  Wired into both command
  dispatchers (interactive + framed).
- src/defib/agent/protocol.py: matching CMD_MARK_BAD = 0x0C.
- src/defib/agent/client.py: new FlashAgentClient.mark_bad_block(block)
  method.

## Hardware proof on hi3516av200 (synthetic bad-block round-trip)

  [1/3] erase block 100         → scan: EMPTY    ✓
  [2/3] mark block 100 bad      → scan: BAD_BLOCK ✓
  [3/3] erase block 100         → scan: EMPTY    ✓ (OOB cleared)

The chip programs OOB[0]=0x00 successfully, the scan reads OOB[0] and
correctly identifies the marker as a bad-block indicator, and a
subsequent block erase clears OOB back to 0xFF restoring the block to
good status.  Full pipeline verified.

This complements PR #72's negative verification (1024-block scan with
zero false positives on a clean chip) with positive verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii merged commit fad5eee into master May 5, 2026
13 checks passed
@widgetii widgetii deleted the agent-nand-bbm branch May 5, 2026 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant