Skip to content

agent: detect dropped RSP_DATA packets in read_memory via seq gap#106

Merged
widgetii merged 1 commit into
masterfrom
agent-read-seq-gap-detection
May 16, 2026
Merged

agent: detect dropped RSP_DATA packets in read_memory via seq gap#106
widgetii merged 1 commit into
masterfrom
agent-read-seq-gap-detection

Conversation

@widgetii
Copy link
Copy Markdown
Member

Why

`FlashAgentClient.read_memory` silently returned short data when a `RSP_DATA` packet was CRC-rejected mid-stream by `recv_packet`. Surfaced backing up an `hi3516av300:emmc` camera with the new agent eMMC reader: at 921600 baud a 9 MiB stream consistently dropped ~5 % of packets, the client appended the surviving bytes contiguously without realising data was missing, then accepted the agent's final `ACK_OK` as success. Two back-to-back dumps disagreed on the kernel/romfs SHAs while uboot/env (small enough to land in single packet bursts) were byte-identical.

What

The agent already stamps the 2-byte seq prefix on every RSP_DATA in `agent/main.c::handle_read` (`pkt[0..1] = seq`). The client now validates it. On a gap, it raises loudly with the offset, the expected and received seq numbers, and the actual addr the gap landed at — the caller can chunk + retry:

```
RuntimeError: Read packet seq gap at offset 64512 (addr=0x1486d600):
expected seq=235, got seq=236; likely UART corruption on a long stream.
Try a smaller read or a lower baud.
```

Also added a final size check at ACK_OK: if `len(received) != size`, raise. Catches the (rare) case of an agent ACKing early without any seq gap on the way.

End-to-end validation on real hardware

Run 1 Run 2 Status
uboot (320 KiB) sha 644932… sha 644932… ✓ match
env (64 KiB) sha a3f951… sha a3f951… ✓ match
kernel (2240 KiB) sha b79ee0… sha b79ee0… ✓ match
romfs (9216 KiB) sha e0f115… sha e0f115… ✓ match

Same partitions, dumped end-to-end with `fast=False` (115200 baud) + 64-KiB chunks + a byte-level UART-silence drain in the host retry path — zero retries needed across both runs.

(Pre-fix on the same hardware: same two runs produced 4 of 4 partitions matching only for the two small partitions; kernel was complete in run 1 but truncated in run 2; romfs was truncated in both with different sizes and different SHAs.)

Test plan

  • `uv run pytest tests/ -x --ignore=tests/fuzz` — 530 passed, 2 skipped (3 new tests in `TestReadMemorySeqValidation`)
  • `uv run ruff check` + `mypy` clean on touched files
  • Two real-hardware backup runs at 115200 baud, ~33 min each, SHA-identical across all four partitions
  • Direct manual at-921600 attempt — seq gap raised within ~100 KB of read start (was: silent short read)

Followups noted in kaeru

`agent-read-seq-gap-detection-2026-05-16` captures the chunked-retry + byte-level-drain pattern that the host-side script used during validation; a reasonable candidate for a future public `FlashAgentClient.read_memory_chunked` if more callers hit the same problem.

🤖 Generated with Claude Code

`FlashAgentClient.read_memory` would silently return short data when a
RSP_DATA packet got CRC-rejected mid-stream by `recv_packet`. Surfaced
backing up an hi3516av300:emmc camera's partitions over the new agent
eMMC reader: at 921600 baud a 9 MiB stream consistently dropped ~5%
of packets, and the client appended the surviving bytes contiguously
without realising data was missing, then accepted the agent's final
ACK_OK as success. Two back-to-back dumps disagreed on kernel/romfs
SHAs while uboot/env (both small enough to land in single packet bursts)
were byte-identical.

The agent already stamps the 2-byte seq prefix on every RSP_DATA in
`agent/main.c::handle_read` (pkt[0..1] = seq). Just validate it on the
host side. On a gap, raise loudly with the offset, the expected and
received seq numbers, and the actual addr the gap landed at — the
caller can chunk + retry.

Also added a final size check at ACK_OK: if `len(received) != size`,
raise. Catches the (rare) case of agent ACKing early without any seq
gap on the way.

Verified end-to-end with the hi3516av300:emmc partition dump:

  - 115200 baud + 64-KiB chunks + byte-level idle-drain in the host
    retry path → 0 retries, both two back-to-back dumps SHA256-identical
    across all four partitions (uboot 320 KiB, env 64 KiB, kernel
    2240 KiB, romfs 9216 KiB).
  - 921600 baud single-shot → seq gap is now raised after the first
    lost packet (was: silent short read).

3 new tests in TestReadMemorySeqValidation:
  - test_normal_sequential_seq_passes  (regression: clean seq stream
    still works)
  - test_seq_gap_raises  (the bug fix)
  - test_size_mismatch_at_ack_raises  (belt-and-braces total-size check)

Followup work captured in kaeru:
  `agent-read-seq-gap-detection-2026-05-16`. The host transport's
  chunked-retry helper (which lived in /tmp/backup_emmc.py for this
  validation) is a reasonable candidate for a public
  `FlashAgentClient.read_memory_chunked` if more callers hit the same
  problem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii merged commit 587c8c0 into master May 16, 2026
13 checks passed
@widgetii widgetii deleted the agent-read-seq-gap-detection branch May 16, 2026 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant