Skip to content

agent+rack: fix multi-packet recv loss; pod-side fastboot client#86

Merged
widgetii merged 1 commit into
masterfrom
agent-protocol-leftover-and-rack-fastboot
May 11, 2026
Merged

agent+rack: fix multi-packet recv loss; pod-side fastboot client#86
widgetii merged 1 commit into
masterfrom
agent-protocol-leftover-and-rack-fastboot

Conversation

@widgetii
Copy link
Copy Markdown
Member

Summary

Two fixes that together make the SPL+agent path actually work over the rack pod's WiFi-bridged UART. The first is a regression-class bug in agent.protocol; the second is the host counterpart to a new POST /fastboot endpoint on the rack pod.


1. agent.protocol.recv_packet — async path was losing post-delimiter bytes

The pyserial path through _recv_packet_sync already maintained a per-port leftover buffer for bytes that arrived in the same read as a frame's delimiter but after it (those bytes belong to the NEXT frame). The async path was written without one. When a chunk contained two whole packets back-to-back — e.g. RSP_DATA + 0x00 + RSP_ACK + 0x00 — the for-loop returned the first frame and silently discarded the rest.

Result: only single-packet commands worked on async transports (SocketTransport, MockTransport). INFO and CRC32 worked (one packet); READ and the streaming WRITE responses timed out 100% of the time because the trailing RSP_ACK was already consumed from the socket and gone.

read 16 bytes at 0x40010000 (RAM)... FAIL: TimeoutError
read 16 bytes at 0x14000000 (FLASH)... FAIL: TimeoutError

Fix: a transport-keyed _async_leftover buffer that's drained before each socket read and refilled with any bytes that came after the parsed frame's delimiter. Same semantics as the pyserial buffer, just keyed on id(transport) to avoid coupling Transport to the parser.

Post-fix, same hardware:

read 16 bytes at 0x40010000 (RAM-16)...    OK: aa aa aa aa aa aa aa aa ...
read 16 bytes at 0x14000000 (FLASH-16)...  OK: 15 05 00 ea fe ff ff ea ...
read 64 bytes at 0x14050000 (KERNEL-64)... OK: 27 05 19 56 e2 8a ab 68 ...
                                               ^^^^^^^^^^^ uImage magic

Two regression tests in TestRecvPacket:

  • two whole packets in one chunk → both come out in order
  • three packets (READY + DATA + ACK) in one chunk → all three come out

2. RackController.fastboot(...) — host counterpart to pod-side BootROM upload

The pod's TCP-bridged UART has ~50 ms WiFi RTT per round-trip. The HiSilicon SPL boot protocol's per-frame ACK loop (150 ms timeout, dozens of frames per upload) couldn't survive that — every attempt failed at PRESTEP0 even with the fixes from #85.

The companion rack-firmware change adds POST /fastboot which runs the entire BootROM upload sequence on the ESP32-S3 itself: handshake (flood 0xAA, watch for 0x20 markers), DDR step (PRESTEP0/DDRSTEP0/optional PRESTEP1), SPL upload, agent upload. ACK latency drops from ~50 ms (WiFi) to ~1 ms (local UART), making the whole sequence reliable.

This PR adds RackController.fastboot(profile, spl, agent) that packs everything into a single binary blob (load addresses + PRESTEP/DDRSTEP/PRESTEP1 + SPL + agent), POSTs it, and returns the pod's phase-by-phase JSON.

End-to-end against 10.216.128.69 (hi3516ev300):

POST /fastboot (41 834 bytes blob)
  result: {'success': True, 'last_phase': 'done',
           'elapsed_ms': 4552, 'handshake_markers': 5}
agent INFO: jedec=ef4018 (W25Q128), flash=16 MB, version=2, caps=0x7f
read 64 b @ 0x14000000: 15 05 00 ea fe ff ff ea ... (uImage payload)

~4.5 s wall-clock for the full SPL+agent bring-up vs. 100% failure before.

Test plan

  • uv run pytest tests/ -x -v --ignore=tests/fuzz (457 passed / 2 skipped)
  • uv run ruff check src/defib/ tests/
  • uv run mypy src/defib/ --ignore-missing-imports
  • Soak the agent path against a local serial transport to confirm the leftover-buffer change doesn't regress pyserial flows (it shouldn't: those go through _recv_packet_sync, untouched).

🤖 Generated with Claude Code

Two fixes that together make the SPL+agent path work over the rack
pod's WiFi-bridged UART.

### 1. `agent.protocol.recv_packet`: preserve post-delimiter bytes (regression)

The pyserial path through `_recv_packet_sync` already maintained a
per-port leftover buffer for bytes read past the current frame's
delimiter — those bytes belong to the NEXT frame. The async path
through `recv_packet` did not: when a chunk contained
`RSP_DATA + 0x00 + RSP_ACK + 0x00`, the for-loop returned the
`RSP_DATA` and silently discarded the rest. `read_memory` then
hung forever waiting for the `RSP_ACK` that had already arrived.

Symptoms only visible on async transports (`SocketTransport`,
`MockTransport`); single-packet commands (`INFO`, `CRC32`) worked
because they only need one frame. Multi-packet commands (`READ`,
streaming `WRITE` responses) timed out 100% of the time.

Fix: add a transport-keyed `_async_leftover` buffer, drain it
before each socket read, and stash any post-delimiter bytes for
the next call. Mirrors the pyserial path's buffer semantics.

Two new regression tests in `TestRecvPacket`:
- two whole packets in one chunk: both come out in order
- three packets (READY + DATA + ACK) in one chunk: all three come
  out in order

### 2. `RackController.fastboot(...)` — pod-side SPL upload client

Companion to the new `POST /fastboot` endpoint in the rack pod
firmware. Host packs profile (PRESTEP0 / DDRSTEP0 / optional
PRESTEP1 + load addresses), SPL bytes, and agent bytes into a
single binary blob and POSTs it. The pod runs the entire HiSilicon
SPL BootROM upload sequence locally on its UART (microsecond ACK
latency, no WiFi jitter in the critical loop). Returns phase-by-
phase JSON.

End-to-end against the prototype at 10.216.128.69 (hi3516ev300):
the full handshake + DDR + 24 KB SPL + 17 KB agent upload completes
in **~4.5 s** wall-clock from a single HTTP POST. Without this, the
per-frame ACK loop over WiFi failed 100% of the time at the very
first PRESTEP0 frame.

Full suite: 457 passed / 2 skipped; ruff + mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii merged commit 3ad3721 into master May 11, 2026
13 checks passed
@widgetii widgetii deleted the agent-protocol-leftover-and-rack-fastboot branch May 11, 2026 18:37
widgetii added a commit that referenced this pull request May 11, 2026
… factory) (#87)

## Summary

The six `defib agent ...` commands hardcoded
`SerialTransport.create(port)`, so `-p tcp://<host>:<port>` errored with
*\"No such file or directory: 'tcp://...'\"*, even though the underlying
COBS protocol works fine over `SocketTransport` (now that #86 fixed the
multi-packet recv-loss).

Drop-in switch to `create_transport(normalize_port_name(port))` — the
same factory `burn` / `install` / `restore` already use. Help strings on
every command updated to mention the supported URL schemes globally.

Affected: `agent upload`, `agent flash`, `agent info`, `agent read`,
`agent write`, `agent scan`.

## Verification on rack pod at `10.216.128.69`

```
$ defib agent info -p tcp://10.216.128.69:9000
  JEDEC ID:    ef4018
  Flash size:  16384 KB
  RAM base:    0x40000000
  Sector size: 64 KB
  Agent ver:   2
  Capabilities: flash_stream, sector_bitmap, page_skip, set_baud, reboot,
                selfupdate, scan
```

```
$ defib agent read -p tcp://10.216.128.69:9000 -a 0x14050000 -s 256 -o uimage_dump.bin
  256/256 (100%)
  256 bytes in 0.0s
  CRC32: OK
$ xxd uimage_dump.bin | head -2
  00000000: 2705 1956 e28a ab68 ...  '..V...hj...
  00000020: 4c69 6e75 782d 342e ...  Linux-4.9.37-hi3516ev300
```

```
$ defib agent scan -p tcp://10.216.128.69:9000
  Scanned: 256/256  (29.6 s)
  Good:          256
  Flash is healthy!
```

The scan command in particular is a stress test of the multi-packet
bridge path — every sector returns its own `RSP_SCAN` packet plus the
final `RSP_ACK`. Streams cleanly thanks to #86's leftover-buffer fix.

## Test plan

- [ ] `uv run pytest tests/ -x -v --ignore=tests/fuzz` — 457 passed / 2
skipped
- [ ] `uv run ruff check src/defib/cli/app.py`
- [ ] `uv run mypy src/defib/cli/app.py --ignore-missing-imports`
- [ ] Regression: confirm the same 6 commands still work against a local
`/dev/ttyUSB*` (they go through the same `create_transport` factory, so
should be unchanged).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
widgetii added a commit that referenced this pull request May 11, 2026
## Summary

Test-only PR — locks in two integration points that previously rested on
real-hardware verification alone.

### 1. `/fastboot` binary blob (`tests/test_power_rack.py`)

`RackController.fastboot()` packs profile + SPL + agent into a single
big-endian binary blob that the pod's C handler in
`rack/firmware/main/http_api.c` parses field by field. Any drift between
the two breaks bring-up silently — pod returns 400, the host can't tell
why.

Six new `TestFastbootWireFormat` cases:

| Case | Why |
|---|---|
| `test_packs_expected_layout` | round-trip every field |
| `test_prestep1_passthrough` | optional prestep1 not dropped |
| `test_success_response_returned_verbatim` | done/elapsed/markers
preserved |
| `test_pod_500_returns_json_body_not_exception` | protocol failure (500
+ JSON) surfaces as a dict, not an exception — callers need
`failed_phase` / `error` |
| `test_pod_unreachable_raises_power_controller_error` | network-layer
errors stay raised |
| `test_realistic_blob_size_within_pod_limit` | 41 834 B for a typical
hi3516ev300 upload, under the pod's 1 MiB cap |

Shared helper `_parse_fastboot_blob()` is the inverse of the host
packer; if either side bumps the format, the round-trip test fails
loudly.

### 2. Async `recv_packet` leftover stress
(`tests/test_agent_protocol.py`)

PR #86 added a per-transport leftover buffer so multi-packet TCP chunks
don't drop trailing packets. The existing tests covered the two simplest
cases (two/three packets in one chunk). Six stress cases added:

| Case | Why |
|---|---|
| `test_frame_split_across_two_reads_recombines` | half a packet, then
the other half (TCP MTU split) |
| `test_large_stream_50_packets_in_one_chunk` | catches off-by-one in
leftover slicing |
| `test_recv_response_skips_ready_in_leftover` | READY skipping still
works when READYs are in leftover, not on the wire |
| `test_per_transport_isolation` | two transports must not share
leftover state |
| `test_incomplete_frame_in_leftover_blocks_until_timeout` | half a
frame must wait for the rest, never spuriously return partial |
| `test_corrupt_frame_skipped_then_recovers` | bit-flipped CRC
mid-stream must be discarded; parser picks up the next valid frame |

## Test plan

- [ ] `uv run pytest tests/ -x -v --ignore=tests/fuzz` — **480 passed /
2 skipped** (was 468)
- [ ] `uv run ruff check tests/ src/defib/`
- [ ] `uv run mypy src/defib/ --ignore-missing-imports`

No production code touched.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant