agent+rack: fix multi-packet recv loss; pod-side fastboot client#86
Merged
Merged
Conversation
Two fixes that together make the SPL+agent path work over the rack pod's WiFi-bridged UART. ### 1. `agent.protocol.recv_packet`: preserve post-delimiter bytes (regression) The pyserial path through `_recv_packet_sync` already maintained a per-port leftover buffer for bytes read past the current frame's delimiter — those bytes belong to the NEXT frame. The async path through `recv_packet` did not: when a chunk contained `RSP_DATA + 0x00 + RSP_ACK + 0x00`, the for-loop returned the `RSP_DATA` and silently discarded the rest. `read_memory` then hung forever waiting for the `RSP_ACK` that had already arrived. Symptoms only visible on async transports (`SocketTransport`, `MockTransport`); single-packet commands (`INFO`, `CRC32`) worked because they only need one frame. Multi-packet commands (`READ`, streaming `WRITE` responses) timed out 100% of the time. Fix: add a transport-keyed `_async_leftover` buffer, drain it before each socket read, and stash any post-delimiter bytes for the next call. Mirrors the pyserial path's buffer semantics. Two new regression tests in `TestRecvPacket`: - two whole packets in one chunk: both come out in order - three packets (READY + DATA + ACK) in one chunk: all three come out in order ### 2. `RackController.fastboot(...)` — pod-side SPL upload client Companion to the new `POST /fastboot` endpoint in the rack pod firmware. Host packs profile (PRESTEP0 / DDRSTEP0 / optional PRESTEP1 + load addresses), SPL bytes, and agent bytes into a single binary blob and POSTs it. The pod runs the entire HiSilicon SPL BootROM upload sequence locally on its UART (microsecond ACK latency, no WiFi jitter in the critical loop). Returns phase-by- phase JSON. End-to-end against the prototype at 10.216.128.69 (hi3516ev300): the full handshake + DDR + 24 KB SPL + 17 KB agent upload completes in **~4.5 s** wall-clock from a single HTTP POST. Without this, the per-frame ACK loop over WiFi failed 100% of the time at the very first PRESTEP0 frame. Full suite: 457 passed / 2 skipped; ruff + mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
widgetii
added a commit
that referenced
this pull request
May 11, 2026
… factory) (#87) ## Summary The six `defib agent ...` commands hardcoded `SerialTransport.create(port)`, so `-p tcp://<host>:<port>` errored with *\"No such file or directory: 'tcp://...'\"*, even though the underlying COBS protocol works fine over `SocketTransport` (now that #86 fixed the multi-packet recv-loss). Drop-in switch to `create_transport(normalize_port_name(port))` — the same factory `burn` / `install` / `restore` already use. Help strings on every command updated to mention the supported URL schemes globally. Affected: `agent upload`, `agent flash`, `agent info`, `agent read`, `agent write`, `agent scan`. ## Verification on rack pod at `10.216.128.69` ``` $ defib agent info -p tcp://10.216.128.69:9000 JEDEC ID: ef4018 Flash size: 16384 KB RAM base: 0x40000000 Sector size: 64 KB Agent ver: 2 Capabilities: flash_stream, sector_bitmap, page_skip, set_baud, reboot, selfupdate, scan ``` ``` $ defib agent read -p tcp://10.216.128.69:9000 -a 0x14050000 -s 256 -o uimage_dump.bin 256/256 (100%) 256 bytes in 0.0s CRC32: OK $ xxd uimage_dump.bin | head -2 00000000: 2705 1956 e28a ab68 ... '..V...hj... 00000020: 4c69 6e75 782d 342e ... Linux-4.9.37-hi3516ev300 ``` ``` $ defib agent scan -p tcp://10.216.128.69:9000 Scanned: 256/256 (29.6 s) Good: 256 Flash is healthy! ``` The scan command in particular is a stress test of the multi-packet bridge path — every sector returns its own `RSP_SCAN` packet plus the final `RSP_ACK`. Streams cleanly thanks to #86's leftover-buffer fix. ## Test plan - [ ] `uv run pytest tests/ -x -v --ignore=tests/fuzz` — 457 passed / 2 skipped - [ ] `uv run ruff check src/defib/cli/app.py` - [ ] `uv run mypy src/defib/cli/app.py --ignore-missing-imports` - [ ] Regression: confirm the same 6 commands still work against a local `/dev/ttyUSB*` (they go through the same `create_transport` factory, so should be unchanged). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
widgetii
added a commit
that referenced
this pull request
May 11, 2026
## Summary Test-only PR — locks in two integration points that previously rested on real-hardware verification alone. ### 1. `/fastboot` binary blob (`tests/test_power_rack.py`) `RackController.fastboot()` packs profile + SPL + agent into a single big-endian binary blob that the pod's C handler in `rack/firmware/main/http_api.c` parses field by field. Any drift between the two breaks bring-up silently — pod returns 400, the host can't tell why. Six new `TestFastbootWireFormat` cases: | Case | Why | |---|---| | `test_packs_expected_layout` | round-trip every field | | `test_prestep1_passthrough` | optional prestep1 not dropped | | `test_success_response_returned_verbatim` | done/elapsed/markers preserved | | `test_pod_500_returns_json_body_not_exception` | protocol failure (500 + JSON) surfaces as a dict, not an exception — callers need `failed_phase` / `error` | | `test_pod_unreachable_raises_power_controller_error` | network-layer errors stay raised | | `test_realistic_blob_size_within_pod_limit` | 41 834 B for a typical hi3516ev300 upload, under the pod's 1 MiB cap | Shared helper `_parse_fastboot_blob()` is the inverse of the host packer; if either side bumps the format, the round-trip test fails loudly. ### 2. Async `recv_packet` leftover stress (`tests/test_agent_protocol.py`) PR #86 added a per-transport leftover buffer so multi-packet TCP chunks don't drop trailing packets. The existing tests covered the two simplest cases (two/three packets in one chunk). Six stress cases added: | Case | Why | |---|---| | `test_frame_split_across_two_reads_recombines` | half a packet, then the other half (TCP MTU split) | | `test_large_stream_50_packets_in_one_chunk` | catches off-by-one in leftover slicing | | `test_recv_response_skips_ready_in_leftover` | READY skipping still works when READYs are in leftover, not on the wire | | `test_per_transport_isolation` | two transports must not share leftover state | | `test_incomplete_frame_in_leftover_blocks_until_timeout` | half a frame must wait for the rest, never spuriously return partial | | `test_corrupt_frame_skipped_then_recovers` | bit-flipped CRC mid-stream must be discarded; parser picks up the next valid frame | ## Test plan - [ ] `uv run pytest tests/ -x -v --ignore=tests/fuzz` — **480 passed / 2 skipped** (was 468) - [ ] `uv run ruff check tests/ src/defib/` - [ ] `uv run mypy src/defib/ --ignore-missing-imports` No production code touched. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two fixes that together make the SPL+agent path actually work over the rack pod's WiFi-bridged UART. The first is a regression-class bug in
agent.protocol; the second is the host counterpart to a newPOST /fastbootendpoint on the rack pod.1.
agent.protocol.recv_packet— async path was losing post-delimiter bytesThe pyserial path through
_recv_packet_syncalready maintained a per-port leftover buffer for bytes that arrived in the same read as a frame's delimiter but after it (those bytes belong to the NEXT frame). The async path was written without one. When a chunk contained two whole packets back-to-back — e.g.RSP_DATA + 0x00 + RSP_ACK + 0x00— the for-loop returned the first frame and silently discarded the rest.Result: only single-packet commands worked on async transports (
SocketTransport,MockTransport).INFOandCRC32worked (one packet);READand the streamingWRITEresponses timed out 100% of the time because the trailingRSP_ACKwas already consumed from the socket and gone.Fix: a transport-keyed
_async_leftoverbuffer that's drained before each socket read and refilled with any bytes that came after the parsed frame's delimiter. Same semantics as the pyserial buffer, just keyed onid(transport)to avoid couplingTransportto the parser.Post-fix, same hardware:
Two regression tests in
TestRecvPacket:2.
RackController.fastboot(...)— host counterpart to pod-side BootROM uploadThe pod's TCP-bridged UART has ~50 ms WiFi RTT per round-trip. The HiSilicon SPL boot protocol's per-frame ACK loop (150 ms timeout, dozens of frames per upload) couldn't survive that — every attempt failed at PRESTEP0 even with the fixes from #85.
The companion rack-firmware change adds
POST /fastbootwhich runs the entire BootROM upload sequence on the ESP32-S3 itself: handshake (flood 0xAA, watch for 0x20 markers), DDR step (PRESTEP0/DDRSTEP0/optional PRESTEP1), SPL upload, agent upload. ACK latency drops from ~50 ms (WiFi) to ~1 ms (local UART), making the whole sequence reliable.This PR adds
RackController.fastboot(profile, spl, agent)that packs everything into a single binary blob (load addresses + PRESTEP/DDRSTEP/PRESTEP1 + SPL + agent), POSTs it, and returns the pod's phase-by-phase JSON.End-to-end against
10.216.128.69(hi3516ev300):~4.5 s wall-clock for the full SPL+agent bring-up vs. 100% failure before.
Test plan
uv run pytest tests/ -x -v --ignore=tests/fuzz(457 passed / 2 skipped)uv run ruff check src/defib/ tests/uv run mypy src/defib/ --ignore-missing-imports_recv_packet_sync, untouched).🤖 Generated with Claude Code