Skip to content

fix: diagnose daemon SerialMonitor write/RPC timeouts and use board-family line control #834

Description

@zackees

Context

FastLED AutoResearch on Windows with an ESP32-S3 on COM22 (VID:PID 303A:1001) can get into a state where fbuild deployment succeeds and the fbuild-backed serial monitor receives device output, but JSON-RPC writes do not produce any firmware response. The client then reports only a generic RPC timeout.

Observed with FastLED using fbuild 2.3.13:

bash autoresearch esp32s3 --all --skip-lint --timeout 240 --upload-port COM22

The deploy completed and the monitor received firmware output such as:

RESULT: {"type":"status","ready":true,"uptimeMs":...}
RESULT: {"chip":"ESP32-S3 (Xtensa)","type":"ready",...}

AutoResearch then sent JSON-RPC pings through fbuild.api.SerialMonitor.write():

{"method":"ping","params":[{}],"id":1}

No REMOTE: response was observed. The client retried, reset via fbuild, retried again, and still timed out.

A direct fbuild probe also did not produce an RPC response:

fbuild serial probe read COM22 --seconds 20 --send '{"method":"ping","params":[{}],"id":77}\n'

It printed only:

ESP-ROM:esp32s3-20210327
probe read: port=COM22 baud=115200 family=Esp32NativeUsbCdc dtr=false rts=false seconds=20

This should be surfaced as an actionable fbuild serial/line-control/write-path failure instead of an undifferentiated higher-level RPC timeout.

Code paths that look suspicious:

  • crates/fbuild-daemon/src/handlers/websockets.rs opens WebSocket serial-monitor sessions with open_port(&port, baud_rate, &client_id, None). None falls back to (DTR=true, RTS=true) in SharedSerialManager::open_port, even though family_for_vid_pid(0x303A, *) maps ESP native USB CDC to (false, false).
  • crates/fbuild-daemon/src/handlers/operations/monitor.rs and post-deploy monitor attach in deploy.rs also pass None to open_port.
  • crates/fbuild-python/src/serial_monitor.rs::write() sends a Write frame and waits for only the next WebSocket frame. If a serial Data frame arrives before WriteAck, it returns 0 and leaves the ack to be consumed later by unrelated reads.
  • crates/fbuild-serial/src/manager.rs::write_to_port() uses serial.write(data) once and acks the returned byte count. It should either write the full buffer or report a partial write as a failed/incomplete write.

Proposal

Make fbuild's daemon-backed serial monitor robust enough that AutoResearch and other clients can distinguish these cases:

  1. The monitor opened the port with the wrong board-family DTR/RTS state.
  2. The write was only partially accepted by the OS serial handle.
  3. The WebSocket write ack raced with serial data frames.
  4. The device is still in ROM/download/boot state after attach/reset and is not running the expected firmware.

Concrete implementation direction:

  • Infer BoardFamily for daemon WebSocket and HTTP monitor opens from the selected OS port's VID/PID, matching fbuild serial probe read behavior. For deploy flows, prefer carrying the known board/platform family through to the post-deploy monitor attach rather than falling back to None.
  • Log the inferred family plus DTR/RTS values on every daemon monitor attach.
  • Change SerialMonitor.write() to keep reading until it sees the matching WriteAck or a timeout/error, while preserving/interleaving serial Data frames safely instead of treating the first non-ack frame as write failure.
  • Change write_to_port() to use write_all() or an explicit full-buffer loop, and fail loudly on partial writes/timeouts.
  • Add diagnostics in the write failure/timeout path that include requested byte count, ack byte count, port, inferred family, and current DTR/RTS policy.

Acceptance criteria

  • WebSocket SerialMonitor attach for VID:PID 303A:* opens with BoardFamily::Esp32NativeUsbCdc semantics (DTR=false, RTS=false) unless the caller explicitly overrides it.
  • HTTP monitor/post-deploy monitor paths no longer pass None when the board family can be inferred from the port or deploy context.
  • SerialMonitor.write() succeeds when a Data frame arrives before WriteAck; add a regression test for Data -> WriteAck ordering.
  • SerialMonitor.write() does not silently return success/zero on partial writes; callers can tell whether all bytes were accepted.
  • AutoResearch-style RPC failures report a fbuild serial diagnostic when the write/attach path is suspect, rather than only timing out waiting for REMOTE:.

Open questions

  • Should the Python SerialMonitor constructor accept an optional board/family argument for ambiguous VID/PID cases, or should the daemon infer from VID/PID plus the deploy context only?
  • Should write acknowledgements carry a request id so the Python client can match acks deterministically even with interleaved serial data?

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions