-
Notifications
You must be signed in to change notification settings - Fork 693
Fix WebSocket connection reliability and domain reload recovery #656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
On Windows, WebSocket connections can die silently (OSError 64) without either side being notified. This causes commands to fail with "Unity session not available" until Unity eventually detects the dead connection. Changes: - Add PingMessage model for server->client pings - Add ping loop in PluginHub that sends pings every 10 seconds - Track last pong time per session; close connection if no pong within 20s - Include session_id in pong messages from Unity for server-side tracking - Clean up debug/timing logs from Issue CoplayDev#654 investigation The server will now proactively detect dead connections within 20 seconds instead of waiting indefinitely for the next command to fail. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When Unity performs a domain reload (after script changes, test runs, or large payload transfers), the MCP connection drops and needs to reconnect. The previous reconnection timeout (2s) was too short for domain reloads which can take 10-30s. Changes: - Increase UNITY_MCP_RELOAD_MAX_WAIT_S default from 2s to 30s - Increase backoff cap when reloading from 0.8s to 5.0s - Skip PluginHub session resolution for stdio transport (was causing unnecessary waits on every command) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…beat # Conflicts: # Server/src/transport/plugin_hub.py # Server/src/transport/unity_transport.py
Reviewer's GuideImplements a server-initiated WebSocket ping/pong heartbeat to proactively detect dead connections, extends Unity-domain-reload-related timeouts from 2s to 30s (with updated backoff and documentation), skips PluginHub session resolution when using stdio transport, and adjusts logging/timing for better observability and reliability diagnostics. Sequence diagram for WebSocket ping_pong heartbeat and dead connection detectionsequenceDiagram
participant PluginHub as PluginHub_Server
participant UnityClient as UnityWebSocketTransportClient
rect rgb(230,230,250)
Note over PluginHub,UnityClient: Registration establishes WebSocket session
PluginHub->>UnityClient: Register handshake
UnityClient-->>PluginHub: RegisteredMessage(session_id)
PluginHub->>PluginHub: _connections[session_id] = websocket
PluginHub->>PluginHub: _last_pong[session_id] = time.monotonic()
PluginHub->>PluginHub: start _ping_loop(session_id, websocket)
end
loop Every PING_INTERVAL (10s)
PluginHub->>PluginHub: sleep(PING_INTERVAL)
PluginHub->>PluginHub: check session_id in _connections
alt Session no longer connected
PluginHub->>PluginHub: stop _ping_loop for session_id
else Session still connected
PluginHub->>PluginHub: elapsed = now - _last_pong[session_id]
alt elapsed > PING_TIMEOUT (20s)
PluginHub->>PluginHub: log [Ping] stale session
PluginHub->>UnityClient: close(code=1001)
PluginHub->>PluginHub: stop _ping_loop for session_id
else Send ping
PluginHub->>UnityClient: send_json(PingMessage{type: ping})
alt Send succeeds
UnityClient-->>PluginHub: PongMessage{type: pong, session_id}
PluginHub->>PluginHub: _handle_pong(payload)
PluginHub->>PluginHub: registry.touch(session_id)
PluginHub->>PluginHub: _last_pong[session_id] = time.monotonic()
else Send fails (e.g. OSError 64)
PluginHub->>PluginHub: log [Ping] send failure
PluginHub->>UnityClient: close(code=1006)
PluginHub->>PluginHub: stop _ping_loop for session_id
end
end
end
end
Note over PluginHub,UnityClient: on_disconnect cancels ping task and clears _last_pong
Sequence diagram for HTTP vs stdio transport session resolution with PluginHubsequenceDiagram
actor MCPClient
participant Server as UnityTransport_Server
participant Middleware as UnityInstanceMiddleware
participant PluginHub as PluginHub
participant Stdio as LegacyUnityConnection
MCPClient->>Server: Invoke command
Server->>Middleware: _inject_unity_instance(context)
alt HTTP transport
Middleware->>Server: _is_http_transport() == true
Middleware->>PluginHub: PluginHub.is_configured()
alt PluginHub configured
Middleware->>PluginHub: _resolve_session_id(unity_instance, user_id)
PluginHub-->>Middleware: session_id (wait up to UNITY_MCP_SESSION_RESOLVE_MAX_WAIT_S <= 30s)
Middleware->>Server: attach session_id to context
Server->>PluginHub: route HTTP request via WebSocket session
PluginHub-->>Server: MCPResponse
else PluginHub not configured
Middleware->>Server: fall back to active_instance only
Server->>Stdio: send_command_with_retry(instance_id, command)
Stdio-->>Server: Unity response
end
else Stdio transport
Middleware->>Server: _is_http_transport() == false
Note over Middleware,PluginHub: PluginHub session resolution is skipped entirely
Middleware->>Server: use active_instance or instance_id only
Server->>Stdio: send_command_with_retry(instance_id, command)
Stdio-->>Server: Unity response
end
Server-->>MCPClient: MCPResponse
Class diagram for PluginHub heartbeat and related transport modelsclassDiagram
class PluginHub {
+KEEP_ALIVE_INTERVAL: int
+SERVER_TIMEOUT: int
+COMMAND_TIMEOUT: int
+PING_INTERVAL: int
+PING_TIMEOUT: int
+FAST_FAIL_TIMEOUT: float
+_connections: dict~str, WebSocket~
+_pending: dict~str, dict~str, Any~~
+_lock: asyncio.Lock
+_loop: asyncio.AbstractEventLoop
+_last_pong: dict~str, float~
+_ping_tasks: dict~str, asyncio.Task~
+on_disconnect(websocket: WebSocket, close_code: int) void
+_handle_register(websocket: WebSocket, payload: RegisterMessage) Task
+_handle_pong(payload: PongMessage) Task
+_ping_loop(session_id: str, websocket: WebSocket) Task
+_get_connection(session_id: str) Task~WebSocket~
+_resolve_session_id(unity_instance: str, user_id: str, allow_unattached: bool) Task~str~
}
class PingMessage {
+type: str
}
class PongMessage {
+type: str
+session_id: str
}
class WebSocketTransportClient {
-_sessionId: string
+SendPongAsync(token: CancellationToken) Task
+SendJsonAsync(payload: JObject, token: CancellationToken) Task
+HandleSocketClosureAsync(reason: string) Task
}
class TransportCommandDispatcher {
}
class PendingCommand {
+PendingCommand(commandJson: string, completionSource: TaskCompletionSource, cancellationToken: CancellationToken, registration: CancellationTokenRegistration)
+CommandJson: string
+CompletionSource: TaskCompletionSource
+CancellationToken: CancellationToken
+CancellationRegistration: CancellationTokenRegistration
+IsExecuting: bool
+QueuedAt: DateTime
+Dispose() void
}
class LegacyUnityConnection {
+send_command(command_type: str, params: dict~str, Any~, max_attempts: int) dict
}
class UnityTransportModule {
+_is_http_transport() bool
}
PluginHub --> PingMessage : uses
PluginHub --> PongMessage : handles
WebSocketTransportClient --> PongMessage : sends
WebSocketTransportClient --> PingMessage : receives
TransportCommandDispatcher *-- PendingCommand : contains
LegacyUnityConnection <-- UnityTransportModule : used_by
PluginHub <-- UnityTransportModule : used_by
Flow diagram for send_command_with_retry and reload timeout handlingflowchart TD
A[Start send_command_with_retry] --> B[Get LegacyUnityConnection via get_unity_connection]
B --> C[Read reload_max_retries and reload_retry_ms from config]
C --> D[Read UNITY_MCP_RELOAD_MAX_WAIT_S env or default 30.0]
D --> E[Clamp max_wait_s to 0..30]
E --> F[Initialize waited = 0, attempt = 0]
F --> G{waited < max_wait_s and
attempt <= max_retries}
G -->|No| H[Log reload timed out
or retries exhausted]
H --> I[Return last response or error]
G -->|Yes| J[Attempt send_command
via LegacyUnityConnection]
J --> K{Success?}
K -->|Yes| L[Return successful response]
K -->|No| M[Inspect status file /
fast_error flags]
M --> N{status.reloading?}
N -->|Yes| O[Set backoff cap = 5.0s
domain reload can take 10-30s]
N -->|No and fast_error| P[Set backoff cap = 0.25s]
N -->|No and not fast_error| Q[Set backoff cap = 1.0s]
O --> R[Compute next backoff
and sleep]
P --> R
Q --> R
R --> S[Increment waited and attempt]
S --> G
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
📝 WalkthroughWalkthroughAdds command enqueue timestamps and WebSocket pong/session payloads; logs socket closure stack traces; implements a server-initiated ping/pong heartbeat with per-session liveness tracking; broad timing instrumentation and increased Unity reload wait defaults to 20s; restricts PluginHub session resolution to HTTP transport; updates logger init and adds multiple Unity editor tests for property-conversion error handling. Changes
Sequence DiagramsequenceDiagram
actor Client as Unity Client
participant WS as WebSocketTransportClient
participant Hub as PluginHub (Server)
participant Server as Python Server
Client->>WS: Connect & register (session_id)
WS->>Hub: Register session
Hub->>Hub: set _last_pong[session] = now<br/>spawn _ping_loop(session)
loop every PING_INTERVAL (10s)
Hub->>WS: PingMessage
WS->>Client: deliver ping
Client->>WS: PongMessage (includes session_id)
WS->>Hub: forward pong
Hub->>Hub: update _last_pong[session]
end
alt no pong within PING_TIMEOUT (20s)
Hub->>Hub: detect stale session
Hub->>WS: close connection / unregister session
end
Client->>Server: Command requests (normal flow)
Server->>Client: Responses
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues, and left some high level feedback:
- The new
[TIMING-STDIO]and WebSocket stack-trace logs are emitted atinfo/Warnlevel and on hot paths; consider downgrading todebugor guarding them behind a config flag to avoid flooding logs in normal operation. - The ping heartbeat constants (
PING_INTERVAL,PING_TIMEOUT) are currently hard-coded; consider making them configurable (similar to theUNITY_MCP_*timeouts) so they can be tuned without code changes for different environments. - Access to
_last_pongin the ping loop and_handle_pongis done without the shared lock while other session state uses the lock; for consistency and to avoid subtle concurrency issues, it may be safer to read/write_last_pongunder the same lock as_connections.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `[TIMING-STDIO]` and WebSocket stack-trace logs are emitted at `info`/`Warn` level and on hot paths; consider downgrading to `debug` or guarding them behind a config flag to avoid flooding logs in normal operation.
- The ping heartbeat constants (`PING_INTERVAL`, `PING_TIMEOUT`) are currently hard-coded; consider making them configurable (similar to the `UNITY_MCP_*` timeouts) so they can be tuned without code changes for different environments.
- Access to `_last_pong` in the ping loop and `_handle_pong` is done without the shared lock while other session state uses the lock; for consistency and to avoid subtle concurrency issues, it may be safer to read/write `_last_pong` under the same lock as `_connections`.
## Individual Comments
### Comment 1
<location> `Server/src/transport/plugin_hub.py:466-475` </location>
<code_context>
+ async def _ping_loop(cls, session_id: str, websocket: WebSocket) -> None:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider cleaning up `_ping_tasks`/`_last_pong` in `_ping_loop` to avoid relying solely on `on_disconnect`.
Relying on `on_disconnect` for cleanup means that in shutdown/exception edge cases (e.g. `_lock` set to `None`, `close()` failing, or `on_disconnect` not firing), the loop can exit the `while` and hit `finally` without removing this session from `_ping_tasks` or clearing `_last_pong`. Consider doing that cleanup in the `finally` block so the ping loop fully manages its own lifecycle and avoids stale entries.
Suggested implementation:
```python
await registry.touch(session_id)
# Record last pong time for staleness detection
cls._last_pong[session_id] = time.monotonic()
@classmethod
def _cleanup_ping_session(cls, session_id: str) -> None:
"""Cleanup ping tracking state for a session.
This is used by the ping loop's finally block to fully manage its own
lifecycle, and can also be safely called from on_disconnect as needed.
"""
# Remove this session's ping task and staleness tracking, if present.
# Using pop(..., None) makes this idempotent and safe to call multiple times.
cls._ping_tasks.pop(session_id, None)
cls._last_pong.pop(session_id, None)
@classmethod
async def _ping_loop(cls, session_id: str, websocket: WebSocket) -> None:
```
To fully implement your suggestion, the following additional changes are needed in the same file:
1. Wrap the body of `_ping_loop` in a `try`/`finally` and call the new helper in the `finally` block, so the ping loop always cleans up its own state:
```python
@classmethod
async def _ping_loop(cls, session_id: str, websocket: WebSocket) -> None:
"""Server-initiated ping loop to detect dead connections.
Sends periodic pings to the Unity client. If no pong is received within
PING_TIMEOUT seconds, the connection is considered dead and closed.
This helps detect connections that die silently (e.g., Windows OSError 64).
"""
logger.debug(f"[Ping] Starting ping loop for session {session_id}")
try:
while True:
await asyncio.sleep(cls.PING_INTERVAL)
# ... existing ping / timeout logic remains unchanged ...
finally:
# Ensure we don't leave stale entries if on_disconnect never fires
cls._cleanup_ping_session(session_id)
```
2. Anywhere else you’re cleaning up `_ping_tasks` / `_last_pong` (likely in `on_disconnect`), update that code to call `cls._cleanup_ping_session(session_id)` instead of manipulating `_ping_tasks` / `_last_pong` directly. Because `_cleanup_ping_session` uses `pop(..., None)`, calling it from both `on_disconnect` and `_ping_loop` is safe and idempotent.
These changes ensure that the ping loop fully manages its own lifecycle and avoids stale entries even in shutdown/exception edge cases where `on_disconnect` might not execute.
</issue_to_address>
### Comment 2
<location> `MCPForUnity/Editor/Services/Transport/Transports/WebSocketTransportClient.cs:654-660` </location>
<code_context>
private async Task HandleSocketClosureAsync(string reason)
{
+ // DEBUG: Capture stack trace to identify what triggers disconnection (Issue #654 investigation)
+ var stackTrace = new System.Diagnostics.StackTrace(true);
+ McpLog.Warn($"[WebSocket] HandleSocketClosureAsync called. Reason: {reason}\nStack trace:\n{stackTrace}");
+
if (_lifecycleCts == null || _lifecycleCts.IsCancellationRequested)
</code_context>
<issue_to_address>
**suggestion:** The WARN-level stack trace log on every socket closure may be too noisy for normal usage.
Since disconnects are often expected, consider either lowering this log to `Info`/`Debug`, gating it behind a diagnostic flag or `#if DEBUG`, or restricting the stack trace to unexpected closure reasons so you keep the investigation signal without overwhelming normal logs.
```suggestion
private async Task HandleSocketClosureAsync(string reason)
{
#if DEBUG
// DEBUG: Capture stack trace to identify what triggers disconnection (Issue #654 investigation)
var stackTrace = new System.Diagnostics.StackTrace(true);
McpLog.Debug($"[WebSocket] HandleSocketClosureAsync called. Reason: {reason}\nStack trace:\n{stackTrace}");
#endif
if (_lifecycleCts == null || _lifecycleCts.IsCancellationRequested)
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| async def _ping_loop(cls, session_id: str, websocket: WebSocket) -> None: | ||
| """Server-initiated ping loop to detect dead connections. | ||
|
|
||
| Sends periodic pings to the Unity client. If no pong is received within | ||
| PING_TIMEOUT seconds, the connection is considered dead and closed. | ||
| This helps detect connections that die silently (e.g., Windows OSError 64). | ||
| """ | ||
| logger.debug(f"[Ping] Starting ping loop for session {session_id}") | ||
| try: | ||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Consider cleaning up _ping_tasks/_last_pong in _ping_loop to avoid relying solely on on_disconnect.
Relying on on_disconnect for cleanup means that in shutdown/exception edge cases (e.g. _lock set to None, close() failing, or on_disconnect not firing), the loop can exit the while and hit finally without removing this session from _ping_tasks or clearing _last_pong. Consider doing that cleanup in the finally block so the ping loop fully manages its own lifecycle and avoids stale entries.
Suggested implementation:
await registry.touch(session_id)
# Record last pong time for staleness detection
cls._last_pong[session_id] = time.monotonic()
@classmethod
def _cleanup_ping_session(cls, session_id: str) -> None:
"""Cleanup ping tracking state for a session.
This is used by the ping loop's finally block to fully manage its own
lifecycle, and can also be safely called from on_disconnect as needed.
"""
# Remove this session's ping task and staleness tracking, if present.
# Using pop(..., None) makes this idempotent and safe to call multiple times.
cls._ping_tasks.pop(session_id, None)
cls._last_pong.pop(session_id, None)
@classmethod
async def _ping_loop(cls, session_id: str, websocket: WebSocket) -> None:To fully implement your suggestion, the following additional changes are needed in the same file:
- Wrap the body of
_ping_loopin atry/finallyand call the new helper in thefinallyblock, so the ping loop always cleans up its own state:
@classmethod
async def _ping_loop(cls, session_id: str, websocket: WebSocket) -> None:
"""Server-initiated ping loop to detect dead connections.
Sends periodic pings to the Unity client. If no pong is received within
PING_TIMEOUT seconds, the connection is considered dead and closed.
This helps detect connections that die silently (e.g., Windows OSError 64).
"""
logger.debug(f"[Ping] Starting ping loop for session {session_id}")
try:
while True:
await asyncio.sleep(cls.PING_INTERVAL)
# ... existing ping / timeout logic remains unchanged ...
finally:
# Ensure we don't leave stale entries if on_disconnect never fires
cls._cleanup_ping_session(session_id)- Anywhere else you’re cleaning up
_ping_tasks/_last_pong(likely inon_disconnect), update that code to callcls._cleanup_ping_session(session_id)instead of manipulating_ping_tasks/_last_pongdirectly. Because_cleanup_ping_sessionusespop(..., None), calling it from bothon_disconnectand_ping_loopis safe and idempotent.
These changes ensure that the ping loop fully manages its own lifecycle and avoids stale entries even in shutdown/exception edge cases where on_disconnect might not execute.
MCPForUnity/Editor/Services/Transport/Transports/WebSocketTransportClient.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@Server/src/transport/legacy/unity_connection.py`:
- Around line 244-245: Remove the unused variable t0 in the send_command
function: delete the assignment "t0 = time.time()" (or replace it with a
meaningful use if you intended to compute duration) and keep the
logger.info("[TIMING-STDIO] send_command START: command=%s", command_type) line
unchanged; locate this inside the send_command method in unity_connection.py to
make the edit.
🧹 Nitpick comments (2)
MCPForUnity/Editor/Services/Transport/Transports/WebSocketTransportClient.cs (1)
656-658: Debug logging uses warning level unconditionally.The comment indicates this is for "DEBUG" / "Issue
#654investigation", butMcpLog.Warnlogs unconditionally at warning level. This could be noisy in production. Consider usingMcpLog.Debugwhich respects the debug logging preference, or remove this investigation code before merging.🔧 Proposed fix: Use conditional debug logging
- // DEBUG: Capture stack trace to identify what triggers disconnection (Issue `#654` investigation) - var stackTrace = new System.Diagnostics.StackTrace(true); - McpLog.Warn($"[WebSocket] HandleSocketClosureAsync called. Reason: {reason}\nStack trace:\n{stackTrace}"); + // Capture stack trace for debugging disconnection triggers + var stackTrace = new System.Diagnostics.StackTrace(true); + McpLog.Debug($"[WebSocket] HandleSocketClosureAsync called. Reason: {reason}\nStack trace:\n{stackTrace}");Server/src/transport/plugin_hub.py (1)
86-89: AddClassVartype annotation for mutable class attributes.Static analysis correctly identifies that mutable class attributes should be annotated with
typing.ClassVarto clarify they are shared across instances and prevent accidental shadowing.🔧 Proposed fix
+from typing import Any, ClassVar + class PluginHub(WebSocketEndpoint): ... - # session_id -> last pong timestamp (monotonic) - _last_pong: dict[str, float] = {} - # session_id -> ping task - _ping_tasks: dict[str, asyncio.Task] = {} + # session_id -> last pong timestamp (monotonic) + _last_pong: ClassVar[dict[str, float]] = {} + # session_id -> ping task + _ping_tasks: ClassVar[dict[str, asyncio.Task]] = {}
| t0 = time.time() | ||
| logger.info("[TIMING-STDIO] send_command START: command=%s", command_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused variable t0.
The variable t0 is assigned but never used. The timing is logged but the stored value isn't referenced for duration calculation later.
🔧 Proposed fix
- t0 = time.time()
logger.info("[TIMING-STDIO] send_command START: command=%s", command_type)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| t0 = time.time() | |
| logger.info("[TIMING-STDIO] send_command START: command=%s", command_type) | |
| logger.info("[TIMING-STDIO] send_command START: command=%s", command_type) |
🧰 Tools
🪛 Ruff (0.14.14)
[error] 244-244: Local variable t0 is assigned to but never used
Remove assignment to unused variable t0
(F841)
🤖 Prompt for AI Agents
In `@Server/src/transport/legacy/unity_connection.py` around lines 244 - 245,
Remove the unused variable t0 in the send_command function: delete the
assignment "t0 = time.time()" (or replace it with a meaningful use if you
intended to compute duration) and keep the logger.info("[TIMING-STDIO]
send_command START: command=%s", command_type) line unchanged; locate this
inside the send_command method in unity_connection.py to make the edit.
- Add server-side ping loop to detect dead WebSocket connections - Include session_id in pong messages for tracking - Reduce domain reload timeout from 30s to 20s - Add ClassVar annotations for mutable class attributes - Add lock protection for _last_pong access - Change debug stack trace log from Warn to Debug level - Remove unused TIMING-STDIO variable - Fix flaky async duration test (allow 20% timer variance) - Fix Python test that cleared HOME env var on Windows - Skip Unix-path test on Windows (path separator difference) - Add LogAssert.Expect to PropertyConversion tests Fixes CoplayDev#654, CoplayDev#643 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2bfffe7 to
0c4c5cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@Server/src/transport/legacy/unity_connection.py`:
- Around line 780-800: Update the reload wait default and clamp to 30s: change
the environment default for UNITY_MCP_RELOAD_MAX_WAIT_S from "20.0" to "30.0",
adjust any hard-coded fallback max_wait_s values (currently set to 20.0) to
30.0, and modify the clamp statement that enforces the range to use 30.0 as the
upper bound (max_wait_s = max(0.0, min(max_wait_s, 30.0))). Also update the
logger.warning message text (and any nearby comments referencing 20s) to reflect
the new 30s default so messages and docs match the new cap; ensure you reference
the same variable name max_wait_s and the env var UNITY_MCP_RELOAD_MAX_WAIT_S in
your changes.
In `@Server/src/transport/plugin_hub.py`:
- Around line 492-519: In the exception handler for failed ping sends (the block
that creates PingMessage, calls websocket.send_json and handles send_ex),
replace the prohibited close code 1006 with 1001 when calling websocket.close
and ensure any exception thrown by websocket.close is logged instead of
swallowed; update the logger call to include context (session_id and the close
exception) and reference the same symbols (PingMessage, websocket.close,
cls.PING_TIMEOUT, logger) so the error path logs close failures for debugging.
In
`@TestProjects/UnityMCPTests/Assets/Tests/EditMode/Tools/PropertyConversionErrorHandlingTests.cs`:
- Around line 207-235: ManageEditor.HandleCommand currently lacks a case for the
"telemetry_status" action so unknown actions hit the default ErrorResponse;
update ManageEditor.HandleCommand (ManageEditor.cs) to handle "telemetry_status"
by either implementing a new branch that gathers and returns the telemetry
health fields (dispatcher status, success flags, etc.) or routing the action to
the existing telemetry handler (if one exists), ensuring the switch or dispatch
logic recognizes "telemetry_status" and returns a success response object used
by the test.
🧹 Nitpick comments (4)
TestProjects/UnityMCPTests/Assets/Tests/EditMode/Services/Server/ServerCommandBuilderTests.cs (1)
113-118: UseAssert.Ignoreto mark the test as skipped on Windows instead ofAssert.Pass.
Assert.Passmarks the test as passed in test reports, which masks that it never actually ran.Assert.Ignorecorrectly records it as skipped, keeping test results accurate.Suggested change
if (UnityEngine.Application.platform == UnityEngine.RuntimePlatform.WindowsEditor) { - Assert.Pass("Skipped on Windows - use BuildUvPathFromUvx_WindowsPath_ConvertsCorrectly instead"); - return; + Assert.Ignore("Skipped on Windows - use BuildUvPathFromUvx_WindowsPath_ConvertsCorrectly instead"); }TestProjects/UnityMCPTests/Assets/Tests/EditMode/Tools/PropertyConversion_ArrayForFloat_Test.cs (1)
36-56: Assert error response and unchanged property.Right now the test only checks for a non-null result, so it would still pass if the command incorrectly succeeds. Consider asserting the error response and that
spatialBlendremains unchanged.♻️ Suggested assertion tightening
var result = ManageComponents.HandleCommand(setPropertyParams); -Assert.IsNotNull(result, "Should return a result"); +Assert.IsNotNull(result, "Should return a result"); +if (result is ErrorResponse error) +{ + Assert.IsFalse(error.Success, "Should report failure for incompatible value"); +} +else +{ + Assert.Fail($"Expected ErrorResponse but got {result.GetType().Name}"); +} +Assert.AreEqual(0f, audioSource.spatialBlend, "spatialBlend should remain unchanged");TestProjects/UnityMCPTests/Assets/Tests/EditMode/Tools/PropertyConversionErrorHandlingTests.cs (2)
40-66: Enforce the expected ErrorResponse.The test allows a success response to pass; that undermines the “graceful error” assertion. Consider requiring an
ErrorResponsewithSuccess == false.♻️ Suggested assertion tightening
var result = ManageComponents.HandleCommand(setPropertyParams); - -// Main test: should return a result without crashing -Assert.IsNotNull(result, "Should return a result, not crash dispatcher"); - -// If it's an ErrorResponse, verify it properly reports failure -if (result is ErrorResponse errorResp) -{ - Assert.IsFalse(errorResp.Success, "Should report failure for incompatible type"); -} +var errorResp = result as ErrorResponse; +Assert.IsNotNull(errorResp, "Expected ErrorResponse for incompatible type"); +Assert.IsFalse(errorResp.Success, "Should report failure for incompatible type");
241-280: Test currently passes in all branches.
Assert.Passis invoked in every path, so this test can’t fail even if behavior regresses. Consider asserting the intended contract once defined (e.g., no throw + null result, or a specific error path).
| # Default to 20s to handle domain reloads (which can take 10-20s after tests or script changes). | ||
| # | ||
| # NOTE: This wait can impact agentic workflows where domain reloads happen | ||
| # frequently (e.g., after test runs, script compilation). The 20s default | ||
| # balances handling slow reloads vs. avoiding unnecessary delays. | ||
| # | ||
| # TODO: Make this more deterministic by detecting Unity's actual reload state | ||
| # rather than blindly waiting up to 20s. See Issue #657. | ||
| # | ||
| # Configurable via: UNITY_MCP_RELOAD_MAX_WAIT_S (default: 20.0, max: 20.0) | ||
| try: | ||
| max_wait_s = float(os.environ.get( | ||
| "UNITY_MCP_RELOAD_MAX_WAIT_S", "2.0")) | ||
| "UNITY_MCP_RELOAD_MAX_WAIT_S", "20.0")) | ||
| except ValueError as e: | ||
| raw_val = os.environ.get("UNITY_MCP_RELOAD_MAX_WAIT_S", "2.0") | ||
| raw_val = os.environ.get("UNITY_MCP_RELOAD_MAX_WAIT_S", "20.0") | ||
| logger.warning( | ||
| "Invalid UNITY_MCP_RELOAD_MAX_WAIT_S=%r, using default 2.0: %s", | ||
| "Invalid UNITY_MCP_RELOAD_MAX_WAIT_S=%r, using default 20.0: %s", | ||
| raw_val, e) | ||
| max_wait_s = 2.0 | ||
| # Clamp to [0, 30] to prevent misconfiguration from causing excessive waits | ||
| max_wait_s = max(0.0, min(max_wait_s, 30.0)) | ||
| max_wait_s = 20.0 | ||
| # Clamp to [0, 20] to prevent misconfiguration from causing excessive waits | ||
| max_wait_s = max(0.0, min(max_wait_s, 20.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reload wait is still capped at 20s—objective calls for 30s.
Issue #654 notes reloads can take 10–30s, but the default, clamp, and log text still enforce 20s, which can prematurely abort retries. Please bump the default and cap (and update messaging) to 30s.
♻️ Proposed fix
- # Default to 20s to handle domain reloads (which can take 10-20s after tests or script changes).
+ # Default to 30s to handle domain reloads (which can take 10-30s after tests or script changes).
@@
- # Configurable via: UNITY_MCP_RELOAD_MAX_WAIT_S (default: 20.0, max: 20.0)
+ # Configurable via: UNITY_MCP_RELOAD_MAX_WAIT_S (default: 30.0, max: 30.0)
try:
- max_wait_s = float(os.environ.get(
- "UNITY_MCP_RELOAD_MAX_WAIT_S", "20.0"))
+ max_wait_s = float(os.environ.get(
+ "UNITY_MCP_RELOAD_MAX_WAIT_S", "30.0"))
except ValueError as e:
- raw_val = os.environ.get("UNITY_MCP_RELOAD_MAX_WAIT_S", "20.0")
+ raw_val = os.environ.get("UNITY_MCP_RELOAD_MAX_WAIT_S", "30.0")
logger.warning(
- "Invalid UNITY_MCP_RELOAD_MAX_WAIT_S=%r, using default 20.0: %s",
+ "Invalid UNITY_MCP_RELOAD_MAX_WAIT_S=%r, using default 30.0: %s",
raw_val, e)
- max_wait_s = 20.0
- # Clamp to [0, 20] to prevent misconfiguration from causing excessive waits
- max_wait_s = max(0.0, min(max_wait_s, 20.0))
+ max_wait_s = 30.0
+ # Clamp to [0, 30] to prevent misconfiguration from causing excessive waits
+ max_wait_s = max(0.0, min(max_wait_s, 30.0))🤖 Prompt for AI Agents
In `@Server/src/transport/legacy/unity_connection.py` around lines 780 - 800,
Update the reload wait default and clamp to 30s: change the environment default
for UNITY_MCP_RELOAD_MAX_WAIT_S from "20.0" to "30.0", adjust any hard-coded
fallback max_wait_s values (currently set to 20.0) to 30.0, and modify the clamp
statement that enforces the range to use 30.0 as the upper bound (max_wait_s =
max(0.0, min(max_wait_s, 30.0))). Also update the logger.warning message text
(and any nearby comments referencing 20s) to reflect the new 30s default so
messages and docs match the new cap; ensure you reference the same variable name
max_wait_s and the env var UNITY_MCP_RELOAD_MAX_WAIT_S in your changes.
| # Check staleness: has it been too long since we got a pong? | ||
| elapsed = time.monotonic() - last_pong | ||
| if elapsed > cls.PING_TIMEOUT: | ||
| logger.warning( | ||
| f"[Ping] Session {session_id} stale: no pong for {elapsed:.1f}s " | ||
| f"(timeout={cls.PING_TIMEOUT}s). Closing connection." | ||
| ) | ||
| try: | ||
| await websocket.close(code=1001) # Going away | ||
| except Exception as close_ex: | ||
| logger.debug(f"[Ping] Error closing stale websocket: {close_ex}") | ||
| break | ||
|
|
||
| # Send a ping to the client | ||
| try: | ||
| ping_msg = PingMessage() | ||
| await websocket.send_json(ping_msg.model_dump()) | ||
| logger.debug(f"[Ping] Sent ping to session {session_id}") | ||
| except Exception as send_ex: | ||
| # Send failed - connection is dead | ||
| logger.warning( | ||
| f"[Ping] Failed to send ping to session {session_id}: {send_ex}. " | ||
| "Connection likely dead." | ||
| ) | ||
| try: | ||
| await websocket.close(code=1006) # Abnormal closure | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
RFC 6455 WebSocket close code 1006 must not be sent close frame specification
💡 Result:
In RFC 6455, WebSocket close code 1006 (“abnormal closure”) is reserved and must not be sent on the wire:
- “1006 is a reserved value and MUST NOT be set as a status code in a Close control frame by an endpoint.” It is only for local/reporting use to indicate the connection closed abnormally (e.g., the TCP connection dropped) without sending or receiving a Close control frame. [1]
- RFC 6455 also defines that if the connection is closed and no Close control frame was received, then the implementation considers the connection close code to be 1006 (again, internal/observed state—not a transmitted close frame). [2]
So: you will never legitimately see a peer send a Close frame with code 1006; implementations use it to report “abnormal closure” when the closing handshake didn’t happen. [1][2]
Sources:
[1] RFC 6455, Section 7.4.1 “Defined Status Codes” (1006) https://datatracker.ietf.org/doc/html/rfc6455
[2] RFC 6455, Section 7.1.5 “The WebSocket Connection Close Code” (1006 when no Close frame received) https://datatracker.ietf.org/doc/html/rfc6455
🏁 Script executed:
# Find and examine the file mentioned in the review
fd -t f "plugin_hub.py" | head -5Repository: CoplayDev/unity-mcp
Length of output: 97
🏁 Script executed:
# Read the specific lines mentioned in the review
sed -n '492,519p' Server/src/transport/plugin_hub.pyRepository: CoplayDev/unity-mcp
Length of output: 1455
🏁 Script executed:
# Let's see more context around these lines to understand the full function
sed -n '470,530p' Server/src/transport/plugin_hub.pyRepository: CoplayDev/unity-mcp
Length of output: 2947
Replace close code 1006 with 1001 and log close failures.
Line 517 uses close code 1006, which RFC 6455 explicitly reserves and forbids sending on the wire—it must only be used internally to report abnormal closure when no Close frame was exchanged. Sending it violates the specification and can cause protocol violations. Additionally, the exception handler at lines 518–519 silently swallows errors without logging, hiding diagnostics crucial for debugging connection failures.
🔧 Proposed fix
await websocket.close(code=1006) # Abnormal closure
- except Exception:
- pass
+ await websocket.close(code=1001) # Going away
+ except Exception as close_ex:
+ logger.debug(f"[Ping] Error closing websocket after send failure: {close_ex}")🧰 Tools
🪛 Ruff (0.14.14)
[warning] 501-501: Do not catch blind exception: Exception
(BLE001)
[warning] 510-510: Do not catch blind exception: Exception
(BLE001)
[error] 518-519: try-except-pass detected, consider logging the exception
(S110)
[warning] 518-518: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@Server/src/transport/plugin_hub.py` around lines 492 - 519, In the exception
handler for failed ping sends (the block that creates PingMessage, calls
websocket.send_json and handles send_ex), replace the prohibited close code 1006
with 1001 when calling websocket.close and ensure any exception thrown by
websocket.close is logged instead of swallowed; update the logger call to
include context (session_id and the close exception) and reference the same
symbols (PingMessage, websocket.close, cls.PING_TIMEOUT, logger) so the error
path logs close failures for debugging.
| [Test] | ||
| public void ManageEditor_TelemetryStatus_ReportsAccurateHealth() | ||
| { | ||
| // Trigger multiple conversion failures first | ||
| var audioSource = testGameObject.AddComponent<AudioSource>(); | ||
|
|
||
| for (int i = 0; i < 3; i++) | ||
| { | ||
| var badParam = new JObject | ||
| { | ||
| ["action"] = "set_property", | ||
| ["target"] = testGameObject.name, | ||
| ["componentType"] = "AudioSource", | ||
| ["property"] = "clip", | ||
| ["value"] = i * 1000 // bad | ||
| }; | ||
| ManageComponents.HandleCommand(badParam); | ||
| } | ||
|
|
||
| // Now check telemetry | ||
| var telemetryParams = new JObject { ["action"] = "telemetry_status" }; | ||
| var telemetryResult = ManageEditor.HandleCommand(telemetryParams); | ||
|
|
||
| Assert.IsNotNull(telemetryResult, "Telemetry should return result"); | ||
|
|
||
| // NOTE: Issue #654 noted that telemetry returns success even when dispatcher is dead. | ||
| // If telemetry returns success, that's the actual current behavior (which may be a problem). | ||
| // This test just documents what happens. | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n --type=cs -C3 'telemetry_status'Repository: CoplayDev/unity-mcp
Length of output: 8475
🏁 Script executed:
rg -n --type=cs 'telemetry_status' -B5 -A10 | grep -E '(case|HandleCommand|dispatcher|health|success)' | head -50Repository: CoplayDev/unity-mcp
Length of output: 4682
🏁 Script executed:
# Find the ManageEditor implementation
fd -e cs -type f | xargs rg -l 'ManageEditor' | head -10Repository: CoplayDev/unity-mcp
Length of output: 490
🏁 Script executed:
cat -n MCPForUnity/Editor/Tools/ManageEditor.cs | head -100Repository: CoplayDev/unity-mcp
Length of output: 4734
🏁 Script executed:
# Search for telemetry_status handler specifically
rg -n 'telemetry_status|case "telemetry' MCPForUnity/Editor/Tools/ManageEditor.cs -A 20Repository: CoplayDev/unity-mcp
Length of output: 45
🏁 Script executed:
# Search for telemetry in the entire codebase
rg -n 'case.*telemetry' --type=cs -A 15Repository: CoplayDev/unity-mcp
Length of output: 45
🏁 Script executed:
# Get line count to see full file
wc -l MCPForUnity/Editor/Tools/ManageEditor.csRepository: CoplayDev/unity-mcp
Length of output: 107
🏁 Script executed:
cat -n MCPForUnity/Editor/Tools/ManageEditor.cs | tail -300Repository: CoplayDev/unity-mcp
Length of output: 15288
🏁 Script executed:
# Search for where telemetry_status is handled - might be in a different file
rg -rn 'telemetry_status' --type=cs -B 5 -A 15 | grep -E '(case|return|new|telemetry_status|dispatcher|health)' | head -60Repository: CoplayDev/unity-mcp
Length of output: 4442
🏁 Script executed:
# Search for dispatcher or health related code
rg -rn 'dispatcher' --type=cs | head -20Repository: CoplayDev/unity-mcp
Length of output: 2117
🏁 Script executed:
cat -n TestProjects/UnityMCPTests/Assets/Tests/EditMode/Tools/PropertyConversionErrorHandlingTests.cs | sed -n '207,235p'Repository: CoplayDev/unity-mcp
Length of output: 1502
🏁 Script executed:
# Search for where telemetry might be handled as a resource
fd -type f -name '*.cs' | xargs grep -l 'telemetry' | head -20Repository: CoplayDev/unity-mcp
Length of output: 233
🏁 Script executed:
# Search for telemetry in all files
find . -name '*.cs' -type f | xargs grep -l telemetryRepository: CoplayDev/unity-mcp
Length of output: 9188
🏁 Script executed:
# Look for resource implementations
find . -name '*Resource*.cs' -type f | head -10Repository: CoplayDev/unity-mcp
Length of output: 184
🏁 Script executed:
cat -n MCPForUnity/Editor/Helpers/TelemetryHelper.csRepository: CoplayDev/unity-mcp
Length of output: 9323
🏁 Script executed:
# Look for how telemetry_status action is actually handled
find . -name '*.cs' -type f -exec grep -l 'telemetry_status.*action\|HandleCommand.*telemetry' {} \; | head -20Repository: CoplayDev/unity-mcp
Length of output: 265
🏁 Script executed:
# Search for response classes that might include dispatcher health
find . -name '*Response*.cs' -type f | head -20Repository: CoplayDev/unity-mcp
Length of output: 1298
🏁 Script executed:
# Look for a McpServer or handler that might route telemetry_status
find . -name '*Server*.cs' -o -name '*Handler*.cs' -o -name '*Dispatcher*.cs' | head -20Repository: CoplayDev/unity-mcp
Length of output: 922
🏁 Script executed:
cat -n MCPForUnity/Editor/Services/Transport/TransportCommandDispatcher.csRepository: CoplayDev/unity-mcp
Length of output: 17315
🏁 Script executed:
# Find CommandRegistry
find . -name 'CommandRegistry.cs' -type fRepository: CoplayDev/unity-mcp
Length of output: 108
🏁 Script executed:
# Search for telemetry_status registration
grep -rn 'telemetry_status' --include='*.cs' | grep -v Test | grep -v '.Trim()'Repository: CoplayDev/unity-mcp
Length of output: 45
🏁 Script executed:
cat -n MCPForUnity/Editor/Tools/CommandRegistry.csRepository: CoplayDev/unity-mcp
Length of output: 18443
🏁 Script executed:
# Find telemetry_status as a resource or tool
find . -name '*.cs' -type f -exec grep -l 'telemetry_status\|TelemetryStatus' {} \; | grep -v Test | grep -v '.Trim()'Repository: CoplayDev/unity-mcp
Length of output: 45
🏁 Script executed:
# Search for McpForUnityResource or McpForUnityTool with telemetry context
grep -rn 'McpForUnityResource\|McpForUnityTool' --include='*.cs' | grep -v Test | head -20Repository: CoplayDev/unity-mcp
Length of output: 2277
🏁 Script executed:
# Search for all HandleCommand methods across the codebase
grep -rn 'public static.*HandleCommand' --include='*.cs' | grep -v TestRepository: CoplayDev/unity-mcp
Length of output: 3557
🏁 Script executed:
# Look for files that might contain telemetry action
find . -path ./TestProjects -prune -o -name '*.cs' -type f -print | xargs grep -l 'action.*telemetry\|telemetry.*action' | head -20Repository: CoplayDev/unity-mcp
Length of output: 45
🏁 Script executed:
# Read the full test to see what it actually checks
cat -n TestProjects/UnityMCPTests/Assets/Tests/EditMode/Tools/PropertyConversionErrorHandlingTests.cs | sed -n '200,250p'Repository: CoplayDev/unity-mcp
Length of output: 2599
🏁 Script executed:
# Let's also check if ManageEditor has a default case that might handle telemetry
grep -n 'default:' MCPForUnity/Editor/Tools/ManageEditor.cs -A 10Repository: CoplayDev/unity-mcp
Length of output: 616
Test will fail with unknown action error before reaching health assertions.
The test calls ManageEditor.HandleCommand with action telemetry_status, but this action is not implemented in ManageEditor.cs (lines 50-141). The switch statement only handles: play, pause, stop, set_active_tool, add_tag, remove_tag, add_layer, remove_layer. Any unknown action hits the default case and returns an ErrorResponse, so the test will fail immediately rather than proceed to check telemetry response content.
Either:
- Implement telemetry_status handling in ManageEditor, or
- Route telemetry_status to the correct handler if it exists elsewhere.
Once the action is properly implemented, the suggestion to assert on health fields (dispatcher status, success flags) aligns with issue #654's intent and should be added.
🤖 Prompt for AI Agents
In
`@TestProjects/UnityMCPTests/Assets/Tests/EditMode/Tools/PropertyConversionErrorHandlingTests.cs`
around lines 207 - 235, ManageEditor.HandleCommand currently lacks a case for
the "telemetry_status" action so unknown actions hit the default ErrorResponse;
update ManageEditor.HandleCommand (ManageEditor.cs) to handle "telemetry_status"
by either implementing a new branch that gathers and returns the telemetry
health fields (dispatcher status, success flags, etc.) or routing the action to
the existing telemetry handler (if one exists), ensuring the switch or dispatch
logic recognizes "telemetry_status" and returns a success response object used
by the test.
Summary
Problem
Two related issues were causing connection reliability problems:
Issue PropertyConversion failures crash command dispatcher while telemetry continues #654: After running test suites or other operations that trigger Unity domain reloads, the dispatcher would become unreachable ("No Unity Editor instances found"). The root cause was that domain reloads take 10-20s, but the reconnection timeout was only 2s.
Issue beta branch: when Claude Code reconnects, Unity session becomes "no session" even though handshake succeeds. This breaks the workflow completely. Logs attached. But Sometimes it works,and the bug only occurs in beta by using http to connect #643: On Windows, WebSocket connections can die silently (OSError 64: "The specified network name is no longer available") without either side being notified. Commands would fail with "Unity session not available" until Unity eventually detected the dead connection.
Solution
Ping/Pong Heartbeat (fixes #643)
{"type": "ping"}every 10 seconds{"type": "pong", "session_id": "..."}Domain Reload Recovery (fixes #654)
UNITY_MCP_SESSION_RESOLVE_MAX_WAIT_Sfrom 2s to 20sUNITY_MCP_RELOAD_MAX_WAIT_Sfrom 2s to 20s_is_http_transport()check so stdio transport skips PluginHub entirelyCode Quality
ClassVartype annotations for mutable class attributes_last_pongdict accessWarntoDebuglevelTest Fixes
Path.home()on WindowsLogAssert.Expectto PropertyConversion tests for expected error logsTest plan
[Ping]messages)Fixes #654
Fixes #643
🤖 Generated with Claude Code