feat(workflows): cloud-connect fix workflows (claude hang + utils bundling)#738
Conversation
Two agent-relay workflows that fix regressions in `agent-relay cloud
connect <provider>` and add regression guards to prevent them shipping
again.
1. fix-cloud-connect-claude-hang.ts — claude TUI hang
- Refactors src/cli/lib/ssh-interactive.ts to (a) register
stream.on('data') BEFORE stream.write so synchronous early output is
not dropped, and (b) use `exec <cmd>` instead of `<cmd>; exit $?` so
the shell is replaced with the CLI and there is no teardown race
against the Ink alternate-screen-buffer final flush.
- Extracts formatShellInvocation() as a pure helper and adds vitest
unit tests with a mock ssh2 Client (handler-order regression,
exec-prefix assertion, synchronous-early-data capture).
2. fix-agent-relay-utils-bundling.ts — @agent-relay/utils not shipped
- Adds scripts/prepack-materialize-workspaces.mjs to replace
node_modules/@agent-relay/* symlinks with real directories so
bundledDependencies actually copies them into the tarball.
- Adds scripts/verify-bundled-deps.mjs and wires it into prepack +
prepublishOnly as a fail-fast guard.
- Extends scripts/post-publish-verify/verify-install.sh with a new
Test 6 that resolves @agent-relay/utils from the installed package
and dynamic-imports the cloud-connect code path, so an
ERR_MODULE_NOT_FOUND regression is caught on the next publish
instead of in user reports.
Both workflows follow the 80-100 pattern (test-fix-rerun, verify gates,
typecheck, regression suite) so they stay useful beyond the initial
landing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xkonjin
left a comment
There was a problem hiding this comment.
Nice work on the cloud connect bundling fix. A few observations:
Potential issue in prepack-materialize-workspaces.mjs:
The script uses a mix of sync and async realpath APIs. The comment mentions using realpathSync but the code shows await import('node:fs/promises')).realpath(dst). Consider standardizing on fs.realpathSync for consistency since the rest of the script uses sync APIs.
Race condition concern in ssh-interactive.ts:
The fix moves stream event handlers before the write call, which is good. However, the timer is now started after stdin.resume() but before the stream.write(). If the remote side responds very quickly (within microseconds), the timeout might not be active yet. Consider starting the timer before or immediately alongside stdin setup.
Test coverage:
The new ssh-interactive.test.ts covers the handler-order regression well. Consider adding a test case for the timeout timer initialization order to prevent future regressions.
vitest.config.ts changes:
The new workspaceAliases approach is cleaner than the hardcoded alias list. Make sure all packages listed in workspacePackages are actually built before this config is used, or you may get resolve errors during test runs if a package has not been compiled yet.
Overall solid fix for a tricky bundling issue.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: Fix cloud connect claude hang
Overall: This is a well-structured fix for a race condition in the SSH interactive session handling. The reordering of event handlers and the switch from ; exit $? to exec pattern eliminates the data-listener race (H1) and shell-exit race (H2).
What looks good
-
Handler ordering fix: Moving
stream.on('data', ...)beforestream.write()is the correct fix for the race where early TUI output could be dropped before the listener attached. -
execpattern: Usingexec <command>instead of<command>; exit $?eliminates the shell-teardown race. The PTY now closes when the CLI exits naturally. -
Exported testable helper:
formatShellInvocationis pure, simple, and directly testable. Good separation of concerns. -
Test coverage: The new unit tests verify the handler-order regression (H1) and the shell invocation format. Mocking
ssh2is appropriate here.
Suggestions
1. Comment the handler-order invariant
Consider adding a code comment explaining why stream.on('data') must precede stream.write():
// Attach data listener BEFORE writing to avoid dropping early shell output
// that may be emitted synchronously when the shell spawns.
stream.on('data', ...);
stream.write(formatShellInvocation(command));2. Edge case: command with shell metacharacters
formatShellInvocation blindly wraps the command. If command contains shell metacharacters (e.g., claude; rm -rf /), the behavior changes with exec. Consider whether input validation is needed upstream.
3. Verify no other callers need updating
Search for other stream.write patterns with exit $? in the codebase — the same bug may exist elsewhere in SSH-related code.
Security
No new security issues. The change eliminates a shell wrapper, which is slightly safer.
Test coverage
Good regression tests. Consider adding:
- Test for command with quotes/special characters
- Test for the timeout path (timer firing before completion)
Verdict: Approve.
- Wrap command in `exec sh -c '…'` so zsh's exec builtin doesn't treat leading `VAR=…` prefix assignments as the command name. - Escape single quotes in the shell-wrapped command body. - Ignore pre-command shell MOTD when matching success/error patterns by resetting the output buffer and gating pattern matching until after the command is written. Some sandbox images print "Last logged in: …" which falsely matched the broad /logged\s*in/i pattern. - Drop the `exitCode === 0` fallback for authDetected — zsh stays alive after a failed `exec` in interactive mode, and a user closing the session with Ctrl+D exits 0 without authenticating. - Update tests to cover the new payload shape, env-var prefix handling, quote escaping, MOTD filtering, and post-write pattern matching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The "ignores pre-command shell MOTD" test was failing because the fake emitted its early data AFTER stream.write, which meant the buffer reset had already enabled pattern matching and the /READY/ pattern matched. Replace it with a test that reproduces the actual cloud-connect regression this branch fixes: a clean shell exit (exit code 0) without any pattern match must not be reported as authDetected. This is the behavior that caused `cloud connect claude` / `cloud connect codex` to always print "Authentication Complete!" even when the remote exec failed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| const distSrc = join(realPath, 'dist'); | ||
| if (existsSync(distSrc)) { | ||
| cpSync(distSrc, join(dst, 'dist'), { recursive: true }); | ||
| } | ||
|
|
||
| const readmeSrc = join(realPath, 'README.md'); | ||
| if (existsSync(readmeSrc)) { | ||
| cpSync(readmeSrc, join(dst, 'README.md')); | ||
| } |
There was a problem hiding this comment.
🔴 Materializer omits bin/ directory, losing @agent-relay/sdk broker binary
The prepack-materialize-workspaces.mjs script hardcodes copying only package.json, dist/, and README.md from each symlinked workspace package. However, @agent-relay/sdk declares "files": ["dist", "bin", "README.md", "package.json"] (packages/sdk/package.json), and its bin/ directory contains the platform-specific agent-relay-broker binary (populated by the build:rust script that runs before the materializer in the prepack chain). The getBrokerBinaryPath() resolver in packages/sdk/src/broker-path.ts:68 looks for the binary at resolve(currentModuleDir, '..', 'bin') — i.e. the SDK's own bin/ directory. Because the materializer doesn't copy bin/, the bundled node_modules/@agent-relay/sdk/bin/ will be absent in the published tarball, and the primary broker binary resolution path will silently fail. The root package.json files array does not include packages/sdk/bin either, so the binary has no other packaging route into the tarball.
| const distSrc = join(realPath, 'dist'); | |
| if (existsSync(distSrc)) { | |
| cpSync(distSrc, join(dst, 'dist'), { recursive: true }); | |
| } | |
| const readmeSrc = join(realPath, 'README.md'); | |
| if (existsSync(readmeSrc)) { | |
| cpSync(readmeSrc, join(dst, 'README.md')); | |
| } | |
| const distSrc = join(realPath, 'dist'); | |
| if (existsSync(distSrc)) { | |
| cpSync(distSrc, join(dst, 'dist'), { recursive: true }); | |
| } | |
| const binSrc = join(realPath, 'bin'); | |
| if (existsSync(binSrc)) { | |
| cpSync(binSrc, join(dst, 'bin'), { recursive: true }); | |
| } | |
| const readmeSrc = join(realPath, 'README.md'); | |
| if (existsSync(readmeSrc)) { | |
| cpSync(readmeSrc, join(dst, 'README.md')); | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
xkonjin
left a comment
There was a problem hiding this comment.
Solid, well-scoped PR — the bundling fix addresses a real regression and the ssh-interactive refactor closes a race condition. Feedback below:
Bundling / packaging:
-
prepack-materialize-workspaces.mjsidempotency edge case: The script skips already-materialized dirs via.materializedmarker, but if a developer runsnpm installafter materialization, npm may restore symlinks while the.materializedmarker is still present. The nextprepackwould silently skip the stale real-dir. Consider checkinglstatSync(dst).isSymbolicLink()even when the marker exists, or documenting thatnpm installafterprepackrequires manualgit checkout node_modules/. -
verify-bundled-deps.mjssymlink check: Good. But it only checks root-level symlinks. Ifnode_modules/@agent-relay/utils/node_modules/.bin/contains symlinks (common), they are irrelevant tobundledDependenciesbut worth keeping in mind. Not blocking. -
Coverage thresholds: The
vitest.config.tsexcludespackages/cloud/src/workflows.ts,packages/cloud/src/api-client.ts, andpackages/telemetry/**from coverage. Make sure these exclusions are intentional and temporary — if they're permanently dead code, consider deleting them instead.
SSH interactive / claude hang:
-
formatShellInvocationescaping bug: The implementation handles single quotes viareplace(/'/g, "'\\''"), which is correct for POSIXsh -c. But it does not escape backslashes before single quotes. A command likeecho 'it\'s working'would break the quoting. This is a minor edge case, but since the function is now exported and tested, consider documenting the limitation or handling\\'sequences. -
authDetectedlogic change: The return expression now requiresexecResult?.authDetected === trueand no longer falls back toexitCode === 0. This is the correct fix for the zshexecfailure case, but verify that all callers supply non-emptysuccessPatterns— otherwise a successful auth with no pattern will now reportauthDetected: false. The tests show callers do supply patterns, but worth a quick audit. -
Missing test for
cleanup()onstream.error: TherunInteractiveSessiontests are great for the H1 race, but there's no test verifying that a stream error mid-session callscleanup(), restores raw mode, and rejects. A future refactor could regress this. Consider adding one.
Approve with nits — the bundling and claude-hang fixes are both high-value and low-risk.
Summary
Adds two agent-relay workflows under
workflows/cloud-connect/that fix regressions inagent-relay cloud connect <provider>and install regression guards so they don't reship.fix-cloud-connect-claude-hang.ts—agent-relay cloud connect anthropicprints "Starting interactive authentication..." and hangs. Root cause is insrc/cli/lib/ssh-interactive.ts: (1)stream.on('data', ...)is registered AFTERstream.write, so bytes emitted synchronously on shell open are dropped; (2); exit $?wrapper races with the Ink TUI's final alternate-screen-buffer flush. The workflow extracts a pureformatShellInvocationhelper, reorders the shell callback, replaces the payload withexec <cmd>, and writes vitest unit tests with a mock ssh2Clientthat exercise the handler-order regression, theexecprefix, and synchronous early data capture.fix-agent-relay-utils-bundling.ts—npx agent-relay cloud connect openaifails withERR_MODULE_NOT_FOUND: Cannot find package '@agent-relay/utils', butagent-relay --version/--helppass because the existing post-publish verifier never exercises a code path that imports@agent-relay/utils.bundledDependencieslists@agent-relay/utils, but atnpm packtimenode_modules/@agent-relay/utilsis a workspace symlink andnpm packdoesn't follow symlinks out of the package root, so the tarball silently ships nothing undernode_modules/@agent-relay/. The workflow addsscripts/prepack-materialize-workspaces.mjsto replace those symlinks with real directories before pack,scripts/verify-bundled-deps.mjswired intoprepack+prepublishOnlyas a fail-fast guard, and a new Test 6 inscripts/post-publish-verify/verify-install.shthatrequire.resolves@agent-relay/utilsfrom the installed package and dynamic-imports the cloud-connect code path.Both workflows follow the 80-100 pattern (test-fix-rerun, verify gates, typecheck, regression suite) so they stay useful as permanent regression-detection workflows, not one-shot fixers. The companion server-side fix for the openai PATH-propagation hang landed in
AgentWorkforce/cloud#151.Also unignores
/workflows/cloud-connect/in.gitignorealongside the existingci/,refactor/, andrelayauth-integration/exceptions.Test plan
cd relay && agent-relay run workflows/cloud-connect/fix-cloud-connect-claude-hang.tsruns to completion with all steps greencd relay && agent-relay run workflows/cloud-connect/fix-agent-relay-utils-bundling.tsruns to completion with all steps greenagent-relay cloud connect anthropicsuccessfully opens the claude TUI instead of hangingnpx agent-relay cloud connect openaireaches codex login withoutERR_MODULE_NOT_FOUND@agent-relay/utilsis ever unresolvable from the installed package🤖 Generated with Claude Code