Skip to content

feat(workflows): cloud-connect fix workflows (claude hang + utils bundling)#738

Merged
khaliqgant merged 10 commits into
mainfrom
fix/cloud-connect-workflows
Apr 14, 2026
Merged

feat(workflows): cloud-connect fix workflows (claude hang + utils bundling)#738
khaliqgant merged 10 commits into
mainfrom
fix/cloud-connect-workflows

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

@khaliqgant khaliqgant commented Apr 13, 2026

Summary

Adds two agent-relay workflows under workflows/cloud-connect/ that fix regressions in agent-relay cloud connect <provider> and install regression guards so they don't reship.

  • fix-cloud-connect-claude-hang.tsagent-relay cloud connect anthropic prints "Starting interactive authentication..." and hangs. Root cause is in src/cli/lib/ssh-interactive.ts: (1) stream.on('data', ...) is registered AFTER stream.write, so bytes emitted synchronously on shell open are dropped; (2) ; exit $? wrapper races with the Ink TUI's final alternate-screen-buffer flush. The workflow extracts a pure formatShellInvocation helper, reorders the shell callback, replaces the payload with exec <cmd>, and writes vitest unit tests with a mock ssh2 Client that exercise the handler-order regression, the exec prefix, and synchronous early data capture.

  • fix-agent-relay-utils-bundling.tsnpx agent-relay cloud connect openai fails with ERR_MODULE_NOT_FOUND: Cannot find package '@agent-relay/utils', but agent-relay --version / --help pass because the existing post-publish verifier never exercises a code path that imports @agent-relay/utils. bundledDependencies lists @agent-relay/utils, but at npm pack time node_modules/@agent-relay/utils is a workspace symlink and npm pack doesn't follow symlinks out of the package root, so the tarball silently ships nothing under node_modules/@agent-relay/. The workflow adds scripts/prepack-materialize-workspaces.mjs to replace those symlinks with real directories before pack, scripts/verify-bundled-deps.mjs wired into prepack + prepublishOnly as a fail-fast guard, and a new Test 6 in scripts/post-publish-verify/verify-install.sh that require.resolves @agent-relay/utils from the installed package and dynamic-imports the cloud-connect code path.

Both workflows follow the 80-100 pattern (test-fix-rerun, verify gates, typecheck, regression suite) so they stay useful as permanent regression-detection workflows, not one-shot fixers. The companion server-side fix for the openai PATH-propagation hang landed in AgentWorkforce/cloud#151.

Also unignores /workflows/cloud-connect/ in .gitignore alongside the existing ci/, refactor/, and relayauth-integration/ exceptions.

Test plan

  • cd relay && agent-relay run workflows/cloud-connect/fix-cloud-connect-claude-hang.ts runs to completion with all steps green
  • cd relay && agent-relay run workflows/cloud-connect/fix-agent-relay-utils-bundling.ts runs to completion with all steps green
  • agent-relay cloud connect anthropic successfully opens the claude TUI instead of hanging
  • npx agent-relay cloud connect openai reaches codex login without ERR_MODULE_NOT_FOUND
  • Post-publish verification Test 6 fails cleanly if @agent-relay/utils is ever unresolvable from the installed package

🤖 Generated with Claude Code


Open with Devin

Two agent-relay workflows that fix regressions in `agent-relay cloud
connect <provider>` and add regression guards to prevent them shipping
again.

1. fix-cloud-connect-claude-hang.ts — claude TUI hang
   - Refactors src/cli/lib/ssh-interactive.ts to (a) register
     stream.on('data') BEFORE stream.write so synchronous early output is
     not dropped, and (b) use `exec <cmd>` instead of `<cmd>; exit $?` so
     the shell is replaced with the CLI and there is no teardown race
     against the Ink alternate-screen-buffer final flush.
   - Extracts formatShellInvocation() as a pure helper and adds vitest
     unit tests with a mock ssh2 Client (handler-order regression,
     exec-prefix assertion, synchronous-early-data capture).

2. fix-agent-relay-utils-bundling.ts — @agent-relay/utils not shipped
   - Adds scripts/prepack-materialize-workspaces.mjs to replace
     node_modules/@agent-relay/* symlinks with real directories so
     bundledDependencies actually copies them into the tarball.
   - Adds scripts/verify-bundled-deps.mjs and wires it into prepack +
     prepublishOnly as a fail-fast guard.
   - Extends scripts/post-publish-verify/verify-install.sh with a new
     Test 6 that resolves @agent-relay/utils from the installed package
     and dynamic-imports the cloud-connect code path, so an
     ERR_MODULE_NOT_FOUND regression is caught on the next publish
     instead of in user reports.

Both workflows follow the 80-100 pattern (test-fix-rerun, verify gates,
typecheck, regression suite) so they stay useful beyond the initial
landing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the cloud connect bundling fix. A few observations:

Potential issue in prepack-materialize-workspaces.mjs:
The script uses a mix of sync and async realpath APIs. The comment mentions using realpathSync but the code shows await import('node:fs/promises')).realpath(dst). Consider standardizing on fs.realpathSync for consistency since the rest of the script uses sync APIs.

Race condition concern in ssh-interactive.ts:
The fix moves stream event handlers before the write call, which is good. However, the timer is now started after stdin.resume() but before the stream.write(). If the remote side responds very quickly (within microseconds), the timeout might not be active yet. Consider starting the timer before or immediately alongside stdin setup.

Test coverage:
The new ssh-interactive.test.ts covers the handler-order regression well. Consider adding a test case for the timeout timer initialization order to prevent future regressions.

vitest.config.ts changes:
The new workspaceAliases approach is cleaner than the hardcoded alias list. Make sure all packages listed in workspacePackages are actually built before this config is used, or you may get resolve errors during test runs if a package has not been compiled yet.

Overall solid fix for a tricky bundling issue.

Copy link
Copy Markdown

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test review

Copy link
Copy Markdown

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Fix cloud connect claude hang

Overall: This is a well-structured fix for a race condition in the SSH interactive session handling. The reordering of event handlers and the switch from ; exit $? to exec pattern eliminates the data-listener race (H1) and shell-exit race (H2).

What looks good

  1. Handler ordering fix: Moving stream.on('data', ...) before stream.write() is the correct fix for the race where early TUI output could be dropped before the listener attached.

  2. exec pattern: Using exec <command> instead of <command>; exit $? eliminates the shell-teardown race. The PTY now closes when the CLI exits naturally.

  3. Exported testable helper: formatShellInvocation is pure, simple, and directly testable. Good separation of concerns.

  4. Test coverage: The new unit tests verify the handler-order regression (H1) and the shell invocation format. Mocking ssh2 is appropriate here.

Suggestions

1. Comment the handler-order invariant
Consider adding a code comment explaining why stream.on('data') must precede stream.write():

// Attach data listener BEFORE writing to avoid dropping early shell output
// that may be emitted synchronously when the shell spawns.
stream.on('data', ...);
stream.write(formatShellInvocation(command));

2. Edge case: command with shell metacharacters
formatShellInvocation blindly wraps the command. If command contains shell metacharacters (e.g., claude; rm -rf /), the behavior changes with exec. Consider whether input validation is needed upstream.

3. Verify no other callers need updating
Search for other stream.write patterns with exit $? in the codebase — the same bug may exist elsewhere in SSH-related code.

Security

No new security issues. The change eliminates a shell wrapper, which is slightly safer.

Test coverage

Good regression tests. Consider adding:

  • Test for command with quotes/special characters
  • Test for the timeout path (timer firing before completion)

Verdict: Approve.

khaliqgant and others added 3 commits April 14, 2026 09:49
- Wrap command in `exec sh -c '…'` so zsh's exec builtin doesn't treat
  leading `VAR=…` prefix assignments as the command name.
- Escape single quotes in the shell-wrapped command body.
- Ignore pre-command shell MOTD when matching success/error patterns by
  resetting the output buffer and gating pattern matching until after the
  command is written. Some sandbox images print "Last logged in: …" which
  falsely matched the broad /logged\s*in/i pattern.
- Drop the `exitCode === 0` fallback for authDetected — zsh stays alive
  after a failed `exec` in interactive mode, and a user closing the
  session with Ctrl+D exits 0 without authenticating.
- Update tests to cover the new payload shape, env-var prefix handling,
  quote escaping, MOTD filtering, and post-write pattern matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The "ignores pre-command shell MOTD" test was failing because the fake
emitted its early data AFTER stream.write, which meant the buffer reset
had already enabled pattern matching and the /READY/ pattern matched.

Replace it with a test that reproduces the actual cloud-connect regression
this branch fixes: a clean shell exit (exit code 0) without any pattern
match must not be reported as authDetected. This is the behavior that
caused `cloud connect claude` / `cloud connect codex` to always print
"Authentication Complete!" even when the remote exec failed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

Open in Devin Review

Comment on lines +82 to +90
const distSrc = join(realPath, 'dist');
if (existsSync(distSrc)) {
cpSync(distSrc, join(dst, 'dist'), { recursive: true });
}

const readmeSrc = join(realPath, 'README.md');
if (existsSync(readmeSrc)) {
cpSync(readmeSrc, join(dst, 'README.md'));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Materializer omits bin/ directory, losing @agent-relay/sdk broker binary

The prepack-materialize-workspaces.mjs script hardcodes copying only package.json, dist/, and README.md from each symlinked workspace package. However, @agent-relay/sdk declares "files": ["dist", "bin", "README.md", "package.json"] (packages/sdk/package.json), and its bin/ directory contains the platform-specific agent-relay-broker binary (populated by the build:rust script that runs before the materializer in the prepack chain). The getBrokerBinaryPath() resolver in packages/sdk/src/broker-path.ts:68 looks for the binary at resolve(currentModuleDir, '..', 'bin') — i.e. the SDK's own bin/ directory. Because the materializer doesn't copy bin/, the bundled node_modules/@agent-relay/sdk/bin/ will be absent in the published tarball, and the primary broker binary resolution path will silently fail. The root package.json files array does not include packages/sdk/bin either, so the binary has no other packaging route into the tarball.

Suggested change
const distSrc = join(realPath, 'dist');
if (existsSync(distSrc)) {
cpSync(distSrc, join(dst, 'dist'), { recursive: true });
}
const readmeSrc = join(realPath, 'README.md');
if (existsSync(readmeSrc)) {
cpSync(readmeSrc, join(dst, 'README.md'));
}
const distSrc = join(realPath, 'dist');
if (existsSync(distSrc)) {
cpSync(distSrc, join(dst, 'dist'), { recursive: true });
}
const binSrc = join(realPath, 'bin');
if (existsSync(binSrc)) {
cpSync(binSrc, join(dst, 'bin'), { recursive: true });
}
const readmeSrc = join(realPath, 'README.md');
if (existsSync(readmeSrc)) {
cpSync(readmeSrc, join(dst, 'README.md'));
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-scoped PR — the bundling fix addresses a real regression and the ssh-interactive refactor closes a race condition. Feedback below:

Bundling / packaging:

  1. prepack-materialize-workspaces.mjs idempotency edge case: The script skips already-materialized dirs via .materialized marker, but if a developer runs npm install after materialization, npm may restore symlinks while the .materialized marker is still present. The next prepack would silently skip the stale real-dir. Consider checking lstatSync(dst).isSymbolicLink() even when the marker exists, or documenting that npm install after prepack requires manual git checkout node_modules/.

  2. verify-bundled-deps.mjs symlink check: Good. But it only checks root-level symlinks. If node_modules/@agent-relay/utils/node_modules/.bin/ contains symlinks (common), they are irrelevant to bundledDependencies but worth keeping in mind. Not blocking.

  3. Coverage thresholds: The vitest.config.ts excludes packages/cloud/src/workflows.ts, packages/cloud/src/api-client.ts, and packages/telemetry/** from coverage. Make sure these exclusions are intentional and temporary — if they're permanently dead code, consider deleting them instead.

SSH interactive / claude hang:

  1. formatShellInvocation escaping bug: The implementation handles single quotes via replace(/'/g, "'\\''"), which is correct for POSIX sh -c. But it does not escape backslashes before single quotes. A command like echo 'it\'s working' would break the quoting. This is a minor edge case, but since the function is now exported and tested, consider documenting the limitation or handling \\' sequences.

  2. authDetected logic change: The return expression now requires execResult?.authDetected === true and no longer falls back to exitCode === 0. This is the correct fix for the zsh exec failure case, but verify that all callers supply non-empty successPatterns — otherwise a successful auth with no pattern will now report authDetected: false. The tests show callers do supply patterns, but worth a quick audit.

  3. Missing test for cleanup() on stream.error: The runInteractiveSession tests are great for the H1 race, but there's no test verifying that a stream error mid-session calls cleanup(), restores raw mode, and rejects. A future refactor could regress this. Consider adding one.

Approve with nits — the bundling and claude-hang fixes are both high-value and low-risk.

@khaliqgant khaliqgant merged commit fb90632 into main Apr 14, 2026
52 checks passed
@khaliqgant khaliqgant deleted the fix/cloud-connect-workflows branch April 14, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants