Skip to content

Add Agent Relay Cloud onboarding design document#35

Merged
khaliqgant merged 39 commits intomainfrom
claude/agent-provider-onboarding-I7cHk
Dec 30, 2025
Merged

Add Agent Relay Cloud onboarding design document#35
khaliqgant merged 39 commits intomainfrom
claude/agent-provider-onboarding-I7cHk

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

Design for cloud-hosted agent-relay with multi-provider authentication:

  • GitHub OAuth as primary auth + repo connection
  • Provider credential vault (API keys + OAuth tokens)
  • Support for Claude, Codex, Gemini, and custom providers
  • Team templates for quick setup
  • Security considerations for credential storage

bd-cloud-onboarding

Design for cloud-hosted agent-relay with multi-provider authentication:
- GitHub OAuth as primary auth + repo connection
- Provider credential vault (API keys + OAuth tokens)
- Support for Claude, Codex, Gemini, and custom providers
- Team templates for quick setup
- Security considerations for credential storage

bd-cloud-onboarding
Remove API key support in favor of login-based auth for all providers:
- All providers now use "Login with X" buttons
- OAuth tokens stored instead of API keys
- Automatic token refresh before expiry
- GitHub Copilot auto-connected via signup
- Updated security model for OAuth token lifecycle

bd-cloud-onboarding
Claude Code currently uses browser-based OAuth that's not fully
headless-compatible. Updated design with practical alternatives:

- Device authorization flow (enter code at anthropic.com)
- Credential import from local Claude installation
- Provider status table showing actual OAuth support levels

References GitHub issue anthropics/claude-code#7100 for context.

bd-cloud-onboarding
Both Claude Code and Codex use browser-based OAuth that doesn't support
redirect URIs for third-party apps. Device flow (RFC 8628) is the solution.

Added:
- Sequence diagram showing device flow protocol
- Provider-specific device flow URLs table
- Complete UI flow with wireframes for both providers
- TypeScript implementation (DeviceFlowAuth class)
- Express API routes with background polling
- React component with state machine

References:
- anthropics/claude-code#7100 (headless auth request)
- openai/codex#2798 (remote auth request)

bd-cloud-onboarding
Agent Relay supports both deployment models with unified auth:

1. Cloud Hosted: Everything runs in cloud, users just connect accounts
2. Self-Hosted: Agents run locally with optional cloud sync
3. Self-Hosted + Cloud: Local execution with cloud auth/dashboard

Added:
- Deployment model diagrams
- Feature comparison table
- Self-hosted onboarding CLI flow
- Credential sync architecture
- Hybrid mode (agent-relay cloud connect)

bd-cloud-onboarding
Clarify that authentication is always handled by Agent Relay Cloud,
not self-hosted. The two models are:

1. Cloud Hosted: We run everything (auth + compute + repos)
2. Self-Hosted: User brings compute, auth still via our cloud

Updated:
- Deployment diagrams showing auth always in cloud
- Feature comparison table (removed offline auth option)
- Self-hosted setup flow using `agent-relay cloud` commands
- Credential sync architecture diagram

bd-cloud-onboarding
Self-hosted users must connect to our cloud to authenticate since
Claude/Codex require browser-based OAuth. This creates intentional
friction to encourage cloud adoption.

Key changes:
- Auth runs on our servers, not user's headless server
- User opens URL in browser, tokens sync to their server
- Token refresh continues via cloud (ongoing dependency)
- Added friction comparison table (cloud = easy, self = more steps)
- QR code option for mobile auth

Business rationale: Cloud should be path of least resistance.

bd-cloud-onboarding
Implements one-click workspace provisioning with:
- Database layer (PostgreSQL) for users, credentials, workspaces, repos
- Credential vault with AES-256-GCM encryption for OAuth tokens
- Workspace provisioner supporting Fly.io, Railway, and Docker
- API routes for auth, providers, workspaces, repos, onboarding
- CLI proxy authentication for Claude Code and Codex
- Device flow OAuth for Google/Gemini

Auth strategy:
- Google: Real OAuth device flow (works today)
- Claude/Codex: CLI-based auth with URL proxy through our UI
- GitHub: Web OAuth (for signup)
- Add custom_domain and custom_domain_status fields to workspaces
- Add API endpoints for domain management:
  - POST /workspaces/:id/domain - Set custom domain
  - POST /workspaces/:id/domain/verify - Verify DNS & provision SSL
  - DELETE /workspaces/:id/domain - Remove custom domain
- DNS verification via CNAME lookup
- SSL provisioning for Fly.io and Railway
- Database index for custom domain lookups

Users can now use their own domains (e.g., agents.acme.com)
instead of the default workspace-xxx.agentrelay.dev URLs.
- Add plan field to users (free, pro, team, enterprise)
- Custom domains require Team or Enterprise plan
- Returns 402 Payment Required with upgrade link for free/pro users
- Default URLs use subdomains: workspace-xxx.agentrelay.dev

Pricing model:
- Free/Pro: Subdomains only (included)
- Team/Enterprise: Custom domains ($10/mo add-on)
- Add workspace_members table for team collaboration
- Support roles: owner, admin, member, viewer
- Add team invitation system with accept/decline
- Add user avatar_url from GitHub
- Add findByGithubUsername and findByEmail to users
- Create /api/teams routes for member management
- Update /api/auth/me to include avatar, plan, pending invites
- Team members require Team/Enterprise plan

Workspace member permissions:
- Owner: Full control, can delete workspace
- Admin: Can invite/remove members, edit settings
- Member: Can use agents, view all
- Viewer: Read-only access
New src/resiliency/ module provides:

## Health Monitor (health-monitor.ts)
- Periodic process liveness checks
- Configurable health check intervals and timeouts
- Max consecutive failures before marking dead
- Memory/CPU usage tracking
- Event-based notifications

## Structured Logger (logger.ts)
- JSON format for production, pretty format for dev
- Log levels with filtering (debug, info, warn, error, fatal)
- Context propagation (correlation IDs, agent names)
- File output with rotation
- Child loggers for scoped context

## Metrics (metrics.ts)
- Per-agent crash/restart/spawn counters
- System-wide health status
- Prometheus-compatible export format
- Metric history for trending
- JSON export for dashboards

## Supervisor (supervisor.ts)
- Ties together health + logging + metrics
- Auto-restart with configurable limits
- Crash notifications
- Force restart capability
- Overall status reporting

Key improvements:
- Agents auto-restart on crash (up to 5 attempts)
- Dead process detection via PID checks
- Structured logs for debugging
- Metrics endpoint for observability
- Event-based notifications for alerts
Implements Continuous-Claude-v2 inspired context persistence:
- Ledger-based state storage for agent context
- Handoff protocol for task continuation across restarts
- Provider-specific context injection:
  - Claude: Uses hooks to inject context into CLAUDE.md
  - Codex: Uses config for periodic context refresh via system prompt
  - Gemini: Updates system instruction file
- Auto-save functionality with configurable intervals
- Integrated with supervisor for automatic context save on crash/restart
The new architecture makes the relay daemon the default mode:
- Orchestrator manages multiple workspaces (repos) from a single API
- Dashboard becomes primary interface for project switching
- No separate "bridge" command needed

New modules:
- orchestrator.ts: Top-level service managing workspace daemons
- workspace-manager.ts: Add/remove/switch workspaces
- agent-manager.ts: Spawn/stop agents with resiliency integration
- api.ts: REST and WebSocket API for dashboard
- types.ts: Core types for workspaces, agents, events

API endpoints:
- GET/POST /workspaces - List/add workspaces
- POST /workspaces/:id/switch - Switch active workspace
- GET/POST /workspaces/:id/agents - List/spawn agents
- WebSocket for real-time events
New components for multi-workspace navigation:
- WorkspaceSelector: Dropdown for switching between workspaces
- AddWorkspaceModal: Modal for adding new repository paths
- useOrchestrator hook: Connects to orchestrator API and WebSocket

Features:
- Real-time workspace/agent updates via WebSocket
- Provider detection (Claude, Codex, Gemini)
- Git branch display
- Status indicators (active, inactive, error)
Integrates multi-workspace support into the dashboard:
- Connect useOrchestrator hook for workspace/agent management
- Add WorkspaceSelector dropdown in sidebar for switching projects
- Add AddWorkspaceModal for adding new repositories
- Convert workspaces to projects for unified navigation
- Update spawn/release handlers to use orchestrator when available
Implements complete billing system for Agent Relay Cloud:
- Billing types and plan definitions (Free, Pro, Team, Enterprise)
- Stripe service for customer, subscription, and payment management
- Billing API endpoints (checkout, portal, webhooks, invoices)
- PricingPlans component with monthly/yearly toggle
- BillingPanel for subscription overview and management
- Usage tracking and plan limit comparisons
Mission-control themed landing page with:
- Animated agent network visualization with glowing connections
- Live demo section showing agents collaborating in real-time
- Dark atmospheric design with cyan/purple/orange accent colors
- Responsive layout with smooth animations
- Features, providers, and pricing sections
- Terminal-style CTA with realistic CLI output
- Static HTML version for SEO and fast initial load
- React component version for dynamic interactions
Resolves conflicts:
- App.tsx: Keep workspace switching + orchestrator logic, add shadow agent support
- hooks/index.ts: Export both useOrchestrator and useAgentLogs
- index.ts: Merge component exports with workspace/billing components
- issues.jsonl: Accept upstream beads issues
- Add .js extensions to all ESM imports in cloud and resiliency modules
- Install missing type declarations (@types/pg, @types/cors, etc.)
- Fix Stripe API compatibility with type assertions for API version changes
- Fix Redis/connect-redis API compatibility with type assertions
- Fix WebSocket import in daemon/api.ts
- Rename DaemonConfig to ApiDaemonConfig to avoid duplicate export
- Exclude landing page from main tsconfig (it's React/browser code)
- Fix type annotations for event handlers in supervisor.ts
- Fix logger delete operator issue using destructuring
- Fix SpawnConfig type mismatch in dashboard App.tsx
- Add Dockerfile for main cloud service (Railway deployment)
- Add railway.json with healthcheck configuration
- Add workspace Dockerfile and fly.toml template for Fly.io machines
- Add deploy scripts for Railway and Fly.io setup
- Add .env.cloud.example with domain configuration template
- Update FlyProvisioner with custom domain support:
  - Support FLY_REGION for workspace placement
  - Support FLY_WORKSPACE_DOMAIN for custom subdomains (e.g., ws.agent-relay.com)
  - Auto-provision SSL certificates for custom hostnames
  - Enable auto-stop/start for cost optimization
- Update all provisioners to use consistent image naming
Cloud-Daemon Sync:
- Add /api/daemons endpoints for daemon registration and linking
- Implement API key authentication for daemon-to-cloud communication
- Add CloudSyncService for heartbeat, agent discovery, and credential sync
- Support cross-machine message relay through cloud queue

Database:
- Add Drizzle ORM schema with full type safety
- Create drizzle.ts client with typed query helpers
- Add linkedDaemons table for daemon registration
- Add subscriptions and usage_records tables for billing

Local Development:
- Add docker-compose.dev.yml for full local cloud stack
- Add init-db.sql for PostgreSQL schema initialization
- Add npm scripts: db:generate, db:migrate, db:push, db:studio

Architecture:
- Each machine gets one API key (not per-project)
- Daemon reports all agents from all projects on that machine
- Cloud aggregates agents from all linked machines
- Messages can be relayed across machines via cloud queue
CLI Commands:
- Add `agent-relay cloud link` to connect machine to cloud
- Add `agent-relay cloud unlink` to disconnect from cloud
- Add `agent-relay cloud status` to show sync status
- Add `agent-relay cloud sync` to manually sync credentials
- Implements browser-based OAuth flow with API key verification
- Stores config securely in ~/.local/share/agent-relay/

CSS to Tailwind:
- Convert sidebar-container CSS classes to Tailwind utilities
- Convert workspace-selector-container to Tailwind classes
- Remove appStyles CSS export (kept empty for backwards compat)
- Use Tailwind theme tokens (bg-sidebar-bg, border-sidebar-border)
- Add CloudSessionProvider to wrap the dashboard with session management
- Add useSession hook for detecting expired sessions
- Add SessionExpiredModal component for re-login prompts
- Add cloudApi client with automatic session expiration detection
- Update auth.ts with session endpoint and error codes
- Add ProjectGroup schema with coordinator agent configuration
- Refactor db layer to use Drizzle ORM with strong typing
- Add WorkspaceMemberQueries for team management
- Fix null/undefined type conversions in vault
Coordinator Agents:
- Add coordinators API at /api/project-groups/:groupId/coordinator
- Add coordinator service for lifecycle management (start/stop/restart)
- Support enable/disable, configuration updates for coordinators

Plan-based Limits:
- Add plan limits service with tier definitions (free/pro/team/enterprise)
- Add middleware to enforce workspace and agent count limits
- Add usage API for tracking quotas (/api/usage, /api/usage/summary)
- Update workspaces API to check limits before creation
- Return 402 errors with upgrade prompts when limits exceeded
- Change limits from per-workspace agents to global concurrent agents
- Add repo count limit (3/20/100/unlimited per tier)
- Add coordinatorsEnabled flag (Pro+ only)
- Update middleware: checkRepoLimit, checkAgentLimit, checkCoordinatorAccess
- Update usage API to return new limit structure
- Free: 1 workspace, 3 repos, 2 concurrent agents, 10 compute hrs
- Pro: 5 workspaces, 20 repos, 10 concurrent agents, 100 compute hrs
- Team: 20 workspaces, 100 repos, 50 concurrent agents, 500 compute hrs
- Update LandingPage.tsx pricing section with new limits:
  - Free: 1 workspace, 3 repos, 2 concurrent agents, 10 hrs
  - Pro: 5 workspaces, 20 repos, 10 agents, 100 hrs, coordinators
  - Team: 20 workspaces, 100 repos, 50 agents, 500 hrs
  - Enterprise: Unlimited everything

- Create dedicated PricingPage.tsx with:
  - Monthly/annual billing toggle (20% discount)
  - Plan cards with visual limit indicators
  - Feature comparison table
  - FAQ section explaining compute hours, BYOK, coordinators
  - Orbital animation CTA section

- Add coordinator Pro-only restriction to coordinators API
- Fix TypeScript warnings array type in usage.ts
- Add pricing page styles to styles.css
- Move landing pages into dashboard/landing for build compatibility
- / now serves LandingPage
- /pricing serves PricingPage
- /app serves the dashboard (post-login)
- Remove old /landing route
…odes

Create comprehensive documentation for three deployment modes:
- CLOUD.md: Getting started with agent-relay.com managed service
- SELF-HOSTED.md: Running on own infrastructure with cloud auth
- LOCAL.md: Standalone local development usage
- Add scripts/dev.sh to start daemon + Next.js dashboard in tmux
- Add dev:start, dev:stop, dev:attach npm scripts
- Update LOCAL.md with simplified quickstart guide
- Dashboard dev server on port 4281 with hot reload
- Add .github/workflows/docker.yml to build and push images on release
- Publishes agent-relay and agent-relay-workspace images
- Supports linux/amd64 and linux/arm64 platforms
- Update all docs and docker-compose to use agentworkforce org
khaliqgant pushed a commit that referenced this pull request Dec 30, 2025
Key changes to match cloud-first paradigm:
- OAuth handled by cloud API (src/cloud/api/integrations/slack.ts)
- Credentials stored in encrypted vault, not local config files
- SlackService in src/cloud/services/ for token management
- Database schema via Drizzle ORM (slack_integrations table)
- Daemon bridge syncs credentials via cloud-sync.ts
- Orchestrator manages SlackBridge lifecycle per workspace
- Plan-gated access (Pro+ only)
- Dashboard UI with SlackIntegrationPanel component

Follows same patterns as provider credentials from PR #35.

bd-slack-integration
khaliqgant and others added 4 commits December 30, 2025 22:35
- AgentList: Solo agents (like Lead) now display without redundant group header
- AgentList: Reduced spacing between agents (gap-2 → gap-1)
- Header: Added notification badge on mobile hamburger menu for unread messages
- App: Track unread messages when sidebar closed on mobile
- LogViewer: Fixed auto-scroll to re-enable when user scrolls back to bottom
- MessageList: Fixed auto-scroll reliability with setTimeout and instant behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@khaliqgant khaliqgant requested a review from Copilot December 30, 2025 22:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the foundational cloud infrastructure for Agent Relay, including landing/pricing pages, credential management, workspace provisioning, and multi-provider authentication. The design supports GitHub OAuth as primary authentication, a secure credential vault for API keys/OAuth tokens, and integration with Claude, Codex, Gemini, and custom providers. Additional features include team templates, coordinator agents for project groups, and Stripe-based subscription management.

Key changes:

  • Landing and pricing pages with mission control aesthetic
  • Secure credential vault using AES-256-GCM encryption
  • Workspace provisioning for Fly.io, Railway, and Docker
  • Database schema with Drizzle ORM for users, workspaces, project groups, and credentials
  • Billing integration with Stripe for subscription tiers (free, pro, team, enterprise)
  • Plan limits service with usage tracking and quota management

Reviewed changes

Copilot reviewed 70 out of 119 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/dashboard/landing/index.ts Export landing and pricing page components
src/dashboard/landing/PricingPage.tsx Full-featured pricing page with plan comparison and FAQ
src/dashboard/landing/LandingPage.tsx Landing page with hero section, live demo, and feature showcase
src/dashboard/app/pricing/page.tsx Next.js route wrapper for pricing page
src/dashboard/app/page.tsx Switched main page from dashboard to landing page
src/dashboard/app/globals.css Updated global styles for mission control theme
src/dashboard/app/app/page.tsx New route for authenticated dashboard app
src/dashboard-server/server.ts Enhanced dashboard server with agent online checks and processing state updates
src/dashboard-server/metrics.ts Updated offline threshold to 30 seconds
src/daemon/workspace-manager.ts Manager for multiple workspaces with switching support
src/daemon/types.ts Core types for daemon (workspaces, agents, events)
src/daemon/server.ts Updated daemon server to support processing state callbacks
src/daemon/router.ts Added processing state change notifications
src/daemon/orchestrator.ts Top-level orchestrator for managing workspace daemons
src/daemon/index.ts Added exports for orchestrator and workspace manager
src/daemon/cloud-sync.ts Cloud sync service for cross-machine agent coordination
src/daemon/api.ts REST and WebSocket API for daemon communication
src/daemon/agent-manager.ts Manages agents across workspaces with resiliency
src/cloud/vault/index.ts Secure credential vault with AES-256-GCM encryption
src/cloud/services/planLimits.ts Plan limits and usage tracking service
src/cloud/services/coordinator.ts Coordinator agent service for project groups
src/cloud/server.ts Express server with session management and CSRF protection
src/cloud/provisioner/index.ts Workspace provisioning for Fly.io, Railway, Docker
src/cloud/index.ts Main entry point for cloud infrastructure
src/cloud/db/schema.ts Drizzle ORM schema for PostgreSQL
src/cloud/db/migrations/0001_initial.sql Initial database migration
src/cloud/db/index.ts Database layer exports and query namespaces
src/cloud/db/drizzle.ts Drizzle database client with type-safe queries
src/cloud/config.ts Cloud configuration with environment variable loading
src/cloud/billing/types.ts Billing types for subscriptions and payments
src/cloud/billing/service.ts Stripe integration for billing operations
src/cloud/billing/plans.ts Subscription plan definitions and comparisons
src/cloud/billing/index.ts Billing module exports
src/bridge/types.ts Added shadow execution mode fields to SpawnRequest

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@khaliqgant khaliqgant merged commit 4c6fad6 into main Dec 30, 2025
6 checks passed
@khaliqgant khaliqgant deleted the claude/agent-provider-onboarding-I7cHk branch December 30, 2025 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants