Skip to content

Slack-driven job orchestration system with BullMQ workers (LM Studio + Claude Code)

Notifications You must be signed in to change notification settings

OutcomefocusAi/AgentOps

Repository files navigation

AgentOps - Slack Job Orchestration System

Slack-driven job orchestration for a Windows workstation. Messages and /queue commands in a designated Slack channel are routed to specialized workers via BullMQ priority queues backed by Redis.

Slack Channel (#claude-code-cli)
     |
Controller (Socket Mode - WebSocket, no ngrok)
     |--- Router (keyword regex + LM Studio LLM fallback)
     |--- Queue Manager (BullMQ / Redis)
     |
Workers (NSSM Windows services)
     |--- local-worker  (LM Studio chat completions)
     |--- code-worker   (Claude Code CLI headless)
     |--- research-worker (stub - future implementation)
     |
Optional: Qdrant (job summary vector memory)

Key features: priority queuing (P1/P2/P3), interactive Slack controls (buttons for cancel/retry/promote/pause/resume), event-level idempotency via Redis SET NX, auto-restart via NSSM, graceful degradation when optional services are down.


Prerequisites

Prerequisite Version Purpose Install
Node.js >= 20 Runtime for controller and workers nodejs.org
Docker Desktop latest Redis (required), Qdrant (optional) docker.com
Chocolatey latest Package manager for NSSM chocolatey.org/install
NSSM latest Windows service management choco install nssm -y
LM Studio latest Local LLM inference + embeddings lmstudio.ai
Claude Code CLI latest Headless code execution for code-worker npm install -g @anthropic-ai/claude-code

LM Studio requirements:

  • A chat model must be loaded for local-worker job processing and LLM routing fallback.
  • An embedding model (e.g., nomic-embed-text-v1.5) must be loaded for Qdrant vector memory. This is optional -- the system works without it.

Slack App Setup

1. Create the App

  1. Go to api.slack.com/apps and click Create New App > From scratch.
  2. Name it whatever you like (e.g., "AgentOps") and select your workspace.

2. Bot Token Scopes (OAuth & Permissions)

Navigate to OAuth & Permissions and add these Bot Token Scopes:

Scope Purpose
chat:write Post messages and job receipts
channels:history Read channel messages for message-based commands
channels:read List channels
commands Register and handle /queue slash command

3. Event Subscriptions

Navigate to Event Subscriptions:

  1. Toggle Enable Events to On.
  2. Under Subscribe to bot events, add: message.channels
  3. Save changes.

Request URL is not needed for Socket Mode. Leave it blank or enter any placeholder.

4. Slash Commands

Navigate to Slash Commands and click Create New Command:

Field Value
Command /queue
Request URL Not needed for Socket Mode (enter any placeholder like https://localhost)
Short Description Manage the job queue
Usage Hint `status

5. Interactivity & Shortcuts

Navigate to Interactivity & Shortcuts:

  1. Toggle Interactivity to On.
  2. Request URL: not needed for Socket Mode (enter any placeholder like https://localhost).
  3. Save changes.

6. Socket Mode

Navigate to Settings > Socket Mode:

  1. Toggle Enable Socket Mode to On.
  2. When prompted, generate an app-level token with the connections:write scope.
  3. Copy this token -- it starts with xapp- and becomes your SLACK_APP_TOKEN.

7. Install to Workspace

Navigate to OAuth & Permissions and click Install to Workspace:

  1. Authorize the app.
  2. Copy the Bot User OAuth Token (xoxb-...) -- this is your SLACK_BOT_TOKEN.
  3. Navigate to Basic Information and copy the Signing Secret -- this is your SLACK_SIGNING_SECRET.

8. Invite Bot to Channel

In Slack, go to your target channel (e.g., #claude-code-cli) and run:

/invite @YourBotName

9. Get Channel ID

Right-click the channel name > View channel details > scroll to the bottom and copy the Channel ID (starts with C).


Environment Configuration

Copy .env.example to .env and fill in real values:

cp .env.example .env
Variable Required Default Description
SLACK_BOT_TOKEN Yes - Bot User OAuth Token from step 7 (xoxb-...)
SLACK_SIGNING_SECRET Yes - Signing Secret from Basic Information page
SLACK_CHANNEL_ID Yes - Target channel ID from step 9 (e.g., C0ABKSQ3TKQ)
SLACK_APP_TOKEN Yes - App-level token from step 6 (xapp-...)
REDIS_HOST No localhost Docker Redis host
REDIS_PORT No 6379 Docker Redis port
PORT No 3000 Controller HTTP port (unused in Socket Mode, kept for config schema)
LM_STUDIO_URL No http://localhost:1234/v1 LM Studio OpenAI-compatible API endpoint
QDRANT_URL No http://localhost:6333 Qdrant REST API endpoint (optional)
NODE_ENV No development development, production, or test

Example .env:

SLACK_BOT_TOKEN=xoxb-YOUR-BOT-TOKEN-HERE
SLACK_SIGNING_SECRET=your-signing-secret-here
SLACK_CHANNEL_ID=C0ABKSQ3TKQ
SLACK_APP_TOKEN=xapp-YOUR-APP-TOKEN-HERE

REDIS_HOST=localhost
REDIS_PORT=6379
PORT=3000

LM_STUDIO_URL=http://localhost:1234/v1
QDRANT_URL=http://localhost:6333
NODE_ENV=development

All environment variables are validated at startup via Zod schema (shared/src/types/config.ts). The controller will exit with a clear error message if required values are missing.


First-Time Setup

# Run as Administrator (right-click PowerShell > Run as Administrator)
cd C:\Users\jcwoo\.agentops

# Run bootstrap (checks prereqs, installs NSSM, registers services, starts everything)
.\scripts\bootstrap.ps1

What bootstrap does (in order):

  1. Verifies administrator privileges
  2. Checks Docker Desktop is running
  3. Checks Chocolatey is installed
  4. Checks Node.js is installed
  5. Verifies TypeScript is compiled (all dist/ directories exist)
  6. Verifies .env file exists
  7. Installs NSSM via Chocolatey (if not present)
  8. Creates logs/ and legacy/ directories
  9. Runs install-services.ps1 to register 4 NSSM services
  10. Runs start-all.ps1 to start Docker Redis + all services
  11. Waits 5 seconds for stabilization
  12. Prints a status table showing each service's state

Before running bootstrap, ensure:

  • .env is populated with real Slack tokens (see Environment Configuration above)
  • TypeScript is compiled: run npx tsc -b from the project root
  • Docker Desktop is running
  • LM Studio is running with a chat model loaded (for worker processing)

Service account: During install-services.ps1, you will be prompted for your Windows password. Services run under .\jcwoo to access user-level environment variables and PATH.


Start / Stop

All scripts require Administrator privileges.

Start Everything

.\scripts\start-all.ps1

Startup sequence:

  1. Checks Docker Desktop is accessible
  2. Starts Redis via docker compose up -d
  3. Waits for Redis health check (PING/PONG, up to 30 seconds)
  4. Starts all 4 NSSM services
  5. Verifies all services are running

Stop Everything

.\scripts\stop-all.ps1

Shutdown sequence (ordered for safety):

  1. Stops workers first (local-worker, code-worker, research-worker)
  2. Stops controller last (so it doesn't enqueue to dead workers)
  3. Each service gets 15 seconds for graceful shutdown
  4. Stops Redis via docker compose down
  5. Prints shutdown summary

Individual Service Control

# Service names
# agentops-controller
# agentops-local-worker
# agentops-code-worker
# agentops-research-worker

nssm start agentops-controller
nssm stop agentops-controller
nssm restart agentops-controller
nssm status agentops-controller

Service Management Scripts

Script Purpose
scripts\bootstrap.ps1 First-time setup (prereqs + install + start + verify)
scripts\start-all.ps1 Start Docker Redis + all 4 NSSM services
scripts\stop-all.ps1 Stop all services + Docker Redis (ordered shutdown)
scripts\install-services.ps1 Register NSSM services (idempotent, does not start them)
scripts\uninstall-services.ps1 Remove all NSSM services (preserves log files)
scripts\legacy-disable.ps1 Disable old pm2-based system (archive scripts, clean state)

Usage

Slash Commands

Command Action
/queue help Show formatted help text with all subcommands
/queue status Show Block Kit status message with queue counts and worker health (no LLM calls)
/queue add <text> Enqueue a job at default priority (P3)
/queue addp1 <text> Enqueue a priority 1 job (highest priority)
/queue addp2 <text> Enqueue a priority 2 job
/queue addp3 <text> Enqueue a priority 3 job (lowest priority)
/queue workers <n> Set global concurrency across all queues (1-6)
/queue pause Pause all queues (in-progress jobs complete, no new claims)
/queue resume Resume all queues
/queue cancel <id> Cancel a job (removes if waiting, flags if active)
/queue retry <id> Retry a failed job (re-enters queue)
/queue promote <id> p1|p2|p3 Change a job's priority

Message-Based Commands

Type directly in the channel (no /queue prefix needed):

Input Behavior
Any text Enqueued as a single job, routed by keyword classifier
status or queue status Shows queue status (same as /queue status)
queue: followed by newline-separated items Enqueues multiple jobs in a batch
p1: <text> or p2: <text> or p3: <text> Priority tag prefix sets job priority

Interactive Buttons

Job receipt buttons (appear on each enqueued job):

  • Cancel -- Remove the job from queue or flag for cancellation
  • Retry -- Re-enqueue a failed job
  • Promote -- Increase the job's priority

Status message buttons (appear on /queue status output):

  • Pause / Resume -- Toggle queue processing
  • Refresh -- Update the status counts in-place
  • Set Workers -- Dropdown to change global concurrency (1-6)
  • View Status -- Post a fresh status message

Routing

The router classifies each job to determine which queue it belongs to:

  1. Keyword classifier (fast): Regex patterns match against job text. Code-related keywords (repo, PR, bug, .ts, .py, GitHub URLs) route to code queue. Writing-related keywords (summarize, draft, compose) route to local queue. Research keywords (research, investigate, compare) route to research queue.
  2. LLM fallback (when ambiguous): If keyword confidence is low (< 2 matches or tied), the router calls LM Studio for classification.
  3. Default: If no matches at all, defaults to local queue with low confidence.

Test Procedures

Automated Tests (Vitest)

Requires Docker Redis running.

cd C:\Users\jcwoo\.agentops

# Run all tests
npx vitest run

# Run with verbose output
npx vitest run --reporter=verbose

Test files (in tests/acceptance/):

File What It Tests
queue-processing.test.ts Multi-item enqueue, parallel processing, crash recovery
idempotency.test.ts Pause/resume, duplicate event prevention via Redis SET NX
controller-unit.test.ts Command parser, help text builder, status formatter

Tests use dedicated test-* queue names to avoid interfering with production data. Queues are cleaned up via obliterate() in afterAll.

Qdrant-related tests are skipped automatically if Qdrant is not running (graceful skip, not failure).

Manual Verification Checklist

Run through these after deployment, major changes, or first-time setup:

  • Zero-token idle: Start system, wait 60 seconds, verify no LLM calls in logs. Check with nssm status agentops-controller that the service is running. Verify no requests to LM Studio or Claude CLI unless a job is queued.
  • /queue help: Run the command, verify formatted help text appears with all subcommands listed.
  • /queue status: Run the command, verify Block Kit status message with per-queue counts and worker health. Confirm no LLM calls appear in controller logs (status is a pure Redis read).
  • /queue add: Run /queue add summarize today's news, verify a job receipt with Cancel/Retry/Promote buttons appears. Check controller logs for routing decision (keyword or LLM).
  • Button - Cancel: Click Cancel on a queued job. Verify job is removed (thread reply confirms).
  • Button - Retry: Wait for a job to fail (or manually fail one), then click Retry. Verify job re-enters queue.
  • Button - Promote: Click Promote on a P3 job. Verify thread reply confirms priority change.
  • Button - Pause/Resume: Click Pause on a status message. Verify all queues pause (status updates in-place). Click Resume. Verify queues resume.
  • Button - Refresh: Click Refresh on a status message. Verify counts update in-place.
  • Message command: Type a task directly in the channel (no /queue prefix). Verify it's enqueued and receipt posted.
  • Duplicate events: Send a message, verify only 1 job is created. Check Redis for idempotency keys: docker exec agentops-redis-1 redis-cli KEYS 'idemp:*'
  • Multi-item batch: Type queue: followed by 3 items on separate lines. Verify 3 separate jobs are created with individual receipts.

Monitoring & Logs

Log Locations

NSSM captures stdout and stderr to individual log files:

logs\agentops-controller-stdout.log
logs\agentops-controller-stderr.log
logs\agentops-local-worker-stdout.log
logs\agentops-local-worker-stderr.log
logs\agentops-code-worker-stdout.log
logs\agentops-code-worker-stderr.log
logs\agentops-research-worker-stdout.log
logs\agentops-research-worker-stderr.log

View log configuration:

nssm get agentops-controller AppStdout
nssm get agentops-controller AppStderr

Log rotation is configured at 10 MB per file.

Service Status

# Check individual service
nssm status agentops-controller
nssm status agentops-local-worker
nssm status agentops-code-worker
nssm status agentops-research-worker

# Check all agentops services via PowerShell
Get-Service agentops-*

Key Log Patterns

Pattern Meaning
[controller] Bolt app running in Socket Mode Controller started successfully
[controller] Listening on channel: C0... Correct channel configured
[controller] Qdrant vector memory enabled Qdrant connected and operational
[controller] Running without Qdrant (optional) Qdrant unavailable -- system works fine
[local-worker] Started on queue "local" with concurrency 2 Local worker started
[code-worker] Started on queue "code" with concurrency 1 Code worker started
[router] Keyword match: CODE, patterns: repo, .ts Job routed via keyword classifier
[router] Low confidence, using LLM fallback -> local Job routed via LM Studio LLM
[qdrant] Stored summary for job X Summary saved to Qdrant on job completion
[qdrant] Not available, running without vector memory Qdrant down at startup (non-fatal)
[qdrant] Embedding failed: ... LM Studio embedding model not loaded or down
[events] Duplicate event X, skipping Idempotency layer rejected a duplicate Slack event

Redis Monitoring

# Connection count
docker exec agentops-redis-1 redis-cli INFO clients

# Total key count
docker exec agentops-redis-1 redis-cli DBSIZE

# List queue keys
docker exec agentops-redis-1 redis-cli KEYS 'bull:*'

# List idempotency keys
docker exec agentops-redis-1 redis-cli KEYS 'idemp:*'

# Check Redis health
docker exec agentops-redis-1 redis-cli PING

Qdrant Dashboard

If Qdrant is running, the dashboard is at http://localhost:6333/dashboard.

The agentops-jobs collection stores job completion summaries with vector embeddings for similarity search.


Troubleshooting

Redis Won't Start

# Check Docker Desktop is running
docker ps

# Start Redis manually
docker compose -f C:\Users\jcwoo\.agentops\docker-compose.yml up -d

# Verify Redis responds
docker exec agentops-redis-1 redis-cli PING
# Expected: PONG

# Check Docker logs
docker logs agentops-redis-1

Common causes: Docker Desktop not running, port 6379 already in use by another Redis instance.

LM Studio Not Responding

# Test LM Studio API
curl http://localhost:1234/v1/models
# Should return a JSON list of loaded models

Impact when down:

  • local-worker jobs fail (no LLM available for processing)
  • LLM classifier falls back to keyword-only routing (functional but less accurate)
  • Qdrant embeddings fail (summaries not stored, but jobs still complete)

Fix: Open LM Studio, ensure a chat model is loaded and the server is running on port 1234.

Slack Authentication Errors

Symptom Likely Cause Fix
Controller crashes on startup with Slack auth error Invalid or expired tokens Verify SLACK_BOT_TOKEN and SLACK_APP_TOKEN in .env match your Slack app
xapp- token rejected Socket Mode not enabled Enable Socket Mode in Slack app settings and regenerate the app-level token
Bot messages not appearing Bot not in channel Run /invite @YourBotName in the target channel
/queue command returns "dispatch_failed" Slash command not configured Verify /queue exists in Slack app's Slash Commands section
Events not firing Wrong event subscription Verify message.channels is subscribed under Event Subscriptions
Interactivity errors Interactivity not enabled Enable Interactivity in Slack app settings

Services Won't Start

# Check service status
nssm status agentops-controller

# Check if service is installed
Get-Service agentops-controller

# Check stderr log for crash reason
type C:\Users\jcwoo\.agentops\logs\agentops-controller-stderr.log

# Try manual start for better error visibility
nssm start agentops-controller

# Check for port conflicts
netstat -an | findstr 6379

# Rebuild TypeScript and restart
npx tsc -b
nssm restart agentops-controller

Common causes:

  • TypeScript not compiled (missing dist/ files) -- run npx tsc -b
  • .env file missing or invalid -- check Zod validation errors in stderr log
  • Redis not running -- start with docker compose up -d
  • NSSM services not installed -- run .\scripts\install-services.ps1

Qdrant Errors (Non-Fatal)

Qdrant is optional. The system operates normally without it -- no job processing is affected.

# Start Qdrant (if using the standalone container)
docker start qdrant

# Verify Qdrant health
curl http://localhost:6333/healthz
# Expected: HTTP 200

# Check Qdrant dashboard
# http://localhost:6333/dashboard

Embedding model not loaded: If you see [qdrant] Embedding failed in logs, load an embedding model in LM Studio (e.g., nomic-embed-text-v1.5). The embedding endpoint is at http://localhost:1234/v1/embeddings.

Dimension mismatch: If the collection was created with one embedding model and you switch to another with different dimensions, delete the collection via the Qdrant dashboard and restart the controller. It will recreate the collection with the correct dimensions.

Jobs Stuck in Active State

# Check for stalled jobs
docker exec agentops-redis-1 redis-cli KEYS 'bull:*:stalled'

Possible causes: Worker crashed while processing a job.

BullMQ auto-recovery: Stalled jobs are automatically reclaimed after the stalled interval (default 30 seconds). The job will be retried up to 3 times with exponential backoff (5s, 10s, 20s).

Manual recovery:

# Restart the specific worker
nssm restart agentops-local-worker

# Nuclear option: clear all queue data (WARNING: destroys all jobs)
docker exec agentops-redis-1 redis-cli FLUSHDB

Build Errors

# Clean build
npx tsc -b --clean
npx tsc -b

# If npm dependencies are missing
npm install
cd controller && npm install && cd ..
cd workers\local-worker && npm install && cd ..\..
cd workers\code-worker && npm install && cd ..\..
cd workers\research-worker && npm install && cd ..\..
cd shared && npm install && cd ..

Architecture

Package Structure

.agentops/
  shared/                    # Shared types and utilities
    src/types/
      job.ts                 # JobData, JobMetadata, JobPriority, JobStatus, QueueName
      config.ts              # Zod env schema (Config type)
      queue.ts               # QUEUE_NAMES constant, createRedisConnection()
      slack.ts               # JobReceipt, QueueStatusMessage types

  controller/                # Slack app + queue manager + router + Qdrant
    src/
      server.ts              # Entry point: Bolt app, Socket Mode, Qdrant init, shutdown
      config.ts              # Loads .env via dotenv + Zod validation
      slack/
        commands.ts          # /queue slash command handler (12 subcommands)
        events.ts            # message.channels event handler (single, multi-item, status)
        actions.ts           # 8 interactive button/menu action handlers
        middleware.ts        # Channel filter (pure function)
      queue/
        manager.ts           # BullMQ queue singletons, all queue operations
        formatters.ts        # Block Kit builders for status, receipts, help
      router/
        index.ts             # routeJob(): keyword first, LLM fallback
        keyword-classifier.ts # Regex pattern matching (CODE/LOCAL/RESEARCH)
        lm-classifier.ts     # LM Studio chat completion classifier
      qdrant/
        embeddings.ts        # LM Studio /v1/embeddings via OpenAI client
        client.ts            # Qdrant init, store, search, close (graceful degradation)
        index.ts             # Barrel re-export
      util/
        command-parser.ts    # Parse /queue subcommands and message text
        idempotency.ts       # Redis SET NX EX deduplication (24h TTL)

  workers/
    local-worker/            # LM Studio inference worker
      src/
        worker.ts            # BullMQ Worker on "local" queue, concurrency 2
        processor.ts         # Job processing logic (LM Studio chat completion)

    code-worker/             # Claude Code CLI worker
      src/
        worker.ts            # BullMQ Worker on "code" queue, concurrency 1
        processor.ts         # Job processing logic (Claude Code headless CLI)

    research-worker/         # Future research worker (stub)
      src/
        worker.ts            # BullMQ Worker on "research" queue
        processor.ts         # Placeholder processor

  scripts/                   # PowerShell service management
    bootstrap.ps1            # First-time setup (prereqs + install + start)
    start-all.ps1            # Start Docker Redis + all NSSM services
    stop-all.ps1             # Ordered shutdown (workers first, then controller, then Redis)
    install-services.ps1     # Register 4 NSSM services (idempotent)
    uninstall-services.ps1   # Remove NSSM services (preserves logs)
    legacy-disable.ps1       # Disable old pm2-based system

  tests/                     # Vitest acceptance tests
    acceptance/
      queue-processing.test.ts   # BullMQ queue behavior tests
      idempotency.test.ts        # Dedup and pause/resume tests
      controller-unit.test.ts    # Parser and formatter unit tests

  docker-compose.yml         # Redis 7 Alpine with AOF persistence
  .env.example               # Environment variable template
  .env                       # Actual environment values (not in git)
  vitest.config.ts           # Vitest configuration
  tsconfig.base.json         # Shared TypeScript settings (skipLibCheck, Node16)
  tsconfig.json              # Root project references (composite build)
  package.json               # Root package with devDependencies (vitest, typescript)

Data Flow

  1. Slack message or /queue command arrives via Socket Mode WebSocket connection.
  2. Controller validates the channel (must match SLACK_CHANNEL_ID) and deduplicates via Redis SET key NX EX 86400 (24-hour TTL). Duplicate Slack event retries are silently dropped.
  3. Router classifies the job text: keyword regex matching runs first (deterministic, fast). If confidence is low (< 2 pattern matches or tied), it falls back to LM Studio LLM classification.
  4. Job is added to the correct BullMQ queue (code, local, or research) with the assigned priority (P1=1, P2=2, P3=3 -- lower number = higher priority in BullMQ).
  5. Job receipt is posted to Slack via client.chat.postMessage with interactive Cancel/Retry/Promote buttons. The message_ts is captured and stored in job metadata for threaded replies.
  6. Worker claims the job from its queue. Each worker type processes differently:
    • local-worker: Sends job text to LM Studio chat completion API, posts result as a thread reply.
    • code-worker: Executes job text via Claude Code CLI in headless mode, posts result as a thread reply.
    • research-worker: Stub -- returns a placeholder response.
  7. On completion: If Qdrant is available, a QueueEvents listener in the controller stores the job summary as a vector embedding in the agentops-jobs collection. This enables future similarity search across completed jobs.
  8. On failure: BullMQ retries up to 3 times with exponential backoff (5s base delay). After all retries exhausted, the job moves to the failed state. Old completed jobs are pruned at 100 per queue; failed jobs at 50.

Key Design Decisions

Decision Rationale
Socket Mode (no ngrok/public URL) Single-workspace internal tool. WebSocket eliminates external URL management.
Workers as separate NSSM services Crash isolation. A worker crash doesn't take down the controller or other workers. NSSM auto-restarts on exit with 5s delay.
Qdrant is optional enrichment Never blocks job processing. All Qdrant calls wrapped in try/catch. System is fully functional without it.
Idempotency via Redis SET NX EX 24-hour TTL prevents duplicate processing from Slack event retries. Composite key for actions: message_ts + action_id + action_ts.
Keyword-first routing with LLM fallback Deterministic classification is fast and predictable. LLM only called when keywords are ambiguous, avoiding unnecessary inference costs.
Dynamic ESM import for Qdrant client @qdrant/js-client-rest is ESM-only. The project uses CJS (Node16 moduleResolution). Dynamic import() at runtime bridges the gap.
UUID point IDs in Qdrant Qdrant string IDs must be valid UUIDs. BullMQ job IDs are numeric strings. BullMQ job ID stored in the payload for traceability.
Services run as user account (.\jcwoo) Access to user-level PATH, environment variables, and npm global packages (Claude Code CLI).

About

Slack-driven job orchestration system with BullMQ workers (LM Studio + Claude Code)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors