AgentOps - Slack Job Orchestration System

Slack-driven job orchestration for a Windows workstation. Messages and /queue commands in a designated Slack channel are routed to specialized workers via BullMQ priority queues backed by Redis.

Slack Channel (#claude-code-cli)
     |
Controller (Socket Mode - WebSocket, no ngrok)
     |--- Router (keyword regex + LM Studio LLM fallback)
     |--- Queue Manager (BullMQ / Redis)
     |
Workers (NSSM Windows services)
     |--- local-worker  (LM Studio chat completions)
     |--- code-worker   (Claude Code CLI headless)
     |--- research-worker (stub - future implementation)
     |
Optional: Qdrant (job summary vector memory)

Key features: priority queuing (P1/P2/P3), interactive Slack controls (buttons for cancel/retry/promote/pause/resume), event-level idempotency via Redis SET NX, auto-restart via NSSM, graceful degradation when optional services are down.

Prerequisites

Prerequisite	Version	Purpose	Install
Node.js	>= 20	Runtime for controller and workers	nodejs.org
Docker Desktop	latest	Redis (required), Qdrant (optional)	docker.com
Chocolatey	latest	Package manager for NSSM	chocolatey.org/install
NSSM	latest	Windows service management	`choco install nssm -y`
LM Studio	latest	Local LLM inference + embeddings	lmstudio.ai
Claude Code CLI	latest	Headless code execution for code-worker	`npm install -g @anthropic-ai/claude-code`

LM Studio requirements:

A chat model must be loaded for local-worker job processing and LLM routing fallback.
An embedding model (e.g., nomic-embed-text-v1.5) must be loaded for Qdrant vector memory. This is optional -- the system works without it.

Slack App Setup

1. Create the App

Go to api.slack.com/apps and click Create New App > From scratch.
Name it whatever you like (e.g., "AgentOps") and select your workspace.

2. Bot Token Scopes (OAuth & Permissions)

Navigate to OAuth & Permissions and add these Bot Token Scopes:

Scope	Purpose
`chat:write`	Post messages and job receipts
`channels:history`	Read channel messages for message-based commands
`channels:read`	List channels
`commands`	Register and handle `/queue` slash command

3. Event Subscriptions

Navigate to Event Subscriptions:

Toggle Enable Events to On.
Under Subscribe to bot events, add: message.channels
Save changes.

Request URL is not needed for Socket Mode. Leave it blank or enter any placeholder.

4. Slash Commands

Navigate to Slash Commands and click Create New Command:

Field	Value
Command	`/queue`
Request URL	Not needed for Socket Mode (enter any placeholder like `https://localhost`)
Short Description	Manage the job queue
Usage Hint	`status

5. Interactivity & Shortcuts

Navigate to Interactivity & Shortcuts:

Toggle Interactivity to On.
Request URL: not needed for Socket Mode (enter any placeholder like https://localhost).
Save changes.

6. Socket Mode

Navigate to Settings > Socket Mode:

Toggle Enable Socket Mode to On.
When prompted, generate an app-level token with the connections:write scope.
Copy this token -- it starts with xapp- and becomes your SLACK_APP_TOKEN.

7. Install to Workspace

Navigate to OAuth & Permissions and click Install to Workspace:

Authorize the app.
Copy the Bot User OAuth Token (xoxb-...) -- this is your SLACK_BOT_TOKEN.
Navigate to Basic Information and copy the Signing Secret -- this is your SLACK_SIGNING_SECRET.

8. Invite Bot to Channel

In Slack, go to your target channel (e.g., #claude-code-cli) and run:

/invite @YourBotName

9. Get Channel ID

Right-click the channel name > View channel details > scroll to the bottom and copy the Channel ID (starts with C).

Environment Configuration

Copy .env.example to .env and fill in real values:

cp .env.example .env

Variable	Required	Default	Description
`SLACK_BOT_TOKEN`	Yes	-	Bot User OAuth Token from step 7 (`xoxb-...`)
`SLACK_SIGNING_SECRET`	Yes	-	Signing Secret from Basic Information page
`SLACK_CHANNEL_ID`	Yes	-	Target channel ID from step 9 (e.g., `C0ABKSQ3TKQ`)
`SLACK_APP_TOKEN`	Yes	-	App-level token from step 6 (`xapp-...`)
`REDIS_HOST`	No	`localhost`	Docker Redis host
`REDIS_PORT`	No	`6379`	Docker Redis port
`PORT`	No	`3000`	Controller HTTP port (unused in Socket Mode, kept for config schema)
`LM_STUDIO_URL`	No	`http://localhost:1234/v1`	LM Studio OpenAI-compatible API endpoint
`QDRANT_URL`	No	`http://localhost:6333`	Qdrant REST API endpoint (optional)
`NODE_ENV`	No	`development`	`development`, `production`, or `test`

Example .env:

SLACK_BOT_TOKEN=xoxb-YOUR-BOT-TOKEN-HERE
SLACK_SIGNING_SECRET=your-signing-secret-here
SLACK_CHANNEL_ID=C0ABKSQ3TKQ
SLACK_APP_TOKEN=xapp-YOUR-APP-TOKEN-HERE

REDIS_HOST=localhost
REDIS_PORT=6379
PORT=3000

LM_STUDIO_URL=http://localhost:1234/v1
QDRANT_URL=http://localhost:6333
NODE_ENV=development

All environment variables are validated at startup via Zod schema (shared/src/types/config.ts). The controller will exit with a clear error message if required values are missing.

First-Time Setup

# Run as Administrator (right-click PowerShell > Run as Administrator)
cd C:\Users\jcwoo\.agentops

# Run bootstrap (checks prereqs, installs NSSM, registers services, starts everything)
.\scripts\bootstrap.ps1

What bootstrap does (in order):

Verifies administrator privileges
Checks Docker Desktop is running
Checks Chocolatey is installed
Checks Node.js is installed
Verifies TypeScript is compiled (all dist/ directories exist)
Verifies .env file exists
Installs NSSM via Chocolatey (if not present)
Creates logs/ and legacy/ directories
Runs install-services.ps1 to register 4 NSSM services
Runs start-all.ps1 to start Docker Redis + all services
Waits 5 seconds for stabilization
Prints a status table showing each service's state

Before running bootstrap, ensure:

.env is populated with real Slack tokens (see Environment Configuration above)
TypeScript is compiled: run npx tsc -b from the project root
Docker Desktop is running
LM Studio is running with a chat model loaded (for worker processing)

Service account: During install-services.ps1, you will be prompted for your Windows password. Services run under .\jcwoo to access user-level environment variables and PATH.

Start / Stop

All scripts require Administrator privileges.

Start Everything

.\scripts\start-all.ps1

Startup sequence:

Checks Docker Desktop is accessible
Starts Redis via docker compose up -d
Waits for Redis health check (PING/PONG, up to 30 seconds)
Starts all 4 NSSM services
Verifies all services are running

Stop Everything

.\scripts\stop-all.ps1

Shutdown sequence (ordered for safety):

Stops workers first (local-worker, code-worker, research-worker)
Stops controller last (so it doesn't enqueue to dead workers)
Each service gets 15 seconds for graceful shutdown
Stops Redis via docker compose down
Prints shutdown summary

Individual Service Control

# Service names
# agentops-controller
# agentops-local-worker
# agentops-code-worker
# agentops-research-worker

nssm start agentops-controller
nssm stop agentops-controller
nssm restart agentops-controller
nssm status agentops-controller

Service Management Scripts

Script	Purpose
`scripts\bootstrap.ps1`	First-time setup (prereqs + install + start + verify)
`scripts\start-all.ps1`	Start Docker Redis + all 4 NSSM services
`scripts\stop-all.ps1`	Stop all services + Docker Redis (ordered shutdown)
`scripts\install-services.ps1`	Register NSSM services (idempotent, does not start them)
`scripts\uninstall-services.ps1`	Remove all NSSM services (preserves log files)
`scripts\legacy-disable.ps1`	Disable old pm2-based system (archive scripts, clean state)

Usage

Slash Commands

Command	Action
`/queue help`	Show formatted help text with all subcommands
`/queue status`	Show Block Kit status message with queue counts and worker health (no LLM calls)
`/queue add <text>`	Enqueue a job at default priority (P3)
`/queue addp1 <text>`	Enqueue a priority 1 job (highest priority)
`/queue addp2 <text>`	Enqueue a priority 2 job
`/queue addp3 <text>`	Enqueue a priority 3 job (lowest priority)
`/queue workers <n>`	Set global concurrency across all queues (1-6)
`/queue pause`	Pause all queues (in-progress jobs complete, no new claims)
`/queue resume`	Resume all queues
`/queue cancel <id>`	Cancel a job (removes if waiting, flags if active)
`/queue retry <id>`	Retry a failed job (re-enters queue)
`/queue promote <id> p1\|p2\|p3`	Change a job's priority

Message-Based Commands

Type directly in the channel (no /queue prefix needed):

Input	Behavior
Any text	Enqueued as a single job, routed by keyword classifier
`status` or `queue status`	Shows queue status (same as `/queue status`)
`queue:` followed by newline-separated items	Enqueues multiple jobs in a batch
`p1: <text>` or `p2: <text>` or `p3: <text>`	Priority tag prefix sets job priority

Interactive Buttons

Job receipt buttons (appear on each enqueued job):

Cancel -- Remove the job from queue or flag for cancellation
Retry -- Re-enqueue a failed job
Promote -- Increase the job's priority

Status message buttons (appear on /queue status output):

Pause / Resume -- Toggle queue processing
Refresh -- Update the status counts in-place
Set Workers -- Dropdown to change global concurrency (1-6)
View Status -- Post a fresh status message

Routing

The router classifies each job to determine which queue it belongs to:

Keyword classifier (fast): Regex patterns match against job text. Code-related keywords (repo, PR, bug, .ts, .py, GitHub URLs) route to code queue. Writing-related keywords (summarize, draft, compose) route to local queue. Research keywords (research, investigate, compare) route to research queue.
LLM fallback (when ambiguous): If keyword confidence is low (< 2 matches or tied), the router calls LM Studio for classification.
Default: If no matches at all, defaults to local queue with low confidence.

Test Procedures

Automated Tests (Vitest)

Requires Docker Redis running.

cd C:\Users\jcwoo\.agentops

# Run all tests
npx vitest run

# Run with verbose output
npx vitest run --reporter=verbose

Test files (in tests/acceptance/):

File	What It Tests
`queue-processing.test.ts`	Multi-item enqueue, parallel processing, crash recovery
`idempotency.test.ts`	Pause/resume, duplicate event prevention via Redis SET NX
`controller-unit.test.ts`	Command parser, help text builder, status formatter

Tests use dedicated test-* queue names to avoid interfering with production data. Queues are cleaned up via obliterate() in afterAll.

Qdrant-related tests are skipped automatically if Qdrant is not running (graceful skip, not failure).

Manual Verification Checklist

Run through these after deployment, major changes, or first-time setup:

Monitoring & Logs

Log Locations

NSSM captures stdout and stderr to individual log files:

logs\agentops-controller-stdout.log
logs\agentops-controller-stderr.log
logs\agentops-local-worker-stdout.log
logs\agentops-local-worker-stderr.log
logs\agentops-code-worker-stdout.log
logs\agentops-code-worker-stderr.log
logs\agentops-research-worker-stdout.log
logs\agentops-research-worker-stderr.log

View log configuration:

nssm get agentops-controller AppStdout
nssm get agentops-controller AppStderr

Log rotation is configured at 10 MB per file.

Service Status

# Check individual service
nssm status agentops-controller
nssm status agentops-local-worker
nssm status agentops-code-worker
nssm status agentops-research-worker

# Check all agentops services via PowerShell
Get-Service agentops-*

Key Log Patterns

Pattern	Meaning
`[controller] Bolt app running in Socket Mode`	Controller started successfully
`[controller] Listening on channel: C0...`	Correct channel configured
`[controller] Qdrant vector memory enabled`	Qdrant connected and operational
`[controller] Running without Qdrant (optional)`	Qdrant unavailable -- system works fine
`[local-worker] Started on queue "local" with concurrency 2`	Local worker started
`[code-worker] Started on queue "code" with concurrency 1`	Code worker started
`[router] Keyword match: CODE, patterns: repo, .ts`	Job routed via keyword classifier
`[router] Low confidence, using LLM fallback -> local`	Job routed via LM Studio LLM
`[qdrant] Stored summary for job X`	Summary saved to Qdrant on job completion
`[qdrant] Not available, running without vector memory`	Qdrant down at startup (non-fatal)
`[qdrant] Embedding failed: ...`	LM Studio embedding model not loaded or down
`[events] Duplicate event X, skipping`	Idempotency layer rejected a duplicate Slack event

Redis Monitoring

# Connection count
docker exec agentops-redis-1 redis-cli INFO clients

# Total key count
docker exec agentops-redis-1 redis-cli DBSIZE

# List queue keys
docker exec agentops-redis-1 redis-cli KEYS 'bull:*'

# List idempotency keys
docker exec agentops-redis-1 redis-cli KEYS 'idemp:*'

# Check Redis health
docker exec agentops-redis-1 redis-cli PING

Qdrant Dashboard

If Qdrant is running, the dashboard is at http://localhost:6333/dashboard.

The agentops-jobs collection stores job completion summaries with vector embeddings for similarity search.

Troubleshooting

Redis Won't Start

# Check Docker Desktop is running
docker ps

# Start Redis manually
docker compose -f C:\Users\jcwoo\.agentops\docker-compose.yml up -d

# Verify Redis responds
docker exec agentops-redis-1 redis-cli PING
# Expected: PONG

# Check Docker logs
docker logs agentops-redis-1

Common causes: Docker Desktop not running, port 6379 already in use by another Redis instance.

LM Studio Not Responding

# Test LM Studio API
curl http://localhost:1234/v1/models
# Should return a JSON list of loaded models

Impact when down:

local-worker jobs fail (no LLM available for processing)
LLM classifier falls back to keyword-only routing (functional but less accurate)
Qdrant embeddings fail (summaries not stored, but jobs still complete)

Fix: Open LM Studio, ensure a chat model is loaded and the server is running on port 1234.

Slack Authentication Errors

Symptom	Likely Cause	Fix
Controller crashes on startup with Slack auth error	Invalid or expired tokens	Verify `SLACK_BOT_TOKEN` and `SLACK_APP_TOKEN` in `.env` match your Slack app
`xapp-` token rejected	Socket Mode not enabled	Enable Socket Mode in Slack app settings and regenerate the app-level token
Bot messages not appearing	Bot not in channel	Run `/invite @YourBotName` in the target channel
`/queue` command returns "dispatch_failed"	Slash command not configured	Verify `/queue` exists in Slack app's Slash Commands section
Events not firing	Wrong event subscription	Verify `message.channels` is subscribed under Event Subscriptions
Interactivity errors	Interactivity not enabled	Enable Interactivity in Slack app settings

Services Won't Start

# Check service status
nssm status agentops-controller

# Check if service is installed
Get-Service agentops-controller

# Check stderr log for crash reason
type C:\Users\jcwoo\.agentops\logs\agentops-controller-stderr.log

# Try manual start for better error visibility
nssm start agentops-controller

# Check for port conflicts
netstat -an | findstr 6379

# Rebuild TypeScript and restart
npx tsc -b
nssm restart agentops-controller

Common causes:

TypeScript not compiled (missing dist/ files) -- run npx tsc -b
.env file missing or invalid -- check Zod validation errors in stderr log
Redis not running -- start with docker compose up -d
NSSM services not installed -- run .\scripts\install-services.ps1

Qdrant Errors (Non-Fatal)

Qdrant is optional. The system operates normally without it -- no job processing is affected.

# Start Qdrant (if using the standalone container)
docker start qdrant

# Verify Qdrant health
curl http://localhost:6333/healthz
# Expected: HTTP 200

# Check Qdrant dashboard
# http://localhost:6333/dashboard

Embedding model not loaded: If you see [qdrant] Embedding failed in logs, load an embedding model in LM Studio (e.g., nomic-embed-text-v1.5). The embedding endpoint is at http://localhost:1234/v1/embeddings.

Dimension mismatch: If the collection was created with one embedding model and you switch to another with different dimensions, delete the collection via the Qdrant dashboard and restart the controller. It will recreate the collection with the correct dimensions.

Jobs Stuck in Active State

# Check for stalled jobs
docker exec agentops-redis-1 redis-cli KEYS 'bull:*:stalled'

Possible causes: Worker crashed while processing a job.

BullMQ auto-recovery: Stalled jobs are automatically reclaimed after the stalled interval (default 30 seconds). The job will be retried up to 3 times with exponential backoff (5s, 10s, 20s).

Manual recovery:

# Restart the specific worker
nssm restart agentops-local-worker

# Nuclear option: clear all queue data (WARNING: destroys all jobs)
docker exec agentops-redis-1 redis-cli FLUSHDB

Build Errors

# Clean build
npx tsc -b --clean
npx tsc -b

# If npm dependencies are missing
npm install
cd controller && npm install && cd ..
cd workers\local-worker && npm install && cd ..\..
cd workers\code-worker && npm install && cd ..\..
cd workers\research-worker && npm install && cd ..\..
cd shared && npm install && cd ..

Architecture

Package Structure

.agentops/
  shared/                    # Shared types and utilities
    src/types/
      job.ts                 # JobData, JobMetadata, JobPriority, JobStatus, QueueName
      config.ts              # Zod env schema (Config type)
      queue.ts               # QUEUE_NAMES constant, createRedisConnection()
      slack.ts               # JobReceipt, QueueStatusMessage types

  controller/                # Slack app + queue manager + router + Qdrant
    src/
      server.ts              # Entry point: Bolt app, Socket Mode, Qdrant init, shutdown
      config.ts              # Loads .env via dotenv + Zod validation
      slack/
        commands.ts          # /queue slash command handler (12 subcommands)
        events.ts            # message.channels event handler (single, multi-item, status)
        actions.ts           # 8 interactive button/menu action handlers
        middleware.ts        # Channel filter (pure function)
      queue/
        manager.ts           # BullMQ queue singletons, all queue operations
        formatters.ts        # Block Kit builders for status, receipts, help
      router/
        index.ts             # routeJob(): keyword first, LLM fallback
        keyword-classifier.ts # Regex pattern matching (CODE/LOCAL/RESEARCH)
        lm-classifier.ts     # LM Studio chat completion classifier
      qdrant/
        embeddings.ts        # LM Studio /v1/embeddings via OpenAI client
        client.ts            # Qdrant init, store, search, close (graceful degradation)
        index.ts             # Barrel re-export
      util/
        command-parser.ts    # Parse /queue subcommands and message text
        idempotency.ts       # Redis SET NX EX deduplication (24h TTL)

  workers/
    local-worker/            # LM Studio inference worker
      src/
        worker.ts            # BullMQ Worker on "local" queue, concurrency 2
        processor.ts         # Job processing logic (LM Studio chat completion)

    code-worker/             # Claude Code CLI worker
      src/
        worker.ts            # BullMQ Worker on "code" queue, concurrency 1
        processor.ts         # Job processing logic (Claude Code headless CLI)

    research-worker/         # Future research worker (stub)
      src/
        worker.ts            # BullMQ Worker on "research" queue
        processor.ts         # Placeholder processor

  scripts/                   # PowerShell service management
    bootstrap.ps1            # First-time setup (prereqs + install + start)
    start-all.ps1            # Start Docker Redis + all NSSM services
    stop-all.ps1             # Ordered shutdown (workers first, then controller, then Redis)
    install-services.ps1     # Register 4 NSSM services (idempotent)
    uninstall-services.ps1   # Remove NSSM services (preserves logs)
    legacy-disable.ps1       # Disable old pm2-based system

  tests/                     # Vitest acceptance tests
    acceptance/
      queue-processing.test.ts   # BullMQ queue behavior tests
      idempotency.test.ts        # Dedup and pause/resume tests
      controller-unit.test.ts    # Parser and formatter unit tests

  docker-compose.yml         # Redis 7 Alpine with AOF persistence
  .env.example               # Environment variable template
  .env                       # Actual environment values (not in git)
  vitest.config.ts           # Vitest configuration
  tsconfig.base.json         # Shared TypeScript settings (skipLibCheck, Node16)
  tsconfig.json              # Root project references (composite build)
  package.json               # Root package with devDependencies (vitest, typescript)

Data Flow

Slack message or /queue command arrives via Socket Mode WebSocket connection.
Controller validates the channel (must match SLACK_CHANNEL_ID) and deduplicates via Redis SET key NX EX 86400 (24-hour TTL). Duplicate Slack event retries are silently dropped.
Router classifies the job text: keyword regex matching runs first (deterministic, fast). If confidence is low (< 2 pattern matches or tied), it falls back to LM Studio LLM classification.
Job is added to the correct BullMQ queue (code, local, or research) with the assigned priority (P1=1, P2=2, P3=3 -- lower number = higher priority in BullMQ).
Job receipt is posted to Slack via client.chat.postMessage with interactive Cancel/Retry/Promote buttons. The message_ts is captured and stored in job metadata for threaded replies.
Worker claims the job from its queue. Each worker type processes differently:
- local-worker: Sends job text to LM Studio chat completion API, posts result as a thread reply.
- code-worker: Executes job text via Claude Code CLI in headless mode, posts result as a thread reply.
- research-worker: Stub -- returns a placeholder response.
On completion: If Qdrant is available, a QueueEvents listener in the controller stores the job summary as a vector embedding in the agentops-jobs collection. This enables future similarity search across completed jobs.
On failure: BullMQ retries up to 3 times with exponential backoff (5s base delay). After all retries exhausted, the job moves to the failed state. Old completed jobs are pruned at 100 per queue; failed jobs at 50.

Key Design Decisions

Decision	Rationale
Socket Mode (no ngrok/public URL)	Single-workspace internal tool. WebSocket eliminates external URL management.
Workers as separate NSSM services	Crash isolation. A worker crash doesn't take down the controller or other workers. NSSM auto-restarts on exit with 5s delay.
Qdrant is optional enrichment	Never blocks job processing. All Qdrant calls wrapped in try/catch. System is fully functional without it.
Idempotency via Redis SET NX EX	24-hour TTL prevents duplicate processing from Slack event retries. Composite key for actions: `message_ts + action_id + action_ts`.
Keyword-first routing with LLM fallback	Deterministic classification is fast and predictable. LLM only called when keywords are ambiguous, avoiding unnecessary inference costs.
Dynamic ESM import for Qdrant client	`@qdrant/js-client-rest` is ESM-only. The project uses CJS (Node16 moduleResolution). Dynamic `import()` at runtime bridges the gap.
UUID point IDs in Qdrant	Qdrant string IDs must be valid UUIDs. BullMQ job IDs are numeric strings. BullMQ job ID stored in the payload for traceability.
Services run as user account (.\jcwoo)	Access to user-level PATH, environment variables, and npm global packages (Claude Code CLI).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.planning		.planning
controller		controller
legacy		legacy
logs		logs
scripts		scripts
shared		shared
tests/acceptance		tests/acceptance
workers		workers
.env.example		.env.example
.gitignore		.gitignore
QUEUE_POSITION_UPDATE.md		QUEUE_POSITION_UPDATE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
restart-controller.ps1		restart-controller.ps1
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

OutcomefocusAi/AgentOps

Folders and files

Latest commit

History

Repository files navigation

AgentOps - Slack Job Orchestration System

Prerequisites

Slack App Setup

1. Create the App

2. Bot Token Scopes (OAuth & Permissions)

3. Event Subscriptions

4. Slash Commands

5. Interactivity & Shortcuts

6. Socket Mode

7. Install to Workspace

8. Invite Bot to Channel

9. Get Channel ID

Environment Configuration

First-Time Setup

Start / Stop

Start Everything

Stop Everything

Individual Service Control

Service Management Scripts

Usage

Slash Commands

Message-Based Commands

Interactive Buttons

Routing

Test Procedures

Automated Tests (Vitest)

Manual Verification Checklist

Monitoring & Logs

Log Locations

Service Status

Key Log Patterns

Redis Monitoring

Qdrant Dashboard

Troubleshooting

Redis Won't Start

LM Studio Not Responding

Slack Authentication Errors

Services Won't Start

Qdrant Errors (Non-Fatal)

Jobs Stuck in Active State

Build Errors

Architecture

Package Structure

Data Flow

Key Design Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages