docker: continuum-core-cuda.Dockerfile orphaned — gpu profile runs upstream llama-server, bypasses our substrate

## Problem

Discovered while validating \`feature/inference-perf\` on BigMama (RTX 5090, WSL2) for PR #891:

### (1) CUDA Dockerfile is orphaned

\`docker/continuum-core-cuda.Dockerfile\` exists but is **not referenced anywhere** in \`docker-compose.yml\`. The \`continuum-core\` service is always built from the CPU-only \`continuum-core.Dockerfile\` with \`GPU_FEATURES: "--no-default-features --features load-dynamic-ort"\` — no \`cuda\` feature.

\`\`\`yaml
# docker-compose.yml
continuum-core:
  build:
    context: ./src/workers
    dockerfile: ../../docker/continuum-core.Dockerfile   # ← always the CPU one
    args:
      GPU_FEATURES: "--no-default-features --features load-dynamic-ort"   # ← no cuda
\`\`\`

### (2) The \`gpu\` profile bypasses our substrate

The \`gpu\` profile adds an \`inference\` service that uses the **upstream** \`ghcr.io/ggml-org/llama.cpp:server-cuda\` image — not our vendored llama.cpp, not our BatchScheduler, not our scheduler sequencing work.

\`\`\`yaml
inference:
  image: ghcr.io/ggml-org/llama.cpp:server-cuda
  profiles: ["gpu"]
  command: ["-m", "/models/current.gguf", "-c", "4096", "-ngl", "99", "--port", "8090", ...]
\`\`\`

### Implication

**Production docker-CUDA deploys today do not run the work that PR #891 is about.** Users get:
- CPU-only \`continuum-core\` (no GPU inference path)
- Upstream llama-server on port 8090 (runs GGUF on CUDA, but outside our runtime — no scheduler, no per-seq LoRA, no Continuum cognition integration)

Given the recent direction "docker-first going forward, npm start is dev-only," this is a PR-blocker for #891: the PR ships work that isn't reachable via the documented deploy path.

## Proposed fix (scope: minimum to close the gap)

1. Add a compose service variant — either a \`cuda\` profile override that swaps \`continuum-core\`'s Dockerfile to \`continuum-core-cuda.Dockerfile\` with \`GPU_FEATURES: "--no-default-features --features load-dynamic-ort,cuda"\`, or a separate \`continuum-core-cuda\` service gated on the same profile.
2. Remove the upstream-image \`inference\` service from the \`gpu\` profile (it's bypass-by-design now that our substrate handles GPU inference end-to-end).
3. Update the compose-file top-of-file deploy doc: \`# CUDA: docker compose --profile cuda up\` (or equivalent) so the documented path matches the wired path.

## Coordination

- Claude on M5 taking the compose YAML draft.
- Claude on M1 Pro (memento, this author) on standby for BigMama CUDA build validation once Joel toggles WSL integration on Docker Desktop (current blocker: \`/var/run/docker.sock\` unreachable from WSL Linux side).

## Related

- #887 (inference capacity consolidation — shipped, now relevant to which n_seq_max gets applied once CUDA is the real path)
- PR #891 (feature/inference-perf umbrella)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker: continuum-core-cuda.Dockerfile orphaned — gpu profile runs upstream llama-server, bypasses our substrate #892

Problem

(1) CUDA Dockerfile is orphaned

docker-compose.yml

(2) The `gpu` profile bypasses our substrate

Implication

Proposed fix (scope: minimum to close the gap)

Coordination

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docker: continuum-core-cuda.Dockerfile orphaned — gpu profile runs upstream llama-server, bypasses our substrate #892

Description

Problem

(1) CUDA Dockerfile is orphaned

docker-compose.yml

(2) The `gpu` profile bypasses our substrate

Implication

Proposed fix (scope: minimum to close the gap)

Coordination

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions