Skip to content

docker: continuum-core-cuda.Dockerfile orphaned — gpu profile runs upstream llama-server, bypasses our substrate #892

@joelteply

Description

@joelteply

Problem

Discovered while validating `feature/inference-perf` on BigMama (RTX 5090, WSL2) for PR #891:

(1) CUDA Dockerfile is orphaned

`docker/continuum-core-cuda.Dockerfile` exists but is not referenced anywhere in `docker-compose.yml`. The `continuum-core` service is always built from the CPU-only `continuum-core.Dockerfile` with `GPU_FEATURES: "--no-default-features --features load-dynamic-ort"` — no `cuda` feature.

```yaml

docker-compose.yml

continuum-core:
build:
context: ./src/workers
dockerfile: ../../docker/continuum-core.Dockerfile # ← always the CPU one
args:
GPU_FEATURES: "--no-default-features --features load-dynamic-ort" # ← no cuda
```

(2) The `gpu` profile bypasses our substrate

The `gpu` profile adds an `inference` service that uses the upstream `ghcr.io/ggml-org/llama.cpp:server-cuda` image — not our vendored llama.cpp, not our BatchScheduler, not our scheduler sequencing work.

```yaml
inference:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
profiles: ["gpu"]
command: ["-m", "/models/current.gguf", "-c", "4096", "-ngl", "99", "--port", "8090", ...]
```

Implication

Production docker-CUDA deploys today do not run the work that PR #891 is about. Users get:

  • CPU-only `continuum-core` (no GPU inference path)
  • Upstream llama-server on port 8090 (runs GGUF on CUDA, but outside our runtime — no scheduler, no per-seq LoRA, no Continuum cognition integration)

Given the recent direction "docker-first going forward, npm start is dev-only," this is a PR-blocker for #891: the PR ships work that isn't reachable via the documented deploy path.

Proposed fix (scope: minimum to close the gap)

  1. Add a compose service variant — either a `cuda` profile override that swaps `continuum-core`'s Dockerfile to `continuum-core-cuda.Dockerfile` with `GPU_FEATURES: "--no-default-features --features load-dynamic-ort,cuda"`, or a separate `continuum-core-cuda` service gated on the same profile.
  2. Remove the upstream-image `inference` service from the `gpu` profile (it's bypass-by-design now that our substrate handles GPU inference end-to-end).
  3. Update the compose-file top-of-file deploy doc: `# CUDA: docker compose --profile cuda up` (or equivalent) so the documented path matches the wired path.

Coordination

  • Claude on M5 taking the compose YAML draft.
  • Claude on M1 Pro (memento, this author) on standby for BigMama CUDA build validation once Joel toggles WSL integration on Docker Desktop (current blocker: `/var/run/docker.sock` unreachable from WSL Linux side).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions