Problem
Discovered while validating `feature/inference-perf` on BigMama (RTX 5090, WSL2) for PR #891:
(1) CUDA Dockerfile is orphaned
`docker/continuum-core-cuda.Dockerfile` exists but is not referenced anywhere in `docker-compose.yml`. The `continuum-core` service is always built from the CPU-only `continuum-core.Dockerfile` with `GPU_FEATURES: "--no-default-features --features load-dynamic-ort"` — no `cuda` feature.
```yaml
docker-compose.yml
continuum-core:
build:
context: ./src/workers
dockerfile: ../../docker/continuum-core.Dockerfile # ← always the CPU one
args:
GPU_FEATURES: "--no-default-features --features load-dynamic-ort" # ← no cuda
```
(2) The `gpu` profile bypasses our substrate
The `gpu` profile adds an `inference` service that uses the upstream `ghcr.io/ggml-org/llama.cpp:server-cuda` image — not our vendored llama.cpp, not our BatchScheduler, not our scheduler sequencing work.
```yaml
inference:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
profiles: ["gpu"]
command: ["-m", "/models/current.gguf", "-c", "4096", "-ngl", "99", "--port", "8090", ...]
```
Implication
Production docker-CUDA deploys today do not run the work that PR #891 is about. Users get:
- CPU-only `continuum-core` (no GPU inference path)
- Upstream llama-server on port 8090 (runs GGUF on CUDA, but outside our runtime — no scheduler, no per-seq LoRA, no Continuum cognition integration)
Given the recent direction "docker-first going forward, npm start is dev-only," this is a PR-blocker for #891: the PR ships work that isn't reachable via the documented deploy path.
Proposed fix (scope: minimum to close the gap)
- Add a compose service variant — either a `cuda` profile override that swaps `continuum-core`'s Dockerfile to `continuum-core-cuda.Dockerfile` with `GPU_FEATURES: "--no-default-features --features load-dynamic-ort,cuda"`, or a separate `continuum-core-cuda` service gated on the same profile.
- Remove the upstream-image `inference` service from the `gpu` profile (it's bypass-by-design now that our substrate handles GPU inference end-to-end).
- Update the compose-file top-of-file deploy doc: `# CUDA: docker compose --profile cuda up` (or equivalent) so the documented path matches the wired path.
Coordination
- Claude on M5 taking the compose YAML draft.
- Claude on M1 Pro (memento, this author) on standby for BigMama CUDA build validation once Joel toggles WSL integration on Docker Desktop (current blocker: `/var/run/docker.sock` unreachable from WSL Linux side).
Related
Problem
Discovered while validating `feature/inference-perf` on BigMama (RTX 5090, WSL2) for PR #891:
(1) CUDA Dockerfile is orphaned
`docker/continuum-core-cuda.Dockerfile` exists but is not referenced anywhere in `docker-compose.yml`. The `continuum-core` service is always built from the CPU-only `continuum-core.Dockerfile` with `GPU_FEATURES: "--no-default-features --features load-dynamic-ort"` — no `cuda` feature.
```yaml
docker-compose.yml
continuum-core:
build:
context: ./src/workers
dockerfile: ../../docker/continuum-core.Dockerfile # ← always the CPU one
args:
GPU_FEATURES: "--no-default-features --features load-dynamic-ort" # ← no cuda
```
(2) The `gpu` profile bypasses our substrate
The `gpu` profile adds an `inference` service that uses the upstream `ghcr.io/ggml-org/llama.cpp:server-cuda` image — not our vendored llama.cpp, not our BatchScheduler, not our scheduler sequencing work.
```yaml
inference:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
profiles: ["gpu"]
command: ["-m", "/models/current.gguf", "-c", "4096", "-ngl", "99", "--port", "8090", ...]
```
Implication
Production docker-CUDA deploys today do not run the work that PR #891 is about. Users get:
Given the recent direction "docker-first going forward, npm start is dev-only," this is a PR-blocker for #891: the PR ships work that isn't reachable via the documented deploy path.
Proposed fix (scope: minimum to close the gap)
Coordination
Related