Skip to content

fix: pick podman-hpc over podman in container runtime auto-detection#88

Merged
EiffL merged 5 commits into
mainfrom
fix/container-runtime-site-aware
Apr 30, 2026
Merged

fix: pick podman-hpc over podman in container runtime auto-detection#88
EiffL merged 5 commits into
mainfrom
fix/container-runtime-site-aware

Conversation

@EiffL
Copy link
Copy Markdown
Member

@EiffL EiffL commented Apr 30, 2026

Summary

  • On NERSC login nodes, lc run was silently picking podman over podman-hpc because both were on PATH and RUNTIMES listed podman first. Plain podman builds images that compute nodes cannot read.
  • Reordered RUNTIMES to (podman-hpc, podman, docker) — the HPC wrapper wins by PATH presence wherever it's installed.
  • Wired up the previously-unused container_runtime field in site_registry.py. detect_runtime() now consults the host's site entry and moves its declared runtime to the front of the probe order. The site preference is a hint: if the declared binary isn't on PATH, it falls through to the next candidate. Explicit user config in ~/.lightcone/config.yaml still errors via load_runtime (unchanged).
  • Two complementary protections — either alone fixes NERSC; together the system is self-correcting from both directions.

Test plan

  • uv run pytest tests/test_container.py — 76 pass (existing 73 + 3 new site-aware tests)
  • uv run pytest — full suite (268) passes
  • uv run ruff check clean
  • uv run mypy src/lightcone/engine/container.py clean
  • Manual verification on Perlmutter that lc run now picks podman-hpc

🤖 Generated with Claude Code

On NERSC login nodes both `podman` and `podman-hpc` are on PATH, and
`detect_runtime()` was returning `podman` first — but plain podman builds
images that compute nodes can't read. Two complementary fixes:

- Reorder RUNTIMES to (podman-hpc, podman, docker) so the HPC wrapper
  wins by PATH presence on any host.
- Wire up the previously-unused `container_runtime` field in
  `site_registry.py`: `detect_runtime()` now consults the host site
  and moves its declared runtime to the front of the probe order.
  The site preference is a hint — missing-from-PATH falls through to
  the next candidate. Explicit user config still errors via
  `load_runtime`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@EiffL
Copy link
Copy Markdown
Member Author

EiffL commented Apr 30, 2026

@claude please review this PR, make sure it makes sense to import socket, didn't we not already have a detection mechanism for which machine we were using? If so, any way to consolidate?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @EiffL's task in 1m 20s —— View job


PR Review

  • Read changed files (container.py, test_container.py)
  • Read site_registry.py for existing detection mechanism
  • Analyze socket usage and consolidation opportunities
  • Post review feedback

The fix is correct and well-motivated. Here are my findings on the specific questions raised, plus a few other observations.


On import socket — does it make sense?

Yes — socket.gethostname() is the idiomatic stdlib way to get the local hostname and there's no cleaner alternative. The import itself is fine.

However, there is a real consolidation issue here (see below).


Was there already a detection mechanism? Yes — and it's now duplicated.

commands.py already does exactly the same thing inside the init callback:

# commands.py:152–156, 226
import socket
from lightcone.engine.site_registry import detect_site, get_site_defaults
...
site_key = detect_site(socket.gethostname())

And now _site_preferred_runtime() in container.py does:

# container.py:185
site_key = detect_site(socket.gethostname())

The pattern detect_site(socket.gethostname()) — "detect the site I'm currently running on" — is now in two places with socket imported in both. This is a good candidate for a small helper in site_registry.py:

def detect_current_site() -> str | None:
    """Detect the site we're currently running on from the local hostname."""
    import socket
    return detect_site(socket.gethostname())

This would:

  • Remove import socket from container.py entirely
  • Give commands.py a cleaner call (detect_current_site() instead of detect_site(socket.gethostname()))
  • Keep the socket dependency scoped to the one module that conceptually owns it

Not a blocker, but worth doing given the explicit question. Fix this →


Other observations

RUNTIMES reordering (podman-hpcpodmandocker) — correct. Anyone who has podman-hpc on PATH installed it intentionally; plain podman would produce images that compute nodes can't read. The original ordering comment was misleading ("podman first" as a rootless rationale doesn't apply on HPC). The new comment accurately explains the intent.

_detection_order() / _site_preferred_runtime() split — clean design. Keeping the site-preference lookup separate from the reordering logic makes both testable in isolation.

get_site_defaults(site_key) or {} — the or {} is slightly redundant since detect_site would already have returned None if the key wasn't in SITE_DEFAULTS, but it's harmless defensive code.

Test fixture _generic_hostname with autouse=True — good pattern. Pinning to "generic-laptop" in TestDetectRuntime ensures those tests are machine-agnostic. TestSiteAwareDetection then explicitly patches the hostname per test. The test coverage looks thorough.

The fallthrough test (test_falls_through_when_site_runtime_missing) — important coverage: it verifies the "site preference is a hint" contract. Well done.


Summary

The fix is correct and the logic is sound. The only structural issue is the duplicated detect_site(socket.gethostname()) pattern between commands.py and container.py. Consolidating that into detect_current_site() in site_registry.py would be the clean follow-up.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

✅ Eval Results

Metric Value
Score 1.00
Build complete
Cost $1.23
Turns 56
Duration 542s
lightcone-cli 0.2.1.dev11+g2eae9692f (2eae9692)
Results Download

Graders

✅ spec_valid (1.00)
✅ all_materialized (1.00)

Full output
7-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:30 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:30 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:30 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:30 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:30 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:30 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:30 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:31 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:31 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:31 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:31 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:31 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:31 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:31 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:31 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:12:32 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5b41befe-416d-4996-b657-7b61c59eab1e/files/bulk-upload "HTTP/1.1 200 OK"
12:21:35 lightcone.eval.sandbox Deleted sandbox for trial build-snae-0
  snae trial 0: score=1.00 complete

lightcone-cli: 0.2.1.dev11+g2eae9692f (HEAD 2eae9692)

  Eval Results: Scores  
┏━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Task ┃     Score     ┃
┡━━━━━━╇━━━━━━━━━━━━━━━┩
│ snae │ 1.00 +/- 0.00 │
│      │ pass@k: 100%  │
└──────┴───────────────┘

   Eval Results: Cost &   
         Duration         
┏━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Task ┃ Cost / Duration ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ snae │      $1.23      │
│      │      542s       │
└──────┴─────────────────┘

Total: 1 trials, $1.23, 542s

Results saved to: eval-results/build-2eae9692/results.json

EiffL and others added 4 commits April 30, 2026 13:49
The pattern `detect_site(socket.gethostname()) → get_site_defaults(...) or
{}` was duplicated in three places (commands.py, scratch.py, container.py),
each importing `socket` separately and re-deriving the same site context.

Introduce a single high-level entry point in site_registry.py:

- `HostSite` — frozen dataclass bundling the matched site key with its
  declared defaults. Truthy when matched, with a `.get(field)` shortcut
  and a `.display_name` property that handles the key-fallback chain.
- `detect_current_site() → HostSite` — single source of truth for "which
  site are we on?". Returns an empty (falsy) HostSite when unmatched, so
  callers don't need a guard before reading fields.

Refactor all three call sites to use it. `socket` now lives only where it
conceptually belongs — inside site_registry.py. Adding a future "site
declares X" feature no longer means a fourth copy of the boilerplate;
extend HostSite or its consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ests

- Drop the WHAT-restating docstring on HostSite.display_name; the
  property name and three-clause expression already say it.
- Replace three repeated `@patch("...socket.gethostname", ...)` blocks
  in TestDetectCurrentSite with a `fake_hostname` fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without fetch-depth: 0, the shallow checkout has no tags, so hatch-vcs
falls back to 0.1.dev1+g<sha> instead of resolving the real version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@EiffL EiffL merged commit 16172f8 into main Apr 30, 2026
5 of 6 checks passed
@aboucaud aboucaud deleted the fix/container-runtime-site-aware branch May 8, 2026 10:47
cailmdaley added a commit that referenced this pull request May 11, 2026
astra-tools v0.2.9 (EiffL's PR #88, merged today) ships stricter
semantics for Option.insights resolution than the 2ffbda4 grammar
rebase assumed: bare ids in [<id>] resolve node-locally only,
cross-scope refs require explicit ../ prefix. Same ../ grammar as
Input.from / Decision.from.

The typical SPECIFY authoring shape (declare each cited paper's
prior_insight at the sub-analysis that uses it, reference with a
bare id from same-scope options) is already what specify.md
prescribes — that's the node-local case and stays correct under
the new semantics.

The change adds the scope rules to step 1's Option ↔ prior_insights
linkage paragraph so SPECIFY iterations doing the rare cross-scope
ref reach for ../ instead of failing astra validate with
INVALID_INSIGHT_REF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant