Skip to content

fix: gracefully retire dask workers so srun exits silently#95

Merged
EiffL merged 3 commits into
mainfrom
fix/graceful-srun-worker-shutdown
Apr 30, 2026
Merged

fix: gracefully retire dask workers so srun exits silently#95
EiffL merged 3 commits into
mainfrom
fix/graceful-srun-worker-shutdown

Conversation

@EiffL
Copy link
Copy Markdown
Member

@EiffL EiffL commented Apr 30, 2026

Summary

  • On every clean `lc run` inside a SLURM allocation, the cleanup path SIGTERM-s the `srun` process, which prints "srun: forcing job termination", "task 0: Killed", and "Terminating StepId=…" to stderr — visible noise on a successful run.
  • Ask the dask scheduler to retire workers first (`Client.retire_workers(close_workers=True)`). Each `dask worker` process exits on its own, srun's task returns code 0, and srun terminates silently.
  • SIGTERM and SIGKILL remain as fallbacks if `retire_workers` fails or workers don't exit within 20s, so we never leak a hung srun.

Test plan

  • `uv run pytest tests/test_dask_cluster.py` (12 passed — the fake `Client` in tests doesn't implement `retire_workers`; that's caught by the `except Exception` and the fallback wait path runs as before)
  • `uv run ruff check` / `uv run mypy` clean
  • Manual: re-run `lc run` on NERSC inside a salloc and confirm no `srun: forcing job termination` lines on success; confirm a Ctrl-C / failure path still terminates srun within ~20s

Notes

Independent of #94 (the dask logger env-var fix); both touch `_slurm_backed_cluster` but in different sections, so whichever lands first will need a trivial rebase of the other.

🤖 Generated with Claude Code

EiffL and others added 2 commits April 30, 2026 14:59
`workers.terminate()` SIGTERM-s the srun process, which then prints
"srun: forcing job termination" and "task 0: Killed" to stderr on
every clean `lc run` inside a SLURM allocation. Ask the scheduler to
retire workers first (close_workers=True): each `dask worker` process
exits on its own, srun's task returns code 0, and srun terminates
silently. SIGTERM/SIGKILL remain as fallbacks if retire_workers fails
or workers don't exit within 20s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

✅ Eval Results

Metric Value
Score 1.00
Build complete
Cost $1.43
Turns 53
Duration 512s
lightcone-cli 0.2.1.dev13+g8c6b06feb (8c6b06fe)
Results Download

Graders

✅ spec_valid (1.00)
✅ all_materialized (1.00)

Full output
0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:02 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:02 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:02 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:02 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:16:03 httpx HTTP Request: POST https://proxy.app.daytona.io/toolbox/5678f77b-cce7-40d6-b1b0-155792648b68/files/bulk-upload "HTTP/1.1 200 OK"
13:24:36 lightcone.eval.sandbox Deleted sandbox for trial build-snae-0
  snae trial 0: score=1.00 complete

lightcone-cli: 0.2.1.dev13+g8c6b06feb (HEAD 8c6b06fe)

  Eval Results: Scores  
┏━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Task ┃     Score     ┃
┡━━━━━━╇━━━━━━━━━━━━━━━┩
│ snae │ 1.00 +/- 0.00 │
│      │ pass@k: 100%  │
└──────┴───────────────┘

   Eval Results: Cost &   
         Duration         
┏━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Task ┃ Cost / Duration ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ snae │      $1.43      │
│      │      512s       │
└──────┴─────────────────┘

Total: 1 trials, $1.43, 512s

Results saved to: eval-results/build-8c6b06fe/results.json

When retire_workers(close_workers=True) tells the worker to exit on
shutdown, the parent Nanny sees the child die and logs "Worker
process died unexpectedly" before it picks up the graceful-close
flag. Drop the Nanny: each srun task is a single run-scoped worker,
so auto-restart doesn't help (srun wouldn't relaunch the task
anyway). The worker process exits cleanly when retired and srun
terminates silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@EiffL EiffL merged commit 7acc108 into main Apr 30, 2026
5 of 6 checks passed
@aboucaud aboucaud deleted the fix/graceful-srun-worker-shutdown branch May 8, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant