What's Changed
- Update README to include TensorRT LLM and vLLM in description by @nlevin-ui in #1
- [MISC] Add License / headers, and a small check to prepare for release by @xinli-sw in #4
- feat: enable runtime container detection for portable dynamo source builds by @qiching in #3
- Sync ishandhanani/srt-slurm history into NVIDIA/srt-slurm by @csahithi in #14
- Add trace-replay benchmark type by @alec-flowers in #16
- fix: use custom_tokenizer to workaround the trtllm + glm5 tokenizer loading issue by @richardhuo-nv in #20
- fix: add nvidia pypi as an extra index to be able to pip install the prerelease dynamo wheels by @richardhuo-nv in #22
- fix: support cross-arch clusters (x86_64 login, aarch64 compute) by @alec-flowers in #17
- feat: trace-replay benchmark with aiperf_args passthrough by @alec-flowers in #18
- feat: add mocker backend for pipeline smoke tests by @alec-flowers in #25
- feat: separate login-node and compute-node venvs by @alec-flowers in #29
- feat: runtime fingerprinting, identity verification, and lockfile by @alec-flowers in #19
- feat: configurable NATS max_payload for disagg serving by @alec-flowers in #31
- Copy {job_id}.json into log directory for S3 upload by @KaunilD in #15
- TRTLLM nsys profiling harness + Dynamo OTEL tracing automation by @karen-sy in #27
- Add CODEOWNERS file by @xinli-sw in #37
- Add CSV export for sa-bench rollup by @weireweire in #26
- Sanitize srun output in node IP resolution by @weireweire in #38
- feat: lockfile v2 — shareable recipe + lock section by @alec-flowers in #32
- fix: Install maturin if not present by @trevor-m in #45
- [codex] Add generic telemetry and custom benchmark support by @ishandhanani in #43
- [codex] Port HF cache cleanup by @ishandhanani in #49
- Add srt-slurm MCP spec server and preflight validation by @ishandhanani in #53
- Push logs_url to status API eagerly and via final PUT by @ishandhanani in #54
- [codex] narrow srtctl mcp to authoring and validation by @ishandhanani in #55
- [codex] Keep MCP validation off host cluster config by @ishandhanani in #56
- fix: emit aggregated resources and harden sa-bench rollup by @ishandhanani in #58
- feat: use pre-generated custom dataset for benchmarking MTP with chat template by @richardhuo-nv in #64
- docs: loud warnings on custom benchmark templating and nginx-off mode by @ishandhanani in #66
- feat(sa-bench): add sglang DeepSeek-V4 tokenizer by @YAMY1234 in #73
- feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) by @elvischenv in #70
- fix(orchestrate): robust container bootstrap (maturin/protoc/venv-race) by @ishandhanani in #81
- fix(sa-bench): actionable error + warmup parity for use_chat_template by @YAMY1234 in #76
- feat(schema): make gsm8k a first-class BenchmarkType by @ishandhanani in #82
- [codex] add AIME benchmark by @ishandhanani in #83
- feat(aime): rework around
ns evalfor reasoning-model parity by @ishandhanani in #87 - Add scripts for wideEP; Note we can reach a PD balance with dep8, cc=2048 by @samuellees in #52
- Revert "Add scripts for wideEP; Note we can reach a PD balance with dep8, cc=2048" by @ishandhanani in #89
- refactor(aime): drop structured runner, ship configs/aime/{run.sh,rescore.py} by @ishandhanani in #91
- Add the chat template to the glm5 tokenizer and apply that when sampling the requests by @richardhuo-nv in #65
- feat(config): resolve container aliases for telemetry + preflight by @ishandhanani in #101
- [codex] Add Dynamo nightly wheel install support by @alec-flowers in #99
- feat(dynamo): cache hash-pinned source builds on /configs by @ishandhanani in #88
- Add DeepSeek V4 Pro vLLM GB200 recipes by @alec-flowers in #102
- feat(config): cluster-wide default_bash_preamble for ulimits and the like by @ishandhanani in #104
- fix(nginx): raise file descriptor limit for nginx workers by @ishandhanani in #108
- log: always set dyn skip log fmt by @ishandhanani in #109
- [NOT FINAL] add wip DSv4 aggregate and disaggregate recipes by @ishandhanani in #85
- nginx: rework to make ulimit optional by @ishandhanani in #110
- log: demote per-srun command line to DEBUG by @cquil11 in #111
- fix: using a setup script to install pip in trtllm venv by @richardhuo-nv in #116
- default dyn log by @ishandhanani in #118
- feat: Add live monitor to SRT-SLURM by @leo-cf-tian in #119
- Pass in boostrap port on prefill by @wenscarl in #121
- Cherry-pick lm-eval benchmark runner from sa-submission-q2-2026 by @ishandhanani in #122
- fix: preflight accepts hf:* model paths and Docker image URIs by @Thunderbeee in #125
- Add GLM5 B200 FP8 disaggregated recipe by @weireweire in #50
- [NOT FINAL] Qwen3.5 fp8 mtp-off recipes by @samuellees in #128
- feat: live in-flight batch-metrics snapshotter (opt-in) by @YAMY1234 in #115
- feat(profiling): add extra_nsys_args for optional nsys CLI flags by @zhengd-nv in #59
- Handle null telemetry in live metrics startup by @weireweire in #135
- Add GPT-OSS TRT-LLM aggregated recipe by @faradawn in #132
- feat: peak gen throughput metric in sa-bench + server-side node metrics CSV export by @zhengd-nv in #93
- feat: first-class mooncake KV store support for SGLang backend by @ishandhanani in #136
- feat: SGLang decode slow_down for PD disagg nsys profiling (with skip-warmup workflow) by @zhengd-nv in #60
- sglang: enable mooncake_master HTTP metadata server + auto-inject MOONCAKE_TE_META_DATA_SERVER by @ishandhanani in #138
- recipes: update glm5 sglang to use faster weights loading by @weireweire in #137
- sa-bench: make SGLangDeepseekV4Tokenizer callable by @ch-wan in #144
- fix(batch-metrics): split agg logs by DP rank by @YAMY1234 in #145
- Capture git state for extra mounts by @YAMY1234 in #146
- Sglang port jitter by @nvjullin in #134
- Default SA-Bench random workers to auto by @weireweire in #147
- Update GB300 FP4 GLM-5 recipe by @weireweire in #152
- Expand environment variables in extra_mount paths by @weireweire in #153
- Make batch metrics legends translucent by @YAMY1234 in #151
- Centralize safe runtime port allocation by @weireweire in #156
- Support default sbatch directives in srtslurm config by @weireweire in #159
- Update GB300 FP8 GLM-5 recipe by @weireweire in #160
- Add Nemotron Super 120B recipes by @faradawn in #150
- Add --no-preflight CLI flag to srtctl apply by @cquil11 in #162
- Add Qwen3.5 DeepEP MTP recipes by @YAMY1234 in #163
- Accept legacy token metric names in telemetry plots by @weireweire in #166
- Fixing sweep submissions for 'sweep' block by @AlphaBladez in #170
- Add DSV4 GB300 8k1k recipe by @weireweire in #173
- Add GB300 FP8 GLM5 MTP recipes and Upadate max-running-requests. by @weireweire in #168
- Add spread worker placement and vLLM colocation (PR against main) by @jasonlizhengjian in #182
- Force-reinstall maturin in portable top_of_tree dynamo source build by @Ankur-singh in #183
- feat(config): add default_health_check cluster-level default in srtslurm.yaml by @shljessie in #180
- fix(slurm): use --key=value for srun options (Slurm 25.11 cpu-bind regression) by @shljessie in #179
- Added heterogenous job support by @nvjullin in #178
- Copy resolved override/zip config into log dir for S3 upload by @KaunilD in #194
- Add workflow: auto-release on PR merge to main by @Ankur-singh in #185
- Fix release workflow 403 by using pull_request_target by @Ankur-singh in #200
New Contributors
- @xinli-sw made their first contribution in #4
- @csahithi made their first contribution in #14
- @KaunilD made their first contribution in #15
- @karen-sy made their first contribution in #27
- @trevor-m made their first contribution in #45
- @ishandhanani made their first contribution in #43
- @elvischenv made their first contribution in #70
- @samuellees made their first contribution in #52
- @cquil11 made their first contribution in #111
- @leo-cf-tian made their first contribution in #119
- @wenscarl made their first contribution in #121
- @Thunderbeee made their first contribution in #125
- @zhengd-nv made their first contribution in #59
- @faradawn made their first contribution in #132
- @ch-wan made their first contribution in #144
- @nvjullin made their first contribution in #134
- @AlphaBladez made their first contribution in #170
- @Ankur-singh made their first contribution in #183
- @shljessie made their first contribution in #180
Full Changelog: https://github.com/NVIDIA/srt-slurm/commits/v1.0.0