Skip to content

[atom-vllm] enable DeepSeek V3.2 quick reduce envs#1047

Merged
zejunchen-zejun merged 2 commits into
mainfrom
xiaobing/deepseek_v3.2_int4_reduce
Jun 3, 2026
Merged

[atom-vllm] enable DeepSeek V3.2 quick reduce envs#1047
zejunchen-zejun merged 2 commits into
mainfrom
xiaobing/deepseek_v3.2_int4_reduce

Conversation

@XiaobingSuper
Copy link
Copy Markdown
Contributor

Motivation

add int4 quick reduce for deepseek v3.2。

Technical Details

Test Plan

accuracy:

image

Test Result

Submission Checklist

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings June 3, 2026 03:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables INT4 “quick reduce” environment settings for DeepSeek-V3.2 in the vLLM-ATOM ecosystem, updating the user recipe plus CI accuracy and OOT benchmark/accuracy metadata so DeepSeek-V3.2 runs pick up the intended AllReduce behavior.

Changes:

  • Update the DeepSeek-V3.2 vLLM recipe to export quick-reduce env vars.
  • Set DeepSeek-V3.2 TP4/TP8 accuracy-validation workflow env_vars to include the quick-reduce settings.
  • Mirror the same env_vars updates into OOT accuracy and OOT benchmark model catalogs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
recipes/atom_vllm/DeepSeek-V3.2.md Adds quick-reduce env exports to the documented launch command.
.github/workflows/atom-vllm-accuracy-validation.yaml Enables quick-reduce env vars for DeepSeek-V3.2 TP4/TP8 accuracy runs (and matching benchmark overrides).
.github/benchmark/oot_models_accuracy.json Updates DeepSeek-V3.2 accuracy dashboard metadata to run with the new env vars.
.github/benchmark/oot_benchmark_models.json Updates DeepSeek-V3.2 OOT benchmark variants to run with the new env vars.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 16 to +18
TP=4
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export AITER_QUICK_REDUCE_CAST_BF16_TO_FP16=0
@XiaobingSuper XiaobingSuper requested a review from wuhuikx June 3, 2026 03:07
```bash
TP=4
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export AITER_QUICK_REDUCE_CAST_BF16_TO_FP16=0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please help add the accuracy results in the recipe file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@XiaobingSuper XiaobingSuper requested a review from wuhuikx June 3, 2026 03:38
@zejunchen-zejun zejunchen-zejun merged commit a912dd1 into main Jun 3, 2026
18 of 30 checks passed
@zejunchen-zejun zejunchen-zejun deleted the xiaobing/deepseek_v3.2_int4_reduce branch June 3, 2026 04:06
zejunchen-zejun added a commit that referenced this pull request Jun 5, 2026
* Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE QK (#838)

* add gdn decode fast kernel

* resolve gdn code conflicts

* resolve gdn code conflicts

* solve mispelling error

* solve redundant import error

* add layernorm and rope optimization

* revert non-gdn optimization changes

Co-authored-by: Cursor <cursoragent@cursor.com>

* revert gdn changes

Co-authored-by: Cursor <cursoragent@cursor.com>

* add gdn decode lossy fast kernel

* revert sglang benchmark file changes

Co-authored-by: Cursor <cursoragent@cursor.com>

* gate gdn decode lossy fast path

Co-authored-by: Cursor <cursoragent@cursor.com>

* address gdn decode review comments

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(gdn): zero out for PAD_SLOT_ID in lossy fast kernel

When ssm_state_indices contains a negative slot id (e.g. SGLang's
PAD_SLOT_ID = -1 for idle/padded decode slots) the kernel previously
returned early without writing to out, leaving the corresponding
positions in the output tensor uninitialized and propagating garbage
into downstream ops.

Match the safer behavior expected by callers: write zeros to out for
the invalid slot and skip the state load/store entirely.

Addresses the latest Copilot review comment on PR #838.

* style: apply black formatting

Fix Check Code Style with Black CI failure on #838.

* perf(qwen3.5): add fused MRoPE QK Triton path

Merges the MRoPE Q/K fusion work originally in #888 into this PR so
the two related Qwen3.5-35B-A3B-FP8 optimizations ship together
(per review feedback that #888's stand-alone +1.7% gain is too small
to justify a separate PR).

Adds:
- atom/model_ops/triton_mrope.py: specialized Qwen3.5 MRoPE Q/K
  Triton kernels (tiled + per-token) with a try_mrope_qk_fused
  dispatcher decorated with @torch.compiler.disable so Dynamo cannot
  specialize positions/q/k symbolic dims to constants (was tripping
  ConstraintViolationError under MMStar dynamic-shape compile).
- atom/models/qwen3_next.py: wires try_mrope_qk_fused into
  Qwen3NextAttention after qk_norm; falls back to the generic
  rotary_emb path when the shapes don't match.

Combined effect over main (MI308X, CONC 224, ISL 4094, OSL 2048,
TP/EP 1/1, ATOM_ENABLE_GDN_DECODE_LOSSY_FAST=1):
- Total token throughput: 7466.90 -> 8004.41 tok/s (+7.20%)
- Mean E2E latency: 176401 -> 164893 ms (-6.52%)
- Mean TPOT: 77.44 -> 71.87 ms (-7.19%)

GSM8K 5-shot remains on par with main:
- flexible-extract: 0.895 (vs 0.8946 baseline)
- strict-match: 0.903 (vs 0.9052 baseline)

* fix(mrope): early-return under torch.compile instead of graph break

Previously try_mrope_qk_fused used @torch.compiler.disable to keep the
Python shape branches out of Dynamo. That fixed the original
ConstraintViolationError but introduced a new MMStar failure:

  torch._dynamo.exc.BackendCompilerFailed: backend='...VllmBackend'
  raised: AssertionError: VllmBackend can only be called once

The graph break inserted by @torch.compiler.disable inside the compiled
Qwen3NextAttention forward causes Dynamo to invoke ATOM's VllmBackend a
second time on the same instance.

Switch to torch.compiler.is_compiling() early-return: under compile we
skip the fused path entirely (fall back to self.rotary_emb, identical
to main), eager mode keeps the fused-path perf gain. No graph break,
no double-backend invocation.

* perf(mrope): drop tl.constexpr on num_tokens to avoid recompilation

num_tokens equals positions.shape[1], which changes every batch (mixed
prefill/decode, varying decode batch sizes). With tl.constexpr, Triton
specializes and recompiles the kernel for every distinct value, which
defeats the perf gain of the fused path.

num_tokens is only used in a runtime mask (row_mask = rows < num_tokens),
so it does not need constexpr semantics. Drop the annotation so the
kernel is compiled once per shape group.

Addresses Copilot review r3322237301.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: zovonoir <zovonoir@users.noreply.github.com>

* fix(spec_decode): support DP attention with MTP in Deepseek V4 (#1001)

* fix(spec_decode): support DP attention with MTP draft

Refresh dp_metadata per draft step (force variable-length DP path) and
add num_spec_step + scheduled_spec_decode_tokens to the dummy decode
batch so DP+MTP runs stay in lockstep.

* style: apply black formatting

---------

Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com>

* Remove qkv 256 tok limitation (#999)

* [Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention (#750)

* [feat][Attention Refactor] Reconstruct the Attention arch

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* ci(benchmark): raise benchmark drain MAX_MIN 30->60 and step timeout 60->80 (#1019)

High-concurrency long-context benchmarks (DP-attention 8k/1k c=1024, which
runs num_prompts = conc*10 = 10240) need ~48 min wall: ~14 min warmup +
~34 min for the measured run (10 waves of 1024 at ~3:20/wave). The benchmark
drain's MAX_MIN=30 cut them off mid-run with exit 4 (timeout), failing the
job even though the server was healthy and still making progress.

Raise the benchmark drain MAX_MIN 30->60 and the "Run benchmark" step
timeout-minutes 60->80 so these runs complete. Fast jobs are unaffected
(drain exits on client completion, well before MAX_MIN); genuine hangs/faults
still surface quickly via STUCK_POLLS (3 min) and fault detection, not MAX_MIN.
The accuracy drain (MAX_MIN=30) is left unchanged.

* [atom-vllm-benchmark] Retrieve model case name (#1022)

Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* ci(accuracy): set Qwen3.5-35B-A3B TP2 baseline to 0.85 (#993)

Mean of first 4 valid CI runs after PR #893 (0.8226 / 0.8529 / 0.8620 / 0.8628).
Threshold 0.83 unchanged.

Co-authored-by: JiaoliangYu <jiaolyu@amd.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix: support PTPC indexer wk FP8 scales (#1009)

* fix: support MTP indexer wk FP8 scales

Allow DeepSeek-V3.2 MTP checkpoints to load indexer.wk tensors that use per-channel FP8 scales while preserving the existing block-scale path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: clarify PTPC indexer wk scale support

Describe the per-channel FP8 scale path as PTPC quantization support rather than MTP-specific behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Fix] Enable dpsk r1 mxfp4 V2 model (#934)

* [Fix] Enable dpsk r1 mxfp4 V2 model

* [Benchmark] Change model to dpsk v2 model for sglang plugin

* [Fix] Move MXFP4 kv_b_proj preservation into SGLang MLA

* [Fix] Handle SGLang MXFP4 kv_b_proj postprocess order

* Add fused chunk GDN prefill path for Qwen3.5-35B (#921)

* Add fused chunk GDN prefill path for Qwen3.5-35B

Port AMD HIP fast path from sglang's flash-linear-attention to
chunk_gated_delta_rule prefill. Fuses 4 kernels into 3.

* remove unused o_16 in fused_merge_recompute_kernel

* format NT_16 ternary on single line for black

* [fix](attn): fix slot mapping in model runner v2 (#1015)

Co-authored-by: perzhang <perzhang@amd.com>

* [MoE] adapt to triton_kernels matmul_ogs -> matmul rename (#763)

Upstream triton_kernels merged the `matmul_ogs` module into `matmul`
and the `matmul_ogs_details` package into `matmul_details`. The
`PrecisionConfig` dataclass was also reshaped: `weight_scale` is now
`b_mx_scale`, and setting it requires `b_microblock_size` to be
provided explicitly (enforced by an assert in the new `matmul()`).

- fused_moe_triton: try importing `FnSpecs / FusedActivation /
  PrecisionConfig / matmul` from `triton_kernels.matmul` first, fall
  back to the old `triton_kernels.matmul_ogs` path. Alias `matmul as
  matmul_ogs` so existing call sites stay unchanged.
- moe (Mxfp4MoEMethod.process_weights_after_loading): same dual-path
  import for `FlexCtx / PrecisionConfig`; detect the kwarg name via
  `dataclasses.fields` so the old `weight_scale=` path keeps working
  while the new API takes `b_mx_scale=` + `b_microblock_size=`.
- Drop the `_amd_smem_safe_tile` workaround that pinned
  block_m / block_n on gfx950: the underlying LDS-spill is no longer
  reproducible against current triton / triton_kernels.

Co-authored-by: jianlian <jianlian@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* CI: Use linux-atom-mi35x-1 in docker release pipeline

* [atom-vllm benchmark] set 0 to random range ratio for vllm bench (#1029)

* Fix AW benchmark fixed length config (#1020)

Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

* Clarify AW benchmark matrix job name (#1021)

* Clarify AW benchmark matrix job name

* Use explicit zero ratio for AW benchmark cases

---------

Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

---------

Co-authored-by: wuhuikx <hattie.wu@amd.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

* [atom-sgl-benchmark] Debug timeout (#977)

Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* [atom-vllm benchmark] allow P0 benchmarks at 128 and 256 concurrency (#1036)

Allow P0 benchmarks at 128 and 256 concurrency (#1030)

Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* fix: chunk prefill (#1032)

* remove disable deepseek v4 chunk prefill flag

* fix(scheduler): use num_tokens for preempted seq re-prefill chunk size

Preempted seqs keep their decoded token_ids (preempt() only deallocates
KV blocks) so seq.num_tokens > seq.num_prompt_tokens on re-admit.
Computing num_new_tokens from num_prompt_tokens caused chunk=0 when a
fully-cached prefix exhausted num_prompt_tokens, triggering the
"chunk must be positive" assert under high concurrency benchmarks.

* fix format

* fix sparse_attn_v4_paged_prefill for MI308 (#1003)

* [ATOM SGLang] SGL plugin Attention Refractory (#863)

* add work log

* [ATOM-SGL][Attn refrac] Separate model-specific MLA from SGL full attention backend

* remove work log

* [ATOM-SGL][Attn refrac] Route DeepSeek MLA through an SGLang wrapper
Move the SGLang DeepSeek MLA runtime entry from legacy forward glue into
SGLangDeepseekMLAAttention while keeping RadixAttention and the full-attention
backend as the host/backend layers. Shrink deepseek_mla_forward.py into a
helper module and clarify absorbed vs non-absorbed path naming.

* [ATOM SGL] runtime extraction

* [ATOM-SGL][Runtime] Introduce model adapter specs

Co-authored-by: Cursor <cursoragent@cursor.com>

* [ATOM-SGL][Runtime] Keep custom wrappers out of generated entries

Co-authored-by: Cursor <cursoragent@cursor.com>

* [ATOM-SGL][Attn refrac] Split full attention backend helpers

Co-authored-by: Cursor <cursoragent@cursor.com>

* [ATOM-SGL][Attn refrac] Format refactored attention files

Co-authored-by: Cursor <cursoragent@cursor.com>

* [ATOM-SGL][Attn refrac] Fix ruff findings in refactored attention code

Co-authored-by: Cursor <cursoragent@cursor.com>

* [ATOM-SGL][Attn refrac] Avoid DeepSeek MLA wrapper module cycle

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix rebase issue

* precheckin

* prepare for sglang only

* import error meet in qwen3.5

* qwen3.5 acc fix

* [Fix] Limit static FP4 linear kv_b_proj post-processing

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: qichu-yun <qichu@amd.com>

* [enable EP] deepseek V4 (#875)

* [enable EP] deepseek V4

* update

* [KV-events] ZMQ publisher for KV cache events (#869)

* feat(kv-events): ZMQ publisher for KV cache events

Add a KV cache lifecycle event pipeline so external consumers can track
when blocks become resident, are evicted, or move across tiers.

- atom/distributed/kv_events.py: EventBatch + tagged-union schema
  (BlockStored, BlockRemoved, AllBlocksCleared, BlockTransferred);
  ZMQ PUB publisher with a background sender thread and bounded queue
  (drops oldest on slow subscriber).
- atom/model_engine/block_manager.py: emit BlockStored on prefix-cache
  coalesced runs, BlockRemoved on lazy LRU eviction, AllBlocksCleared on
  clear_cache(); record_remote_store() hook for remote-transfer
  connectors to emit BlockStored(medium=REMOTE).
- atom/model_engine/scheduler.py: publish_kv_events() drains the
  BlockManager event log per scheduler step into one EventBatch;
  shutdown_kv_events() tears down the publisher on engine shutdown.
- atom/model_engine/engine_core.py: publisher lifecycle wiring.
- atom/utils/envs.py: ATOM_KV_EVENTS_{ENABLE,PUBLISHER,ENDPOINT,TOPIC,
  HWM,BUFFER_STEPS} env vars.
- atom/config.py: KV-events config knobs.
- tests/test_kv_events.py: schema round-trip + tagged-union batch.

BlockTransferred and medium in {CPU, DISK} are reserved in the schema
but not emitted yet. The hybrid-cache metadata fields on BlockStored
(kv_cache_spec_kind, kv_cache_spec_sliding_window) are reserved wire
slots emitted as None until a follow-up wires them from the cache-spec
coordinator.

Review feedback (incorporated):
- Make pyzmq an optional runtime dep: import zmq inside ZmqEventPublisher
  so BlockManager's unconditional import of this module no longer
  requires pyzmq when KV events are disabled.
- Validate buffer_steps >= 1 in ZmqEventPublisher so 0 (which Python's
  queue.Queue treats as unbounded) can't silently disable backpressure.
- Track encode failures in stats (encode_errors counter) instead of
  swallowing the exception silently.
- Add BlockManager.kv_events_enabled property so the scheduler stops
  reaching into _event_log directly.
- Use the MEDIUM_REMOTE constant rather than the "REMOTE" string literal
  in record_remote_store.
- Use pytest.importorskip("zmq") and an inproc:// endpoint in
  test_zmq_publisher_roundtrip so the test no longer hard-codes a TCP
  port and can be skipped cleanly when pyzmq is absent.

* chore(kv-events): trim verbose comments and docstrings

Remove descriptive comments and docstrings that restated what the code
already says, leaving only the ones whose WHY is non-obvious (lazy
eviction point, coalesced-store parent semantics, sticky cache_miss
invariant, drop-on-overflow design, clear_cache live-seq invariant).

* fix(kv-events): import MEDIUM_REMOTE for record_remote_store

The earlier commit added a MEDIUM_REMOTE reference at the
record_remote_store() emit site but the import line was never added,
which would have raised NameError on first remote-store callback.
Path wasn't exercised in the local smoke run because we never wired a
KV-transfer producer.

* fix(kv-events): close shutdown race and drop unused _EventBatch

* fix(kv-events): align KVEventsConfig defaults with env

* fix(kv-events): teardown safety, multipart docstring, parent_hash dedupe

* fix(kv-events): no BlockRemoved on cache-hit block reuse

* fix(kv-events): chain parent on remote store, atomic drain, longer linger

* fix(kv-events): use sub.poll in test_zmq_publisher_roundtrip

* Merge branch 'main' into feat/kv-events

* fix(kv-events): publish on every step, skip cached blocks on remote-store, safer shutdown

* fix(kv-events): default endpoint to loopback for safer opt-in

* fix(kv-events): default group_idx to None to match vLLM wire layout

* fix(kv-events): call hash_blocks before fwd_output idx-skip

main's postprocess() skipped seqs whose idx is None (prefill step pattern)
before calling hash_blocks(), so the prefill seq's hashes were never
registered and BlockStored was never emitted. Move the hash_blocks call
above the idx-None continue so it runs on every prefill step regardless
of the fwd_output idx mapping.

* test(kv-events): rename test_cache_hit_emits_no_new_store -> only_new_blocks

* kv_events: log first encode error, count shutdown drops, hoist event-log check

* black format

* kv_events: harden finally, add overflow/encode tests

* pyproject: add msgspec to deps

* [atom-vllm benchmark] enable DeepSeek V3.2 quick reduce envs (#1047)

* [atom-vllm] enable DeepSeek V3.2 quick reduce envs

Co-authored-by: Cursor <cursoragent@cursor.com>

* add accuracy recipe

---------

Co-authored-by: perzhang <perzhang@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: warmup uses full token budget for DP (#1024)

* fix: warmup uses full token budget

* only for dp attn

---------

Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com>

* feat: support DeepSeek-V4-Flash-Base model on gfx942 device. (#996)

* Expose ATOM test base image input (#1053)

* [atom-vllm-benchmark] Add model case amd/DeepSeek-V3.2-mtp-ptpc for AW_P0 (#1039)

* Add model case amd/DeepSeek-V3.2-mtp-ptpc for AW_P0

* First run non-mtp version

* Remove 'MTP' from choice_label

* Add model case amd/DeepSeek-V3.2-mtp-ptpc to accuracy and recipe

* Add launch params to deepseek v3.2 ptpc

---------

Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* [atom-vllm-benchmark] Change AW execution logic from one server one job to one server multi jobs (#1005)

* Rename to AW (#1000)

Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* Debug 'no such file or directory benchmark_matrix.json' (#1002)

Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* [minimax dev_perf] remove qkv token 256 limitation for ar fusion (#1004)

* [atom-vllm benchmark] refine model case name (#995)

Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* Remove qkv 256 tok limitation

---------

Co-authored-by: junyyang-amd <junyyang@amd.com>
Co-authored-by: root <root@hjbog-srdc-15.amd.com>

* Change AW execution logic from one server one job to one server multi jobs

* Change the content as suggested

* Fix metadata naming after rebase

---------

Co-authored-by: root <root@hjbog-srdc-15.amd.com>
Co-authored-by: Yutao Xu <xytpai@foxmail.com>

* [Feat] Fused qknorm + quant for dpsk v2 model (#963)

* [Feat] Fused qknorm + quant for dpsk v2 model

* [Fix] Localize SGLang MXFP4 projection preservation

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* use ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM to enable non shuffle triton gemm (#1031)

* use ATOM_USE_FP4_TRITON_GEMM to enable non shuffle triton gemm

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

* update env name and add comments

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix(v4): drop redundant cu_seqlens_q refill in attention metadata builder (#1058)

cu_seqlens_q is already populated in ModelRunner as a variable-length
prefix sum over num_scheduled_tokens, with the [scheduled_bs+1:bs+1]
tail padded to the boundary value for cudagraph. The DeepseekV4
attention metadata builder re-filled it with a uniform np.arange sized
scheduled_bs+1, overwriting ModelRunner's correct values. Remove the
redundant fill and copy bs+1 entries so the GPU buffer matches the
range ModelRunner populates.

Also split a grouped local import into per-line imports (isort).

* [ATOM-vLLM] Upgrade vLLM version to v0.22.0 (#1006)

upgrade atom-vllm vllm version to 0.22.0

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

* [feat] Add RLHF rollout integration support (verl) (#549)

* [verl] feat: add trust_remote_code arg and compilation_config dict support

* [verl] feat: add logprobs and request_id support across sampling pipeline

* [verl] feat: weight sync, memory lifecycle and DP isolation for verl integration (TP+DP)

* [verl] feat: utility command dispatch and broadcast communication

* [verl] feat: basic integration with verl - load_weights, sleep/wake_up API

* [atom] fix: rope parameters handling, remove CLI trust_remote_code, and minor fixes

* [atom] feat: implement packed weight handling in ModelRunner for FP8 parameters

* [verl] refactor: decouple RLHF rollout logic from inference engine into atom/rollout/

* [verl] feat: extend tokenIDProcessor for logprobs support and enhance ModelRunner with DP isolation handling

* fix: patch NCCL device binding for DP-isolated ModelRunner

* refactor: minimize diff against main by reverting non-functional changes

* refactor: improve code readability by formatting and organizing function parameters and comments across multiple files

* refactor: extract sleep logic from engine_core busy_loop into helper methods

* [verl] refactor: merge logprobs and DP isolation into base ModelRunner, simplify RLHFModelRunner

* refactor: rename sleep state variables and update related logic for RL training in EngineCore and ModelRunner

* fix: restore mark_trace profiler around cudagraph capture

* docs: add veRL + Megatron + ATOM environment setup guide for ROCm

* [verl] feat: add logprobs and request_id support across sampling pipeline

* [verl] refactor: unify load_weights API with auto mode selection

* fix: batch token ID processing in tokenIDProcessor

* fix: use process group size instead of config for DP-isolated mode

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* [rollout, atom] fix: align DP logic with main

* [rollout] fix: remove unnecessary DP config overrides and RLHF APIs from LLMEngine

---------

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

* fix

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* trim decode tensors for moe

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* [atom-vllm recipe] align recipe to nightly script (#1040)

Co-authored-by: perzhang <perzhang@amd.com>

* [sgl-atom][docker]add optional sglang_tag_suffix (#1068)

* add docker prefix

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

* Enable standalone DeepSeek NextN draft model (#964)

Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* [Feat] enable dualstream in mtp (#1049)

* [atom-vllm-benchmark] Change matrix cell launches one server for one ISL/OSL pair + all concurrency (#1075)

---------

Co-authored-by: Jun Yan Yang

* [atom-vllm benchmark] recover warmup to concurrency

Co-authored-by: perzhang <perzhang@amd.com>

* Update SGLANG accuracy runner (#1084)

* [plugin][perf] refine pa dispatch for better perf (#1038)

* add pa dispatch for GLM-4.7 and clean code

* refine the dispatch

* fix minimax acc

* revert unnecessary change

* clean code

---------

Co-authored-by: Guanbao Yu <gyu@amd.com>

* fix fused_moe (#1076)

* fix non triton routing expert mask in moe

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fold heads to 8

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* black

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Co-authored-by: Zhu Jiale <69138280+zovonoir@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: zovonoir <zovonoir@users.noreply.github.com>
Co-authored-by: ZhangLirong <lirzhang@amd.com>
Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com>
Co-authored-by: Yutao Xu <xytpai@foxmail.com>
Co-authored-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: junyyang-amd <junyyang@amd.com>
Co-authored-by: root <root@hjbog-srdc-15.amd.com>
Co-authored-by: JiaoliangYu <Jiaoliang.Yu@amd.com>
Co-authored-by: JiaoliangYu <jiaolyu@amd.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com>
Co-authored-by: qichu-yun <qichu@amd.com>
Co-authored-by: ningding01 <niding@amd.com>
Co-authored-by: PerryZhang01 <Perry.Zhang@amd.com>
Co-authored-by: perzhang <perzhang@amd.com>
Co-authored-by: jianhao <Jianhao.Liang@amd.com>
Co-authored-by: jianlian <jianlian@amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
Co-authored-by: Jiayun <jiayyu@amd.com>
Co-authored-by: Wang, Yiting <18916612990@163.com>
Co-authored-by: Zhiwei <yanzhw5@mail3.sysu.edu.cn>
Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
Co-authored-by: Bongwoo Bak <bongwoobak@gmail.com>
Co-authored-by: junna2016 <xingjunna.xjn@alibaba-inc.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Hexiang Wang <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Sijing Yang <Sijing.Yang@amd.com>
Co-authored-by: Ling Zhang <69022634+ZLkanyo009@users.noreply.github.com>
Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com>
Co-authored-by: Guanbao Yu <gyu@amd.com>
Co-authored-by: Wang, Yiting <yitiwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants