[Feature] Add off-policy sequence masking algorithm proposed in DeepSeek v3.2 #999

yitianlian · 2025-12-02T02:11:26Z

No description provided.

zhuzilin · 2025-12-02T05:09:56Z

slime/utils/arguments.py

                help="The rollout routing replay technique from https://arxiv.org/abs/2510.11370",
            )
+            parser.add_argument(
+                "--enable-opsm",


please rename to --use-xxx to align the naming to other flags. and I wonder if we can have an additional --use-off-policy-sequence-mask because the origin paper does not mention opsm abbreviation.

also please add the dpsk v3.2 paper to the help.

zhuzilin · 2025-12-02T05:12:09Z

slime/backends/fsdp_utils/actor.py

+
+        opsm_mask = None
+        opsm_clipfrac_num = 0
+        if getattr(self.args, "enable_opsm", False):


Suggested change

if getattr(self.args, "enable_opsm", False):

if self.args.use_opsm:

zhuzilin · 2025-12-02T05:13:16Z

slime/backends/fsdp_utils/actor.py

+        ppo_kl = old_log_probs - log_probs
+
+        opsm_mask = None
+        opsm_clipfrac_num = 0


please move these 2 values inside the if getattr(self.args, "enable_opsm", False): and always use enable_opsm as the flag to check if opsm is enabled.

zhuzilin · 2025-12-02T05:21:06Z

slime/backends/megatron_utils/loss.py

-        ]
+    # Pre-gather log probs if needed by OPSM or GSPO to avoid duplicate gathering
+    cp_size = mpu.get_context_parallel_world_size()
+    need_full_log_probs = (args.enable_opsm or args.advantage_estimator == "gspo") and cp_size > 1


Suggested change

need_full_log_probs = (args.enable_opsm or args.advantage_estimator == "gspo") and cp_size > 1

need_full_log_probs = (args.enable_opsm or args.advantage_estimator == "gspo")

it seems that the code below has deal with cp == 1.

zhuzilin · 2025-12-02T05:22:11Z

slime/backends/megatron_utils/loss.py

+            ]
+            if cp_size > 1
+            else log_probs
+        )


there seems to be early return in all_gather_with_cp we can always run all_gather_with cp here.

zhuzilin · 2025-12-02T05:23:04Z

slime/backends/megatron_utils/loss.py

-        ppo_kl = torch.cat(ppo_kl, dim=0)
+    # Compute OPSM mask if enabled
+    opsm_mask = None
+    opsm_clipfrac_num = 0


same as the comment in fsdp, please move into the args.enable_opsm

zhuzilin · 2025-12-02T05:24:38Z

slime/backends/megatron_utils/loss.py

        old_log_probs = torch.cat(old_log_probs, dim=0)
        log_probs = torch.cat(log_probs, dim=0)
        ppo_kl = old_log_probs - log_probs



when running with opsm and without gspo, will the shape of ppo_kl different from opsm_mask?

zhuzilin · 2025-12-02T05:28:47Z

slime/utils/ppo_utils.py

+
+        # Calculate sequence-level advantage (mean of advantage values)
+        # For GRPO, advantage is constant across the sequence, so mean == any element
+        seq_advantage = advantage.mean()


hmm... it seems that from the dpsk v3.2 paper, we dont need to have per-sequence adv but per token adv.

zhuzilin · 2025-12-02T05:29:44Z

slime/utils/ppo_utils.py

+        # This mask applies to the entire sequence
+        condition = (seq_advantage < 0) & (seq_kl > args.opsm_delta)
+        mask_chunk = torch.where(condition, torch.zeros_like(local_log_prob), torch.ones_like(local_log_prob))
+        opsm_clipfrac_num += condition.int().item()


please don't do .item() during the loop, it will trigger gpu -> cpu copy and synchronization.

…eek v3.2 (THUDM#999)

yitianlian added 9 commits December 2, 2025 01:57

add ofsm

9937559

add ofsm

b666815

fix format error

7495008

fix format error

390aa5e

clean code

d9445e9

clean code

ad7c0c8

clean doc

3d45ccd

add opsm in fsdp backend

a3d6391

clean code

05881cf

zhuzilin reviewed Dec 2, 2025

View reviewed changes

yitianlian added 4 commits December 2, 2025 06:55

clean code

8601b4b

clean code

d3ecb66

clean code

4040ba8

clean code

5aec48c

zhuzilin approved these changes Dec 2, 2025

View reviewed changes

clean code

361afb9

zhuzilin merged commit caa4af2 into THUDM:main Dec 2, 2025
4 of 5 checks passed

Fengzdadi pushed a commit to Fengzdadi/slime that referenced this pull request Dec 19, 2025

[Feature] Add off-policy sequence masking algorithm proposed in DeepS…

8cadee0

…eek v3.2 (THUDM#999)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add off-policy sequence masking algorithm proposed in DeepSeek v3.2 #999

[Feature] Add off-policy sequence masking algorithm proposed in DeepSeek v3.2 #999

Uh oh!

yitianlian commented Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

zhuzilin Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if getattr(self.args, "enable_opsm", False):
	if self.args.use_opsm:

	need_full_log_probs = (args.enable_opsm or args.advantage_estimator == "gspo") and cp_size > 1
	need_full_log_probs = (args.enable_opsm or args.advantage_estimator == "gspo")

[Feature] Add off-policy sequence masking algorithm proposed in DeepSeek v3.2 #999

[Feature] Add off-policy sequence masking algorithm proposed in DeepSeek v3.2 #999

Uh oh!

Conversation

yitianlian commented Dec 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants