fix: qwen3_8b_hellaswag_pp_peft recipe by ZhiyuLi-Nvidia · Pull Request #1335 · NVIDIA-NeMo/Automodel

ZhiyuLi-Nvidia · 2026-02-19T04:33:46Z

What does this PR do ?

As titled, remove explicitly set cache_dir and add pad sequence length.

Changelog

Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

copy-pr-bot · 2026-02-19T04:33:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ZhiyuLi-Nvidia · 2026-02-19T04:35:22Z

/ok to test dfc655d

akoumpa · 2026-03-03T09:34:48Z

-  collate_fn: nemo_automodel.components.datasets.utils.default_collater
+  collate_fn:
+    _target_: nemo_automodel.components.datasets.utils.default_collater
+    pad_seq_len_divisible: 320


Hi @ZhiyuLi-Nvidia. for this one can you share a little more context why it is needed? I thought for fp8 we needed the padding, but I'm not sure what I'm missing here.

The root cause is torch pp doesn't support variant seq len. This is to do the padding to avoid the above issue.

Could you take a look at the logs: https://wandb.ai/nvidia/automodel-dev-zhiyul/runs/0t1oh7bz/logs

torchrun --nproc-per-node=8 nemo_automodel/recipes/llm/train_ft.py --config examples/llm_finetune/qwen/qwen3_8b_hellaswag_pp_peft.yaml

Without the change, The pipeline compiled based on the constant cached the shape [2, 168, 4096] from training step.
When validation started, the first validation batch happened to be shorter (seq_len=72), resulting in a shape of [2, 72, 4096]. The pipeline validation logic detected that 72 != 168 and raised the PipeliningShapeError.

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-03-06T07:41:22Z

/ok to test 453a48f

* update recipe Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> * update Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> --------- Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>

* update recipe Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> * update Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> --------- Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

update recipe

dfc655d

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia requested review from HuiyingLi, adil-a, akoumpa and hemildesai as code owners February 19, 2026 04:33

copy-pr-bot Bot temporarily deployed to test February 19, 2026 04:35 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 19, 2026 04:35 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 19, 2026 04:55 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 19, 2026 05:06 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 19, 2026 05:24 Inactive

akoumpa reviewed Mar 3, 2026

View reviewed changes

ZhiyuLi-Nvidia added 2 commits March 3, 2026 09:05

Merge branch 'main' into zhiyul/fix_qwen3_8b_hellaswag_pp_peft

e4d250a

update

33deeed

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

akoumpa approved these changes Mar 5, 2026

View reviewed changes

Merge branch 'main' into zhiyul/fix_qwen3_8b_hellaswag_pp_peft

99b56e1

ZhiyuLi-Nvidia enabled auto-merge (squash) March 6, 2026 07:39

Merge branch 'main' into zhiyul/fix_qwen3_8b_hellaswag_pp_peft

453a48f

copy-pr-bot Bot temporarily deployed to test March 6, 2026 07:41 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 6, 2026 07:41 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 6, 2026 07:48 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 6, 2026 08:00 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 6, 2026 08:17 Inactive

ZhiyuLi-Nvidia merged commit 023373b into main Mar 6, 2026
52 checks passed

ZhiyuLi-Nvidia deleted the zhiyul/fix_qwen3_8b_hellaswag_pp_peft branch March 6, 2026 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: qwen3_8b_hellaswag_pp_peft recipe#1335

fix: qwen3_8b_hellaswag_pp_peft recipe#1335
ZhiyuLi-Nvidia merged 5 commits intomainfrom
zhiyul/fix_qwen3_8b_hellaswag_pp_peft

ZhiyuLi-Nvidia commented Feb 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Feb 19, 2026

Uh oh!

ZhiyuLi-Nvidia commented Feb 19, 2026

Uh oh!

akoumpa Mar 3, 2026

Uh oh!

ZhiyuLi-Nvidia Mar 3, 2026 •

edited

Loading

Uh oh!

ZhiyuLi-Nvidia commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZhiyuLi-Nvidia commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Feb 19, 2026

Uh oh!

ZhiyuLi-Nvidia commented Feb 19, 2026

Uh oh!

akoumpa Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZhiyuLi-Nvidia commented Feb 19, 2026 •

edited

Loading

ZhiyuLi-Nvidia Mar 3, 2026 •

edited

Loading