Skip to content

fix: qwen3_8b_hellaswag_pp_peft recipe#1335

Merged
ZhiyuLi-Nvidia merged 5 commits intomainfrom
zhiyul/fix_qwen3_8b_hellaswag_pp_peft
Mar 6, 2026
Merged

fix: qwen3_8b_hellaswag_pp_peft recipe#1335
ZhiyuLi-Nvidia merged 5 commits intomainfrom
zhiyul/fix_qwen3_8b_hellaswag_pp_peft

Conversation

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Feb 19, 2026

What does this PR do ?

As titled, remove explicitly set cache_dir and add pad sequence length.

Changelog

  • Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test dfc655d

collate_fn: nemo_automodel.components.datasets.utils.default_collater
collate_fn:
_target_: nemo_automodel.components.datasets.utils.default_collater
pad_seq_len_divisible: 320
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZhiyuLi-Nvidia. for this one can you share a little more context why it is needed? I thought for fp8 we needed the padding, but I'm not sure what I'm missing here.

Copy link
Copy Markdown
Contributor Author

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause is torch pp doesn't support variant seq len. This is to do the padding to avoid the above issue.

Could you take a look at the logs: https://wandb.ai/nvidia/automodel-dev-zhiyul/runs/0t1oh7bz/logs

torchrun --nproc-per-node=8 nemo_automodel/recipes/llm/train_ft.py --config examples/llm_finetune/qwen/qwen3_8b_hellaswag_pp_peft.yaml 

Without the change, The pipeline compiled based on the constant cached the shape [2, 168, 4096] from training step.
When validation started, the first validation batch happened to be shorter (seq_len=72), resulting in a shape of [2, 72, 4096]. The pipeline validation logic detected that 72 != 168 and raised the PipeliningShapeError.

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia enabled auto-merge (squash) March 6, 2026 07:39
@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 453a48f

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia merged commit 023373b into main Mar 6, 2026
52 checks passed
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia deleted the zhiyul/fix_qwen3_8b_hellaswag_pp_peft branch March 6, 2026 08:37
SwekeR-463 pushed a commit to SwekeR-463/Automodel that referenced this pull request Mar 11, 2026
* update recipe

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

* update

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

---------

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
* update recipe

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

* update

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

---------

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants