fix: qwen3_8b_hellaswag_pp_peft recipe#1335
Conversation
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
|
/ok to test dfc655d |
| collate_fn: nemo_automodel.components.datasets.utils.default_collater | ||
| collate_fn: | ||
| _target_: nemo_automodel.components.datasets.utils.default_collater | ||
| pad_seq_len_divisible: 320 |
There was a problem hiding this comment.
Hi @ZhiyuLi-Nvidia. for this one can you share a little more context why it is needed? I thought for fp8 we needed the padding, but I'm not sure what I'm missing here.
There was a problem hiding this comment.
The root cause is torch pp doesn't support variant seq len. This is to do the padding to avoid the above issue.
Could you take a look at the logs: https://wandb.ai/nvidia/automodel-dev-zhiyul/runs/0t1oh7bz/logs
torchrun --nproc-per-node=8 nemo_automodel/recipes/llm/train_ft.py --config examples/llm_finetune/qwen/qwen3_8b_hellaswag_pp_peft.yaml
Without the change, The pipeline compiled based on the constant cached the shape [2, 168, 4096] from training step.
When validation started, the first validation batch happened to be shorter (seq_len=72), resulting in a shape of [2, 72, 4096]. The pipeline validation logic detected that 72 != 168 and raised the PipeliningShapeError.
|
/ok to test 453a48f |
* update recipe Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> * update Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> --------- Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
* update recipe Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> * update Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> --------- Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
What does this PR do ?
As titled, remove explicitly set cache_dir and add pad sequence length.
Changelog
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information