GPTSFTChatDataset loss_mask becomes all False when prompt length > max_seq_length #8025

shengyangs · 2023-12-13T18:46:32Z

Describe the bug

In the GPTSFTChatDataset, if the first prompt length exceeds max_seq_length, all following turns are truncated out. Then the loss_mask becomes all False for the example.

https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py#L359

This is problematic because, when the loss_mask is all False, the loss of the MegatronGPTModel will be nan.

https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py#L1015

Steps/Code to reproduce bug

This can be reproduced by passing into GPTSFTChatDataset an example whose first turn prompt has > 2048 tokens when the max_seq_length=2048. Then use the GPTSFTChatDataset in a supervised fine-tuning job (e.g., train_gpt_sft.py in NeMo-Aligner)

Expected behavior

In the collate_fn function, check if the loss_masks of all examples are not empty. If not, raise an error.

The text was updated successfully, but these errors were encountered:

odelalleau · 2023-12-13T19:19:51Z

I ran into the same issue => hacked the code to zero out the loss on problematic micro-batches: odelalleau@b383e6a

(obviously not a proper fix but can be useful to get unblocked)

shengyangs · 2023-12-13T19:26:15Z

Yeah, this will also fix the problem.

github-actions · 2024-01-13T01:45:46Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

shengyangs · 2024-01-15T14:44:46Z

This issue should keep active since the same problem has been encountered by others as well.

github-actions · 2024-02-16T01:44:14Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-02-24T01:43:02Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

shengyangs added the bug Something isn't working label Dec 13, 2023

shengyangs mentioned this issue Dec 13, 2023

GPTSFTChatDataset loss_mask becomes all False when prompt length > max_seq_length NVIDIA/NeMo-Aligner#57

Open

github-actions bot added the stale label Jan 13, 2024

github-actions bot removed the stale label Jan 16, 2024

github-actions bot added the stale label Feb 16, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTSFTChatDataset loss_mask becomes all False when prompt length > max_seq_length #8025

GPTSFTChatDataset loss_mask becomes all False when prompt length > max_seq_length #8025

shengyangs commented Dec 13, 2023

odelalleau commented Dec 13, 2023

shengyangs commented Dec 13, 2023

github-actions bot commented Jan 13, 2024

shengyangs commented Jan 15, 2024

github-actions bot commented Feb 16, 2024

github-actions bot commented Feb 24, 2024

GPTSFTChatDataset loss_mask becomes all False when prompt length > max_seq_length #8025

GPTSFTChatDataset loss_mask becomes all False when prompt length > max_seq_length #8025

Comments

shengyangs commented Dec 13, 2023

odelalleau commented Dec 13, 2023

shengyangs commented Dec 13, 2023

github-actions bot commented Jan 13, 2024

shengyangs commented Jan 15, 2024

github-actions bot commented Feb 16, 2024

github-actions bot commented Feb 24, 2024