Conversation
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Zhilin123
left a comment
There was a problem hiding this comment.
LGTM, minor code style issues
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
| } | ||
| else: | ||
| self.special_tokens = special_tokens | ||
|
|
There was a problem hiding this comment.
can we do a check to see if the tokens in special_tokens are tokenizer's special tokens or not? If not (the case with llama), can we just throw a warning that we'll use text as turn tokens which might cause incorrect merging
There was a problem hiding this comment.
I have an assert in the code
assert torch.equal(torch.tensor(target[:header_len]), torch.tensor(header_tokens))which will throw an exception if the token merge happens.
There was a problem hiding this comment.
that is different, the token merge can still happen during multi-turn
what I mean is that if the turn tokens are not special tokens, we just say that there might be an error possible
There was a problem hiding this comment.
The header_len stops at the "end_of_turn". The next token is "turn_start". If the merge happens this assert will catch it. The multiple turn has the same thing. each turn ends with "end_of_turn" and the next token is "turn_start". So this one is enough to catch it.
Also I don't see the point of just giving a warning which doesn't help the user at all.
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Yi Dong <yidong@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com> Signed-off-by: Sasha Meister <sasha.meister.work@gmail.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com> Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com>
What does this PR do ?
In this PR, it genialized the chat SFT dataset that it can use customized turn start/end tokens by using chat_prompt_tokens config. e.g.
after this change, the LM is not required to have "extra_id" special tokens any more to use chat SFT dataset. In this PR, also expanded the unit test to cover more LM tokenizers.
Another feature added is to overwrite the prompt_template config with the chat prompt format.