You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found the tokenizer by default will remove newline (\n). Is '\n' included in the training corpora?
I was trying to use '\n' to separate multiple samples (few-shot learning), and I was comparing with other models so it is better to not change the prompt. Is it recommended to set the tokenizer ignore_linebreak=False? where '\n' will be encoded to 20004.
Thank you very much!
The text was updated successfully, but these errors were encountered:
Thanks for your interest in our work. In our pre-training, the tokenizer does not take '\n' into account. In out MMLU few-shot evaluation, we find the evaluation tokenization with ignore_linebreak=True to be a bit better (about 1%) than that with ignore_linebreak=False. So I still suggest you to use ignore_linebreak=True for the current version GLM-130B.
In fact, only the Chinese corpus contains '\n' during training and the English does not. It is rather unfortunate that the code data in the Pile dataset also does not including '\n' during training, hopefully this can be remedied by continuing training.
I found the tokenizer by default will remove newline (\n). Is '\n' included in the training corpora?
I was trying to use '\n' to separate multiple samples (few-shot learning), and I was comparing with other models so it is better to not change the prompt. Is it recommended to set the tokenizer
ignore_linebreak=False
? where '\n' will be encoded to20004
.Thank you very much!
The text was updated successfully, but these errors were encountered: