Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does GLM-130B support newline (\n)? #17

Closed
LorrinWWW opened this issue Aug 24, 2022 · 3 comments
Closed

Does GLM-130B support newline (\n)? #17

LorrinWWW opened this issue Aug 24, 2022 · 3 comments

Comments

@LorrinWWW
Copy link

I found the tokenizer by default will remove newline (\n). Is '\n' included in the training corpora?
I was trying to use '\n' to separate multiple samples (few-shot learning), and I was comparing with other models so it is better to not change the prompt. Is it recommended to set the tokenizer ignore_linebreak=False? where '\n' will be encoded to 20004.
Thank you very much!

@Xiao9905
Copy link
Member

@LorrinWWW Hi,

Thanks for your interest in our work. In our pre-training, the tokenizer does not take '\n' into account. In out MMLU few-shot evaluation, we find the evaluation tokenization with ignore_linebreak=True to be a bit better (about 1%) than that with ignore_linebreak=False. So I still suggest you to use ignore_linebreak=True for the current version GLM-130B.

@LorrinWWW
Copy link
Author

Sure, thanks for the quick reply! I guess I shall keep ignore_linebreak=True.

@Sengxian
Copy link
Contributor

In fact, only the Chinese corpus contains '\n' during training and the English does not. It is rather unfortunate that the code data in the Pile dataset also does not including '\n' during training, hopefully this can be remedied by continuing training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants