Does GLM-130B support newline (\n)? #17

LorrinWWW · 2022-08-24T07:56:48Z

I found the tokenizer by default will remove newline (\n). Is '\n' included in the training corpora?
I was trying to use '\n' to separate multiple samples (few-shot learning), and I was comparing with other models so it is better to not change the prompt. Is it recommended to set the tokenizer ignore_linebreak=False? where '\n' will be encoded to 20004.
Thank you very much!

The text was updated successfully, but these errors were encountered:

Xiao9905 · 2022-08-24T08:05:00Z

@LorrinWWW Hi,

Thanks for your interest in our work. In our pre-training, the tokenizer does not take '\n' into account. In out MMLU few-shot evaluation, we find the evaluation tokenization with ignore_linebreak=True to be a bit better (about 1%) than that with ignore_linebreak=False. So I still suggest you to use ignore_linebreak=True for the current version GLM-130B.

LorrinWWW · 2022-08-24T08:23:52Z

Sure, thanks for the quick reply! I guess I shall keep ignore_linebreak=True.

Sengxian · 2022-08-24T12:43:41Z

In fact, only the Chinese corpus contains '\n' during training and the English does not. It is rather unfortunate that the code data in the Pile dataset also does not including '\n' during training, hopefully this can be remedied by continuing training.

LorrinWWW closed this as completed Aug 24, 2022

teetone mentioned this issue Oct 26, 2022

Good replacement for \n #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does GLM-130B support newline (\n)? #17

Does GLM-130B support newline (\n)? #17

LorrinWWW commented Aug 24, 2022

Xiao9905 commented Aug 24, 2022

LorrinWWW commented Aug 24, 2022

Sengxian commented Aug 24, 2022

Does GLM-130B support newline (\n)? #17

Does GLM-130B support newline (\n)? #17

Comments

LorrinWWW commented Aug 24, 2022

Xiao9905 commented Aug 24, 2022

LorrinWWW commented Aug 24, 2022

Sengxian commented Aug 24, 2022