Enable EOS and BOS tokens? #545

nkasmanoff · 2023-09-13T12:47:14Z

Please correct me if I'm wrong, but shouldn't there be a BOS and EOS token enabled for these kinds of tokenization?

https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L116

My thinking is that during inference since the fine-tuned LLM doesn't know that answering a question results in an EOS, the outputs might go on for too long.

Can be resolved by setting eos & bos to true, or making that an argument in these function. If people agree with me, happy to open a PR to amend this for the various tokenization scripts.

rasbt · 2023-09-13T12:59:24Z

Very good point.

I think it depends on the model and tokenizer used. E.g. GPT models don't use BOS or EOS tokens (only <|endoftext|> when concatenating multiple texts).

But I also agree with you and think there should definitely be tokens enabled though, depending on the model.

What we need to do here is probably just extract this info from the tokenizer.json files (see screenshot below from the Llama 2 one, which uses a BOS but not EOS token).

But then, I think it's already being taken care of by the Tokenizer class: https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L51C1-L51C1

Any insights @carmocca ?

nkasmanoff · 2023-09-13T13:30:07Z

Thanks for the quick reply! That’s important surrounding code & configs I missed. Would be curious to hear @carmocca 's thoughts too, because I think the tokenizer loads that config, but doesn’t use it for declaring which special tokens to use and not use.

Maybe around here (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/lit_gpt/tokenizer.py#L33)

There are some lines like:
self.bos = config.get("add_bos_token")

and you can use that for the including these special tokens in the encoding function. Can still enable some sort of override in the encode function too.

One issue with this is I took a look at a Falcon config, and it looks like the info is stored slightly differently :-)  https://huggingface.co/tiiuae/falcon-180B/blob/main/tokenizer_config.json

carmocca · 2023-09-13T13:42:15Z

Finetuning only uses the input_ids field (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/finetune/lora.py#L283) which does set eos=True (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L117)

But you are right that we don't use bos by default, whereas HuggingFace does:

from pathlib import Path
from transformers import LlamaTokenizer
from lit_gpt.tokenizer import Tokenizer

prompt = "Hello friends!"

ours = Tokenizer(Path("checkpoints/openlm-research/open_llama_7b"))
print(ours.encode(prompt))  # [16644, 2439, 31905]


theirs = LlamaTokenizer.from_pretrained("openlm-research/open_llama_7b")
print(theirs.encode(prompt))  # [1, 16644, 2439, 31905]

And yes, this might need special consideration for the different checkpoints. GptNeoX based models dont seem to use BOS (Falcon, Pythia)

nkasmanoff · 2023-09-14T14:26:21Z

@carmocca is this a feature you think worth adding to lit-gpt? Glad to volunteer/ submit a PR that makes this special consideration for the different model types, and defaults to not including the BOS (& EOS) token.

carmocca · 2023-09-14T17:47:20Z

Absolutely. However one detail:

defaults to not including the BOS (& EOS) token.

Why would this be? This would be chosen based on the model's tokenizer configuration.

nkasmanoff · 2023-09-14T19:57:15Z

Thanks! And ah that wasn't clear by me, what I meant was if for some reason I can't find whether or not to use the BOS token (like it's not evident to me in the metadata) I will make it so that the default is to have the same behavior as there is now.

Apologies if that's still confusing I hope it will make more sense when I send a PR your way!

carmocca added the enhancement New feature or request label Sep 13, 2023

carmocca assigned nkasmanoff Sep 14, 2023

nkasmanoff mentioned this issue Sep 15, 2023

Read the default encoding configuration from the tokenizer #552

Merged

carmocca closed this as completed Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable EOS and BOS tokens? #545

Enable EOS and BOS tokens? #545

nkasmanoff commented Sep 13, 2023

rasbt commented Sep 13, 2023

nkasmanoff commented Sep 13, 2023 •

edited

Loading

carmocca commented Sep 13, 2023 •

edited

Loading

nkasmanoff commented Sep 14, 2023

carmocca commented Sep 14, 2023

nkasmanoff commented Sep 14, 2023

Enable EOS and BOS tokens? #545

Enable EOS and BOS tokens? #545

Comments

nkasmanoff commented Sep 13, 2023

rasbt commented Sep 13, 2023

nkasmanoff commented Sep 13, 2023 • edited Loading

carmocca commented Sep 13, 2023 • edited Loading

nkasmanoff commented Sep 14, 2023

carmocca commented Sep 14, 2023

nkasmanoff commented Sep 14, 2023

nkasmanoff commented Sep 13, 2023 •

edited

Loading

carmocca commented Sep 13, 2023 •

edited

Loading