-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable EOS and BOS tokens? #545
Comments
Very good point. I think it depends on the model and tokenizer used. E.g. GPT models don't use BOS or EOS tokens (only <|endoftext|> when concatenating multiple texts). But I also agree with you and think there should definitely be tokens enabled though, depending on the model. What we need to do here is probably just extract this info from the tokenizer.json files (see screenshot below from the Llama 2 one, which uses a BOS but not EOS token). But then, I think it's already being taken care of by the Any insights @carmocca ? |
Thanks for the quick reply! That’s important surrounding code & configs I missed. Would be curious to hear @carmocca 's thoughts too, because I think the tokenizer loads that config, but doesn’t use it for declaring which special tokens to use and not use. Maybe around here (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/lit_gpt/tokenizer.py#L33) There are some lines like: and you can use that for the including these special tokens in the encoding function. Can still enable some sort of override in the One issue with this is I took a look at a Falcon config, and it looks like the info is stored slightly differently :-) https://huggingface.co/tiiuae/falcon-180B/blob/main/tokenizer_config.json |
Finetuning only uses the But you are right that we don't use bos by default, whereas HuggingFace does: from pathlib import Path
from transformers import LlamaTokenizer
from lit_gpt.tokenizer import Tokenizer
prompt = "Hello friends!"
ours = Tokenizer(Path("checkpoints/openlm-research/open_llama_7b"))
print(ours.encode(prompt)) # [16644, 2439, 31905]
theirs = LlamaTokenizer.from_pretrained("openlm-research/open_llama_7b")
print(theirs.encode(prompt)) # [1, 16644, 2439, 31905] And yes, this might need special consideration for the different checkpoints. GptNeoX based models dont seem to use BOS (Falcon, Pythia) |
@carmocca is this a feature you think worth adding to lit-gpt? Glad to volunteer/ submit a PR that makes this special consideration for the different model types, and defaults to not including the BOS (& EOS) token. |
Absolutely. However one detail:
Why would this be? This would be chosen based on the model's tokenizer configuration. |
Thanks! And ah that wasn't clear by me, what I meant was if for some reason I can't find whether or not to use the BOS token (like it's not evident to me in the metadata) I will make it so that the default is to have the same behavior as there is now. Apologies if that's still confusing I hope it will make more sense when I send a PR your way! |
Please correct me if I'm wrong, but shouldn't there be a BOS and EOS token enabled for these kinds of tokenization?
https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L116
My thinking is that during inference since the fine-tuned LLM doesn't know that answering a question results in an EOS, the outputs might go on for too long.
Can be resolved by setting eos & bos to true, or making that an argument in these function. If people agree with me, happy to open a PR to amend this for the various tokenization scripts.
The text was updated successfully, but these errors were encountered: