Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable EOS and BOS tokens? #545

Closed
nkasmanoff opened this issue Sep 13, 2023 · 6 comments
Closed

Enable EOS and BOS tokens? #545

nkasmanoff opened this issue Sep 13, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@nkasmanoff
Copy link
Contributor

Please correct me if I'm wrong, but shouldn't there be a BOS and EOS token enabled for these kinds of tokenization?

https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L116

My thinking is that during inference since the fine-tuned LLM doesn't know that answering a question results in an EOS, the outputs might go on for too long.

Can be resolved by setting eos & bos to true, or making that an argument in these function. If people agree with me, happy to open a PR to amend this for the various tokenization scripts.

@rasbt
Copy link
Collaborator

rasbt commented Sep 13, 2023

Very good point.

I think it depends on the model and tokenizer used. E.g. GPT models don't use BOS or EOS tokens (only <|endoftext|> when concatenating multiple texts).

But I also agree with you and think there should definitely be tokens enabled though, depending on the model.

What we need to do here is probably just extract this info from the tokenizer.json files (see screenshot below from the Llama 2 one, which uses a BOS but not EOS token).

But then, I think it's already being taken care of by the Tokenizer class: https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L51C1-L51C1

Any insights @carmocca ?

Screenshot 2023-09-13 at 7 56 37 AM

@nkasmanoff
Copy link
Contributor Author

nkasmanoff commented Sep 13, 2023

Thanks for the quick reply! That’s important surrounding code & configs I missed. Would be curious to hear @carmocca 's thoughts too, because I think the tokenizer loads that config, but doesn’t use it for declaring which special tokens to use and not use.

Maybe around here (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/lit_gpt/tokenizer.py#L33)

There are some lines like:
self.bos = config.get("add_bos_token")

and you can use that for the including these special tokens in the encoding function. Can still enable some sort of override in the encode function too.

One issue with this is I took a look at a Falcon config, and it looks like the info is stored slightly differently :-) 
https://huggingface.co/tiiuae/falcon-180B/blob/main/tokenizer_config.json

@carmocca
Copy link
Contributor

carmocca commented Sep 13, 2023

Finetuning only uses the input_ids field (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/finetune/lora.py#L283) which does set eos=True (https://github.com/Lightning-AI/lit-gpt/blob/91014a7783d83c7dcb745d384a01bedaa7a8799e/scripts/prepare_dolly.py#L117)

But you are right that we don't use bos by default, whereas HuggingFace does:

from pathlib import Path
from transformers import LlamaTokenizer
from lit_gpt.tokenizer import Tokenizer

prompt = "Hello friends!"

ours = Tokenizer(Path("checkpoints/openlm-research/open_llama_7b"))
print(ours.encode(prompt))  # [16644, 2439, 31905]


theirs = LlamaTokenizer.from_pretrained("openlm-research/open_llama_7b")
print(theirs.encode(prompt))  # [1, 16644, 2439, 31905]

And yes, this might need special consideration for the different checkpoints. GptNeoX based models dont seem to use BOS (Falcon, Pythia)

@carmocca carmocca added the enhancement New feature or request label Sep 13, 2023
@nkasmanoff
Copy link
Contributor Author

@carmocca is this a feature you think worth adding to lit-gpt? Glad to volunteer/ submit a PR that makes this special consideration for the different model types, and defaults to not including the BOS (& EOS) token.

@carmocca
Copy link
Contributor

Absolutely. However one detail:

defaults to not including the BOS (& EOS) token.

Why would this be? This would be chosen based on the model's tokenizer configuration.

@nkasmanoff
Copy link
Contributor Author

Thanks! And ah that wasn't clear by me, what I meant was if for some reason I can't find whether or not to use the BOS token (like it's not evident to me in the metadata) I will make it so that the default is to have the same behavior as there is now.

Apologies if that's still confusing I hope it will make more sense when I send a PR your way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants