Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicated bos_token for CodeLlama #1566

Merged
merged 1 commit into from
Jul 10, 2024
Merged

Conversation

alealv
Copy link
Contributor

@alealv alealv commented Jul 9, 2024

Fixes #1565

@rasbt
Copy link
Collaborator

rasbt commented Jul 9, 2024

I just saw your issue and was about to suggest a PR. Thanks for already doing that! This is really appreciated!

The update looks good to me, but I am tagging @Andrei-Aksionov for a second pair of eyes (I think he saw something similar with other models)

@Andrei-Aksionov
Copy link
Collaborator

Andrei-Aksionov commented Jul 10, 2024

Hello @alealv

Thanks for the PR.

I believe that prompt template was chosen since it's the one that is used in HF Transformers and, I guess, it was provided by the author of the model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Instruct-hf")
messages = [
    {"role": "user", "content": "List all presidents of the USA."},
]
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
'<s>[INST] List all presidents of the USA. [/INST]'

Although I agree that it's weird to hard-code bos token in a prompt when in the tokenizer_config.json add_bos_token is set to True.

Nevertheless, this issue uncovered other problems:

  1. .encode should deal with such cases.
    Update: the CodeLlaMA tokenizer return a bos token no matter what (before if-else statement where we add bos manually). So the check isn't applied.
  2. HF AutoTokenizer doesn't return doubled bos token, so a test should have failed. But it didn't.
    Update: everything is correct. It's just tokenizer.apply_chat_template does additional checks if bos is duplicated. Ours code output is aligned with AutoTokenizer.

Overall, I think we can merge the PR and in the meanwhile I'll investigate these two issues.

@rasbt
Copy link
Collaborator

rasbt commented Jul 10, 2024

Thanks for feedback @Andrei-Aksionov . Let's merge it and then investigate the 2 other points separately.

@rasbt rasbt merged commit 83404a4 into Lightning-AI:main Jul 10, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bos_token is added twice when using CodeLlama prompting style
3 participants