Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLaMA 3 tokenizer and preset #1584

Merged
merged 10 commits into from May 17, 2024

Conversation

tirthasheshpatel
Copy link
Contributor

@tirthasheshpatel tirthasheshpatel commented Apr 19, 2024

Closes #1583

LLaMA 3 uses a BPE tokenizer with different special start and end tokens. Also, unlike LLaMA 2, it doesn't add the start token anymore when generating. To reflect these changes, this PR adds a new Llama3Tokenizer, Llama3Preprocessor, Llama3CausalLMPreprocessor, and Llama3CausalLM. The backbone hasn't changed and uses the plain old LlamaBackbone.

A checkpoint conversion script has also been added.

TODOs:

  • Add docs for all the components.
  • Add tests for all the new components.
  • Upload presets on Kaggle.

Copy link
Contributor Author

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdangerw @SamanehSaadat I initially wanted to reuse all the components we had for LLaMA but:

  • The huggingface repo contains a broken sentence piece model. The only other available option is using their tokenizer.json file which contains the vocab and merges for BPE.
  • LLaMA 3 tokenizer doesn't prepend a start token to the inputs anymore, so the preprocessor has fundamentally changed.

So, I created new Llama3* components which live under their own directory. But I am not sure if this is the best way to add the model since we didn't name the LLaMA 2 components Llama2*. This breaks consistency so ideas welcome here; please let me know if there's a better way to do this.

Copy link
Member

@SamanehSaadat SamanehSaadat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Tirth! Looks good! Left some comments.

keras_nlp/models/llama/llama_presets.py Outdated Show resolved Hide resolved
keras_nlp/models/llama/llama_presets.py Outdated Show resolved Hide resolved
keras_nlp/models/llama/llama_presets.py Outdated Show resolved Hide resolved
keras_nlp/models/llama/llama_presets.py Outdated Show resolved Hide resolved
keras_nlp/models/llama/llama_presets.py Outdated Show resolved Hide resolved
keras_nlp/models/llama/llama_presets.py Outdated Show resolved Hide resolved
**kwargs
):
super().__init__(
tokenizer, sequence_length, add_start_token, add_end_token, **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I understand correctly: is the main difference between llama3 preprocessor and llama preprocessor that add_start_token=False in llama3 and add_start_token=True in llama?
Or are there other differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as far as the preprocessor is concerned, this is the only change. But there's one more change in the tokenizer, the authors don't provide a sentence piece tokenizer anymore. Instead, they provide a BPE vocab and merges file, so I had to create a completely new preprocessing pipeline for LLaMA 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!
re new directories for llama3, what do you think about adding new llama3 classes to llama files?

  • llama
    • llama_tokenizer.py
      • LlamaTokenizer
      • Llama3Tokenizer
    • llama_preprocessor.py
      • LlamaPreprocessor
      • Llama3Preprocessor
    • etc

This way, we won't have a new llama3 directory and files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I went for a different directory to maintain the precedence in KerasNLP. I will check if it's possible to do this without making the existing files too cumbersome.

What do you think about #1584 (review)? I am curious if there's a better way to utilize the current infra to avoid creating separate classes. One way I can think of is to accept both proto and vocab, merges in the tokenizer class and branch to BPE or sentence piece based on the inputs. I avoided that since it an unclean approach and also pollutes the API a bit.

@tirthasheshpatel tirthasheshpatel changed the title [WIP] Add LLaMA 3 tokenizer and preset Add LLaMA 3 tokenizer and preset May 1, 2024
@mattdangerw
Copy link
Member

@tirthasheshpatel I think you need to run ./shell/api_gen.sh to pick up the new symbols.

@mattdangerw mattdangerw added the kokoro:force-run Runs Tests on GPU label May 3, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 3, 2024
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! just a few comments

from keras_nlp.src.models.llama.llama_causal_lm import LlamaCausalLM


class Llama3CausalLM(LlamaCausalLM):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent with the rest of the classes. Should we copy a docstring over here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

from transformers import LlamaForCausalLM

from keras_nlp import upload_preset
from keras_nlp.api.models import Llama3Backbone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this anywhere. Can't we just do keras_nlp.models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

from keras_nlp.models import LlamaTokenizer
from keras_nlp.utils.preset_utils import save_to_preset
from keras_nlp import upload_preset
from keras_nlp.api.models import LlamaBackbone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. keras_nlp.models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# LLaMA 3 shares the same architecture as its predecessors
# So, we simply create an alias for API consistency
@keras_nlp_export("keras_nlp.models.Llama3Backbone")
class Llama3Backbone(LlamaBackbone):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for consistency, can we copy some tests for llama3 backbone too? a preset numerics test we should probably annotate with extra_large, given how big our presets are. or just leave it off if it will be too much hassle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for consistency, can we copy some tests for llama3 backbone too?

We should. I don't have compute to locally run the preset tests so feel free to add them in a follow-up if you have time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do as a follow up!

@mattdangerw
Copy link
Member

Thanks! Looks good to me!

@mattdangerw mattdangerw added the kokoro:force-run Runs Tests on GPU label May 17, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 17, 2024
@mattdangerw
Copy link
Member

Looks like we need to accept the llama terms of service for our test account. Just did, should hopefully propagate soon and turn CI green.

@mattdangerw mattdangerw added the kokoro:force-run Runs Tests on GPU label May 17, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 17, 2024
@mattdangerw mattdangerw merged commit 294304b into keras-team:master May 17, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Any plans for Llama 3?
4 participants