Add LLaMA 3 tokenizer and preset #1584

tirthasheshpatel · 2024-04-19T18:29:09Z

LLaMA 3 uses a BPE tokenizer with different special start and end tokens. Also, unlike LLaMA 2, it doesn't add the start token anymore when generating. To reflect these changes, this PR adds a new Llama3Tokenizer, Llama3Preprocessor, Llama3CausalLMPreprocessor, and Llama3CausalLM. The backbone hasn't changed and uses the plain old LlamaBackbone.

A checkpoint conversion script has also been added.

TODOs:

Add docs for all the components.
Add tests for all the new components.
Upload presets on Kaggle.

keras_nlp/models/llama/llama_causal_lm.py

keras_nlp/models/llama3/llama3_causal_lm_preprocessor.py

keras_nlp/models/llama3/llama3_presets.py

keras_nlp/models/llama3/llama3_tokenizer.py

tirthasheshpatel

@mattdangerw @SamanehSaadat I initially wanted to reuse all the components we had for LLaMA but:

The huggingface repo contains a broken sentence piece model. The only other available option is using their tokenizer.json file which contains the vocab and merges for BPE.
LLaMA 3 tokenizer doesn't prepend a start token to the inputs anymore, so the preprocessor has fundamentally changed.

So, I created new Llama3* components which live under their own directory. But I am not sure if this is the best way to add the model since we didn't name the LLaMA 2 components Llama2*. This breaks consistency so ideas welcome here; please let me know if there's a better way to do this.

SamanehSaadat

Thanks, Tirth! Looks good! Left some comments.

keras_nlp/models/llama/llama_presets.py

SamanehSaadat · 2024-04-19T23:54:06Z

keras_nlp/models/llama3/llama3_preprocessor.py

+        **kwargs
+    ):
+        super().__init__(
+            tokenizer, sequence_length, add_start_token, add_end_token, **kwargs


Just want to make sure I understand correctly: is the main difference between llama3 preprocessor and llama preprocessor that add_start_token=False in llama3 and add_start_token=True in llama?
Or are there other differences?

Yes, as far as the preprocessor is concerned, this is the only change. But there's one more change in the tokenizer, the authors don't provide a sentence piece tokenizer anymore. Instead, they provide a BPE vocab and merges file, so I had to create a completely new preprocessing pipeline for LLaMA 3.

I see!
re new directories for llama3, what do you think about adding new llama3 classes to llama files?

llama

llama_tokenizer.py

LlamaTokenizer

Llama3Tokenizer

llama_preprocessor.py

LlamaPreprocessor

Llama3Preprocessor

etc

This way, we won't have a new llama3 directory and files.

Good idea. I went for a different directory to maintain the precedence in KerasNLP. I will check if it's possible to do this without making the existing files too cumbersome.

What do you think about #1584 (review)? I am curious if there's a better way to utilize the current infra to avoid creating separate classes. One way I can think of is to accept both proto and vocab, merges in the tokenizer class and branch to BPE or sentence piece based on the inputs. I avoided that since it an unclean approach and also pollutes the API a bit.

[skip ci]

…preset

mattdangerw · 2024-05-03T01:37:34Z

@tirthasheshpatel I think you need to run ./shell/api_gen.sh to pick up the new symbols.

…preset

mattdangerw

lgtm! just a few comments

mattdangerw · 2024-05-03T19:03:11Z

keras_nlp/src/models/llama3/llama3_causal_lm.py

+from keras_nlp.src.models.llama.llama_causal_lm import LlamaCausalLM
+
+
+class Llama3CausalLM(LlamaCausalLM):


This seems inconsistent with the rest of the classes. Should we copy a docstring over here as well?

mattdangerw · 2024-05-03T19:04:59Z

tools/checkpoint_conversion/convert_llama3_checkpoints.py

+from transformers import LlamaForCausalLM
+
+from keras_nlp import upload_preset
+from keras_nlp.api.models import Llama3Backbone


We shouldn't need this anywhere. Can't we just do keras_nlp.models?

mattdangerw · 2024-05-03T19:05:12Z

tools/checkpoint_conversion/convert_llama_checkpoints.py

-from keras_nlp.models import LlamaTokenizer
-from keras_nlp.utils.preset_utils import save_to_preset
+from keras_nlp import upload_preset
+from keras_nlp.api.models import LlamaBackbone


Same here. keras_nlp.models?

mattdangerw · 2024-05-03T19:07:42Z

keras_nlp/src/models/llama3/llama3_backbone.py

+# LLaMA 3 shares the same architecture as its predecessors
+# So, we simply create an alias for API consistency
+@keras_nlp_export("keras_nlp.models.Llama3Backbone")
+class Llama3Backbone(LlamaBackbone):


also for consistency, can we copy some tests for llama3 backbone too? a preset numerics test we should probably annotate with extra_large, given how big our presets are. or just leave it off if it will be too much hassle.

also for consistency, can we copy some tests for llama3 backbone too?

We should. I don't have compute to locally run the preset tests so feel free to add them in a follow-up if you have time.

will do as a follow up!

mattdangerw · 2024-05-17T16:32:55Z

Thanks! Looks good to me!

mattdangerw · 2024-05-17T19:07:18Z

Looks like we need to accept the llama terms of service for our test account. Just did, should hopefully propagate soon and turn CI green.

Add LLaMA 3 tokenizer and preset

a546dd9

tirthasheshpatel requested review from mattdangerw and SamanehSaadat April 19, 2024 18:29