-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3-8b 馃 #653
base: main
Are you sure you want to change the base?
Llama3-8b 馃 #653
Conversation
c6485e2
to
70a8fa6
Compare
70a8fa6
to
4053b58
Compare
MaxText/tokenizer.py
Outdated
sp_model = model_fp.read() | ||
sp_tokenizer = tftxt.SentencepieceTokenizer(model=sp_model, add_bos=add_bos, add_eos=add_eos, reverse=reverse) | ||
return sp_tokenizer | ||
class TikToken(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq; why not import and use this: https://github.com/google/JetStream/blob/main/jetstream/engine/token_utils.py#L353
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I created a class for it and not used the jetstream is because of the error below. Our input pipeline is tfds based so we are tokenizing Symbolic Tensors instead of np array's
File "/__w/maxtext/maxtext/MaxText/tokenizer.py", line 79, in __call__ *
features[k], _ = self.sp_tokenizer.encode(str(features[k]), is_bos = self.add_bos, is_eos = self.add_eos)
File "/usr/local/lib/python3.10/dist-packages/jetstream/engine/token_utils.py", line 271, in encode *
tokens = np.array(self.vocab.encode_tf(s))
NotImplementedError: Cannot convert a symbolic tf.Tensor (SentenceTokenizer/SentenceTokenizer/SentencepieceTokenizeOp:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported
Do you suggest another way to get around this problem ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the Tiktoken class you have created can handle tf.Tensor as well.
Looks like the SentencePieceTokenizer you used from tensorflow_text made it into a tf op. And I don't think tensorflow_text has tiktoken.
Modifying maxtext to pass in numpy arrays (for both tiktoken and sentencepiece) should be the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiktoken class here works with tf.Tensors. See the tests passing here: http://shortn/_YvmkdOTxIQ.
73ee9c1
to
1761056
Compare
b5f18ea
to
2f3d8a2
Compare
2f3d8a2
to
6dc3318
Compare
6dc3318
to
6a9e519
Compare
No description provided.