Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3-8b 馃 #653

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Llama3-8b 馃 #653

wants to merge 1 commit into from

Conversation

khatwanimohit
Copy link
Collaborator

No description provided.

sp_model = model_fp.read()
sp_tokenizer = tftxt.SentencepieceTokenizer(model=sp_model, add_bos=add_bos, add_eos=add_eos, reverse=reverse)
return sp_tokenizer
class TikToken():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I created a class for it and not used the jetstream is because of the error below. Our input pipeline is tfds based so we are tokenizing Symbolic Tensors instead of np array's

File "/__w/maxtext/maxtext/MaxText/tokenizer.py", line 79, in __call__  *
        features[k], _ = self.sp_tokenizer.encode(str(features[k]), is_bos = self.add_bos, is_eos = self.add_eos)
    File "/usr/local/lib/python3.10/dist-packages/jetstream/engine/token_utils.py", line 271, in encode  *
        tokens = np.array(self.vocab.encode_tf(s))

    NotImplementedError: Cannot convert a symbolic tf.Tensor (SentenceTokenizer/SentenceTokenizer/SentencepieceTokenizeOp:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

Do you suggest another way to get around this problem ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the Tiktoken class you have created can handle tf.Tensor as well.

Looks like the SentencePieceTokenizer you used from tensorflow_text made it into a tf op. And I don't think tensorflow_text has tiktoken.

Modifying maxtext to pass in numpy arrays (for both tiktoken and sentencepiece) should be the way to go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiktoken class here works with tf.Tensors. See the tests passing here: http://shortn/_YvmkdOTxIQ.

MaxText/configs/models/llama3-8b.yml Outdated Show resolved Hide resolved
MaxText/input_pipeline/input_pipeline_interface.py Outdated Show resolved Hide resolved
MaxText/tokenizer.py Outdated Show resolved Hide resolved
end_to_end/tpu/llama3/8b/2_test_llama3_8b.sh Outdated Show resolved Hide resolved
@rwitten rwitten assigned gobbleturk and unassigned rwitten May 17, 2024
@khatwanimohit khatwanimohit force-pushed the mohit/llama3 branch 3 times, most recently from 73ee9c1 to 1761056 Compare May 20, 2024 22:53
@khatwanimohit khatwanimohit force-pushed the mohit/llama3 branch 6 times, most recently from b5f18ea to 2f3d8a2 Compare May 24, 2024 17:33
MaxText/maxengine.py Outdated Show resolved Hide resolved
MaxText/scratch_code/golden_llama3-8b_export.ipynb Outdated Show resolved Hide resolved
MaxText/tests/tokenizer_test.py Show resolved Hide resolved
MaxText/tokenizer.py Outdated Show resolved Hide resolved
end_to_end/tpu/llama3/8b/2_test_llama3_8b.sh Outdated Show resolved Hide resolved
end_to_end/tpu/llama3/8b/2_test_llama3_8b.sh Outdated Show resolved Hide resolved
@gobbleturk gobbleturk removed their assignment May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants