Skip to content

remove tf dependency from tokenizer.py#3196

Merged
copybara-service[bot] merged 1 commit intomainfrom
aireen/tokenizer_no_tf
Feb 24, 2026
Merged

remove tf dependency from tokenizer.py#3196
copybara-service[bot] merged 1 commit intomainfrom
aireen/tokenizer_no_tf

Conversation

@aireenmei
Copy link
Copy Markdown
Collaborator

@aireenmei aireenmei commented Feb 19, 2026

Description

This is part of the bigger plan of removing tensorflow dependency in MaxText

  • combine SentencePieceTokenizerGrain and SentencePieceTokenizer under the name SentencePieceTokenizer
  • Uses the sentencepiece library to load sentencepiece tokenizer instead of the tensorflow_text library
  • replace tf.io.gfile with Google Storage client for gs:// path
  • move TokenizeOp to input_pipeline_utils.py, in the future will put functions using tf to a separate input_pipeline_utils_tf.py to make installing tf optional

Tests

CI test

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@aireenmei aireenmei force-pushed the aireen/tokenizer_no_tf branch from 8ae57b1 to 3638112 Compare February 19, 2026 20:50
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 35.13514% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/utils/gcs_utils.py 15.38% 11 Missing ⚠️
src/maxtext/input_pipeline/input_pipeline_utils.py 36.36% 7 Missing ⚠️
src/maxtext/input_pipeline/tokenizer.py 45.45% 5 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This PR successfully removes the tensorflow_text dependency by consolidating SentencePieceTokenizerGrain and SentencePieceTokenizer using the native sentencepiece library. The implementation correctly handles GCS loading and transitions the tokenization operation to a unified tf.py_function approach within input_pipeline_utils.py.

🔍 General Feedback

  • Simplification: Merging the two classes and moving the file logic to gcs_utils.py significantly cleans up the tokenization architecture.
  • Testing: The unit tests have been appropriately updated to accommodate the return type change (from tf.Tensor to list).
  • Completeness: The removal of tokenizer imports in tfds_data_processing_c4_mlperf.py looks clean and accurate.

Comment thread src/maxtext/input_pipeline/tokenizer.py Outdated
Comment thread src/maxtext/input_pipeline/tokenizer.py
@aireenmei aireenmei force-pushed the aireen/tokenizer_no_tf branch from 4b354db to 7f1f7e3 Compare February 20, 2026 01:04
@aireenmei aireenmei force-pushed the aireen/tokenizer_no_tf branch from 7f1f7e3 to 7963fc8 Compare February 20, 2026 01:07
Copy link
Copy Markdown
Collaborator

@hengtaoguo hengtaoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread src/maxtext/input_pipeline/tokenizer.py
Comment thread tests/unit/tokenizer_test.py
@copybara-service copybara-service Bot merged commit e228e8a into main Feb 24, 2026
73 of 74 checks passed
@copybara-service copybara-service Bot deleted the aireen/tokenizer_no_tf branch February 24, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants