Skip to content

Skip GCS-dependent tokenizer unit tests under decoupled offline mode#4001

Merged
copybara-service[bot] merged 1 commit into
mainfrom
decoupled-tokenizer-skips
May 28, 2026
Merged

Skip GCS-dependent tokenizer unit tests under decoupled offline mode#4001
copybara-service[bot] merged 1 commit into
mainfrom
decoupled-tokenizer-skips

Conversation

@darisoy
Copy link
Copy Markdown
Collaborator

@darisoy darisoy commented May 28, 2026

Description

This PR adds code-level skip decorators to TrainTokenizerTest, TikTokenTest, and HFTokenizerTest in tests/unit/tokenizer_test.py when running in decoupled offline mode (DECOUPLE_GCLOUD=TRUE).

Problem & Context

These unit tests are hardcoded to train and verify tokenizers on GCS parquet datasets (gs://maxtext-dataset/...) and copy models using the gcloud CLI in a subprocess.

In an air-gapped GKE container running offline:

  • GCS access is stubbed out, causing FileNotFoundError when JAX tries to glob the buckets.
  • The minimal container image lacks the Google Cloud SDK (gcloud CLI), causing FileNotFoundError: 'gcloud' when spawning the download subprocess.

Solution

Bypass these cloud-dependent tests dynamically using @unittest.skipIf(is_decoupled(), ...) when offline decoupled mode is enabled. This allows the decoupled test suite to execute completely green out-of-the-box in offline environments, without affecting standard online runs.


Tests

This change has been fully verified on a physical TPU7x VM slice.

Commands to Reproduce & Verify:

export DECOUPLE_GCLOUD=TRUE
python3 -m pytest tests/unit/tokenizer_test.py -vv

Expected Output:
All 5 collection errors are resolved and the classes are cleanly skipped:

tests/unit/tokenizer_test.py::TrainTokenizerTest SKIPPED
tests/unit/tokenizer_test.py::TikTokenTest SKIPPED
tests/unit/tokenizer_test.py::HFTokenizerTest SKIPPED

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@darisoy darisoy force-pushed the decoupled-tokenizer-skips branch from 601a77d to 541ce2b Compare May 28, 2026 02:03
@darisoy darisoy force-pushed the decoupled-tokenizer-skips branch from 541ce2b to 1ede2aa Compare May 28, 2026 02:06
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@copybara-service copybara-service Bot merged commit a8c4f4d into main May 28, 2026
47 checks passed
@copybara-service copybara-service Bot deleted the decoupled-tokenizer-skips branch May 28, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants