Skip to content

Fix tokenizer loading for logit check#3544

Merged
copybara-service[bot] merged 1 commit intomainfrom
shuningjin-fix-error
Apr 2, 2026
Merged

Fix tokenizer loading for logit check#3544
copybara-service[bot] merged 1 commit intomainfrom
shuningjin-fix-error

Conversation

@shuningjin
Copy link
Copy Markdown
Collaborator

@shuningjin shuningjin commented Apr 1, 2026

Description

Fix tokenizer loading for forward pass logit check. b/497054985

  • Previous behavior: if test_args.hf_model_path is local, load from config.tokenizer_path.
  • This requires config.tokenizer_path to be huggingface format. Or would cause error, e.g.,
python3 -m tests.utils.forward_pass_logit_checker /deps/src/maxtext/configs//base.yml tokenizer_path=/deps/src/maxtext/assets/tokenizer.gemma3 load_parameters_path=gs://maxtext-gemma/unified/gemma3/4b/unscanned/2025-08-05-18-18/0/items model_name=gemma3-4b use_multimodal=false scan_layers=false --hf_model_path=./tmp/hf/gemma3-4b/2026-03-27-11-42 --max_kl_div=0.015 --run_hf_model=true

tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, token=hf_token)
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/deps/src/maxtext/assets/tokenizers/tokenizer.gemma3'. Use `repo_type` argument if needed.
  • Current behavior:
    • (1) try load from test_args.hf_model_path first (eliminate the need of manually set config.tokenizer_path),
    • (2) if error, fallback to config.tokenizer_path (e.g., dequantized hf checkpoint may not include tokenizer like gpt-oss)

Tests

Case 1: load from test_args.hf_model_path

UNSCANNED_CKPT_PATH=gs://ml-auto-solutions/output/unowned/maxtext_stable_deepseek2-16b-v5p-8-2026-03-23-05-25-15/unscanned/0/items
HF_PATH=~/tmp/deepseek2-16b-hf-2026-03-23-22-29-46
python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml base_output_directory=gs://runner-maxtext-logs run_name=forward_logits_check load_parameters_path=${UNSCANNED_CKPT_PATH} scan_layers=false attention=dot_product per_device_batch_size=1 model_name=deepseek2-16b max_prefill_predict_length=4 max_target_length=4 async_checkpointing=false sparse_matmul=false ici_fsdp_parallelism=1 ici_expert_parallelism=1 checkpoint_storage_concurrent_gb=1024 weight_dtype=bfloat16 dtype=bfloat16 --max_kl_div=3e-2 \
hardware=cpu skip_jax_distributed_system=True \
--run_hf_model=true --hf_model_path=$HF_PATH \
tokenizer_path=deepseek-ai/DeepSeek-V2-Lite tokenizer_type=huggingface

https://paste.googleplex.com/5308075478220800

Loading tokenizer from /home/shuningjin/tmp/deepseek2-16b-hf-2026-03-23-22-29-46.

Case 2: fallback to config.tokenizer_path

UNSCANNED_CKPT_PATH=gs://shuningjin-multipod-dev/gpt-oss-20b/unscan-bf16-v2-2025-09-02-01-16-00/0/items
HF_PATH=~/gpt-oss-20b/gpt-oss-20b-bf16-v2
python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml base_output_directory=gs://runner-maxtext-logs run_name=forward_logits_check load_parameters_path=${UNSCANNED_CKPT_PATH} scan_layers=false attention=dot_product per_device_batch_size=1 model_name=gpt-oss-20b max_prefill_predict_length=4 max_target_length=4 async_checkpointing=false sparse_matmul=false ici_fsdp_parallelism=1 ici_expert_parallelism=1 checkpoint_storage_concurrent_gb=1024 weight_dtype=bfloat16 dtype=bfloat16 --max_kl_div=3e-2 \
hardware=cpu skip_jax_distributed_system=True \
--run_hf_model=true --hf_model_path=$HF_PATH \
tokenizer_path=openai/gpt-oss-20b tokenizer_type=huggingface

https://paste.googleplex.com/4519457809629184

INFO:absl:Loading tokenizer from /home/shuningjin/gpt-oss-20b/gpt-oss-20b-bf16-v2.
INFO:absl:Tokenizer loading error: 'NoneType' object has no attribute 'endswith'.
Loading tokenizer from openai/gpt-oss-20b.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment thread tests/utils/forward_pass_logit_checker.py Outdated
@shuningjin shuningjin force-pushed the shuningjin-fix-error branch from 38b30e5 to 0b0281e Compare April 1, 2026 19:16
@shuningjin shuningjin force-pushed the shuningjin-fix-error branch from 0b0281e to 48d1761 Compare April 1, 2026 22:04
Copy link
Copy Markdown
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, once @RissyRan's comment is also resolved. Thank you @shuningjin for this fix

@copybara-service copybara-service Bot merged commit d370f95 into main Apr 2, 2026
58 of 60 checks passed
@copybara-service copybara-service Bot deleted the shuningjin-fix-error branch April 2, 2026 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants