Skip to content

Conversation

@gabrielhuang
Copy link
Collaborator

@gabrielhuang gabrielhuang commented Jun 23, 2025

Changes:

  • default to gpt-4 tokenizer if any exception is found while loading the tokenizer

How to reproduce the error

from agentlab.llm.llm_utils import count_tokens
count_tokens("this is a test", model="anthropic/claude-3.5-sonnet:beta")

Error that this PR addresses

  File "/Users/gabriel.huang/code/BrowserGym/browsergym/experiments/src/browsergym/experiments/loop.py", line 417, in run
    action = step_info.from_action(agent)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/code/BrowserGym/browsergym/experiments/src/browsergym/experiments/loop.py", line 205, in from_action
    self.action, self.agent_info = agent.get_action(self.obs.copy())
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/code/AgentLab/src/agentlab/llm/tracking.py", line 61, in wrapper
    action, agent_info = get_action(self, obs)
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/code/AgentLab/src/agentlab/agents/generic_agent/generic_agent.py", line 115, in get_action
    human_prompt = dp.fit_tokens(
                   ^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/code/AgentLab/src/agentlab/agents/dynamic_prompting.py", line 254, in fit_tokens
    max_prompt_tokens -= count_tokens(prompt, model=model_name) + 1  # +1 because why not ?
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/code/AgentLab/src/agentlab/llm/llm_utils.py", line 197, in count_tokens
    enc = get_tokenizer(model)
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/code/AgentLab/src/agentlab/llm/llm_utils.py", line 190, in get_tokenizer
    return AutoTokenizer.from_pretrained(model_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/mamba/envs/raa6/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 950, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/mamba/envs/raa6/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 782, in get_tokenizer_config
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/Users/gabriel.huang/mamba/envs/raa6/lib/python3.12/site-packages/transformers/utils/hub.py", line 312, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/mamba/envs/raa6/lib/python3.12/site-packages/transformers/utils/hub.py", line 523, in cached_files
    _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/mamba/envs/raa6/lib/python3.12/site-packages/transformers/utils/hub.py", line 140, in _get_cache_file_to_return
    resolved_file = try_to_load_from_cache(path_or_repo_id, full_filename, cache_dir=cache_dir, revision=revision)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gabriel.huang/mamba/envs/raa6/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "/Users/gabriel.huang/mamba/envs/raa6/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
    raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'anthropic/claude-3.5-sonnet:beta'.

Description by Korbit AI

What change is being made?

Update the exception handling in get_tokenizer to catch all exceptions and log the specific error message, then default to "gpt-4".

Why are these changes being made?

The change aims to improve the robustness and clarity of the error handling by logging the specific exception encountered when a tokenizer cannot be found, ensuring that unexpected issues are detected and reported more precisely. This approach guarantees fallback to a known default model while maintaining detailed logging information for potential troubleshooting.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

@gabrielhuang gabrielhuang self-assigned this Jun 23, 2025
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Error Handling Over-broad Exception Handling ▹ view
Files scanned
File Path Reviewed
src/agentlab/llm/llm_utils.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

Comment on lines 189 to +192
try:
return AutoTokenizer.from_pretrained(model_name)
except OSError:
logging.info(f"Could not find a tokenizer for model {model_name}. Defaulting to gpt-4.")
except Exception as e:
logging.info(f"Could not find a tokenizer for model {model_name}: {e} Defaulting to gpt-4.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over-broad Exception Handling category Error Handling

Tell me more
What is the issue?

Using a bare Exception catch is too broad and could mask critical errors that should be handled differently.

Why this matters

This could catch and ignore serious issues like memory errors or import errors that require different handling, potentially making debugging more difficult.

Suggested change ∙ Feature Preview

Catch specific exceptions that are expected in tokenizer loading:

try:
    return AutoTokenizer.from_pretrained(model_name)
except (OSError, ValueError) as e:
    logging.info(f"Could not find a tokenizer for model {model_name}: {e} Defaulting to gpt-4.")
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Copy link
Collaborator

@amanjaiswal73892 amanjaiswal73892 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@amanjaiswal73892 amanjaiswal73892 merged commit 69e7216 into main Jun 27, 2025
6 of 7 checks passed
@amanjaiswal73892 amanjaiswal73892 deleted the fix_count_tokens branch June 27, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants