Skip to content

chore: update mlx-swift-lm to fix/gemma4-pad-eos-token#70

Merged
solderzzc merged 5 commits intomainfrom
fix/gemma4-tool-latency
Apr 21, 2026
Merged

chore: update mlx-swift-lm to fix/gemma4-pad-eos-token#70
solderzzc merged 5 commits intomainfrom
fix/gemma4-tool-latency

Conversation

@solderzzc
Copy link
Copy Markdown
Member

Points to fix(Gemma4): add pad token (ID=0) to eosTokenIds to prevent infinite padding loops when Gemma-4 prompts exceed the 1024-token sliding window attention limit.

Points to fix(Gemma4): add pad token (ID=0) to eosTokenIds to prevent
infinite padding loops when Gemma-4 prompts exceed the 1024-token
sliding window attention limit.
Copilot AI review requested due to automatic review settings April 21, 2026 19:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…uery bug)

Adds option 9 to run_benchmark.sh to reproduce and track the bug where
Gemma-4 fails to call tools for vague natural-language queries.

Test structure (11 total requests):
  [1/3] Vague 'what is the news' + web_search tool — 5 runs, need ≥3 tool_calls
  [2/3] Same query, no tools — 3 runs, need 3 coherent text responses (sanity)
  [3/3] Explicit 'Use web_search...' + tool — 3 runs, need 3 tool_calls

Pass criteria: all three sections meet their thresholds.

Root cause (documented): The chat_template.jinja appends
  <|channel>thought\n<channel|>
to every non-thinking generation prompt. This flattens the first-token
logit distribution for vague queries when tools are present, causing the
model to output garbage tokens or ignore tools entirely.

Baseline (unfixed): 0/5 vague tool_calls, 3/3 explicit tool_calls.
Target (fixed):     ≥3/5 vague tool_calls, 3/3 explicit tool_calls.
…nchmark.sh

- Swap Quit/regression: 8=Regression, 9=Quit (conventional placement)
- Move Test 8 handler block to after BIN+FULL_MODEL are assigned
  (was incorrectly placed before model selection, causing empty $FULL_MODEL)
- Restore accidentally removed 'if [ suite_opt == 2 ]' guard
@solderzzc
Copy link
Copy Markdown
Member Author

close #69

Now working on a fix for: #69

- Implemented Server.swift workaround to force enable_thinking=true for gemma4 with tools
- Extracted and tracked <|channel>thought tags correctly in prompt cache states
- Fixed run_benchmark.sh to properly parse tool testing outcomes with adjusted max_tokens and system prompts
@solderzzc solderzzc merged commit 116ee91 into main Apr 21, 2026
8 checks passed
@solderzzc solderzzc deleted the fix/gemma4-tool-latency branch April 21, 2026 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants