Skip to content

[Bug]: gemma4 tool-call format produces degenerate output on natural-language prompts (reproduces at 4-bit AND 8-bit weights) #69

@psperera

Description

@psperera

Summary

When --model is a Gemma-4 checkpoint (SwiftLM detects Inferred tool call format: gemma4), requests that pass a tools array and a vague/natural-language user message cause the model to emit degenerate output — CJK/Korean tokens, repetition cascades, or zero-length replies — roughly 2 runs in 3. The same request without tools produces coherent text, and the same request with tools but an explicit user message (Use bash to run: echo X) works 100%. This points at the gemma4 tool-call format implementation, not a model-quality issue.

Environment

  • SwiftLM: b517 (release tarball SwiftLM-b517-macos-arm64.tar.gz)
  • Hardware: Apple M4 Max, 64 GB unified memory, macOS 25.4.0 (Darwin 25.4.0)
  • Models tested (both exhibit the same failure pattern):
    • mlx-community/gemma-4-26b-a4b-it-4bit
    • mlx-community/gemma-4-26b-a4b-it-8bit
  • Flags: --port 5413 --host 127.0.0.1 --mem-limit 51200 (reproduces with or without --turbo-kv)
  • SwiftLM log on load: Inferred tool call format: Optional(MLXLMCommon.ToolCallFormat.gemma4)

Minimal repro

```bash
./SwiftLM --model mlx-community/gemma-4-26b-a4b-it-4bit --port 5413 --host 127.0.0.1 &

Vague natural-language prompt — fails ~67% of runs

for i in 1 2 3; do
curl -sS -X POST http://127.0.0.1:5413/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages":[{"role":"user","content":"what is the news"}],
"max_tokens":300, "temperature":0.7,
"tools":[{"type":"function","function":{"name":"web_search","description":"Search the web","parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}}]
}' | python3 -c "import sys,json; d=json.load(sys.stdin); print('finish:', d['choices'][0]['finish_reason']); print(d['choices'][0]['message'].get('content','')[:200], d['choices'][0]['message'].get('tool_calls'))"
done
```

Observed output samples (verbatim from 3 runs)

  • Run A — `finish: stop`, no `tool_calls`, content = `"ing/sallyer/sallyer/sallyer/sallyer/..."` (800+ chars of repetition).
  • Run B — `finish: stop`, no `tool_calls`, content = 2 chars (essentially empty).
  • Run C — `finish: tool_calls`, clean `web_search({"query":"news today"})` ← works ~1/3.

Controls that narrow the bug to the gemma4 tool-call format

  1. Same request, no `tools` field → always produces a coherent English refusal ("I don't have access to real-time information…"). Isolates the bug to the tool-call code path.
  2. Same `tools`, explicit prompt (`"Use web_search to find news today"`) → 3/3 clean tool calls. The model can emit the tool-call structure reliably; the format just fails to steer the first-token distribution for natural-language queries.
  3. Reproduces at both 4-bit and 8-bit weights. When the 8-bit variant degenerates, the cascade is actually longer (hundreds of repeated tokens). Rules out low-bit weight fragility as the cause.
  4. Reproduces with/without `--turbo-kv` and across sampler combinations. Any distribution-clipping default tight enough to suppress the garbage tokens (e.g. `--top-p 0.95`, `--top-k 64`, `--min-p 0.05`) also suppresses the tool-call control tokens and breaks the explicit prompt case. The usable window between "stops gibberish" and "allows tool calls" appears to be zero with the current format. `--repeat-penalty 1.1` helps later-token repetition but does not change the first-token problem.

Hypothesis

The gemma4 tool-call format appears to produce a prompt shape where, at the "decide whether to tool-call" decision point for a vague user message, the first-token distribution has a broad, flat tail. Sampling (even at `temperature=0.7`) then frequently selects a low-probability junk token (commonly a Korean syllable or a sub-word like `/sallyer`), which cascades. Explicit prompts ("Use X to Y") sharpen the distribution onto the tool-call opener and bypass the failure mode.

Expected behavior

On any valid user message, the server should either emit a structured `tool_calls` response or coherent content. Degenerate output on natural-language inputs is not recoverable via sampler tuning.

Suggested next step

Comparing how `MLXLMCommon.ToolCallFormat.gemma4` builds the chat template for Gemma-4 (particularly how the tools schema is injected relative to the turn header and control tokens) against the Qwen / Hermes paths. Happy to gather additional traces, alternate prompts, or per-token timing if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions