Summary
When --model is a Gemma-4 checkpoint (SwiftLM detects Inferred tool call format: gemma4), requests that pass a tools array and a vague/natural-language user message cause the model to emit degenerate output — CJK/Korean tokens, repetition cascades, or zero-length replies — roughly 2 runs in 3. The same request without tools produces coherent text, and the same request with tools but an explicit user message (Use bash to run: echo X) works 100%. This points at the gemma4 tool-call format implementation, not a model-quality issue.
Environment
- SwiftLM:
b517 (release tarball SwiftLM-b517-macos-arm64.tar.gz)
- Hardware: Apple M4 Max, 64 GB unified memory, macOS 25.4.0 (Darwin 25.4.0)
- Models tested (both exhibit the same failure pattern):
mlx-community/gemma-4-26b-a4b-it-4bit
mlx-community/gemma-4-26b-a4b-it-8bit
- Flags:
--port 5413 --host 127.0.0.1 --mem-limit 51200 (reproduces with or without --turbo-kv)
- SwiftLM log on load:
Inferred tool call format: Optional(MLXLMCommon.ToolCallFormat.gemma4)
Minimal repro
```bash
./SwiftLM --model mlx-community/gemma-4-26b-a4b-it-4bit --port 5413 --host 127.0.0.1 &
Vague natural-language prompt — fails ~67% of runs
for i in 1 2 3; do
curl -sS -X POST http://127.0.0.1:5413/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages":[{"role":"user","content":"what is the news"}],
"max_tokens":300, "temperature":0.7,
"tools":[{"type":"function","function":{"name":"web_search","description":"Search the web","parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}}]
}' | python3 -c "import sys,json; d=json.load(sys.stdin); print('finish:', d['choices'][0]['finish_reason']); print(d['choices'][0]['message'].get('content','')[:200], d['choices'][0]['message'].get('tool_calls'))"
done
```
Observed output samples (verbatim from 3 runs)
- Run A — `finish: stop`, no `tool_calls`, content = `"ing/sallyer/sallyer/sallyer/sallyer/..."` (800+ chars of repetition).
- Run B — `finish: stop`, no `tool_calls`, content = 2 chars (essentially empty).
- Run C — `finish: tool_calls`, clean `web_search({"query":"news today"})` ← works ~1/3.
Controls that narrow the bug to the gemma4 tool-call format
- Same request, no `tools` field → always produces a coherent English refusal ("I don't have access to real-time information…"). Isolates the bug to the tool-call code path.
- Same `tools`, explicit prompt (`"Use web_search to find news today"`) → 3/3 clean tool calls. The model can emit the tool-call structure reliably; the format just fails to steer the first-token distribution for natural-language queries.
- Reproduces at both 4-bit and 8-bit weights. When the 8-bit variant degenerates, the cascade is actually longer (hundreds of repeated tokens). Rules out low-bit weight fragility as the cause.
- Reproduces with/without `--turbo-kv` and across sampler combinations. Any distribution-clipping default tight enough to suppress the garbage tokens (e.g. `--top-p 0.95`, `--top-k 64`, `--min-p 0.05`) also suppresses the tool-call control tokens and breaks the explicit prompt case. The usable window between "stops gibberish" and "allows tool calls" appears to be zero with the current format. `--repeat-penalty 1.1` helps later-token repetition but does not change the first-token problem.
Hypothesis
The gemma4 tool-call format appears to produce a prompt shape where, at the "decide whether to tool-call" decision point for a vague user message, the first-token distribution has a broad, flat tail. Sampling (even at `temperature=0.7`) then frequently selects a low-probability junk token (commonly a Korean syllable or a sub-word like `/sallyer`), which cascades. Explicit prompts ("Use X to Y") sharpen the distribution onto the tool-call opener and bypass the failure mode.
Expected behavior
On any valid user message, the server should either emit a structured `tool_calls` response or coherent content. Degenerate output on natural-language inputs is not recoverable via sampler tuning.
Suggested next step
Comparing how `MLXLMCommon.ToolCallFormat.gemma4` builds the chat template for Gemma-4 (particularly how the tools schema is injected relative to the turn header and control tokens) against the Qwen / Hermes paths. Happy to gather additional traces, alternate prompts, or per-token timing if useful.
Summary
When
--modelis a Gemma-4 checkpoint (SwiftLM detectsInferred tool call format: gemma4), requests that pass atoolsarray and a vague/natural-language user message cause the model to emit degenerate output — CJK/Korean tokens, repetition cascades, or zero-length replies — roughly 2 runs in 3. The same request withouttoolsproduces coherent text, and the same request withtoolsbut an explicit user message (Use bash to run: echo X) works 100%. This points at the gemma4 tool-call format implementation, not a model-quality issue.Environment
b517(release tarballSwiftLM-b517-macos-arm64.tar.gz)mlx-community/gemma-4-26b-a4b-it-4bitmlx-community/gemma-4-26b-a4b-it-8bit--port 5413 --host 127.0.0.1 --mem-limit 51200(reproduces with or without--turbo-kv)Inferred tool call format: Optional(MLXLMCommon.ToolCallFormat.gemma4)Minimal repro
```bash
./SwiftLM --model mlx-community/gemma-4-26b-a4b-it-4bit --port 5413 --host 127.0.0.1 &
Vague natural-language prompt — fails ~67% of runs
for i in 1 2 3; do
curl -sS -X POST http://127.0.0.1:5413/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages":[{"role":"user","content":"what is the news"}],
"max_tokens":300, "temperature":0.7,
"tools":[{"type":"function","function":{"name":"web_search","description":"Search the web","parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}}]
}' | python3 -c "import sys,json; d=json.load(sys.stdin); print('finish:', d['choices'][0]['finish_reason']); print(d['choices'][0]['message'].get('content','')[:200], d['choices'][0]['message'].get('tool_calls'))"
done
```
Observed output samples (verbatim from 3 runs)
Controls that narrow the bug to the gemma4 tool-call format
Hypothesis
The gemma4 tool-call format appears to produce a prompt shape where, at the "decide whether to tool-call" decision point for a vague user message, the first-token distribution has a broad, flat tail. Sampling (even at `temperature=0.7`) then frequently selects a low-probability junk token (commonly a Korean syllable or a sub-word like `/sallyer`), which cascades. Explicit prompts ("Use X to Y") sharpen the distribution onto the tool-call opener and bypass the failure mode.
Expected behavior
On any valid user message, the server should either emit a structured `tool_calls` response or coherent content. Degenerate output on natural-language inputs is not recoverable via sampler tuning.
Suggested next step
Comparing how `MLXLMCommon.ToolCallFormat.gemma4` builds the chat template for Gemma-4 (particularly how the tools schema is injected relative to the turn header and control tokens) against the Qwen / Hermes paths. Happy to gather additional traces, alternate prompts, or per-token timing if useful.