[Bug]: `gemma4` tool-call format produces degenerate output on natural-language prompts (reproduces at 4-bit AND 8-bit weights)

## Summary

When `--model` is a Gemma-4 checkpoint (SwiftLM detects `Inferred tool call format: gemma4`), requests that pass a `tools` array and a *vague/natural-language* user message cause the model to emit degenerate output — CJK/Korean tokens, repetition cascades, or zero-length replies — roughly 2 runs in 3. The same request *without* `tools` produces coherent text, and the same request *with* `tools` but an *explicit* user message (`Use bash to run: echo X`) works 100%. This points at the gemma4 tool-call format implementation, not a model-quality issue.

## Environment

- **SwiftLM:** `b517` (release tarball `SwiftLM-b517-macos-arm64.tar.gz`)
- **Hardware:** Apple M4 Max, 64 GB unified memory, macOS 25.4.0 (Darwin 25.4.0)
- **Models tested (both exhibit the same failure pattern):**
  - `mlx-community/gemma-4-26b-a4b-it-4bit`
  - `mlx-community/gemma-4-26b-a4b-it-8bit`
- **Flags:** `--port 5413 --host 127.0.0.1 --mem-limit 51200` (reproduces with or without `--turbo-kv`)
- SwiftLM log on load: ``Inferred tool call format: Optional(MLXLMCommon.ToolCallFormat.gemma4)``

## Minimal repro

\`\`\`bash
./SwiftLM --model mlx-community/gemma-4-26b-a4b-it-4bit --port 5413 --host 127.0.0.1 &

# Vague natural-language prompt — fails ~67% of runs
for i in 1 2 3; do
curl -sS -X POST http://127.0.0.1:5413/v1/chat/completions \\
  -H 'Content-Type: application/json' \\
  -d '{
    \"messages\":[{\"role\":\"user\",\"content\":\"what is the news\"}],
    \"max_tokens\":300, \"temperature\":0.7,
    \"tools\":[{\"type\":\"function\",\"function\":{\"name\":\"web_search\",\"description\":\"Search the web\",\"parameters\":{\"type\":\"object\",\"properties\":{\"query\":{\"type\":\"string\"}},\"required\":[\"query\"]}}}]
  }' | python3 -c \"import sys,json; d=json.load(sys.stdin); print('finish:', d['choices'][0]['finish_reason']); print(d['choices'][0]['message'].get('content','')[:200], d['choices'][0]['message'].get('tool_calls'))\"
done
\`\`\`

### Observed output samples (verbatim from 3 runs)

- **Run A** — \`finish: stop\`, no \`tool_calls\`, content = \`\"ing/sallyer/sallyer/sallyer/sallyer/...\"\` (800+ chars of repetition).
- **Run B** — \`finish: stop\`, no \`tool_calls\`, content = 2 chars (essentially empty).
- **Run C** — \`finish: tool_calls\`, clean \`web_search({\"query\":\"news today\"})\` ← works ~1/3.

## Controls that narrow the bug to the gemma4 tool-call format

1. **Same request, no \`tools\` field** → always produces a coherent English refusal (\"I don't have access to real-time information…\"). Isolates the bug to the tool-call code path.
2. **Same \`tools\`, explicit prompt** (\`\"Use web_search to find news today\"\`) → 3/3 clean tool calls. The model *can* emit the tool-call structure reliably; the format just fails to steer the first-token distribution for natural-language queries.
3. **Reproduces at both 4-bit and 8-bit weights.** When the 8-bit variant degenerates, the cascade is actually longer (hundreds of repeated tokens). Rules out low-bit weight fragility as the cause.
4. **Reproduces with/without \`--turbo-kv\` and across sampler combinations.** Any distribution-clipping default tight enough to suppress the garbage tokens (e.g. \`--top-p 0.95\`, \`--top-k 64\`, \`--min-p 0.05\`) also suppresses the tool-call control tokens and breaks the *explicit* prompt case. The usable window between \"stops gibberish\" and \"allows tool calls\" appears to be zero with the current format. \`--repeat-penalty 1.1\` helps later-token repetition but does not change the first-token problem.

## Hypothesis

The gemma4 tool-call format appears to produce a prompt shape where, at the \"decide whether to tool-call\" decision point for a vague user message, the first-token distribution has a broad, flat tail. Sampling (even at \`temperature=0.7\`) then frequently selects a low-probability junk token (commonly a Korean syllable or a sub-word like \`/sallyer\`), which cascades. Explicit prompts (\"Use X to Y\") sharpen the distribution onto the tool-call opener and bypass the failure mode.

## Expected behavior

On any valid user message, the server should either emit a structured \`tool_calls\` response or coherent content. Degenerate output on natural-language inputs is not recoverable via sampler tuning.

## Suggested next step

Comparing how \`MLXLMCommon.ToolCallFormat.gemma4\` builds the chat template for Gemma-4 (particularly how the tools schema is injected relative to the turn header and control tokens) against the Qwen / Hermes paths. Happy to gather additional traces, alternate prompts, or per-token timing if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: `gemma4` tool-call format produces degenerate output on natural-language prompts (reproduces at 4-bit AND 8-bit weights) #69

Summary

Environment

Minimal repro

Vague natural-language prompt — fails ~67% of runs

Observed output samples (verbatim from 3 runs)

Controls that narrow the bug to the gemma4 tool-call format

Hypothesis

Expected behavior

Suggested next step

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: gemma4 tool-call format produces degenerate output on natural-language prompts (reproduces at 4-bit AND 8-bit weights) #69

Description

Summary

Environment

Minimal repro

Vague natural-language prompt — fails ~67% of runs

Observed output samples (verbatim from 3 runs)

Controls that narrow the bug to the gemma4 tool-call format

Hypothesis

Expected behavior

Suggested next step

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: `gemma4` tool-call format produces degenerate output on natural-language prompts (reproduces at 4-bit AND 8-bit weights) #69