Skip to content

llama.cpp qwen 3.5 error #62

@Firues

Description

@Firues

smallcode:
last version 1.2.3

Llama.cpp last version
model: unsloth/Qwen3.5-4B-Q4_K_M_unsloth.gguf

.env:
SMALLCODE_MODEL=Qwen3.5-4B-Q4_K_M_unsloth.gguf
SMALLCODE_BASE_URL=http://127.0.0.1:8080/v1

error:
Image

llama.cpp log:
0.48.160.470 W srv operator(): got exception: {"error":{"code":400,"message":"Unable to generate parser for this template. Automatic parser generation failed: \n------------\nWhile executing CallExpression at line 85, column 32 in source:\n...first %}↵ {{- raise_exception('System message must be at the beginnin...\n ^\nError: Jinja Exception: System message must be at the beginning.","type":"invalid_request_error"}} 0.50.210.207 W srv operator(): got exception: {"error":{"code":400,"message":"Unable to generate parser for this template. Automatic parser generation failed: \n------------\nWhile executing CallExpression at line 85, column 32 in source:\n...first %}↵ {{- raise_exception('System message must be at the beginnin...\n ^\nError: Jinja Exception: System message must be at the beginning.","type":"invalid_request_error"}}

Further explanation from Claude:

Root Cause

When a request includes tools, llama.cpp automatically generates a grammar from the model's Jinja chat template to enforce valid tool-call output. Qwen3's chat template contains a strict validation guard (around line 85):

{%- if not loop.first %}
    {{- raise_exception('System message must be at the beginning.') }}
{%- endif %}

This raises an exception if any role: "system" message appears at a position other than index 0 in the messages array.

SmallCode's architecture injects additional system-role content mid-conversation in several places:

  • Knowledge injection (from knowledge/ directory)
  • Working memory / task re-injection on greeting regression
  • Plan re-injection as a turn anchor
  • Multi-file edit coordination headers

If any of these are appended as a new { role: "system", content: "..." } object rather than merged into the first system message, the Qwen3 template throws the exception and llama.cpp returns HTTP 400 before the request is even processed.

Expected Behavior

All dynamic system-role injections should be merged into a single system message at position 0, not appended as additional system objects.

Suggested Fix

In the function that assembles the final messages array before each API call, consolidate all system-role content:

// Instead of pushing a new system message:
// messages.push({ role: "system", content: knowledgeInjection }); // ❌

// Merge into the existing system message at index 0:
function buildMessages(systemParts, history) {
  const systemContent = systemParts.filter(Boolean).join("\n\n");
  return [
    { role: "system", content: systemContent },
    ...history.filter(m => m.role !== "system") // strip any stray system messages from history
  ];
}

This ensures the messages array always has exactly one system message, always at index 0 — which satisfies Qwen3's (and other strict models') chat template requirements.

Additional Notes

  • This likely affects all Qwen3-family models (Qwen3-4B, 8B, 14B, 32B, etc.) and any other model whose Jinja chat template enforces system-message ordering.
  • The bug only surfaces when tools are present in the request, because that is when llama.cpp executes the template for grammar generation. Plain chat requests without tools may work fine even with the broken message order.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions