🔧[0.3.38] Release Note: Optimized CJK Detokenization, Sync Grammar Parser, and Patched CUDA Graph Logs #124

JamePeng · 2026-05-04T03:49:09Z

JamePeng
May 4, 2026
Maintainer

🔧Release 0.3.38: Optimized CJK Detokenization, Sync Grammar Parser, and Patched CUDA Graph Logs

Hi everyone,

This release comes only two days after the previous version, but I decided to publish it because it includes a small set of practical fixes that are worth shipping, especially for CJK-heavy generation.

The main change in this version is an optimization to the detokenization buffer sizing logic. While testing Chinese/Japanese output, I found that the previous initial buffer estimate was too small for many CJK-heavy responses. The old logic used the token count itself as the initial byte buffer size, but in real CJK output the required byte size is usually much larger.

In my local tests, CJK-heavy outputs often required around 4.0x to 5.04x bytes per token, and some small-token edge cases reached about 6.0x.

Because of this, the previous implementation frequently called llama_detokenize, received a negative required size, allocated a larger buffer, and then called llama_detokenize again. In practice, this meant that many CJK detokenization calls were doing avoidable extra work.

This release changes the initial buffer estimate from:

max(n_tokens, 64)

to:

max(64, n_tokens * 5 + 32)

The new estimate is still simple and integer-only, but it avoids most of the retry path for CJK-heavy outputs. In local profiling, this reduced detokenization overhead noticeably. In one py-spy comparison, detokenize time dropped from about 0.25s to 0.10s. In Streamlit-based testing, the same area also showed a clear reduction after the patch. Overall, I observed around 3–5% improvement in CJK-heavy generation scenarios, depending on output length and runtime conditions.

This is not a large architectural change, but it removes a repeated cost that showed up clearly when generating long Chinese responses.

Another small but useful patch in this release is for the noisy CUDA Graph log output. Recent backend behavior may repeatedly print messages like:

CUDA Graph id %zu reused

Before(verbose=True)：

After(verbose=True)：

These logs are not useful for most users during normal generation, and they can become distracting in applications such as Streamlit demos. This release adds a temporary filter in ggml_log_callback to suppress this specific noisy message. A more complete logger refactor is still planned for a future version, but this patch should make the current runtime output cleaner.

This version also includes grammar-related updates synced from upstream. LlamaGrammar.from_json_schema and json_schema_to_gbnf now accept both string and dict schema inputs, and several upstream arguments are exposed through the public API. Some edge cases in JSON schema handling were also fixed, including empty or unconstrained schema objects and min/max integer generation when values are zero.

The wiki has also been updated with a new LlamaGrammar page, and the wiki index has been refreshed.

Highlights

Optimized detokenization buffer sizing for CJK-heavy outputs.
Reduced avoidable llama_detokenize retry calls.
Observed around 3–5% local improvement in CJK-heavy generation scenarios.
Added a temporary filter for noisy CUDA Graph reuse logs.
Synced JSON schema to GBNF grammar parser behavior with upstream.
Added and updated wiki documentation.
Updated llama.cpp to ggml-org/llama.cpp@e48034dfc9e5705248fd39dc437ca887dc55a528.

Notes

The logger patch in this release is intentionally small. I still plan to revisit the logger design more thoroughly in a future version, including better verbosity control and cleaner integration with the underlying C++ logging behavior.

For more details, see the full comparison:

fe38cbf...fe8657a

— JamePeng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔧[0.3.38] Release Note: Optimized CJK Detokenization, Sync Grammar Parser, and Patched CUDA Graph Logs #124

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

🔧[0.3.38] Release Note: Optimized CJK Detokenization, Sync Grammar Parser, and Patched CUDA Graph Logs #124

Uh oh!

Uh oh!

JamePeng May 4, 2026 Maintainer

🔧Release 0.3.38: Optimized CJK Detokenization, Sync Grammar Parser, and Patched CUDA Graph Logs

Highlights

Notes

Replies: 0 comments

JamePeng
May 4, 2026
Maintainer