🔧[0.3.38] Release Note: Optimized CJK Detokenization, Sync Grammar Parser, and Patched CUDA Graph Logs #124
JamePeng
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🔧Release 0.3.38: Optimized CJK Detokenization, Sync Grammar Parser, and Patched CUDA Graph Logs
Hi everyone,
This release comes only two days after the previous version, but I decided to publish it because it includes a small set of practical fixes that are worth shipping, especially for CJK-heavy generation.
The main change in this version is an optimization to the detokenization buffer sizing logic. While testing Chinese/Japanese output, I found that the previous initial buffer estimate was too small for many CJK-heavy responses. The old logic used the token count itself as the initial byte buffer size, but in real CJK output the required byte size is usually much larger.
In my local tests, CJK-heavy outputs often required around 4.0x to 5.04x bytes per token, and some small-token edge cases reached about 6.0x.

Because of this, the previous implementation frequently called
llama_detokenize, received a negative required size, allocated a larger buffer, and then calledllama_detokenizeagain. In practice, this meant that many CJK detokenization calls were doing avoidable extra work.This release changes the initial buffer estimate from:
to:
The new estimate is still simple and integer-only, but it avoids most of the retry path for CJK-heavy outputs. In local profiling, this reduced detokenization overhead noticeably. In one py-spy comparison,
detokenizetime dropped from about0.25sto0.10s. In Streamlit-based testing, the same area also showed a clear reduction after the patch. Overall, I observed around 3–5% improvement in CJK-heavy generation scenarios, depending on output length and runtime conditions.This is not a large architectural change, but it removes a repeated cost that showed up clearly when generating long Chinese responses.
Another small but useful patch in this release is for the noisy CUDA Graph log output. Recent backend behavior may repeatedly print messages like:
Before(verbose=True):


After(verbose=True):
These logs are not useful for most users during normal generation, and they can become distracting in applications such as Streamlit demos. This release adds a temporary filter in
ggml_log_callbackto suppress this specific noisy message. A more complete logger refactor is still planned for a future version, but this patch should make the current runtime output cleaner.This version also includes grammar-related updates synced from upstream.
LlamaGrammar.from_json_schemaandjson_schema_to_gbnfnow accept both string and dict schema inputs, and several upstream arguments are exposed through the public API. Some edge cases in JSON schema handling were also fixed, including empty or unconstrained schema objects and min/max integer generation when values are zero.The wiki has also been updated with a new
LlamaGrammarpage, and the wiki index has been refreshed.Highlights
llama_detokenizeretry calls.ggml-org/llama.cpp@e48034dfc9e5705248fd39dc437ca887dc55a528.Notes
The logger patch in this release is intentionally small. I still plan to revisit the logger design more thoroughly in a future version, including better verbosity control and cleaner integration with the underlying C++ logging behavior.
For more details, see the full comparison:
fe38cbf...fe8657a
— JamePeng
Beta Was this translation helpful? Give feedback.
All reactions