[DSv4] add jinja chat template support by kedarpotdar-nv · Pull Request #1153 · SemiAnalysisAI/InferenceX

kedarpotdar-nv · 2026-04-25T06:36:20Z

Summary

Adds a self-contained Jinja chat template encoder for DeepSeek-V4 (encoding_dsv4.py) and wires it into the benchmark serving pipeline as an alternative to the tokenizer's built-in chat template.

Changes

utils/bench_serving/encoding_dsv4.py (new, ~744 lines): Standalone DeepSeek-V4 encoding/decoding module that implements:
- The full DSv4 chat prompt format (<bos><User>...<Assistant><think> framing)
- Thinking mode support ("chat" vs "thinking") with configurable reasoning effort
- Tool-calling encoding/decoding with DSML-based XML format
- Multi-turn conversation context handling, tool message merging, and response parsing
- Special token management (BOS/EOS, thinking tags, task tokens)
utils/bench_serving/benchmark_serving.py: Integrates the new encoder:
- Adds --dsv4 CLI flag (under a new DeepSeek-V4 chat template options argument group)
- New _apply_chat_template() helper that dispatches to either encoding_dsv4.encode_messages or the tokenizer's built-in jinja template
- Threads the dsv4 flag through sample_random_requests and its multiprocessing worker chunks
- Validates that --dsv4 requires --use-chat-template
benchmarks/benchmark_lib.sh: Adds --dsv4 as an optional flag to run_benchmark_serving():
- Parses the new flag and auto-enables --use-chat-template when --dsv4 is passed
- Forwards the flag to the Python benchmark script

Motivation

The HuggingFace tokenizer's built-in Jinja chat template may not perfectly match the prompt format DeepSeek-V4 expects (especially for thinking mode and tool-calling scenarios). This PR provides a faithful, self-contained encoder that reproduces the exact prompt structure the model was trained with, ensuring accurate benchmarking results.

github-actions · 2026-04-25T06:36:31Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

kedarpotdar-nv · 2026-04-25T06:36:33Z

@claude please generate a PR description

Klaud-Cold · 2026-04-25T06:36:56Z

Claude finished @kedarpotdar-nv's task in 1m 0s —— View job

PR Description Generated

Analyze PR changes
Generate PR description

PR description has been updated with a summary of the changes, including details on the new encoding_dsv4.py module, the benchmark_serving.py integration, and the benchmark_lib.sh flag additions.

kedarpotdar-nv · 2026-04-25T06:37:06Z

@claude please review code

Klaud-Cold · 2026-04-25T06:37:34Z

Claude finished @kedarpotdar-nv's task in 1m 25s —— View job

Code Review: [DSv4] Jinja Chat Template Support

Read and analyze changed files
Review benchmarks/benchmark_lib.sh changes
Review utils/bench_serving/benchmark_serving.py changes
Review utils/bench_serving/encoding_dsv4.py (new file)
Post review findings

Summary

This PR adds a --dsv4 flag to the benchmark tooling so that random prompts can be formatted with the DeepSeek-V4 chat template (encoding_dsv4.py) instead of relying on the tokenizer's built-in Jinja template. The changes are well-structured: the shell library, the Python benchmark script, and the new encoder module all integrate cleanly.

Below are the issues and suggestions, ordered by severity.

Issues

1. Bare `except` in `to_json` — catches too much (Medium)

encoding_dsv4.py:105-106

def to_json(value: Any) -> str:
    try:
        return json.dumps(value, ensure_ascii=False)
    except:
        return json.dumps(value, ensure_ascii=True)

A bare except: will swallow KeyboardInterrupt, SystemExit, etc. This should be narrowed to at least except Exception:, or more specifically except (ValueError, TypeError, UnicodeEncodeError): since those are the only realistic failure modes for json.dumps.

Fix this →

2. `thinking_mode` is hardcoded to `"thinking"` — should it be configurable? (Medium)

benchmark_serving.py:117-120

if dsv4:
    return dsv4_encode_messages(
        [{"role": "user", "content": prompt}],
        thinking_mode="thinking",
    )

The thinking_mode is hardcoded to "thinking" which adds the <think> start token to every prompt. For benchmarking non-thinking endpoints (or measuring non-thinking performance), there's no way to use "chat" mode. Consider exposing this as a CLI flag (e.g., --dsv4-thinking-mode {thinking,chat}, defaulting to "thinking"), or at least leave a comment explaining why "thinking" is always the correct choice for the DSv4 benchmarking use case.

3. `chat_template_len` calculation may be inaccurate for DSv4 (Low-Medium)

benchmark_serving.py:189-192

chat_template_dummy = _apply_chat_template("a", tokenizer, dsv4)
tokenized_chat_template_dummy = tokenizer.encode(chat_template_dummy, add_special_tokens=False)
chat_template_len = len(tokenized_chat_template_dummy) - 1

The overhead computation subtracts 1 token (for the single-char dummy "a") then uses that to reduce the target input_len. This works for tokenizer Jinja templates where "a" is one token. But for the DSv4 encoder, the output includes BOS, <｜User｜>, <｜Assistant｜>, <think>, etc. — all as special tokens. Depending on how the tokenizer handles those special tokens in encode(), the overhead estimate could be off. Might be worth verifying this empirically with a DSv4 tokenizer.

4. Inconsistency: `benchmark_lib.sh` auto-enables `--use-chat-template` but `benchmark_serving.py` requires both flags (Low)

In benchmark_lib.sh:260-263, passing --dsv4 automatically sets use_chat_template=true:

--dsv4)
    dsv4=true
    use_chat_template=true
    shift
    ;;

But in benchmark_serving.py:185-186, --dsv4 without --use-chat-template raises an error:

if dsv4 and not use_chat_template:
    raise ValueError("--dsv4 requires --use-chat-template to be set.")

This means calling benchmark_serving.py --dsv4 directly (without going through the shell wrapper) will fail unless the user also passes --use-chat-template. Two options:

Option A: Have the Python side auto-enable use_chat_template when dsv4 is set (consistent with the shell behavior).
Option B: Keep the current validation but document the inconsistency in the --dsv4 help text (e.g., "When using benchmark_lib.sh, --use-chat-template is auto-enabled.").

Option A is cleaner — auto-enable in Python to match the shell.

Fix this →

Nits / Minor Suggestions

5. Inconsistent variable naming: `P_dsml_strs` vs `p_dsml_template`

encoding_dsv4.py:150-151 — P_dsml_strs uses an uppercase prefix while p_dsml_template uses lowercase. Prefer consistent casing (p_dsml_strs).

6. Large file scope — only `encode_messages` is used

encoding_dsv4.py is 745 lines, but only encode_messages is imported. The file includes full parsing/decoding machinery (parse_tool_calls, parse_message_from_completion_text, etc.) that isn't used by the benchmark. This is fine if the intent is to keep it as a self-contained reusable module, but worth noting that ~300+ lines of parsing code are dead weight for the current use case.

7. No unit tests for the new encoder

There are no tests for encoding_dsv4.py. For a 745-line encoder with non-trivial parsing logic, even a few smoke tests (e.g., round-trip encode/parse, edge cases like empty messages, multi-turn with tool calls) would increase confidence.

What Looks Good

Shell integration in benchmark_lib.sh is clean — the --dsv4 flag follows the exact same pattern as --use-chat-template and --trust-remote-code.
Multiprocessing support is correct — the dsv4 flag is properly threaded through both the parallel chunk args tuple and the serial fallback path.
Validation guard (dsv4 and not use_chat_template) is a good defensive check.
The _apply_chat_template abstraction keeps the diff minimal and avoids scattering conditional logic across multiple call sites.

Oseltamivir

lgtm

add jinja chat template support

d4365e3

kedarpotdar-nv requested a review from a team April 25, 2026 06:36

github-project-automation Bot added this to InferenceMAX Board Apr 25, 2026

Oseltamivir approved these changes Apr 25, 2026

View reviewed changes

Oseltamivir merged commit b562888 into main Apr 25, 2026
6 checks passed

Oseltamivir deleted the dsv4-chat-template branch April 25, 2026 20:39

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 25, 2026

cquil11 mentioned this pull request Apr 26, 2026

dsv4-fp4-b300-sglang-mtp: pass --dsv4 to use DSv4 chat template #1182

Merged

2 tasks

Oseltamivir mentioned this pull request Apr 28, 2026

Add dsv4-fp4-b300-vllm-mtp config (DSv4 vLLM B300 + MTP) #1203

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DSv4] add jinja chat template support#1153

[DSv4] add jinja chat template support#1153
Oseltamivir merged 1 commit intomainfrom
dsv4-chat-template

kedarpotdar-nv commented Apr 25, 2026 •

edited by Klaud-Cold

Loading

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

kedarpotdar-nv commented Apr 25, 2026

Uh oh!

Klaud-Cold commented Apr 25, 2026 •

edited

Loading

Uh oh!

kedarpotdar-nv commented Apr 25, 2026

Uh oh!

Klaud-Cold commented Apr 25, 2026 •

edited

Loading

Uh oh!

Oseltamivir left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kedarpotdar-nv commented Apr 25, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Motivation

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

kedarpotdar-nv commented Apr 25, 2026

Uh oh!

Klaud-Cold commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description Generated

Uh oh!

kedarpotdar-nv commented Apr 25, 2026

Uh oh!

Klaud-Cold commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: [DSv4] Jinja Chat Template Support

Summary

Issues

1. Bare except in to_json — catches too much (Medium)

2. thinking_mode is hardcoded to "thinking" — should it be configurable? (Medium)

3. chat_template_len calculation may be inaccurate for DSv4 (Low-Medium)

4. Inconsistency: benchmark_lib.sh auto-enables --use-chat-template but benchmark_serving.py requires both flags (Low)

Nits / Minor Suggestions

5. Inconsistent variable naming: P_dsml_strs vs p_dsml_template

6. Large file scope — only encode_messages is used

7. No unit tests for the new encoder

What Looks Good

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kedarpotdar-nv commented Apr 25, 2026 •

edited by Klaud-Cold

Loading

Klaud-Cold commented Apr 25, 2026 •

edited

Loading

Klaud-Cold commented Apr 25, 2026 •

edited

Loading

1. Bare `except` in `to_json` — catches too much (Medium)

2. `thinking_mode` is hardcoded to `"thinking"` — should it be configurable? (Medium)

3. `chat_template_len` calculation may be inaccurate for DSv4 (Low-Medium)

4. Inconsistency: `benchmark_lib.sh` auto-enables `--use-chat-template` but `benchmark_serving.py` requires both flags (Low)

5. Inconsistent variable naming: `P_dsml_strs` vs `p_dsml_template`

6. Large file scope — only `encode_messages` is used