Skip to content

[DSv4] add jinja chat template support#1153

Merged
Oseltamivir merged 1 commit intomainfrom
dsv4-chat-template
Apr 25, 2026
Merged

[DSv4] add jinja chat template support#1153
Oseltamivir merged 1 commit intomainfrom
dsv4-chat-template

Conversation

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

@kedarpotdar-nv kedarpotdar-nv commented Apr 25, 2026

Summary

Adds a self-contained Jinja chat template encoder for DeepSeek-V4 (encoding_dsv4.py) and wires it into the benchmark serving pipeline as an alternative to the tokenizer's built-in chat template.

Changes

  • utils/bench_serving/encoding_dsv4.py (new, ~744 lines): Standalone DeepSeek-V4 encoding/decoding module that implements:

    • The full DSv4 chat prompt format (<bos><User>...<Assistant><think> framing)
    • Thinking mode support ("chat" vs "thinking") with configurable reasoning effort
    • Tool-calling encoding/decoding with DSML-based XML format
    • Multi-turn conversation context handling, tool message merging, and response parsing
    • Special token management (BOS/EOS, thinking tags, task tokens)
  • utils/bench_serving/benchmark_serving.py: Integrates the new encoder:

    • Adds --dsv4 CLI flag (under a new DeepSeek-V4 chat template options argument group)
    • New _apply_chat_template() helper that dispatches to either encoding_dsv4.encode_messages or the tokenizer's built-in jinja template
    • Threads the dsv4 flag through sample_random_requests and its multiprocessing worker chunks
    • Validates that --dsv4 requires --use-chat-template
  • benchmarks/benchmark_lib.sh: Adds --dsv4 as an optional flag to run_benchmark_serving():

    • Parses the new flag and auto-enables --use-chat-template when --dsv4 is passed
    • Forwards the flag to the Python benchmark script

Motivation

The HuggingFace tokenizer's built-in Jinja chat template may not perfectly match the prompt format DeepSeek-V4 expects (especially for thinking mode and tool-calling scenarios). This PR provides a faithful, self-contained encoder that reproduces the exact prompt structure the model was trained with, ensuring accurate benchmarking results.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@claude please generate a PR description

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 25, 2026

Claude finished @kedarpotdar-nv's task in 1m 0s —— View job


PR Description Generated

  • Analyze PR changes
  • Generate PR description

PR description has been updated with a summary of the changes, including details on the new encoding_dsv4.py module, the benchmark_serving.py integration, and the benchmark_lib.sh flag additions.

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@claude please review code

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 25, 2026

Claude finished @kedarpotdar-nv's task in 1m 25s —— View job


Code Review: [DSv4] Jinja Chat Template Support

  • Read and analyze changed files
  • Review benchmarks/benchmark_lib.sh changes
  • Review utils/bench_serving/benchmark_serving.py changes
  • Review utils/bench_serving/encoding_dsv4.py (new file)
  • Post review findings

Summary

This PR adds a --dsv4 flag to the benchmark tooling so that random prompts can be formatted with the DeepSeek-V4 chat template (encoding_dsv4.py) instead of relying on the tokenizer's built-in Jinja template. The changes are well-structured: the shell library, the Python benchmark script, and the new encoder module all integrate cleanly.

Below are the issues and suggestions, ordered by severity.


Issues

1. Bare except in to_json — catches too much (Medium)

encoding_dsv4.py:105-106

def to_json(value: Any) -> str:
    try:
        return json.dumps(value, ensure_ascii=False)
    except:
        return json.dumps(value, ensure_ascii=True)

A bare except: will swallow KeyboardInterrupt, SystemExit, etc. This should be narrowed to at least except Exception:, or more specifically except (ValueError, TypeError, UnicodeEncodeError): since those are the only realistic failure modes for json.dumps.

Fix this →

2. thinking_mode is hardcoded to "thinking" — should it be configurable? (Medium)

benchmark_serving.py:117-120

if dsv4:
    return dsv4_encode_messages(
        [{"role": "user", "content": prompt}],
        thinking_mode="thinking",
    )

The thinking_mode is hardcoded to "thinking" which adds the <think> start token to every prompt. For benchmarking non-thinking endpoints (or measuring non-thinking performance), there's no way to use "chat" mode. Consider exposing this as a CLI flag (e.g., --dsv4-thinking-mode {thinking,chat}, defaulting to "thinking"), or at least leave a comment explaining why "thinking" is always the correct choice for the DSv4 benchmarking use case.

3. chat_template_len calculation may be inaccurate for DSv4 (Low-Medium)

benchmark_serving.py:189-192

chat_template_dummy = _apply_chat_template("a", tokenizer, dsv4)
tokenized_chat_template_dummy = tokenizer.encode(chat_template_dummy, add_special_tokens=False)
chat_template_len = len(tokenized_chat_template_dummy) - 1

The overhead computation subtracts 1 token (for the single-char dummy "a") then uses that to reduce the target input_len. This works for tokenizer Jinja templates where "a" is one token. But for the DSv4 encoder, the output includes BOS, <|User|>, <|Assistant|>, <think>, etc. — all as special tokens. Depending on how the tokenizer handles those special tokens in encode(), the overhead estimate could be off. Might be worth verifying this empirically with a DSv4 tokenizer.

4. Inconsistency: benchmark_lib.sh auto-enables --use-chat-template but benchmark_serving.py requires both flags (Low)

In benchmark_lib.sh:260-263, passing --dsv4 automatically sets use_chat_template=true:

--dsv4)
    dsv4=true
    use_chat_template=true
    shift
    ;;

But in benchmark_serving.py:185-186, --dsv4 without --use-chat-template raises an error:

if dsv4 and not use_chat_template:
    raise ValueError("--dsv4 requires --use-chat-template to be set.")

This means calling benchmark_serving.py --dsv4 directly (without going through the shell wrapper) will fail unless the user also passes --use-chat-template. Two options:

  • Option A: Have the Python side auto-enable use_chat_template when dsv4 is set (consistent with the shell behavior).
  • Option B: Keep the current validation but document the inconsistency in the --dsv4 help text (e.g., "When using benchmark_lib.sh, --use-chat-template is auto-enabled.").

Option A is cleaner — auto-enable in Python to match the shell.

Fix this →


Nits / Minor Suggestions

5. Inconsistent variable naming: P_dsml_strs vs p_dsml_template

encoding_dsv4.py:150-151P_dsml_strs uses an uppercase prefix while p_dsml_template uses lowercase. Prefer consistent casing (p_dsml_strs).

6. Large file scope — only encode_messages is used

encoding_dsv4.py is 745 lines, but only encode_messages is imported. The file includes full parsing/decoding machinery (parse_tool_calls, parse_message_from_completion_text, etc.) that isn't used by the benchmark. This is fine if the intent is to keep it as a self-contained reusable module, but worth noting that ~300+ lines of parsing code are dead weight for the current use case.

7. No unit tests for the new encoder

There are no tests for encoding_dsv4.py. For a 745-line encoder with non-trivial parsing logic, even a few smoke tests (e.g., round-trip encode/parse, edge cases like empty messages, multi-turn with tool calls) would increase confidence.


What Looks Good

  • Shell integration in benchmark_lib.sh is clean — the --dsv4 flag follows the exact same pattern as --use-chat-template and --trust-remote-code.
  • Multiprocessing support is correct — the dsv4 flag is properly threaded through both the parallel chunk args tuple and the serial fallback path.
  • Validation guard (dsv4 and not use_chat_template) is a good defensive check.
  • The _apply_chat_template abstraction keeps the diff minimal and avoids scattering conditional logic across multiple call sites.

Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Oseltamivir Oseltamivir merged commit b562888 into main Apr 25, 2026
6 checks passed
@Oseltamivir Oseltamivir deleted the dsv4-chat-template branch April 25, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants