Skip to content

[bugfix] Add postprocessor to extract last option for MMMU datasets#238

Merged
SJTUyh merged 1 commit into
AISBench:masterfrom
SijieFu:fix/mmmu-eval
May 6, 2026
Merged

[bugfix] Add postprocessor to extract last option for MMMU datasets#238
SJTUyh merged 1 commit into
AISBench:masterfrom
SijieFu:fix/mmmu-eval

Conversation

@SijieFu

@SijieFu SijieFu commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

  • Feature(功能新增)
  • Bugfix(Bug 修复)
  • Docs(文档更新)
  • CI/CD(持续集成/持续部署)
  • Refactor(代码重构)
  • Perf(性能优化)
  • Dependency(依赖项更新)
  • Test-Cases(测试用例更新)
  • Other(其他)

Related Issue | 关联 Issue
Relates to #224

🔍 Motivation / 变更动机

Previously, when benchmarking accuracy with the MMMU dataset, the postprocessing of the original predictions from models is missing/incorrect. The postprocessed predictions were the same as the original predictions. Therefore, even if the model outputs the correct answer, the evaluation process still count it as incorrect, resulting in a low accuracy score. For example,

"0": {
            "prompt": [
                {
                    "role": "HUMAN",
                    "prompt": [
                        {
                            "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nEach of the following situations relates to a different company. ",
                            "type": "text"
                        },
                        {
                            "image_url": {
                                "url": "file://{project_root}/ais_bench/datasets/mmmu/MMMU_images/1_1.jpg"
                            },
                            "type": "image_url"
                        },
                        {
                            "text": " For company B, find the missing amounts.A. $63,020\nB. $58,410\nC. $71,320\nD. $77,490\n",
                            "type": "text"
                        }
                    ]
                }
            ],
            "origin_prediction": "Let's solve this step by step.\n\n......\n\nANSWER: D",
            "predictions": [
                "Let's solve this step by step.\n\n......\n\nANSWER: D"
            ],
            "references": [
                {
                    "answer": "D",
                    "category": "Business",
                    "choices": "{\"A\": \"$63,020\", \"B\": \"$58,410\", \"C\": \"$71,320\", \"D\": \"$77,490\"}",
                    "l2-category": "Accounting",
                    "split": "dev"
                }
            ],
            "correct": [
                false
            ]
        }

📝 Modification / 修改内容

This PR adds a pred_postprocessor to the MMMU-related evaluation configurations.

📐 Associated Test Results / 关联测试结果

Instead of showing the original predictions, the postprocessed predictions is showing a single choice letter:

"0": {
            "prompt": [
                {
                    "role": "HUMAN",
                    "prompt": [
                        {
                            "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nEach of the following situations relates to a different company. ",
                            "type": "text"
                        },
                        {
                            "image_url": {
                                "url": "file://{project_root}/ais_bench/datasets/mmmu/MMMU_images/1_1.jpg"
                            },
                            "type": "image_url"
                        },
                        {
                            "text": " For company B, find the missing amounts.A. $63,020\nB. $58,410\nC. $71,320\nD. $77,490\n",
                            "type": "text"
                        }
                    ]
                }
            ],
            "origin_prediction": "Let's solve this step by step.\n\n......\n\nANSWER: D",
            "predictions": [
                "D"
            ],
            "references": [
                {
                    "answer": "D",
                    "category": "Business",
                    "choices": "{\"A\": \"$63,020\", \"B\": \"$58,410\", \"C\": \"$71,320\", \"D\": \"$77,490\"}",
                    "l2-category": "Accounting",
                    "split": "dev"
                }
            ],
            "correct": [
                true
            ]
        }

✅ Checklist / 检查列表

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。
  • All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响,应在那些项目中测试此 PR。
  • CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署,且本 PR 中的所有提交者均已签署 CLA。

Copilot AI review requested due to automatic review settings April 10, 2026 07:08

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the last_option_postprocess function into the evaluation configurations for MMMU and MMMU-Pro datasets. Reviewers identified several issues with this approach, noting that the post-processor is highly susceptible to false positives due to its lack of word boundaries and case-insensitivity. This is especially concerning for Chain-of-Thought (CoT) responses and datasets using extended option sets where common letters like 'I' might be incorrectly matched within the model's reasoning or standard text. It is recommended to implement a more robust extraction logic that targets specific answer patterns and to ensure the function is correctly registered within the project's post-processor registry.

mmmu_eval_cfg = dict(
evaluator=dict(type=MMMUEvaluator)
evaluator=dict(type=MMMUEvaluator),
pred_postprocessor=dict(type=last_option_postprocess, options="ABCD"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The last_option_postprocess function is highly susceptible to false positives because its implementation (re.findall(rf'([{options}])', text)) matches any occurrence of the option letters, even when they are part of other words (e.g., matching 'B' in 'Business'). This is risky as MMMU outputs often contain category names or reasoning that may include these letters. Additionally, the postprocessor is case-sensitive and will fail to extract lowercase answers. Consider improving the postprocessor to use word boundaries (e.g., rf'\b([{options}])\b') and case-insensitivity. Also, note that last_option_postprocess is not registered in the TEXT_POSTPROCESSORS registry, which is inconsistent with other postprocessors in the project.

mmmu_eval_cfg = dict(
evaluator=dict(type=MMMUEvaluator)
evaluator=dict(type=MMMUEvaluator),
pred_postprocessor=dict(type=last_option_postprocess, options="ABCD"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For Chain-of-Thought (CoT) responses, using a naive last_option_postprocess is particularly risky as the model might mention multiple options during reasoning. Since the prompt explicitly requests the format ANSWER: [LETTER], it would be much more robust to use a postprocessor that specifically targets this pattern rather than just finding the last occurrence of any option letter. Furthermore, the current implementation is case-sensitive and can match letters inside words.

mmmu_pro_eval_cfg = dict(
evaluator=dict(type=MMMUProCotEvaluator)
evaluator=dict(type=MMMUProCotEvaluator),
pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The risk of false positives with last_option_postprocess is even higher here because the options string ABCDEFGHIJ includes the letter 'I', which frequently appears as a standalone pronoun or within words in model responses. This naive extraction will likely return 'I' as the answer if it appears after the actual choice. A more specific pattern matching the requested 'Answer: $LETTER' format should be used instead.

mmmu_pro_eval_cfg = dict(
evaluator=dict(type=MMMUProEvaluator)
evaluator=dict(type=MMMUProEvaluator),
pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The last_option_postprocess function is prone to incorrect extractions, especially with the extended option set ABCDEFGHIJ. It can match letters within words and is case-sensitive. Given that MMMU-Pro evaluation relies on precise option extraction, consider using a more robust method that enforces word boundaries and handles case-insensitivity.

mmmu_pro_eval_cfg = dict(
evaluator=dict(type=MMMUProCotEvaluator)
evaluator=dict(type=MMMUProCotEvaluator),
pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using last_option_postprocess for CoT evaluation is risky as it may pick up letters from the model's reasoning steps. Since the prompt specifies a clear format ('Answer: $LETTER'), the postprocessor should ideally look for that specific pattern to ensure accuracy, rather than simply taking the last matching character in the text.

mmmu_pro_eval_cfg = dict(
evaluator=dict(type=MMMUProEvaluator)
evaluator=dict(type=MMMUProEvaluator),
pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The last_option_postprocess implementation is too naive for reliable evaluation. It lacks word boundary checks and case-insensitivity, which can lead to incorrect results if the model output contains words like 'Business' or 'Image' (containing 'B' or 'I'). A more robust extraction logic is recommended.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds prediction postprocessing to MMMU-family dataset evaluation configs so that generated answers are reduced to a final multiple-choice option letter before scoring, fixing cases where verbose model outputs were previously mis-scored.

Changes:

  • Add last_option_postprocess as pred_postprocessor for MMMU (A–D) evaluation configs.
  • Add last_option_postprocess as pred_postprocessor for MMMU-Pro evaluation configs (A–J).
  • Import the postprocessor into the affected dataset config files.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen.py Adds eval-time pred_postprocessor to extract final A–D option.
ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py Adds eval-time pred_postprocessor to extract final A–D option for CoT outputs.
ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_vision_gen.py Adds eval-time pred_postprocessor to extract final A–J option for MMMU-Pro vision.
ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_vision_cot_gen.py Adds eval-time pred_postprocessor to extract final A–J option for MMMU-Pro vision CoT.
ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_options10_gen.py Adds eval-time pred_postprocessor to extract final A–J option for MMMU-Pro options10.
ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_options10_cot_gen.py Adds eval-time pred_postprocessor to extract final A–J option for MMMU-Pro options10 CoT.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 17, 2026
…cture in xlite backend (#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.


For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
keyi-zz pushed a commit to keyi-zz/vllm-ascend that referenced this pull request Apr 20, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.


For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
Pz1116 pushed a commit to Pz1116/vllm-ascend that referenced this pull request Apr 20, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.


For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
anning-2026 pushed a commit to anning-2026/vllm-ascend that referenced this pull request Apr 21, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.


For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.

For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.

For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
@SJTUyh SJTUyh merged commit bb661f6 into AISBench:master May 6, 2026
13 of 14 checks passed
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.


For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 10, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.

For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
nanxingMy pushed a commit to nanxingMy/vllm-ascend that referenced this pull request May 15, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.


For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
Signed-off-by: nanxing <1014662416@qq.com>
ader47 pushed a commit to ader47/vllm-ascend that referenced this pull request Jun 18, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.


For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
CXY-Katrina pushed a commit to CXY-Katrina/vllm-ascend-zhx that referenced this pull request Jun 27, 2026
…cture in xlite backend (vllm-project#8046)

This PR is the result of the last two commits on
`feat/xlite-qwen3-vl-moe`:

1. Refactor `LlamaXliteModel` so the shared `initialize` path uses
`config.rope_head_dim` instead of duplicating subclass-specific setup.
2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map
and route it through `QwenMoeXliteModel`.

The net effect is that Qwen3-VL MoE models can reuse the existing xlite
initialization flow while still applying the MoE-specific config and
weight wiring in the subclass.

### What this PR does / why we need it?

This PR extends xlite backend coverage to the
`Qwen3VLMoeForConditionalGeneration` architecture. The previous
implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE
variants, so Qwen3-VL MoE models were not routed into the xlite path.

At the same time, the shared rotary embedding precomputation was
normalized to use `config.rope_head_dim`, which keeps the base
`initialize` implementation generic and avoids duplicated
subclass-specific logic.

### Does this PR introduce _any_ user-facing change?

Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration`
architecture can now use the xlite backend.

### How was this patch tested?

First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and
the vLLM server started successfully with the xlite backend, including
launching the worker process and processing requests without crashing
(see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`,
`Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`)
were also re-tested to confirm no regressions.

```bash
# Script to start the server and test a single request

# using Docker image for Ascend A3

export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

# server launch command (with xlite enabled)
vllm serve /path/to/model \
--host 0.0.0.0 \
--port 8000 \
--api-server-count 1 \
--data-parallel-size 1 \
--data-parallel-size-local 1 \
--tensor-parallel-size 16 \
--served-model-name mymodel \
--max-num-seqs 16 \
--max-model-len 40960 \
--max-num-batched-tokens 4096 \
--enable-expert-parallel \
--trust-remote-code \
--async-scheduling \
--gpu-memory-utilization 0.9 \
--block-size 128 \
--allowed-local-media-path /path/to/media \
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \

# single request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "mymodel",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me how to sleep well at night."}
        ],
        "max_tokens": 128,
        "temperature": "0.0"
}'
```

Further tests were performed to confirm the accuracy of the outputs
using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`,
multiple datasets were evaluated, including a multimodal dataset `mmmu`.
The results are summarized in the table below.

```bash
# command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled
ais_bench --models vllm_api_general_chat --datasets \
aime2024_gen_0_shot_chat_prompt \
ceval_gen_0_shot_cot_chat_prompt \
gpqa_gen_0_shot_cot_chat_prompt \
gsm8k_gen_0_shot_cot_chat_prompt \
math_prm800k_500_0shot_cot_gen \
mmlu_gen_0_shot_cot_chat_prompt \
mmmu_gen_cot \
--work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \
--mode all \
--dump-eval-details --merge-ds \
--max-num-workers 128
```

For the `vllm_api_general_chat` task type, the corresponding
`ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`
was modified as:

```python
# ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="mymodel",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8000,
        url="",
        max_out_len=32768,
        batch_size=512,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            top_k=10,
            top_p=0.95,
            seed=None,
            repetition_penalty=1.03,
            ignore_eos=False,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
```

| dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen
official |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7
(aime2025)_ |
| ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ |
| gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ |
| gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ |
| livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3
(LCBV6)_ |
| math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ |
| mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 |
| mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 |
| mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ |

> *Multimodal datasets that include image inputs.\
> Qwen official:
<https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\
> Bug fix from <AISBench/benchmark#238> were
also merged to ensure the evaluation correctness for the `mmmu` dataset.


For other previously supported models, their accuracies were also
validated on the `ceval` dataset, with the results below:

| model | dataset | version | metric | mode | w/ xlite | w/o xlite |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 |
| Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._
|
| GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ |

These results confirm that the xlite backend is functioning correctly
for the `Qwen3VLMoeForConditionalGeneration` architecture, with no
observed alterations to previous models.

- vLLM version: v0.18.0
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: Sijie Fu <fusijie@huawei.com>
Co-authored-by: Sijie Fu <fusijie@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants