[bugfix] Add postprocessor to extract last option for MMMU datasets by SijieFu · Pull Request #238 · AISBench/benchmark

SijieFu · 2026-04-10T07:08:04Z

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

Related Issue | 关联 Issue
Relates to #224

🔍 Motivation / 变更动机

Previously, when benchmarking accuracy with the MMMU dataset, the postprocessing of the original predictions from models is missing/incorrect. The postprocessed predictions were the same as the original predictions. Therefore, even if the model outputs the correct answer, the evaluation process still count it as incorrect, resulting in a low accuracy score. For example,

"0": {
            "prompt": [
                {
                    "role": "HUMAN",
                    "prompt": [
                        {
                            "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nEach of the following situations relates to a different company. ",
                            "type": "text"
                        },
                        {
                            "image_url": {
                                "url": "file://{project_root}/ais_bench/datasets/mmmu/MMMU_images/1_1.jpg"
                            },
                            "type": "image_url"
                        },
                        {
                            "text": " For company B, find the missing amounts.A. $63,020\nB. $58,410\nC. $71,320\nD. $77,490\n",
                            "type": "text"
                        }
                    ]
                }
            ],
            "origin_prediction": "Let's solve this step by step.\n\n......\n\nANSWER: D",
            "predictions": [
                "Let's solve this step by step.\n\n......\n\nANSWER: D"
            ],
            "references": [
                {
                    "answer": "D",
                    "category": "Business",
                    "choices": "{\"A\": \"$63,020\", \"B\": \"$58,410\", \"C\": \"$71,320\", \"D\": \"$77,490\"}",
                    "l2-category": "Accounting",
                    "split": "dev"
                }
            ],
            "correct": [
                false
            ]
        }

📝 Modification / 修改内容

This PR adds a pred_postprocessor to the MMMU-related evaluation configurations.

📐 Associated Test Results / 关联测试结果

Instead of showing the original predictions, the postprocessed predictions is showing a single choice letter:

"0": {
            "prompt": [
                {
                    "role": "HUMAN",
                    "prompt": [
                        {
                            "text": "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of A,B,C,D. Think step by step before answering.\n\nEach of the following situations relates to a different company. ",
                            "type": "text"
                        },
                        {
                            "image_url": {
                                "url": "file://{project_root}/ais_bench/datasets/mmmu/MMMU_images/1_1.jpg"
                            },
                            "type": "image_url"
                        },
                        {
                            "text": " For company B, find the missing amounts.A. $63,020\nB. $58,410\nC. $71,320\nD. $77,490\n",
                            "type": "text"
                        }
                    ]
                }
            ],
            "origin_prediction": "Let's solve this step by step.\n\n......\n\nANSWER: D",
            "predictions": [
                "D"
            ],
            "references": [
                {
                    "answer": "D",
                    "category": "Business",
                    "choices": "{\"A\": \"$63,020\", \"B\": \"$58,410\", \"C\": \"$71,320\", \"D\": \"$77,490\"}",
                    "l2-category": "Accounting",
                    "split": "dev"
                }
            ],
            "correct": [
                true
            ]
        }

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响，应在那些项目中测试此 PR。
CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署，且本 PR 中的所有提交者均已签署 CLA。

gemini-code-assist

Code Review

This pull request introduces the last_option_postprocess function into the evaluation configurations for MMMU and MMMU-Pro datasets. Reviewers identified several issues with this approach, noting that the post-processor is highly susceptible to false positives due to its lack of word boundaries and case-insensitivity. This is especially concerning for Chain-of-Thought (CoT) responses and datasets using extended option sets where common letters like 'I' might be incorrectly matched within the model's reasoning or standard text. It is recommended to implement a more robust extraction logic that targets specific answer patterns and to ensure the function is correctly registered within the project's post-processor registry.

gemini-code-assist · 2026-04-10T07:11:03Z

 mmmu_eval_cfg = dict(
-    evaluator=dict(type=MMMUEvaluator)
+    evaluator=dict(type=MMMUEvaluator),
+    pred_postprocessor=dict(type=last_option_postprocess, options="ABCD"),


The last_option_postprocess function is highly susceptible to false positives because its implementation (re.findall(rf'([{options}])', text)) matches any occurrence of the option letters, even when they are part of other words (e.g., matching 'B' in 'Business'). This is risky as MMMU outputs often contain category names or reasoning that may include these letters. Additionally, the postprocessor is case-sensitive and will fail to extract lowercase answers. Consider improving the postprocessor to use word boundaries (e.g., rf'\b([{options}])\b') and case-insensitivity. Also, note that last_option_postprocess is not registered in the TEXT_POSTPROCESSORS registry, which is inconsistent with other postprocessors in the project.

gemini-code-assist · 2026-04-10T07:11:03Z

 mmmu_eval_cfg = dict(
-    evaluator=dict(type=MMMUEvaluator)
+    evaluator=dict(type=MMMUEvaluator),
+    pred_postprocessor=dict(type=last_option_postprocess, options="ABCD"),


For Chain-of-Thought (CoT) responses, using a naive last_option_postprocess is particularly risky as the model might mention multiple options during reasoning. Since the prompt explicitly requests the format ANSWER: [LETTER], it would be much more robust to use a postprocessor that specifically targets this pattern rather than just finding the last occurrence of any option letter. Furthermore, the current implementation is case-sensitive and can match letters inside words.

gemini-code-assist · 2026-04-10T07:11:03Z

 mmmu_pro_eval_cfg = dict(
-    evaluator=dict(type=MMMUProCotEvaluator)
+    evaluator=dict(type=MMMUProCotEvaluator),
+    pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"),


The risk of false positives with last_option_postprocess is even higher here because the options string ABCDEFGHIJ includes the letter 'I', which frequently appears as a standalone pronoun or within words in model responses. This naive extraction will likely return 'I' as the answer if it appears after the actual choice. A more specific pattern matching the requested 'Answer: $LETTER' format should be used instead.

gemini-code-assist · 2026-04-10T07:11:03Z

 mmmu_pro_eval_cfg = dict(
-    evaluator=dict(type=MMMUProEvaluator)
+    evaluator=dict(type=MMMUProEvaluator),
+    pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"),


The last_option_postprocess function is prone to incorrect extractions, especially with the extended option set ABCDEFGHIJ. It can match letters within words and is case-sensitive. Given that MMMU-Pro evaluation relies on precise option extraction, consider using a more robust method that enforces word boundaries and handles case-insensitivity.

gemini-code-assist · 2026-04-10T07:11:03Z

 mmmu_pro_eval_cfg = dict(
-    evaluator=dict(type=MMMUProCotEvaluator)
+    evaluator=dict(type=MMMUProCotEvaluator),
+    pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"),


Using last_option_postprocess for CoT evaluation is risky as it may pick up letters from the model's reasoning steps. Since the prompt specifies a clear format ('Answer: $LETTER'), the postprocessor should ideally look for that specific pattern to ensure accuracy, rather than simply taking the last matching character in the text.

gemini-code-assist · 2026-04-10T07:11:03Z

 mmmu_pro_eval_cfg = dict(
-    evaluator=dict(type=MMMUProEvaluator)
+    evaluator=dict(type=MMMUProEvaluator),
+    pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"),


The last_option_postprocess implementation is too naive for reliable evaluation. It lacks word boundary checks and case-insensitivity, which can lead to incorrect results if the model output contains words like 'Business' or 'Image' (containing 'B' or 'I'). A more robust extraction logic is recommended.

Copilot

Pull request overview

Adds prediction postprocessing to MMMU-family dataset evaluation configs so that generated answers are reduced to a final multiple-choice option letter before scoring, fixing cases where verbose model outputs were previously mis-scored.

Changes:

Add last_option_postprocess as pred_postprocessor for MMMU (A–D) evaluation configs.
Add last_option_postprocess as pred_postprocessor for MMMU-Pro evaluation configs (A–J).
Import the postprocessor into the affected dataset config files.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen.py	Adds eval-time `pred_postprocessor` to extract final A–D option.
ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py	Adds eval-time `pred_postprocessor` to extract final A–D option for CoT outputs.
ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_vision_gen.py	Adds eval-time `pred_postprocessor` to extract final A–J option for MMMU-Pro vision.
ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_vision_cot_gen.py	Adds eval-time `pred_postprocessor` to extract final A–J option for MMMU-Pro vision CoT.
ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_options10_gen.py	Adds eval-time `pred_postprocessor` to extract final A–J option for MMMU-Pro options10.
ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_options10_cot_gen.py	Adds eval-time `pred_postprocessor` to extract final A–J option for MMMU-Pro options10 CoT.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…cture in xlite backend (#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>

…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>

…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com> Signed-off-by: guxin108 <1252896542@qq.com>

…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>

…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>

…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com> Signed-off-by: nanxing <1014662416@qq.com>

…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>

[bugfix] Add postprocessor to extract last option for MMMU datasets

dcec094

Copilot AI review requested due to automatic review settings April 10, 2026 07:08

github-actions Bot added the bugfix label Apr 10, 2026

Copilot started reviewing on behalf of SijieFu April 10, 2026 07:08 View session

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

Copilot AI reviewed Apr 10, 2026

View reviewed changes

SijieFu mentioned this pull request Apr 10, 2026

[Feature][xlite] Support Qwen3VLMoeForConditionalGeneration architecture in xlite backend vllm-project/vllm-ascend#8046

Merged

SJTUyh approved these changes May 6, 2026

View reviewed changes

SJTUyh merged commit bb661f6 into AISBench:master May 6, 2026
13 of 14 checks passed

SJTUyh mentioned this pull request May 19, 2026

[Bug] 3.1.20260330 版本跑mmmu_gen 大面积题目结果判定出错 #300

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix] Add postprocessor to extract last option for MMMU datasets#238

[bugfix] Add postprocessor to extract last option for MMMU datasets#238
SJTUyh merged 1 commit into
AISBench:masterfrom
SijieFu:fix/mmmu-eval

SijieFu commented Apr 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

SijieFu commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

📐 Associated Test Results / 关联测试结果

✅ Checklist / 检查列表

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SijieFu commented Apr 10, 2026 •

edited

Loading