[bugfix] Add postprocessor to extract last option for MMMU datasets#238
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the last_option_postprocess function into the evaluation configurations for MMMU and MMMU-Pro datasets. Reviewers identified several issues with this approach, noting that the post-processor is highly susceptible to false positives due to its lack of word boundaries and case-insensitivity. This is especially concerning for Chain-of-Thought (CoT) responses and datasets using extended option sets where common letters like 'I' might be incorrectly matched within the model's reasoning or standard text. It is recommended to implement a more robust extraction logic that targets specific answer patterns and to ensure the function is correctly registered within the project's post-processor registry.
| mmmu_eval_cfg = dict( | ||
| evaluator=dict(type=MMMUEvaluator) | ||
| evaluator=dict(type=MMMUEvaluator), | ||
| pred_postprocessor=dict(type=last_option_postprocess, options="ABCD"), |
There was a problem hiding this comment.
The last_option_postprocess function is highly susceptible to false positives because its implementation (re.findall(rf'([{options}])', text)) matches any occurrence of the option letters, even when they are part of other words (e.g., matching 'B' in 'Business'). This is risky as MMMU outputs often contain category names or reasoning that may include these letters. Additionally, the postprocessor is case-sensitive and will fail to extract lowercase answers. Consider improving the postprocessor to use word boundaries (e.g., rf'\b([{options}])\b') and case-insensitivity. Also, note that last_option_postprocess is not registered in the TEXT_POSTPROCESSORS registry, which is inconsistent with other postprocessors in the project.
| mmmu_eval_cfg = dict( | ||
| evaluator=dict(type=MMMUEvaluator) | ||
| evaluator=dict(type=MMMUEvaluator), | ||
| pred_postprocessor=dict(type=last_option_postprocess, options="ABCD"), |
There was a problem hiding this comment.
For Chain-of-Thought (CoT) responses, using a naive last_option_postprocess is particularly risky as the model might mention multiple options during reasoning. Since the prompt explicitly requests the format ANSWER: [LETTER], it would be much more robust to use a postprocessor that specifically targets this pattern rather than just finding the last occurrence of any option letter. Furthermore, the current implementation is case-sensitive and can match letters inside words.
| mmmu_pro_eval_cfg = dict( | ||
| evaluator=dict(type=MMMUProCotEvaluator) | ||
| evaluator=dict(type=MMMUProCotEvaluator), | ||
| pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"), |
There was a problem hiding this comment.
The risk of false positives with last_option_postprocess is even higher here because the options string ABCDEFGHIJ includes the letter 'I', which frequently appears as a standalone pronoun or within words in model responses. This naive extraction will likely return 'I' as the answer if it appears after the actual choice. A more specific pattern matching the requested 'Answer: $LETTER' format should be used instead.
| mmmu_pro_eval_cfg = dict( | ||
| evaluator=dict(type=MMMUProEvaluator) | ||
| evaluator=dict(type=MMMUProEvaluator), | ||
| pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"), |
There was a problem hiding this comment.
The last_option_postprocess function is prone to incorrect extractions, especially with the extended option set ABCDEFGHIJ. It can match letters within words and is case-sensitive. Given that MMMU-Pro evaluation relies on precise option extraction, consider using a more robust method that enforces word boundaries and handles case-insensitivity.
| mmmu_pro_eval_cfg = dict( | ||
| evaluator=dict(type=MMMUProCotEvaluator) | ||
| evaluator=dict(type=MMMUProCotEvaluator), | ||
| pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"), |
There was a problem hiding this comment.
Using last_option_postprocess for CoT evaluation is risky as it may pick up letters from the model's reasoning steps. Since the prompt specifies a clear format ('Answer: $LETTER'), the postprocessor should ideally look for that specific pattern to ensure accuracy, rather than simply taking the last matching character in the text.
| mmmu_pro_eval_cfg = dict( | ||
| evaluator=dict(type=MMMUProEvaluator) | ||
| evaluator=dict(type=MMMUProEvaluator), | ||
| pred_postprocessor=dict(type=last_option_postprocess, options="ABCDEFGHIJ"), |
There was a problem hiding this comment.
The last_option_postprocess implementation is too naive for reliable evaluation. It lacks word boundary checks and case-insensitivity, which can lead to incorrect results if the model output contains words like 'Business' or 'Image' (containing 'B' or 'I'). A more robust extraction logic is recommended.
There was a problem hiding this comment.
Pull request overview
Adds prediction postprocessing to MMMU-family dataset evaluation configs so that generated answers are reduced to a final multiple-choice option letter before scoring, fixing cases where verbose model outputs were previously mis-scored.
Changes:
- Add
last_option_postprocessaspred_postprocessorfor MMMU (A–D) evaluation configs. - Add
last_option_postprocessaspred_postprocessorfor MMMU-Pro evaluation configs (A–J). - Import the postprocessor into the affected dataset config files.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen.py | Adds eval-time pred_postprocessor to extract final A–D option. |
| ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py | Adds eval-time pred_postprocessor to extract final A–D option for CoT outputs. |
| ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_vision_gen.py | Adds eval-time pred_postprocessor to extract final A–J option for MMMU-Pro vision. |
| ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_vision_cot_gen.py | Adds eval-time pred_postprocessor to extract final A–J option for MMMU-Pro vision CoT. |
| ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_options10_gen.py | Adds eval-time pred_postprocessor to extract final A–J option for MMMU-Pro options10. |
| ais_bench/benchmark/configs/datasets/mmmu_pro/mmmu_pro_options10_cot_gen.py | Adds eval-time pred_postprocessor to extract final A–J option for MMMU-Pro options10 CoT. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…cture in xlite backend (#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com> Signed-off-by: guxin108 <1252896542@qq.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com> Signed-off-by: nanxing <1014662416@qq.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>
…cture in xlite backend (vllm-project#8046) This PR is the result of the last two commits on `feat/xlite-qwen3-vl-moe`: 1. Refactor `LlamaXliteModel` so the shared `initialize` path uses `config.rope_head_dim` instead of duplicating subclass-specific setup. 2. Add `Qwen3VLMoeForConditionalGeneration` to the xlite strategy map and route it through `QwenMoeXliteModel`. The net effect is that Qwen3-VL MoE models can reuse the existing xlite initialization flow while still applying the MoE-specific config and weight wiring in the subclass. ### What this PR does / why we need it? This PR extends xlite backend coverage to the `Qwen3VLMoeForConditionalGeneration` architecture. The previous implementation only handled `Qwen3MoeForCausalLM` and related (non-)MoE variants, so Qwen3-VL MoE models were not routed into the xlite path. At the same time, the shared rotary embedding precomputation was normalized to use `config.rope_head_dim`, which keeps the base `initialize` implementation generic and avoids duplicated subclass-specific logic. ### Does this PR introduce _any_ user-facing change? Yes. Users running models with the `Qwen3VLMoeForConditionalGeneration` architecture can now use the xlite backend. ### How was this patch tested? First, `Qwen3-VL-235B-A22B-Instruct`->`QwenMoeXliteModel` was tested and the vLLM server started successfully with the xlite backend, including launching the worker process and processing requests without crashing (see the script below). Previous models (`Qwen3-32B`->`LlamaXliteModel`, `Qwen3-235B-A22B`->`QwenMoeXliteModel`, `GLM-4.7`->`Glm4MoeXliteModel`) were also re-tested to confirm no regressions. ```bash # Script to start the server and test a single request # using Docker image for Ascend A3 export VLLM_USE_MODELSCOPE=true export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # server launch command (with xlite enabled) vllm serve /path/to/model \ --host 0.0.0.0 \ --port 8000 \ --api-server-count 1 \ --data-parallel-size 1 \ --data-parallel-size-local 1 \ --tensor-parallel-size 16 \ --served-model-name mymodel \ --max-num-seqs 16 \ --max-model-len 40960 \ --max-num-batched-tokens 4096 \ --enable-expert-parallel \ --trust-remote-code \ --async-scheduling \ --gpu-memory-utilization 0.9 \ --block-size 128 \ --allowed-local-media-path /path/to/media \ --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' \ # single request example curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mymodel", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me how to sleep well at night."} ], "max_tokens": 128, "temperature": "0.0" }' ``` Further tests were performed to confirm the accuracy of the outputs using `ais_benchmark==3.1.0`. For `Qwen3-VL-235B-A22B-Instruct`, multiple datasets were evaluated, including a multimodal dataset `mmmu`. The results are summarized in the table below. ```bash # command to run the evaluation for `Qwen3-VL-235B-A22B-Instruct` with xlite enabled ais_bench --models vllm_api_general_chat --datasets \ aime2024_gen_0_shot_chat_prompt \ ceval_gen_0_shot_cot_chat_prompt \ gpqa_gen_0_shot_cot_chat_prompt \ gsm8k_gen_0_shot_cot_chat_prompt \ math_prm800k_500_0shot_cot_gen \ mmlu_gen_0_shot_cot_chat_prompt \ mmmu_gen_cot \ --work-dir "outputs/Qwen3-VL-235B-A22B-Instruct-xlite" \ --mode all \ --dump-eval-details --merge-ds \ --max-num-workers 128 ``` For the `vllm_api_general_chat` task type, the corresponding `ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py` was modified as: ```python # ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py models = [ dict( attr="service", type=VLLMCustomAPIChat, abbr="vllm-api-general-chat", path="", model="mymodel", stream=False, request_rate=0, use_timestamp=False, retry=2, api_key="", host_ip="localhost", host_port=8000, url="", max_out_len=32768, batch_size=512, trust_remote_code=False, generation_kwargs=dict( temperature=0.01, top_k=10, top_p=0.95, seed=None, repetition_penalty=1.03, ignore_eos=False, ), pred_postprocessor=dict(type=extract_non_reasoning_content), ) ] ``` | dataset | version | metric | mode | w/ xlite | w/o xlite | Qwen official | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | aime2024 | 544e9a | accuracy | gen | 83.33 | _N.A._ | _74.7 (aime2025)_ | | ceval | - | accuracy-weighted | gen | 90.42 | 90.19 | _N.A._ | | gpqa | 5d9994 | accuracy | gen | 70.20 | _N.A._ | _N.A._ | | gsm8k | 271d0b | accuracy | gen | 96.51 | _N.A._ | _N.A._ | | livecodebench | 270e7b | pass@1 | gen | 52.61 | _N.A._ | _54.3 (LCBV6)_ | | math | 9eff90 | accuracy | gen | 94.40 | _N.A._ | _N.A._ | | mmlu | - | accuracy-weighted | gen | 89.94 | _N.A._ | 88.8 | | mmmu* | 14da4b | [validation]: Overall | gen | 71.11 | _N.A._ | 78.7 | | mmmu* | 14da4b | [dev]: Overall | gen | 69.33 | _N.A._ | _N.A._ | > *Multimodal datasets that include image inputs.\ > Qwen official: <https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct>\ > Bug fix from <AISBench/benchmark#238> were also merged to ensure the evaluation correctness for the `mmmu` dataset. For other previously supported models, their accuracies were also validated on the `ceval` dataset, with the results below: | model | dataset | version | metric | mode | w/ xlite | w/o xlite | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | Qwen3-32B | ceval | - | accuracy-weighted | gen | 88.26 | 88.48 | | Qwen3-235B-A22B | ceval | - | accuracy-weighted | gen | 91.01 | _N.A._ | | GLM-4.7 | ceval | - | accuracy-weighted | gen | 90.79 | _N.A._ | These results confirm that the xlite backend is functioning correctly for the `Qwen3VLMoeForConditionalGeneration` architecture, with no observed alterations to previous models. - vLLM version: v0.18.0 - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: Sijie Fu <fusijie@huawei.com> Co-authored-by: Sijie Fu <fusijie@huawei.com>
Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。
PR Type / PR类型
Related Issue | 关联 Issue
Relates to #224
🔍 Motivation / 变更动机
Previously, when benchmarking accuracy with the MMMU dataset, the postprocessing of the original predictions from models is missing/incorrect. The postprocessed predictions were the same as the original predictions. Therefore, even if the model outputs the correct answer, the evaluation process still count it as incorrect, resulting in a low accuracy score. For example,
📝 Modification / 修改内容
This PR adds a
pred_postprocessorto the MMMU-related evaluation configurations.📐 Associated Test Results / 关联测试结果
Instead of showing the original predictions, the postprocessed predictions is showing a single choice letter:
✅ Checklist / 检查列表
Before PR:
After PR: