Skip to content

feat: add end-to-end vLLM W4A8+FP8 mixed quantization pipeline #255

Merged
liusong1222 merged 5 commits into
Tencent:mainfrom
liusong1222:feature/vllm_calibrate
Mar 11, 2026
Merged

feat: add end-to-end vLLM W4A8+FP8 mixed quantization pipeline #255
liusong1222 merged 5 commits into
Tencent:mainfrom
liusong1222:feature/vllm_calibrate

Conversation

@liusong1222
Copy link
Copy Markdown
Collaborator

feat: add end-to-end vLLM W4A8+FP8 mixed quantization pipeline for DeepSeek V3

Major changes:

  • Add weight_quantize.py: standalone weight quantization module supporting
    FP8 blockwise and INT4 symmetric per-group quantization with multi-process
    parallel processing on safetensors files (no full HF model loading required)
  • Add VLLMCalibrateEngine in engine.py: unified engine for vLLM-based
    calibration (activation/MoE stats collection) and weight quantization,
    with support for skipping calibration when stats already exist
  • Add CalibrateConfig in config_parser.py: YAML-driven calibration config
    (backend, tp_size, max_num_seqs, etc.) integrated into FullConfig
  • Add pack_weight_to_int8_gpu() in packing_utils.py: pure PyTorch
    GPU-accelerated INT4→INT8 packing (no numpy dependency)
  • Add YAML config deepseek_r1_w4a8_fp8_vllm.yaml and shell script for
    DeepSeek R1 W4A8+FP8 quantization workflow
  • Integrate vLLM calibrate path into tools/run.py via vllm_calibrate_run()
  • Delete standalone tools/run_vllm_calibrate.py (consolidated into engine)

Comment thread angelslim/engine.py Outdated
print("\n" + "=" * 80)
print("Calibration completed successfully!")
print(f"Results saved to: {output_dir}")
print("=" * 80)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

都统一采用print_info函数

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread angelslim/engine.py

return {"activation_stats": activation_stats, "moe_stats": moe_stats}

def quantize(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantize函数能否抽象到compressor/quant文件夹

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantize是一个engine中控制顶层流程的函数,感觉还是放到engine里会比较好;把quantize下的_moe_expert_stats_to_input_scales函数移到quant文件夹下了,精简了engine中quantize的代码数量。



@dataclass
class CalibrateConfig:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalibrateConfig能不能放在QuantizationConfig下,类似quant_method
-->calibrate_method

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalibrateConfig移到CompressionConfig下了,和QuantizationConfig并列,这样层级关系会不会更好一点

yghstill
yghstill previously approved these changes Mar 10, 2026
@liusong1222 liusong1222 merged commit 9985e28 into Tencent:main Mar 11, 2026
5 checks passed
dawnranger pushed a commit to dawnranger/AngelSlim that referenced this pull request Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants