Skip to content

Latest commit

 

History

History
88 lines (59 loc) · 3.33 KB

README.md

File metadata and controls

88 lines (59 loc) · 3.33 KB

TensorRT-LLM Quantization Toolkit Installation Guide

Introduction

This document introduces:

  • The steps to install the TensorRT-LLM quantization toolkit.
  • The Python APIs to quantize the models.

The detailed LLM quantization recipe is distributed to the README.md of the corresponding model examples.

Installation

  1. If the dev environment is a docker container, please launch the docker with the following flags
docker run --gpus all --ipc=host --ulimit memlock=-1 --shm-size=20g -it <the docker image with TensorRT-LLM installed> bash
  1. Install the quantization toolkit modelopt and the related dependencies on top of the TensorRT-LLM installation or docker file.
# Install Modelopt
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt==0.9.3
# Install the additional requirements
cd <this example folder>
pip install -r requirements.txt

APIs

quantize.py uses the quantization toolkit to calibrate the PyTorch models and export TensorRT-LLM checkpoints. Each TensorRT-LLM checkpoint contains a config file (in .json format) and one or several rank weight files (in .safetensors format). The checkpoints can be directly used by trtllm-build command to build TensorRT-LLM engines. See this doc for more details on the TensorRT-LLM checkpoint format.

This quantization step may take a long time to finish and requires large GPU memory. Please use a server grade GPU if a GPU out-of-memory error occurs

If the model is trained with multi-GPU with tensor parallelism, the PTQ calibration process requires the same amount of GPUs as the training time too.

PTQ (Post Training Quantization)

PTQ can be achieved with simple calibration on a small set of training or evaluation data (typically 128-512 samples) after converting a regular PyTorch model to a quantized model.

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForCausalLM
import modelopt.torch.quantization as atq

model = AutoModelForCausalLM.from_pretrained(...)

# Select the quantization config, for example, FP8
config = atq.FP8_DEFAULT_CFG


# Prepare the calibration set and define a forward loop
calib_dataloader = DataLoader(...)
def calibrate_loop():
    for data in calib_dataloader:
        model(data)


# PTQ with in-place replacement to quantized modules
with torch.no_grad():
    atq.quantize(model, config, forward_loop=calibrate_loop)

Export Quantized Model

After the model is quantized, it can be exported to a TensorRT-LLM checkpoint, which includes

  • One json file recording the model structure and metadata, and
  • One or several rank weight files storing quantized model weights and scaling factors.

The export API is

from modelopt.torch.export import export_tensorrt_llm_checkpoint

with torch.inference_mode():
    export_tensorrt_llm_checkpoint(
        model,  # The quantized model.
        decoder_type,  # The type of the model as str, e.g gptj, llama or gptnext.
        dtype,  # The exported weights data type as torch.dtype.
        export_dir,  # The directory where the exported files will be stored.
        inference_tensor_parallel=tp_size,  # The tensor parallelism size for inference.
        inference_pipeline_parallel=pp_size,  # The pipeline parallelism size for inference.
    )