# NVIDIA ModelOpt Quantization Aware Training (QAT) Walkthrough

**Quantization Aware Training (QAT)** is a method that learn the effects of quantization during neural network post-training to preserve accuracy when deploying models in very-low-precision formats. QAT inserts quantizer nodes into the computational graph, mimicking the rounding and clamping operations that occur during actual quantization. This allows the model to adapt its weights and activations to mitigate accuracy loss.

This notebook demonstrates how to apply Quantization Aware Training (QAT) to an LLM, Qwen3-8b in this example, with NVIDIA's TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrates on a small eval subset, applying NVFP4 quantization and finally deploying the quantized model to TensorRT-LLM.

## Installing Prerequisites and Dependancies

If you haven't already, install the required dependencies for this notebook. Key dependancies include:

- nvidia-modelopt
- torch
- transformers
- jupyterlab

This repo contains a `examples/llm_qat/notebooks/requirements.txt` file that can be used to install all required dependancies.

In [None]:
!pip install -r requirements.txt

## Setting HuggingFace Token and Model for Download (Optional)

If your model requires authentication *(not required for Qwen3-8b)* set the HF_TOKEN environment variable making sure to update it to include you token (eg. `%env HF_TOKEN=hf_abdxyz...`). Be careful to remove your token from this notebook before checking in your code to any public repository.

In [None]:
%env HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>

We will use **Qwen/Qwen3‑8B** in this example, but you can change the model name to what suits your needs.

In [1]:
model_name = "Qwen/Qwen3-8B"

## Import Required Libraries

In [2]:
import modelopt.torch.opt as mto

# Enable automatic save/load of modelopt state huggingface checkpointing
# modelopt state will be saved automatically to "modelopt_state.pth"
mto.enable_huggingface_checkpointing()

ModelOpt save/restore enabled for `transformers` library.
ModelOpt save/restore enabled for `diffusers` library.
ModelOpt save/restore enabled for `peft` library.


## Model Configuration

Configure the model parameters including the model path, attention implementation, and data type. Set up the model configuration and prepare the model loading arguments.

In [3]:
from transformers import AutoConfig
from trl import ModelConfig

model_args = ModelConfig(
    model_name_or_path=model_name,
    attn_implementation="eager",
    torch_dtype="bfloat16",
)
model_kwargs = {
    "revision": model_args.model_revision,
    "trust_remote_code": model_args.trust_remote_code,
    "attn_implementation": model_args.attn_implementation,
    "torch_dtype": model_args.torch_dtype,
    "use_cache": False,
    "device_map": "auto",
}

## Load the Model and Tokenizer

Load the pre-trained model and tokenizer with the specified configuration.

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)

tokenizer = AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

## Dataset Configuration

Set up the dataset parameters for training and evaluation. This includes specifying the dataset name, train/test splits, and test size ratio.

In [5]:
from trl import ScriptArguments

script_args = ScriptArguments(
    dataset_name="HuggingFaceH4/Multilingual-Thinking",
    dataset_train_split="train",
    dataset_test_split="test",
)
test_size = 0.1

## Load and Prepare Dataset

Load the dataset and split it into training and evaluation sets. The dataset is split with the specified test size ratio and random seed for reproducibility.

In [6]:
from datasets import load_dataset

dataset = load_dataset(script_args.dataset_name)
# split the dataset into train and test
dataset = dataset[script_args.dataset_train_split].train_test_split(test_size=test_size, seed=42)
train_dataset = dataset[script_args.dataset_train_split]
eval_dataset = dataset[script_args.dataset_test_split]

## Training Configuration

Configure the training parameters including epochs, batch sizes, learning rate, gradient accumulation, and evaluation strategy. This sets up the SFT configuration for supervised fine-tuning.

In [7]:
from trl import SFTConfig

training_args = SFTConfig(
    output_dir="qwen3-8b-qat-multilingual-reasoner",
    num_train_epochs=1,
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    max_length=4096,
    warmup_ratio=0.03,
    eval_strategy="steps",
    eval_on_start=True,
    logging_steps=50,
    save_steps=450,
    eval_steps=50,
    save_total_limit=2,
)

## Initialize Trainer

Set up the SFT trainer with the model, dataset, and training configuration.

In [8]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset[script_args.dataset_train_split],
    eval_dataset=dataset[script_args.dataset_test_split],
    processing_class=tokenizer,
)

[2025-09-04 09:34:55,478] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lcufile: No such file or directory
collect2: error: ld returned 1 exit status


[2025-09-04 09:34:56,319] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False


## Quantization Aware Training

Configure the quantization parameters and prepare the calibration dataset. This step sets up the quantization configuration, creates a calibration subset from the evaluation dataset, and defines a forward loop function for model calibration. The calibration process helps determine optimal quantization scales for the model weights and activations.

In [9]:
import torch

import modelopt.torch.quantization as mtq

# Some configs don't need calibration, but other quantization configurations may require it.
quantization_config = mtq.NVFP4_DEFAULT_CFG
calib_size = 128

dataset = torch.utils.data.Subset(
    trainer.eval_dataset, list(range(min(len(trainer.eval_dataset), calib_size)))
)
data_loader = trainer.get_eval_dataloader(dataset)


def forward_loop(model):
    for data in data_loader:
        model(**data)

Apply quantization to the model using the prepared configuration and calibration data.

In [10]:
mtq.quantize(model, quantization_config, forward_loop)

Once you have quantized the model you can now start the post-training process for QAT.  The training process will calculate validation loss at 50 step intervals and save the model. These can be controled by adjusting the `eval_steps` and `output_dir` above along with other `training_args`.

In [11]:
trainer.train()

Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
0,No log,2.010943,0.638744,0.0,0.653663
50,1.140700,0.940151,0.973159,50336.0,0.741705
100,0.879300,0.899297,0.885603,108411.0,0.74618
150,0.931500,0.887055,0.896838,162282.0,0.74811
200,0.902700,0.877437,0.890031,215991.0,0.747941
250,0.889600,0.875439,0.881631,266630.0,0.749481
300,0.898200,0.874025,0.868418,321852.0,0.751207
350,0.929300,0.871051,0.879694,379927.0,0.749429
400,0.884500,0.873047,0.881471,433364.0,0.750523
450,0.925100,0.874171,0.879648,489988.0,0.749288


Saved ModelOpt state to qwen3-8b-qat-multilingual-reasoner/checkpoint-450/modelopt_state.pth


TrainOutput(global_step=450, training_loss=0.9311961958143447, metrics={'train_runtime': 598.9058, 'train_samples_per_second': 1.503, 'train_steps_per_second': 0.751, 'total_flos': 2.225056725656371e+16, 'train_loss': 0.9311961958143447, 'epoch': 1.0})

**Note:** The QAT checkpoint for `nvfp4` config can be created by using `--quant_cfg NVFP4_DEFAULT_CFG` in QAT example.

See more details on deployment of quantized model [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/cb98b221e1b1730226257e20b4c81ebb259fc2d6/examples/llm_ptq/README.md).

# Deploying the QAT Model with TensorRT-LLM

Once you have completed the above QAT workflow you should now have a model in the checkpoint folder `./qwen3-8b-qat-multilingual-reasoner/checkpoint-450` which contains the model files including the checkpoints and tokenizer. You can use this folder to serve the QAT NVFP4 model in TensorRT-LLM via Docker.

## Running TensorRT-LLM Docker

The easiest way to get started with TensorRT-LLM is to run a TensorRT-LLM docker container. Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use.

Open a new bash shell and run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match latest release):

In [None]:
docker run --rm --ipc=host -it \
  --ulimit stack=67108864   --ulimit memlock=-1 \
  --gpus all   -p 8000:8000   -e TRTLLM_ENABLE_PDL=1 \
  -v ~/.cache:/root/.cache:rw --name tensorrt_llm \
  -v $(pwd)/qwen3-8b-qat-multilingual-reasoner/:/app/tensorrt_llm/qat \
  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2  /bin/bash

## Exporting Quantized Model for deployment
Before deploying the model with TensorRT-LLM you will need to export the model checkpoint files. This is similar to the step you take for a quantized PTQ Model. To export the unified Hugging Face checkpoints, which can be deployed on TensorRT-LLM Pytorch, vLLM and SGLang you will need to run the [huggingface_example.sh](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/scripts/huggingface_example.sh) script found in the TensorRT Model Optimizer repo. 

**Clone the TensorRT-LLM Model Optimizer repo inside the docker container**

In [None]:
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git

**Install Model Opt prerequisites**

In [None]:
cd TensorRT-Model-Optimizer/
pip install -e .

**Run HuggingFace checkpoint conversion script**

In [None]:
# set export path for converted checkpoints. The script saves the converted checkpoint in ${ROOT_SAVE_PATH}/saved_models_${MODEL_FULL_NAME}
export ROOT_SAVE_PATH=/app/tensorrt_llm

# run conversion script
cd ..
bash TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh --model $(pwd)/qat/checkpoint-450/ --quant nvfp4 --export_fmt hf

## Serving the Model

Run the following trtllm-serve command to launch the inference server

In [None]:
trtllm-serve /app/tensorrt_llm/saved_models_checkpoint-450_nvfp4_hf/  \
  --max_batch_size 1 --max_num_tokens 1024 \
  --max_seq_len 4096 --tp_size 8 --pp_size 1 \
  --host 0.0.0.0 --port 8000 \
  --kv_cache_free_gpu_memory_fraction 0.95

## Sending an Inference Request to TensorRT-LLM Server

In another terminal or in the below cell run the example curl command to send an inference request to the server.

In [12]:
%%sh
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "Qwen3/qwen3-8b-qat-multilingual-reasoner",
    "messages": [
        {
            "role": "user",
            "content": "What is NVIDIAs advantage for inference?"
        }
    ],
    "max_tokens": 1024,
    "top_p": 0.9
}' -w "\n"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   240    0     0  100   240      0    108  0:00:02  0:00:02 --:--:--   108

{"id":"chatcmpl-6ec07862eaaf4487b8ea85d5d37c90ac","object":"chat.completion","created":1757004926,"model":"Qwen3/qwen3-8b-qat-multilingual-reasoner","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n\n</think>\n\nNVIDIA's advantage in **inference** (the process of using a trained machine learning model to make predictions or decisions) lies in its **dedicated hardware and software ecosystem**, which provides **high performance, efficiency, and scalability** for AI workloads. Here's a breakdown of key advantages:\n\n---\n\n### **1. Hardware Specialization: GPUs (Graphics Processing Units)**\n- **High Parallelism**: NVIDIA GPUs are optimized for parallel processing, making them ideal for the massive computations required in inference (e.g., image recognition, NLP, autonomous driving).\n- **Tensor Cores**: NVIDIA's **Tensor Cores** (introduced in the Volta and Turing architectures) accelerate matrix operations critical for deep learning, enabling faster inference wit

100  5699  100  5459  100   240   2294    100  0:00:02  0:00:02 --:--:--  2395





## Stop the TensorRT-LLM Docker contrainer

Finally, clean up the TensorRT-LLM server by stopping and exiting the Docker container. Alternatively you can run the below cell to stop the running container. The container should automatically delete itself once stopped as it was started with the `--rm` flag.

In [13]:
!docker stop tensorrt_llm

tensorrt_llm
