# SDXL Int8 Quantization Solution by Ammo

### Note:
This notebook requires nvidia-ammo > 0.9.x. An example command to launch the container:

```
docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo --shm-size=8g \
     -p 8888:8888 --ulimit memlock=-1 --ulimit \
      stack=67108864 <your_nemo_container>
```

This tutorial shows how to use Ammo to calibrate and quantize the UNet part of the SDXL within NeMo framework. 

Please note that NeMo provides users with an end-to-end training framework for SDXL, and this quantization pipeline is supposed to work with a `.nemo` checkpoint trained from their own text-image dataset. In this tutorial, a open-source checkpoint is converted to `.nemo` format for illustration purpose.

### Download SDXL checkpoint

In [None]:
## Download Unet checkpoint
! mkdir -p /sdxl_ckpts/stable-diffusion-xl-base-1.0/unet && wget -P /sdxl_ckpts/stable-diffusion-xl-base-1.0/unet https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/unet/diffusion_pytorch_model.safetensors
## Download Vae checkpoint  
! mkdir -p /sdxl_ckpts/stable-diffusion-xl-base-1.0/vae && wget -P /sdxl_ckpts/stable-diffusion-xl-base-1.0/vae https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/vae/diffusion_pytorch_model.safetensors

### Convert downloaded checkpoint into `.nemo` format

In [None]:
WORKDIR = '/quantization'
! torchrun /opt/NeMo/examples/multimodal/text_to_image/convert_hf_ckpt_to_nemo.py \
    --model_type sdxl \
    --ckpt_path /sdxl_ckpts/stable-diffusion-xl-base-1.0/unet/diffusion_pytorch_model.safetensors \
    --hparams_file /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/conf/sd_xl_base_train.yaml \
    --nemo_file_path $WORKDIR/sdxl_base.nemo

### Run quantization script with default config, and finally the script will export the quantized unet to onnx file.

##### Quantization config

```yaml
quantize
  exp_name: nemo_test
  n_steps: 20          # number of inference steps
  format: 'int8'       # only int8 quantization is supported now
  percentile: 1.0      # Control quantization scaling factors (amax) collecting range, meaning that we will collect the minimum amax in the range of `(n_steps * percentile)` steps. Recommendation: 1.0
  batch_size: 1        # batch size calling sdxl inference pipeline during calibration
  calib_size: 32       # For SDXL, we recommend 32, 64 or 128
  quant_level: 2.5     #Which layers to be quantized, 1: `CNNs`, 2: `CNN + FFN`, 2.5: `CNN + FFN + QKV`, 3: `CNN + Linear`. Recommendation: 2, 2.5 and 3, depending on the requirements for image quality & speedup.
  alpha: 0.8           # A parameter in SmoothQuant, used for linear layers only. Recommendation: 0.8 for SDXL
```

##### Onnx export config

```yaml
onnx_export:
  onnx_dir: nemo_onnx    # Path to save onnx files
  pretrained_base: ${model.restore_from_path}  # Path to nemo checkpoint for sdxl
  quantized_ckpt: nemo.unet.state_dict.${quantize.exp_name}.pt  # Path to save quantized unet checkpoint
  format: int8
```

The following command restores a pre-trained sdxl model from `$WORKDIR/sdxl_base.nemo` derived from the above step.
The quantized U-Net checkpoint is saved to `quantize.quantized_ckpt` and converted onnx file is saved to `onnx_export.onnx_dir`.

In [None]:
! torchrun /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_quantize.py model.restore_from_path=$WORKDIR/sdxl_base.nemo onnx_export.onnx_dir=$WORKDIR/nemo_onnx quantize.quantized_ckpt=$WORKDIR/nemo.unet.state_dict.nemo.pt

### Now we want to build trt engine from the onnx file

In [None]:
! trtexec --onnx=$WORKDIR/nemo_onnx/unet.onnx --shapes=x:8x4x128x128,timesteps:8,context:8x80x2048,y:8x2816 --fp16 --int8 --builderOptimizationLevel=4 --saveEngine=$WORKDIR/nemo_unet_xl.plan

### Build end to end TRT inference pipeline
In order to run an end to end inference with quantized U-Net engine, we need to export and build engines for the other compenents in SDXL, which includes the VAE and two CLIP encoder. The following script restores SDXL from the `nemo` checkpoint and saves the corresponding engine files to `infer.out_path`.

In [None]:
! torchrun /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_export.py model.restore_from_path=$WORKDIR/sdxl_base.nemo infer.out_path=$WORKDIR

### Run TRT inference pipeline with original engines

In [None]:
! torchrun /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_trt_inference.py \
    out_path=$WORKDIR/trt_output_fp16 \
    unet_xl=$WORKDIR/plan/unet_xl.plan \
    vae=$WORKDIR/plan/vae.plan \
    clip1=$WORKDIR/plan/clip1.plan \
    clip2=$WORKDIR/plan/clip2.plan
    

### Run TRT inference pipeline with quantized U-Net engine

In [None]:
! torchrun /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_trt_inference.py \
    out_path=$WORKDIR/trt_output_int8 \
    unet_xl=$WORKDIR/nemo_unet_xl.plan \
    vae=$WORKDIR/plan/vae.plan \
    clip1=$WORKDIR/plan/clip1.plan \
    clip2=$WORKDIR/plan/clip2.plan