# SDXL Int8 Quantization Solution by Ammo

### Note:
This notebook requires nvidia-ammo > 0.9.x, which comes with NeMo framework container > 23.05. An example command to launch the container:

```
docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo --shm-size=8g \
     -p 8888:8888 --ulimit memlock=-1 --ulimit \
      stack=67108864 <your_nemo_container>
```

This tutorial shows how to use Ammo to calibrate and quantize the UNet part of the SDXL within NeMo framework. 

Please note that NeMo provides users with an end-to-end training framework for SDXL, and this quantization pipeline is supposed to work with a `.nemo` checkpoint trained from their own text-image dataset. In this tutorial, a open-source checkpoint is converted to `.nemo` format for illustration purpose.

### Download SDXL checkpoint

In [None]:
## Download Unet checkpoint
! mkdir -p /sdxl_ckpts/stable-diffusion-xl-base-1.0/unet && wget -P /sdxl_ckpts/stable-diffusion-xl-base-1.0/unet https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/unet/diffusion_pytorch_model.safetensors
## Download Vae checkpoint  
! mkdir -p /sdxl_ckpts/stable-diffusion-xl-base-1.0/vae && wget -P /sdxl_ckpts/stable-diffusion-xl-base-1.0/vae https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/vae/diffusion_pytorch_model.safetensors

### Convert downloaded checkpoint into `.nemo` format

In [1]:
WORKDIR = '/quantization'
! torchrun /opt/NeMo/examples/multimodal/text_to_image/convert_hf_ckpt_to_nemo.py \
    --model_type sdxl \
    --ckpt_path /sdxl_ckpts/stable-diffusion-xl-base-1.0/unet/diffusion_pytorch_model.safetensors \
    --hparams_file /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/conf/sd_xl_base_train.yaml \
    --nemo_file_path $WORKDIR/sdxl_base.nemo

FlashAttention Installed
[NeMo I 2024-04-24 22:13:11 distributed:42] Initializing torch.distributed with local_rank: 0, rank: 0, world_size: 1
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo W 2024-04-24 22:13:12 megatron_base_model:1172] The model: MegatronDiffusionEngine() does not have field.name: tensor_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-04-24 22:13:12 megatron_base_model:1172] The model: MegatronDiffusionEngine() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-04-24 22:13:12 megatron_base_model:1172] The model: MegatronDiffusionEngine() does not have field.name: pipeline_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-04-24 22:13:12

### Run quantization script with default config, and finally the script will export the quantized unet to onnx file.

##### Quantization config

```yaml
quantize
  exp_name: nemo_test
  n_steps: 20          # number of inference steps
  format: 'int8'       # only int8 quantization is supported now
  percentile: 1.0      # Control quantization scaling factors (amax) collecting range, meaning that we will collect the minimum amax in the range of `(n_steps * percentile)` steps. Recommendation: 1.0
  batch_size: 1        # batch size calling sdxl inference pipeline during calibration
  calib_size: 32       # For SDXL, we recommend 32, 64 or 128
  quant_level: 2.5     #Which layers to be quantized, 1: `CNNs`, 2: `CNN + FFN`, 2.5: `CNN + FFN + QKV`, 3: `CNN + Linear`. Recommendation: 2, 2.5 and 3, depending on the requirements for image quality & speedup.
  alpha: 0.8           # A parameter in SmoothQuant, used for linear layers only. Recommendation: 0.8 for SDXL
```

##### Onnx export config

```yaml
onnx_export:
  onnx_dir: nemo_onnx    # Path to save onnx files
  pretrained_base: ${model.restore_from_path}  # Path to nemo checkpoint for sdxl
  quantized_ckpt: nemo.unet.state_dict.${quantize.exp_name}.pt  # Path to save quantized unet checkpoint
  format: int8
```
##### Onnx export config

```yaml
trt_export:
  static_batch: False # static batch engines have better latency
  min_batch_size: 1   # minimum batch size when using dynamic batch, has to be the same with max_batch_size and infer.num_samples when using static batch
  max_batch_size: 1   # maximum batch size when using dynamic batch, has to be the same with min_batch_size and infer.num_samples when using static batch
  int8: True          # Allow engine builder recognize int8 precision
  builder_optimization_level: 4  # set to 1-5, higher optimization level means better latency but longer compiling time
  trt_engine: int8_unet_xl.plan  # path to save trt engine
```

The following command restores a pre-trained sdxl model from `$WORKDIR/sdxl_base.nemo` derived from the above step.
The quantized U-Net checkpoint is saved to `quantize.quantized_ckpt`, converted onnx file is saved to `onnx_export.onnx_dir` and trt engine is saved to `trt_export.trt_engine`.

In [6]:
! torchrun /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_quantize.py model.restore_from_path=$WORKDIR/sdxl_base.nemo onnx_export.onnx_dir=$WORKDIR/nemo_onnx quantize.quantized_ckpt=$WORKDIR/nemo.unet.state_dict.nemo.pt trt_export.trt_engine=$WORKDIR/int8_unet_xl.plan


FlashAttention Installed
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
[NeMo W 2024-04-24 19:42:59 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:563: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
    
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo W 2024-04-24 19:43:09 megatron_base_model:1172] The model: MegatronDiffusionEngine() does not have field.name: tensor_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-04-24 19:43:09 megatron_base_model:1172] The model: MegatronDiffusionEngine() does not have field.name: context_parallel_size in its cfg. Add this key to c

### Build end to end TRT inference pipeline
In order to run an end to end inference with quantized U-Net engine, we need to export and build engines for the other compenents in SDXL, which includes the VAE and two CLIP encoder. The following script restores SDXL from the `nemo` checkpoint and saves the corresponding engine files to `infer.out_path`.

In [2]:
! torchrun /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_export.py model.restore_from_path=$WORKDIR/sdxl_base.nemo infer.out_path=$WORKDIR

FlashAttention Installed
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
[NeMo W 2024-04-24 22:17:42 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:563: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
    
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo W 2024-04-24 22:17:50 megatron_base_model:1172] The model: MegatronDiffusionEngine() does not have field.name: tensor_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-04-24 22:17:50 megatron_base_model:1172] The model: MegatronDiffusionEngine() does not have field.name: context_parallel_size in its cfg. Add this key to c

### Run TRT inference pipeline with original engines

In [None]:
! torchrun /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_trt_inference.py \
    out_path=$WORKDIR/trt_output_fp16 \
    unet_xl=$WORKDIR/plan/unet_xl.plan \
    vae=$WORKDIR/plan/vae.plan \
    clip1=$WORKDIR/plan/clip1.plan \
    clip2=$WORKDIR/plan/clip2.plan
    

FlashAttention Installed
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
Loading TensorRT engine: /quantization/plan/unet_xl.plan
[I] Loading bytes from /quantization/plan/unet_xl.plan
unet_xl trt engine loaded successfully
Loading TensorRT engine: /quantization/plan/vae.plan
[I] Loading bytes from /quantization/plan/vae.plan
vae trt engine loaded successfully
Loading TensorRT engine: /quantization/plan/clip1.plan
[I] Loading bytes from /quantization/plan/clip1.plan
clip1 trt engine loaded successfully
Loading TensorRT engine: /quantization/plan/clip2.plan
[I] Loading bytes from /quantization/plan/clip2.plan
clip2 trt engine loaded successfully
[NeMo I 2024-04-24 22:46:17 utils:108] Getting module=<nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.discretizer>, cls=<LegacyDDPMDiscretization>
[NeMo I 2024-04-24 22:46:17 utils:108] Getting module=<nemo.collections.multimodal.modules.stab

### Run TRT inference pipeline with quantized U-Net engine

In [5]:
! torchrun /opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_trt_inference.py \
    out_path=$WORKDIR/trt_output_int8 \
    unet_xl=$WORKDIR/int8_unet_xl.plan \
    vae=$WORKDIR/plan/vae.plan \
    clip1=$WORKDIR/plan/clip1.plan \
    clip2=$WORKDIR/plan/clip2.plan

^C
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pkg_resources/__init__.py", line 3109, in _dep_map
  File "/usr/local/lib/python3.10/dist-packages/pkg_resources/__init__.py", line 2902, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/NeMo/examples/multimodal/text_to_image/stable_diffusion/sd_xl_trt_inference.py", line 25, in <module>
    from nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.denoiser import DiscreteDenoiser
  File "/opt/NeMo/nemo/collections/multimodal/modules/stable_diffusion/diffusionmodules/denoiser.py", line 17, in <module>
    from nemo.collections.multimodal.parts.stable_diffusion.utils import append_dims, instantiate_from_config
  File "/opt/NeMo/nemo/collections/multimodal/parts/stable_diffusion/utils.py", line 25, in <module>
    from nemo.uti