<p> SPDX-FileCopyrightText: Copyright (c) <2024> NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 SPDX-License-Identifier: Apache-2.0

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 </p>

# Quantization

**Quantization** is a technique used to reduce the size of a model's weights, thereby decreasing memory usage and potentially increasing the speed of the model. It achieves this by converting weights from a high-precision format, such as 32-bit floating-point, to a lower-precision format, such as 4-bit or 8-bit integers. This can significantly reduce the model's size and enhance its prediction speed.

The quantization process involves the following steps:

- **Loading a model checkpoint:** Utilize a suitable parallelism strategy to load the model.
- **Calibrating the model:** Determine the appropriate algorithm-specific scaling factors.
- **Outputting the model:** Generate an output directory with the quantized model.

### Imports and Dependencies
Begin by importing the required libraries and dependencies.

In [None]:
# Dependencies
import torch
import os
import time
import torch.nn as nn
import modelopt.torch.quantization as mtq
from transformers import AutoTokenizer, AutoModelForCausalLM
from modelopt.torch.export import export_tensorrt_llm_checkpoint
from peft import  PeftModel 
from datasets import load_dataset
from torch.utils.data import DataLoader

#### Constants:
- **Hugging Face Access Token**: Import the value of the Hugging Face Access Token from the environment variables or a separate file.
- **Model ID**: Add the ID of the base model to the code.

In [None]:
hf_token= os.environ.get("HF_TOKEN")
# Change model_id
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

### Quantization
Post Training Quantization(PTQ) enables deploying a model in a low-precision format – FP8, INT4, or INT8 – for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.

Now we will be define the path of recently merged model and also selects the appropriate device based on the hardware available, allowing for seamless execution on both GPU and CPU.

**Ensure the correct merged checkpoint path is specified below.**

In [None]:
merged_model= "/project/data/scratch/merged"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

We are using INT4 AWQ Method. Activation-aware Weight Quantization (AWQ), is a technique for compressing and accelerating Large Language Models (LLMs) by reducing the precision of the model weights. AWQ focuses on low-bit, weight-only quantization.

In [None]:
# Select the quantization config
config = mtq.INT4_AWQ_CFG

calib_size=32
block_size=512
batch_size=1

Now, we will load the tokenizer and the recently merged model. After loading, we will calibrate the model. The calibration process in Post-Training Quantization (PTQ) is a crucial step that involves adjusting the quantization parameters of a model to minimize the loss of accuracy that typically occurs when the model's numerical precision is reduced.

In [None]:
# The forward loop is used to pass data through the model in-order to collect statistics for calibration. 
# It should wrap around the calibration dataloader and the model.
def calibrate_loop(model):
	"""Adjusts weights and scaling factors based on selected algorithms."""
	for idx, data in enumerate(calib_dataloader):
		print(f"Calibrating batch {idx}")
		model(data)

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=merged_model)
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=merged_model, torch_dtype=torch.float16, device_map=device)

tokenizer.pad_token = tokenizer.eos_token

# Prepare calibration data
dataset2 = load_dataset("cnn_dailymail", name="3.0.0", split="train").select(range(512))
dataset2 = dataset2["article"][:calib_size]
batch_encoded = tokenizer.batch_encode_plus(dataset2, return_tensors="pt", padding=True, truncation=True, max_length=block_size)
batch_encoded = batch_encoded.to(device)
batch_encoded = batch_encoded["input_ids"]
calib_dataloader = DataLoader(batch_encoded, batch_size=batch_size, shuffle=False)

# PTQ with in-place replacement to quantized modules
with torch.no_grad():
	print("starting quantization (mtq.quantize API call)...")
	start_time = time.time()
	model=mtq.quantize(model, config, calibrate_loop)
	end_time = time.time()
	print(f"done, time taken = {end_time - start_time} seconds")

### 3. Exporting Quantized model
As model is quantized now, it can be exported to a TensorRT-LLM checkpoint, which includes

- One json file recording the model structure and metadata, and
- One or several rank weight files storing quantized model weights and scaling factors.

In [None]:
export_dir = "/project/data/scratch/merged-int4" #change checkpoint directory
decoder_type = "llama" #change decoder_type according to the supported model
inference_tensor_parallel = 1
inference_pipeline_parallel = 1

# Export Quantized Model
with torch.inference_mode():
    export_tensorrt_llm_checkpoint(
        model,
        decoder_type,
        torch.float16,
        export_dir,
        inference_tensor_parallel,
        inference_pipeline_parallel
    )

print("\ntrt-llm checkpoint export done\n")