# SPDX-FileCopyrightText: Copyright (c) <2024> NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Quantization

Using Llamafactory App, model was finetuned and this notebook will walk through two important processes: **Weight Merging** which means integrating the enhancements obtained from fine-tuning back into the original model, and **Model Optimization (Quantization)**, which aims to reduce the model's size and potentially increase its operational speed.

The first part of the notebook focuses on how to effectively merge the fine-tuned weights with the base model. By merging these fine-tuned weights back into the base model, we're essentially combining the broad knowledge captured by the original model with the specialized improvements gained through fine-tuning. This process enhances the base model with new insights, making it more effecient and versatile.

Second part is about quantization, **Quantization** is about making the model's weights smaller so they take up less space and might make the model work faster. Model quantization has two primary benefits: 
1. Reduce model memory requirements.
2. Increase inference throughput.
This is done by changing the weights from a format that holds a lot of detail (like 32-bit floating-point) to a format that holds less detail (like 4-bit or 8-bit integers). This can make the model much smaller and possibly speed up how fast it can make predictions.

The quantization process consists of the following steps:

- Loading a model checkpoint using an appropriate parallelism strategy
- Calibrating the model to obtain appropriate algorithm-specific scaling factors
- Producing an output directory.

### Imports and Dependencies
Begin by importing the required libraries and dependencies.

In [None]:
# Dependencies
import torch
import os
import time
import torch.nn as nn
import modelopt.torch.quantization as mtq
from transformers import AutoTokenizer, AutoModelForCausalLM
from modelopt.torch.export import export_tensorrt_llm_checkpoint
from peft import  PeftModel 
from datasets import load_dataset
from torch.utils.data import DataLoader

#### Constants:
- **Hugging Face Access Token**: Import the value of the Hugging Face Access Token from the environment variables or a separate file.
- **LORA Adapter Path**: Provide the path to the saved LPRA Adapters.
- **Base Model ID**: Add the ID of the base model to the code.

In [2]:
hf_token= os.environ.get("HUGGING_FACE_HUB_TOKEN")
adapter_path= '/project/adapter/checkpoint-100'
# Change model_id
model_id = "mistralai/Mistral-7B-v0.1"

### 1. Weight Mergging

To start, we will use the `AutoModelForCausalLM` class from the Hugging Face Transformers library to set up the basic model. This class helps in setting up models tailored fo causal language modeling tasks. We'll load a pre-trained model using the `from_pretrained` method offered by the library.


In [3]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cpu",  # If you have a GPU with less VRAM change "auto" => "cpu". 
                        # Since we are not running inference it doesn't matter where the model is loaded.
    trust_remote_code=True,
    token= hf_token,
    torch_dtype=torch.bfloat16,
)

Downloading shards: 100%|██████████| 2/2 [02:57<00:00, 88.95s/it] 
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  3.45it/s]


Next, we will be load the tokenizer with specific configurations, then load LoRA adapters, fusing the LoRA weights with the base model, and finally save both merged model and tokenizer.

In [4]:
# Load the tokenizer with specific configurations
# This tokenizer is configured to handle inputs up to 512 tokens in length,
# pads from the left, and adds an end-of-sequence token to each sequence.
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    token= hf_token,
    model_max_length=512,
    padding_side="left",
    add_eos_token=True)

# Load the LoRA adapters and attach them to the base model
# LoRA adapters allow for efficient parameterization and are loaded here with
# the base model specified. The model uses bfloat16 precision for the torch tensors.
ft_model = PeftModel.from_pretrained(base_model, adapter_path, torch_dtype=torch.bfloat16,)

# Fuse the LoRA weights to the base model and return the merged model
# This step integrates the LoRA adapters fully into the base model, creating a single
# cohesive model that is optimized for performance.
merged_model = ft_model.merge_and_unload()

# Save the merged model and tokenizer
merged_model.save_pretrained("mistral/merged")
tokenizer.save_pretrained("mistral/merged")
tokenizer.save_pretrained("/project/models/mistral/tokenizer")
print("merging done")

IndentationError: unexpected indent (3139032999.py, line 5)

### 2. Quantization
Post Training Quantization(PTQ) enables deploying a model in a low-precision format – FP8, INT4, or INT8 – for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.

Now we will be define the path of recently merged model and also selects the appropriate device based on the hardware available, allowing for seamless execution on both GPU and CPU.

In [None]:
merged_model= "mistral/merged"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

We are using INT4 AWQ Method. Activation-aware Weight Quantization (AWQ), is a technique for compressing and accelerating Large Language Models (LLMs) by reducing the precision of the model weights. AWQ focuses on low-bit, weight-only quantization.

In [2]:
# Select the quantization config
config = mtq.INT4_AWQ_CFG

calib_size=32
block_size=512
batch_size=1

Now, we will load the tokenizer and the recently merged model. After loading, we will calibrate the model. The calibration process in Post-Training Quantization (PTQ) is a crucial step that involves adjusting the quantization parameters of a model to minimize the loss of accuracy that typically occurs when the model's numerical precision is reduced.

In [None]:
# The forward loop is used to pass data through the model in-order to collect statistics for calibration. 
# It should wrap around the calibration dataloader and the model.
def calibrate_loop(model):
	"""Adjusts weights and scaling factors based on selected algorithms."""
	for idx, data in enumerate(calib_dataloader):
		print(f"Calibrating batch {idx}")
		model(data)

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=merged_model)
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=merged_model, torch_dtype=torch.float16, device_map=device)

tokenizer.pad_token = tokenizer.eos_token

# Prepare calibration data
dataset2 = load_dataset("cnn_dailymail", name="3.0.0", split="train").select(range(512))
dataset2 = dataset2["article"][:calib_size]
batch_encoded = tokenizer.batch_encode_plus(dataset2, return_tensors="pt", padding=True, truncation=True, max_length=block_size)
batch_encoded = batch_encoded.to(device)
batch_encoded = batch_encoded["input_ids"]
calib_dataloader = DataLoader(batch_encoded, batch_size=batch_size, shuffle=False)

# PTQ with in-place replacement to quantized modules
with torch.no_grad():
	print("starting quantization (mtq.quantize API call)...")
	start_time = time.time()
	model=mtq.quantize(model, config, calibrate_loop)
	end_time = time.time()
	print(f"done, time taken = {end_time - start_time} seconds")

### 3. Exporting Quantized model
As model is quantized now, it can be exported to a TensorRT-LLM checkpoint, which includes

- One json file recording the model structure and metadata, and
- One or several rank weight files storing quantized model weights and scaling factors.

In [None]:
export_dir = "/project/models/mistral/checkpoints" #change checkpoint directory
decoder_type = "llama" #change decoder_type according to the supported model
inference_tensor_parallel = 1
inference_pipeline_parallel = 1

# Export Quantized Model
with torch.inference_mode():
    export_tensorrt_llm_checkpoint(
        model,
        decoder_type,
        torch.float16,
        export_dir,
        inference_tensor_parallel,
        inference_pipeline_parallel
    )

print("\ntrt-llm checkpoint export done\n")