
# Guide: Quantizing LLMs with LLM Compressor for vLLM

This notebook provides a step-by-step guide for quantizing a Hugging Face language model using Neural Magic's `llm-compressor`. Quantization is a crucial technique for reducing a model's memory footprint and accelerating inference speed by converting its weights from high-precision floating-point numbers (like FP32) to lower-precision integers (like INT8 or INT4).

This process is essential for deploying large models efficiently on resource-constrained hardware. We will walk through the entire workflow:

1.  **Setup**: Install necessary libraries.
2.  **Configuration**: Define the model and quantization parameters.
3.  **Download**: Fetch the base model from the Hugging Face Hub.
4.  **Quantize**: Apply a one-shot quantization recipe using `llm-compressor`.
5.  **Evaluate**: (Optional) Measure the performance of the quantized model.
6.  **Upload**: (Optional) Push the quantized model to an S3-compatible object store.

This guide is designed for AI Platform Engineers and Consultants who are building and optimizing AI services. The final quantized model will be compatible with high-performance inference engines like vLLM.


In [None]:

# 1. Setup: Install required packages
# Note: This process requires a GPU for both quantization and evaluation.
%pip install llmcompressor accelerate vllm datasets transformers torch lm_eval==0.4.3 huggingface-hub boto3



## 2. Configuration

Here, we'll define all the key parameters for our quantization process. You can easily change these values to quantize a different model or adjust the quantization settings.


In [None]:

import os

# --- Model Configuration ---
# The Hugging Face model ID to download and quantize.
MODEL_ID = "ibm-granite/granite-3.2-2b-instruct"

# --- Quantization Configuration ---
# Choose "int8" or "int4".
# INT8 offers a good balance of performance and accuracy.
# INT4 provides maximum compression but may have a higher accuracy loss.
QUANTIZATION_TYPE = "int8"

# --- Path Configuration ---
# Directory to store the original, full-precision model.
BASE_MODEL_PATH = "base_model"
# Directory to save the final quantized model.
OPTIMIZED_MODEL_PATH = f"optimized_model_{QUANTIZATION_TYPE}"

print(f"Configuration Summary:")
print(f"  - Model ID: {MODEL_ID}")
print(f"  - Quantization Type: {QUANTIZATION_TYPE}")
print(f"  - Base Model Path: ./{BASE_MODEL_PATH}")
print(f"  - Optimized Model Path: ./{OPTIMIZED_MODEL_PATH}")



## 3. Download the Base Model

First, we download the pre-trained model weights and tokenizer from the Hugging Face Hub using the `snapshot_download` function. This saves the complete model repository to our local `BASE_MODEL_PATH`.


In [None]:

from huggingface_hub import snapshot_download

print(f"Downloading model '{MODEL_ID}' from Hugging Face Hub...")
snapshot_download(repo_id=MODEL_ID, local_dir=BASE_MODEL_PATH)
print(f"Model downloaded successfully to ./{BASE_MODEL_PATH}")



## 4. Quantize the Model

This is the core of the notebook. We will perform one-shot quantization, which compresses the model without requiring a full retraining cycle. This process involves several sub-steps.



### 4.1. Load Model and Tokenizer

We load the downloaded model and its tokenizer using the `transformers` library. We specify `device_map="auto"` to ensure the model is loaded onto the available GPU(s).


In [None]:

from transformers import AutoTokenizer, AutoModelForCausalLM

print("Loading model and tokenizer...")
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
print("Model and tokenizer loaded successfully.")



### 4.2. Prepare Calibration Data

Quantization algorithms need a small, representative sample of data to "calibrate" the model's weight distributions before converting them to integers. This helps minimize accuracy loss. We'll use a standard calibration dataset from Neural Magic and preprocess it.


In [None]:

from datasets import load_dataset

# Parameters for data calibration
NUM_CALIBRATION_SAMPLES = 256
DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"

print(f"Loading and preparing calibration dataset: {DATASET_ID}")

# Load and preprocess the dataset
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {"text": example["text"]}
ds = ds.map(preprocess)

# Tokenize the data
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        truncation=False,
        add_special_tokens=True,
    )
ds = ds.map(tokenize, remove_columns=ds.column_names)

print("Calibration data is ready.")



### 4.3. Define the Quantization Recipe

A "recipe" in `llm-compressor` is a set of instructions that defines how to modify the model. We'll use the GPTQ (Generative Pre-trained Transformer Quantization) algorithm. For INT8, we'll also add SmoothQuant as a pre-processing step to make quantization more effective.

The parameters below (`DAMPENING_FRAC`, `GROUP_SIZE`, etc.) are hyperparameters for the quantization algorithm that can be tuned for optimal performance.


In [None]:

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

# Hyperparameters for quantization
DAMPENING_FRAC = 0.1
OBSERVER = "mse"
GROUP_SIZE = 128
ignore = ["lm_head"]

print(f"Creating quantization recipe for type: {QUANTIZATION_TYPE}")

if QUANTIZATION_TYPE == "int8":
    # For INT8, we use SmoothQuant + GPTQ.
    # SmoothQuant shifts the quantization difficulty from activations to weights.
    mappings=[
        [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
        [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
        [["re:.*down_proj"], "re:.*up_proj"]
    ]
    recipe = [
        SmoothQuantModifier(smoothing_strength=0.7, ignore=ignore, mappings=mappings),
        GPTQModifier(
            targets=["Linear"],
            ignore=ignore,
            scheme="W8A8",  # 8-bit weights, 8-bit activations
            dampening_frac=DAMPENING_FRAC,
            observer=OBSERVER,
        )
    ]
elif QUANTIZATION_TYPE == "int4":
    # For INT4, we use GPTQ directly.
    recipe = [
        GPTQModifier(
            targets=["Linear"],
            ignore=ignore,
            scheme="w4a16",  # 4-bit weights, 16-bit activations
            dampening_frac=DAMPENING_FRAC,
            observer=OBSERVER,
            group_size=GROUP_SIZE
        )
    ]
else:
    raise ValueError(f"Quantization type {QUANTIZATION_TYPE} not supported")

print("Recipe created successfully.")
print(recipe)



### 4.4. Apply the Recipe with `oneshot`

The `oneshot` function from `llm-compressor` applies our recipe to the model. It uses the calibration data we prepared to execute the quantization process without any backpropagation or gradient updates, making it very fast.

The `max_seq_length` should be set based on your model's context window. We use 8196 for the Granite model.


In [None]:

from llmcompressor.transformers import oneshot

print("Applying one-shot quantization... This may take several minutes.")

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    max_seq_length=8196,
)

print("One-shot quantization complete.")



### 4.5. Save the Quantized Model

Finally, we save the modified model to disk. It's crucial to set `save_compressed=True` to ensure the quantization changes are correctly stored in a format that engines like vLLM can load. We also save the tokenizer for completeness.


In [None]:

print(f"Saving quantized model to ./{OPTIMIZED_MODEL_PATH}...")

# Save to disk in compressed format
model.save_pretrained(OPTIMIZED_MODEL_PATH, save_compressed=True)
tokenizer.save_pretrained(OPTIMIZED_MODEL_PATH)

print("Quantized model saved successfully.")



## 5. (Optional) Evaluate the Model

After quantization, it's important to evaluate the model's performance on a standard benchmark to understand any potential accuracy degradation. We'll use the `lm-eval-harness` to run the `gsm8k` benchmark, which tests grade-school math reasoning.

**Note:** This step requires a GPU and will be executed via a subprocess. The results will be printed below.


In [None]:

import subprocess
import torch

# Check for GPU
if not torch.cuda.is_available():
    print("Evaluation skipped: No GPU detected.")
else:
    print("Starting model evaluation with lm-eval-harness...")
    command = [
        "lm_eval",
        "--model", "vllm",
        "--model_args", f"pretrained={OPTIMIZED_MODEL_PATH},add_bos_token=true,dtype=auto",
        "--tasks", "gsm8k",
        "--num_fewshot", "5",
        "--limit", "250",
        "--batch_size", "auto",
        "--trust_remote_code"
    ]

    # Execute the command
    result = subprocess.run(command, capture_output=True, text=True)

    # Print the output
    if result.returncode == 0:
        print("Model evaluated successfully:")
        print(result.stdout)
    else:
        print("Error evaluating the model:")
        print(result.stderr)



## 6. (Optional) Upload Model to S3

For production workflows, you'll often need to store your model artifacts in a central object store. This section shows how to upload the entire quantized model directory to an S3-compatible bucket using `boto3`.

**Action Required**: You must configure your S3 credentials below for this step to work.


In [None]:

import boto3

# --- S3 Configuration ---
# IMPORTANT: Replace these with your actual S3 details.
# For security, it's best to load these from environment variables or a secrets manager.
S3_ENDPOINT_URL = "https://s3.example.com"  # e.g., 'https://s3.us-east-1.amazonaws.com'
S3_ACCESS_KEY = "YOUR_ACCESS_KEY"
S3_SECRET_KEY = "YOUR_SECRET_KEY"
S3_BUCKET_NAME = "your-models-bucket"
S3_PATH_IN_BUCKET = f"quantized-models/{MODEL_ID.replace('/', '_')}-{QUANTIZATION_TYPE}"

# --- Upload Logic ---
def upload_to_s3(local_path, s3_bucket, s3_prefix):
    if S3_ACCESS_KEY == "YOUR_ACCESS_KEY":
        print("Upload skipped: S3 credentials not configured.")
        return

    print(f"Starting upload to bucket '{s3_bucket}' at '{S3_ENDPOINT_URL}'...")
    s3_client = boto3.client(
        's3',
        endpoint_url=S3_ENDPOINT_URL,
        aws_access_key_id=S3_ACCESS_KEY,
        aws_secret_access_key=S3_SECRET_KEY,
        verify=False # Set to True if you have valid SSL certs
    )

    try:
        for root, dirs, files in os.walk(local_path):
            for file in files:
                local_file_path = os.path.join(root, file)
                # Create a relative path for S3
                s3_file_path = os.path.join(s3_prefix, os.path.relpath(local_file_path, local_path))
                s3_client.upload_file(local_file_path, s3_bucket, s3_file_path)
                print(f"  - Uploaded {s3_file_path}")
        print("Finished uploading results.")
    except Exception as e:
        print(f"An error occurred during upload: {e}")

# Execute the upload
upload_to_s3(
    local_path=OPTIMIZED_MODEL_PATH,
    s3_bucket=S3_BUCKET_NAME,
    s3_prefix=S3_PATH_IN_BUCKET
)



## Conclusion

You have successfully quantized a large language model using `llm-compressor`. The resulting model is smaller, faster for inference, and ready to be served with vLLM.

From here, you can:
- Integrate the quantized model path into your vLLM deployment configurations.
- Experiment with `int4` quantization for even greater compression.
- Fine-tune the quantization hyperparameters (`DAMPENING_FRAC`, `GROUP_SIZE`, etc.) to optimize the trade-off between performance and accuracy for your specific use case.
