# Post-training Quantization (PTQ) using Amazon SageMaker AI 🚀
---

Quantization is a technique used to compress large language models by reducing the precision of their weights and activations, often from 16-bit or 32-bit floating-point numbers down to lower bit-width integers (like int8 or int4). This compression reduces model size, lowers memory bandwidth requirements, and speeds up inference on supported hardware — all while trying to maintain acceptable model accuracy.

Post-Training Quantization (PTQ) applies quantization to a pretrained model without requiring any additional fine-tuning. Instead, it uses a small calibration dataset to estimate activation statistics and determine optimal quantization parameters. PTQ is especially useful when retraining is expensive or infeasible. In this notebook, we’ll demonstrate how PTQ works and evaluate the impact on model size and inference performance.

## 01. Setup

Download the latest version of SageMaker Python SDK for up to date features.

In [None]:
%pip install -Uq sagemaker

In [None]:
import sagemaker
from sagemaker.huggingface import HuggingFace
from sagemaker.pytorch import PyTorch

In [None]:
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

## 02. Running Post-Training Quantization on Amazon SageMaker

---

To quantize large language models at scale, we use Amazon SageMaker Training Jobs to execute a PTQ (Post-Training Quantization) script on a GPU-backed instance. While the name *Training Job* suggests model training, in this case, we are **not retraining or fine-tuning the model**. Instead, we’re simply using the Training Job infrastructure to run our quantization workload efficiently on a high-performance GPU (e.g., `ml.g5.2xlarge`), leveraging parallelism and scalability built into the SageMaker platform.

The script we run (`post_training_sagemaker_quantizer.py`) automates all steps of PTQ. It loads the model in full or half-precision, preprocesses a calibration dataset, and applies either GPTQ or AWQ quantization using the [`llm-compressor`](https://github.com/vllm-project/llm-compressor) library. This is a one-shot quantization process that computes activation statistics from a small number of input sequences and generates a compressed version of the model, reducing memory and compute footprint without needing any labeled data or training.

Once the Training Job completes, the quantized model is automatically saved to Amazon S3. From there, it can be untarred and deployed behind a fully managed SageMaker Endpoint using a prebuilt inference container (like `lmi-dist` with `vLLM`). The following code snippet shows how to launch the Training Job, pass in hyperparameters like quantization scheme and number of calibration samples, and prepare your model for efficient, low-latency inference.

---

Post training Quantization takes the folllowing arguments,

```bash
usage: post_training_sagemaker_quantizer.py [-h] --model-id MODEL_ID [--sequential-loading SEQUENTIAL_LOADING] --dataset-id DATASET_ID
                                            [--dataset-split DATASET_SPLIT] [--dataset-seed DATASET_SEED] [--num-calibration-samples NUM_CALIBRATION_SAMPLES]
                                            [--max-sequence-length MAX_SEQUENCE_LENGTH] [--vision-enabled] [--transformer-model-name TRANSFORMER_MODEL_NAME]
                                            [--vision-sequential-targets VISION_SEQUENTIAL_TARGETS] [--algorithm {awq,gptq}] [--ignore-layers IGNORE_LAYERS]
                                            [--include-targets INCLUDE_TARGETS] [--awq-quantization-scheme {W4A16_ASYM,W4A16}]
                                            [--gptq-quantization-scheme {W4A16,W4A16_ASYM,W8A8,W8A16}] [--sm-model-dir SM_MODEL_DIR]

Quantize a language model using AWQ

options:
  -h, --help            show this help message and exit
  --model-id MODEL_ID   Hugging Face model ID
  --sequential-loading SEQUENTIAL_LOADING
                        If the quantization model size GPU set this param to true to run sequential loading to optimize on a single GPU
  --dataset-id DATASET_ID
                        Hugging Face dataset ID
  --dataset-split DATASET_SPLIT
                        Dataset split to use for calibration
  --dataset-seed DATASET_SEED
                        Deterministic dataset seed
  --num-calibration-samples NUM_CALIBRATION_SAMPLES
                        Number of samples for calibration, larger value <> better quantized model
  --max-sequence-length MAX_SEQUENCE_LENGTH
                        Maximum sequence length for tokenization
  --vision-enabled      Weather to use images during quanitzation with vision models
  --transformer-model-name TRANSFORMER_MODEL_NAME
                        Need a dynamic transformer import mechanism for varying types
  --vision-sequential-targets VISION_SEQUENTIAL_TARGETS
                        Vision model sequential targets
  --algorithm {awq,gptq}
                        Quantization Algorithm to use
  --ignore-layers IGNORE_LAYERS
                        Ignore layers to quantize, comma separated
  --include-targets INCLUDE_TARGETS
                        Targets to quantize including, comma separated
  --awq-quantization-scheme {W4A16_ASYM,W4A16}
                        AWQ Param: Quantization scheme to use
  --gptq-quantization-scheme {W4A16,W4A16_ASYM,W8A8,W8A16}
                        GPTQ Param: Quantization scheme to use
  --sm-model-dir SM_MODEL_DIR
                        Directory to save quantized model
```

Use `--gptq-*` params to set runtime quantization GPTQ params and `--awq-*` to set runtime quantization AWQ params.

In [None]:
# hyperparameters which are passed to the training job - Example with AWQ
hyperparameters = {
    'model-id': 'meta-llama/Llama-3.1-8B-Instruct',
    'dataset-id': 'HuggingFaceH4/ultrachat_200k',
    'dataset-split': 'train_sft',
    'dataset-seed': 42,
    'algorithm': 'gptq',
    'max-sequence-length': 2048,
    'num-calibration-samples': 256,
    'ignore-layers': 'lm_head',
    'include-targets': 'Linear',
    'gptq-quantization-scheme': 'W8A16',
}

If you're attempting to quantize a gated model like [meta-Llama](https://huggingface.co/meta-llama) model series, please provide `HF_TOKEN` in the environments to ensure the session is capable of pulling model weights from HF_Hub

In [None]:
quantization_estimator = PyTorch(
    entry_point='post_training_sagemaker_quantizer.py',
    source_dir='./scripts',
    instance_type='ml.g6e.2xlarge', # Change the instance size based on Quota or choice
    instance_count=1,
    role=role,
    framework_version='2.4.0',
    py_version='py311',
    hyperparameters=hyperparameters,
    environment={"HF_TOKEN": ""}
)

🚀 Go!

In [None]:
quantization_estimator.fit()

In [None]:
print(f"Quantized model available under: {quantization_estimator.model_data}")

## Download Quantized Model

In [None]:
import tarfile
import os
from sagemaker.s3 import S3Downloader

This is where the model.tar.gz will be pulled and saved locally

⚠️ NOTE: if you're using a large model, ensure you have sufficient EBS storage size using `df -h`

In [None]:
model_download_basepath = os.path.join(os.getcwd(), "quantized-model-tj")

In [None]:
def download_and_extract_model_from_s3(s3_uri: str, local_tar_path: str, extract_dir: str):
    """
    Downloads a .tar.gz file from S3 using SageMaker's Downloader and extracts it.

    Parameters:
        s3_uri (str): Full S3 URI to the .tar.gz file (e.g., 's3://my-bucket/path/model.tar.gz').
        local_tar_path (str): Local path where the .tar.gz will be saved.
        extract_dir (str): Local directory to extract the tar.gz file to.
    """

    file_name = os.path.basename(s3_uri)
    
    # Create extract directory if it doesn't exist
    os.makedirs(extract_dir, exist_ok=True)

    # Download from S3 using SageMaker Downloader
    print(f"Downloading {s3_uri} to {local_tar_path}")
    S3Downloader.download(s3_uri, local_tar_path)

    # Extract tar.gz archive
    tarball_path = os.path.join(local_tar_path, file_name)
    print(f"Extracting {tarball_path} to {extract_dir}")
    with tarfile.open(tarball_path, "r:gz") as tar:
        tar.extractall(path=extract_dir)

    print("Download and extraction complete.")

In [None]:
download_and_extract_model_from_s3(
    s3_uri=quantization_estimator.model_data,
    local_tar_path=model_download_basepath,
    extract_dir=model_download_basepath
)

## Upload Quantized Model to S3

In [None]:
from sagemaker.s3 import S3Uploader

In [None]:
local_quant_model_path = os.path.join(
    model_download_basepath, 
    [model_path for model_path in os.listdir(model_download_basepath) if 'AWQ' in model_path or 'GPTQ' in model_path][0]
)
assert os.path.exists(local_quant_model_path), f"model path does not exists: {local_quant_model_path}"
print(f"reference local model path: {local_quant_model_path}")

In [None]:
quant_prefix = '-'.join(os.path.basename(local_quant_model_path).split('-')[4:]).replace('_', '-')
print(f"leveraging quant prefix: {quant_prefix}")

In [None]:
remote_upload_s3uri = os.path.join(
    os.path.dirname(quantization_estimator.model_data), 
    os.path.basename(local_quant_model_path)
)
print(f"s3 target dir to upload quantized model > {remote_upload_s3uri}")

In [None]:
print(f"uploading model from: {local_quant_model_path} to remote: {remote_upload_s3uri}")
S3Uploader.upload(
    local_path=local_quant_model_path, 
    desired_s3_uri=remote_upload_s3uri
)

In [None]:
!aws s3 ls {remote_upload_s3uri}/

## Deploy Quantized Model

---
The lmi-dist (Large Model Inference - Distributed) container in Amazon SageMaker Hosting is purpose-built to serve large or optimized models efficiently using features like model partitioning, tensor parallelism, and inference optimization. It allows seamless deployment of models stored in Amazon S3 by specifying the S3 path as `HF_MODEL_ID` parameter during endpoint creation. This container is ideal for serving quantized models—such as those compressed using GPTQ or AWQ—and supports efficient multi-GPU inference. To deploy your quantized model, simply upload the model artifacts (e.g., model.pt or model.safetensors) to an S3 bucket, then create a SageMaker model using the lmi-dist container and point to your S3 path. The container automatically loads the model, handles parallel execution, and exposes a performant inference endpoint ready for production use.

In [None]:
from datetime import datetime
from sagemaker.huggingface import get_huggingface_llm_image_uri

All available images can be found here: https://github.com/aws/deep-learning-containers/blob/master/available_images.md

In [None]:
prebaked_inference_image_uri = f"763104351884.dkr.ecr.{sagemaker.Session().boto_session.region_name}.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128"

In [None]:
model_name = f"quantized-model-{quant_prefix}-{datetime.now().strftime('%y%m%d-%H%M%S')}"
endpoint_name = f"{model_name}-ep"
print(f"choosing model name > {model_name}")
print(f"choosing endpoint name > {endpoint_name}")

In [None]:
quant_model = sagemaker.Model(
    image_uri=prebaked_inference_image_uri,
    env={
        "HF_MODEL_ID": f"{remote_upload_s3uri}/",
        "OPTION_MAX_MODEL_LEN": "12000",
        "OPTION_GPU_MEMORY_UTILIZATION": "0.95",
        "OPTION_ENABLE_STREAMING": "false",
        "OPTION_ROLLING_BATCH": "auto",
        "OPTION_MODEL_LOADING_TIMEOUT": "3600",
        "OPTION_PAGED_ATTENTION": "false",
        "OPTION_DTYPE": "fp16",
    },
    role=role,
    name=model_name,
    sagemaker_session=sagemaker.Session()
)

In [None]:
pretrained_predictor = quant_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    wait=False
)
print(f"Your Endpoint: {endpoint_name} is now deployed! 🚀")

## Inference with LiteLLM

In [None]:
%pip install -Uq litellm

In [None]:
import os 
from litellm import completion

In [None]:
response = completion(
    model=f"sagemaker/{endpoint_name}", 
    messages=[
        { "content": "Hello", "role": "user"}, 
        { "content": "You are a helpful assistant that follows instructions", "role": "system"}
    ],
    temperature=0.1,
    max_tokens=64
)

In [None]:
def detokenize_gpt_style(text):
    return text.replace("Ċ", "\n").replace("Ġ", " ")
    
print(detokenize_gpt_style(response.choices[0].message.content))