# LLM Compressor - Model Quantization

This notebook demonstrates how to use LLM Compressor to quantize the Llama 3.1 8B model using SmoothQuant or GPTQ methods.

## Overview

The Lean RAG Accelerator uses model quantization to:
- Reduce memory footprint by 50-75%
- Enable 2x to 4x higher throughput
- Maintain >95% accuracy retention

## Prerequisites

- Access to LLM Compressor
- Llama 3.1 8B model (or compatible model)
- Calibration dataset
- GPU with sufficient memory


In [None]:
import os
import torch
from llm_compressor import QuantizationRecipe, SmoothQuantModifier, GPTQModifier
from transformers import AutoModelForCausalLM, AutoTokenizer
import json


## Step 2: Configuration


In [None]:
# Model configuration
MODEL_NAME = "meta-llama/Llama-3.1-8B"
MODEL_PATH = "/mnt/models/llama-3.1-8b"  # Update with your model path

# Quantization configuration
QUANTIZATION_METHOD = "smoothquant"  # Options: "smoothquant", "gptq"
TARGET_PRECISION = "int8"  # Options: "int8" (smoothquant), "int4" (gptq)

# Output configuration
OUTPUT_PATH = "/mnt/models/llama-3.1-8b-quantized"  # Update with output path
OUTPUT_FORMAT = "onnx"  # Options: "onnx", "safetensors", "pytorch"

# Calibration dataset
CALIBRATION_DATASET = "/mnt/data/calibration"  # Update with calibration dataset path
NUM_CALIBRATION_SAMPLES = 512
BATCH_SIZE = 8
