-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Describe the bug
After processing all layers, quantization attempts to pack them and uses up EXTREME amounts of VRAM. Talking 600+GB over the already quanted model in system ram (Model has system ram at ~442GB, packing stage then uses 600+GB) on a 1TB RAM System.
The process is then killed by Linux.
GPU Info
Show output of:
I can't run it because I would have to rent the system again, but it's eight H100's.
Software Info
Ubuntu 22.04 Python3.10
Show output of:
pip show gptqmodel torch transformers accelerate triton
I can get these, but I am doing the latest and greatest installs. 4.0, etc.
To Reproduce
Run my script below:
import os
import yaml
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
import torch
def main():
# Load configuration from YAML file relative to the script's location
script_dir = os.path.dirname(os.path.abspath(__file__))
config_path = os.path.join(script_dir, "config.yaml")
with open(config_path, "r") as f:
config = yaml.safe_load(f)
# Set CUDA_VISIBLE_DEVICES for multi-GPU support
os.environ["CUDA_VISIBLE_DEVICES"] = str(config["gpu_devices"])
# Verify that GPUs are available
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. Please check your installation.")
num_gpus = torch.cuda.device_count()
print(f"Using {num_gpus} GPU(s): {os.environ['CUDA_VISIBLE_DEVICES']}")
# --- Quantization Configuration ---
quantization_config = config["quantization_config"]
quant_config = QuantizeConfig(
bits=quantization_config["bits"],
group_size=quantization_config["group_size"],
sym=quantization_config["sym"],
desc_act=quantization_config["desc_act"],
v2=quantization_config["v2"],
damp_percent=quantization_config["damp_percent"]
)
# --- Calibration Dataset ---
calibration_config = config["calibration"]
calibration_dataset = load_dataset(
calibration_config["dataset"],
data_files=calibration_config["subset"],
split="train"
).select(range(calibration_config["num_samples"]))["text"]
# --- Model Loading ---
model_id = config["model_id"]
print(f"Loading model: {model_id}")
model = GPTQModel.load(
model_id,
quant_config,
device_map="auto"
)
# --- Quantization ---
print("Starting quantization...")
model.quantize(
calibration_dataset,
batch_size=calibration_config["batch_size"]
)
print("Quantization complete.")
# --- Saving the Quantized Model ---
output_dir = config["output_dir"]
print(f"Saving quantized model to: {output_dir}")
model.save(output_dir)
print("Model saved successfully.")
if __name__ == "__main__":
main()
With config.yaml of:
# Configuration for GPTQv2 8-bit quantization script
# Base model to be quantized from Hugging Face Hub or a local path
model_id: "TheDrummer/Agatha-111B-v1"
# Directory to save the quantized model. This will be created inside the repository.
output_dir: "./quantized-models/TheDrummer/Agatha-111B-v1"
# GPU devices to use for quantization. For multi-GPU, list as a comma-separated string (e.g., "0,1")
# This will be used to set the CUDA_VISIBLE_DEVICES environment variable.
gpu_devices: "0,1,2,3,4,5,6,7"
# Quantization configuration
quantization_config:
bits: 8
group_size: 128 # Use -1 for per-channel quantization
sym: true # Symmetric quantization
desc_act: false # Set to true to enable activation order, which might improve accuracy
v2: true # Use GPTQ v2 algorithm
damp_percent: 0.01
# Calibration dataset configuration
calibration:
dataset: "allenai/c4"
subset: "en/c4-train.00001-of-01024.json.gz"
num_samples: 512 # Number of samples to use for calibration
batch_size: 1 # Batch size for quantization. Adjust based on your VRAM.
Expected behavior
After processing, the model is packed and saved on disk.
Model/Datasets
TheDrummer/Agatha-111B-v1
Screenshots
See screenshot below, at this point it goes from 442GB of System Ram used to using ALL system ram, up to 1008GB of system ram.
Additional context
This script works fine on: CohereLabs/c4ai-command-r7b-12-2024 which is what I use to test.