Skip to content

[BUG] MASSIVE memory leak - GPTQv2 Quantization #1662

@phaelon74

Description

@phaelon74

Describe the bug

After processing all layers, quantization attempts to pack them and uses up EXTREME amounts of VRAM. Talking 600+GB over the already quanted model in system ram (Model has system ram at ~442GB, packing stage then uses 600+GB) on a 1TB RAM System.

The process is then killed by Linux.

GPU Info

Show output of:

I can't run it because I would have to rent the system again, but it's eight H100's.

Software Info

Ubuntu 22.04 Python3.10

Show output of:

pip show gptqmodel torch transformers accelerate triton

I can get these, but I am doing the latest and greatest installs. 4.0, etc.

To Reproduce

Run my script below:

import os
import yaml
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
import torch

def main():
    # Load configuration from YAML file relative to the script's location
    script_dir = os.path.dirname(os.path.abspath(__file__))
    config_path = os.path.join(script_dir, "config.yaml")
    with open(config_path, "r") as f:
        config = yaml.safe_load(f)

    # Set CUDA_VISIBLE_DEVICES for multi-GPU support
    os.environ["CUDA_VISIBLE_DEVICES"] = str(config["gpu_devices"])
    
    # Verify that GPUs are available
    if not torch.cuda.is_available():
        raise RuntimeError("CUDA is not available. Please check your installation.")
    
    num_gpus = torch.cuda.device_count()
    print(f"Using {num_gpus} GPU(s): {os.environ['CUDA_VISIBLE_DEVICES']}")

    # --- Quantization Configuration ---
    quantization_config = config["quantization_config"]
    quant_config = QuantizeConfig(
        bits=quantization_config["bits"],
        group_size=quantization_config["group_size"],
        sym=quantization_config["sym"],
        desc_act=quantization_config["desc_act"],
        v2=quantization_config["v2"],
        damp_percent=quantization_config["damp_percent"]
    )

    # --- Calibration Dataset ---
    calibration_config = config["calibration"]
    calibration_dataset = load_dataset(
        calibration_config["dataset"],
        data_files=calibration_config["subset"],
        split="train"
    ).select(range(calibration_config["num_samples"]))["text"]

    # --- Model Loading ---
    model_id = config["model_id"]
    print(f"Loading model: {model_id}")
    model = GPTQModel.load(
        model_id,
        quant_config,
        device_map="auto"
    )

    # --- Quantization ---
    print("Starting quantization...")
    model.quantize(
        calibration_dataset,
        batch_size=calibration_config["batch_size"]
    )
    print("Quantization complete.")

    # --- Saving the Quantized Model ---
    output_dir = config["output_dir"]
    print(f"Saving quantized model to: {output_dir}")
    model.save(output_dir)
    print("Model saved successfully.")

if __name__ == "__main__":
    main() 

With config.yaml of:

# Configuration for GPTQv2 8-bit quantization script

# Base model to be quantized from Hugging Face Hub or a local path
model_id: "TheDrummer/Agatha-111B-v1"

# Directory to save the quantized model. This will be created inside the repository.
output_dir: "./quantized-models/TheDrummer/Agatha-111B-v1"

# GPU devices to use for quantization. For multi-GPU, list as a comma-separated string (e.g., "0,1")
# This will be used to set the CUDA_VISIBLE_DEVICES environment variable.
gpu_devices: "0,1,2,3,4,5,6,7"

# Quantization configuration
quantization_config:
  bits: 8
  group_size: 128 # Use -1 for per-channel quantization
  sym: true # Symmetric quantization
  desc_act: false # Set to true to enable activation order, which might improve accuracy
  v2: true # Use GPTQ v2 algorithm
  damp_percent: 0.01

# Calibration dataset configuration
calibration:
  dataset: "allenai/c4"
  subset: "en/c4-train.00001-of-01024.json.gz"
  num_samples: 512 # Number of samples to use for calibration
  batch_size: 1 # Batch size for quantization. Adjust based on your VRAM. 

Expected behavior

After processing, the model is packed and saved on disk.

Model/Datasets

TheDrummer/Agatha-111B-v1

Screenshots

See screenshot below, at this point it goes from 442GB of System Ram used to using ALL system ram, up to 1008GB of system ram.

Image

Additional context

This script works fine on: CohereLabs/c4ai-command-r7b-12-2024 which is what I use to test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions