# `quanto` integration in `transformers`

Welcome to this tutorial where we showcase how to use `quanto` library and `transformers` to quantize any model in 8, 4, even 2 bit precision on GPU / CPU and MPS device! Let's get started 🔥

## Download requirements

First, install the dependencies below to get started. As these features are available on the `main` branches only, we need to install the libraries below from source.

In [None]:
!pip install -U -q git+https://github.com/huggingface/transformers.git
!pip install -U -q quanto
!pip install -U -q accelerate
!pip install -q datasets

## Basic usage

### Quantize the model

You can quantize a model by passing a `QuantoConfig` object in the from_pretrained method !

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
import torch

model_id = "bigscience/bloom-560m"
quantization_config = QuantoConfig(weights="int8")

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, quantization_config=quantization_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

If you print the model, you will see that most of the `nn.Linear` layers are replaced by `bnb.nn.Qlinear` layers! You can also see that the scale has the same dtype as the orignal model.

In [None]:
print(model.transformer.h[0].self_attention.dense.weight)

QTensor(tensor([[ 73, -21,  29,  ..., -22,  25,  19],
        [-31, -59,  -4,  ..., -14, -46, -10],
        [ 23, -10,  36,  ...,  27,  -3, -14],
        ...,
        [ -1,   7,  -9,  ...,   4,  17,   3],
        [ -7,  13, -22,  ...,  -6,  20, -44],
        [ 11,   2,  -4,  ...,  -1,  23, -29]], device='cuda:0',
       dtype=torch.int8), scale=tensor([[0.0004],
        [0.0003],
        [0.0004],
        ...,
        [0.0009],
        [0.0003],
        [0.0006]], device='cuda:0'), public_dtype=torch.float32)


Once loaded, run a prediction as you would do it with a classic model

In [None]:
text = "Hello my name is"
device = "cuda"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is John, I am a professional photographer and I am a member of the Photography Society of the


Let's try it on a bigger model such as Mistal 7B ! In 8-bit, the model should only need around 7B parameters * 1 byte (=8 bit) = 7GB !

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
quantization_config = QuantoConfig(weights="int8")

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, quantization_config=quantization_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
--text = "Hello my name is"
device = "cuda"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Hello my name is Katie and I am a 20 year old college student. I am currently studying to be


To save the quantized model, you just need to call the `save_pretrained` method.

If the model is too big to fit the gpu, you can also use cpu/disk offload by passing device_map="auto" or a custom device_map ! You can check the device_map of your model with checking `model.hf_device_map`. Read more about how the offload works here. --> put link

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

quantization_config = QuantoConfig(weights="int8")

device_map = {'model.embed_tokens': 0,
 'model.layers.0': 0,
 'model.layers.1': 0,
 'model.layers.2': 0,
 'model.layers.3': 0,
 'model.layers.4': 0,
 'model.layers.5': 0,
 'model.layers.6': 0,
 'model.layers.7': 0,
 'model.layers.8': 0,
 'model.layers.9': 0,
 'model.layers.10': 0,
 'model.layers.11': 0,
 'model.layers.12': 0,
 'model.layers.13': 0,
 'model.layers.14': 0,
 'model.layers.15': 0,
 'model.layers.16': 0,
 'model.layers.17': 0,
 'model.layers.18': 0,
 'model.layers.19': 0,
 'model.layers.20': 0,
 'model.layers.21': 0,
 'model.layers.22': 0,
 'model.layers.23': 0,
 'model.layers.24': 0,
 'model.layers.25': 0,
 'model.layers.26': 0,
 'model.layers.27': 'cpu',
 'model.layers.28': 'cpu',
 'model.layers.29': 'cpu',
 'model.layers.30': 'cpu',
 'model.layers.31': 'disk',
 'model.norm': 'cpu',
 'lm_head':0}
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, quantization_config=quantization_config, device_map=device_map)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [None]:
text = "Hello my name is"
device = "cuda"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Hello my name is Katie and I am a 20 year old college student. I am currently studying to be


## Quantize models from other modalities

You can easily use quanto to quantize models from other modalities! Let's check below how to use transformers + quanto + whisper for automatic speech recognition task.

For this demo we use [`openai/whisper-large-v3`](https://huggingface.co/openai/whisper-large-v3)

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline, QuantoConfig
from datasets import load_dataset

model_id = "openai/whisper-small"
quanto_config = QuantoConfig(weights="int8")

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="cuda",
    quantization_config=quanto_config
)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch.float16
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.


You can also compile the quantized model with `torch.compile` ! See an example below:

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline, QuantoConfig
from datasets import load_dataset

model_id = "openai/whisper-small"
quanto_config = QuantoConfig(weights="int8")

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    quantization_config=quanto_config,
    low_cpu_mem_usage=True,
    device_map="cuda"
)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
)

# Compile the model
pipe.model = torch.compile(model)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
