## AQLM transformers integration example

**Install the `aqlm` library**
- The only extra dependency to run AQLM models.
- Add `[gpu]` to install the required CUDA specific dependencies.
- To use nice features like `device_map` you'll need to install accelerate. To properly support AQLM, you'd have to install the latest version straight from their GitHub (to catch [PR#2376](https://github.com/huggingface/accelerate/pull/2376)).

In [1]:
%%capture
!pip install aqlm[gpu]==1.0.1
!pip install git+https://github.com/huggingface/accelerate.git@main
!pip install git+https://github.com/huggingface/transformers.git@main

In [1]:
!pip freeze | grep -E "aqlm|acce|trans"

accelerate @ git+https://github.com/huggingface/accelerate.git@97d2168e5953fe7373a06c69c02c5a00a84d5344
aqlm==1.0.1
google-cloud-translate==3.11.3
transformers @ git+https://github.com/huggingface/transformers.git@864c8e6ea31e2e9671cd34e1febd889f5e8d9150


**Load the model as usual**

The tokenizer is just a normal `Mixtral` tokenizer.

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
    trust_remote_code=True, torch_dtype="auto", device_map="cuda"
).cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")

config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

configuration_mixtral_aqlm.py:   0%|          | 0.00/427 [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf:
- configuration_mixtral_aqlm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_mixtral_aqlm.py:   0%|          | 0.00/73.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf:
- modeling_mixtral_aqlm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/263k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Do a few forward passes to load CUDA and automatically compile the kernels. It's done separately here for it not to affect the generation speed benchmark below.

In [6]:
prompt = "Who is Lee Kuan Yew? Summarize your answer in 10 point form."
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
output = quantized_model.generate(input_ids, max_new_tokens=1024)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [7]:
print(tokenizer.decode(output[0], skip_special_tokens=False))

<s> Who is Lee Kuan Yew? Summarize your answer in 10 point form.

1. Lee Kuan Yew was born in Singapore in 1923.
2. He was the first Prime Minister of Singapore.
3. He was the Prime Minister of Singapore from 1959 to 1990.
4. He was the Prime Minister of Singapore for 31 years.
5. He was the Prime Minister of Singapore for 31 years.
6. He was the Prime Minister of Singapore for 31 years.
7. He was the Prime Minister of Singapore for 31 years.
8. He was the Prime Minister of Singapore for 31 years.
9. He was the Prime Minister of Singapore for 31 years.
10. He was the Prime Minister of Singapore for 31 years.

What is the Singapore model?

The Singapore model is a model of economic development that has been used by Singapore since the 1960s. It is based on the idea that Singapore should be a developed country with a high level of economic growth. The Singapore model has been used by Singapore since the 1960s. It is based on the idea that Singapore should be a developed country with a hi

**Measure generation speed**

In [8]:
%%time
input_ids = tokenizer("I'm AQLM, ", return_tensors="pt")["input_ids"].cuda()
output = quantized_model.generate(input_ids, min_new_tokens=128, max_new_tokens=128)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


CPU times: user 20.4 s, sys: 0 ns, total: 20.4 s
Wall time: 20.4 s


Note that `transformers` generation is not the fastest implementation and it's heavily influenced by CPU capabilities of _Google Colab_.

**Check that the output is what one would expect from Mixtral**

In [9]:
print(tokenizer.decode(output[0]))

<s> I'm AQLM, 20 years old, and I'm a student at the University of California, Berkeley. I'm currently majoring in Computer Science and minoring in Business Administration. I'm also a member of the Berkeley Student Cooperative, a student-run housing cooperative.

I'm interested in the intersection of technology and business, and I'm currently working on a project to create a platform for students to share their experiences and advice. I'm also interested in the intersection of technology and education, and I'm currently working on a project to create a platform for students to share their experiences and advice.



**Check peak memory usage**

In [10]:
import torch

print(f"Peak memory usage: {torch.cuda.max_memory_allocated()*1e-9:.2f} Gb")

Peak memory usage: 13.93 Gb


Indeed, it's ~2 bits per model weight.