# Overview
This notebook demonstrates how to efficiently run the Mixtral-8x7B large language model using **AQLM (Additive Quantization of Language Models)** — a powerful technique that compresses the model to 2-bit weights, enabling fast and memory-efficient inference on limited hardware such as Google Colab.

# Purpose
- Load and run a 2-bit AQLM-quantized LLM using the 🤗 transformers library.

- Benchmark generation time and evaluate output quality.

- Explore how quantization affects performance, memory usage, and usability.

## AQLM transformers
**Additive Quantization of Language Models (AQLM)** is an advanced technique designed to compress large language models (LLMs) by reducing their **memory footprint** while maintaining performance. It achieves this through a method called **Multi-Codebook Quantization (MCQ)**, which efficiently approximates weight matrices using learned codebooks.

**1. Install the `aqlm` library**
- The only extra dependency to run AQLM models.
- Add `[gpu]` to install the required CUDA specific dependencies.

In [1]:
%%capture
!pip install aqlm[gpu]>=1.0.1
!pip install accelerate>=0.27.0
!pip install transformers>=4.38.0

**2. Load the model as usual**

The tokenizer is just a normal `Mixtral` tokenizer.
- Here we loads a 2-bit AQLM-quantized version of the Mixtral-8x7B model from Hugging Face using the transformers library.

- Model: `Mixtral-8x7B`, compressed using AQLM 2-bit quantization, reducing memory usage while retaining performance.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the quantized Mixtral model
quantized_model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
    torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,
)

# Load the matching tokenizer
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/6.13k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/263k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Do a few forward passes to load CUDA and automatically compile the kernels. It's done separately here for it not to affect the generation speed benchmark below.

In [3]:
%%capture
import torch

# Tokenize an empty prompt and move input IDs to GPU
input_ids = tokenizer("", return_tensors="pt")["input_ids"].cuda()

# Generate up to 10 new tokens
output = quantized_model.generate(input_ids, max_new_tokens=10)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


**3. Measure Inference Time with AQLM Quantized Model**
- Here we measures how long it takes for the AQLM-quantized Mixtral model to generate 128 new tokens from a given prompt.**bold text**

In [4]:
%%time

# Prepare input prompt
prompt = "I'm AQLM, "
input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()

# Generate text with fixed output length
output = quantized_model.generate(
    input_ids,
    min_new_tokens=128,
    max_new_tokens=128
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


CPU times: user 34.4 s, sys: 0 ns, total: 34.4 s
Wall time: 35.1 s


Note that `transformers` generation is not the fastest implementation and it's heavily influenced by CPU capabilities of _Google Colab_.

**4. Check that the output is what one would expect from Mixtral**

In [5]:
print(tokenizer.decode(output[0]))

<s> I'm AQLM, 20 years old, and I'm a student at the University of California, Berkeley. I'm a member of the Berkeley Student Union, and I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student


**5. Check peak memory usage**

In [6]:
import torch

print(f"Peak memory usage: {torch.cuda.max_memory_allocated()*1e-9:.2f} Gb")

Peak memory usage: 13.22 Gb


Indeed, it's ~2 bits per model weight.