<a href="https://colab.research.google.com/github/Alishaw99/Quantize_LLLMs/blob/main/Quantization_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Simple Model

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Define a simple feedforward neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Load a sample dataset: MNIST
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:01<00:00, 5066215.03it/s]


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 135899.70it/s]


Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:01<00:00, 1090276.05it/s]


Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 5525095.35it/s]

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw






In [2]:
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(1):  # one epoch for simplicity
    model.train()
    for images, labels in train_loader:
        images = images.view(-1, 28*28)  # flatten the images
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

In [3]:
model.eval()  # Set the model to inference mode
quantized_model = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)

In [4]:
def evaluate_model(model, data_loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in data_loader:
            images = images.view(-1, 28*28)  # flatten the images
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return 100 * correct / total

accuracy_original = evaluate_model(model, train_loader)
accuracy_quantized = evaluate_model(quantized_model, train_loader)

print(f'Accuracy of original model: {accuracy_original}%')
print(f'Accuracy of quantized model: {accuracy_quantized}%')

Accuracy of original model: 92.91833333333334%
Accuracy of quantized model: 92.90833333333333%


In [5]:
import numpy as np

def quantile_quantization(data, num_bins):
    quantiles = np.quantile(data, np.linspace(0, 1, num_bins+1))
    quantized_data = np.digitize(data, quantiles) - 1  # Adjust indices to be from 0 to num_bins-1
    return quantized_data, quantiles

# Example data and quantization
data = np.random.randn(1000)  # Normally distributed data
num_bins = 10  # Number of bins
quantized_data, quantiles = quantile_quantization(data, num_bins)

print("Quantized data:", quantized_data[:10])
print("Quantiles:", quantiles)

Quantized data: [5 0 3 7 6 2 0 1 2 7]
Quantiles: [-3.34755413 -1.41377064 -0.94495911 -0.59658283 -0.36101944 -0.08902172
  0.17986573  0.42686126  0.78925232  1.26518361  3.30240347]


## A language model

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import numpy as np
import string

# Define a character-based dataset
class CharDataset(Dataset):
    def __init__(self, length=1000, seq_length=10):
        self.seq_length = seq_length
        self.chars = string.ascii_lowercase + " "
        self.data = "".join(np.random.choice(list(self.chars), length))
        self.vocab_size = len(self.chars)

    def __len__(self):
        return len(self.data) - self.seq_length

    def __getitem__(self, index):
        start = index
        end = start + self.seq_length + 1
        input_seq = torch.tensor([self.chars.index(c) for c in self.data[start:end-1]], dtype=torch.long)
        target_char = torch.tensor(self.chars.index(self.data[end-1]), dtype=torch.long)
        return input_seq, target_char

# Define the RNN model
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super(SimpleRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.embedding(x)  # x shape: [batch_size, seq_len, embed_dim]
        x, _ = self.rnn(x)  # RNN accepts correctly shaped input
        x = self.fc(x[:, -1, :])  # Use the last output, shape: [batch_size, output_dim]
        return x

# Instantiate the dataset and dataloader
dataset = CharDataset(1000, 10)
loader = DataLoader(dataset, batch_size=10, shuffle=True)

# Initialize the model
vocab_size = len(string.ascii_lowercase + " ")  # 27 characters
model = SimpleRNN(vocab_size, 10, 20, vocab_size)

# Set up training components
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Train the model
for epoch in range(10):  # Limited training for demonstration
    model.train()
    for inputs, targets in loader:
        optimizer.zero_grad()
        outputs = model(inputs)  # Outputs shape: [batch_size, vocab_size]
        loss = criterion(outputs, targets)  # Targets are now correctly shaped
        loss.backward()
        optimizer.step()

# Apply dynamic quantization
model.eval()
quantized_model = torch.quantization.quantize_dynamic(
    model, {nn.RNN, nn.Linear}, dtype=torch.qint8
)

# Evaluate the model
def evaluate_model(model, loader):
    correct, total = 0, 0
    model.eval()
    with torch.no_grad():
        for inputs, targets in loader:
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == targets).sum().item()
            total += targets.size(0)
    return 100 * correct / total

accuracy_original = evaluate_model(model, loader)
accuracy_quantized = evaluate_model(quantized_model, loader)

print(f"Accuracy of the original model: {accuracy_original:.2f}%")
print(f"Accuracy of the quantized model: {accuracy_quantized:.2f}%")

Accuracy of the original model: 23.94%
Accuracy of the quantized model: 24.04%


## Transformers Bert Based


In [7]:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.quantization import quantize_dynamic

In [9]:
# Load pre-trained model and tokenizer
model_name = "bert-base-uncased"
model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [10]:
# Example input text
text = "Hello, this is an example to test BERT model performance."

In [11]:
# Encode text
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

In [12]:
# Perform inference before quantization
with torch.no_grad():
    original_output = model(**inputs)

In [13]:
# Apply dynamic quantization
quantized_model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

In [14]:
# Perform inference with quantized model
with torch.no_grad():
    quantized_output = quantized_model(**inputs)

In [15]:
# Compare the outputs (optional, for demonstration)
print("Original output:", original_output.logits)
print("Quantized output:", quantized_output.logits)

Original output: tensor([[ 0.1966, -0.3992]])
Quantized output: tensor([[ 0.1932, -0.2385]])


In [16]:
# Save the quantized model using PyTorch
torch.save(quantized_model.state_dict(), "./quantized_bert_model.pth")

In [17]:
# Load the quantized model
model_loaded = BertForSequenceClassification.from_pretrained(model_name)  # Load the original configuration
model_loaded = quantize_dynamic(model_loaded, {torch.nn.Linear}, dtype=torch.qint8)  # Re-apply quantization
model_loaded.load_state_dict(torch.load("./quantized_bert_model.pth"))

# Perform inference with the loaded quantized model
model_loaded.eval()
with torch.no_grad():
    output_loaded = model_loaded(**inputs)

print("Output from loaded quantized model:", output_loaded.logits)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  device=storage.device,


Output from loaded quantized model: tensor([[ 0.1932, -0.2385]])


## GPT2

In [18]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import time

In [19]:
# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [20]:
# Sample input text
input_text = "The science of today is the technology of tomorrow."
input_ids = tokenizer.encode(input_text, return_tensors="pt")

In [21]:
# Measure performance before quantization
start_time = time.time()
with torch.no_grad():
    original_outputs = model.generate(input_ids, max_length=50)
original_duration = time.time() - start_time
original_text = tokenizer.decode(original_outputs[0], skip_special_tokens=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [22]:
# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

In [23]:
quantized_model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): DynamicQuantizedLinear(in_features=768, out_features=50257, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)

In [24]:
# Measure performance after quantization
start_time = time.time()
with torch.no_grad():
    quantized_outputs = quantized_model.generate(input_ids, max_length=50)
quantized_duration = time.time() - start_time
quantized_text = tokenizer.decode(quantized_outputs[0], skip_special_tokens=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [25]:
# Output results
print("Original Text:", original_text)
print("Quantized Text:", quantized_text)
print(f"Time taken (original): {original_duration:.3f} seconds")
print(f"Time taken (quantized): {quantized_duration:.3f} seconds")

Original Text: The science of today is the technology of tomorrow. The technology of tomorrow is the technology of tomorrow.

The science of today is the technology of tomorrow.

The science of today is the technology of tomorrow.

The science of today
Quantized Text: The science of today is the technology of tomorrow. Newly, the "fiscal-policy-in-artificial-utility" and "sustainable" and "sustainable-in-fiscal-policy" and "sustainable-in
Time taken (original): 1.912 seconds
Time taken (quantized): 1.563 seconds


ExLlamaV2 doesn't communicate well with Hugging Face transformers. You first need to download the models locally. We only need the safetensors version. ExLlamaV2 uses safetensors so we don't need to download the ".bin" files.

In [36]:
!pip install -qqq bitsandbytes accelerate datasets

In [37]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
else:
  compute_dtype = torch.float16

model_name = "microsoft/Phi-3-mini-4k-instruct"
quant_path = 'Phi-3-mini-4k-instruct-bnb-4bit'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, trust_remote_code=True
)


model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`

In [38]:
!pip install -qqq auto-gptq optimum

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
model_path = 'microsoft/Phi-3-mini-4k-instruct'
w = 4 #quantization to 4-bit. Change to 2, 3, or 8 to quantize with another precision

quant_path = 'Phi-3-mini-4k-instruct-gptq-'+str(w)+'bit'

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)

quantized_model.save_pretrained("./"+quant_path, safetensors=True)
tokenizer.save_pretrained("./"+quant_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (4130 > 4096). Running this sequence through the model will result in indexing errors


Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]



Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/4 [00:00<?, ?it/s]