# Tutorial 6: Mixed Precision Quantization Search with Mase and Optuna

In this tutorial, we'll see how Mase can be integrated with Optuna to search on more applications, such as finding out the best mixed-precision configuration.

As we'll see, this would be very similar to running NAS, involving the following setups: **Define the search space**, **Write the model constructor** and **Write the objective function**. 

## The difference caused different precisions

We start by importing the necessary libraries and defining the model constructor. We'll use the same model as in the previous tutorial.

However, we instantiate a function `get_accuracy` that will be used to evaluate the model's accuracy. This function will be used to evaluate the model's accuracy in the objective function too.

In [7]:
import chop.passes as passes

from chop import MaseGraph

from chop.tools import get_tokenized_dataset, get_trainer
from transformers import AutoModelForSequenceClassification
from pathlib import Path

import copy


checkpoint = "prajjwal1/bert-tiny"
tokenizer_checkpoint = "bert-base-uncased"
dataset_name = "imdb"


def get_accuracy(mg, dataset, tokenizer):
    t = get_trainer(
        model=mg.model,
        tokenized_dataset=dataset,
        tokenizer=tokenizer,
        evaluate_metric="accuracy")
    e = t.evaluate()
    return e["eval_accuracy"]

We load the pretrianed model and then transform the model to the MASE graph format and also instantiate the dataset and tokenizer.

In [8]:

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
model.config.problem_type = "single_label_classification"

mg = MaseGraph(
    model,
    hf_input_names=[
        "input_ids",
        "attention_mask",
        "labels",
    ],
)

mg, _ = passes.init_metadata_analysis_pass(mg)
mg, _ = passes.add_common_metadata_analysis_pass(mg)
mg = MaseGraph.from_checkpoint(f"{Path.home()}/tutorial_2_lora")

dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)
quantization_config = {
    "by": "type",
    "default": {
        "config": {
            "name": None,
        }
    },
    "linear": {
        "config": {
            "name": "integer",
            # data
            "data_in_width": 16,
            "data_in_frac_width": 10,
            # weight
            "weight_width": 16,
            "weight_frac_width": 10,
            # bias
            "bias_width": 16,
            "bias_frac_width": 10,
        }
    },
}
# test with two different precision setups
quantization_config_low = copy.deepcopy(quantization_config)
quantization_config_high = copy.deepcopy(quantization_config)

quantization_config_low["linear"]["config"]["data_in_width"] = 4
quantization_config_low["linear"]["config"]["data_in_frac_width"] = 2
quantization_config_low["linear"]["config"]["weight_width"] = 4
quantization_config_low["linear"]["config"]["weight_frac_width"] = 2
quantization_config_low["linear"]["config"]["bias_width"] = 4
quantization_config_low["linear"]["config"]["bias_frac_width"] = 2

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
`past_key_values` were not specified as input names, but model.config.use_cache = True. Setting model.config.use_cache = False.
[32mINFO    [0m [34mGetting dummy input for prajjwal1/bert-tiny.[0m
  loaded_model = torch.load(f)
[32mINFO    [0m [34mTokenizing dataset imdb with AutoTokenizer for bert-base-uncased.[0m


tensor([[ 101, 9932, 2089, 2202, 2058, 1996, 2088, 2028, 2154,  102],
        [ 101, 2023, 2003, 2339, 2017, 2323, 4553, 4748, 4877,  102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
tensor([[ 101, 9932, 2089, 2202, 2058, 1996, 2088, 2028, 2154,  102],
        [ 101, 2023, 2003, 2339, 2017, 2323, 4553, 4748, 4877,  102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
tensor([[[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]],


        [[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]]])
tensor([[[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]],




Using the latest cached version of the dataset since imdb couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'plain_text' at /Users/yz10513/.cache/huggingface/datasets/imdb/plain_text/0.0.0/e6281661ce1c48d982bc483cf8a173c1bbeb5d31 (last modified on Sun Dec  1 05:38:09 2024).


We apply the transformations with two different precision configurations, both under the sam model.

In [9]:
copy_mg = copy.deepcopy(mg)
mg_high, _ = passes.quantize_transform_pass(
    copy_mg,
    pass_args=quantization_config_high,
)

copy_mg = copy.deepcopy(mg)
mg_low, _ = passes.quantize_transform_pass(
    copy_mg,
    pass_args=quantization_config_low,
)

dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)

accuracy = get_accuracy(mg, dataset, tokenizer)
print(f"Original Accuracy: {accuracy}")


accuracy_high = get_accuracy(mg_high, dataset, tokenizer)
accuracy_low = get_accuracy(mg_low, dataset, tokenizer)
print(f"Accuracy high: {accuracy_high}")
print(f"Accuracy low: {accuracy_low}")

[32mINFO    [0m [34mTokenizing dataset imdb with AutoTokenizer for bert-base-uncased.[0m
Using the latest cached version of the dataset since imdb couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'plain_text' at /Users/yz10513/.cache/huggingface/datasets/imdb/plain_text/0.0.0/e6281661ce1c48d982bc483cf8a173c1bbeb5d31 (last modified on Sun Dec  1 05:38:09 2024).


[2025-01-02 10:49:38,098] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to mps (auto detect)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
W0102 10:49:39.030000 25837 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
100%|██████████| 3125/3125 [01:22<00:00, 38.11it/s]


Original Accuracy: 0.8218


100%|██████████| 3125/3125 [01:21<00:00, 38.29it/s]
100%|██████████| 3125/3125 [01:18<00:00, 40.05it/s]

Accuracy high: 0.82176
Accuracy low: 0.5





This way, we can compare the results of the two configurations and see how the precision affects the model's accuracy.

## Mixed Precision Search

In the above example, all the weights, activations and biases are quantized to the same format. However, we can also quantize them to different formats. This is called mixed precision quantization.

The common problem is we do not know the best precision configuration for a given model. We can use tools to search for the best precision configuration.

In the code below, we define the search space that allows weights, biases and activations in a linear layer to be quantized to different precision configurations with the integer format.

In [5]:
search_space = {
    # (x, y), (width, frac_width)
    "data_in": [
        (4, 2), (4, 3),
        (6, 2), (6, 4), 
        (8, 2), (8, 4), (8, 6), 
        (16, 2), (16, 4), (16, 6), (16, 8), (16, 10), (16, 12)],
    "weight": [
        (4, 2), (4, 3),
        (6, 2), (6, 4), 
        (8, 2), (8, 4), (8, 6), 
        (16, 2), (16, 4), (16, 6), (16, 8), (16, 10), (16, 12)],
    "bias": [
        (4, 2), (4, 3),
        (6, 2), (6, 4), 
        (8, 2), (8, 4), (8, 6), 
        (16, 2), (16, 4), (16, 6), (16, 8), (16, 10), (16, 12)],
}


def construct_model(trial):
    config = copy.deepcopy(quantization_config)
    for param in ["data_in", "weight", "bias"]:
        chosen_idx = trial.suggest_int(param, 0, len(search_space[param]) - 1)
        width, frac_width = search_space[param][chosen_idx]
        config["linear"]["config"][f"{param}_width"] = width
        config["linear"]["config"][f"{param}_frac_width"] = frac_width
    ori_mg = copy.deepcopy(mg)
    mg_q, _ = passes.quantize_transform_pass(
        ori_mg,
        pass_args=config,
    )
    return mg_q

def objective(trial):
    mg_q = construct_model(trial)
    return get_accuracy(mg_q, dataset, tokenizer)


from optuna.samplers import GridSampler, RandomSampler, TPESampler
sampler = RandomSampler()

import optuna

study = optuna.create_study(
    direction="maximize",
    study_name="bert-tiny-mixed-q-study",
    sampler=sampler,
)

study.optimize(
    objective,
    n_trials=10,
    timeout=60 * 60 * 24,
)

[I 2025-01-02 11:45:27,961] A new study created in memory with name: bert-tiny-mixed-q-study
100%|██████████| 3125/3125 [01:15<00:00, 41.15it/s]
[I 2025-01-02 11:46:47,202] Trial 0 finished with value: 0.73568 and parameters: {'data_in': 1, 'weight': 6, 'bias': 3}. Best is trial 0 with value: 0.73568.
 20%|██        | 633/3125 [00:18<01:11, 34.68it/s]