This notebook benchmarks the memory-efficiency, inference speed, and accuracy in downstream tasks of Llama 2 7B, 13B, and Mistral 7B, all quantized in 2-bit, 3-bit, 4-bit, and 8-bit with GPTQ.



We will need the following libraries to benchmark with optimum-benchmark:

In [None]:
!pip install --upgrade transformers auto-gptq accelerate datasets bitsandbytes
!python -m pip install git+https://github.com/huggingface/optimum.git
!python -m pip install git+https://github.com/huggingface/optimum-benchmark.git

Collecting transformers
  Downloading transformers-4.37.0-py3-none-any.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting auto-gptq
  Downloading auto_gptq-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

I use optimum-benchmark to track the memory consumption. Note that it also produces data about the inference speed.

In the following example of configuration file, I use Llama 13B. Replace "Llama-2-13b-hf" by "Llama-2-7b-hf" or by "Mistral-7B-v0.1" to benchmark the other models.

Results will be stored in experiments_ob/

In [None]:
import os
for w in [2,3,4,8]:
  YAML_DEFAULT="""
  defaults:
    - backend: pytorch # default backend
    - benchmark: inference # we will monitor the inference
    - launcher: process
    - experiment # inheriting from experiment config
    - _self_ # for hydra 1.1 compatibility
    - override hydra/job_logging: colorlog # colorful logging
    - override hydra/hydra_logging: colorlog # colorful logging

  hydra:
    run:
      dir: experiments_ob/${experiment_name} #The results will be reported in this directory. Note that "experiment_name" refers to the configuration field name "experiment_name" below
    sweep:
      dir: experiments_ob/${experiment_name}
    job:
      chdir: true
      env_set: #These are environment variable that you may want to set before running the benchmark
        CUDA_VISIBLE_DEVICES: 0
        CUDA_DEVICE_ORDER: PCI_BUS_ID

  experiment_name: kaitchup/Llama-2-13b-hf-gptq-%sbit
  model: kaitchup/Llama-2-13b-hf-gptq-%sbit #The model that we want to evaluate. It can be from the Hugging Face Hub or local directory
  device: cuda #Which device to use for the benchmark. We will use CUDA, i.e., the GPU

  backend:
    torch_dtype: float16

  benchmark:
    memory: true #We will monitor the memory usage
    warmup_runs: 10 #Before the monitoring starts, the inference will be run 10 times for warming up

    new_tokens: 1000 #Inference will generate 1000 tokens
    input_shapes:
      sequence_length: 512 #Prompt will have 512 tokens
      batch_size: 2
  """ % (str(w), str(w))

  with open("llama2_13b_ob.yaml", 'w') as f:
    f.write(YAML_DEFAULT)
  os.system("optimum-benchmark --config-dir ./ --config-name llama2_13b_ob")

We will need the following libraries to benchmark with the Evaluation Harness:

In [None]:
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
!pip install bitsandbytes
!pip install --upgrade transformers
!pip install auto-gptq optimum

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-0_4_vubf
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-0_4_vubf
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 45a8f709cd3fa903a0e2ff7275694d441bcf0cac
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate>=0.21.0 (from lm_eval==0.4.0)
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate (from lm_eval==0.4.0)
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/8

Here is an example of benchmarking with Winogrande, HellaSwag, and Arc Challenge tasks using 5-shot prompting. The results are stored in eval_harness/

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Mistral-7B-v0.1-gptq-4bit --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 5 --batch_size 2 --output_path ./eval_harness/Mistral-7B-v0.1-gptq-4bit

2024-01-23:13:55:37,530 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
2024-01-23:13:55:37,791 INFO     [config.py:58] PyTorch version 2.1.0+cu121 available.
2024-01-23:13:55:37,792 INFO     [config.py:95] TensorFlow version 2.15.0 available.
2024-01-23:13:55:37,793 INFO     [config.py:108] JAX version 0.4.23 available.
2024-01-23 13:55:38.458320: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-23 13:55:38.458379: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-23 13:55:38.459763: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 

# Appendix: Quantize with GPTQ

I used the following code to quantize Llama 2 7B, 13B, and Mistral-7B-v0.1 in 8,4, 3, and 2=bit.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
model_path = 'mistralai/Mistral-7B-v0.1'

for w in [2,3,4,8]:
  quant_path = 'Mistral-7B-v0.1-gptq-'+str(w)+'bit'

  # Load model and tokenizer
  tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
  model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
  quantizer = GPTQQuantizer(bits=w, dataset="c4", model_seqlen = 4096)
  quantized_model = quantizer.quantize_model(model, tokenizer)
  # Save quantized model and push the model to the HF Hub
  quantized_model.push_to_hub("kaitchup/"+quant_path)
  tokenizer.push_to_hub("kaitchup/"+quant_path)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/92.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/19 [00:00<?, ?it/s]

model-00001-of-00019.safetensors:   0%|          | 0.00/4.89G [00:00<?, ?B/s]

model-00002-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00005-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00006-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00007-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00008-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00009-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00010-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00011-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00012-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00013-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00014-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00015-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00016-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00017-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00018-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00019-of-00019.safetensors:   0%|          | 0.00/4.22G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]