### Runing LLaMA-2-13B W2A16 quantized model

#### Download the prebuilt quantized model:
We have provide the prebuilt quantized model on Huggingface. In order to download the large weights, we'll have to use git lfs.

In [None]:
!git lfs install

# download LLaMA-2-13b-w2a16 quantization
!git clone https://huggingface.co/FRM-PTQ/llama-2-13b-w2a16-frm-ptq

In [1]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "5"

In [2]:
from accelerate import infer_auto_device_map, dispatch_model
import torch
from datautils import get_loaders, test_ppl

@torch.no_grad()
def evaluate(model, tokenizer):
    '''
    Note: evaluation simply move model to single GPU. 
    Therefor, to evaluate large model such as Llama-2-70B on single A100-80GB,
    please activate '--real_quant'.
    '''
    # import pdb;pdb.set_trace()
    block_class_name = model.model.layers[0].__class__.__name__
    device_map = infer_auto_device_map(model, max_memory={i: '40GB' for i in range(torch.cuda.device_count())}, no_split_module_classes=[block_class_name])
    model = dispatch_model(model, device_map=device_map)
    results = {}

    datasets = ["c4","wikitext2"]
    ppl_results = test_ppl(model, tokenizer, datasets, 2048)
    for dataset in ppl_results:
        print(f'{dataset} perplexity: {ppl_results[dataset]:.2f}')

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
from quantize.int_linear_real import load_quantized_model
from accelerate import infer_auto_device_map, dispatch_model
import torch

model_path = './llama-2-13b-w2a16-frm-ptq'
wbits = 2
abits = 16
group_size = 128
use_act_quant = False
sensitive_group = [0, 39, 3, 1, 28, 2, 38, 24]
robust_group = [27, 31, 29, 30, 34, 35]
model, tokenizer = load_quantized_model(model_path=model_path, wbits=wbits, abits=abits, group_size=group_size, use_act_quant=use_act_quant,sensitive_group=sensitive_group, robust_group=robust_group)
model = model.cuda()
# Test PPL
evaluate(model, tokenizer)

Loading quantized model from ./llama-2-13b-w2a16-frm-ptq


100%|██████████| 40/40 [00:00<00:00, 93.39it/s]


Loading pre-computed quantized weights...
Loading pre-computed quantized weights Successfully
get_c4


Generating train split: 356317 examples [00:01, 179607.44 examples/s]
Generating validation split: 45576 examples [00:00, 125660.30 examples/s]
100%|██████████| 256/256 [02:00<00:00,  2.13it/s]


c4:9.418624877929688
get_wikitext2


Downloading readme: 100%|██████████| 10.5k/10.5k [00:00<00:00, 8.99MB/s]
Downloading data: 100%|██████████| 733k/733k [00:00<00:00, 817kB/s]
Downloading data: 100%|██████████| 6.36M/6.36M [00:01<00:00, 6.18MB/s]
Downloading data: 100%|██████████| 657k/657k [00:00<00:00, 923kB/s]
Generating test split: 100%|██████████| 4358/4358 [00:00<00:00, 146146.03 examples/s]
Generating train split: 100%|██████████| 36718/36718 [00:00<00:00, 204676.88 examples/s]
Generating validation split: 100%|██████████| 3760/3760 [00:00<00:00, 194482.46 examples/s]
100%|██████████| 166/166 [01:15<00:00,  2.20it/s]

wikitext2:7.127840995788574
c4 perplexity: 9.42
wikitext2 perplexity: 7.13





In [4]:
# Test Zero_shot
import lm_eval
from lm_eval.models.huggingface import HFLM
from lm_eval.utils import make_table
eval_tasks = 'piqa,arc_easy,arc_challenge,hellaswag,boolq,winogrande,mmlu'
task_list = eval_tasks.split(',')
model = HFLM(pretrained=model, batch_size=8)
task_manager = lm_eval.tasks.TaskManager()
results = lm_eval.simple_evaluate(
        model=model,
        tasks=task_list,
        num_fewshot=0,
        task_manager=task_manager,
        )
print(make_table(results))
total_acc = 0
for task in task_list:
    total_acc += results['results'][task]['acc,none']
print(f'Average Acc: {total_acc/len(task_list)*100:.2f}%')

Downloading builder script: 100%|██████████| 5.67k/5.67k [00:00<00:00, 4.99MB/s]
2024-12-15:17:53:04,118 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|██████████| 5.36k/5.36k [00:00<00:00, 9.27MB/s]
Downloading readme: 100%|██████████| 8.41k/8.41k [00:00<00:00, 8.18MB/s]
Downloading data: 100%|██████████| 1.82M/1.82M [00:00<00:00, 9.19MB/s]
Downloading data: 100%|██████████| 815k/815k [00:00<00:00, 4.45MB/s]
Generating train split: 100%|██████████| 16113/16113 [00:00<00:00, 48941.64 examples/s]
Generating test split: 100%|██████████| 3084/3084 [00:00<00:00, 52533.35 examples/s]
Generating validation split: 100%|██████████| 1838/1838 [00:00<00:00, 50027.13 examples/s]
Downloading readm

|                 Tasks                 |Version|Filter|n-shot| Metric |Value |   |Stderr|
|---------------------------------------|-------|------|-----:|--------|-----:|---|-----:|
|mmlu                                   |N/A    |none  |     0|acc     |0.3586|±  |0.0040|
| - humanities                          |N/A    |none  |     0|acc     |0.3445|±  |0.0068|
|  - formal_logic                       |      0|none  |     0|acc     |0.3016|±  |0.0410|
|  - high_school_european_history       |      0|none  |     0|acc     |0.4364|±  |0.0387|
|  - high_school_us_history             |      0|none  |     0|acc     |0.4118|±  |0.0345|
|  - high_school_world_history          |      0|none  |     0|acc     |0.4557|±  |0.0324|
|  - international_law                  |      0|none  |     0|acc     |0.5041|±  |0.0456|
|  - jurisprudence                      |      0|none  |     0|acc     |0.4444|±  |0.0480|
|  - logical_fallacies                  |      0|none  |     0|acc     |0.3620|±  |0.0378|