# Runing Falcon-180B on a single A100 80GB

This file including three section:
- [(Optional) Train the quantization parameters of Falcon-180B by yourself.](#optional-train-the-quantization-parameters-of-falcon-180b-by-yourself)
- [Download the pre-quantized models](#download-the-pre-quantized-models)
- [Let's Infer!](#lets-infer)

## (Optional) Train the quantization parameters of Mixtral-7bx8 by yourself.

This section provids how to train the quantization parameters of Mixtral-7bx8 by yourself. You can skip this section because we have provided the pre-built quantized models in [Download the pre-quantized models](#download-the-pre-quantized-models).

In [None]:
!CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/Mixtral-8x7B-v0.1 \
--epochs 10 --output_dir ./log/mixtral-8x7b_w4a16g128 \
--wbits 4 --abits 16 --group_size 128 --lwc  \
--nsamples 128 --net mixtral-8x7b \
--real_quant --eval_ppl \
--save_dir ./checkpoint/mixtral-8x7b_w4a16g128

## Download the prebuilt quantized model:

We have provide the prebuilt quantized model on Huggingface. In order to download the large weights, we'll have to use `git lfs`.

In [None]:
!conda install git git-lfs
!git lfs install

In [None]:
!mkdir -p pre_quantized_models/

# download mixtral-8x7b-v0.1 with w4a16g128 quantization
!git clone https://huggingface.co/ChenMnZ/Mixtral-8x7B-v0.1-OmniQuantv1-w4a16g128 ./pre_quantized_models/Mixtral-8x7B-v0.1-OmniQuantv1-w4a16g128

# download mixtral-8x7b-v0.1-Instruct with w4a16g128 quantization
# !git clone https://huggingface.co/ChenMnZ/Mixtral-8x7B-Instruct-v0.1-OmniQuantv1-w4a16g128 ./pre_quantized_models/Mixtral-8x7B-Instruct-v0.1-OmniQuantv1-w4a16g128

## Let's Infer!

Constraint in one GPU.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Load model.

In [None]:
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_in_model
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch
import auto_gptq.nn_modules.qlinear.qlinear_cuda as qlinear_cuda
import auto_gptq.nn_modules.qlinear.qlinear_triton as qlinear_triton
from tqdm import tqdm
import gc   
import time

def get_named_linears(module):
    return {name: m for name, m in module.named_modules() if isinstance(m, torch.nn.Linear) and not 'gate' in name}

def set_op_by_name(layer, name, new_module):
    levels = name.split('.')
    if len(levels) > 1:
        mod_ = layer
        for l_idx in range(len(levels)-1):
            if levels[l_idx].isdigit():
                mod_ = mod_[int(levels[l_idx])]
            else:
                mod_ = getattr(mod_, levels[l_idx])
        setattr(mod_, levels[-1], new_module)
    else:
        setattr(layer, name, new_module)

# manually adjust the model_path, corresponding weight bit width, group size.
model_path = './pre_quantized_models/Mixtral-8x7B-v0.1-OmniQuantv1-w4a16g128'
wbits = 4
group_size = 128



config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
enc = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config=config,torch_dtype=torch.float16, trust_remote_code=True)

layers = model.model.layers
for i in tqdm(range(len(layers))):
    layer = layers[i]
    named_linears = get_named_linears(layer)
    for name, module in named_linears.items():
        if wbits in [2,4]:
            q_linear = qlinear_triton.QuantLinear(wbits, group_size, module.in_features,module.out_features,not module.bias is None,kernel_switch_threshold=128)
        elif wbits == 3:
            q_linear = qlinear_cuda.QuantLinear(wbits, group_size, module.in_features,module.out_features,not module.bias is None,kernel_switch_threshold=128)
        else:
            raise NotImplementedError("Only 2,3,4 bits are supported.")
        q_linear.to(next(layer.parameters()).device)
        set_op_by_name(layer, name, q_linear)
torch.cuda.empty_cache()
gc.collect()
model.tie_weights()
device_map = infer_auto_device_map(model)
print("Loading pre-computed quantized weights...")
load_checkpoint_in_model(model,checkpoint=model_path,device_map=device_map,offload_state_dict=True)
print("Loading pre-computed quantized weights Successfully")


Start inference.

In [None]:
model.eval()
prompt = "Hello my name is"
input_ids = enc(prompt, return_tensors='pt').input_ids.cuda()
model = model.cuda()
start_time = time.time()
output = model.generate(inputs=input_ids, max_new_tokens=128)
end_time = time.time()
speed = len(output[0])/(end_time-start_time)
print(enc.decode(output[0], skip_special_tokens=True))
print(f"speed:{speed:.2f}token/s max memory: {torch.cuda.max_memory_allocated(model.device)/ 1024**2:.2f}M")