# Runing Falcon-180B on a single A100 80GB

## Download the pre-quantized models

Next, let's download the prebuilt model libraries from HuggingFace. In order to download the large weights, we'll have to use `git lfs`.

In [None]:
!conda install git git-lfs
!git lfs install

Download the prebuilt quantized model:

In [None]:
!mkdir -p pre_quantized_models/
# download falcon-180b with w3a16g512 quantization
!git clone https://huggingface.co/ChenMnZ/falcon-180b-omniquant-w3a16g512 ./pre_quantized_models/falcon-180b-omniquant-w3a16g512

## Let's Infer!

Constraint in one GPU.

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Load model.

In [2]:
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_in_model
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch
import auto_gptq.nn_modules.qlinear.qlinear_cuda as qlinear_cuda
from transformers.models.falcon.modeling_falcon import FalconLinear
from tqdm import tqdm
import gc   
import time

def get_named_linears(module):
    return {name: m for name, m in module.named_modules() if isinstance(m, FalconLinear)}

def set_op_by_name(layer, name, new_module):
    levels = name.split('.')
    if len(levels) > 1:
        mod_ = layer
        for l_idx in range(len(levels)-1):
            if levels[l_idx].isdigit():
                mod_ = mod_[int(levels[l_idx])]
            else:
                mod_ = getattr(mod_, levels[l_idx])
        setattr(mod_, levels[-1], new_module)
    else:
        setattr(layer, name, new_module)

model_path = './pre_quantized_models/falcon-180b-omniquant-w3a16g512'
wbits = 3
group_size = 512
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
enc = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config=config,torch_dtype=torch.float16, trust_remote_code=True)

layers = model.transformer.h
for i in tqdm(range(len(layers))):
    layer = layers[i]
    named_linears = get_named_linears(layer)
    for name, module in named_linears.items():
        q_linear = qlinear_cuda.QuantLinear(wbits, group_size, module.in_features,module.out_features,not module.bias is None,kernel_switch_threshold=128)
        q_linear.to(next(layer.parameters()).device)
        set_op_by_name(layer, name, q_linear)
torch.cuda.empty_cache()
gc.collect()
model.tie_weights()
device_map = infer_auto_device_map(model)
print("Loading pre-computed quantized weights...")
load_checkpoint_in_model(model,checkpoint=model_path,device_map=device_map,offload_state_dict=True)
print("Loading pre-computed quantized weights Successfully")


  from .autonotebook import tqdm as notebook_tqdm


[2023-09-11 05:21:59,584] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


100%|██████████| 80/80 [00:06<00:00, 12.17it/s]


Loading pre-computed quantized weights...
Loading pre-computed quantized weights Successfully


Start inference.

In [3]:
model.eval()
prompt = "Give me a list of the top 10 dive sites you would recommend around the world. \nThe list is:"
input_ids = enc(prompt, return_tensors='pt').input_ids.cuda()
model = model.cuda()
start_time = time.time()
output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=128)
end_time = time.time()
speed = len(output[0])/(end_time-start_time)
print(enc.decode(output[0]))
print(f"speed:{speed}token/s")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
The current implementation of Falcon calls `torch.scaled_dot_product_attention` directly, this will be deprecated in the future in favor of the `BetterTransformer` API. Please install the latest optimum library with `pip install -U optimum` and call `model.to_bettertransformer()` to benefit from `torch.scaled_dot_product_attention` and future performance optimizations.


Give me a list of the top 10 dive sites you would recommend around the world. 
The list is:
The Red Sea 
The Great Barrier Reef
The Cayman Islands 
Belize 
The Bahamas 
The Galápagos Islands 
Palau 
Hawaii 
Thailand 
Fiji

This is an article I found on a dive site, so I am sure the list is accurate.

I am not sure about the "best dive sites" because that is a subjective question. However, the top ten list of dive sites that are most popular, most visited, and most famous would be:
1. Red Sea, Egypt
2. Great Barrier Reef, Australia
3
speed:0.4184875772231599token/s


Although the quantized Falcon-180b can be loaded onto a single A100 80GB, its inference speed remains slow due to CUDA kernel incompatibility. Kernel improvements are in progress.