[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LianjiaTech/BELLE/blob/main/notebook/BELLE_INFER_COLAB.ipynb) 

# ** BELLE模型在COLAB推理的示例** 
这里提供在colab环境运行BELLE模型的代码。默认加载的是4bit量化的llama模型，在模型加载到内存过程中，最高消费RAM大概需要7G，等模型load到GPU中以后，RAM只需要4G，GPU大概需要6G。


## 查看colab分配的显卡类型，一般免费账户上14G的T4显卡



---



In [None]:
!nvidia-smi

Thu Apr  6 04:03:10 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

##  将BELLE项目git clone到colab

In [None]:
!git clone https://github.com/LianjiaTech/BELLE.git 

fatal: destination path 'BELLE' already exists and is not an empty directory.


### 14G显卡目前只支持量化版本，这里暂时只提供量化版本在colab推理

In [None]:
%cd BELLE/gptq

/content/BELLE/gptq


### 安装gptq环境

In [None]:
!pip install -r requirements.txt

['Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/',
 'Collecting git+https://github.com/huggingface/transformers (from -r requirements.txt (line 4))',
 '  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-zup77i5e',
 '  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-zup77i5e',
 '  Resolved https://github.com/huggingface/transformers to commit 15641892985b1d77acc74c9065c332cd7c3f7d7f',
 '  Installing build dependencies ... \x1b[?25l\x1b[?25hdone',
 '  Getting requirements to build wheel ... \x1b[?25l\x1b[?25hdone',
 '  Installing backend dependencies ... \x1b[?25l\x1b[?25hdone',
 '  Preparing metadata (pyproject.toml) ... \x1b[?25l\x1b[?25hdone',

In [None]:
!pip uninstall -y  transformers
!pip install  git+https://github.com/huggingface/transformers

In [None]:
! python setup_cuda.py install && CUDA_VISIBLE_DEVICES=0 && python test_kernel.py


running install
running bdist_egg
running egg_info
writing quant_cuda.egg-info/PKG-INFO
writing dependency_links to quant_cuda.egg-info/dependency_links.txt
writing top-level names to quant_cuda.egg-info/top_level.txt
reading manifest file 'quant_cuda.egg-info/SOURCES.txt'
writing manifest file 'quant_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.9/quant_cuda.cpython-39-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for quant_cuda.cpython-39-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/quant_cuda.py to quant_cuda.cpython-39.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/dependency_links.txt -> build/bdist.linux

### 将BELLE-7B-gptq版本下载到colab


In [None]:

!git lfs install && git clone https://huggingface.co/BelleGroup/BELLE-LLAMA-7B-2M-gptq


Updated git hooks.
Git LFS initialized.
fatal: destination path 'BELLE-LLAMA-7B-2M-gptq' already exists and is not an empty directory.


In [None]:
!ls BELLE-LLAMA-7B-2M-gptq

config.json		llama7b-2m-4bit-128g.pt  special_tokens_map.json
generation_config.json	llama7b-2m-8bit-128g.pt  tokenizer_config.json
LICENSE			README.md		 tokenizer.model


## BELLE **gptq推理**

In [None]:
import time

import torch
import torch.nn as nn

from gptq import *
from modelutils import *
from quant import *

from transformers import AutoTokenizer

DEV = torch.device('cuda:0')

def get_llama(model):
    import torch
    def skip(*args, **kwargs):
        pass
    torch.nn.init.kaiming_uniform_ = skip
    torch.nn.init.uniform_ = skip
    torch.nn.init.normal_ = skip
    from transformers import LlamaForCausalLM
    model = LlamaForCausalLM.from_pretrained(model, torch_dtype='auto')
    model.seqlen = 2048
    return model

def load_quant(model, checkpoint, wbits, groupsize):
    from transformers import LlamaConfig, LlamaForCausalLM 
    config = LlamaConfig.from_pretrained(model)
    def noop(*args, **kwargs):
        pass
    torch.nn.init.kaiming_uniform_ = noop 
    torch.nn.init.uniform_ = noop 
    torch.nn.init.normal_ = noop 

    torch.set_default_dtype(torch.half)
    transformers.modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = LlamaForCausalLM(config)
    torch.set_default_dtype(torch.float)
    model = model.eval()
    layers = find_layers(model)
    for name in ['lm_head']:
        if name in layers:
            del layers[name]
    make_quant(model, layers, wbits, groupsize)

    print('Loading model ...')
    if checkpoint.endswith('.safetensors'):
        from safetensors.torch import load_file as safe_load
        model.load_state_dict(safe_load(checkpoint))
    else:
        model.load_state_dict(torch.load(checkpoint))
    model.seqlen = 2048
    print('Done.')

    return model


class args:
    model = "BELLE-LLAMA-7B-2M-gptq"
    wbits = 4
    groupsize = 128
    load = "BELLE-LLAMA-7B-2M-gptq/llama7b-2m-4bit-128g.pt"
    text = None
    min_length = 10
    max_length = 1024
    top_p = 0.95
    temperature = 0.7


    
    
if type(args.load) is not str:
    args.load = args.load.as_posix()




model = load_quant(args.model, args.load, args.wbits, args.groupsize)


model.to(DEV)


Loading model ...
Done.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QuantLinear()
          (k_proj): QuantLinear()
          (v_proj): QuantLinear()
          (o_proj): QuantLinear()
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): QuantLinear()
          (down_proj): QuantLinear()
          (up_proj): QuantLinear()
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

In [None]:
from transformers import LlamaForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("BelleGroup/BELLE-LLAMA-7B-0.6M")

In [None]:
def infer_text_gen(text):
    inputs = f'Human: {text} \n\nAssistant:'
    input_ids = tokenizer.encode(inputs, return_tensors="pt").to(DEV)

    with torch.no_grad():
        generated_ids = model.generate(
            input_ids,
            do_sample=True,
            min_length=args.min_length,
            max_length=args.max_length,
            top_p=args.top_p,
            temperature=args.temperature,
        )

    decode_text = tokenizer.decode([el.item() for el in generated_ids[0]])
    decode_text = decode_text[len(inputs):]
    decode_text = decode_text.replace("</s>","")
    return decode_text

In [None]:
print(infer_text_gen("你是谁？"))

ant:我叫Belle，我的名字代表着Bloom Enhanced Large Language model Engine，也就是说我是基于Bloom训练的大语言模型。



In [None]:
print(infer_text_gen("怎么让自己精力充沛，列5点建议"))

ant:1. 充足的睡眠：每晚7-8小时的充足睡眠可以帮助我们恢复精力和焕发活力。
2. 饮食健康：均衡的饮食可以提供身体所需的所有营养素，让我们保持精力充沛。
3. 锻炼身体：运动可以提高我们的身体素质和免疫力，增强我们的精力和耐力。
4. 减压放松：学习减压放松技巧可以帮助我们减少压力和焦虑，让我们保持精力充沛。
5. 规律作息：保持规律的作息可以让我们的身体和大脑有一个良好的休息和恢复的时间，从而保持精力充沛。
