# ** BELLE模型在COLAB推理的示例** 
这里提供在colab环境运行BELLE模型的代码。需要注意的是，为了保证模型的正常加载，需要选择Runtime->Change Runtime type->Runtime shape->High RAM,只有这样才能将模型先加载到内存中，在模型加载到内存过程中，最高消费RAM大概需要16G，等模型load到GPU中以后，RAM只需要4G，GPU大概需要11G。


## 查看colab分配的显卡类型，一般免费账户上14G的T4显卡



---



In [1]:
!nvidia-smi

Wed Apr  5 12:30:07 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

##  将BELLE项目git clone到colab

In [2]:
!git clone https://github.com/LianjiaTech/BELLE.git 

fatal: destination path 'BELLE' already exists and is not an empty directory.


### 14G显卡目前只支持量化版本，这里暂时只提供量化版本在colab推理

In [3]:
%cd BELLE/gptq

/content/BELLE/gptq


### 安装gptq环境

In [4]:
!!pip install -r requirements.txt

['Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/',
 'Collecting git+https://github.com/huggingface/transformers (from -r requirements.txt (line 4))',
 '  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-w063dz_m',
 '  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-w063dz_m',
 '  Resolved https://github.com/huggingface/transformers to commit 2a91a9ef663776ad8259ff22fd285f3cfc888d0f',
 '  Installing build dependencies ... \x1b[?25l\x1b[?25hdone',
 '  Getting requirements to build wheel ... \x1b[?25l\x1b[?25hdone',
 '  Installing backend dependencies ... \x1b[?25l\x1b[?25hdone',
 '  Preparing metadata (pyproject.toml) ... \x1b[?25l\x1b[?25hdone',

In [5]:
! python setup_cuda.py install && CUDA_VISIBLE_DEVICES=0 && python test_kernel.py


running install
running bdist_egg
running egg_info
writing quant_cuda.egg-info/PKG-INFO
writing dependency_links to quant_cuda.egg-info/dependency_links.txt
writing top-level names to quant_cuda.egg-info/top_level.txt
reading manifest file 'quant_cuda.egg-info/SOURCES.txt'
writing manifest file 'quant_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.9/quant_cuda.cpython-39-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for quant_cuda.cpython-39-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/quant_cuda.py to quant_cuda.cpython-39.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/dependency_links.txt -> build/bdist.linux

### 将BELLE-7B-gptq版本下载到colab


In [6]:

!git lfs install && git clone https://huggingface.co/BelleGroup/BELLE-7B-gptq


Updated git hooks.
Git LFS initialized.
fatal: destination path 'BELLE-7B-gptq' already exists and is not an empty directory.


In [7]:
!ls BELLE-7B-gptq

bloom7b-0.2m-4bit-128g.pt  bloom7b-2m-8bit-128g.pt  special_tokens_map.json
bloom7b-0.2m-8bit-128g.pt  config.json		    tokenizer_config.json
bloom7b-2m-4bit-128g.pt    README.md		    tokenizer.json


## BELLE **gpqt推理**

In [8]:
import time

import torch
import torch.nn as nn

from gptq import *
from modelutils import *
from quant import *

from transformers import AutoTokenizer

DEV = torch.device('cuda:0')

def get_bloom(model):
    import torch
    def skip(*args, **kwargs):
        pass
    torch.nn.init.kaiming_uniform_ = skip
    torch.nn.init.uniform_ = skip
    torch.nn.init.normal_ = skip
    from transformers import BloomForCausalLM
    model = BloomForCausalLM.from_pretrained(model, torch_dtype='auto')
    model.seqlen = 2048
    return model

def load_quant(model, checkpoint, wbits, groupsize):
    from transformers import BloomConfig, BloomForCausalLM 
    config = BloomConfig.from_pretrained(model)
    def noop(*args, **kwargs):
        pass
    torch.nn.init.kaiming_uniform_ = noop 
    torch.nn.init.uniform_ = noop 
    torch.nn.init.normal_ = noop 

    torch.set_default_dtype(torch.half)
    transformers.modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = BloomForCausalLM(config)
    torch.set_default_dtype(torch.float)
    model = model.eval()
    layers = find_layers(model)
    for name in ['lm_head']:
        if name in layers:
            del layers[name]
    make_quant(model, layers, wbits, groupsize)

    print('Loading model ...')
    if checkpoint.endswith('.safetensors'):
        from safetensors.torch import load_file as safe_load
        model.load_state_dict(safe_load(checkpoint))
    else:
        model.load_state_dict(torch.load(checkpoint))
    model.seqlen = 2048
    print('Done.')

    return model



class args:
    model = "BELLE-7B-gptq"
    wbits = 8
    groupsize = 128
    load = "BELLE-7B-gptq/bloom7b-2m-8bit-128g.pt"
    text = None
    min_length = 10
    max_length = 1024
    top_p = 0.95
    temperature = 0.7


    
    
if type(args.load) is not str:
    args.load = args.load.as_posix()

if args.load:
    model = load_quant(args.model, args.load, args.wbits, args.groupsize)
else:
    model = get_llama(args.model)
    model.eval()

model.to(DEV)


Loading model ...
Done.


BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 4096)
    (word_embeddings_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-29): 30 x BloomBlock(
        (input_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): QuantLinear()
          (dense): QuantLinear()
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): QuantLinear()
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): QuantLinear()
        )
      )
    )
    (ln_f): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4096, out_features=250880, bias=False)
)

In [9]:
tokenizer = AutoTokenizer.from_pretrained("BelleGroup/BELLE-7B-gptq")


In [10]:
def infer_text_gen(text):
    inputs = f'Human: {text} \n\nAssistant:'
    input_ids = tokenizer.encode(inputs, return_tensors="pt").to(DEV)

    with torch.no_grad():
        generated_ids = model.generate(
            input_ids,
            do_sample=True,
            min_length=args.min_length,
            max_length=args.max_length,
            top_p=args.top_p,
            temperature=args.temperature,
        )

    decode_text = tokenizer.decode([el.item() for el in generated_ids[0]])
    decode_text = decode_text[len(inputs):]
    decode_text = decode_text.replace("</s>","")
    return decode_text

In [15]:
print(infer_text_gen("你是谁？"))

 我是一个人工智能语言模型，没有个人身份和意识。我的开发者们是来自全球各地的科学家和工程师。


In [16]:
print(infer_text_gen("怎么让自己精力充沛，列5点建议"))



1. 规律作息：每天保持规律的作息时间，尽量在同一时间起床和睡觉，让身体和大脑能够适应这种规律。

2. 锻炼身体：适当的体育锻炼能够让人充满活力，同时还能增强身体的免疫力。

3. 合理饮食：保持健康的饮食习惯，摄入足够的营养物质，避免暴饮暴食和过多垃圾食品。

4. 放松心情：遇到压力和焦虑时，可以通过冥想、瑜伽、按摩等方式放松心情，缓解压力。

5. 保持良好的人际关系：和家人、朋友保持良好的关系，能够让人感到心情愉悦，并且能够提供支持和鼓励。
