# 基础思路（baseline）

因为transformer后端推理十分缓慢，导致了对计算单元（核时）的消耗非常大。尝试使用vLLM对GLM4-9B-CHAT模型进行后端推理，以减少对计算单元的消耗。

vLLM的推理是增加吞吐量的方式，对于显存的使用策略非常激进，会占据大量的显存空间，请保证至少36GB以上的显存。

## 步骤1：更新或安装所需环境

In [1]:
!pip install --upgrade transformers requests urllib3 tqdm pandas tiktoken vllm
!apt update > /dev/null; apt install aria2 git-lfs axel -y > /dev/null

Collecting tiktoken
  Using cached tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)




增加了 transformers 、tiktoken、vllm

## 步骤2：下载数据集

In [2]:
!axel -n 12 -a https://ai-contest-static.xfyun.cn/2024/%E5%A4%A7%E6%A8%A1%E5%9E%8B%E8%83%BD%E5%8A%9B%E8%AF%84%E6%B5%8B%EF%BC%9A%E4%B8%AD%E6%96%87%E6%88%90%E8%AF%AD%E9%87%8A%E4%B9%89%E4%B8%8E%E8%A7%A3%E6%9E%90%E6%8C%91%E6%88%98%E8%B5%9B/test_input.csv

Initializing download: https://ai-contest-static.xfyun.cn/2024/%E5%A4%A7%E6%A8%A1%E5%9E%8B%E8%83%BD%E5%8A%9B%E8%AF%84%E6%B5%8B%EF%BC%9A%E4%B8%AD%E6%96%87%E6%88%90%E8%AF%AD%E9%87%8A%E4%B9%89%E4%B8%8E%E8%A7%A3%E6%9E%90%E6%8C%91%E6%88%98%E8%B5%9B/test_input.csv
File size: 131.332 Kilobyte(s) (134484 bytes)
Opening output file test_input.csv.2
Starting download

Can't setup alternate output. Deactivating.
..... .......... .......... .......... ..........  [ 259.8KB/s]
[ 38%]  .......... .......... .......... .......... ..........  [ 467.3KB/s]
[ 76%]  .......... .......... .......... .

Downloaded 131.332 Kilobyte(s) in 0 second(s). (547.49 KB/s)


## 步骤3：构建模型（使用GLM-4-9B-Chat）

In [3]:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import torch

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4

# GLM-4-9B-Chat
# 如果遇见 OOM 现象，建议减少max_model_len，或者增加tp_size
max_model_len, tp_size = 1024, 1
gpu_memory_utilization = 0.5
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "列举与下面句子最符合的五个成语。只需要输出五个成语，不需要有其他的输出，写在一行中：比喻夫妻关系和谐"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象，建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=512, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0])
print(outputs[0].outputs[0].text)
torch.cuda.empty_cache()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


INFO 10-29 03:09:00 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=THUDM/glm-4-9b-chat, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=Fa

Loading safetensors checkpoint shards:   0% Completed | 0/10 [00:00<?, ?it/s]


INFO 10-29 03:09:08 model_runner.py:1067] Loading model weights took 17.5635 GB
INFO 10-29 03:09:09 gpu_executor.py:122] # GPU blocks: 26291, # CPU blocks: 6553
INFO 10-29 03:09:09 gpu_executor.py:126] Maximum concurrency for 1024 tokens per request: 410.80x


Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.18it/s, est. speed input: 42.67 toks/s, output: 28.44 toks/s]

RequestOutput(request_id=0, prompt='[gMASK]<sop><|user|>\n列举与下面句子最符合的五个成语。只需要输出五个成语，不需要有其他的输出，写在一行中：比喻夫妻关系和谐<|assistant|>', prompt_token_ids=[151331, 151333, 151331, 151333, 151336, 198, 115571, 98381, 101534, 108259, 98430, 100498, 98314, 105053, 109924, 1773, 107073, 102162, 105053, 109924, 3837, 103628, 98318, 106911, 102162, 3837, 99032, 104120, 98351, 98322, 5122, 110573, 102885, 99172, 101938, 151337], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n相敬如宾、琴瑟和鸣、恩爱夫妻、伉俪情深、夫唱妇随', token_ids=(198, 98463, 100619, 98410, 100907, 5373, 101396, 104899, 98327, 102033, 5373, 118219, 102885, 5373, 17406, 231, 123143, 125894, 5373, 99324, 99918, 99991, 98932, 151336), cumulative_logprob=None, logprobs=None, finish_reason=stop, stop_reason=151336)], finished=True, metrics=RequestMetrics(arrival_time=1730171352.6092703, last_token_time=1730171352.6092703, first_scheduled_time=1730171352.6124427, first_token_time=1730171352.6596




检查输出是否正常

原版“max_model_len, tp_size = 131072, 1”会OOM

经调整“max_model_len, tp_size = 65536, 1”可以运行

占用显存34.2GB

##print(outputs[0]):##
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.72it/s, est. speed input: 13.80 toks/s, output: 27.59 toks/s]RequestOutput(request_id=0, prompt='[gMASK]<sop><|user|>\n你好<|assistant|>', prompt_token_ids=[151331, 151333, 151331, 151333, 151336, 198, 109377, 151337], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, ##text='\n你好👋！很高兴见到你，欢迎问我任何问题。'##, token_ids=(198, 109377, 9281, 239, 233, 6313, 118295, 103810, 98406, 3837, 100940, 106546, 99766, 98622, 1773, 151336), cumulative_logprob=None, logprobs=None, finish_reason=stop, stop_reason=151336)], finished=True, metrics=RequestMetrics(arrival_time=1730035203.5149841, last_token_time=1730035203.5149841, first_scheduled_time=1730035203.5178537, first_token_time=1730035203.5585952, time_in_queue=0.0028696060180664062, finished_time=1730035204.0973136, scheduler_time=0.002600323000251592, model_forward_time=None, model_execute_time=None), lora_request=None)

##print(outputs[0].outputs[0].text)##

你好👋！很高兴见到你，欢迎问我任何问题。

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
tokenizer_config.json: 100%
 6.15k/6.15k [00:00<00:00, 501kB/s]
tokenization_chatglm.py: 100%
 8.99k/8.99k [00:00<00:00, 769kB/s]
A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-4-9b-chat:
- tokenization_chatglm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.model: 100%
 2.62M/2.62M [00:00<00:00, 12.0MB/s]
config.json: 100%
 1.44k/1.44k [00:00<00:00, 126kB/s]
WARNING 10-28 02:32:16 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-28 02:32:16 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=THUDM/glm-4-9b-chat, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 10-28 02:32:17 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
generation_config.json: 100%
 207/207 [00:00<00:00, 17.4kB/s]
INFO 10-28 02:32:18 model_runner.py:1056] Starting to load model THUDM/glm-4-9b-chat...
INFO 10-28 02:32:18 weight_utils.py:243] Using model weights format ['*.safetensors']
model-00001-of-00010.safetensors: 100%
 1.95G/1.95G [00:46<00:00, 42.8MB/s]
model-00003-of-00010.safetensors: 100%
 1.97G/1.97G [00:46<00:00, 42.5MB/s]
model-00002-of-00010.safetensors: 100%
 1.82G/1.82G [00:43<00:00, 41.3MB/s]
model-00004-of-00010.safetensors: 100%
 1.93G/1.93G [00:46<00:00, 42.8MB/s]
model-00005-of-00010.safetensors: 100%
 1.82G/1.82G [00:43<00:00, 42.3MB/s]
model-00008-of-00010.safetensors: 100%
 1.82G/1.82G [00:43<00:00, 42.2MB/s]
model-00006-of-00010.safetensors: 100%
 1.97G/1.97G [00:47<00:00, 43.3MB/s]
model-00007-of-00010.safetensors: 100%
 1.93G/1.93G [00:46<00:00, 41.2MB/s]
model-00009-of-00010.safetensors: 100%
 1.97G/1.97G [00:46<00:00, 41.9MB/s]
model-00010-of-00010.safetensors: 100%
 1.65G/1.65G [00:39<00:00, 42.2MB/s]
model.safetensors.index.json: 100%
 29.1k/29.1k [00:00<00:00, 2.23MB/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:05<00:00,  1.87it/s]
INFO 10-28 02:33:55 model_runner.py:1067] Loading model weights took 17.5635 GB
INFO 10-28 02:33:56 gpu_executor.py:122] # GPU blocks: 26291, # CPU blocks: 6553
INFO 10-28 02:33:56 gpu_executor.py:126] Maximum concurrency for 2048 tokens per request: 205.40x
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.17it/s, est. speed input: 42.02 toks/s, output: 28.01 toks/s]RequestOutput(request_id=0, prompt='[gMASK]<sop><|user|>\n列举与下面句子最符合的五个成语。只需要输出五个成语，不需要有其他的输出，写在一行中：比喻夫妻关系和谐<|assistant|>', prompt_token_ids=[151331, 151333, 151331, 151333, 151336, 198, 115571, 98381, 101534, 108259, 98430, 100498, 98314, 105053, 109924, 1773, 107073, 102162, 105053, 109924, 3837, 103628, 98318, 106911, 102162, 3837, 99032, 104120, 98351, 98322, 5122, 110573, 102885, 99172, 101938, 151337], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n相敬如宾、琴瑟和鸣、恩爱夫妻、伉俪情深、夫唱妇随', token_ids=(198, 98463, 100619, 98410, 100907, 5373, 101396, 104899, 98327, 102033, 5373, 118219, 102885, 5373, 17406, 231, 123143, 125894, 5373, 99324, 99918, 99991, 98932, 151336), cumulative_logprob=None, logprobs=None, finish_reason=stop, stop_reason=151336)], finished=True, metrics=RequestMetrics(arrival_time=1730082839.8494065, last_token_time=1730082839.8494065, first_scheduled_time=1730082839.8528595, first_token_time=1730082839.9011638, time_in_queue=0.0034530162811279297, finished_time=1730082840.709096, scheduler_time=0.003833885999938502, model_forward_time=None, model_execute_time=None), lora_request=None)

相敬如宾、琴瑟和鸣、恩爱夫妻、伉俪情深、夫唱妇随

## 步骤4：读取数据集

In [4]:
import pandas as pd
test = pd.read_csv('./test_input.csv', header=None)

In [5]:
# 查看数据集大小
print(f"数据集的大小为: {test.shape[0]}\n前50条数据如下：\n")

# 查看前50条赛事数据集（赛题要求根据每行句子，给出5个可能匹配的成语）
for test_prompt in test[0].values[:50]:
    print(test_prompt)

数据集的大小为: 2972
前50条数据如下：

双方心往一处想，意志坚定。
两端都极为艰难，难以作出决定。
雄浑深远的意旨。细腻微妙。高超巧妙。
描述水流湍急，且快速且深远。
避免被引诱做出不道德或可疑的行为。
肘部肩窝。比喻事物发生于身边。
他张开嘴巴，吞咽着口水。
仅见一面，不足以深入了解。
向四处散发施舍，同时手中拿着碗盆。
比喻把装备脱下，放下武器。
对于任何战役，都要一败涂地。
阴谋真相已经昭然于众。
比喻因为无能为力而写下了好文章。
古代文人士大夫经常举行诗歌朗诵会。
不能形容为不高兴，只能说明没劲儿。
表现得举止端庄，很有教养。
1.洗刷兵器2.喂养战马3.准备作战
关注生命垂危者，关怀濒危者。
比喻面对挑战，坚韧不拔地前行。
过去的科举考试中被选拔为进士的称号。
相似程度极高或相差无几。
比喻不断地补充、堆砌和延伸。
以安逸快乐的生活和劳动为重。
她仍然每天教导她的儿子。
犹以火为耕，比喻原始、简朴的农耕方式。
指困境中处于不利地位。
搜集和研究其内在道理。
没有任何人帮助和支持。
这位作者的文章风格与自己非常相似。
国力强大，军事力量已经停止。
相互勾结维持；相互利用。
比喻进程飞快，日行千里。
只有一个目的，追求利润。
务必谨慎对待，慎重处理事务。
无法言表，只能感慨万千。
形容气势磅礴的文章风格。
根据贡献的大小给予奖励。
让国家蒙羞，民众蒙难。
形容有权势的人极其残忍和无礼。
坚决要求再次向某人强调。
采取措施；采取办法；采取行动；实行
心中充满了疑惑，还没有找到解答。
累累罪行，遍历无穷。形容罪恶极重。
形容人的容貌清爽俊雅，风度翩翩。
辞掉本职工作去做其他的事情。
秦汉时期，勋位至高者都佩戴金印和紫绶。
创立独特的风格，与众不同。
到处都是冰雪覆盖的环境，形容严冬天气。
旧事物被废弃；为了新事物而采取措施。
①向人行礼。②用作哀悼词或祭奠语。


## 步骤5：输出成语

爆显存(out of memory）比较严重，目前Colab A100单张40GB显存还不够用。期待大佬的意见。

In [7]:
from tqdm import tqdm
import os


i = 1
# 假设 test 是一个 DataFrame
# 遍历测试数据集的第一项的值，目的是生成与给定句子最相关的五个成语
for test_prompt in tqdm(test[0].values, total=len(test[0].values), desc="处理进度"):
    i = i + 1
    # 构造提示信息，要求模型输出与句子最相关的五个成语
    prompt = [{"role": "user", "content": f"列举与下面句子最符合的五个成语。只需要输出五个成语，不需要有其他的输出，写在一行中：{test_prompt}"}]

    # 初始化一个长度为5的列表，填充默认成语“同舟共济”
    words = ['同舟共济'] * 5

    # 使用预训练模型初始化tokenizer，信任远程代码以支持可能的自定义实现
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

    # 初始化大语言模型(LLM)，配置模型并行大小、最大模型长度等参数
    llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    gpu_memory_utilization=0.8

    )

    # 定义停止生成的token ID列表，用于控制生成文本的结束
    stop_token_ids = [151329, 151336, 151338]
    # 设置采样参数，包括温度、最大生成token数和停止token ID
    sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
    # 使用tokenizer将提示转化为模型所需的输入格式，不进行token化，添加生成提示
    inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
    # 生成文本输出，根据输入提示和采样参数进行文本生成
    outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

    # 解码模型输出，去除特殊标记
    response = outputs[0].outputs[0].text

    # 清理回答文本，确保格式统一 需要去掉的内容包括常规标点符号，其中'\n'是换行符，使输出回到同一行内
    response = response.replace('\n', ' ').replace('、', ' ').replace(',', ' ').replace('；',' ').replace('。',' ')
    # 提取回答中的成语，确保每个成语长度为4且非空
    words = [x for x in response.split() if len(x) == 4 and x.strip() != '']


    # 如果生成的成语列表长度不满足要求（即20个字符），则使用默认成语列表
   #if len(' '.join(words).strip()) != 24:
       # words = ['同舟共济'] * 5
    while True:
        text = ' '.join(words).strip()
        if len(text) < 24:
            words.append('同舟共济')
        else:
            break

    # 将最终的成语列表写入提交文件
    with open('submit.csv', 'a+', encoding='utf-8') as up:
        up.write(' '.join(words) + '\n')


    # 查看阶段性结果
    if i % 50 == 0:
        tqdm.write(f"大模型第{i}次返回的结果是：\n   {response}\n")
        tqdm.write(f"submit.cvs第{i}行输出结果：\n   {words}\n")

    # 为了尽快拿到结果，我们暂时仅获得500个结果（如果有时间的话，可以删除这两行）
    if i == 2973:
        break

print('submit.csv 已生成')

处理进度:   0%|          | 0/2972 [00:00<?, ?it/s]

INFO 10-29 03:15:41 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=THUDM/glm-4-9b-chat, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=Fa

处理进度:   0%|          | 0/2972 [00:02<?, ?it/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.16 GiB. GPU 0 has a total capacity of 39.56 GiB of which 26.81 MiB is free. Process 105528 has 39.52 GiB memory in use. Of the allocated memory 38.96 GiB is allocated by PyTorch, and 60.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)