v0.4.7

Tongjilibo · Feb 4, 2024 · bdf11e0 · bdf11e0
1 parent 4b8e008
commit bdf11e0
Show file tree

Hide file tree

Showing 7 changed files with 29 additions and 28 deletions.
diff --git a/README.md b/README.md
@@ -78,13 +78,9 @@ pip install git+https://github.com/Tongjilibo/bert4torch
 
 |更新日期| bert4torch | torch4keras | 版本说明 |
 |------| ---------------- | ----------------- |----------- |
+|20240204| 0.4.7          | 0.1.9|修改`save_pretrained`用于保存文件夹, 增加GenerateSpeed用于统计token生成速度，修复t5在use_states=True时候的错误, 修改层次编码的bug, 增加deepseek_moe模型，修复generation并发错误，优化大模型耗时|
 |20240116| 0.4.6          | 0.1.8|bug修复，增加`save_pretrained`用于保存`transformer`格式的权重, 增加部分`embedding`模型|
 |20240111| 0.4.5          | 0.1.7|`training`时候不生成`past_key_values`, 增加`streamlit`的example, 修复句向量`max`时的bug, `batch_generate`合并到`generate`, 修改`generation`的默认参数名(兼容过去的参数名), 多轮对话中可保留`past_key_values`, 把`attention`中的`mask`补齐逻辑移到`apply_embedding`中, 增加`uie`的`pipeline`，增加`PtuningV2Trainer`|
-|20231228| 0.4.4          | 0.1.7|新增`pipelines`模块，把`chat`整理进去，并新增`Text2Vec`模块用于向量生成，新增`snapshot_download`用于hf模型下载|
-|20231224| 0.4.3          | 0.1.7|在`chat`中增加常见chat模型, 简化大模型调用的代码逻辑|
-|20231219| 0.4.2          | 0.1.7|参数`checkpoint_path`支持传入文件夹地址，增加`chat`模块用于快速发布demo/api, 支持加载`.safetensors`, `meta`的device提示报错|
-|20231210| 0.4.1          | 0.1.6.post2|增加longlora, 增加test模块，适配torch4keras==0.1.6(监控fit过程，有报错则发送邮件提醒; 解决torch2.0的compile冲突问题; 修复clip_grad_norm的bug)|
-|20231126| 0.4.0          | 0.1.5     |修复flash_attn的bug, stream_generate支持仅输出last_token|
 
 [更多版本](https://github.com/Tongjilibo/bert4torch/blob/master/docs/Update.md)
 

diff --git a/bert4torch/layers/attention.py b/bert4torch/layers/attention.py
@@ -36,9 +36,7 @@ def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_pro
         # 获取flash_attention的配置项
         self.flash_attention_config = kwargs.get('flash_attention_config', dict())
         self.is_causal = kwargs.get('is_causal', False) or self.flash_attention_config.pop('is_causal', False)
-        if (flash_attention is None) and (int(torch.__version__.split('.')[0]) >= 2):
-            flash_attention = 'sdpa'
-        elif ((flash_attention is True) or (flash_attention == 'sdpa')) and (int(torch.__version__.split('.')[0]) < 2):
+        if ((flash_attention is True) or (flash_attention == 'sdpa')) and (int(torch.__version__.split('.')[0]) < 2):
             log_warn_once('`F.scaled_dot_product_attention` only supported in torch 2.0')
             flash_attention = None
         elif (flash_attention == 'xformers') and (not is_xformers_available()):
@@ -176,10 +174,10 @@ def forward(self, hidden_states=None, attention_mask=None, encoder_hidden_states
             context_layer = xops.memory_efficient_attention(query_layer, key_layer, value_layer, attn_bias=xops.LowerTriangularMask())
         # SDPA
         elif self.flash_attention in {True, 'sdpa'}:
-            # is_causal=True 仅适用于qlen=klen，且单条样本时（多条样本mask必须使用）
-            if attention_mask.size(0)==1 and (query_layer.shape[2] == key_layer.shape[2]):
-                context_layer = F.scaled_dot_product_attention(query_layer, key_layer, value_layer, is_causal=True)
-            elif len(self.flash_attention_config) == 0:
+            # # is_causal=True 仅适用于qlen=klen，且单条样本时（多条样本mask必须使用）
+            # if attention_mask.size(0)==1 and (query_layer.shape[2] == key_layer.shape[2]):
+            #     context_layer = F.scaled_dot_product_attention(query_layer, key_layer, value_layer, is_causal=True)
+            if len(self.flash_attention_config) == 0:
                 # 默认方式
                 context_layer = F.scaled_dot_product_attention(query_layer, key_layer, value_layer, attention_mask.bool())
             else:

diff --git a/docs/History.md b/docs/History.md
@@ -1,6 +1,7 @@
 ## 更新历史
 
-- **20240121**：增加GenerateSpeed用于统计token生成速度，修复t5在use_states=True时候的错误, 修改层次编码的bug
+- **20240204**：增加deepseek_moe模型，修复generation并发错误，优化大模型耗时
+- **20240121**：修改`save_pretrained`用于保存文件夹, 增加GenerateSpeed用于统计token生成速度，修复t5在use_states=True时候的错误, 修改层次编码的bug
 - **20240116**：bug修复，增加`save_pretrained`用于保存`transformer`格式的权重, 增加部分`embedding`模型|
 - **20240111**：`training`时候不生成`past_key_values`, 增加`streamlit`的example, 修复句向量`max`时的bug, `batch_generate`合并到`generate`, 修改`generation`的默认参数名(兼容过去的参数名), 多轮对话中可保留`past_key_values`, 把`attention`中的`mask`补齐逻辑移到`apply_embedding`中, 增加`uie`的`pipeline`，增加`PtuningV2Trainer`
 - **20231228**：新增`pipelines`模块，把`chat`整理进去，并新增`Text2Vec`模块用于向量生成，新增`snapshot_download`用于hf模型下载

diff --git a/docs/Update.md b/docs/Update.md
@@ -2,6 +2,11 @@
 
 |更新日期| bert4torch版本 | torch4keras版本 | 版本说明 |
 |------| ---------------- | ----------------- |----------- |
+|20240204| 0.4.7          | 0.1.9|修改`save_pretrained`用于保存文件夹, 增加GenerateSpeed用于统计token生成速度，修复t5在use_states=True时候的错误, 修改层次编码的bug, 增加deepseek_moe模型，修复generation并发错误，优化大模型耗时|
+|20240116| 0.4.6          | 0.1.8|bug修复，增加`save_pretrained`用于保存`transformer`格式的权重, 增加部分`embedding`模型|
+|20240111| 0.4.5          | 0.1.7|`training`时候不生成`past_key_values`, 增加`streamlit`的example, 修复句向量`max`时的bug, `batch_generate`合并到`generate`, 修改`generation`的默认参数名(兼容过去的参数名), 多轮对话中可保留`past_key_values`, 把`attention`中的`mask`补齐逻辑移到`apply_embedding`中, 增加`uie`的`pipeline`，增加`PtuningV2Trainer`|
+|20231228| 0.4.4          | 0.1.7|新增`pipelines`模块，把`chat`整理进去，并新增`Text2Vec`模块用于向量生成，新增`snapshot_download`用于hf模型下载|
+|20231224| 0.4.3          | 0.1.7|在`chat`中增加常见chat模型, 简化大模型调用的代码逻辑|
 |20231219| 0.4.2          | 0.1.7|参数`checkpoint_path`支持传入文件夹地址，增加`chat`模块用于快速发布demo/api, 支持加载`.safetensors`, `meta`的device提示报错|
 |20231210| 0.4.1          | 0.1.6.post2|增加longlora, 增加test模块，适配torch4keras==0.1.6(监控fit过程，有报错则发送邮件提醒; 解决torch2.0的compile冲突问题; 修复clip_grad_norm的bug)|
 |20231126| 0.4.0          | 0.1.5     |修复flash_attn的bug, stream_generate支持仅输出last_token|

diff --git a/docs/pics/wechat_group.jpg b/docs/pics/wechat_group.jpg
diff --git a/examples/basic/glm/basic_language_model_chatglm_batch.py b/examples/basic/glm/basic_language_model_chatglm_batch.py
@@ -4,7 +4,8 @@
 import torch
 from bert4torch.models import build_transformer_model
 from transformers import AutoTokenizer
-from bert4torch.generation import AutoRegressiveDecoder, SeqGeneration
+from bert4torch.generation import SeqGeneration
+from bert4torch.snippets import Timeit2
 import time
 import os
 
@@ -22,29 +23,29 @@
 
 tokenizer = AutoTokenizer.from_pretrained(dir_path.replace('/', '\\'), trust_remote_code=True)
 encoder = build_transformer_model(config_path=config_path, checkpoint_path=checkpoint_path).to(device)
-generation = SeqGeneration(encoder, tokenizer, start_id=None, end_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id, 
+generation = SeqGeneration(encoder, tokenizer, end_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id, 
                            mode='random_sample', maxlen=2048, default_rtype='logits', use_states=True)
 
 
 print('===============single================')
-start = time.time()
+ti = Timeit2()
 for text in texts:
     response = generation.generate(text, topk=50, topp=0.7, temperature=0.95)
     print(response)
-print(f'Consume: {time.time()-start}s')
+ti('single')
 
 
 print('===============batch_cache================')
-start = time.time()
 response = generation.generate(texts, topk=50, topp=0.7, temperature=0.95)
 print(response)
-print(f'Consume: {time.time()-start}s')
+ti('batch_cache')
 
 
 print('===============batch_nocache================')
-start = time.time()
 generation = SeqGeneration(encoder, tokenizer, start_id=None, end_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id, 
                            mode='random_sample', maxlen=2048, default_rtype='logits', use_states=False)
+ti.restart()
 response = generation.generate(texts, topk=50, topp=0.7, temperature=0.95)
 print(response)
-print(f'Consume: {time.time()-start}s')
+ti('batch_nocache')
+ti.end()
diff --git a/test/llm/test_llama.py b/test/llm/test_llama.py
@@ -271,11 +271,11 @@ def test_yi(model_dir):
 
 
 if __name__=='__main__':
-    # test_baichuan('E:/pretrain_ckpt/llama/Baichuan-7B')
-    # test_belle('E:/pretrain_ckpt/llama/belle-llama-7b-2m')
-    # test_chinese_llama_alpaca('E:/pretrain_ckpt/llama/hfl@chinese_alpaca_plus_7b')
-    # test_vicuna('E:/pretrain_ckpt/llama/lmsys@vicuna-7b-v1.5')
-    # test_ziya('E:/pretrain_ckpt/llama/IDEA-CCNL@Ziya-LLaMA-13B-v1.1')
-    # test_llama2('E:/pretrain_ckpt/llama/llama-2-7b-chat')
-    # test_llama('E:/pretrain_ckpt/llama/llama-7b')
+    test_baichuan('E:/pretrain_ckpt/llama/Baichuan-7B')
+    test_belle('E:/pretrain_ckpt/llama/belle-llama-7b-2m')
+    test_chinese_llama_alpaca('E:/pretrain_ckpt/llama/hfl@chinese_alpaca_plus_7b')
+    test_vicuna('E:/pretrain_ckpt/llama/lmsys@vicuna-7b-v1.5')
+    test_ziya('E:/pretrain_ckpt/llama/IDEA-CCNL@Ziya-LLaMA-13B-v1.1')
+    test_llama2('E:/pretrain_ckpt/llama/llama-2-7b-chat')
+    test_llama('E:/pretrain_ckpt/llama/llama-7b')
     test_yi("E:/pretrain_ckpt/llama/01-ai@Yi-6B")