Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
brightmart committed May 8, 2023
1 parent c035f97 commit 2b40b23
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 40 deletions.
89 changes: 49 additions & 40 deletions README.md
Expand Up @@ -158,51 +158,60 @@ SuperCLUE从三个不同的维度评价模型的能力:基础能力、专业
## SuperCLUE的测试结果
四个表格:汇总表、基础能力表、专业能力表、中文特性表

##### 排行榜会定期更新 数据来源: www.CLUEbenchmarks.com

#### 汇总表(v1.0版)
| 模型 | 基础能力+中文特性 | 基础能力 |中文特性能力| 专业能力 | 三项能力(汇总)
| -------- | ---------- | ---- | ---------- | -------- | ---------- |
| <a href="Vicuna-13B"></a> Vicuna-13B |46.52 | 59.13 |33.90 |Comming |Comming |
| <a href="MOSS-16B"></a> Moss-16B | 58.27 | 63.73|52.80 |31.54 |49.36 |
| <a href="ChatGLM-6B"></a> ChatGLM-6B |61.82 | 66.77 |56.87 |32.56? |52.07 |
| <a href="文心一言"></a> 文心一言 |67.44 | 69.97 |64.90 |- |- |
| <a href="MinMax"></a> MinMax | 72.94 | 70.90 |74.97 |- |- |
| <a href="ChatGPT-3.5"></a> ChatGPT-3.5 |73.27 | 82.17 |64.37 |57.1 |67.79 |
| <a href="GPT-4.0"></a> GPT-4.0 |**81.79** | **84.17** |**79.40** |- |- |
##### 排行榜会定期更新 数据来源: www.CLUEbenchmarks.com

#### 总榜单(v1.0版)

| 模型 | 总分 | 基础能力+中文特性 | 基础能力 | 学术与专业能力 | 中文特性 |
|--------------|-------|--------------|--------|------------|-------|
| 人类 | - | **96.50%** | **98.00%** | - | **95.00%** |
| GPT-4 | 76.77%| 79.00% | 90.00% | **72.32%** | 68.00% |
| ChatGPT | 66.21%| 72.00% | 85.00% | 54.64% | 59.00% |
| 星火 | 53.42%| 59.00% | 74.00% | 42.27% | 44.00% |
| MiniMax | 46.35%| 50.50% | 72.00% | 38.04% | 29.00% |
| BELLE-13B | 43.70%| 46.00% | 69.00% | 39.11% | 23.00% |
| ChatGLM-6B | 42.13%| 46.50% | 60.00% | 33.39% | 33.00% |
| MOSS-16B | 36.51%| 39.50% | 52.00% | 30.54% | 27.00% |
| Vicuna-13B | 34.49%| 37.50% | 45.00% | 28.46% | 30.00% |
| 文心一言 | 32.44%| 32.00% | 40.00% | 33.33% | 24.00% |


#### 基础能力表(v1.0版)

| 任务类型 | ChatGPT3.5 | GPT4 | ChatGLM-6B | Moss-16B | Vicuna-13B | 文心一言 | MinMax |
| -------- | ---------- | ---- | ---------- | -------- | ---------- | -------- | ------ |
| 语义理解 | 100.00 | **90.70** | 70.00 | 50.00 | 70.00 | 70.30 | 70.30 |
| 生成与创作 | 77.30 | **82.30** | 68.00 | 63.30 | 62.30 | 68.70 | 63.30 |
| 闲聊 | 75.70 | **79.30** | 74.70 | 65.70 | 72.70 | 78.30 | 76.00 |
| 对话 | 76.00 | **84.30** | 80.00 | 76.70 | 74.00 | 68.00 | 81.00 |
| 百科与知识 | **80.30** | 79.00 | 79.30 | 71.00 | 38.00 | 79.30 | 79.70 |
| 逻辑与推理 | 77.30 | **84.70** | 24.30 | 28.00 | 30.70 | 35.00 | 45.70 |
| 计算能力 | 88.00 | **92.30** | 46.30 | 54.00 | 32.70 | 82.00 | 64.70 |
| 代码 | 85.70 | **89.00** | 68.30 | 81.70 | 78.00 | 80.00 | 76.30 |
| 角色模拟 | 82.00 | **86.00** | 80.30 | 78.00 | 74.30 | 77.30 | 78.30 |
| 安全 | **79.30** | 74.00 | 76.30 | 69.00 | 58.70 | 60.70 | 73.70 |
| 总分 | 82.17 | **84.17** | 66.77 | 63.73 | 59.13 | 69.97 | 70.90 |
| 基础能力 | 人类 | ChatGLM-6B | MOSS-16B | Vicuna-13B | GPT-3.5 | GPT-4 | BELLE-13B | 星火 | 文心一言 | MiniMax |
|-------------|------|------------|----------|------------|---------|-------|-----------|------|---------|---------|
| Accuracy | Accuracy| Accuracy | Accuracy | Accuracy | Accuracy|Accuracy| Accuracy |Accuracy| Accuracy|Accuracy |
| 代码 | 0.90 | 0.40 | 0.70 | 0.60 | 0.90 | 0.90 | 0.80 | 0.50 | 0.40 | 0.60 |
| 安全 | 1.00 | 0.30 | 0.50 | 0.60 | 0.90 | 0.80 | 0.70 | 0.80 | 0.30 | 0.80 |
| 对话 | 1.00 | 0.70 | 0.50 | 0.30 | 0.90 | 1.00 | 0.80 | 0.90 | 0.20 | 0.90 |
| 生成与创作 | 0.90 | 0.50 | 0.40 | 0.30 | 1.00 | 0.80 | 0.40 | 0.50 | 0.20 | 0.50 |
| 百科与知识 | 1.00 | 0.50 | 0.50 | 0.30 | 0.90 | 1.00 | 0.70 | 0.90 | 0.70 | 1.00 |
| 角色模拟 | 1.00 | 1.00 | 0.70 | 0.70 | 1.00 | 1.00 | 0.80 | 1.00 | 0.50 | 0.80 |
| 计算能力 | 1.00 | 0.50 | 0.30 | 0.40 | 0.60 | 0.70 | 0.40 | 0.60 | 0.20 | 0.60 |
| 语义理解 | 1.00 | 0.90 | 0.50 | 0.40 | 1.00 | 0.90 | 0.80 | 1.00 | 0.30 | 0.70 |
| 逻辑与推理 | 1.00 | 0.40 | 0.10 | 0.40 | 0.30 | 0.90 | 0.50 | 0.30 | 0.30 | 0.40 |
| 闲聊 | 1.00 | 0.80 | 1.00 | 0.50 | 1.00 | 1.00 | 1.00 | 0.90 | 0.90 | 0.90 |
| 总分 | 0.98 | 0.60 | 0.52 | 0.45 | 0.85 | 0.90 | 0.69 | 0.74 | 0.40 | 0.72 |



#### 中文特性能力表(v1.0版)
| | ChatGPT3.5 | GPT-4 | ChatGLM | MOSS-16B | Vicuna-13B | 文心一言 | MinMax |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 1.成语 | 64.0 | **74.7** | 80.7 | 59.0 | 34.0 | 72.0 | 76.7 |
| 2.对联 | 64.0 | **73.7** | 45.0 | 52.0 | 16.0 | 51.3 | 66.3 |
| 3.方言 | 67.3 | **81.3** | 43.7 | 23.3 | 26.0 | 54.7 | 70.0 |
| 4.古文 | 60.7 | **76.3** | 50.7 | 51.3 | 34.0 | 70.3 | 75.7 |
| 5.句法 | 69.7 | **92.0** | 76.7 | 64.7 | 64.7 | 60.0 | 83.7 |
| 6.字形 | 81.7 | **94.0** | 47.0 | 57.0 | 31.7 | 84.0 | 83.3 |
| 7.诗词 | 50.3 | 74.0 | 55.3 | 63.0 | 28.0 | 63.3 | **77.3** |
| 8.文学 | 66.3 | **77.7** | 68.7 | 60.0 | 32.0 | 69.0 | **77.7** |
| 9.歇后语 | 58.0 | 68.3 | 54.7 | 41.7 | 32.3 | 63.0 | **71.3** |
| 10.字义 | 61.7 | **82.0** | 46.3 | 56.0 | 40.3 | 61.3 | 67.7 |
| 总分 | 64.37 | **79.40** | 56.87 | 52.80 | 33.90 | 64.90 | 74.97 |

注:由于文心一言和讯飞星火未提供测试API,由人工录入问题进行结果评测

| 子能力 | 人类 | ChatGLM-6B | MOSS-16B | Vicuna-13B | GPT-3.5 | GPT-4 | BELLE-13B | 星火 | 文心一言 | MiniMax |
| ---------------- | ---- | --------- | ------- | --------- | ------ | ---- | -------- | ---- | ------ | ------- |
| 成语 | 0.80 | 0.30 | 0.10 | 0.30 | 0.70 | 0.70 | 0.30 | 0.70 | 0.40 | 0.30 |
| 文学 | 1.00 | 0.50 | 0.50 | 0.10 | 0.40 | 0.40 | 0.40 | 0.30 | 0.10 | 0.40 |
| 诗词 | 1.00 | 0.20 | 0.20 | 0.20 | 0.30 | 0.60 | 0.00 | 0.50 | 0.30 | 0.20 |
| 字义理解 | 0.80 | 0.10 | 0.30 | 0.70 | 0.70 | 0.70 | 0.20 | 0.20 | 0.30 | 0.20 |
| 古文 | 1.00 | 0.30 | 0.10 | 0.30 | 0.30 | 0.60 | 0.20 | 0.40 | 0.20 | 0.20 |
| 对联 | 1.00 | 0.20 | 0.30 | 0.40 | 0.60 | 0.60 | 0.30 | 0.30 | 0.30 | 0.20 |
| 方言 | 1.00 | 0.30 | 0.30 | 0.20 | 0.70 | 0.80 | 0.10 | 0.50 | 0.20 | 0.20 |
| 歇后语和谚语 | 1.00 | 0.40 | 0.30 | 0.40 | 0.70 | 0.80 | 0.10 | 0.40 | 0.30 | 0.30 |
| 汉字字形和拼音理解 | 0.90 | 0.50 | 0.40 | 0.40 | 0.60 | 0.80 | 0.20 | 0.40 | 0.10 | 0.40 |
| 汉语句法分析 | 1.00 | 0.50 | 0.20 | 0.00 | 0.90 | 0.80 | 0.50 | 0.70 | 0.20 | 0.50 |
| 平均 | 0.95 | 0.33 | 0.27 | 0.30 | 0.59 | 0.68 | 0.23 | 0.44 | 0.24 | 0.29 |




## SuperCLUE的不足与局限
Expand Down
Binary file added bar_chart.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 2b40b23

Please sign in to comment.