Skip to content

Magnetic2014/RoleEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

We introduce RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fiction. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative.

NOTE: The English version of RoleEval is under internal review and will be released later.

Leaderboard (5-shot)

If you want to submit your model's predictions to our leaderboard, please feel free to contact us via thshen@tju.edu.cn for more details.

NOTE: * indicates the results calculated by submitted predictions.

RoleEval (zh)

RoleEval-Chinese (2,000 questions)

Model Celebrities Anime and Comics Movies and TV Series Games Fiction Avg.
Qwen-72B 70.00 59.75 66.00 61.25 74.00 66.20
Baichuan-NPC-Turbo* 66.25 61.00 71.50 54.25 76.25 65.85
Yi-34B 65.50 54.50 70.00 56.00 77.00 64.60
GPT-4-1106 62.50 63.25 63.00 62.00 63.00 62.75
GPT-4-0613 57.75 60.25 57.75 60.00 58.00 58.75
Yi-6B 59.25 46.00 61.50 47.75 62.00 55.30
Baichuan-NPC-Lite* 56.00 51.75 56.75 47.50 62.00 54.80
MiniMax 54.00 55.00 52.75 57.50 54.00 54.65
Qwen-14B 56.25 45.50 54.75 51.50 56.75 52.95
Baichuan2-13B 54.75 47.75 54.00 47.50 60.00 52.80
Skywork-13B 55.25 45.75 56.00 48.50 57.50 52.60
Baichuan2-7B 52.25 43.75 49.00 47.25 55.00 49.45
ChatGLM3-6B 50.00 44.50 48.00 44.25 58.00 48.95
Qwen-7B 49.00 42.00 47.50 44.75 51.25 46.90
GPT-3.5-1106 47.50 46.75 41.75 44.75 38.75 43.90
GPT-3.5-0613 42.25 43.50 39.75 43.75 39.00 41.65
Chinese-LLaMA-2-13B 36.50 36.50 34.00 34.00 40.50 36.30
LLaMA-2-70B 36.00 38.00 36.25 36.25 34.75 36.25
Chinese-LLaMA-2-7B 34.50 29.00 33.00 30.25 36.25 32.60
Mistral-7B 32.50 37.50 26.25 33.25 31.50 32.20
Falcon-40B 28.25 33.00 30.25 29.25 38.50 31.85
LLaMA-65B 30.00 32.25 29.00 35.50 29.00 31.15
LLaMA-2-7B 25.75 28.00 33.75 29.75 34.50 30.35
LLaMA-30B 30.00 28.75 26.00 31.75 28.00 28.90
LLaMA-2-13B 28.75 30.50 25.25 29.75 28.25 28.50
Falcon-7B 24.75 30.50 31.50 29.75 25.25 28.35
LLaMA-13B 27.25 29.75 27.25 26.00 29.00 27.85
LLaMA-7B 28.50 24.75 20.50 27.75 29.00 26.10

RoleEval-Global (4,000 questions)

Model Celebrities Anime and Comics Movies and TV Series Games Fiction Avg.
GPT-4-1106 74.75 73.62 74.38 72.50 71.62 73.38
GPT-4-0613 73.38 72.12 74.25 72.25 69.62 72.32
Qwen-72B 72.88 63.88 70.38 56.75 73.50 67.47
Baichuan-NPC-Turbo* 72.25 65.25 64.62 55.50 72.75 66.07
Yi-34B 72.38 60.62 69.75 53.25 73.12 65.83
Baichuan-NPC-Lite* 60.62 56.62 51.88 48.25 62.12 55.90
MiniMax 51.75 54.50 62.62 56.75 52.75 55.67
Qwen-14B 62.50 52.38 55.00 45.50 58.00 54.67
Yi-6B 61.88 51.38 52.38 45.38 60.75 54.35
Baichuan2-13B 60.25 52.38 51.00 46.88 60.75 54.25
Skywork-13B 59.13 51.75 51.88 44.50 58.75 53.20
GPT-3.5-1106 48.75 51.88 51.25 49.88 48.38 50.02
ChatGLM3-6B 56.50 47.62 48.38 41.88 54.50 49.78
Baichuan2-7B 56.00 49.62 45.50 40.50 52.38 48.80
GPT-3.5-0613 46.62 48.38 51.75 49.50 47.38 48.73
Qwen-7B 54.75 44.38 44.62 42.75 53.00 47.90
LLaMA-2-70B 53.50 43.25 39.25 40.25 47.25 44.70
Chinese-LLaMA-2-13B 45.38 38.25 39.88 31.87 42.12 39.50
Falcon-40B 39.62 32.25 32.38 30.00 45.00 35.85
Chinese-LLaMA-2-7B 35.62 36.75 35.62 35.38 34.38 35.55
LLaMA-2-7B 37.00 29.88 28.75 34.50 38.25 33.67
LLaMA-2-13B 36.50 34.00 33.00 31.87 31.75 33.42
Mistral-7B 36.12 33.50 32.00 30.25 35.00 33.38
LLaMA-65B 32.12 31.87 32.75 31.00 34.88 32.52
LLaMA-30B 24.88 31.13 30.25 27.75 28.62 28.52
LLaMA-13B 28.50 28.50 28.25 26.50 27.75 27.90
LLaMA-7B 25.50 31.87 25.87 26.00 28.88 27.62
Falcon-7B 23.88 28.12 24.50 28.00 28.12 26.52

RoleEval (en)

RoleEval-Chinese (2,000 questions)

Model Celebrities Anime and Comics Movies and TV Series Games Fiction Avg.
GPT-4-0613 54.25 61.75 63.00 63.00 63.00 61.00
GPT-4-1106 57.50 63.50 60.00 62.50 58.00 60.30
Yi-34B 56.00 52.00 47.50 55.00 57.00 53.50
Qwen-72B 52.75 47.50 46.50 54.25 50.50 50.30
GPT-3.5-0613 42.00 47.75 42.50 42.25 45.50 44.00
GPT-3.5-1106 38.25 45.50 44.00 44.50 46.00 43.65
LLaMA-2-70B 43.25 41.50 40.25 47.50 43.50 43.20
Yi-6B 42.25 38.50 41.50 44.25 45.00 42.30
Qwen-14B 41.00 38.75 38.25 43.25 41.00 40.45
LLaMA-65B 41.50 38.50 33.50 43.25 37.50 38.85
ChatGLM3-6B 36.25 36.25 35.25 42.25 43.50 38.70
Skywork-13B 39.25 34.50 38.25 41.75 38.50 38.45
MiniMax 34.00 39.50 40.75 38.25 39.00 38.30
Qwen-7B 36.25 36.00 36.25 42.25 40.00 38.15
Baichuan2-7B 37.25 35.75 33.00 40.25 37.00 36.65
Mistral-7B 35.75 42.00 30.00 41.75 31.50 36.20
Baichuan2-13B 35.50 36.50 31.25 42.25 34.75 36.05
Falcon-40B 34.00 38.25 30.75 38.75 35.25 35.40
LLaMA-30B 34.75 35.75 30.75 40.00 35.00 35.25
Chinese-LLaMA-2-13B 34.00 38.50 27.75 37.50 34.00 34.35
LLaMA-2-13B 30.50 36.50 33.25 36.50 33.25 34.00
LLaMA-13B 32.75 31.75 30.75 38.50 32.00 33.15
LLaMA-2-7B 28.75 29.25 32.75 37.50 32.25 32.10
Chinese-LLaMA-2-7B 30.50 27.75 33.00 30.50 27.75 29.90
LLaMA-7B 24.00 27.50 29.75 33.00 29.25 28.70
Falcon-7B 27.25 27.75 27.75 29.75 28.50 28.20

RoleEval-Global (4,000 questions)

Model Celebrities Anime and Comics Movies and TV Series Games Fiction Avg.
GPT-4-0613 77.62 79.50 73.12 74.88 75.00 76.02
GPT-4-1106 75.12 78.75 75.00 76.12 75.00 76.00
Yi-34B 73.12 61.75 67.88 57.12 67.25 65.42
Qwen-72B 70.12 62.00 69.00 55.75 69.50 65.27
LLaMA-2-70B 63.25 57.38 59.00 50.00 63.25 58.58
GPT-3.5-0613 57.38 59.62 58.13 59.50 57.50 58.43
GPT-3.5-1106 58.75 56.62 55.75 58.00 55.00 56.82
MiniMax 54.87 56.38 53.50 54.12 51.38 54.05
Yi-6B 59.25 52.00 54.12 47.50 56.25 53.82
Qwen-14B 61.12 49.00 53.87 45.38 56.12 53.10
LLaMA-65B 58.13 50.50 54.37 47.62 54.50 53.02
Baichuan2-13B 56.12 47.50 51.50 45.62 54.00 50.95
Skywork-13B 56.25 46.75 51.62 44.38 53.62 50.52
Mistral-7B 54.87 46.75 49.62 44.25 52.25 49.55
ChatGLM3-6B 55.12 46.62 49.25 43.25 52.62 49.37
LLaMA-30B 51.62 46.88 48.62 43.12 52.62 48.57
Qwen-7B 53.87 46.12 48.12 40.00 51.12 47.85
Baichuan2-7B 51.00 45.12 49.00 42.12 50.00 47.45
Falcon-40B 47.38 45.00 49.62 43.12 50.00 47.02
Chinese-LLaMA-2-13B 47.75 46.00 46.88 45.00 48.38 46.80
LLaMA-2-13B 49.38 43.50 46.50 44.25 48.25 46.38
LLaMA-13B 39.38 40.25 39.88 40.62 43.00 40.63
LLaMA-2-7B 38.88 37.00 37.50 41.62 42.38 39.48
Chinese-LLaMA-2-7B 36.50 30.75 31.75 36.25 39.50 34.95
LLaMA-7B 29.38 30.50 29.25 33.50 28.50 30.23
Falcon-7B 26.25 27.75 28.50 29.38 31.00 28.58

Citation

If you find our work useful, please cite our paper:

@article{shen2023roleeval,
      title={RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models}, 
      author={Tianhao Shen and Sun Li and Deyi Xiong},
      year={2023},
      eprint={2312.16132},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

A Bilingual Role Evaluation Benchmark for Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published