RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

We introduce RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fiction. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative.

NOTE: The English version of RoleEval is under internal review and will be released later.

Leaderboard (5-shot)

If you want to submit your model's predictions to our leaderboard, please feel free to contact us via thshen@tju.edu.cn for more details.

NOTE: * indicates the results calculated by submitted predictions.

RoleEval (zh)

RoleEval-Chinese (2,000 questions)

Model	Celebrities	Anime and Comics	Movies and TV Series	Games	Fiction	Avg.
Qwen-72B	70.00	59.75	66.00	61.25	74.00	66.20
Baichuan-NPC-Turbo*	66.25	61.00	71.50	54.25	76.25	65.85
Yi-34B	65.50	54.50	70.00	56.00	77.00	64.60
GPT-4-1106	62.50	63.25	63.00	62.00	63.00	62.75
GPT-4-0613	57.75	60.25	57.75	60.00	58.00	58.75
Yi-6B	59.25	46.00	61.50	47.75	62.00	55.30
Baichuan-NPC-Lite*	56.00	51.75	56.75	47.50	62.00	54.80
MiniMax	54.00	55.00	52.75	57.50	54.00	54.65
Qwen-14B	56.25	45.50	54.75	51.50	56.75	52.95
Baichuan2-13B	54.75	47.75	54.00	47.50	60.00	52.80
Skywork-13B	55.25	45.75	56.00	48.50	57.50	52.60
Baichuan2-7B	52.25	43.75	49.00	47.25	55.00	49.45
ChatGLM3-6B	50.00	44.50	48.00	44.25	58.00	48.95
Qwen-7B	49.00	42.00	47.50	44.75	51.25	46.90
GPT-3.5-1106	47.50	46.75	41.75	44.75	38.75	43.90
GPT-3.5-0613	42.25	43.50	39.75	43.75	39.00	41.65
Chinese-LLaMA-2-13B	36.50	36.50	34.00	34.00	40.50	36.30
LLaMA-2-70B	36.00	38.00	36.25	36.25	34.75	36.25
Chinese-LLaMA-2-7B	34.50	29.00	33.00	30.25	36.25	32.60
Mistral-7B	32.50	37.50	26.25	33.25	31.50	32.20
Falcon-40B	28.25	33.00	30.25	29.25	38.50	31.85
LLaMA-65B	30.00	32.25	29.00	35.50	29.00	31.15
LLaMA-2-7B	25.75	28.00	33.75	29.75	34.50	30.35
LLaMA-30B	30.00	28.75	26.00	31.75	28.00	28.90
LLaMA-2-13B	28.75	30.50	25.25	29.75	28.25	28.50
Falcon-7B	24.75	30.50	31.50	29.75	25.25	28.35
LLaMA-13B	27.25	29.75	27.25	26.00	29.00	27.85
LLaMA-7B	28.50	24.75	20.50	27.75	29.00	26.10

RoleEval-Global (4,000 questions)

Model	Celebrities	Anime and Comics	Movies and TV Series	Games	Fiction	Avg.
GPT-4-1106	74.75	73.62	74.38	72.50	71.62	73.38
GPT-4-0613	73.38	72.12	74.25	72.25	69.62	72.32
Qwen-72B	72.88	63.88	70.38	56.75	73.50	67.47
Baichuan-NPC-Turbo*	72.25	65.25	64.62	55.50	72.75	66.07
Yi-34B	72.38	60.62	69.75	53.25	73.12	65.83
Baichuan-NPC-Lite*	60.62	56.62	51.88	48.25	62.12	55.90
MiniMax	51.75	54.50	62.62	56.75	52.75	55.67
Qwen-14B	62.50	52.38	55.00	45.50	58.00	54.67
Yi-6B	61.88	51.38	52.38	45.38	60.75	54.35
Baichuan2-13B	60.25	52.38	51.00	46.88	60.75	54.25
Skywork-13B	59.13	51.75	51.88	44.50	58.75	53.20
GPT-3.5-1106	48.75	51.88	51.25	49.88	48.38	50.02
ChatGLM3-6B	56.50	47.62	48.38	41.88	54.50	49.78
Baichuan2-7B	56.00	49.62	45.50	40.50	52.38	48.80
GPT-3.5-0613	46.62	48.38	51.75	49.50	47.38	48.73
Qwen-7B	54.75	44.38	44.62	42.75	53.00	47.90
LLaMA-2-70B	53.50	43.25	39.25	40.25	47.25	44.70
Chinese-LLaMA-2-13B	45.38	38.25	39.88	31.87	42.12	39.50
Falcon-40B	39.62	32.25	32.38	30.00	45.00	35.85
Chinese-LLaMA-2-7B	35.62	36.75	35.62	35.38	34.38	35.55
LLaMA-2-7B	37.00	29.88	28.75	34.50	38.25	33.67
LLaMA-2-13B	36.50	34.00	33.00	31.87	31.75	33.42
Mistral-7B	36.12	33.50	32.00	30.25	35.00	33.38
LLaMA-65B	32.12	31.87	32.75	31.00	34.88	32.52
LLaMA-30B	24.88	31.13	30.25	27.75	28.62	28.52
LLaMA-13B	28.50	28.50	28.25	26.50	27.75	27.90
LLaMA-7B	25.50	31.87	25.87	26.00	28.88	27.62
Falcon-7B	23.88	28.12	24.50	28.00	28.12	26.52

RoleEval (en)

RoleEval-Chinese (2,000 questions)

Model	Celebrities	Anime and Comics	Movies and TV Series	Games	Fiction	Avg.
GPT-4-0613	54.25	61.75	63.00	63.00	63.00	61.00
GPT-4-1106	57.50	63.50	60.00	62.50	58.00	60.30
Yi-34B	56.00	52.00	47.50	55.00	57.00	53.50
Qwen-72B	52.75	47.50	46.50	54.25	50.50	50.30
GPT-3.5-0613	42.00	47.75	42.50	42.25	45.50	44.00
GPT-3.5-1106	38.25	45.50	44.00	44.50	46.00	43.65
LLaMA-2-70B	43.25	41.50	40.25	47.50	43.50	43.20
Yi-6B	42.25	38.50	41.50	44.25	45.00	42.30
Qwen-14B	41.00	38.75	38.25	43.25	41.00	40.45
LLaMA-65B	41.50	38.50	33.50	43.25	37.50	38.85
ChatGLM3-6B	36.25	36.25	35.25	42.25	43.50	38.70
Skywork-13B	39.25	34.50	38.25	41.75	38.50	38.45
MiniMax	34.00	39.50	40.75	38.25	39.00	38.30
Qwen-7B	36.25	36.00	36.25	42.25	40.00	38.15
Baichuan2-7B	37.25	35.75	33.00	40.25	37.00	36.65
Mistral-7B	35.75	42.00	30.00	41.75	31.50	36.20
Baichuan2-13B	35.50	36.50	31.25	42.25	34.75	36.05
Falcon-40B	34.00	38.25	30.75	38.75	35.25	35.40
LLaMA-30B	34.75	35.75	30.75	40.00	35.00	35.25
Chinese-LLaMA-2-13B	34.00	38.50	27.75	37.50	34.00	34.35
LLaMA-2-13B	30.50	36.50	33.25	36.50	33.25	34.00
LLaMA-13B	32.75	31.75	30.75	38.50	32.00	33.15
LLaMA-2-7B	28.75	29.25	32.75	37.50	32.25	32.10
Chinese-LLaMA-2-7B	30.50	27.75	33.00	30.50	27.75	29.90
LLaMA-7B	24.00	27.50	29.75	33.00	29.25	28.70
Falcon-7B	27.25	27.75	27.75	29.75	28.50	28.20

RoleEval-Global (4,000 questions)

Model	Celebrities	Anime and Comics	Movies and TV Series	Games	Fiction	Avg.
GPT-4-0613	77.62	79.50	73.12	74.88	75.00	76.02
GPT-4-1106	75.12	78.75	75.00	76.12	75.00	76.00
Yi-34B	73.12	61.75	67.88	57.12	67.25	65.42
Qwen-72B	70.12	62.00	69.00	55.75	69.50	65.27
LLaMA-2-70B	63.25	57.38	59.00	50.00	63.25	58.58
GPT-3.5-0613	57.38	59.62	58.13	59.50	57.50	58.43
GPT-3.5-1106	58.75	56.62	55.75	58.00	55.00	56.82
MiniMax	54.87	56.38	53.50	54.12	51.38	54.05
Yi-6B	59.25	52.00	54.12	47.50	56.25	53.82
Qwen-14B	61.12	49.00	53.87	45.38	56.12	53.10
LLaMA-65B	58.13	50.50	54.37	47.62	54.50	53.02
Baichuan2-13B	56.12	47.50	51.50	45.62	54.00	50.95
Skywork-13B	56.25	46.75	51.62	44.38	53.62	50.52
Mistral-7B	54.87	46.75	49.62	44.25	52.25	49.55
ChatGLM3-6B	55.12	46.62	49.25	43.25	52.62	49.37
LLaMA-30B	51.62	46.88	48.62	43.12	52.62	48.57
Qwen-7B	53.87	46.12	48.12	40.00	51.12	47.85
Baichuan2-7B	51.00	45.12	49.00	42.12	50.00	47.45
Falcon-40B	47.38	45.00	49.62	43.12	50.00	47.02
Chinese-LLaMA-2-13B	47.75	46.00	46.88	45.00	48.38	46.80
LLaMA-2-13B	49.38	43.50	46.50	44.25	48.25	46.38
LLaMA-13B	39.38	40.25	39.88	40.62	43.00	40.63
LLaMA-2-7B	38.88	37.00	37.50	41.62	42.38	39.48
Chinese-LLaMA-2-7B	36.50	30.75	31.75	36.25	39.50	34.95
LLaMA-7B	29.38	30.50	29.25	33.50	28.50	30.23
Falcon-7B	26.25	27.75	28.50	29.38	31.00	28.58

Citation

If you find our work useful, please cite our paper:

@article{shen2023roleeval,
      title={RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models}, 
      author={Tianhao Shen and Sun Li and Deyi Xiong},
      year={2023},
      eprint={2312.16132},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
zh		zh
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zh

zh

README.md

README.md

Repository files navigation

RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

Leaderboard (5-shot)

RoleEval (zh)

RoleEval-Chinese (2,000 questions)

RoleEval-Global (4,000 questions)

RoleEval (en)

RoleEval-Chinese (2,000 questions)

RoleEval-Global (4,000 questions)

Citation

About

Releases

Packages

Magnetic2014/RoleEval

Folders and files

Latest commit

History

zh

zh

README.md

README.md

Repository files navigation

RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

Leaderboard (5-shot)

RoleEval (zh)

RoleEval-Chinese (2,000 questions)

RoleEval-Global (4,000 questions)

RoleEval (en)

RoleEval-Chinese (2,000 questions)

RoleEval-Global (4,000 questions)

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages