🔎 Data |
🔨 Code |
🤗 Huggingface Leaderboard |
📑 Paper |
🤖ConvRe🤯 is the benchmark proposed in our EMNLP 2023 main conference paper: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations.
It aims to evaluate LLMs' ability on understanding converse relations.
Converse relation is defined as the opposite of semantic relation while keeping the surface form of the triple unchanged.
For example, the triple (x, has part, y)
is interpreted as "x has a part called y" in normal relation, while "y has a part called x" in converse relation 🔁.
The experiments in our paper suggested that LLMs often resort to shortcut learning (or superficial correlations) and still face challenges on our 🤖ConvRe🤯 benchmark even for powerful models like GPT-4. The following picture shows the performances of GPT models under zero-shot easy/hard settings on our benchmark. It can be observed that in both Re2Text
and Text2Re
tasks, GPT models exhibit a positive scaling trend under easy-setting, and inverse scaling trend under hard-setting. Please check our paper 📑 or huggingface leaderboard 🤗 for more detailed and comprehensive results.
Read this in 中文.
- [2023/10/09] ConvRe benchmark has released🌟.
- [2023/10/08] ConvRe has been accepted by EMNLP 2023.
ConvRe benchmark is composed of 17 relations and 1240 triples from five widely used knowledge graph datasets: WN18RR, FB15K-237, NELL-ONE, Wikidata5M, ICEWS14, ConceptNet5. The detailed number of triples for each relation in the benchmark is listed below.
Relation | # Triples | Source |
---|---|---|
hypernym | 80 | WN18RR |
has part | 78 | WN18RR |
organization, organization relationship, child | 75 | FB15K-237 |
location, location, partially contains | 77 | FB15K-237 |
athlete beat athlete | 80 | NELL-ONE |
parent of | 145 | NELL-ONE & Wikidata5M |
represented by | 79 | Wikidata5M |
side effect | 8 | Wikidata5M |
has facility | 62 | Wikidata5M |
influenced by | 65 | Wikidata5M |
owned by | 51 | Wikidata5M |
consult | 73 | ICEWS14 |
praise or endorse | 78 | ICEWS14 |
made of | 80 | ConceptNet5 |
used of | 79 | ConceptNet5 |
has property | 55 | ConceptNet5 |
has subevent | 75 | ConceptNet5 |
Total | 1240 |
The dataset files can be found in data
directory. Here is the description of each file.
re2text_relations.json
: The normal and converse relation definition and corresponding choices of each relation forre2text
task.re2text_examples.json
: The few shot examples ofre2text
task, includingnormal
prompt setting andhint+cot
setting.text2re_relations
: The normal and converse relation definition and corresponding choices of each relation fortext2re
task.text2re_examples.json
: The few shot examples ofre2text
task, includingnormal
prompt setting andhint+cot
setting.triple_dataset
: Full dataset of the benchmark, including triples and correct answers.triple_subset
: The subset we used in our paper, it contains 328 triples and their corresponding correct answers.
The models listed below are tested and can be run directly using the script in Inference.
GPT TEXT MODELS
- text-ada-001
- text-babbage-001
- text-curie-001
- text-davinci-003
- gpt-3.5-turbo
- gpt-3.5-turbo-0301
- gpt-4
- gpt-4-0314
Claude MODELS
- claude-1.3
- claude-instant-1.1
FLAN-T5 MODELS
- flan-t5-small
- flan-t5-base
- flan-t5-large
- flan-t5-xl
- flan-t5-xxl
LLAMA2 CHAT MODELS
- llama-2-7b-chat-hf
- llama-2-13b-chat-hf
- llama-2-70b-chat-hf
QWEN CHAT MODELS
- qwen-7b-chat
- qwen-14b-chat
INTERNLM MODELS
- internlm-chat-7b
- internlm-chat-20b
Our benchmark is available on Huggingface 🤗 (link). You can easily run the inference by using main_hf.py
and specifying the following three arguments.
model_name
: the name of the large language model, see our supported model list.task
: the subtask of ConvRe benchmark:text2re
orre2text
.setting
: prompt setting for current run (prompt1-prompt 12), please refer to our paper(LINK) for more details of each setting.
Example
Here is the script to run prompt4
of re2text
task on text-davinci-003
👇
python3 main_hf.py --model_name text-davinci-003 --task re2text --setting prompt4
We also provide a more flexible way to run the experiments. There are ️eight arguments you need to specify.
model_name
: the name of the large language model you want to use, see our supported model list.task
: the subtask of ConvRe benchmark:text2re
orre2text
.data_dir
: The directory where the dataset stored.prompt
: The type of prompt to use in the experiment:normal
,hint
orhint+cot
.relation
: The relation type to use in the experiment:normal
for normal relation andconverse
for converse relation.n_shot
: Few-shot numbers, choose a number in [0, 1, 2, 3, 4, 5, 6].example_type
: The type of few-shot examples,hard
orregular
.text_type
: The type of text to use in the experiment,regular
orhard
.
The argument settings for each of the 12 prompt used in our paper is listed below.
Prompt ID | prompt | relation | n_shot | example_type | text_type |
---|---|---|---|---|---|
re2text 1# | normal | normal | 0 | regular | regular |
text2re 1# | normal | normal | 0 | regular | hard |
re2text 2# | normal | normal | 0 | regular | hard |
text2re 2# | normal | normal | 0 | regular | regular |
re2text 3# | normal | converse | 0 | regular | regular |
text2re 3# | normal | converse | 0 | regular | hard |
re2text 4# | normal | converse | 0 | regular | hard |
text2re 4# | normal | converse | 0 | regular | regular |
re2text 5# | hint | converse | 0 | regular | regular |
text2re 5# | hint | converse | 0 | regular | hard |
re2text 6# | hint | converse | 0 | regular | hard |
text2re 6# | hint | converse | 0 | regular | regular |
7# | normal | converse | 3 | hard | hard |
8# | hint+cot | converse | 3 | hard | hard |
9# | normal | converse | 6 | hard | hard |
10# | normal | converse | 3 | regular | hard |
11# | hint+cot | converse | 3 | regular | hard |
12# | normal | converse | 6 | regular | hard |
Example
Here is the script to run prompt3
of text2re
task on gpt-3.5-turbo-0301
👇
python3 main.py --model_name gpt-3.5-turbo-0301 --task text2re --data_dir data --prompt normal --relation converse --n_shot 0 --example_type regular --text_type hard
There are three arguments need to be specified when running the evaluation script.
file_path
: Thepath
of the result file 📁.model_family
: The model family of the result file, used to choose the corresponding evaluator. You should choose fromflan-t5
,claude
,gpt-text
,gpt-chat
,llama2
,qwen
,internlm
.mode
: We provide two evaluation mode:strict
andauto
.strict
mode will raise errors if the answer of the model isn't consistent with what we want. In this case, you should check the model's answer manually.auto
mode will just ignore the inconsistent answers. The performance calculated underauto
mode may be lower thanstrict
mode, but it's very convenient and doesn't need any human support. 💡The ability to align with user's request is also a very important indicator of LLMs' capability.
Firstly, you should create a new class that inherit LanguageModels
in llms_interface.py
, and then implement the completion
method according to the characteristics (such as the structure of the new model's API) of your model.
After obtaining the result, you should create a new class that inherit BaseEvaluator
in llms_evaluator.py
, and then implement the evaluate
method according to the pattern of your model's answer.
To add a new relation in the benchmark, you should firstly check whether the relation meets the requirements in Section 2.5
of our paper. Then you should write the corresponding prompts for both Re2Text
and Text2Re
tasks.
Re2Text
Note: in this task, all the question is asking for head entity.
normal
: thenormal
instruction of the relation.converse
: theconverse
instruction of the relaiton.normal-regular
: theregular
description for the question undernormal
relation.normal-hard
: thehard
description for the question undernormal
relation.converse-regular
: theregular
description for the question underconverse
relation.converse-hard
: thehard
description for the question underconverse
relation.
Text2Re
normal
: thenormal
instruction of the relation.converse
: theconverse
instruction of the relaton.hard
: thehard
description of the question.regular
: theregular
description of the question.normal-correct
: thecorrect
choice undernormal
relation.normal-wrong
: thewrong
choice undernormal
relation.converse-correct
: thecorrect
choice underconverse
relation.converse-wrong
: thewrong
choice underconverse
relation.
Feel free to add new models and relations to our benchmark🥰
@misc{qi2023investigation,
title={An Investigation of LLMs' Inefficacy in Understanding Converse Relations},
author={Chengwen Qi and Bowen Li and Binyuan Hui and Bailin Wang and Jinyang Li and Jinwang Wu and Yuanjun Laili},
year={2023},
eprint={2310.05163},
archivePrefix={arXiv},
primaryClass={cs.CL}
}