CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

[🌐 Website] • [📜 Paper] • [🤗 Dataset] • [🐱 GitHub]

Repo for "CriticBench: Benchmarking LLMs for Critique-Correct Reasoning"

💡 Introduction

The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning, and analyze the key factors affecting LLM critical reasoning.

Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in critique and correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing pattern, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.

Figure 1: An overview for the CriticBench construction.

🚀 Quick Start

⚙️ Setup

Cloning the repository

git clone git@github.com:CriticBench/CriticBench.git 
cd CriticBench/src

Preparing conda env

conda create -n critcbench python=3.10
conda activate criticbench

Install torch that is compatible with your device, then install the required dependencies as follows:

pip install -r requirements.txt

⚖️ Evaluation

You can evaluation model's generation(G), critique(Q), correction(C) by the following command.

Evaluate with specified model

Some models require access permissions, which can be set with the following commands:

export HUGGING_FACE_HUB_TOKEN=<Your Huggingface token>
export OPENAI_API_KEY=<Your OpenAI API key>

Huggingface model

python evaluate.py  \
    --available_gpus <GPU_IDs> \
    --tasks GQC \
    --prompt_type fs\
    --hf_model <model-name> \
    --enable_code_execution

Huggingface critique model

We provide support for Auto-J and UltraCM. You can evaluate these models with the following command.

python evaluate.py  \
    --available_gpus <GPU_IDs> \
    --tasks Q \
    --hf_critic_model <model-name> \
    --enable_code_execution

OpenAI model

python evaluate.py  \
    --tasks GQC \
    --prompt_type fs\
    --openai_model <model-name> \
    --enable_code_execution

--tasks specifies which task to evaluate, with the available options being:
- GQC for a combination of generation, critique, and correction;
- QC for critique and correction;
- G, Q, or C for generation, critique, or correction individually;
- Note that correction tasks ("C") should be executed after critique tasks ("Q") or require a specified critique result file.
--prompt_type allows you to further specify the prompts for critique and correction used during evaluation:
- fs: few-shot prompt for both critique and correction;
- zs-crit-cot: zero-shot chain-of-thought prompt for critique;
- zs-crit-ao-1, zs-crit-ao-2 and zs-crit-ao-3 represent three distinct types of zero-shot answer-only prompts for critique;
- In correction, zero-shot prompts are all set to chain of thought (cot).
--enable_code_execution argument enables execution of code for generation and correction tasks
--available_gpus argument specifies which GPUs to use, identified by their IDs (e.g., 0,1).

Evaluate with existed result

You can specify paths to the existed result file using the --existed_gen_file, --existed_crit_file and --existed_corr_file. For accurate answer extraction, ensure --prompt_type aligns with your results. Here is an example:

python evaluate.py \
    --tasks GQC \
    --prompt_type fs \
    --enable_code_execution \
    --existed_gen_file <path to generation result file> \
    --existed_crit_file <path to critique result file> \
    --existed_corr_file <path to correction result file>

Here's an example of what a JSON line in a generation result file might look like:

{
  "id": 0, 
  "final_prompt": "The final prompt for LLMs",
  "generation_result": "LLM's result for the generation task"
}

☕️ Citation

If you find this repository helpful, please consider citing our paper:

@misc{lin2024criticbench,
  title={CriticBench: Benchmarking LLMs for Critique-Correct Reasoning}, 
  author={Zicheng Lin and Zhibin Gou and Tian Liang and Ruilin Luo and Haowei Liu and Yujiu Yang},
  year={2024},
  eprint={2402.14809},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

💡 Introduction

🚀 Quick Start

⚙️ Setup

⚖️ Evaluation

Evaluate with specified model

Huggingface model

Huggingface critique model

Evaluate with existed result

☕️ Citation

About

Releases

Contributors 2

Languages

License

CriticBench/CriticBench

Folders and files

Latest commit

History

Repository files navigation

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

💡 Introduction

🚀 Quick Start

⚙️ Setup

⚖️ Evaluation

Evaluate with specified model

Huggingface model

Huggingface critique model

Evaluate with existed result

☕️ Citation

About

Resources

License

Stars

Watchers

Forks

Languages