Skip to content

AI9Stars/CheckRLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning

Dingling Xu1, Ruobing Wang2, Qingfei Zhao2, Yukun Yan3, Zhichun Wang1, Daren Zha2, Shi Yu3, Zhenghao Liu4, Shuo Wang3, Xu Han3, Maosong Sun3

1Beijing Normal University, 2 Institute of Information Engineering, Chinese Academy of Sciences, 3Tsinghua University, 4Northeastern University

📖 Introduction

We propose CheckRLM, a framework that employs Retrieval-Augmented Generation (RAG) to promptly identify and correct factual errors within long reasoning chains, thereby aligning them with external knowledge. CheckRLM comprises two components: in-process knowledge claim recognition and localized knowledge coherence correction via retrieval.

⚙️ Installation

Environment Setup

conda create --name checkrlm python=3.12
conda activate checkrlm

git clone https://github.com/AI9Stars/CheckRLM.git
cd CheckRLM
pip install -r requirement.txt

Elasticsearch Setup

Install Elasticsearch for BM25 retrieval:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
shasum -a 512 -c elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz
cd elasticsearch-7.10.2/
./bin/elasticsearch # Start the server
pkill -f elasticsearch # To stop the server

📁 Dataset Preparation

We evaluate on five benchmarks: hotpotqa, 2wikimultihopqa, simpleqa, musique, and iirc. For hotpotqa, 2wikimultihopqa, musique, and iirc, the evaluation splits and retrieval corpora follow the IRCoT setup. For simpleqa, we randomly sample 500 instances from the official test set and retrieve against the KILT Wikipedia knowledge source.

All per-benchmark files are under data/{dataset_name}/, with subsampled questions in data/{dataset_name}/test_subsampled.jsonl.

To fetch retrieval corpora, run the following from the repository root (the script follows IRCoT-style sources and may install helpers such as gdown):

cd CheckRLM
bash src/download_corpus.sh

🛠️ Configuration

You can configure Project root, Dataset Parameters, Model Parameters, Retrieval Parameters and DPO Parameters in src/scripts/config.sh.

💾 Build Indices

For hotpotqa, 2wikimultihopqa, musique, and iirc:

  • BM25 retrieval based on Elasticsearch
  • Dense retrieval with FAISS index using embeddings from bge-large-en-v1.5

For simpleqa:

Dense retrieval with FAISS index using embeddings from bge-large-en-v1.5

BM25

cd src/scripts
bash run_build_index_bm25.sh

Dense Retrieval

cd src/scripts
bash run_build_index_embedding.sh

CheckRLM Inference

We implement four methods from the paper: Direct Reasoning, Vanilla RAG, Post-reasoning Check, and In-reasoning Check. In-reasoning Check achieves the strongest results in our experiments.

Decoding hyperparameters

Configure the reasoning and check models under src/config/reasoning_model.yaml and src/config/check_model.yaml respectively. The parameters of different reasoning models and check models in the experiments are as follows:

Reasoning backbone(s) Temperature top_p top_k
Qwen3-8B, Qwen3-32B 0.7 0.8 20
QwQ-32B 0.6 0.95 40
DeepSeek-R1-Distill-Llama-70B 0.6 0.95 −1
Check backbone(s) Temperature top_p top_k
Qwen3-8B 0.7 0.8 20
Qwen2.5-14B-Instruct, Qwen2.5-32B-Instruct, Llama-3.3-70B-Instruct 0 0.95 −1

Direct Reasoning

bash src/scripts/run_base.sh

Vanilla RAG

bash src/scripts/run_vanilla.sh

Post-reasoning Check

bash src/scripts/run_check_think_offline.sh

In-reasoning Check

bash src/scripts/run_check_think_online.sh

DPO Training (Optional)

Training Data Construction

Before running the data-construction scripts, edit src/config/check_model.yaml and set temperature and top_p to lists of numeric values.

bash src/scripts/gen_dpo_data.sh

Training

bash src/scripts/train_dpo.sh

We also provide our DPO training data and Qwen-2.5-14B-Instruct_DPO model.

📄 Acknowledgement

We acknowledge the following open-source projects that informed our code:

🥰 Citation

@inproceedings{xu-etal-2026-checkrlm,
    title = "{C}heck{RLM}: Effective Knowledge{--}Thought Coherence Checking in Retrieval-Augmented Reasoning",
    author = "Xu, Dingling  and
      Wang, Ruobing  and
      Zhao, Qingfei  and
      Yan, Yukun  and
      Wang, Zhichun  and
      Zha, Daren  and
      Yu, Shi  and
      Liu, Zhenghao  and
      Wang, Shuo  and
      Han, Xu  and
      Sun, Maosong",
    editor = "Liakata, Maria  and
      Moreira, Viviane P.  and
      Zhang, Jiajun  and
      Jurgens, David",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-long.1780/",
    doi = "10.18653/v1/2026.acl-long.1780",
    pages = "38403--38426",
    ISBN = "979-8-89176-390-6",
    abstract = "Reasoning Language Models (RLMs) have significantly improved performance on complex tasks by extending the reasoning chain. However, these chains are prone to containing factual errors, particularly in knowledge-intensive tasks. To address this issue, we propose **CheckRLM**, a framework that improves the reliability of the reasoning process through Retrieval-Augmented Generation (RAG) by timely checking and correcting factual errors. Specifically, CheckRLM extracts factual claims from the reasoning chain to identify and localize subtle knowledge inconsistencies during inference. Upon detection of errors, a refinement mechanism performs minimal-cost yet precise corrections by leveraging external knowledge, ensuring coherence between the reasoning chain and correct knowledge. Extensive experiments demonstrate that CheckRLM substantially outperforms existing baselines, exhibiting a strong capability to mitigate error accumulation in long-horizon reasoning with lower costs. The code and data are available at https://github.com/AI9Stars/CheckRLM."
}

⭐ Star History

Star History Chart

About

CheckRLM: Effective Knowledge–Thought Coherence Checking in Retrieval-Augmented Reasoning (ACL 2026 Main)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors