How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

This repository contains the official codebase for working with RMCB, including dataset reconstruction, feature extraction, model training, and evaluation pipelines used in the project.

Dataset

Hugging Face Dataset Hub

The public RMCB metadata is hosted on Hugging Face:

🤗 Dataset Hub https://huggingface.co/datasets/ledengary/RMCB

The hosted dataset contains public metadata only, which is sufficient to reconstruct the full dataset locally when combined with the original source datasets.

System Requirements

This project was tested on NVIDIA H200 GPUs using PyTorch built with CUDA 12.6.

Python: 3.12.11
GPU: NVIDIA H200
NVIDIA Driver: >= 570.195.03
CUDA runtime (PyTorch): 12.6
CUDA toolkit (nvcc): 12.0

Other hardware may work but has not been systematically validated.

Setup

1. Clone the repository

git clone https://github.com/Ledengary/RMCB.git
cd RMCB

2. Create the conda environment

The environment YAML is exported directly from the development environment and should be used as-is.

conda env create -f environment.yaml
conda activate rmcb

Python 3.12.11 and CUDA 12.6 were used during development and testing.

RMCB Dataset Reconstruction

Why reconstruction is required

RMCB is distributed as public metadata only. Full records are reconstructed locally by matching metadata entries to the original source datasets. By running the reconstruction script, you are downloading original datasets directly from their official sources and agreeing to their respective licenses.

Reconstruction Script

Script:

RMCB/reconstruct_rmcb.py

This script:

Downloads public metadata if not present
Retrieves original datasets from official sources
Reconstructs full records by matching record IDs
Writes per-model train and test splits

Public Metadata

The public metadata file is automatically downloaded if missing.

File: rmcb_public_metadata.jsonl
Source: Hugging Face
Contains:
- record_id
- model_id
- model_response
- grading
- dataset

This file does not include prompts or ground-truth answers.

Original Dataset Retrieval

During reconstruction, original datasets are loaded from their official sources. The table below lists all datasets with their sources, revisions, and download methods.

Dataset Sources

#	Link	Dataset	Domain	Split	Source	Revision/Version	Date	Notes
1	🔗	GSM8K	Mathematical	Train	HuggingFace: `openai/gsm8k` (config: `socratic`)	`cc7b047b6e5bb11b4f1af84efc572db110a51b3c`	Dec 2023	Automatic
2	🔗	TATQA	Financial	Train	Manual Download	-	-	See manual download instructions below
3	🔗	MedQA	Medical	Train	Manual Download	`ddef95d268cdad413693d634279a9a679d468469`	Apr 5, 2024	See manual download instructions below
4	🔗	LEXam	Legal	Train	HuggingFace: `LEXam-Benchmark/LEXam` (configs: `open_question`, `mcq_4_choices`, `mcq_perturbation`)	`68f21a324eb0e14837be42f10b644c40847c3ed4`	Oct 24, 2025	Automatic
5	🔗	ARC	General	Train	HuggingFace: `allenai/ai2_arc` (config: `ARC-Challenge`)	`210d026faf9955653af8916fad021475a3f00453`	Dec 21, 2023	Automatic
6	🔗	CommonsenseQA-2	General	Train	HuggingFace: `chiayewken/commonsense-qa-2`	`15e7dc364f7906ad69cbe4a0bed697ba12f07bdf`	Jan 23, 2024	Automatic
7	🔗	LogiQA	General	Train	HuggingFace: `lucasmccabe/logiqa`	`3c19b0488d794d30c36f73d132d8a22e64f42f2e`	Feb 7, 2023	Automatic
8	🔗	OpenBookQA	General	Train	HuggingFace: `allenai/openbookqa` (config: `main`)	`388097ea7776314e93a529163e0fea805b8a6454`	Jan 4, 2024	Automatic
9	🔗	QuaRTz	General	Train	HuggingFace: `allenai/quartz`	`28c1dbb56caf81799296cb17892fa73402e23464`	Jan 4, 2024	Automatic
10	🔗	ReClor	General	Train	HuggingFace: `voidful/ReClor` (raw file: `train.json`)	`809ebe44b8dde882c4190f4178b27676b941b933`	May 20, 2023	Automatic
11	🔗	MATH	Mathematical	Test	Kaggle: `awsaf49/math-dataset`	Version 1	-	Automatic (requires Kaggle Hub)
12	🔗	FinQA	Financial	Test	Manual Download	`0f16e2867befa6840783e58be38c9efb9229d742`	Jun 5, 2022	See manual download instructions below
13	🔗	MedMCQA	Medical	Test	HuggingFace: `openlifescienceai/medmcqa`	`91c6572c454088bf71b679ad90aa8dffcd0d5868`	Jan 4, 2024	Automatic
14	🔗	LegalBench	Legal	Test	HuggingFace: `nguha/legalbench`	`e042ea68c19df12b737fe768572f22ead61e8e37`	Sep 30, 2024	Automatic
15	🔗	MMLU-Pro	General	Test	HuggingFace: `TIGER-Lab/MMLU-Pro`	`dd36ce4b34827164989f100331f82c5a29741747`	Oct 25, 2025	Automatic
16	🔗	BBH	General	Test	HuggingFace: `maveriq/bigbenchhard`	`d53c5b10a77edeb29da195f47e6086b29f2f7f74`	Sep 29, 2023	Automatic (requires `datasets < 4.0.0`)

Manual Downloads Required

Three datasets require manual download due to licensing or distribution constraints:

1. FinQA

File: test.json
Source: GitHub repository czyssrs/FinQA at commit 0f16e2867befa6840783e58be38c9efb9229d742
Direct link: https://github.com/czyssrs/FinQA/tree/main/dataset
Place at: RMCB/subdatasets/raw_data/finqa/test.json

2. MedQA

File: train.jsonl (extracted from data_clean.zip)
Source:
- Google Drive: https://drive.google.com/file/d/1ImYUSLk9JbgHXOemfvyiDiirluZHPeQw/view
- Alternative: Available via HuggingFace dataset nguha/legalbench
Instructions: Download data_clean.zip, extract it, and copy train.jsonl from the questions/US/ directory
Place at: RMCB/subdatasets/raw_data/medqa/train.jsonl

3. TATQA

File: tatqa_dataset_train.json
Source: GitHub repository NExTplusplus/TAT-QA
Direct link: https://github.com/NExTplusplus/TAT-QA/tree/master/dataset_raw
Place at: RMCB/subdatasets/raw_data/tatqa/tatqa_dataset_train.json

Final Directory Structure

After all manual downloads are complete, the RMCB/subdatasets/raw_data/ directory should have the following structure:

RMCB/subdatasets/raw_data/
├── finqa/
│   └── test.json
├── medqa/
│   └── train.jsonl
└── tatqa/
    └── tatqa_dataset_train.json

Coverage statistics are printed during reconstruction to indicate matching success.

Reconstructed records are saved per model:

data/
└── {reconstructed_dir}/
    └── {model_id}/
        ├── train.jsonl
        └── test.jsonl

Running Reconstruction

python rmcb_reconstruction/reconstruct_rmcb.py \
  --public-metadata data/RMCB/rmcb_public_metadata.jsonl \
  --output-dir data/final

If the metadata file is missing, it will be downloaded automatically.

Models and Evaluation Workflow Structure

Overview

The codebase follows a four-stage pipeline:

Preprocessing → Feature Extraction → Training → Evaluation

Stage 0: Preprocessing

Purpose Generate reasoning responses from LLMs and grade them to form finalized datasets.

Inputs

Reconstructed RMCB dataset
Or raw dataset JSONLs

Outputs

data/{model_id}/run/final_results/
data/grading/{model_id}/final_graded/
data/final/{model_id}/

Scripts

Answer generation scripts (generate_reasoning_answers_thread_batch.py)
Grading scripts (grade_answers.py)
Postprocessing utilities (postprocessing_grade.py, finalize_grade.ipynb, combined_datasets.ipynb)

This stage can be skipped if using already reconstructed datasets.

Stage 1: Feature Extraction

Purpose Extract hidden states, logits, and engineered features from model responses.

Inputs

data/{reconstructed_dir}/{model_id}/train.jsonl
data/{reconstructed_dir}/{model_id}/test.jsonl
LLM checkpoints via Hugging Face Transformers

Outputs

representations/{method}/{model_id}/
features/{method}/{model_id}/

Script Types

*_hidden_states.py
*_extraction.py
*_feature_extraction.py

Each script includes example CLI usage inside the file.

Stage 2: Training

Purpose Train confidence estimation models using extracted features.

Inputs

Representations or features
Training split JSONLs
Utility configs

Outputs

trained_models/{method}/{model_id}/

Artifacts include:

Model weights
Training configs
Optuna studies where applicable

Training scripts support both direct training and Optuna-based search.

Stage 3: Evaluation

Purpose Evaluate trained models on held-out test data.

Inputs

Trained models
Test representations or features
Test JSONLs

Outputs

results/{method}/{model_id}/

Saved results include:

Per-record predictions
Confidence scores
Aggregate metrics

Directory Structure Summary

project_root/
├── data/
│   └── {reconstructed_dir}/
├── RMCB/
│   ├── subdatasets/
│   │   ├── raw_data/
│   │   │   ├── finqa/
│   │   │   │   └── test.json
│   │   │   ├── medqa/
│   │   │   │   └── train.jsonl
│   │   │   └── tatqa/
│   │   │       └── tatqa_dataset_train.json
│   │   ├── arc.py
│   │   ├── bbh.py
│   │   ├── ...
│   └── reconstruct_rmcb.py
├── representations/
├── features/
├── trained_models/
├── results/
├── utils/
└── environment.yaml

Notes

All scripts include example CLI commands in-file
Model IDs use org/model-name format and are normalized in directory names
CUDA device selection is configurable via CLI arguments
Users are responsible for complying with all original dataset licenses
Reconstruction is designed to make dataset provenance explicit and auditable

Citation

If you use RMCB or this codebase in your research, please cite the accompanying paper.

[TO-BE-COMPLETED]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
RMCB		RMCB
evaluation		evaluation
models		models
preprocessing/datasets		preprocessing/datasets
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

Dataset

Hugging Face Dataset Hub

System Requirements

Setup

1. Clone the repository

2. Create the conda environment

RMCB Dataset Reconstruction

Why reconstruction is required

Reconstruction Script

Public Metadata

Original Dataset Retrieval

Dataset Sources

Manual Downloads Required

Final Directory Structure

Running Reconstruction

Models and Evaluation Workflow Structure

Overview

Stage 0: Preprocessing

Stage 1: Feature Extraction

Stage 2: Training

Stage 3: Evaluation

Directory Structure Summary

Notes

Citation

About

Uh oh!

Releases

Packages

Languages

License

Ledengary/RMCB

Folders and files

Latest commit

History

Repository files navigation

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

Dataset

Hugging Face Dataset Hub

System Requirements

Setup

1. Clone the repository

2. Create the conda environment

RMCB Dataset Reconstruction

Why reconstruction is required

Reconstruction Script

Public Metadata

Original Dataset Retrieval

Dataset Sources

Manual Downloads Required

Final Directory Structure

Running Reconstruction

Models and Evaluation Workflow Structure

Overview

Stage 0: Preprocessing

Stage 1: Feature Extraction

Stage 2: Training

Stage 3: Evaluation

Directory Structure Summary

Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages