How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains
This repository contains the official codebase for working with RMCB, including dataset reconstruction, feature extraction, model training, and evaluation pipelines used in the project.
The public RMCB metadata is hosted on Hugging Face:
π€ Dataset Hub https://huggingface.co/datasets/ledengary/RMCB
The hosted dataset contains public metadata only, which is sufficient to reconstruct the full dataset locally when combined with the original source datasets.
This project was tested on NVIDIA H200 GPUs using PyTorch built with CUDA 12.6.
- Python: 3.12.11
- GPU: NVIDIA H200
- NVIDIA Driver: >= 570.195.03
- CUDA runtime (PyTorch): 12.6
- CUDA toolkit (nvcc): 12.0
Other hardware may work but has not been systematically validated.
git clone https://github.com/Ledengary/RMCB.git
cd RMCBThe environment YAML is exported directly from the development environment and should be used as-is.
conda env create -f environment.yaml
conda activate rmcbPython 3.12.11 and CUDA 12.6 were used during development and testing.
RMCB is distributed as public metadata only. Full records are reconstructed locally by matching metadata entries to the original source datasets. By running the reconstruction script, you are downloading original datasets directly from their official sources and agreeing to their respective licenses.
Script:
RMCB/reconstruct_rmcb.py
This script:
- Downloads public metadata if not present
- Retrieves original datasets from official sources
- Reconstructs full records by matching record IDs
- Writes per-model train and test splits
The public metadata file is automatically downloaded if missing.
-
File:
rmcb_public_metadata.jsonl -
Source: Hugging Face
-
Contains:
record_idmodel_idmodel_responsegradingdataset
This file does not include prompts or ground-truth answers.
During reconstruction, original datasets are loaded from their official sources. The table below lists all datasets with their sources, revisions, and download methods.
| # | Link | Dataset | Domain | Split | Source | Revision/Version | Date | Notes |
|---|---|---|---|---|---|---|---|---|
| 1 | π | GSM8K | Mathematical | Train | HuggingFace: openai/gsm8k (config: socratic) |
cc7b047b6e5bb11b4f1af84efc572db110a51b3c |
Dec 2023 | Automatic |
| 2 | π | TATQA | Financial | Train | Manual Download | - | - | See manual download instructions below |
| 3 | π | MedQA | Medical | Train | Manual Download | ddef95d268cdad413693d634279a9a679d468469 |
Apr 5, 2024 | See manual download instructions below |
| 4 | π | LEXam | Legal | Train | HuggingFace: LEXam-Benchmark/LEXam (configs: open_question, mcq_4_choices, mcq_perturbation) |
68f21a324eb0e14837be42f10b644c40847c3ed4 |
Oct 24, 2025 | Automatic |
| 5 | π | ARC | General | Train | HuggingFace: allenai/ai2_arc (config: ARC-Challenge) |
210d026faf9955653af8916fad021475a3f00453 |
Dec 21, 2023 | Automatic |
| 6 | π | CommonsenseQA-2 | General | Train | HuggingFace: chiayewken/commonsense-qa-2 |
15e7dc364f7906ad69cbe4a0bed697ba12f07bdf |
Jan 23, 2024 | Automatic |
| 7 | π | LogiQA | General | Train | HuggingFace: lucasmccabe/logiqa |
3c19b0488d794d30c36f73d132d8a22e64f42f2e |
Feb 7, 2023 | Automatic |
| 8 | π | OpenBookQA | General | Train | HuggingFace: allenai/openbookqa (config: main) |
388097ea7776314e93a529163e0fea805b8a6454 |
Jan 4, 2024 | Automatic |
| 9 | π | QuaRTz | General | Train | HuggingFace: allenai/quartz |
28c1dbb56caf81799296cb17892fa73402e23464 |
Jan 4, 2024 | Automatic |
| 10 | π | ReClor | General | Train | HuggingFace: voidful/ReClor (raw file: train.json) |
809ebe44b8dde882c4190f4178b27676b941b933 |
May 20, 2023 | Automatic |
| 11 | π | MATH | Mathematical | Test | Kaggle: awsaf49/math-dataset |
Version 1 | - | Automatic (requires Kaggle Hub) |
| 12 | π | FinQA | Financial | Test | Manual Download | 0f16e2867befa6840783e58be38c9efb9229d742 |
Jun 5, 2022 | See manual download instructions below |
| 13 | π | MedMCQA | Medical | Test | HuggingFace: openlifescienceai/medmcqa |
91c6572c454088bf71b679ad90aa8dffcd0d5868 |
Jan 4, 2024 | Automatic |
| 14 | π | LegalBench | Legal | Test | HuggingFace: nguha/legalbench |
e042ea68c19df12b737fe768572f22ead61e8e37 |
Sep 30, 2024 | Automatic |
| 15 | π | MMLU-Pro | General | Test | HuggingFace: TIGER-Lab/MMLU-Pro |
dd36ce4b34827164989f100331f82c5a29741747 |
Oct 25, 2025 | Automatic |
| 16 | π | BBH | General | Test | HuggingFace: maveriq/bigbenchhard |
d53c5b10a77edeb29da195f47e6086b29f2f7f74 |
Sep 29, 2023 | Automatic (requires datasets < 4.0.0) |
Three datasets require manual download due to licensing or distribution constraints:
1. FinQA
- File:
test.json - Source: GitHub repository
czyssrs/FinQAat commit0f16e2867befa6840783e58be38c9efb9229d742 - Direct link: https://github.com/czyssrs/FinQA/tree/main/dataset
- Place at:
RMCB/subdatasets/raw_data/finqa/test.json
2. MedQA
- File:
train.jsonl(extracted fromdata_clean.zip) - Source:
- Google Drive: https://drive.google.com/file/d/1ImYUSLk9JbgHXOemfvyiDiirluZHPeQw/view
- Alternative: Available via HuggingFace dataset
nguha/legalbench
- Instructions: Download
data_clean.zip, extract it, and copytrain.jsonlfrom thequestions/US/directory - Place at:
RMCB/subdatasets/raw_data/medqa/train.jsonl
3. TATQA
- File:
tatqa_dataset_train.json - Source: GitHub repository
NExTplusplus/TAT-QA - Direct link: https://github.com/NExTplusplus/TAT-QA/tree/master/dataset_raw
- Place at:
RMCB/subdatasets/raw_data/tatqa/tatqa_dataset_train.json
After all manual downloads are complete, the RMCB/subdatasets/raw_data/ directory should have the following structure:
RMCB/subdatasets/raw_data/
βββ finqa/
β βββ test.json
βββ medqa/
β βββ train.jsonl
βββ tatqa/
βββ tatqa_dataset_train.json
Coverage statistics are printed during reconstruction to indicate matching success.
Reconstructed records are saved per model:
data/
βββ {reconstructed_dir}/
βββ {model_id}/
βββ train.jsonl
βββ test.jsonl
python rmcb_reconstruction/reconstruct_rmcb.py \
--public-metadata data/RMCB/rmcb_public_metadata.jsonl \
--output-dir data/finalIf the metadata file is missing, it will be downloaded automatically.
The codebase follows a four-stage pipeline:
Preprocessing β Feature Extraction β Training β Evaluation
Purpose Generate reasoning responses from LLMs and grade them to form finalized datasets.
Inputs
- Reconstructed RMCB dataset
- Or raw dataset JSONLs
Outputs
data/{model_id}/run/final_results/
data/grading/{model_id}/final_graded/
data/final/{model_id}/
Scripts
- Answer generation scripts (
generate_reasoning_answers_thread_batch.py) - Grading scripts (
grade_answers.py) - Postprocessing utilities (
postprocessing_grade.py,finalize_grade.ipynb,combined_datasets.ipynb)
This stage can be skipped if using already reconstructed datasets.
Purpose Extract hidden states, logits, and engineered features from model responses.
Inputs
data/{reconstructed_dir}/{model_id}/train.jsonldata/{reconstructed_dir}/{model_id}/test.jsonl- LLM checkpoints via Hugging Face Transformers
Outputs
representations/{method}/{model_id}/
features/{method}/{model_id}/
Script Types
*_hidden_states.py*_extraction.py*_feature_extraction.py
Each script includes example CLI usage inside the file.
Purpose Train confidence estimation models using extracted features.
Inputs
- Representations or features
- Training split JSONLs
- Utility configs
Outputs
trained_models/{method}/{model_id}/
Artifacts include:
- Model weights
- Training configs
- Optuna studies where applicable
Training scripts support both direct training and Optuna-based search.
Purpose Evaluate trained models on held-out test data.
Inputs
- Trained models
- Test representations or features
- Test JSONLs
Outputs
results/{method}/{model_id}/
Saved results include:
- Per-record predictions
- Confidence scores
- Aggregate metrics
project_root/
βββ data/
β βββ {reconstructed_dir}/
βββ RMCB/
β βββ subdatasets/
β β βββ raw_data/
β β β βββ finqa/
β β β β βββ test.json
β β β βββ medqa/
β β β β βββ train.jsonl
β β β βββ tatqa/
β β β βββ tatqa_dataset_train.json
β β βββ arc.py
β β βββ bbh.py
β β βββ ...
β βββ reconstruct_rmcb.py
βββ representations/
βββ features/
βββ trained_models/
βββ results/
βββ utils/
βββ environment.yaml
- All scripts include example CLI commands in-file
- Model IDs use
org/model-nameformat and are normalized in directory names - CUDA device selection is configurable via CLI arguments
- Users are responsible for complying with all original dataset licenses
- Reconstruction is designed to make dataset provenance explicit and auditable
If you use RMCB or this codebase in your research, please cite the accompanying paper.
[TO-BE-COMPLETED]