Skip to content
/ RMCB Public

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

License

Notifications You must be signed in to change notification settings

Ledengary/RMCB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

This repository contains the official codebase for working with RMCB, including dataset reconstruction, feature extraction, model training, and evaluation pipelines used in the project.

Dataset

Hugging Face Dataset Hub

The public RMCB metadata is hosted on Hugging Face:

πŸ€— Dataset Hub https://huggingface.co/datasets/ledengary/RMCB

The hosted dataset contains public metadata only, which is sufficient to reconstruct the full dataset locally when combined with the original source datasets.

System Requirements

This project was tested on NVIDIA H200 GPUs using PyTorch built with CUDA 12.6.

  • Python: 3.12.11
  • GPU: NVIDIA H200
  • NVIDIA Driver: >= 570.195.03
  • CUDA runtime (PyTorch): 12.6
  • CUDA toolkit (nvcc): 12.0

Other hardware may work but has not been systematically validated.

Setup

1. Clone the repository

git clone https://github.com/Ledengary/RMCB.git
cd RMCB

2. Create the conda environment

The environment YAML is exported directly from the development environment and should be used as-is.

conda env create -f environment.yaml
conda activate rmcb

Python 3.12.11 and CUDA 12.6 were used during development and testing.

RMCB Dataset Reconstruction

Why reconstruction is required

RMCB is distributed as public metadata only. Full records are reconstructed locally by matching metadata entries to the original source datasets. By running the reconstruction script, you are downloading original datasets directly from their official sources and agreeing to their respective licenses.

Reconstruction Script

Script:

RMCB/reconstruct_rmcb.py

This script:

  1. Downloads public metadata if not present
  2. Retrieves original datasets from official sources
  3. Reconstructs full records by matching record IDs
  4. Writes per-model train and test splits

Public Metadata

The public metadata file is automatically downloaded if missing.

  • File: rmcb_public_metadata.jsonl

  • Source: Hugging Face

  • Contains:

    • record_id
    • model_id
    • model_response
    • grading
    • dataset

This file does not include prompts or ground-truth answers.

Original Dataset Retrieval

During reconstruction, original datasets are loaded from their official sources. The table below lists all datasets with their sources, revisions, and download methods.

Dataset Sources

# Link Dataset Domain Split Source Revision/Version Date Notes
1 πŸ”— GSM8K Mathematical Train HuggingFace: openai/gsm8k (config: socratic) cc7b047b6e5bb11b4f1af84efc572db110a51b3c Dec 2023 Automatic
2 πŸ”— TATQA Financial Train Manual Download - - See manual download instructions below
3 πŸ”— MedQA Medical Train Manual Download ddef95d268cdad413693d634279a9a679d468469 Apr 5, 2024 See manual download instructions below
4 πŸ”— LEXam Legal Train HuggingFace: LEXam-Benchmark/LEXam (configs: open_question, mcq_4_choices, mcq_perturbation) 68f21a324eb0e14837be42f10b644c40847c3ed4 Oct 24, 2025 Automatic
5 πŸ”— ARC General Train HuggingFace: allenai/ai2_arc (config: ARC-Challenge) 210d026faf9955653af8916fad021475a3f00453 Dec 21, 2023 Automatic
6 πŸ”— CommonsenseQA-2 General Train HuggingFace: chiayewken/commonsense-qa-2 15e7dc364f7906ad69cbe4a0bed697ba12f07bdf Jan 23, 2024 Automatic
7 πŸ”— LogiQA General Train HuggingFace: lucasmccabe/logiqa 3c19b0488d794d30c36f73d132d8a22e64f42f2e Feb 7, 2023 Automatic
8 πŸ”— OpenBookQA General Train HuggingFace: allenai/openbookqa (config: main) 388097ea7776314e93a529163e0fea805b8a6454 Jan 4, 2024 Automatic
9 πŸ”— QuaRTz General Train HuggingFace: allenai/quartz 28c1dbb56caf81799296cb17892fa73402e23464 Jan 4, 2024 Automatic
10 πŸ”— ReClor General Train HuggingFace: voidful/ReClor (raw file: train.json) 809ebe44b8dde882c4190f4178b27676b941b933 May 20, 2023 Automatic
11 πŸ”— MATH Mathematical Test Kaggle: awsaf49/math-dataset Version 1 - Automatic (requires Kaggle Hub)
12 πŸ”— FinQA Financial Test Manual Download 0f16e2867befa6840783e58be38c9efb9229d742 Jun 5, 2022 See manual download instructions below
13 πŸ”— MedMCQA Medical Test HuggingFace: openlifescienceai/medmcqa 91c6572c454088bf71b679ad90aa8dffcd0d5868 Jan 4, 2024 Automatic
14 πŸ”— LegalBench Legal Test HuggingFace: nguha/legalbench e042ea68c19df12b737fe768572f22ead61e8e37 Sep 30, 2024 Automatic
15 πŸ”— MMLU-Pro General Test HuggingFace: TIGER-Lab/MMLU-Pro dd36ce4b34827164989f100331f82c5a29741747 Oct 25, 2025 Automatic
16 πŸ”— BBH General Test HuggingFace: maveriq/bigbenchhard d53c5b10a77edeb29da195f47e6086b29f2f7f74 Sep 29, 2023 Automatic (requires datasets < 4.0.0)

Manual Downloads Required

Three datasets require manual download due to licensing or distribution constraints:

1. FinQA

2. MedQA

  • File: train.jsonl (extracted from data_clean.zip)
  • Source:
  • Instructions: Download data_clean.zip, extract it, and copy train.jsonl from the questions/US/ directory
  • Place at: RMCB/subdatasets/raw_data/medqa/train.jsonl

3. TATQA

Final Directory Structure

After all manual downloads are complete, the RMCB/subdatasets/raw_data/ directory should have the following structure:

RMCB/subdatasets/raw_data/
β”œβ”€β”€ finqa/
β”‚   └── test.json
β”œβ”€β”€ medqa/
β”‚   └── train.jsonl
└── tatqa/
    └── tatqa_dataset_train.json

Coverage statistics are printed during reconstruction to indicate matching success.

Reconstructed records are saved per model:

data/
└── {reconstructed_dir}/
    └── {model_id}/
        β”œβ”€β”€ train.jsonl
        └── test.jsonl

Running Reconstruction

python rmcb_reconstruction/reconstruct_rmcb.py \
  --public-metadata data/RMCB/rmcb_public_metadata.jsonl \
  --output-dir data/final

If the metadata file is missing, it will be downloaded automatically.


Models and Evaluation Workflow Structure

Overview

The codebase follows a four-stage pipeline:

Preprocessing β†’ Feature Extraction β†’ Training β†’ Evaluation

Stage 0: Preprocessing

Purpose Generate reasoning responses from LLMs and grade them to form finalized datasets.

Inputs

  • Reconstructed RMCB dataset
  • Or raw dataset JSONLs

Outputs

data/{model_id}/run/final_results/
data/grading/{model_id}/final_graded/
data/final/{model_id}/

Scripts

  • Answer generation scripts (generate_reasoning_answers_thread_batch.py)
  • Grading scripts (grade_answers.py)
  • Postprocessing utilities (postprocessing_grade.py, finalize_grade.ipynb, combined_datasets.ipynb)

This stage can be skipped if using already reconstructed datasets.

Stage 1: Feature Extraction

Purpose Extract hidden states, logits, and engineered features from model responses.

Inputs

  • data/{reconstructed_dir}/{model_id}/train.jsonl
  • data/{reconstructed_dir}/{model_id}/test.jsonl
  • LLM checkpoints via Hugging Face Transformers

Outputs

representations/{method}/{model_id}/
features/{method}/{model_id}/

Script Types

  • *_hidden_states.py
  • *_extraction.py
  • *_feature_extraction.py

Each script includes example CLI usage inside the file.

Stage 2: Training

Purpose Train confidence estimation models using extracted features.

Inputs

  • Representations or features
  • Training split JSONLs
  • Utility configs

Outputs

trained_models/{method}/{model_id}/

Artifacts include:

  • Model weights
  • Training configs
  • Optuna studies where applicable

Training scripts support both direct training and Optuna-based search.

Stage 3: Evaluation

Purpose Evaluate trained models on held-out test data.

Inputs

  • Trained models
  • Test representations or features
  • Test JSONLs

Outputs

results/{method}/{model_id}/

Saved results include:

  • Per-record predictions
  • Confidence scores
  • Aggregate metrics

Directory Structure Summary

project_root/
β”œβ”€β”€ data/
β”‚   └── {reconstructed_dir}/
β”œβ”€β”€ RMCB/
β”‚   β”œβ”€β”€ subdatasets/
β”‚   β”‚   β”œβ”€β”€ raw_data/
β”‚   β”‚   β”‚   β”œβ”€β”€ finqa/
β”‚   β”‚   β”‚   β”‚   └── test.json
β”‚   β”‚   β”‚   β”œβ”€β”€ medqa/
β”‚   β”‚   β”‚   β”‚   └── train.jsonl
β”‚   β”‚   β”‚   └── tatqa/
β”‚   β”‚   β”‚       └── tatqa_dataset_train.json
β”‚   β”‚   β”œβ”€β”€ arc.py
β”‚   β”‚   β”œβ”€β”€ bbh.py
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   └── reconstruct_rmcb.py
β”œβ”€β”€ representations/
β”œβ”€β”€ features/
β”œβ”€β”€ trained_models/
β”œβ”€β”€ results/
β”œβ”€β”€ utils/
└── environment.yaml

Notes

  • All scripts include example CLI commands in-file
  • Model IDs use org/model-name format and are normalized in directory names
  • CUDA device selection is configurable via CLI arguments
  • Users are responsible for complying with all original dataset licenses
  • Reconstruction is designed to make dataset provenance explicit and auditable

Citation

If you use RMCB or this codebase in your research, please cite the accompanying paper.

[TO-BE-COMPLETED]

About

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published