Skip to content

ReProbe/ReProbe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

104 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Main steps to run whole benchmark on new dataset/model (there are examples for each step in later sections):

  1. Generate training dataset
    • Generate texts: synthetic_dataset_generation/run_generate_texts.py
    • Split texts to steps and annotate with DeepSeek: synthetic_dataset_generation/run_extract_verify_claims.py
  2. Generate test dataset: synthetic_dataset_generation/run_create_test_dataset.py
  3. Train UHead: train_luh/run_train_luh.py
  4. Test UHead: eval_uhead.py
  5. Evaluate baselines
    • PRM: eval_prm.py
    • ReasonEval: eval_reasoneval.py
  6. Plot resulting tables: plot_results.ipynb

Before usage: paste your deekseek API key in configs/deepseek_api_key.txt.

Example usage

For the following commands, change /cluster/... paths to your local paths to save models/datasets in. Also change rediska0123/... to your huggingface path.

1. Generate training dataset

Example commands to generate training dataset:

  1. Generate annotation dataset (on GPU). Runs for ~30 mins.
python -m synthetic_dataset_generation.run_generate_texts \
  --dataset-path openai/gsm8k,main --n-samples 819 \
  --model-path Qwen/Qwen3-1.7B \
  --device cuda:0 \
  --prompt-file configs/qwen3_prompt.txt \
  --save-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B_texts
# 819 samples: all except last 500 for test
  1. Verify claims with DeepSeek (no GPU required). DeepSeek answers are cached. Can run for 30min-1h with enough n-threads.
python -m synthetic_dataset_generation.run_extract_verify_claims \
  --dataset-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B_texts \
  --model-path Qwen/Qwen3-1.7B \
  --prompt-file configs/qwen3_prompt.txt \
  --save-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B \
  --hf-save-path rediska0123/train_gsm8k_Qwen3-1.7B \
  --api-key-file configs/deepseek_api_key.txt \
  --n-threads 16

2. Generate test dataset

Apply prompt to create test dataset. No GPU required. Runs super fast.

python -m synthetic_dataset_generation.run_create_test_dataset \
  --dataset-path openai/gsm8k,main --dataset-split test --start-index 819 \
  --save-path /cluster/project/sachan/ekaterina/.cache/test_gsm8k_Qwen3-1.7B \
  --hf-save-path rediska0123/test_gsm8k_Qwen3-1.7B \
  --hf-cache /cluster/project/sachan/ekaterina/.cache \
  --prompt-file configs/gsm8k_3shot_prompt.txt
# start from 819'th text until the end of the dataset

3. Train UHead

Train with the following script. Runs several hours. Recommended to experiment with different number of epochs.

Before running, create your wandb project and replace WANDB_PROJECT variable, or alternatively switch off wandb logging by setting report_to: none in YAML config.

PYTHONPATH=./ WANDB_PROJECT=ue-reasoning \
HYDRA_CONFIG=../configs/train_uhead_claim.yaml \
python train_luh/run_train_luh.py \
  model.pretrained_model_name_or_path=Qwen/Qwen3-1.7B \
  dataset.path=hf:rediska0123/train_gsm8k_Qwen3-1.7B \
  dataset.prompt_path=configs/qwen3_prompt.txt \
  training_arguments.num_train_epochs=30 \
  +save_dir=/cluster/project/sachan/ekaterina/.cache/uhead_Qwen3-1.7B_gsm8k \
  +hf_save_path=rediska0123/uhead_Qwen3-1.7B_gsm8k

Claim sampling: +claim_num_upper_bound=N limits how many claims are randomly sampled per example during training (no replacement). If N <= 0 or omitted, all claims are used. See clariden_scripts/train_uh_natural_hs_2500_8_true_random_5e.sh for an example (sets CLAIM_NUM_UPPER_BOUND and passes +claim_num_upper_bound=...). The sampling happens in train_luh/run_train_luh.py inside DataCollatorForLanguageModelingWithUncertaintyClaim where it uses random.sample to take min(N, len(claims)).

4. Test UHead

Example to test your UHead along with other UE baselines (MaxProb, Perplexity, Entropy, CCP). Replace WANDB_PROJECT with your wandb project.

PYTHONPATH=./ \
WANDB_PROJECT=ue-reasoning \
DEEPSEEK_API_KEY=$(<configs/deepseek_api_key.txt) \
HYDRA_CONFIG=configs/polygraph_eval_claim_reasoning.yaml \
    python eval_uhead.py \
    model.path=Qwen/Qwen3-1.7B \
    dataset=rediska0123/test_gsm8k_Qwen3-1.7B \
    stat_calculators.2.cfg.uq_head_path=rediska0123/uhead_Qwen3-1.7B_gsm8k \
    +hf_save_path=rediska0123/ue_manager_gsm8k_Qwen3-1.7B

Alternatively: first eval UHead, then run annotation on different machine

# run uhead without annotation
PYTHONPATH=./ \
WANDB_PROJECT=ue-reasoning \
HYDRA_CONFIG=configs/polygraph_eval_claim_reasoning_no_annotation.yaml \
python eval_uhead.py \
    model.path=Qwen/Qwen3-1.7B \
    dataset=rediska0123/test_gsm8k_Qwen3-1.7B \
    stat_calculators.2.cfg.uq_head_path=rediska0123/uhead_Qwen3-1.7B_gsm8k \
    +hf_save_path=rediska0123/ue_manager_gsm8k_Qwen3-1.7B

# run annotation (will update existing manager in `man-path`)
PYTHONPATH=./ \
DEEPSEEK_API_KEY=$(<configs/deepseek_api_key.txt) \
python eval_anno.py \
    --man-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
    --model-path Qwen/Qwen3-8B \
    --prompt-path configs/qwen3_prompt.txt \
    --n-threads 8

5. Evaluate baselines

Process-Reward Model baseline

Runs super fast, updates manager in hf-manager-path with PRM reward values.

python eval_prm.py \
    --hf-manager-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
    --base-model-path Qwen/Qwen3-1.7B \
    --prm-model-path Qwen/Qwen2.5-Math-7B-PRM800K \
    --prompt-file configs/qwen3_prompt.txt \
    --device auto

ReasonEval baseline

Runs super fast, updates manager in hf-manager-path with PRM reward values.

python eval_reasoneval.py \
    --hf-manager-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
    --base-model-path Qwen/Qwen3-1.7B \
    --reasoneval-model-path GAIR/ReasonEval-7B \
    --prompt-file configs/qwen3_prompt.txt \
    --device auto

6. Plot results

Use plot_results.ipynb to get results table.

Hyperparameters Optimization

  1. Create sweep ID
WANDB_PROJECT=<wandb_project_name> \
wandb sweep configs/sweep_cfg_training.yaml
  1. Run hyperoptimization:
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=./ \
WANDB_PROJECT=<wandb_project_name> \
wandb agent <entity>/<wandb_project_name>/<SWEEP_ID>

Important notes:

  1. Sweep optimizes eval/mean_f1 by default, which is the mean F1 over the main eval set and all additional_test_datasets defined in configs/sweep_cfg_training.yaml. There's already 3 datasets specified: GSM8k, Proofnet, Math. Might be useful to update this list with planning / qa.

  2. You can run the second command (wandb agent) simultaneously on multiple machines/GPUs to pool resources (use the one global sweep id generated by first command). The sweep will automatically distribute different parameter sets to different agents for faster parallel evaluation.

  3. You can monitor hyperoptimization progress in the Sweeps tab of your W&B project. If metrics stop improving after come reasonable amount of runs (e.g. ~10-20), you can stop the sweep and pick the best run from the completed list.

About

Official codebase for ACL 2026 paper: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors