Overview

Main steps to run whole benchmark on new dataset/model (there are examples for each step in later sections):

Generate training dataset
- Generate texts: synthetic_dataset_generation/run_generate_texts.py
- Split texts to steps and annotate with DeepSeek: synthetic_dataset_generation/run_extract_verify_claims.py
Generate test dataset: synthetic_dataset_generation/run_create_test_dataset.py
Train UHead: train_luh/run_train_luh.py
Test UHead: eval_uhead.py
Evaluate baselines
- PRM: eval_prm.py
- ReasonEval: eval_reasoneval.py
Plot resulting tables: plot_results.ipynb

Before usage: paste your deekseek API key in configs/deepseek_api_key.txt.

Example usage

For the following commands, change /cluster/... paths to your local paths to save models/datasets in. Also change rediska0123/... to your huggingface path.

1. Generate training dataset

Example commands to generate training dataset:

Generate annotation dataset (on GPU). Runs for ~30 mins.

python -m synthetic_dataset_generation.run_generate_texts \
  --dataset-path openai/gsm8k,main --n-samples 819 \
  --model-path Qwen/Qwen3-1.7B \
  --device cuda:0 \
  --prompt-file configs/qwen3_prompt.txt \
  --save-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B_texts
# 819 samples: all except last 500 for test

Verify claims with DeepSeek (no GPU required). DeepSeek answers are cached. Can run for 30min-1h with enough n-threads.

python -m synthetic_dataset_generation.run_extract_verify_claims \
  --dataset-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B_texts \
  --model-path Qwen/Qwen3-1.7B \
  --prompt-file configs/qwen3_prompt.txt \
  --save-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B \
  --hf-save-path rediska0123/train_gsm8k_Qwen3-1.7B \
  --api-key-file configs/deepseek_api_key.txt \
  --n-threads 16

2. Generate test dataset

Apply prompt to create test dataset. No GPU required. Runs super fast.

python -m synthetic_dataset_generation.run_create_test_dataset \
  --dataset-path openai/gsm8k,main --dataset-split test --start-index 819 \
  --save-path /cluster/project/sachan/ekaterina/.cache/test_gsm8k_Qwen3-1.7B \
  --hf-save-path rediska0123/test_gsm8k_Qwen3-1.7B \
  --hf-cache /cluster/project/sachan/ekaterina/.cache \
  --prompt-file configs/gsm8k_3shot_prompt.txt
# start from 819'th text until the end of the dataset

3. Train UHead

Train with the following script. Runs several hours. Recommended to experiment with different number of epochs.

Before running, create your wandb project and replace WANDB_PROJECT variable, or alternatively switch off wandb logging by setting report_to: none in YAML config.

PYTHONPATH=./ WANDB_PROJECT=ue-reasoning \
HYDRA_CONFIG=../configs/train_uhead_claim.yaml \
python train_luh/run_train_luh.py \
  model.pretrained_model_name_or_path=Qwen/Qwen3-1.7B \
  dataset.path=hf:rediska0123/train_gsm8k_Qwen3-1.7B \
  dataset.prompt_path=configs/qwen3_prompt.txt \
  training_arguments.num_train_epochs=30 \
  +save_dir=/cluster/project/sachan/ekaterina/.cache/uhead_Qwen3-1.7B_gsm8k \
  +hf_save_path=rediska0123/uhead_Qwen3-1.7B_gsm8k

Claim sampling: +claim_num_upper_bound=N limits how many claims are randomly sampled per example during training (no replacement). If N <= 0 or omitted, all claims are used. See clariden_scripts/train_uh_natural_hs_2500_8_true_random_5e.sh for an example (sets CLAIM_NUM_UPPER_BOUND and passes +claim_num_upper_bound=...). The sampling happens in train_luh/run_train_luh.py inside DataCollatorForLanguageModelingWithUncertaintyClaim where it uses random.sample to take min(N, len(claims)).

4. Test UHead

Example to test your UHead along with other UE baselines (MaxProb, Perplexity, Entropy, CCP). Replace WANDB_PROJECT with your wandb project.

PYTHONPATH=./ \
WANDB_PROJECT=ue-reasoning \
DEEPSEEK_API_KEY=$(<configs/deepseek_api_key.txt) \
HYDRA_CONFIG=configs/polygraph_eval_claim_reasoning.yaml \
    python eval_uhead.py \
    model.path=Qwen/Qwen3-1.7B \
    dataset=rediska0123/test_gsm8k_Qwen3-1.7B \
    stat_calculators.2.cfg.uq_head_path=rediska0123/uhead_Qwen3-1.7B_gsm8k \
    +hf_save_path=rediska0123/ue_manager_gsm8k_Qwen3-1.7B

Alternatively: first eval UHead, then run annotation on different machine

# run uhead without annotation
PYTHONPATH=./ \
WANDB_PROJECT=ue-reasoning \
HYDRA_CONFIG=configs/polygraph_eval_claim_reasoning_no_annotation.yaml \
python eval_uhead.py \
    model.path=Qwen/Qwen3-1.7B \
    dataset=rediska0123/test_gsm8k_Qwen3-1.7B \
    stat_calculators.2.cfg.uq_head_path=rediska0123/uhead_Qwen3-1.7B_gsm8k \
    +hf_save_path=rediska0123/ue_manager_gsm8k_Qwen3-1.7B

# run annotation (will update existing manager in `man-path`)
PYTHONPATH=./ \
DEEPSEEK_API_KEY=$(<configs/deepseek_api_key.txt) \
python eval_anno.py \
    --man-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
    --model-path Qwen/Qwen3-8B \
    --prompt-path configs/qwen3_prompt.txt \
    --n-threads 8

5. Evaluate baselines

Process-Reward Model baseline

Runs super fast, updates manager in hf-manager-path with PRM reward values.

python eval_prm.py \
    --hf-manager-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
    --base-model-path Qwen/Qwen3-1.7B \
    --prm-model-path Qwen/Qwen2.5-Math-7B-PRM800K \
    --prompt-file configs/qwen3_prompt.txt \
    --device auto

ReasonEval baseline

Runs super fast, updates manager in hf-manager-path with PRM reward values.

python eval_reasoneval.py \
    --hf-manager-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
    --base-model-path Qwen/Qwen3-1.7B \
    --reasoneval-model-path GAIR/ReasonEval-7B \
    --prompt-file configs/qwen3_prompt.txt \
    --device auto

6. Plot results

Use plot_results.ipynb to get results table.

Hyperparameters Optimization

Create sweep ID

WANDB_PROJECT=<wandb_project_name> \
wandb sweep configs/sweep_cfg_training.yaml

Run hyperoptimization:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=./ \
WANDB_PROJECT=<wandb_project_name> \
wandb agent <entity>/<wandb_project_name>/<SWEEP_ID>

Important notes:

Sweep optimizes eval/mean_f1 by default, which is the mean F1 over the main eval set and all additional_test_datasets defined in configs/sweep_cfg_training.yaml. There's already 3 datasets specified: GSM8k, Proofnet, Math. Might be useful to update this list with planning / qa.
You can run the second command (wandb agent) simultaneously on multiple machines/GPUs to pool resources (use the one global sweep id generated by first command). The sweep will automatically distribute different parameter sets to different agents for faster parallel evaluation.
You can monitor hyperoptimization progress in the Sweeps tab of your W&B project. If metrics stop improving after come reasonable amount of runs (e.g. ~10-20), you can stop the sweep and pick the best run from the completed list.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
baselines		baselines
bestofn		bestofn
bestofn_optimized		bestofn_optimized
bestofn_stepwise		bestofn_stepwise
configs		configs
lm_polygraph		lm_polygraph
luh		luh
online_bestofn		online_bestofn
scienceqa_missing_images		scienceqa_missing_images
synthetic_dataset_generation		synthetic_dataset_generation
train_luh		train_luh
.gitignore		.gitignore
README.md		README.md
eval_anno.py		eval_anno.py
eval_anno_think.py		eval_anno_think.py
eval_prm.py		eval_prm.py
eval_prm_think.py		eval_prm_think.py
eval_reasoneval.py		eval_reasoneval.py
eval_uhead.py		eval_uhead.py
metrics.py		metrics.py
plot_results.ipynb		plot_results.ipynb
plot_utils.py		plot_utils.py
push_uhead.py		push_uhead.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Example usage

1. Generate training dataset

2. Generate test dataset

3. Train UHead

4. Test UHead

Alternatively: first eval UHead, then run annotation on different machine

5. Evaluate baselines

Process-Reward Model baseline

ReasonEval baseline

6. Plot results

Hyperparameters Optimization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Example usage

1. Generate training dataset

2. Generate test dataset

3. Train UHead

4. Test UHead

Alternatively: first eval UHead, then run annotation on different machine

5. Evaluate baselines

Process-Reward Model baseline

ReasonEval baseline

6. Plot results

Hyperparameters Optimization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages