Main steps to run whole benchmark on new dataset/model (there are examples for each step in later sections):
- Generate training dataset
- Generate texts:
synthetic_dataset_generation/run_generate_texts.py - Split texts to steps and annotate with DeepSeek:
synthetic_dataset_generation/run_extract_verify_claims.py
- Generate texts:
- Generate test dataset:
synthetic_dataset_generation/run_create_test_dataset.py - Train UHead:
train_luh/run_train_luh.py - Test UHead:
eval_uhead.py - Evaluate baselines
- PRM:
eval_prm.py - ReasonEval:
eval_reasoneval.py
- PRM:
- Plot resulting tables:
plot_results.ipynb
Before usage: paste your deekseek API key in configs/deepseek_api_key.txt.
For the following commands, change /cluster/... paths to your local paths to save models/datasets in. Also change rediska0123/... to your huggingface path.
Example commands to generate training dataset:
- Generate annotation dataset (on GPU). Runs for ~30 mins.
python -m synthetic_dataset_generation.run_generate_texts \
--dataset-path openai/gsm8k,main --n-samples 819 \
--model-path Qwen/Qwen3-1.7B \
--device cuda:0 \
--prompt-file configs/qwen3_prompt.txt \
--save-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B_texts
# 819 samples: all except last 500 for test- Verify claims with DeepSeek (no GPU required). DeepSeek answers are cached. Can run for 30min-1h with enough
n-threads.
python -m synthetic_dataset_generation.run_extract_verify_claims \
--dataset-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B_texts \
--model-path Qwen/Qwen3-1.7B \
--prompt-file configs/qwen3_prompt.txt \
--save-path /cluster/project/sachan/ekaterina/.cache/train_gsm8k_Qwen3-1.7B \
--hf-save-path rediska0123/train_gsm8k_Qwen3-1.7B \
--api-key-file configs/deepseek_api_key.txt \
--n-threads 16Apply prompt to create test dataset. No GPU required. Runs super fast.
python -m synthetic_dataset_generation.run_create_test_dataset \
--dataset-path openai/gsm8k,main --dataset-split test --start-index 819 \
--save-path /cluster/project/sachan/ekaterina/.cache/test_gsm8k_Qwen3-1.7B \
--hf-save-path rediska0123/test_gsm8k_Qwen3-1.7B \
--hf-cache /cluster/project/sachan/ekaterina/.cache \
--prompt-file configs/gsm8k_3shot_prompt.txt
# start from 819'th text until the end of the datasetTrain with the following script. Runs several hours. Recommended to experiment with different number of epochs.
Before running, create your wandb project and replace WANDB_PROJECT variable, or alternatively switch off wandb logging by setting report_to: none in YAML config.
PYTHONPATH=./ WANDB_PROJECT=ue-reasoning \
HYDRA_CONFIG=../configs/train_uhead_claim.yaml \
python train_luh/run_train_luh.py \
model.pretrained_model_name_or_path=Qwen/Qwen3-1.7B \
dataset.path=hf:rediska0123/train_gsm8k_Qwen3-1.7B \
dataset.prompt_path=configs/qwen3_prompt.txt \
training_arguments.num_train_epochs=30 \
+save_dir=/cluster/project/sachan/ekaterina/.cache/uhead_Qwen3-1.7B_gsm8k \
+hf_save_path=rediska0123/uhead_Qwen3-1.7B_gsm8kClaim sampling: +claim_num_upper_bound=N limits how many claims are randomly sampled per example during training (no replacement). If N <= 0 or omitted, all claims are used. See clariden_scripts/train_uh_natural_hs_2500_8_true_random_5e.sh for an example (sets CLAIM_NUM_UPPER_BOUND and passes +claim_num_upper_bound=...). The sampling happens in train_luh/run_train_luh.py inside DataCollatorForLanguageModelingWithUncertaintyClaim where it uses random.sample to take min(N, len(claims)).
Example to test your UHead along with other UE baselines (MaxProb, Perplexity, Entropy, CCP). Replace WANDB_PROJECT with your wandb project.
PYTHONPATH=./ \
WANDB_PROJECT=ue-reasoning \
DEEPSEEK_API_KEY=$(<configs/deepseek_api_key.txt) \
HYDRA_CONFIG=configs/polygraph_eval_claim_reasoning.yaml \
python eval_uhead.py \
model.path=Qwen/Qwen3-1.7B \
dataset=rediska0123/test_gsm8k_Qwen3-1.7B \
stat_calculators.2.cfg.uq_head_path=rediska0123/uhead_Qwen3-1.7B_gsm8k \
+hf_save_path=rediska0123/ue_manager_gsm8k_Qwen3-1.7B# run uhead without annotation
PYTHONPATH=./ \
WANDB_PROJECT=ue-reasoning \
HYDRA_CONFIG=configs/polygraph_eval_claim_reasoning_no_annotation.yaml \
python eval_uhead.py \
model.path=Qwen/Qwen3-1.7B \
dataset=rediska0123/test_gsm8k_Qwen3-1.7B \
stat_calculators.2.cfg.uq_head_path=rediska0123/uhead_Qwen3-1.7B_gsm8k \
+hf_save_path=rediska0123/ue_manager_gsm8k_Qwen3-1.7B
# run annotation (will update existing manager in `man-path`)
PYTHONPATH=./ \
DEEPSEEK_API_KEY=$(<configs/deepseek_api_key.txt) \
python eval_anno.py \
--man-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
--model-path Qwen/Qwen3-8B \
--prompt-path configs/qwen3_prompt.txt \
--n-threads 8Runs super fast, updates manager in hf-manager-path with PRM reward values.
python eval_prm.py \
--hf-manager-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
--base-model-path Qwen/Qwen3-1.7B \
--prm-model-path Qwen/Qwen2.5-Math-7B-PRM800K \
--prompt-file configs/qwen3_prompt.txt \
--device autoRuns super fast, updates manager in hf-manager-path with PRM reward values.
python eval_reasoneval.py \
--hf-manager-path rediska0123/ue_manager_gsm8k_Qwen3-1.7B \
--base-model-path Qwen/Qwen3-1.7B \
--reasoneval-model-path GAIR/ReasonEval-7B \
--prompt-file configs/qwen3_prompt.txt \
--device autoUse plot_results.ipynb to get results table.
- Create sweep ID
WANDB_PROJECT=<wandb_project_name> \
wandb sweep configs/sweep_cfg_training.yaml- Run hyperoptimization:
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=./ \
WANDB_PROJECT=<wandb_project_name> \
wandb agent <entity>/<wandb_project_name>/<SWEEP_ID>Important notes:
-
Sweep optimizes
eval/mean_f1by default, which is the mean F1 over the main eval set and alladditional_test_datasetsdefined inconfigs/sweep_cfg_training.yaml. There's already 3 datasets specified: GSM8k, Proofnet, Math. Might be useful to update this list with planning / qa. -
You can run the second command (
wandb agent) simultaneously on multiple machines/GPUs to pool resources (use the one global sweep id generated by first command). The sweep will automatically distribute different parameter sets to different agents for faster parallel evaluation. -
You can monitor hyperoptimization progress in the Sweeps tab of your W&B project. If metrics stop improving after come reasonable amount of runs (e.g. ~10-20), you can stop the sweep and pick the best run from the completed list.