“The limits of my language mean the limits of my world.”
– Wittgenstein
This repository is based on TinyZero, which is a reproduction of DeepSeek R1 Zero in countdown built upon veRL.
conda create -n zero python=3.9
# install torch [or you can skip this step and let vllm to install the correct version for you]
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
pip install ray
# verl
pip install -e .
# flash attention 2
pip install flash-attn --no-build-isolation
# quality of life
pip install wandb IPython matplotlib
# behavioral evals
pip install asynciolimiter loguru tenacity anthropic openai
First, you can generate the Countdown dataset from the original repo.
Data Preparation
conda activate zero
python ./examples/data_preprocess/countdown.py --local_dir {path_to_your_dataset}
To generate the priming data, we use claude-3.5-sonnet
.
We generate 5 datasets with different behaviors.
python generate_cot_datasets/api_gen.py --api_key {your_api_key} --dataset_type {dataset_type} --target_samples {target_samples} --output_file {output_file} --seed {seed} --max_target {max_target} --min_target {min_target}
# process the data into parquet format
sh ./scripts/process_data.sh
We generate 2 datasets with empty COT, one that is length matched to all strategies and one that just has an empty COT.
python generate_cot_datasets/generate_empty_cot.py --input_file {input_file} --output_file {output_file}
We convert the all strategies dataset to only have incorrect examples.
python generate_cot_datasets/generate_no_positive_cot.py --input_file {input_file} --output_file {output_file}
All our experiments are run on either 8 or 4, 80GB A100s or H100s. The number of GPUs can be changed in the script. Please see the TinyZero repo or VERL for more information on training with different compute.
We run SFT on the priming data to get a new primed base model.
chmod +x scripts/sft.sh
./scripts/sft.sh
We run PPO on the primed model.
sh ./scripts/train.sh
We run the behavioral evals with gpt-4o-mini
to get the results.
sh ./scripts/behavioral_evals.sh
First, we label the pretraining data to get the behavior counts.
# Classify behaviors
python pretraining_analysis/relabel_pretrain_offline.py --user username --start 0 --end 1000000 --save_every 10000 --dataset_name {dataset_name}
# Process and get stats
python pretraining_analysis/process_pretrain_labelled.py --dataset_name {dataset_name}
python pretraining_analysis/get_stats.py --dataset_name {dataset_name}
We generate a new dataset with the labeled data.
# Label as COT
python pretraining_analysis/relabel_pretrain_qa.py --user username --start 0 --end 1000000 --save_every 10000 --dataset_name {dataset_name} --mehtod {curated or negative}
# Format as QA and save as parquet
python pretraining_analysis/generate_qa_parquets.py --dataset_name {dataset_name}
# Trim for SFT and save as parquet
python pretraining_analysis/save_parquets.py --dataset_name {dataset_name}
We then run SFT on the new dataset to get a new primed base model and run PPO on it.
@misc{gandhi2025cognitivebehaviorsenableselfimproving,
title={Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs},
author={Kanishk Gandhi and Ayush Chakravarthy and Anikait Singh and Nathan Lile and Noah D. Goodman},
year={2025},
eprint={2503.01307},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.01307},
}