GitHub - Andree-9/HiLL: Implementations for paper "Learning to Hint for Reinforcement Learning"

Learning to Hint for Reinforcement Learning

Jointly train a hinter policy and a reasoner policy during RL. For each hard question that yields an all-incorrect GRPO group, the hinter generates hints online conditioned on the current reasoner's failure, recovering gradient signal while optimizing for transferability to the no-hint setting.

🌟 Overview

GRPO suffers from advantage collapse: when all sampled rollouts for a question are incorrect, the group yields zero relative advantages and no gradient signal. HiLL addresses this by co-training a hinter policy alongside the reasoner with two key ideas:

1. Failure-conditioned hint generation. The hinter generates hints online, conditioned on the question, the current reasoner's failed rollout, and a reference solution. This allows hints to adapt to the reasoner's evolving weaknesses over training, unlike fixed or offline hints.

2. Hint reliance and transfer-weighted reward. Not all hints that recover GRPO signal are equally useful. A hint that performs key reasoning steps directly induces high hint reliance — correct hinted trajectories become unlikely once the hint is removed, limiting transfer to the no-hint policy. HiLL introduces a transfer-weighted reward that penalizes high-reliance hints, steering the hinter toward concise, conceptual guidance that transfers back to no-hint inference.

📦 Setup

Create a new virtual environment

python -m venv ~/.python/hill
source ~/.python/hill/bin/activate

Install dependencies and wandb login

git clone https://github.com/Andree-9/HiLL.git
cd ./HiLL

bash hill_setup.sh

Get training data prepared by SAGE
```
bash scripts/get_data.sh
```

⚡ Training

bash scripts/run_hill.sh

🎓 Evaluation

bash scripts/eval.sh

📝 Citation

If you find HiLL useful, please cite as:

@misc{xia2026hill,
      title={Learning to Hint for Reinforcement Learning},
      author={Yu Xia and Canwen Xu and Zhewei Yao and Julian McAuley and Yuxiong He},
      year={2026},
      eprint={2604.00698},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/pdf/2604.00698},
}

🙏 Acknowledgments

Our implementation builds on SAGE, whose training data and evaluation codebase we extend for HiLL. We also thank verl for the RL training framework, vllm for rollout generation, and oat for response grading.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
eval		eval
scripts		scripts
verl		verl
.gitignore		.gitignore
README.md		README.md
hill_setup.sh		hill_setup.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning to Hint for Reinforcement Learning

🌟 Overview

📦 Setup

⚡ Training

🎓 Evaluation

📝 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Learning to Hint for Reinforcement Learning

🌟 Overview

📦 Setup

⚡ Training

🎓 Evaluation

📝 Citation

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages