Skip to content

Andree-9/HiLL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning to Hint for Reinforcement Learning

Jointly train a hinter policy and a reasoner policy during RL. For each hard question that yields an all-incorrect GRPO group, the hinter generates hints online conditioned on the current reasoner's failure, recovering gradient signal while optimizing for transferability to the no-hint setting.

arXiv GitHub

overview

🌟 Overview

GRPO suffers from advantage collapse: when all sampled rollouts for a question are incorrect, the group yields zero relative advantages and no gradient signal. HiLL addresses this by co-training a hinter policy alongside the reasoner with two key ideas:

1. Failure-conditioned hint generation. The hinter generates hints online, conditioned on the question, the current reasoner's failed rollout, and a reference solution. This allows hints to adapt to the reasoner's evolving weaknesses over training, unlike fixed or offline hints.

2. Hint reliance and transfer-weighted reward. Not all hints that recover GRPO signal are equally useful. A hint that performs key reasoning steps directly induces high hint reliance — correct hinted trajectories become unlikely once the hint is removed, limiting transfer to the no-hint policy. HiLL introduces a transfer-weighted reward that penalizes high-reliance hints, steering the hinter toward concise, conceptual guidance that transfers back to no-hint inference.

📦 Setup

  1. Create a new virtual environment

    python -m venv ~/.python/hill
    source ~/.python/hill/bin/activate
  2. Install dependencies and wandb login

    git clone https://github.com/Andree-9/HiLL.git
    cd ./HiLL
    
    bash hill_setup.sh
  3. Get training data prepared by SAGE

    bash scripts/get_data.sh

⚡ Training

bash scripts/run_hill.sh

🎓 Evaluation

bash scripts/eval.sh

📝 Citation

If you find HiLL useful, please cite as:

@misc{xia2026hill,
      title={Learning to Hint for Reinforcement Learning},
      author={Yu Xia and Canwen Xu and Zhewei Yao and Julian McAuley and Yuxiong He},
      year={2026},
      eprint={2604.00698},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/pdf/2604.00698},
}

🙏 Acknowledgments

Our implementation builds on SAGE, whose training data and evaluation codebase we extend for HiLL. We also thank verl for the RL training framework, vllm for rollout generation, and oat for response grading.

About

Implementations for paper "Learning to Hint for Reinforcement Learning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages