Skip to content

Linxi000/MEDS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Give it a star
Typing SVG
Paper Hugging Face Paper

Overview

Main Figure

MEDS💊 (Memory-Enhanced Dynamic Reward Shaping) is a memory-enhanced RL training recipe for LLMs built on top of veRL. Unlike standard memoryless reward designs, MEDS incorporates historical error signals into reward shaping, allowing the training process to recognize and discourage repeated mistakes.

To achieve this, MEDS reuses layer-wise logits from the forward pass as lightweight representations of reasoning behavior, clusters similar error patterns, and applies stronger penalties to repeated failures. This encourages broader exploration and leads to better reasoning performance and greater sampling diversity.

Contents

Getting Started

Environment Setup

# Clone the MEDS repository
git clone https://github.com/Linxi000/MEDS.git
cd MEDS

# Create a new conda environment
conda create -n meds python=3.10
conda activate meds

# Install Python dependencies
pip install -r requirements.txt

MEDS is built on top of veRL. Please follow the veRL installation guide to set up the framework, then place the MEDS directory into your verl repo root to get started.

In addition to the standard veRL installation, MEDS depends on updated code components in the following veRL files:

  • verl/workers/fsdp_workers.py: includes MEDS-related worker-side logic used during rollout/training.
  • verl/workers/actor/dp_actor.py: includes MEDS-related actor-side logic used for reward shaping behavior.

Before training, make sure these two files are synced with the MEDS-integrated version in this repository.

Training with MEDS

Set the required paths and launch training:

export MODEL_PATH="${HOME}/models/Qwen2.5-Math-7B"  # Base model path
export TRAIN_FILE="${HOME}/data/unified_math.parquet"
export TEST_FILE="${HOME}/data/aime-2024.parquet"
export CKPTS_DIR="${HOME}/ckpts/MEDS/meds_7b"

bash recipe/meds/run_meds.sh

This runs MEDS training with the default configuration (Qwen2.5-Math-7B, 8 GPUs per node, 100 epochs).

Configuration

Key hyperparameters in recipe/meds/run_meds.sh:

Parameter Default Description
cluster_method hdbscan Clustering algorithm for error pattern grouping
use_layer_diff False Whether to use layer-difference features for clustering
use_last_n_layers 14 Number of last transformer layers used for clustering
cluster_penalty_target wrong Which responses to penalize: wrong / right / both / none
penalty_coef 0.1 Strength of the diversity penalty

Fine-grained Hydra config options are in recipe/meds/config/meds_trainer.yaml.

Evaluation

Our evaluation pipeline fully follows the official open-source implementation from LIMO, particularly its evaluation module.

First, install the required dependencies under the LIMO/eval directory:

pip install -r requirements.txt

To launch the full evaluation pipeline, run:

bash eval.sh

To evaluate a specific checkpoint, update the --model_name_or_path argument in eval.sh to point to your target checkpoint directory.

To change the number of GPUs or specify particular devices, modify CUDA_VISIBLE_DEVICES in the script. For example:

CUDA_VISIBLE_DEVICES=0,1,2,3

To evaluate pass@k, adjust the following parameters:

  • n_sampling: total number of samples generated per problem
  • k: the k value used for pass@k computation

The maximum generation length used in our experiments is max_tokens = 8192.

Data Preparation

The training set is a unified math dataset combining DAPO-Math-17K and difficulty levels 3–5 of MATH-lighteval, with deduplication applied. The validation set is AIME 2024.

Citation

If you find our work helpful, please consider citing:

@article{liu2026meds,
  title={The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping},
  author={Liu, Yang and Wang, Enxi and Gao, Yufei and Zhang, Weixin and Wang, Bo and Zeng, Zhiyuan and Zhang, Yikai and Zheng, Yining and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2604.11297},
  year={2026}
}

License

This project is licensed under the Apache-2.0 License.

Acknowledgments

We gratefully acknowledge the open-source projects that made this work possible. This project is built on top of the veRL framework, uses HDBSCAN for clustering, and is partly inspired by DAPO. Our models are trained based on Qwen2.5 and Qwen3. We sincerely thank the contributors and maintainers of these projects for their valuable contributions to the open-source community.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors