GitHub - Linxi000/MEDS

Overview

MEDS💊 (Memory-Enhanced Dynamic Reward Shaping) is a memory-enhanced RL training recipe for LLMs built on top of veRL. Unlike standard memoryless reward designs, MEDS incorporates historical error signals into reward shaping, allowing the training process to recognize and discourage repeated mistakes.

To achieve this, MEDS reuses layer-wise logits from the forward pass as lightweight representations of reasoning behavior, clusters similar error patterns, and applies stronger penalties to repeated failures. This encourages broader exploration and leads to better reasoning performance and greater sampling diversity.

Getting Started

Environment Setup

# Clone the MEDS repository
git clone https://github.com/Linxi000/MEDS.git
cd MEDS

# Create a new conda environment
conda create -n meds python=3.10
conda activate meds

# Install Python dependencies
pip install -r requirements.txt

MEDS is built on top of veRL. Please follow the veRL installation guide to set up the framework, then place the MEDS directory into your verl repo root to get started.

In addition to the standard veRL installation, MEDS depends on updated code components in the following veRL files:

verl/workers/fsdp_workers.py: includes MEDS-related worker-side logic used during rollout/training.
verl/workers/actor/dp_actor.py: includes MEDS-related actor-side logic used for reward shaping behavior.

Before training, make sure these two files are synced with the MEDS-integrated version in this repository.

Training with MEDS

Set the required paths and launch training:

export MODEL_PATH="${HOME}/models/Qwen2.5-Math-7B"  # Base model path
export TRAIN_FILE="${HOME}/data/unified_math.parquet"
export TEST_FILE="${HOME}/data/aime-2024.parquet"
export CKPTS_DIR="${HOME}/ckpts/MEDS/meds_7b"

bash recipe/meds/run_meds.sh

This runs MEDS training with the default configuration (Qwen2.5-Math-7B, 8 GPUs per node, 100 epochs).

Configuration

Key hyperparameters in recipe/meds/run_meds.sh:

Parameter	Default	Description
`cluster_method`	`hdbscan`	Clustering algorithm for error pattern grouping
`use_layer_diff`	`False`	Whether to use layer-difference features for clustering
`use_last_n_layers`	`14`	Number of last transformer layers used for clustering
`cluster_penalty_target`	`wrong`	Which responses to penalize: `wrong` / `right` / `both` / `none`
`penalty_coef`	`0.1`	Strength of the diversity penalty

Fine-grained Hydra config options are in recipe/meds/config/meds_trainer.yaml.

Evaluation

Our evaluation pipeline fully follows the official open-source implementation from LIMO, particularly its evaluation module.

First, install the required dependencies under the LIMO/eval directory:

pip install -r requirements.txt

To launch the full evaluation pipeline, run:

bash eval.sh

To evaluate a specific checkpoint, update the --model_name_or_path argument in eval.sh to point to your target checkpoint directory.

To change the number of GPUs or specify particular devices, modify CUDA_VISIBLE_DEVICES in the script. For example:

CUDA_VISIBLE_DEVICES=0,1,2,3

To evaluate pass@k, adjust the following parameters:

n_sampling: total number of samples generated per problem
k: the k value used for pass@k computation

The maximum generation length used in our experiments is max_tokens = 8192.

Data Preparation

The training set is a unified math dataset combining DAPO-Math-17K and difficulty levels 3–5 of MATH-lighteval, with deduplication applied. The validation set is AIME 2024.

Citation

If you find our work helpful, please consider citing:

@article{liu2026meds,
  title={The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping},
  author={Liu, Yang and Wang, Enxi and Gao, Yufei and Zhang, Weixin and Wang, Bo and Zeng, Zhiyuan and Zhang, Yikai and Zheng, Yining and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2604.11297},
  year={2026}
}

License

This project is licensed under the Apache-2.0 License.

Acknowledgments

We gratefully acknowledge the open-source projects that made this work possible. This project is built on top of the veRL framework, uses HDBSCAN for clustering, and is partly inspired by DAPO. Our models are trained based on Qwen2.5 and Qwen3. We sincerely thank the contributors and maintainers of these projects for their valuable contributions to the open-source community.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
recipe		recipe
verl		verl
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Contents

Getting Started

Environment Setup

Training with MEDS

Configuration

Evaluation

Data Preparation

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Contents

Getting Started

Environment Setup

Training with MEDS

Configuration

Evaluation

Data Preparation

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages