MEDS💊 (Memory-Enhanced Dynamic Reward Shaping) is a memory-enhanced RL training recipe for LLMs built on top of veRL. Unlike standard memoryless reward designs, MEDS incorporates historical error signals into reward shaping, allowing the training process to recognize and discourage repeated mistakes.
To achieve this, MEDS reuses layer-wise logits from the forward pass as lightweight representations of reasoning behavior, clusters similar error patterns, and applies stronger penalties to repeated failures. This encourages broader exploration and leads to better reasoning performance and greater sampling diversity.
# Clone the MEDS repository
git clone https://github.com/Linxi000/MEDS.git
cd MEDS
# Create a new conda environment
conda create -n meds python=3.10
conda activate meds
# Install Python dependencies
pip install -r requirements.txtMEDS is built on top of veRL. Please follow the veRL installation guide to set up the framework, then place the MEDS directory into your verl repo root to get started.
In addition to the standard veRL installation, MEDS depends on updated code components in the following veRL files:
verl/workers/fsdp_workers.py: includes MEDS-related worker-side logic used during rollout/training.verl/workers/actor/dp_actor.py: includes MEDS-related actor-side logic used for reward shaping behavior.
Before training, make sure these two files are synced with the MEDS-integrated version in this repository.
Set the required paths and launch training:
export MODEL_PATH="${HOME}/models/Qwen2.5-Math-7B" # Base model path
export TRAIN_FILE="${HOME}/data/unified_math.parquet"
export TEST_FILE="${HOME}/data/aime-2024.parquet"
export CKPTS_DIR="${HOME}/ckpts/MEDS/meds_7b"
bash recipe/meds/run_meds.shThis runs MEDS training with the default configuration (Qwen2.5-Math-7B, 8 GPUs per node, 100 epochs).
Key hyperparameters in recipe/meds/run_meds.sh:
| Parameter | Default | Description |
|---|---|---|
cluster_method |
hdbscan |
Clustering algorithm for error pattern grouping |
use_layer_diff |
False |
Whether to use layer-difference features for clustering |
use_last_n_layers |
14 |
Number of last transformer layers used for clustering |
cluster_penalty_target |
wrong |
Which responses to penalize: wrong / right / both / none |
penalty_coef |
0.1 |
Strength of the diversity penalty |
Fine-grained Hydra config options are in recipe/meds/config/meds_trainer.yaml.
Our evaluation pipeline fully follows the official open-source implementation from LIMO, particularly its evaluation module.
First, install the required dependencies under the LIMO/eval directory:
pip install -r requirements.txtTo launch the full evaluation pipeline, run:
bash eval.shTo evaluate a specific checkpoint, update the --model_name_or_path argument in eval.sh to point to your target checkpoint directory.
To change the number of GPUs or specify particular devices, modify CUDA_VISIBLE_DEVICES in the script. For example:
CUDA_VISIBLE_DEVICES=0,1,2,3To evaluate pass@k, adjust the following parameters:
n_sampling: total number of samples generated per problemk: the k value used for pass@k computation
The maximum generation length used in our experiments is max_tokens = 8192.
The training set is a unified math dataset combining DAPO-Math-17K and difficulty levels 3–5 of MATH-lighteval, with deduplication applied. The validation set is AIME 2024.
If you find our work helpful, please consider citing:
@article{liu2026meds,
title={The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping},
author={Liu, Yang and Wang, Enxi and Gao, Yufei and Zhang, Weixin and Wang, Bo and Zeng, Zhiyuan and Zhang, Yikai and Zheng, Yining and Qiu, Xipeng},
journal={arXiv preprint arXiv:2604.11297},
year={2026}
}
This project is licensed under the Apache-2.0 License.
We gratefully acknowledge the open-source projects that made this work possible. This project is built on top of the veRL framework, uses HDBSCAN for clustering, and is partly inspired by DAPO. Our models are trained based on Qwen2.5 and Qwen3. We sincerely thank the contributors and maintainers of these projects for their valuable contributions to the open-source community.
