An Agent Memory Framework Designed For Deep Research Agents
- [April 1, 2026]: π Whole Training and Evaluation Codebase, Models and Datasets are published.
MIA (Memory In Intelligence Agent) is a memory framewoek designed for deep research agents. It is developed by a joint team from the Shanghai Institute of Intelligence (SII) and East China Normal University (ECNU). To address the limitations of existing agents, such as ineffective memory evolution and expensive storage costs, MIA introduces a sophisticated Manager-Planner-Executor architecture. This framework features a Manager for memory storage, a Planner for memory usage, and an Executor for memory-guided inference. MIAβs core innovations include an Alternative Reinforcement Learning paradigm for seamless multi-agent cooperation and a Continual Test-Time Learning mechanism that allows the Planner to evolve on-the-fly during inference. By establishing a collaborative loop between parametric and non-parametric memories with Reflection and Unsupervised Judgment, MIA enables efficient autonomous evolution and robust reasoning in complex, open-world scenarios.
π Experimental Analysis
Our comprehensive evaluation across multiple benchmarks demonstrates that MIA significantly improves the performance of Deep Research Agents:
- Elevating the State-of-the-Art (a & b): Comparative bar charts on LiveVQA (text-only) and HotpotQA (multimodal, sandbox-based Wiki search) reveal that MIA consistently boosts the performance of current SOTA Large Language Models, proving its efficacy in both text and complex multimodal reasoning tasks.
- The "Small-to-Great" Leap (c): Utilizing a Qwen-2.5-VL-7B-based Executor, MIA enables this 7B model to achieve a staggering performance breakthrough across 7 diverse datasets. Remarkably, the MIA-enhanced 7B model not only outperforms the larger Qwen-2.5-VL-32B but also surpasses former closed-source models such as GPT-4o and Gemini 2.5 Pro (in non-tool-calling settings). This underscores MIAβs ability to unlock "super-model" intelligence within efficient, smaller-scale parameters.
- Superiority in Agent Memory (d): When benchmarked against contemporary SOTA agent memory frameworks using a unified Qwen-2.5-VL-7B Executor, MIA achieves top-tier results across all 7 datasets. These results establish MIA as a new benchmark in memory-augmented architectures, offering unparalleled efficiency and reasoning depth.
We also provide two MIA versions of OpenClaw skills in , which not only integrates MIA memory framework, but also includes trust-worthy judgment mechanism. Here are the MIA memory and trust-worthy demos.
MIA Memory Demo:
Trust-Worthy Demo:
The core implementation is mainly in web_tools/server.
Open web_tools/run.sh and configure the Google Search Serper key:
export SERPER_KEY_ID="xxxxx"Start the run script:
cd web_tools
bash ./run.shService SERVICE_URL/server, method SERVICE_URL/server/search
The core implementation is mainly in local_search.
Refer to the setup instructions in search-r1. This project uses wiki25 local retrieval.
Configure the path and start the run script:
cd local_search
bash ./run.shService http://localhost:8001/, method http://localhost:8001/retrieve
The image search cache used in this project: image_search_cache
conda create -n verl python==3.10.12Run the install.sh script in the train directory to install dependencies.
Flash-attention needs to be installed separately:
wget -nv https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install --no-cache-dir flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp310-cp310-linux_x86_64.whlTraining: π€ Train
Our implementation is based on VeRL. Key modifications:
- The core interaction implementation is in
/Executor-Train/Train/verl/experimental/tool_agent_loop.py. promptis defined in/Executor-Train/Train/local_search/prompt.py.- Custom dataset processing (
CustomRLHFDataset) and reward score computation (compute_score) are inExecutor-Train/Train/local_search/mmsearch.py. - Tool implementations are in
verl.tools.search_tool.SearchToolandverl.tools.web_image_to_image_search_tool.WebImageToImageSearchTool. - The run script is at
/Executor-Train/Train/local_search/run_mmsearch_grpo.sh.
1. Deploy the local text search tool.
2. Configure /Executor-Train/Train/local_search/mm_search_tool_config.yaml and /Executor-Train/Train/local_search/mmsearch.yaml:
mm_search_tool_config.yamltools[0].config.retrieval_service_url: local search service URLtools[1].config.fvqa_train_cache_path,tools[1].config.test_cache_path: image search cache paths for the test and validation sets
mmsearch.yamlhydra.searchpath: trainer config pathdata.custom_cls.path: custom dataset code pathactor_rollout_ref.rollout.multi_turn.tool_config_path: path to the tool configmm_search_tool_config.yaml
3. Deploy Qwen3-32B on Node 1 as Planner & Judger:
export VLLM_USE_FLASHINFER_SAMPLER=0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve /your_path/Qwen/Qwen3-32B \
--tensor-parallel-size 4 \
--served-model-name "qwen" \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 \
--port 8002LLM service at your_url/8002/v1
4. Deploy the Memory-Planner service (can be on the same node as Planner & Judger):
cd Memory-Serve
cd TRAIN_PLANNERConfigure the run script run.sh: set both MEMORY_URL and PLAN_URL to the LLM service deployed in the previous step.
To improve training efficiency, memory content and initial plan are collected in advance. Only the replan service is needed here: your_url/5000/replan_train.
5. Configure the training script /Executor-Train/Train/local_search/run_mmsearch_grpo.sh:
JUDGE_URL: judge service, set toyour_url/8002/v1REPLAN_URL: replan service, set toyour_url/5000/replan_trainWANDB_API_KEY: WandB API key (optional)SAVE_CHECKPOINT_DIR: model save pathDATASET_TRAIN: training dataset pathDATASET_VAL: validation dataset pathREF_MODEL_PATH: pretrained model path
6. Start training on Node 2. Navigate to /Executor-Train/Train/:
bash ./local_search/run_mmsearch_grpo.sh7. Export the model:
python -m verl.model_merger merge \
--backend fsdp \
--local_dir /your_path/actor \
--target_dir /your_pathDownload our trained Executor π€ here
Our implementation is based on VeRL. Key modifications:
- The core interaction implementation is in
/Planner-Train/mem-plan/verl/experimental/multi_turn_loop.py. promptis defined in/Planner-Train/mem-plan/local_search/prompt.py.- Custom dataset processing (
CustomRLHFDataset) and reward score computation (compute_score) are in/Planner-Train/mem-plan/local_search/mmsearch.py. - The run script is at
/Planner-Train/mem-plan/local_search/run_mmsearch_grpo.sh.
1. Deploy the Judger service on Node 1:
export VLLM_USE_FLASHINFER_SAMPLER=0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve /your_path/Qwen/Qwen3-32B \
--tensor-parallel-size 4 \
--served-model-name "qwen" \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 \
--port 80022. Deploy the Executor service on Node 2.
Deploy the trained Executor:
export VLLM_USE_FLASHINFER_SAMPLER=0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve /your_path/Executor \
--tensor-parallel-size 4 \
--served-model-name "qwen" \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 \
--port 8002Navigate to /Serve/Train_Planner and configure the run script serve.sh:
AGENT_URL: Executor service URLSERVICE_URL: offline text search service URLTEST_CACHE_DIR: image-to-image search cache pathMAX_LLM_CALL_PER_RUN: maximum number of interaction rounds between Executor and tools
Start the service:
bash serve.sh3. Configure the training script /Planner-Train/mem-plan/local_search/run_mmsearch_grpo.sh:
JUDGE_URL: judge service, set toyour_url/8002/v1PLAN_URL: Executor service forplanresponses, set toyour_url/5000/planREPLAN_URL: Executor service forreplanresponses, set toyour_url/5000/replanWANDB_API_KEY: WandB API key (optional)SAVE_CHECKPOINT_DIR: model save pathDATASET_TRAIN: training dataset pathDATASET_VAL: validation dataset pathREF_MODEL_PATH: pretrained model path
To improve training efficiency, memory content and image caption are collected in advance.
4. Start training on Node 3. Navigate to /Planner-Train/mem-plan/:
bash ./local_search/run_mmsearch_grpo.sh5. Export the model:
python -m verl.model_merger merge \
--backend fsdp \
--local_dir /your_path/actor \
--target_dir /your_pathDownload our trained Planner π€ here
1. Deploy the Memory Manager & Judger service on Node 1:
export VLLM_USE_FLASHINFER_SAMPLER=0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve /your_path/Qwen/Qwen3-32B \
--tensor-parallel-size 4 \
--served-model-name "qwen" \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 \
--port 80022. Deploy the Executor service on Node 2:
export VLLM_USE_FLASHINFER_SAMPLER=0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve /your_path/Executor \
--tensor-parallel-size 4 \
--served-model-name "qwen" \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 \
--port 8002Navigate to /Serve/MIA-TTRL and configure the run script serve.sh:
AGENT_URL: Executor service URLSERVICE_URL: offline/online text search service URLTEST_CACHE_DIR: image-to-image search cache pathMAX_LLM_CALL_PER_RUN: maximum number of interaction rounds between Executor and toolsMEMORY_URL: Memory Manager service URLTTRL_SAVE: output path during explorationPARQUET_PATH: test set path with images (required for image datasets)
In /Serve/MIA-TTRL/call_agent.py, configure the text search method (choose one):
from tool_search_local import * # offline text search
from tool_serper import * # online text searchSwitch the Python script in /Serve/MIA-TTRL/serve.sh to select the mode (supervised / unsupervised):
python agent_serve_ttrl.py .... # scenario where Ground-Truth is available after each question
python agent_serve_ttrl_nogt.py .... # scenario where Ground-Truth is unavailable after each question3. Configure the script /TTRL/TTRL/local_search/run_mmsearch_grpo.sh (supervised) or /TTRL/TTRL-nogt/local_search/run_mmsearch_grpo.sh (unsupervised):
JUDGE_URL: judge service, set toyour_url/8002/v1MEMORY_URL: memory retrieval service, set toyour_url/5000/memoryPLAN_URL: Executor service forplanresponses, set toyour_url/5000/planREPLAN_URL: Executor service forreplanresponses, set toyour_url/5000/replanMEMORY_BANK_SAVE_URL: service for saving memories to the buffer (current batch exploration not yet complete), set toyour_url/5000/memory_bank_saveBATCH_EVALUATE_URL: service for evaluating current batch samples, set toyour_url/5000/batch_evaluateCONSOLIDATE_MEMORIES_URL: service for extracting memories from all buffered samples, set toyour_url/5000/consolidate_memoriesSAVE_MEMORIES_URL: service for saving all memories, set toyour_url/5000/save_memoryWANDB_API_KEY: WandB API key (optional)SAVE_CHECKPOINT_DIR: model save pathDATASET_TRAIN: dataset pathDATASET_VAL: unused, set to the same asDATASET_TRAINREF_MODEL_PATH: initial Planner path
Start the service:
bash serve.sh4. Start the Planner training on Node 3. Navigate to /TTRL/TTRL/ or /TTRL/TTRL-nogt/:
bash ./local_search/run_mmsearch_grpo.shReleased under the MIT License.
PhD Student: Weicheng Meng, Yu Cheng, Zhihang Lin
Student Leader: Jingyang Qiao
Professor: Zhizhong Zhang, Xin Tan, Jingyu Gong, Zhaoxia Yin
Projector Leader: Yuan Xie
We also plan to release the following next versions:
-
High-Efficiency Version
-
Trust-worthy Version












