The data and code for the paper RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems
People commonly leverage structured content to accelerate knowledge acquisition and research problem solving. Among these, roadmaps guide researchers through hierarchical subtasks to solve complex research problems step by step. Despite progress in structured content generation, the roadmap generation task has remained unexplored. To bridge this gap, we introduce RoadMap, a novel benchmark designed to evaluate the ability of large language models (LLMs) to construct high-quality roadmaps for solving complex research problems. Based on this, we identify three limitations of LLMs: (1) lack of professional knowledge, (2) unreasonable task decomposition, and (3) disordered logical relationships. To address these challenges, we propose RoadMapper, an LLM-based multi-agent system that decomposes the research roadmap generation task into three key stages (i.e., initial generation, knowledge augmentation, and iterative "critique-revise-evaluate"). Extensive experiments demonstrate that RoadMapper can improve LLMs' ability for roadmap generation, while enhancing average performance by more than 8% and saving 84% of the time required by human experts, highlighting its effectiveness and application potential.
Figure 1. Overview of the RoadMap benchmark.
Figure 2. Overview of the RoadMapper multi-agent methodology.
The repository is organized as follows:
root/
├── chroma/ # ChromaDB related scripts and configuration
│ ├── chroma.py # ChromaDB operations (create, test, delete collections)
│ ├── config/
│ │ └── chroma.yaml # ChromaDB configuration file
│ └── start_chroma.sh # Script to start Chroma server
├── code/
│ ├── dpo/ # DPO (Direct Preference Optimization) related files
│ │ ├── qwen3-dpo.yaml # DPO training configuration
│ │ ├── qwen3-infer.sh # DPO inference script
│ │ └── qwen3-merge.yaml # DPO model merge configuration
│ ├── experiments/ # Batch processing scripts for experiments
│ │ ├── infer.py # Batch inference pipeline for RoadMapper
│ │ └── eval.py # Batch evaluation pipeline for RoadMapper
│ ├── roadmapper/ # Core RoadMapper multi-agent system
│ │ ├── agents/ # Agent modules
│ │ │ ├── evaluate_agent.py # Evaluate agent for roadmap quality assessment
│ │ │ ├── granularity_critique_agent.py # Granularity Critique agent
│ │ │ ├── init_agent.py # Init agent
│ │ │ ├── knowledge_agent.py # Knowledge agent
│ │ │ ├── logic_critique_agent.py # Logic Critique agent
│ │ │ ├── revise_agent.py # Revise agent
│ │ │ └── utils.py # Utility functions for agents
│ │ ├── direct_prompting.py # Direct prompting baseline implementation
│ │ ├── force_n_round.py # Force N rounds iteration implementation
│ │ └── roadmapper_retry.py # Main RoadMapper pipeline with retry mechanism
│ └── scripts/ # Utility scripts during markdown useful contents extraction when constructing RoadMap
│ ├── extract-content-before.py # Content extraction utility
│ └── extract-useful-only.py # Useful content extraction utility
├── data/
│ ├── core-research-question-embeddings/ # Core research question embedding files (used by Knowledge Agent for skill point retrieval)
│ ├── golden-roadmap/ # Gold standard roadmap files (used for evaluation)
│ └── skill-points-embeddings/ # Skill points embeddings files (used by Knowledge Agent for skill point retrieval)
├── dataset/
│ ├── dataset-split-cn.json # Metadata for Chinese dataset split
│ ├── dataset-split-en.json # Metadata for English dataset split
│ ├── dataset_dpo_training.json # Dataset for DPO training
│ ├── skill-points-cn.json # Metadata for Chinese skill points
│ └── skill-points-en.json # Metadata for English skill points
├── README.md # Introduction to this repository
└── requirements.txt # Python dependenciesid: Unique identifier for the research problem (e.g., "en-0001").title: The title of the dissertation to which the research problem belongs.core_research_question: The main research problem addressed in the dissertation.research_field: The academic field pertaining to the research problem.research_type: The classification of research (e.g., "Application", "Theory").year: The year of the dissertation.
Example (from dataset/dataset-split-en.json):
{
"id": "en-0014",
"title": "Graph Representation Learning-Based Recommender Systems",
"core_research_question": "How to effectively integrate graph representation learning with recommender systems to enhance recommendation quality?",
"research_field": "Computer Science",
"research_type": "Application",
"year": "2020"
}The metadata format of a skill point is as follows:
id: Unique identifier for the skill point (e.g., "en-skill-point-0001").problem_description: A brief description of the research problem or challenge addressed by the skill point.skill_point_name: The name of the skill point.skill_point_description: A detailed description of the skill point.
Example (from dataset/skill-points-en.json):
{
"id": "en-skill-point-0001",
"problem_description": "LLMs often hallucinate or generate unverified information, making them unreliable for scientific research where accuracy and traceability are paramount.",
"skill_point_name": "Two-Pass Retrieval-Augmented Generation RAG with Usefulness Metric",
"skill_point_description": "LmRaC employs a two-pass RAG approach. First, candidate text chunks are retrieved based on semantic similarity. Second, these candidates are explicitly filtered by the LLM for 'usefulness' in answering the question. This significantly improves answer quality and reduces hallucination by ensuring only the most relevant and high-utility information, traceable to paragraph-level citations, is presented."
}The metadata format of a DPO (Direct Preference Optimization) training dataset item is as follows:
id: Unique identifier for the DPO training sample (e.g., "dpo-0001").type: Type of the sample (e.g., "reason").core_research_question: The core research problem.roadmap: The roadmap content (in Markdown format) used to solve the research problem to be evaluated.system_prompt: System prompt for evaluating the quality of the roadmap.user_prompt: User prompt containing the research problem and roadmap.better_eval: Better evaluation result, containing the score within<eval_score>tags and detailed analysis within<eval_reason>tags.worse_eval: Worse evaluation result, containing the score within<eval_score>tags and detailed analysis within<eval_reason>tags.
Download this repository and navigate into the directory:
cd RoadMapperCreate your Conda environment with the following command:
conda create -n roadmapper python=3.11.5Activate the environment:
conda activate roadmapperInstall the dependencies with the following command:
pip install -r requirements.txtChroma is a vector database used to store embedding data and retrieve relevant skill points for the Knowledge Agent.
- Navigate to the Chroma directory:
cd chroma- Start the Chroma server:
bash start_chroma.shIf the terminal displays the following text, the Chroma server is running successfully:
Saving data to: ./chroma-server
Connect to Chroma at: http://localhost:57778
Getting started guide: https://docs.trychroma.com/docs/overview/getting-started
OpenTelemetry is not enabled because it is missing from the config.
Listening on 0.0.0.0:57778To proceed with the following steps, keep this terminal open to ensure the Chroma server continues running. You can open a new terminal to execute subsequent commands.
- We provide a script (
chroma.py) for basic operations on the Chroma server within this experiment.
Run the following command for help information:
python chroma.py- To complete our experiments, you need to create the English and Chinese skill point collections:
python chroma.py action=create-en
python chroma.py action=create-cnThe system will output the following text if a collection is successfully created (example for English collection):
Successfully added 2493 vectors to collectionTo test the Chroma server, you can run the following commands:
python chroma.py action=test-en
python chroma.py action=test-cnTo delete all collections in the Chroma server, you can run the following command:
python chroma.py action=delete-allWe use ./code/experiments/infer.py for main experiment inference. Before executing the script, you need to configure the Config class:
Common Configuration Items (usually need to be modified):
dataset_file(str): Dataset file pathmodel_name(str): Model name for inferencemodel_client(AsyncOpenAI): Model client configuration, requires settingapi_keyandbase_urlevaluate_model_name(str): Evaluate agent model nameevaluate_model_client(AsyncOpenAI): Evaluate model client configuration, requires settingapi_keyandbase_urlrpm(int): Requests per minute (Rate Per Minute), used to control API call frequencyprocess_count(int): Number of data items to process, set to-1to process all datainfer_mode(str): Inference mode, optional values:"roadmapper": Use the complete RoadMapper system (default)"direct": Use direct prompting baseline"force_n_round": Force N rounds iteration
RoadMapper Core Parameters:
n_results(int): Number of skill points retrieved by Knowledge Agentmax_iteration_count(int): Maximum iteration rounds for RoadMapper, default is5passing_score(int): Passing score threshold for evaluation, default is80max_retry_count_roadmapper(int): Maximum retry count within RoadMapper, default is10
Task Management Parameters:
restore_from_dir(str): Restore running state from specified directory (for resuming from checkpoint), empty string means no restoration neededoutput_root_dir(str): Output root directorymax_retry_count_per_item(int): Maximum retry count per data itemshuffle(bool): Whether to shuffle data ordermax_processing_count(int): Maximum concurrent processing count, set to-1for unlimiteditem_process_detail_dir(str): Directory for saving detailed processing results of each data item, default is"item_process_detail"
ChromaDB Configuration:
chroma_client(chromadb.HttpClient): ChromaDB clientproject_embedding_file_dir(str): Core research question embedding file directory
Then return to the project root directory and execute:
cd ..
python code/experiments/infer.pyAfter running the pipeline, inference results will be stored in the output_root_dir directory. Each execution creates a timestamped directory with the following structure:
root/
├── YYYY-MM-DD-HH-MM-SS_process_count_N/
│ ├── merged_result.json
│ ├── item_json_detail/
│ │ ├── en-0001.json
│ │ ├── en-0002.json
│ │ └── ...
│ ├── item_process_detail/
│ │ ├── en-0001/
│ │ ├── en-0002/
│ │ └── ...
│ ├── item_fail_map.json
│ ├── process.log
│ ├── runtime_info.md
│ └── infer.pyKey Output Files:
-
merged_result.json: The final merged inference results containing all successfully processed research questions. This file contains an array of results, with each item representing the inference result for a research problem. -
item_json_detail/: Contains individual JSON result files for each research question (e.g.,en-0001.json), storing the complete inference result for that specific item. -
item_process_detail/: Contains detailed processing logs and intermediate files for each research question during the inference pipeline execution. -
item_fail_map.json: Records the failure count for each item that encountered errors during processing. -
process.log: Comprehensive execution log recording the pipeline execution process. -
runtime_info.md: Runtime statistics including success count, processing count, and item status information. -
infer.py: A copy of the inference script used for this execution, preserved for reproducibility.
We use ./code/experiments/eval.py for main experiment evaluation of inference outcome. Before executing the script, you need to configure the Config class:
Common Configuration Items (usually need to be modified):
dataset_file(str): Dataset file path (should contain the inference results with roadmap field)model_name(str): Model name for evaluationmodel_client(AsyncOpenAI): Model client configuration, requires settingapi_keyandbase_urlrpm(int): Requests per minute (Rate Per Minute), used to control API call frequencyprocess_count(int): Number of data items to process, set to-1to process all data
Task Management Parameters:
restore_from_dir(str): Restore running state from specified directory (for resuming from checkpoint), empty string means no restoration neededoutput_root_dir(str): Output root directorymax_retry_count_per_item(int): Maximum retry count per data itemshuffle(bool): Whether to shuffle data ordermax_processing_count(int): Maximum concurrent processing count, set to-1for unlimited
Evaluation Specific Parameters:
golden_roadmap_dir(str): Directory containing golden roadmap files (used as reference standard for evaluation)
Then, run the evaluation:
python code/experiments/eval.pyThe evaluation results will be saved in the output_root_dir directory. Each run will create a timestamped directory with the following structure:
root/
├── YYYY-MM-DD-HH-MM-SS_process_count_N/
│ ├── merged_result.json
│ ├── item_json_detail/
│ │ ├── en-0001.json
│ │ ├── en-0002.json
│ │ └── ...
│ ├── evaluation_result.json
│ ├── process.log
│ ├── runtime_info.md
│ ├── item_fail_map.json
│ ├── dataset_remark.md
│ └── eval.pyKey Output Files:
-
merged_result.json: The final merged evaluation results containing all successfully processed research questions. Each item includes the original research problem metadata, the generated roadmap, and the evaluation results. -
item_json_detail/: Contains individual JSON result files for each research question (e.g.,en-0001.json), storing the complete evaluation result for that specific item, including:- Original research problem metadata (
id,title,core_research_question, etc.) - Generated
roadmapcontent evaluation_resultcontaining:step_score: Key step representation score (0-100)logic_score: Logical coherence score (0-100)degree_score: Average out-degree score (0-100, DegreeScore in our paper)depth_score: Average depth score (0-100, DepthScore in our paper)eval_reason: Detailed evaluation reasoning
- Original research problem metadata (
-
evaluation_result.json: Contains the average statistics of all evaluation metrics:average_step_score: Average key step representation scoreaverage_logic_score: Average logical coherence scoreaverage_degree_score: Average out-degree score (DegreeScore in our paper)average_depth_score: Average depth score (DepthScore in our paper)average_score: Overall average score across all metrics
-
process.log: Comprehensive execution log recording the evaluation process. -
runtime_info.md: Runtime statistics including success count, processing count, and item status information. -
item_fail_map.json: Records the failure count for each item that encountered errors during processing. -
dataset_remark.md: Records the dataset file path used for evaluation. -
eval.py: A copy of the evaluation script used for this execution, preserved for reproducibility.


