Tracking the systems that automate scientific research — from single-purpose code agents to full idea-to-paper pipelines
Autonomous research systems have gone from weekend experiments to NeurIPS Spotlight papers in under two years. This repository catalogues 30+ active projects across the full spectrum — lightweight literature scrapers, multi-agent experiment runners, and end-to-end systems that can take a vague research direction and output a reviewable manuscript — together with a capability comparison matrix, a pipeline map, a tool selection guide, and in-depth technical reports for the most impactful systems.
Latest Additions (2026-04-08): Tier updates across all sections (ARIS 5.8k★→🏆, Aider 43k★→🏆, Tongyi DeepResearch 18.6k★→🏆, SWE-agent 19k★→🏆), 7 new in-depth reports (Aider, ARIS, Tongyi DeepResearch, DeepResearchAgent, Idea2Paper, SciAgents, AIDE), Capability Matrix corrected to 18 rows
Understanding where each tool fits in the research process is key to choosing the right one.
╔════════════════════════════════════════════════════════════════════════════════╗
║ THE AUTONOMOUS RESEARCH PIPELINE ║
╠══════════════╦════════════════╦════════════════╦════════════╦══════════════════╣
║ DISCOVER ║ SYNTHESIZE ║ HYPOTHESIZE ║ EXECUTE ║ WRITE & REVIEW ║
║ ║ ║ ║ ║ ║
║ Idea2Paper ║ STORM ║ AI-Scientist ║ OpenHands ║ AI-Scientist ║
║ SciAgents ║ GPT Researcher ║ AI-Researcher ║ SWE-agent ║ Agent Lab ║
║ ResAgent ║ PaperQA2 ║ Agent Lab ║ Aider ║ AI-Researcher ║
║ ║ OpenScholar ║ autoresearch ║ AIDE ║ ║
║ ║ DeerFlow ║ ║ ║ ║
╠══════════════╩════════════════╩════════════════╩════════════╩══════════════════╣
║ ◄── FULL PIPELINE (End-to-End) ──► ║
║ autoresearch · AI-Scientist v1/v2 · AI-Researcher · Agent Laboratory · Biomni ║
╚════════════════════════════════════════════════════════════════════════════════╝
- 📊 Capability Matrix
- 🚀 End-to-End Research Systems
- 🔍 Literature Review & Deep Research
- ⚗️ Experiment Automation & Code Agents
- ✍️ Idea Generation & Writing Assistants
- 📐 Benchmarks & Evaluation Suites
- 🎓 Academic Surveys & Papers
- 🧾 In-Depth Analysis Reports
- 🧭 How to Choose the Right Tool
- 🤝 Contributing
- 📈 Star History
The Tier column groups systems by overall impact and maturity — this same tier label appears in every section table below, so you can quickly cross-reference.
| Tier | System | Lit Review | Hypothesis | Code Exec | Paper Writing | Peer Review | Multimodal | Fully Local |
|---|---|---|---|---|---|---|---|---|
| 🏆 | OpenHands | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
| 🏆 | autoresearch | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| 🏆 | DeerFlow | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | |
| 🏆 | STORM | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | |
| 🏆 | GPT Researcher | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | |
| 🏆 | SWE-agent | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
| 🏆 | deep-research | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | |
| 🏆 | AI-Scientist | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | |
| 🏆 | RD-Agent | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | |
| 🏆 | Open Deep Research | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | |
| 🏆 | PaperQA2 | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| 🏆 | MiroThinker | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ | |
| 🏆 | ARIS | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| 🏆 | Agent Laboratory | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | |
| 🏆 | AI-Scientist-v2 | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | |
| 🏆 | AI-Researcher | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| 🌟 | EvoScientist | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | |
| 🌟 | Biomni | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
Tier legend: 🏆 Landmark — defined or significantly shaped the field · 🌟 Flagship — mature, widely adopted, strong results · 🔬 Notable — active, specialized, or emerging
Capability legend: ✅ Native ·⚠️ Partial / requires setup · ❌ Not supported
Systems that automate the full research lifecycle: discovery → hypothesis → experiments → manuscript. The most ambitious category — each one aims to replace or augment the entire scientific process.
| Tier | Project | Stars | Core Approach | Notes | Report |
|---|---|---|---|---|---|
| 🏆 | autoresearch Andrej Karpathy |
630-line agent; reads its own training script, forms hypotheses, modifies code, runs hundreds of experiments overnight | Minimal & self-contained; seminal proof-of-concept | 📄 | |
| 🏆 | AI-Scientist SakanaAI · 2024 |
Template-driven idea generation → experiment loop → LaTeX write-up → agentic peer review | First comprehensive end-to-end system; multiple ML research templates | 📄 | |
| 🏆 | RD-Agent Microsoft Research · 2025 |
Dual-agent R&D automation: Research agent (ideation) + Development agent (implementation) with iterative loops | #1 on MLE-bench (30.22%); NeurIPS 2025; data-centric multi-domain framework | 📄 | |
| 🏆 | AI-Scientist-v2 SakanaAI · 2025 |
BFTS (beam-search agentic tree search) + AIDE for code generation | First AI-written paper accepted through standard peer review | 📄 | |
| 🌟 | DATAGEN starpig1129 · 2025 |
Multi-agent orchestration: hypothesis generation → data analysis → visualization → report generation | LangChain + LangGraph; advanced state tracking via Note Taker agent | 📄 | |
| 🏆 | AI-Researcher HKUDS · NeurIPS 2025 Spotlight |
LiteLLM multi-provider + Docker-sandboxed execution + Gradio UI | Broadest LLM compatibility; strong reproducibility focus | 📄 | |
| 🏆 | Agent Laboratory SamuelSchmidgall · 2024 |
Role-specialized multi-agent: Professor → PhD Student → Reviewer | arXiv + HuggingFace integration for literature and datasets | 📄 | |
| 🌟 | EvoScientist EvoScientist Team · 2026 |
Six-agent team (plan, research, code, analyze, write, review) with RL self-improvement | ICAIS 2025 Best Paper; #1 on DeepResearch Bench II; human-on-the-loop paradigm | 📄 | |
| 🌟 | Biomni Stanford SNAP · 2025 |
Biomedical datalake + know-how library + sandboxed code execution | Domain-specialized for biology & medicine; multimodal inputs | 📄 | |
| 🔬 | MedResearcher-R1 AQ-MedAI · 2025 |
KG-grounded multi-hop QA synthesis + trajectory generation for medical AI training | SOTA on MedBrowseComp; open 32B model + full training data released | 📄 | |
| 🔬 | BioAgents bio-xyz · 2025 |
Specialized literature + analysis agents for biological sciences; state-of-the-art on BixBench | SOTA analysis agent (48.78% open-answer); configurable dual-agent backend | 📄 | |
| 🏆 | ARIS wanshuiyin |
Claude Code + MCP servers; runs overnight unattended | Cross-model review loops; Zotero + Obsidian integration | 📄 | |
| 🌟 | Idea2Paper AgentAlphaAGI |
Multi-agent + Knowledge Graph alignment for novelty checking | Semantic Scholar + arXiv grounding; idea → draft pipeline | 📄 |
Systems specialized in information gathering, synthesis, and structured report generation. The entry point for most research workflows — and often the most practical category for daily use.
| Tier | Project | Stars | Core Approach | Notes | Report |
|---|---|---|---|---|---|
| 🏆 | deep-research dzhng (Aomni) · 2025 |
Recursive depth/breadth search with Firecrawl + LLM extraction; <500 LoC reference scaffold | Most-forked deep-research scaffold; direct inspiration for Open Deep Research and DeerFlow | 📄 | |
| 🏆 | STORM Stanford OVAL · NAACL 2024 |
Multi-perspective question asking + DSPy pipeline | Generates full Wikipedia-style articles with citations; Co-STORM for collaborative mode | 📄 | |
| 🏆 | GPT Researcher assafelovic · 2023 |
Parallel web scraping agents + LangGraph orchestration | Outputs 5–6 page cited report (PDF / Docx / MD); MCP server support | 📄 | |
| 🏆 | MiroThinker MiroMind AI · 2025 |
RL-trained open-source agent (30B / 235B) with 256K context + 300 tool calls | SOTA on BrowseComp (88.2 H1, 74.0 open); step-verifiable long-chain reasoning | 📄 | |
| 🌟 | CognitiveKernel-Pro Tencent AI Lab · 2025 |
SFT-trained Qwen3-8B + Playwright web engine + multi-agent (web/file/main) | Outperforms RL-trained WebDancer/WebSailor on GAIA using SFT-only recipe; fully open model & data | 📄 | |
| 🏆 | DeerFlow ByteDance · 2025 |
Sub-agent orchestration with persistent memory + InfoQuest + LangGraph | Uniquely combines deep research with code generation in one pipeline | 📄 | |
| 🔬 | Deeper-Seeker HarshJ23 · 2024 |
Iterative research with follow-up questions + multi-step query generation + report synthesis | OSS alternative to OpenAI's Deep Research; Exa integration for web search | 📄 | |
| 🌟 | PaperQA2 Future House · ICLR 2024 |
Iterative RAG over full-text PDFs using tantivy search index | Highest-accuracy Q&A from local scientific papers; outperforms Perplexity Pro | 📄 | |
| 🌟 | OpenScholar Asai et al. · Nature 2024 |
Dense retrieval (Contriever) over 45M open-access papers | Outperforms PaperQA2 on scientific Q&A; evidence-grounded answers | 📄 | |
| 🌟 | Open Deep Research LangChain · 2025 |
LangGraph workflow + MCP tool plugins + LangSmith tracing | Reference implementation from LangChain; highly configurable | 📄 | |
| 🌟 | ToolUniverse Harvard Medical School · 2025 |
AI-Tool Interaction Protocol; 1,000+ tools (ML models, datasets, APIs, packages) | Universal LLM support (Claude, GPT, Gemini, Qwen, Deepseek); 68+ pre-built research skills | 📄 | |
| 🏆 | Tongyi DeepResearch Alibaba NLP · 2025 |
RL-trained agentic LLM (30.5B, GRPO) | SOTA on long-horizon information-seeking benchmarks; open-weight model | 📄 | |
| 🌟 | DeepResearchAgent Skywork AI |
Hierarchical multi-agent + Autogenesis self-evolution | Planning agent coordinates specialized lower-level agents | 📄 | |
| 🔬 | II-Researcher Intelligent Internet · 2025 |
BAML-structured LLM functions + multi-provider web search + async reflection loop | 84.12% on Frames multi-hop benchmark; MCP server support; pip-installable | 📄 |
The "hands" of an autonomous research pipeline. These systems write, execute, debug, and iterate on code — essential when a hypothesis needs to become a running experiment.
| Tier | Project | Stars | Core Approach | Notes | Report |
|---|---|---|---|---|---|
| 🏆 | OpenHands All-Hands-AI · 2024 |
Composable Python agent library; file editing + terminal + web browsing | 72% on SWE-Bench Verified — best-in-class; production-ready UI | 📄 | |
| 🏆 | SWE-agent Princeton NLP · 2024 |
Agent-Computer Interface (ACI) giving structured file/bash/edit access | ~19% on SWE-Bench (full); widely used as research baseline | 📄 | |
| 🏆 | Aider Aider-AI · 2023 |
AI pair programming in terminal with native Git integration | ~18% on SWE-Bench; fastest daily iteration loop; supports 60+ models | 📄 | |
| 🔬 | AutoGPT Significant Gravitas · 2023 |
Plugin-based autonomous agent platform + Forge builder framework | Historically seminal; sparked the autonomous agent movement | 📄 | |
| 🌟 | AIDE WecoAI · 2024 |
Tree-search over ML solution space with iterative code refinement | ML-experiment-specific; used internally by AI-Scientist-v2 | 📄 | |
| 🔬 | AutoDidact dCaples · 2025 |
GRPO RL + self-generated Q&A pairs to bootstrap research-agent LLMs on custom corpora | Doubles Llama-8B accuracy in 1 hr on single RTX 4090; fully local open-source pipeline | 📄 |
Systems focused on the creative and communicative ends of research: surfacing novel hypotheses, structuring arguments, and drafting manuscripts.
| Tier | Project | Stars | Core Approach | Notes | Report |
|---|---|---|---|---|---|
| 🌟 | Idea2Paper AgentAlphaAGI |
Multi-agent pipeline with Knowledge Graph novelty alignment | Semantic Scholar + arXiv grounding; raw idea → structured research proposal | 📄 | |
| 🌟 | SciAgents MIT · 2024 |
Multi-agent system with ontology graph for scientific reasoning | Generates multi-step reasoning chains grounded in domain ontologies | 📄 | |
| 🏆 | ARIS wanshuiyin |
Claude Code + MCP servers running overnight without supervision | Cross-model review loop; integrates Zotero, Obsidian, Kimi, DeepSeek | 📄 |
Principled evaluation frameworks for measuring the capabilities of autonomous research systems.
| Benchmark | Maintained By | What It Measures | Link |
|---|---|---|---|
| SWE-Bench | Princeton NLP | Software engineering task resolution on real GitHub issues | github.com/princeton-nlp/SWE-bench |
| SWE-Bench Verified | OpenAI | Human-verified subset of SWE-Bench (cleaner signal) | openai.com/research |
| MLE-Bench | OpenAI | ML engineering quality on Kaggle competition tasks | github.com/openai/mle-bench |
| CORE-Bench | — | Computational reproducibility of published research | — |
| AI-Scientist Eval | SakanaAI | Paper quality via automated + human review | AI-Scientist |
| MLGym | Meta AI Research | 13 open-ended AI research tasks (CV, NLP, RL, game theory) for benchmarking research agents | github.com/facebookresearch/MLGym · arXiv:2502.14499 |
| DeepResearch Bench | Ayanami et al. | Comprehensive multi-domain benchmark for deep research agent quality | github.com/Ayanami0730/deep_research_bench |
| BixBench | bio-xyz | Biology-focused tool-use benchmark for research agents | github.com/bio-xyz/BioAgents |
| MedBrowseComp | AQ-MedAI | Medical knowledge synthesis via multi-hop web retrieval | github.com/AQ-MedAI/MedResearcher-R1 |
💡 Contributions to this section are especially welcome — if you know of additional evaluation suites for research agents, please open an issue or submit a PR.
| Year | Title | Venue | Authors | Link |
|---|---|---|---|---|
| 2024 | The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery | arXiv | Lu et al. (SakanaAI) | arXiv:2408.06292 |
| 2024 | From Copilot to Pilot: Towards AI-Driven Autonomous Scientific Research | arXiv | Guo et al. | arXiv:2409.14526 |
| 2024 | Agent Laboratory: Using LLM Agents as Research Assistants | arXiv | Schmidgall et al. | arXiv:2501.04227 |
| 2024 | STORM: Assisting in Writing Wikipedia-like Articles From Scratch | NAACL | Shao et al. (Stanford) | arXiv:2402.14207 |
| 2024 | OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs | Nature | Asai et al. | Nature |
| 2024 | PaperQA2: Accurate Scientific QA through Iterative Literature Search | ICLR | Skarlinski et al. | arXiv:2312.07559 |
| 2025 | Towards Automated Research: A Survey of AI Agents for Scientific Discovery | arXiv | Various | — |
| 2025 | The AI Scientist-v2: Workshop-Level AI Research Automation | arXiv | Lu et al. (SakanaAI) | arXiv:2504.08066 |
| 2025 | EvoScientist: Automated Scientific Discovery with Evolvable Multi-Agent Collaboration | ICAIS 2025 Best Paper | EvoScientist Team | — |
| 2025 | MLGym: A New Framework and Benchmark for Advancing AI Research Agents | arXiv | Nathani et al. (Meta) | arXiv:2502.14499 |
| 2025 | SciAgents: Accelerating Scientific Discovery with Multi-Agent Intelligent Graph Reasoning | Advanced Materials | Buehler et al. (MIT) | — |
| 2025 | Tongyi DeepResearch: Reinforcement Learning for Deep Research Agents | arXiv | Alibaba NLP | — |
The reports/ folder is the core value of this repository. Each file contains a structured 10-section analysis: architecture internals, component breakdowns, benchmark context, and honest assessment of strengths and limitations.
| Tier | Report | System | Category | Key Topics Covered |
|---|---|---|---|---|
| 🏆 | ai-scientist.md | AI-Scientist | End-to-End | LaTeX pipeline, template-driven idea gen, agentic review loop |
| 🏆 | ai-scientist-v2.md | AI-Scientist v2 | End-to-End | BFTS tree search, AIDE integration, peer review milestone |
| 🏆 | ai-researcher.md | AI-Researcher | End-to-End | LiteLLM multi-provider, Docker sandbox, NeurIPS 2025 |
| 🏆 | agent-laboratory.md | Agent Laboratory | End-to-End | Role-specialized agents, arXiv + HuggingFace integration |
| 🔬 | biomni.md | Biomni | End-to-End | Biomedical datalake, know-how library, multimodal inputs |
| 🔬 | bioagents.md | BioAgents | End-to-End | Specialized literature + analysis agents, BixBench SOTA (48.78%) |
| 🏆 | storm.md | STORM | Literature | DSPy pipeline, multi-perspective QA, Co-STORM |
| 🏆 | gpt-researcher.md | GPT Researcher | Literature | Parallel scraping, LangGraph orchestration, MCP |
| 🏆 | deerflow.md | DeerFlow | Literature | ByteDance InfoQuest, sub-agent memory, code execution |
| 🌟 | paperqa2.md | PaperQA2 | Literature | Iterative retrieval, tantivy indexing, ICLR results |
| 🌟 | openscholar.md | OpenScholar | Literature | 45M paper index, Contriever dense retrieval, Nature paper |
| 🌟 | open-deep-research.md | Open Deep Research | Literature | LangChain MCP integration, LangSmith tracing |
| 🏆 | openhands.md | OpenHands | Code Agent | 72% SWE-Bench Verified, composable agent architecture |
| 🏆 | swe-agent.md | SWE-agent | Code Agent | Agent-Computer Interface (ACI), Princeton NLP design |
| 🔬 | autogpt.md | AutoGPT | Code Agent | Historical context, Forge platform, Agent Protocol |
| 🏆 | autoresearch.md | autoresearch | End-to-End | 630-line self-referential experiment loop, Karpathy design philosophy |
| 🏆 | deep-research.md | deep-research | Literature | Recursive depth/breadth scaffold, Firecrawl+Exa, TypeScript reference |
| 🌟 | cognitivekernel-pro.md | CognitiveKernel-Pro | Literature | SFT-trained Qwen3-8B, Playwright web engine, Tencent AI Lab |
| 🏆 | datagen.md | DATAGEN | End-to-End | Multi-agent hypothesis gen, data analysis pipeline, state tracking |
| 🔬 | medresearcher-r1.md | MedResearcher-R1 | End-to-End | Medical KG-grounded trajectory synthesis, 32B model, MedBrowseComp SOTA |
| 🏆 | mirothinker.md | MiroThinker | Literature | RL-trained 30B/235B open models, 88.2 BrowseComp, interactive scaling |
| 🔬 | deeper-seeker.md | Deeper-Seeker | Literature | Iterative research, follow-up questions, multi-step synthesis |
| 🔬 | autodidact.md | AutoDidact | Code Agent | GRPO self-bootstrapping, Llama-8B, single-GPU research agent training |
| 🔬 | ii-researcher.md | II-Researcher | Literature | BAML structured LLM functions, 84.12% Frames, async multi-provider search |
| 🏆 | aider.md | Aider | Code Agent | AI pair programming, 60+ LLM models, SWE-Bench ~18%, Git-native commits |
| 🏆 | aris.md | ARIS | End-to-End | Claude Code + MCP overnight agent, cross-model review, Zotero + Obsidian |
| 🏆 | tongyi-deepresearch.md | Tongyi DeepResearch | Literature | RL-trained 30.5B (GRPO), SOTA long-horizon info-seeking, open-weight |
| 🌟 | deep-research-agent.md | DeepResearchAgent | Literature | Hierarchical multi-agent, Autogenesis self-evolution, Skywork AI |
| 🌟 | idea2paper.md | Idea2Paper | Idea Generation | Multi-agent + KG novelty alignment, Semantic Scholar + arXiv pipeline |
| 🌟 | sciagents.md | SciAgents | Idea Generation | Ontology graph + multi-agent reasoning, MIT Buehler lab |
| 🌟 | aide.md | AIDE | Code Agent | Tree-search over ML solution space, iterative code refinement, WecoAI |
Answer the questions below in order — each branch ends at a concrete recommendation.
── START HERE ────────────────────────────────────────────────────────────────
Q1: What is your end goal?
│
├─ (A) Produce a full research paper / manuscript
│ └─ go to Q2
│
├─ (B) Survey a topic, synthesize literature, or generate a research report
│ └─ go to Q5
│
├─ (C) Run, debug, or automate code / ML experiments
│ └─ go to Q8
│
└─ (D) Generate or refine novel research ideas
└─ go to Q11
───────────────────────────────────────────────────────────────────────────────
A: FULL PAPER / MANUSCRIPT
───────────────────────────────────────────────────────────────────────────────
Q2: What research domain are you in?
│
├─ General ML / Computer Science
│ └─ go to Q3
│
├─ Biomedical / Life Sciences
│ └─ ✅ Biomni (Stanford SNAP; biomedical datalake + know-how library)
│
└─ Other / interdisciplinary
└─ go to Q3 (general systems are still useful starting points)
Q3: How much control / human involvement do you want?
│
├─ Fully autonomous — I want to set it running overnight
│ └─ go to Q4
│
└─ Semi-autonomous — I want to steer hypothesis and review results
└─ ✅ Agent Laboratory (role-based: Professor → PhD Student → Reviewer;
human can intervene at each stage)
Q4: Do you prioritize pipeline maturity or LLM flexibility?
│
├─ Mature pipeline, proven end-to-end results
│ └─ ✅ AI-Scientist v1 / v2 (SakanaAI; produced first peer-reviewed AI paper)
│
└─ Broadest LLM provider support + reproducible Docker environment
└─ ✅ AI-Researcher (HKUDS; LiteLLM + Docker; NeurIPS 2025 Spotlight)
───────────────────────────────────────────────────────────────────────────────
B: LITERATURE SURVEY / RESEARCH REPORT
───────────────────────────────────────────────────────────────────────────────
Q5: Where does your source material come from?
│
├─ The open web (news, blogs, general knowledge)
│ └─ go to Q6
│
├─ My own PDF collection (papers I've already downloaded)
│ └─ ✅ PaperQA2 (iterative full-text RAG; highest accuracy on local PDFs)
│
└─ Academic papers at large scale (no local download needed)
└─ ✅ OpenScholar (45M open-access papers; Contriever dense retrieval;
Nature 2024; outperforms Perplexity Pro on sci Q&A)
Q6: What output format do you need?
│
├─ A structured, Wikipedia-style article with cited sections
│ └─ ✅ STORM (Stanford OVAL; DSPy pipeline; Co-STORM for collaboration;
NAACL 2024)
│
├─ A concise 5–6 page factual report (PDF / Word / Markdown)
│ └─ ✅ GPT Researcher (parallel web agents + LangGraph; MCP support;
fastest route to a cited report)
│
└─ A report that also includes runnable code or data analysis
└─ go to Q7
Q7: Do you need a production-grade, configurable pipeline?
│
├─ Yes — I'm building this into a product or workflow
│ └─ ✅ Open Deep Research (LangChain; MCP tool plugins; LangSmith
tracing; designed as a reference implementation)
│
└─ No — I need something working quickly out of the box
└─ ✅ DeerFlow (ByteDance; LangGraph + memory + code execution;
research + code in one pipeline)
───────────────────────────────────────────────────────────────────────────────
C: CODE / EXPERIMENT AUTOMATION
───────────────────────────────────────────────────────────────────────────────
Q8: What is your primary metric for choosing?
│
├─ Raw benchmark performance on software engineering tasks
│ └─ ✅ OpenHands (72% on SWE-Bench Verified; best-in-class;
composable Python library + Web UI)
│
├─ Structured, auditable, research-friendly interface
│ └─ ✅ SWE-agent (Princeton NLP; Agent-Computer Interface (ACI);
widely used as research baseline)
│
├─ Daily pair-programming with Git integration (low overhead)
│ └─ ✅ Aider (terminal-native; Git-native commits; supports 60+ models)
│
└─ ML-experiment-specific iteration (Kaggle / benchmark tasks)
└─ go to Q9
Q9: Is your task similar to Kaggle-style ML competitions?
│
├─ Yes
│ └─ ✅ AIDE (WecoAI; tree-search over solution space;
used internally by AI-Scientist-v2)
│
└─ No — I just want a pioneer framework to understand the space
└─ ✅ AutoGPT (historically seminal; Forge builder; broad plugin ecosystem)
───────────────────────────────────────────────────────────────────────────────
D: NOVEL IDEA GENERATION
───────────────────────────────────────────────────────────────────────────────
Q10: What kind of grounding do you need for the ideas?
│
├─ Literature-grounded novelty checking (Semantic Scholar + arXiv KG)
│ └─ ✅ Idea2Paper (KG alignment; raw idea → structured proposal)
│
├─ Domain ontology-based scientific reasoning
│ └─ ✅ SciAgents (MIT; multi-agent + ontology graphs)
│
├─ Iterative critique loops against academic concept databases
│ └─ ✅ ResearchAgent (light-weight; good for early-stage idea exploration)
│
└─ Fully autonomous overnight ideation with cross-model review
└─ ✅ ARIS (Claude Code + MCP; runs unattended; Zotero + Obsidian)
── STILL UNSURE? ─────────────────────────────────────────────────────────────
→ Check the Capability Matrix above to compare any two systems side-by-side
→ Read the in-depth reports in reports/ for architecture and limitation details
──────────────────────────────────────────────────────────────────────────────
- Add a project — open an issue or submit a PR to the appropriate section table
- Write an analysis report — see the report template, create
reports/<slug>.md, update the Reports table above - Fix outdated info — broken links, stale star counts, new benchmark scores
- Suggest new sections — open a Discussion
Please read CONTRIBUTING.md before submitting.
Star growth of the leading research-specific tools since their respective launch dates.
AutoGPT (170k+ ⭐) is excluded from the chart to keep the research tools readable — view full comparison including AutoGPT →
Maintained by Peizheng Li · Licensed under MIT
If this repository helped your research, please consider giving it a ⭐