A cognitively inspired multi-agent framework for generating comprehensive, multimodal deep research reports.
- [2026/04] CogGen has been accepted to Findings of ACL 2026!
CogGen is an open-source multi-agent system that generates comprehensive, multimodal research reports through cognitively inspired recursive processes. It leverages a three-agent architecture — Planner, Writer, and Reviewer — orchestrated via Google ADK and powered by any LLM through LiteLLM.
CogGen employs a Hierarchical Recursive Architecture consisting of two nested cognitive loops:
- Macro-Cognitive Loop: The Planner generates a global outline, sections are written in parallel, and the Reviewer provides structural/content feedback (Δ) to trigger replanning iterations.
- Micro-Cognitive Cycle: Per-section processing where each section undergoes independent search → plan → write → revise cycles with format, factual, and cognitive revision.
- Renderer Agent: Translates Abstract Visual Representations (AVRs) into executable ECharts/Mermaid syntax and renders them to PNG via Playwright.
| Agent | Role | Sub-agents |
|---|---|---|
| Planner Agent (Aₚ) | Information retrieval & structural planning | init_research, outline, section_plan, section_search, replan_loop, combine_plan |
| Writer Agent (Aw) | Text composition & AVR definition | section_writer, write_loop, role_inference, content_cleanup |
| Reviewer Agent (Aᵣ) | Evaluation & feedback signals (Δ) | structure_detector, content_detector, plan_restructurer, refine_loop, *_revise agents |
| Renderer Agent | Translates AVRs into executable syntax | gen_image (ECharts + Mermaid → PNG/HTML) |
- Python 3.10+
- Node.js 18+ (optional, for ECharts validation)
- A Tavily API key
- An LLM API key (OpenAI, Azure OpenAI, or any LiteLLM-supported provider)
# Clone the repository
git clone https://github.com/NJUNLP/coggen.git
cd coggen
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies (creates .venv automatically)
uv sync
# (Optional) Install Playwright for chart PNG rendering
uv sync --extra render
uv run playwright install chromium
# Copy and edit environment variables
cp .env.example .env
# Edit .env with your API keys| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
Yes* | — | OpenAI API key |
AZURE_API_KEY |
Alt* | — | Azure OpenAI API key |
TAVILY_API_KEY |
Yes | — | Tavily web search API key |
PROMPT_LANG |
No | en |
en (with charts) or en_no_image (text only) |
COGGEN_WRITING_MODEL |
No | gpt-4.1 |
Writing model |
COGGEN_REASONING_MODEL |
No | gpt-4.1 |
Planning & revision model |
COGGEN_CODE_MODEL |
No | gpt-4.1 |
Chart code generation model |
COGGEN_TEMPERATURE |
No | 0.5 |
LLM sampling temperature |
COGGEN_SEARCH_MODE |
No | snippet |
snippet (fast) or full (thorough) |
ECHARTS_VALIDATION |
No | simple |
simple, enhanced, or full |
* At least one LLM provider key is required.
# Standard mode (full CogGen pipeline)
uv run python run.py "What are the latest advances in quantum computing?"
# With custom title and output directory
uv run python run.py "AI in healthcare" --title "AI Healthcare Report" --output ./my_reportsCogGen supports three generation modes for ablation studies:
# Standard: Full pipeline (Macro + Micro cognitive loops)
uv run python run.py "Your query" --mode standard
# Simple: w/o Macro-Cognitive Loop (no replan/refine iterations)
uv run python run.py "Your query" --mode simple
# No-Revise: w/o Micro-Cognitive Cycle (shared context, no per-section search/revision)
uv run python run.py "Your query" --mode no_revise| Mode | Macro Loop | Micro Cycle | Per-Section Search | Description |
|---|---|---|---|---|
| Standard | ✓ | ✓ | ✓ | Full CogGen pipeline |
| Simple | ✗ | ✓ | ✓ | w/o Macro-Cognitive Loop |
| No-Revise | ✗ | ✗ | ✗ | w/o Micro-Cognitive Cycle |
| Direct | ✗ | ✗ | ✗ | Single LLM call baseline |
# From a text file (one query per line)
uv run python run.py --batch queries.txt
# From a JSONL file
uv run python run.py --batch ./data/wildseek/queries.json --mode standard --output ./batch_outputRun systematic ablation experiments to evaluate each component's contribution:
# Run all three modes on the same query set
python -m ablation.run_ablation --queries data/owid/queries.json
# Run specific modes only
python -m ablation.run_ablation --queries data/wildseek/queries.json --modes standard simple
# Direct LLM baseline (single-call generation)
python -m ablation.gen_report_once "Your query"CogGen includes a comprehensive evaluation framework based on CLEF (Cognitive Load Evaluation Framework), featuring automated LLM-based quality scoring, pairwise comparison, citation accuracy analysis, and report statistics collection.
Batch evaluation modes require a JSONL file as input. Use the built-in collection script to scan output/reports/, de-duplicate by query (keeping the newest successful result), and produce a ready-to-use JSONL:
# Default: scan output/reports, write to output/model_reports.jsonl
python -m evaluation.scripts.collect_reports
# Custom paths
python -m evaluation.scripts.collect_reports -d output/reports -o my_reports.jsonl
# Keep all reports (no dedup), useful for comparing different runs of the same query
python -m evaluation.scripts.collect_reports --all -o all_reports.jsonlEach line in the output JSONL contains query, markdown, title, report_dir, and report_path fields.
Evaluate a single report across 5 cognitive dimensions:
# Evaluate a single markdown report (result auto-saved to report's directory as eval_result.json)
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md
# With a specific query for context
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md --query "Your query"
# Specify a custom output path
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md --output result.json
# Evaluate from a JSONL batch file
python -m evaluation.evaluate single --model-output reports.jsonl --output results.json
# Use a specific evaluation model
python -m evaluation.evaluate single --model-output reports.jsonl --eval-model gpt-4o --output results.jsonCompare model-generated reports against reference reports (ground truth) using positive-scoring rubrics:
python -m evaluation.evaluate pairwise \
--model-output my_reports.jsonl \
--ground-truth ./coggen-benchmark/owid_reports.jsonl \
--output comparison.jsonThe pairwise evaluator computes a relative advantage score per dimension: model_score / (model_score + ref_score), where values > 0.5 indicate the model report is stronger.
Evaluate citation quality by checking URL accessibility (recall) and content relevance (precision):
# CogGen citation format ([N] footnotes)
python -m evaluation.evaluate citation \
--model-output reports.jsonl \
--citation-format coggen \
--output citation_results.json
# Supported formats: coggen, mmdr, gemini
python -m evaluation.evaluate citation \
--model-output reports.jsonl \
--citation-format mmdr \
--cache-dir .cache/citations| Metric | Description |
|---|---|
citation_recall |
Fraction of cited URLs that are accessible |
citation_precision |
Fraction of citations whose content is relevant to the context |
f1_score |
Harmonic mean of recall and precision |
Collect structural statistics (character count, word count, visualization count) from report files:
python -m evaluation.evaluate statistics --model-output reports.jsonl --output stats.json| Dimension | Focus | Cognitive Principle | Category |
|---|---|---|---|
| D1: Visual-Text Alignment | Visual-text integration | Spatial Contiguity | Core |
| D2: Multimodal Synergy | Information gain from multimodal content | Multimedia, Redundancy | Core |
| D3: Information Organization | Hierarchical structure | Signaling, Segmenting | Control |
| D4: Content Depth | Causal explanations | Schema Construction | Control |
| D5: Content Relevance | Topic coherence & coverage | Coherence, Pre-training | Control |
| Flag | Description |
|---|---|
--eval-model |
LLM model for evaluation (default: gpt-4o) |
--temperature |
Evaluation temperature (default: 0.1) |
--force |
Ignore existing checkpoints, re-evaluate everything |
--retry-failed |
Re-attempt previously failed evaluations |
--cache-dir |
Cache directory for URL content (default: .cache/citations) |
Each report generation run produces a timestamped directory:
output/reports/20260409_103508_7401/
├── report_final.md # Final report with rendered charts and footnotes
├── report.md # Raw report content
├── report_image.md # Report with <IMAGE> placeholders replaced
├── report_image_sources.md # Report with image placeholders + reference list
├── report_code.md # Report with embedded chart code comments
├── report_0.md # Intermediate report snapshot
├── images/ # Rendered chart images
│ ├── Figure1.1__xxx.png # PNG bitmap (when Playwright is available)
│ └── Figure1.1__xxx.html # Self-contained interactive HTML (always)
├── outline.json # Final report outline
├── outline_list.json # Outline evolution across iterations
├── section_list.json # Per-section writing history
├── gen_images_list.json # Chart metadata and ECharts/Mermaid code
├── used_chunks.json # Web sources referenced in the report
├── index_context_pool.json # Full indexed context pool
├── token_consumption.json # LLM token usage breakdown
├── config.json # Run configuration
└── report.log # Detailed execution log
CogGen uses two evaluation datasets. Query definitions are included in this repository; reference reports are hosted on HuggingFace for clean separation of code and data.
The full reference reports are available at: HuggingFace: coggen-benchmark
hf download tk1111/coggen-benchmark --repo-type dataset --local-dir ./coggen-benchmark| Dataset | Records | Type | Reference Source |
|---|---|---|---|
| OWID | 50 (40 main + 10 extend) | Human-written expert reports | Our World in Data |
| WildSeek | 20 | LLM-generated reference reports | Gemini Deep Research |
| File | Description |
|---|---|
data/owid/queries.json |
50 OWID queries with source_report_id linking to HuggingFace reference reports |
data/wildseek/queries.json |
100 WildSeek queries (20 used in evaluation) with topic, intent, and domain |
-
OWID: The OWID reports are originally published by their respective authors at Our World in Data. Republished here under a Creative Commons BY 4.0 license. Each report's original authors, publication date, and source URL are preserved in the dataset. Note that data and charts within these reports may originate from third-party providers (e.g., WHO, UN, World Bank) and are subject to those providers' license terms.
-
WildSeek: The WildSeek queries are sourced from the WildSeek benchmark (Jiang et al., 2024). Reference reports were generated using Gemini Deep Research and are included for research evaluation purposes.
coggen/
├── run.py # CLI entry point
├── pyproject.toml # Dependencies & build config
├── .env.example # Environment variable template
│
├── coggen/ # Core package
│ ├── config.py # Centralized configuration
│ ├── types.py # ReportRequest / ReportResult
│ │
│ ├── pipeline/ # Execution orchestration
│ │ ├── runner.py # Main execution engine
│ │ ├── helpers.py # Session, logging, file I/O
│ │ ├── simple.py # Simple ablation pipeline
│ │ └── no_revise.py # No-revise ablation pipeline
│ │
│ ├── agents/ # Multi-agent system
│ │ ├── planner/ # Planner Agent (Aₚ)
│ │ ├── writer/ # Writer Agent (Aw)
│ │ ├── reviewer/ # Reviewer Agent (Aᵣ)
│ │ └── renderer/ # Renderer Agent
│ │
│ ├── prompts/ # Prompt templates
│ ├── tools/ # Tool modules (LLM, search, chunker, rendering)
│ └── utils/ # Shared utilities
│
├── evaluation/ # CLEF evaluation framework
│ ├── evaluate.py # CLI entry point (4 modes)
│ ├── evaluator.py # Batch evaluation orchestrator
│ ├── metrics/ # Evaluation metrics
│ ├── prompts/ # Evaluation rubrics
│ └── scripts/ # Helper scripts
│
├── ablation/ # Ablation experiment scripts
│ ├── run_ablation.py # Multi-mode ablation runner
│ └── gen_report_once.py # Single-LLM-call baseline
│
└── data/ # Evaluation datasets
├── owid/ # OWID expert report queries
├── wildseek/ # WildSeek collected queries
└── examples/ # CogGen-generated example reports
If you find this work useful, please consider citing:
@article{coggen2026,
title={CogGen: Cognition-Inspired Deep Research Report Generation via Multi-Agent Recursive Framework},
author={Tian, Kuo and Sun, Pengfei and Ding, Junran and Wu, Zhen and Dai, Xinyu},
booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
year={2026},
}Other related works
@misc{ourworldindata,
author = {Roser, Max and Ritchie, Hannah and Ortiz-Ospina, Esteban and others},
title = {Our World in Data},
year = {2025},
howpublished = {\url{https://ourworldindata.org}},
note = {Online resource, licensed under CC-BY 4.0. Accessed: 2025},
}@inproceedings{jiang-etal-2024-unknown,
title = "Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations",
author = "Jiang, Yucheng and
Shao, Yijia and
Ma, Dekun and
Semnani, Sina and
Lam, Monica",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.554/",
doi = "10.18653/v1/2024.emnlp-main.554",
pages = "9917--9955",
}@inproceedings{shao-etal-2024-assisting,
title = "Assisting in Writing {W}ikipedia-like Articles From Scratch with Large Language Models",
author = "Shao, Yijia and
Jiang, Yucheng and
Kanell, Theodore and
Xu, Peter and
Khattab, Omar and
Lam, Monica",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.347/",
doi = "10.18653/v1/2024.naacl-long.347",
pages = "6252--6278",
}Our codebase is built upon Google ADK, LiteLLM, and Tavily. The evaluation benchmark incorporates data from Our World in Data (CC-BY 4.0) and queries from the WildSeek dataset collected as part of the STORM project. We extend our sincere thanks to all these projects for making their work openly available.
This project is released under the MIT License. See LICENSE for details. The CogGen Benchmark dataset is licensed under CC-BY 4.0.