CogGen

A cognitively inspired multi-agent framework for generating comprehensive, multimodal deep research reports.

News

[2026/04] CogGen has been accepted to Findings of ACL 2026!

Introduction

CogGen is an open-source multi-agent system that generates comprehensive, multimodal research reports through cognitively inspired recursive processes. It leverages a three-agent architecture — Planner, Writer, and Reviewer — orchestrated via Google ADK and powered by any LLM through LiteLLM.

CogGen employs a Hierarchical Recursive Architecture consisting of two nested cognitive loops:

Macro-Cognitive Loop: The Planner generates a global outline, sections are written in parallel, and the Reviewer provides structural/content feedback (Δ) to trigger replanning iterations.
Micro-Cognitive Cycle: Per-section processing where each section undergoes independent search → plan → write → revise cycles with format, factual, and cognitive revision.
Renderer Agent: Translates Abstract Visual Representations (AVRs) into executable ECharts/Mermaid syntax and renders them to PNG via Playwright.

Three-Agent Architecture

Agent	Role	Sub-agents
Planner Agent (Aₚ)	Information retrieval & structural planning	`init_research`, `outline`, `section_plan`, `section_search`, `replan_loop`, `combine_plan`
Writer Agent (Aw)	Text composition & AVR definition	`section_writer`, `write_loop`, `role_inference`, `content_cleanup`
Reviewer Agent (Aᵣ)	Evaluation & feedback signals (Δ)	`structure_detector`, `content_detector`, `plan_restructurer`, `refine_loop`, `*_revise` agents
Renderer Agent	Translates AVRs into executable syntax	`gen_image` (ECharts + Mermaid → PNG/HTML)

Installation

Prerequisites

Python 3.10+
Node.js 18+ (optional, for ECharts validation)
A Tavily API key
An LLM API key (OpenAI, Azure OpenAI, or any LiteLLM-supported provider)

Setup

# Clone the repository
git clone https://github.com/NJUNLP/coggen.git
cd coggen

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies (creates .venv automatically)
uv sync

# (Optional) Install Playwright for chart PNG rendering
uv sync --extra render
uv run playwright install chromium

# Copy and edit environment variables
cp .env.example .env
# Edit .env with your API keys

Environment Variables

Variable	Required	Default	Description
`OPENAI_API_KEY`	Yes*	—	OpenAI API key
`AZURE_API_KEY`	Alt*	—	Azure OpenAI API key
`TAVILY_API_KEY`	Yes	—	Tavily web search API key
`PROMPT_LANG`	No	`en`	`en` (with charts) or `en_no_image` (text only)
`COGGEN_WRITING_MODEL`	No	`gpt-4.1`	Writing model
`COGGEN_REASONING_MODEL`	No	`gpt-4.1`	Planning & revision model
`COGGEN_CODE_MODEL`	No	`gpt-4.1`	Chart code generation model
`COGGEN_TEMPERATURE`	No	`0.5`	LLM sampling temperature
`COGGEN_SEARCH_MODE`	No	`snippet`	`snippet` (fast) or `full` (thorough)
`ECHARTS_VALIDATION`	No	`simple`	`simple`, `enhanced`, or `full`

* At least one LLM provider key is required.

Quick Start

Generate a Single Report

# Standard mode (full CogGen pipeline)
uv run python run.py "What are the latest advances in quantum computing?"

# With custom title and output directory
uv run python run.py "AI in healthcare" --title "AI Healthcare Report" --output ./my_reports

Ablation Modes

CogGen supports three generation modes for ablation studies:

# Standard: Full pipeline (Macro + Micro cognitive loops)
uv run python run.py "Your query" --mode standard

# Simple: w/o Macro-Cognitive Loop (no replan/refine iterations)
uv run python run.py "Your query" --mode simple

# No-Revise: w/o Micro-Cognitive Cycle (shared context, no per-section search/revision)
uv run python run.py "Your query" --mode no_revise

Mode	Macro Loop	Micro Cycle	Per-Section Search	Description
Standard	✓	✓	✓	Full CogGen pipeline
Simple	✗	✓	✓	w/o Macro-Cognitive Loop
No-Revise	✗	✗	✗	w/o Micro-Cognitive Cycle
Direct	✗	✗	✗	Single LLM call baseline

Batch Processing

# From a text file (one query per line)
uv run python run.py --batch queries.txt

# From a JSONL file
uv run python run.py --batch ./data/wildseek/queries.json --mode standard --output ./batch_output

Ablation Experiments

Run systematic ablation experiments to evaluate each component's contribution:

# Run all three modes on the same query set
python -m ablation.run_ablation --queries data/owid/queries.json

# Run specific modes only
python -m ablation.run_ablation --queries data/wildseek/queries.json --modes standard simple

# Direct LLM baseline (single-call generation)
python -m ablation.gen_report_once "Your query"

Evaluation

CogGen includes a comprehensive evaluation framework based on CLEF (Cognitive Load Evaluation Framework), featuring automated LLM-based quality scoring, pairwise comparison, citation accuracy analysis, and report statistics collection.

Evaluation Modes

1. Collect Reports into JSONL

Batch evaluation modes require a JSONL file as input. Use the built-in collection script to scan output/reports/, de-duplicate by query (keeping the newest successful result), and produce a ready-to-use JSONL:

# Default: scan output/reports, write to output/model_reports.jsonl
python -m evaluation.scripts.collect_reports

# Custom paths
python -m evaluation.scripts.collect_reports -d output/reports -o my_reports.jsonl

# Keep all reports (no dedup), useful for comparing different runs of the same query
python -m evaluation.scripts.collect_reports --all -o all_reports.jsonl

Each line in the output JSONL contains query, markdown, title, report_dir, and report_path fields.

2. Single-Report Evaluation (CLEF)

Evaluate a single report across 5 cognitive dimensions:

# Evaluate a single markdown report (result auto-saved to report's directory as eval_result.json)
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md

# With a specific query for context
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md --query "Your query"

# Specify a custom output path
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md --output result.json

# Evaluate from a JSONL batch file
python -m evaluation.evaluate single --model-output reports.jsonl --output results.json

# Use a specific evaluation model
python -m evaluation.evaluate single --model-output reports.jsonl --eval-model gpt-4o --output results.json

3. Pairwise Comparison

Compare model-generated reports against reference reports (ground truth) using positive-scoring rubrics:

python -m evaluation.evaluate pairwise \
    --model-output my_reports.jsonl \
    --ground-truth ./coggen-benchmark/owid_reports.jsonl \
    --output comparison.json

The pairwise evaluator computes a relative advantage score per dimension: model_score / (model_score + ref_score), where values > 0.5 indicate the model report is stronger.

4. Citation Accuracy

Evaluate citation quality by checking URL accessibility (recall) and content relevance (precision):

# CogGen citation format ([N] footnotes)
python -m evaluation.evaluate citation \
    --model-output reports.jsonl \
    --citation-format coggen \
    --output citation_results.json

# Supported formats: coggen, mmdr, gemini
python -m evaluation.evaluate citation \
    --model-output reports.jsonl \
    --citation-format mmdr \
    --cache-dir .cache/citations

Metric	Description
`citation_recall`	Fraction of cited URLs that are accessible
`citation_precision`	Fraction of citations whose content is relevant to the context
`f1_score`	Harmonic mean of recall and precision

5. Report Statistics

Collect structural statistics (character count, word count, visualization count) from report files:

python -m evaluation.evaluate statistics --model-output reports.jsonl --output stats.json

CLEF Dimensions

Dimension	Focus	Cognitive Principle	Category
D1: Visual-Text Alignment	Visual-text integration	Spatial Contiguity	Core
D2: Multimodal Synergy	Information gain from multimodal content	Multimedia, Redundancy	Core
D3: Information Organization	Hierarchical structure	Signaling, Segmenting	Control
D4: Content Depth	Causal explanations	Schema Construction	Control
D5: Content Relevance	Topic coherence & coverage	Coherence, Pre-training	Control

Common Evaluation Options

Flag	Description
`--eval-model`	LLM model for evaluation (default: `gpt-4o`)
`--temperature`	Evaluation temperature (default: `0.1`)
`--force`	Ignore existing checkpoints, re-evaluate everything
`--retry-failed`	Re-attempt previously failed evaluations
`--cache-dir`	Cache directory for URL content (default: `.cache/citations`)

Output Structure

Each report generation run produces a timestamped directory:

output/reports/20260409_103508_7401/
├── report_final.md             # Final report with rendered charts and footnotes
├── report.md                   # Raw report content
├── report_image.md             # Report with <IMAGE> placeholders replaced
├── report_image_sources.md     # Report with image placeholders + reference list
├── report_code.md              # Report with embedded chart code comments
├── report_0.md                 # Intermediate report snapshot
├── images/                     # Rendered chart images
│   ├── Figure1.1__xxx.png      #   PNG bitmap (when Playwright is available)
│   └── Figure1.1__xxx.html     #   Self-contained interactive HTML (always)
├── outline.json                # Final report outline
├── outline_list.json           # Outline evolution across iterations
├── section_list.json           # Per-section writing history
├── gen_images_list.json        # Chart metadata and ECharts/Mermaid code
├── used_chunks.json            # Web sources referenced in the report
├── index_context_pool.json     # Full indexed context pool
├── token_consumption.json      # LLM token usage breakdown
├── config.json                 # Run configuration
└── report.log                  # Detailed execution log

Datasets

CogGen uses two evaluation datasets. Query definitions are included in this repository; reference reports are hosted on HuggingFace for clean separation of code and data.

HuggingFace Dataset

The full reference reports are available at: HuggingFace: coggen-benchmark

hf download tk1111/coggen-benchmark --repo-type dataset --local-dir ./coggen-benchmark

Dataset Summary

Dataset	Records	Type	Reference Source
OWID	50 (40 main + 10 extend)	Human-written expert reports	Our World in Data
WildSeek	20	LLM-generated reference reports	Gemini Deep Research

Query Files

File	Description
`data/owid/queries.json`	50 OWID queries with `source_report_id` linking to HuggingFace reference reports
`data/wildseek/queries.json`	100 WildSeek queries (20 used in evaluation) with topic, intent, and domain

Source Attribution

OWID: The OWID reports are originally published by their respective authors at Our World in Data. Republished here under a Creative Commons BY 4.0 license. Each report's original authors, publication date, and source URL are preserved in the dataset. Note that data and charts within these reports may originate from third-party providers (e.g., WHO, UN, World Bank) and are subject to those providers' license terms.
WildSeek: The WildSeek queries are sourced from the WildSeek benchmark (Jiang et al., 2024). Reference reports were generated using Gemini Deep Research and are included for research evaluation purposes.

Project Structure

coggen/
├── run.py                          # CLI entry point
├── pyproject.toml                  # Dependencies & build config
├── .env.example                    # Environment variable template
│
├── coggen/                         # Core package
│   ├── config.py                   # Centralized configuration
│   ├── types.py                    # ReportRequest / ReportResult
│   │
│   ├── pipeline/                   # Execution orchestration
│   │   ├── runner.py               # Main execution engine
│   │   ├── helpers.py              # Session, logging, file I/O
│   │   ├── simple.py               # Simple ablation pipeline
│   │   └── no_revise.py            # No-revise ablation pipeline
│   │
│   ├── agents/                     # Multi-agent system
│   │   ├── planner/                # Planner Agent (Aₚ)
│   │   ├── writer/                 # Writer Agent (Aw)
│   │   ├── reviewer/               # Reviewer Agent (Aᵣ)
│   │   └── renderer/               # Renderer Agent
│   │
│   ├── prompts/                    # Prompt templates
│   ├── tools/                      # Tool modules (LLM, search, chunker, rendering)
│   └── utils/                      # Shared utilities
│
├── evaluation/                     # CLEF evaluation framework
│   ├── evaluate.py                 # CLI entry point (4 modes)
│   ├── evaluator.py                # Batch evaluation orchestrator
│   ├── metrics/                    # Evaluation metrics
│   ├── prompts/                    # Evaluation rubrics
│   └── scripts/                    # Helper scripts
│
├── ablation/                       # Ablation experiment scripts
│   ├── run_ablation.py             # Multi-mode ablation runner
│   └── gen_report_once.py          # Single-LLM-call baseline
│
└── data/                           # Evaluation datasets
    ├── owid/                       # OWID expert report queries
    ├── wildseek/                   # WildSeek collected queries
    └── examples/                   # CogGen-generated example reports

Citation

If you find this work useful, please consider citing:

@article{coggen2026,
  title={CogGen: Cognition-Inspired Deep Research Report Generation via Multi-Agent Recursive Framework},
  author={Tian, Kuo and Sun, Pengfei and Ding, Junran and Wu, Zhen and Dai, Xinyu},
  booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
  year={2026},
}

Other related works

Our World in Data

@misc{ourworldindata,
  author = {Roser, Max and Ritchie, Hannah and Ortiz-Ospina, Esteban and others},
  title = {Our World in Data},
  year = {2025},
  howpublished = {\url{https://ourworldindata.org}},
  note = {Online resource, licensed under CC-BY 4.0. Accessed: 2025},
}

WildSeek

@inproceedings{jiang-etal-2024-unknown,
    title = "Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations",
    author = "Jiang, Yucheng  and
      Shao, Yijia  and
      Ma, Dekun  and
      Semnani, Sina  and
      Lam, Monica",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.554/",
    doi = "10.18653/v1/2024.emnlp-main.554",
    pages = "9917--9955",
}

STORM

@inproceedings{shao-etal-2024-assisting,
    title = "Assisting in Writing {W}ikipedia-like Articles From Scratch with Large Language Models",
    author = "Shao, Yijia  and
      Jiang, Yucheng  and
      Kanell, Theodore  and
      Xu, Peter  and
      Khattab, Omar  and
      Lam, Monica",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.347/",
    doi = "10.18653/v1/2024.naacl-long.347",
    pages = "6252--6278",
}

Acknowledgments

Our codebase is built upon Google ADK, LiteLLM, and Tavily. The evaluation benchmark incorporates data from Our World in Data (CC-BY 4.0) and queries from the WildSeek dataset collected as part of the STORM project. We extend our sincere thanks to all these projects for making their work openly available.

License

This project is released under the MIT License. See LICENSE for details. The CogGen Benchmark dataset is licensed under CC-BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ablation		ablation
assets		assets
coggen		coggen
data		data
evaluation		evaluation
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

CogGen

News

Introduction

Three-Agent Architecture

Installation

Prerequisites

Setup

Environment Variables

Quick Start

Generate a Single Report

Ablation Modes

Batch Processing

Ablation Experiments

Evaluation

Evaluation Modes

1. Collect Reports into JSONL

2. Single-Report Evaluation (CLEF)

3. Pairwise Comparison

4. Citation Accuracy

5. Report Statistics

CLEF Dimensions

Common Evaluation Options

Output Structure

Datasets

HuggingFace Dataset

Dataset Summary

Query Files

Source Attribution

Project Structure

Citation

Our World in Data

WildSeek

STORM

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages