Skip to content

NJUNLP/CogGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CogGen

A cognitively inspired multi-agent framework for generating comprehensive, multimodal deep research reports.



ACL Paper Dataset Code License

News

  • [2026/04] CogGen has been accepted to Findings of ACL 2026!

Introduction

CogGen is an open-source multi-agent system that generates comprehensive, multimodal research reports through cognitively inspired recursive processes. It leverages a three-agent architecture — Planner, Writer, and Reviewer — orchestrated via Google ADK and powered by any LLM through LiteLLM.

CogGen employs a Hierarchical Recursive Architecture consisting of two nested cognitive loops:

  • Macro-Cognitive Loop: The Planner generates a global outline, sections are written in parallel, and the Reviewer provides structural/content feedback (Δ) to trigger replanning iterations.
  • Micro-Cognitive Cycle: Per-section processing where each section undergoes independent search → plan → write → revise cycles with format, factual, and cognitive revision.
  • Renderer Agent: Translates Abstract Visual Representations (AVRs) into executable ECharts/Mermaid syntax and renders them to PNG via Playwright.

Three-Agent Architecture

Agent Role Sub-agents
Planner Agent (Aₚ) Information retrieval & structural planning init_research, outline, section_plan, section_search, replan_loop, combine_plan
Writer Agent (Aw) Text composition & AVR definition section_writer, write_loop, role_inference, content_cleanup
Reviewer Agent (Aᵣ) Evaluation & feedback signals (Δ) structure_detector, content_detector, plan_restructurer, refine_loop, *_revise agents
Renderer Agent Translates AVRs into executable syntax gen_image (ECharts + Mermaid → PNG/HTML)

Installation

Prerequisites

Setup

# Clone the repository
git clone https://github.com/NJUNLP/coggen.git
cd coggen

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies (creates .venv automatically)
uv sync

# (Optional) Install Playwright for chart PNG rendering
uv sync --extra render
uv run playwright install chromium

# Copy and edit environment variables
cp .env.example .env
# Edit .env with your API keys

Environment Variables

Variable Required Default Description
OPENAI_API_KEY Yes* OpenAI API key
AZURE_API_KEY Alt* Azure OpenAI API key
TAVILY_API_KEY Yes Tavily web search API key
PROMPT_LANG No en en (with charts) or en_no_image (text only)
COGGEN_WRITING_MODEL No gpt-4.1 Writing model
COGGEN_REASONING_MODEL No gpt-4.1 Planning & revision model
COGGEN_CODE_MODEL No gpt-4.1 Chart code generation model
COGGEN_TEMPERATURE No 0.5 LLM sampling temperature
COGGEN_SEARCH_MODE No snippet snippet (fast) or full (thorough)
ECHARTS_VALIDATION No simple simple, enhanced, or full

* At least one LLM provider key is required.

Quick Start

Generate a Single Report

# Standard mode (full CogGen pipeline)
uv run python run.py "What are the latest advances in quantum computing?"

# With custom title and output directory
uv run python run.py "AI in healthcare" --title "AI Healthcare Report" --output ./my_reports

Ablation Modes

CogGen supports three generation modes for ablation studies:

# Standard: Full pipeline (Macro + Micro cognitive loops)
uv run python run.py "Your query" --mode standard

# Simple: w/o Macro-Cognitive Loop (no replan/refine iterations)
uv run python run.py "Your query" --mode simple

# No-Revise: w/o Micro-Cognitive Cycle (shared context, no per-section search/revision)
uv run python run.py "Your query" --mode no_revise
Mode Macro Loop Micro Cycle Per-Section Search Description
Standard Full CogGen pipeline
Simple w/o Macro-Cognitive Loop
No-Revise w/o Micro-Cognitive Cycle
Direct Single LLM call baseline

Batch Processing

# From a text file (one query per line)
uv run python run.py --batch queries.txt

# From a JSONL file
uv run python run.py --batch ./data/wildseek/queries.json --mode standard --output ./batch_output

Ablation Experiments

Run systematic ablation experiments to evaluate each component's contribution:

# Run all three modes on the same query set
python -m ablation.run_ablation --queries data/owid/queries.json

# Run specific modes only
python -m ablation.run_ablation --queries data/wildseek/queries.json --modes standard simple

# Direct LLM baseline (single-call generation)
python -m ablation.gen_report_once "Your query"

Evaluation

CogGen includes a comprehensive evaluation framework based on CLEF (Cognitive Load Evaluation Framework), featuring automated LLM-based quality scoring, pairwise comparison, citation accuracy analysis, and report statistics collection.

Evaluation Modes

1. Collect Reports into JSONL

Batch evaluation modes require a JSONL file as input. Use the built-in collection script to scan output/reports/, de-duplicate by query (keeping the newest successful result), and produce a ready-to-use JSONL:

# Default: scan output/reports, write to output/model_reports.jsonl
python -m evaluation.scripts.collect_reports

# Custom paths
python -m evaluation.scripts.collect_reports -d output/reports -o my_reports.jsonl

# Keep all reports (no dedup), useful for comparing different runs of the same query
python -m evaluation.scripts.collect_reports --all -o all_reports.jsonl

Each line in the output JSONL contains query, markdown, title, report_dir, and report_path fields.

2. Single-Report Evaluation (CLEF)

Evaluate a single report across 5 cognitive dimensions:

# Evaluate a single markdown report (result auto-saved to report's directory as eval_result.json)
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md

# With a specific query for context
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md --query "Your query"

# Specify a custom output path
python -m evaluation.evaluate single --report output/reports/xxx/report_final.md --output result.json

# Evaluate from a JSONL batch file
python -m evaluation.evaluate single --model-output reports.jsonl --output results.json

# Use a specific evaluation model
python -m evaluation.evaluate single --model-output reports.jsonl --eval-model gpt-4o --output results.json

3. Pairwise Comparison

Compare model-generated reports against reference reports (ground truth) using positive-scoring rubrics:

python -m evaluation.evaluate pairwise \
    --model-output my_reports.jsonl \
    --ground-truth ./coggen-benchmark/owid_reports.jsonl \
    --output comparison.json

The pairwise evaluator computes a relative advantage score per dimension: model_score / (model_score + ref_score), where values > 0.5 indicate the model report is stronger.

4. Citation Accuracy

Evaluate citation quality by checking URL accessibility (recall) and content relevance (precision):

# CogGen citation format ([N] footnotes)
python -m evaluation.evaluate citation \
    --model-output reports.jsonl \
    --citation-format coggen \
    --output citation_results.json

# Supported formats: coggen, mmdr, gemini
python -m evaluation.evaluate citation \
    --model-output reports.jsonl \
    --citation-format mmdr \
    --cache-dir .cache/citations
Metric Description
citation_recall Fraction of cited URLs that are accessible
citation_precision Fraction of citations whose content is relevant to the context
f1_score Harmonic mean of recall and precision

5. Report Statistics

Collect structural statistics (character count, word count, visualization count) from report files:

python -m evaluation.evaluate statistics --model-output reports.jsonl --output stats.json

CLEF Dimensions

Dimension Focus Cognitive Principle Category
D1: Visual-Text Alignment Visual-text integration Spatial Contiguity Core
D2: Multimodal Synergy Information gain from multimodal content Multimedia, Redundancy Core
D3: Information Organization Hierarchical structure Signaling, Segmenting Control
D4: Content Depth Causal explanations Schema Construction Control
D5: Content Relevance Topic coherence & coverage Coherence, Pre-training Control

Common Evaluation Options

Flag Description
--eval-model LLM model for evaluation (default: gpt-4o)
--temperature Evaluation temperature (default: 0.1)
--force Ignore existing checkpoints, re-evaluate everything
--retry-failed Re-attempt previously failed evaluations
--cache-dir Cache directory for URL content (default: .cache/citations)

Output Structure

Each report generation run produces a timestamped directory:

output/reports/20260409_103508_7401/
├── report_final.md             # Final report with rendered charts and footnotes
├── report.md                   # Raw report content
├── report_image.md             # Report with <IMAGE> placeholders replaced
├── report_image_sources.md     # Report with image placeholders + reference list
├── report_code.md              # Report with embedded chart code comments
├── report_0.md                 # Intermediate report snapshot
├── images/                     # Rendered chart images
│   ├── Figure1.1__xxx.png      #   PNG bitmap (when Playwright is available)
│   └── Figure1.1__xxx.html     #   Self-contained interactive HTML (always)
├── outline.json                # Final report outline
├── outline_list.json           # Outline evolution across iterations
├── section_list.json           # Per-section writing history
├── gen_images_list.json        # Chart metadata and ECharts/Mermaid code
├── used_chunks.json            # Web sources referenced in the report
├── index_context_pool.json     # Full indexed context pool
├── token_consumption.json      # LLM token usage breakdown
├── config.json                 # Run configuration
└── report.log                  # Detailed execution log

Datasets

CogGen uses two evaluation datasets. Query definitions are included in this repository; reference reports are hosted on HuggingFace for clean separation of code and data.

HuggingFace Dataset

The full reference reports are available at: HuggingFace: coggen-benchmark

hf download tk1111/coggen-benchmark --repo-type dataset --local-dir ./coggen-benchmark

Dataset Summary

Dataset Records Type Reference Source
OWID 50 (40 main + 10 extend) Human-written expert reports Our World in Data
WildSeek 20 LLM-generated reference reports Gemini Deep Research

Query Files

File Description
data/owid/queries.json 50 OWID queries with source_report_id linking to HuggingFace reference reports
data/wildseek/queries.json 100 WildSeek queries (20 used in evaluation) with topic, intent, and domain

Source Attribution

  • OWID: The OWID reports are originally published by their respective authors at Our World in Data. Republished here under a Creative Commons BY 4.0 license. Each report's original authors, publication date, and source URL are preserved in the dataset. Note that data and charts within these reports may originate from third-party providers (e.g., WHO, UN, World Bank) and are subject to those providers' license terms.

  • WildSeek: The WildSeek queries are sourced from the WildSeek benchmark (Jiang et al., 2024). Reference reports were generated using Gemini Deep Research and are included for research evaluation purposes.

Project Structure

coggen/
├── run.py                          # CLI entry point
├── pyproject.toml                  # Dependencies & build config
├── .env.example                    # Environment variable template
│
├── coggen/                         # Core package
│   ├── config.py                   # Centralized configuration
│   ├── types.py                    # ReportRequest / ReportResult
│   │
│   ├── pipeline/                   # Execution orchestration
│   │   ├── runner.py               # Main execution engine
│   │   ├── helpers.py              # Session, logging, file I/O
│   │   ├── simple.py               # Simple ablation pipeline
│   │   └── no_revise.py            # No-revise ablation pipeline
│   │
│   ├── agents/                     # Multi-agent system
│   │   ├── planner/                # Planner Agent (Aₚ)
│   │   ├── writer/                 # Writer Agent (Aw)
│   │   ├── reviewer/               # Reviewer Agent (Aᵣ)
│   │   └── renderer/               # Renderer Agent
│   │
│   ├── prompts/                    # Prompt templates
│   ├── tools/                      # Tool modules (LLM, search, chunker, rendering)
│   └── utils/                      # Shared utilities
│
├── evaluation/                     # CLEF evaluation framework
│   ├── evaluate.py                 # CLI entry point (4 modes)
│   ├── evaluator.py                # Batch evaluation orchestrator
│   ├── metrics/                    # Evaluation metrics
│   ├── prompts/                    # Evaluation rubrics
│   └── scripts/                    # Helper scripts
│
├── ablation/                       # Ablation experiment scripts
│   ├── run_ablation.py             # Multi-mode ablation runner
│   └── gen_report_once.py          # Single-LLM-call baseline
│
└── data/                           # Evaluation datasets
    ├── owid/                       # OWID expert report queries
    ├── wildseek/                   # WildSeek collected queries
    └── examples/                   # CogGen-generated example reports

Citation

If you find this work useful, please consider citing:

@article{coggen2026,
  title={CogGen: Cognition-Inspired Deep Research Report Generation via Multi-Agent Recursive Framework},
  author={Tian, Kuo and Sun, Pengfei and Ding, Junran and Wu, Zhen and Dai, Xinyu},
  booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
  year={2026},
}
Other related works

Our World in Data

@misc{ourworldindata,
  author = {Roser, Max and Ritchie, Hannah and Ortiz-Ospina, Esteban and others},
  title = {Our World in Data},
  year = {2025},
  howpublished = {\url{https://ourworldindata.org}},
  note = {Online resource, licensed under CC-BY 4.0. Accessed: 2025},
}

WildSeek

@inproceedings{jiang-etal-2024-unknown,
    title = "Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations",
    author = "Jiang, Yucheng  and
      Shao, Yijia  and
      Ma, Dekun  and
      Semnani, Sina  and
      Lam, Monica",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.554/",
    doi = "10.18653/v1/2024.emnlp-main.554",
    pages = "9917--9955",
}

STORM

@inproceedings{shao-etal-2024-assisting,
    title = "Assisting in Writing {W}ikipedia-like Articles From Scratch with Large Language Models",
    author = "Shao, Yijia  and
      Jiang, Yucheng  and
      Kanell, Theodore  and
      Xu, Peter  and
      Khattab, Omar  and
      Lam, Monica",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.347/",
    doi = "10.18653/v1/2024.naacl-long.347",
    pages = "6252--6278",
}

Acknowledgments

Our codebase is built upon Google ADK, LiteLLM, and Tavily. The evaluation benchmark incorporates data from Our World in Data (CC-BY 4.0) and queries from the WildSeek dataset collected as part of the STORM project. We extend our sincere thanks to all these projects for making their work openly available.

License

This project is released under the MIT License. See LICENSE for details. The CogGen Benchmark dataset is licensed under CC-BY 4.0.

About

CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages