SimpleDoc

SimpleDoc is a lightweight yet powerful Retrieval-Augmented Generation (RAG) framework for multi-modal document understanding, with a focus on Document Visual Question Answering (DocVQA). It introduces a dual-cue retrieval mechanism and an iterative reasoning agent that together outperform more complex multi-agent pipelines, using fewer document pages.

Key Contributions

Dual-Cue Retrieval: Combines page-level visual embeddings and LLM-generated summaries to retrieve and rerank relevant pages.
Iterative Reasoning: A single VLM-based agent dynamically updates queries and working memory to iteratively refine answers.
Fewer Pages, Better Accuracy: Achieves up to 70.12% accuracy on DocVQA benchmarks while reading only ~3.5 pages per query.
Simple but Effective: Outperforms multi-agent systems like MDocAgent and hybrid RAG pipelines like M3DocRAG on 3 out of 4 major benchmarks.

Method Overview

SimpleDoc operates in two distinct stages:

1. Offline Document Processing

Extracts visual embeddings per page via models like ColPali/ColQwen.
Generates summaries per page using LLMs prompted with structured instructions.
Stores both into a retrievable vector DB.

2. Online Multi-Modal QA Loop

Embeds a user query and retrieves top-k candidate pages by embedding similarity.
Filters and re-ranks these using a summary-aware LLM.
A reasoning agent decides whether the current context suffices to answer or if further refinement is needed.
The process continues iteratively, updating working memory and queries until the answer is found or the query is deemed unanswerable.

Quickstart

1. Clone the repository

git clone https://github.com/ag2ai/simpledoc.git
cd simpledoc

2. Environment Setup

Create and activate the conda environment, then install dependencies:

conda create -n simpledoc python=3.12
conda activate simpledoc
bash install.sh

3. Data Preparation

Create the data directory and download the dataset:

mkdir data
cd data

Make sure your data directory looks like this:

simpledoc/
├── data/
│   └── MMLongBench/
│       ├── samples.json
│       └── documents/

Download the dataset from huggingface and place it in the data directory. The documents of PaperText are same as PaperTab. You can use symbol link or make a copy.

4. VLLM Setup

For setting up the VLLM server, the following official resources provide the most reliable and up-to-date guidance:

You can install VLLM as described above and start the server using the vllm serve command. We use 2 x H100 94GB to run the model:

For Qwen3-30B-A3B:

vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1 \
--gpu_memory_utilization=0.95 --enable_prefix_caching --tensor-parallel-size=2 \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072

For Qwen2.5-VL-32B:

vllm serve Qwen/Qwen2.5-VL-32B-Instruct --dtype bfloat16 \
--gpu_memory_utilization=0.95 --enable_prefix_caching --tensor-parallel-size=2 \
--limit-mm-per-prompt image=20

Because we run on 2 GPUs, so we set tensor-parallel-size=2 . Set to 1 if you want to only use 1 GPU. The Qwen2.5-VL-32B might not fit into a single 80GB GPU ( If it cannot, it will give OOM error ).

4. Run the Pipeline

Execute the pipeline in the following order using the provided scripts:

# Step 1: Generate LLM summaries
bash scripts/preprocess_summaries.sh

# Step 2: Generate visual embeddings
bash scripts/preprocess_embeddings.sh

# Step 3: Run the AG2-wrapped SimpleDoc multi-agent chat pipeline
bash scripts/run_simpledoc.sh

Project Structure

├── preprocess/                   # Offline embedding + summary extraction (Stage 1)
│   ├── generate_embeddings.py
│   └── generate_summaries.py

├── pipeline/                   
│   ├── groupchat_controller.py
│   └── run_simpledoc_chat.py

├── modules/                      # Dual-cue retrieval + Iterative QA and memory-based reasoning (Stage 2)
│   ├── step02_page_retrieval.py
│   └── step03_target_page_qa.py

├── prompts/                      # Prompt templates used for retrieval, QA, and memory update
│   ├── page_retrieval_prompt.txt
│   ├── general_summary_propmt.txt
│   └── doc_qa_prompt_v3.5.txt

├── scripts/                      # Bash scripts
│   ├── preprocess_embeddings.sh
│   ├── preprocess_summaries.sh
│   ├── run_simpledoc.sh

├── agent/                        # AG2-compatible single-agent wrapper (SimpleDocAgent)
│   ├── reasoning_agent.py
│   └── retriever_agent.py

├── utils/                        # Utility functions (e.g OpenAI client initialization)
│   ├── pipeline_utils.py
│   └── openai_helper.py

├── data/                         
│   ├── MMLongBench/
│   ├── LongDocURL/
│   ├── FetaTab/
│   └── PaperTab/

├── outputs/                      # Final pipeline outputs (answers + metadata)

└── README.md                     # Project documentation and usage guide

Citation

If you find this work useful, please cite our paper:

@article{jain2024simpledoc,
  title={SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement},
  author={Jain, Chelsi and Wu, Yiran and Zeng, Yifan and Liu, Jiale and Dai, Shengyu and Shao, Zhenwen and Wu, Qingyun and Wang, Huazheng},
  journal={arXiv preprint arXiv:2506.14035},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SimpleDoc

Key Contributions

Method Overview

1. Offline Document Processing

2. Online Multi-Modal QA Loop

Quickstart

1. Clone the repository

2. Environment Setup

3. Data Preparation

4. VLLM Setup

4. Run the Pipeline

Project Structure

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
agent		agent
evaluation		evaluation
figures		figures
modules		modules
pipeline		pipeline
preprocess		preprocess
prompts		prompts
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh

ag2ai/SimpleDoc

Folders and files

Latest commit

History

Repository files navigation

SimpleDoc

Key Contributions

Method Overview

1. Offline Document Processing

2. Online Multi-Modal QA Loop

Quickstart

1. Clone the repository

2. Environment Setup

3. Data Preparation

4. VLLM Setup

4. Run the Pipeline

Project Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages