Complex Docs RAG · Day 1-6 – Parsing → Chunking → Embedding → RAG → Table/Chart QA → MCP Agent

企业级多模态 RAG 系统的前六日目标：

Day 1：PDF 解析，把复杂文档拆成 text/table/image 等结构化片段；
Day 2：基于版面信息进行 layout-aware chunking，并加入 OCR fallback；
Day 3：将多模态 chunk 送入本地向量库，完成 embedding + 检索基础设施；
Day 4：编排 RAG pipeline（query rewrite → retrieval → rerank → answer synthesis）；
Day 5：实现 Table QA（定位/排序/聚合）与 Chart QA（图表类型识别 + OCR 提取）；
Day 6：构建 MCP 风格多工具编排（自动选工具、失败回退、RAG fallback）。

✅ Scope（Day 1）

PDF 文本抽取：支持 PyMuPDF / pdfplumber 双引擎，保留页码与 bbox。
表格结构化：优先使用 Camelot，失败时回退到 pdfplumber。
图像 / 图表导出：基于 PyMuPDF 导出原生图片并生成 Figure 片段占位。
元数据：包含解析时间、页数、解析器信息，统一输出 schema。

输出结构（rag_docs.models.ParsedDocument）：

{
  "text_chunks": [
    {"page": 1, "text": "...", "bbox": [x0, y0, x1, y1], "metadata": {...}}
  ],
  "tables": [
    {"page": 2, "df_json": [["col1", "col2"]], "column_names": ["col1", "col2"]}
  ],
  "figures": [...],
  "images": [...],
  "metadata": {"source_path": "...", "page_count": "..."}
}

✅ Scope（Day 2）

Layout-aware chunking：识别多栏排版，按列与段落生成可控长度的 layout_chunks。
可调 chunk config：支持 max_chars、overlap_chars 等参数，便于与 embedding token 上限对齐。
OCR fallback：基于 PyMuPDF 渲染 + Tesseract，对扫描页或文本缺失页自动补抽。
CLI 增强：可通过 --table-flavor、--enable-ocr、--chunk-* 选项完成验收，并另存 chunk JSON。

ParsedDocument 结构新增 layout_chunks 字段，后续模块直接消费即可。

✅ Scope（Day 3）

多模态 embedding：针对 layout_chunks、tables、images 分别生成向量。
轻量向量库：实现 SimpleVectorStore（本地 JSONL + cosine search），支持 upsert/query。
Embedding pipeline：EmbeddingPipeline 负责把解析结果转为 EmbeddingRecord 并写入向量库。
CLI：scripts/embed_demo.py ingest/query，可一键完成 PDF→向量库写入与问题检索。

✅ Scope（Day 4）

Query rewrite：QueryRewriter 对用户问题做规范化、同义词扩展，输出变换记录。
Retrieval + rerank：使用 SimpleVectorStore 召回，再由 SimpleReranker 对不同模态加减权重（table boost、image penalty）。
Answer synthesis：AnswerSynthesizer 汇总前 K 个上下文，生成引用、理由链。
CLI：scripts/rag_demo.py，提供 ingest（复用 Day3 流程）与 ask（RAG QA）命令。

✅ Scope（Day 5）

Table QA：TableQAEngine 支持 filters / sort / max-min-avg-sum-count 聚合，可从 CLI 传入条件组合。
Chart QA：ChartQAEngine 对 PDF 提取的图像执行边缘检测 + OCR，输出 chart 类型、主方向与文字摘要。
CLI：scripts/table_chart_demo.py table|chart，一条命令演示“找表 + 聚合 + 过滤”或“图表 OCR + 类型判定”。

✅ Scope（Day 6）

MCP 工具：封装 table_qa、chart_qa、rag 等工具调用，并记录调用日志。
Orchestrator：MCPOrchestrator 根据问题关键词生成计划（table/chart → RAG fallback），自动解析 PDF、确保向量库 ready、运行多步工具。
CLI：scripts/agent_demo.py ask "question"，输出答案 + 计划 + 实际调用的工具，验证多步流程。

✅ Scope（Phase C：结构化 QA & 跨文档）

Table QA 进阶：新增列名别名、联合过滤（and/or）、IN/NIN 运算、group by + aggregation、Top-N 结果、以及多表 join（指定左右列、join 类型）示例，满足“跨表对齐 + 分组聚合”的真实场景。
Chart QA 进阶：基于 HoughLines 合并获取 X/Y 轴位置，结合 OCR 识别轴标签、刻度；通过轮廓/峰值检测提取数据点，并输出归一化坐标 + 趋势分析（增/降/混合/数据不足）。
跨文档 RAG/Agent：scripts/embed_demo.py ingest-batch 可一次性向量化多本 PDF，RAGPipeline/Agent 支持 document_id 过滤与多 PDF 检索，Agent 可在多本文档之间对齐、对比并统一答复。

✅ Scope（Phase D：评估 & Demo）

Benchmark：data/benchmark/qa.json 收录手工标注的 Q/A + 关键词，scripts/benchmark_demo.py 计算 precision/recall、keyword hit、hallucination rate，并输出解析时间、平均检索延迟。
Demo / Docs：README 汇总 CLI usage、MCP 工具表与扩展指南，docs/demo_script.md 提供录屏脚本，简历 bullet 可直接引用。
Optional FastAPI：scripts/demo_server.py 暴露 /ask 接口（需要 fastapi、uvicorn），便于快速搭建 Web Demo。

📁 Project Layout

.
├── artifacts/parsed_images      # 导出的图片/图表
├── scripts/parse_demo.py        # Day 1/2 演示脚本（chunking + OCR options）
├── scripts/embed_demo.py        # Day 3 演示脚本（ingest/query）
├── scripts/rag_demo.py          # Day 4 演示脚本（ingest/ask）
├── scripts/table_chart_demo.py  # Day 5 演示脚本（table qa + chart qa）
├── scripts/retrieve_demo.py     # 表格/图表向量检索 CLI
├── scripts/agent_demo.py        # Day 6 MCP agent demo
├── src/rag_docs
│   ├── models.py                # Pydantic schema
│   ├── parsers/pdf_parser.py    # PDFDocumentParser + backends
│   ├── chunking/...             # Layout-aware chunker
│   ├── embeddings/...           # Text/table/image embedders
│   ├── vectorstores/...         # SimpleVectorStore
│   ├── rag/...                  # Query rewrite / rerank / synthesis
│   ├── table_qa/...             # Row/col filtering & aggregation engine
│   ├── chart_qa/...             # Chart OCR & heuristic classification
│   ├── tools/...                # MCP orchestrator + tool call tracking
│   └── pipelines/
│       ├── parsing_pipeline.py
│       ├── embedding_pipeline.py
│       └── rag_pipeline.py
└── pyproject.toml

🚀 Getting Started

Install dependencies

pip install -e .

Camelot 依赖 Ghostscript/Tk；如果当前环境缺失，解析器会自动回落到 pdfplumber。

Parse & chunk (Day 1/2)

python scripts/parse_demo.py path/to/sample.pdf \
  --max-pages 3 \
  --table-flavor lattice \
  --enable-ocr \
  --chunk-max-chars 1400 \
  --chunk-json artifacts/chunks.json \
  --output-json artifacts/parsed.json

命令会输出：

文本块、表格、图像、layout chunk 的数量统计
首段文本、表格与 layout chunk 预览
解析结果（ParsedDocument + layout chunks）写入 JSON，供后续 embedding 阶段直接消费

Embed & query（Day 3）

python scripts/embed_demo.py ingest data/samples/publish.pdf \
  --enable-ocr \
  --store-path artifacts/vector_store.jsonl

python scripts/embed_demo.py query --store-path artifacts/vector_store.jsonl \
  "What is the study about?"

第一条命令会解析 PDF 并把多模态 chunk 向量化存入本地向量库；第二条命令对向量库进行 Top-K 检索并展示得分/摘要。

若需接入真实语义模型，可附加 --embedding-provider sentence-transformer --embedding-model sentence-transformers/all-MiniLM-L6-v2 --vector-store-backend chroma 等参数。

RAG QA（Day 4）

python scripts/rag_demo.py ingest data/samples/publish.pdf --enable-ocr
python scripts/rag_demo.py ask "What activities does the study compare?" --top-k 5

ask 会输出重写后的查询、Top-K 上下文得分、最终回答、引用以及 reasoning steps。

想启用真实语义检索和 LLM 答复，可以附加 --embedding-provider openai --embedding-model text-embedding-3-large --vector-store-backend chroma --llm-provider openai --llm-model gpt-4o-mini --reranker-provider cross-encoder。

Table + Chart QA（Day 5 & Phase C）

Table 聚合示例：

python scripts/table_chart_demo.py table data/samples/publish.pdf \
  --table-index 0 \
  --top-k 3 \
  --filters "Region:in:APAC|EMEA" \
  --column-aliases "Region=Province|Area" \
  --group-by Region \
  --group-aggregation sum

Chart 解析示例：

python scripts/table_chart_demo.py chart artifacts/parsed_images/publish_p1_img1.png

第一条命令会在指定表格上执行过滤 + max 聚合并输出行数据；第二条命令会分析图表类型、主走向与 OCR 摘要。

进阶：--join-table 1 --join-left-on Region --join-right-on Region --join-how left 可以把两个表按列名拼接；Chart 模式现在还会输出轴位置、轴标签、归一化数据点与整体趋势，可用于后续 Table/Chart QA 对齐。

针对已经进入向量库的表格 / 图像，也可以直接检索：

python scripts/retrieve_demo.py table "Which table mentions institutions?" \
  --store-path artifacts/vector_store.jsonl \
  --embedding-provider openai \
  --vector-store-backend chroma

python scripts/retrieve_demo.py chart "Where are trend lines discussed?" \
  --store-path artifacts/vector_store.jsonl

命令会通过 metadata_filter 只返回指定 content_type 的片段，并展示相关页码与摘要。

跨文档 ingest + Agent（Phase C）

# 一次性向量化多本 PDF
python scripts/embed_demo.py ingest-batch \
  data/samples/publish.pdf \
  path/to/another_report.pdf \
  --enable-ocr --vector-store-backend chroma

# RAG 问答时指定文档子集
python scripts/rag_demo.py ask \
  --store-path artifacts/vector_store.jsonl \
  --document-id publish \
  --document-id another_report \
  "Compare the key institutions mentioned across both studies."

# Agent 同时加载多本 PDF 并回答
python scripts/agent_demo.py \
  --pdf data/samples/publish.pdf \
  --pdf path/to/another_report.pdf \
  --document-id publish --document-id another_report \
  --question "Which universities appear in both reports?"

ingest-batch 保证每本 PDF 都生成独立 document_id 并写入同一向量库；--document-id 过滤可让 RAG/Agent 只从指定文档回答，便于做跨文档对比。

Benchmark（Phase D）

python scripts/benchmark_demo.py \
  --qa-path data/benchmark/qa.json \
  --pdf-path data/samples/publish.pdf \
  --store-path artifacts/chroma.sqlite3 \
  --vector-store-backend chroma \
  --report-path reports/benchmark.json

会输出 parse latency、平均检索延迟、precision/recall、hallucination rate，并将详情写入 reports/benchmark.json。

🛠 Tech Highlights

Layout-aware chunking：根据列坐标与 bbox 合并上下文，控制 chunk 长度 + 重叠，适配向量模型输入。
OCR fallback：当页面无文本时自动渲染 + 调用 Tesseract，保留 bbox 与置信度，兼容扫描件。
多模态 embedding：文本使用哈希向量化，表格先序列化再共享，同步得到 layout/table embedding；图像基于 OpenCV 灰度+pooling 得到轻量视觉向量。
轻量向量库 + RAG：SimpleVectorStore 以 JSONL + numpy 执行 cosine 检索，RAGPipeline 串联 query rewrite → retrieval → rerank → answer synthesis。
Table/Chart QA：TableQAEngine 现支持列名别名、联合过滤、IN/NIN、group by + aggregation + TopN 以及多表 join；ChartQAEngine 通过 HoughLines 检测坐标轴、OCR 标签、提取数据点并输出趋势。
MCP Agent：MCPOrchestrator 自动解析问题意图、调用 table/chart/RAG 工具并记录 tool_calls，支持多 PDF ingest、document 过滤以及跨文档检索/对比，便于扩展到真实 MCP 协议。
可插拔配置：PDFParserConfig / LayoutChunkerConfig / EmbeddingConfig 支持替换，便于接入企业自定义模型。

📟 CLI Quick Reference

功能	命令
解析 + chunk	`python scripts/parse_demo.py data/samples/publish.pdf --enable-ocr --chunk-max-chars 1200`
单/多 PDF ingest	`python scripts/embed_demo.py ingest-batch data/samples/publish.pdf data/samples/copy.pdf --vector-store-backend chroma`
RAG QA	`python scripts/rag_demo.py ask --store-path artifacts/chroma.sqlite3 --vector-store-backend chroma --document-id publish "..."`
Table QA	`python scripts/table_chart_demo.py table ... --column-aliases Region=Province
Chart QA	`python scripts/table_chart_demo.py chart artifacts/parsed_images/publish_p1_img1.png`
Agent	`python scripts/agent_demo.py --pdf data/samples/publish.pdf --question "..."`
Benchmark	`python scripts/benchmark_demo.py --qa-path data/benchmark/qa.json --report-path reports/benchmark.json`
FastAPI Demo	`uvicorn scripts.demo_server:app --reload`

更完整的录屏脚本请查看 docs/demo_script.md。

💼 Resume-ready Summary

Built a 7-day enterprise-grade multimodal RAG stack that parses complex PDFs, performs layout-aware chunking + OCR, ingests multi-modal embeddings into Simple/Chroma stores, and serves table/chart QA with MCP-style tool orchestration and OpenAI integration.
Delivered Phase-C cross-document reasoning and Phase-D evaluation: batch ingest, document filters, benchmark scripts (precision/recall/hallucination, latency profiling), CLI playbooks, resume bullet, and an optional FastAPI endpoint for stakeholder demos.

🧰 MCP Tooling & Extension Guide

Tool	Permission	Inputs	说明
`parse_pdf`	`read:pdf`	`path`, `max_pages`	解析 PDF → `ParsedDocument` 并缓存。
`extract_tables`	`read:pdf`	`path`	统计与抽取表格结构。
`extract_figures`	`read:pdf`	`path`	导出图像/图表。
`ocr_image`	`analyze:image`	`image_path`	对图像执行 OCR（Chart QA 前置）。
`table_qa`	`analyze:table`	`table_index`, `question`	别名、联合过滤、GroupBy、聚合、Join。
`chart_qa`	`analyze:image`	`image_path`	识别图表类型、轴、数据点、趋势。
`vector_store_upsert`	`write:vector`	`document_id`	将解析结果向量化并写入 Simple/Chroma。
`vector_search`	`read:vector`	`top_k`	带 metadata filter 的向量检索。
`rag`	`reason:routing`	`top_k`, `question`	Query rewrite → retrieve → rerank → synthesize。

扩展步骤：实现新工具（解析或 QA 模块），在 MCPOrchestrator.tool_specs 注册元数据，更新 _detect_plan 与 _run_* 执行流；agent_demo.py --list-tools 会自动展示新增工具，便于向 MCP 平台声明。

🌐 FastAPI Demo（可选）

pip install fastapi uvicorn
uvicorn scripts.demo_server:app --reload

默认读取 data/samples/publish.pdf 和 artifacts/vector_store.jsonl，可通过 RAG_DEMO_PDF / RAG_DEMO_STORE 环境变量覆盖，打开 http://127.0.0.1:8000/docs 发送问题。

📄 Known Gaps / TODO

Camelot 仍难解析无边框/多栏/扫描类表格，建议在命令行尝试 --table-flavor stream 或集成更强的表格检测模型。
CLI 需要手动配置较多参数（top_k、chunk、reranker 等）；计划引入配置预设与自动调参模式，降低操作复杂度。

📅 What’s Next

已完成 Day 1-7 核心目标；后续可继续：

接入真实多模态 embedding / 外部向量库
多文档融合 & MCP 工具链扩展
Web demo / 演示视频 / 简历 bullet 打磨

有了 Day 1-7 的底座，后续模块可以直接消费统一结构化数据 + 本地向量库 + RAG + Table/Chart QA + MCP 工具编排，加速实现完整的企业级多模态 RAG 系统。Feel free to iterate! 🚀 6. MCP Agent（Day 6）

python scripts/agent_demo.py --question "What activities does the study compare?" \
  --pdf-path data/samples/publish.pdf \
  --store-path artifacts/vector_store.jsonl \
  --max-pages 2 --enable-ocr

CLI 会展示 agent 计划（table_qa/chart_qa/rag）、每个工具的输出、以及最终答案。

若已安装 sentence-transformers/chromadb，可以通过 --embedding-provider sentence-transformer --embedding-model sentence-transformers/all-MiniLM-L6-v2 --vector-store-backend chroma 切换到真实语义检索。使用 --list-tools 可查看 MCP 工具清单；附加 --llm-provider openai --llm-model gpt-4o-mini --reranker-provider cross-encoder 即可体验真实 LLM + cross-encoder 流程。

Benchmark & Deliverables（Day 7）

python scripts/benchmark_demo.py --pdf-path data/samples/publish.pdf \
  --store-path artifacts/vector_store.jsonl \
  --max-pages 2 --enable-ocr

命令会运行示例问题、统计平均延迟与关键词命中率，并将 JSON 报告写入 reports/benchmark.json（也可导出为 reports/benchmark.json.txt 方便查看）。同样支持 --embedding-provider、--vector-store-backend、--llm-provider、--reranker-provider 等参数以切换为真实语义模型和 LLM。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Complex Docs RAG · Day 1-6 – Parsing → Chunking → Embedding → RAG → Table/Chart QA → MCP Agent

✅ Scope（Day 1）

✅ Scope（Day 2）

✅ Scope（Day 3）

✅ Scope（Day 4）

✅ Scope（Day 5）

✅ Scope（Day 6）

✅ Scope（Phase C：结构化 QA & 跨文档）

✅ Scope（Phase D：评估 & Demo）

📁 Project Layout

🚀 Getting Started

🛠 Tech Highlights

📟 CLI Quick Reference

💼 Resume-ready Summary

🧰 MCP Tooling & Extension Guide

🌐 FastAPI Demo（可选）

📄 Known Gaps / TODO

📅 What’s Next

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
docs		docs
reports		reports
scripts		scripts
src/rag_docs		src/rag_docs
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Enfoirer/complex-docs-rag-agent

Folders and files

Latest commit

History

Repository files navigation

Complex Docs RAG · Day 1-6 – Parsing → Chunking → Embedding → RAG → Table/Chart QA → MCP Agent

✅ Scope（Day 1）

✅ Scope（Day 2）

✅ Scope（Day 3）

✅ Scope（Day 4）

✅ Scope（Day 5）

✅ Scope（Day 6）

✅ Scope（Phase C：结构化 QA & 跨文档）

✅ Scope（Phase D：评估 & Demo）

📁 Project Layout

🚀 Getting Started

🛠 Tech Highlights

📟 CLI Quick Reference

💼 Resume-ready Summary

🧰 MCP Tooling & Extension Guide

🌐 FastAPI Demo（可选）

📄 Known Gaps / TODO

📅 What’s Next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages