feat(ocr): VLM 輔助圖片偵測與裁切 --extract-figures

## Problem

> **Original text**:
> 「我覺得 macdoc 需要好好考慮這個功能，要用 D（VLM 輔助裁切：用 vision model 偵測圖片邊界框再自動裁）」
> — Source: Claude Code conversation, 2026-04-12

`macdoc ocr` 目前只提取文字，遇到掃描 PDF 中的圖片/圖表/流程圖會直接遺失。實測 78 份轉學考 PDF，38%（30/78）的試卷包含「右圖」「下圖」「下表」「如圖」等圖片引用，但 OCR 輸出的 .md 完全沒有對應圖片。

**根本原因**：大部分考古題 PDF 是整頁掃描（ccitt 壓縮），`pdfimages` 提取出來的只是整頁掃描圖，不是個別嵌入圖片。需要 VLM 層級的空間理解才能定位圖片區域。

## Type
feature

## Expected

```bash
# 基本用法：OCR + 圖片提取
macdoc ocr page.png --extract-figures --output page.md --figures-dir figures/

# 輸出 .md 自動插入圖片引用
# ![figure-1](figures/page-1-fig-1.png)

# 也支援 PDF 直接輸入
macdoc ocr exam.pdf --extract-figures --output exam.md
```

### 技術設計

**兩階段流程**（每頁）：

1. **偵測（Detection）**：送 page PNG 給 VLM，用 structured prompt 要求回傳 bounding box 座標
   ```json
   [
     {"label": "figure", "bbox": [x1, y1, x2, y2], "description": "ANOVA table"},
     {"label": "chart", "bbox": [x1, y1, x2, y2], "description": "scatter plot"}
   ]
   ```
   - 偵測目標：圖片、圖表、統計表格、流程圖、數學圖形
   - 不偵測：純文字段落、題號、頁首頁尾

2. **裁切（Crop）**：用 CoreGraphics / `sips --crop` 從原始 PNG 裁出各圖片區域
   - 加 padding（10-20px）避免切到邊緣
   - 存到 `--figures-dir`，命名 `page-{N}-fig-{M}.png`

3. **整合**：在 OCR 輸出的 .md 中，在對應位置插入 `![描述](figures/page-N-fig-M.png)`

### 模型選擇

- **偵測用**：`qwen2.5vl`（空間理解能力好，支援 bounding box 輸出）
- **OCR 用**：`glm-ocr`（文字辨識好但不擅長空間定位）
- 可能需要 `--detection-model` 參數讓使用者指定偵測模型（預設 qwen2.5vl）

### CLI 介面

```
OPTIONS:
  --extract-figures       啟用圖片偵測與裁切
  --figures-dir <path>    圖片輸出目錄（預設 figures/）
  --detection-model <m>   圖片偵測用的模型（預設 qwen2.5vl）
  --figure-padding <px>   裁切 padding（預設 15）
  --min-figure-size <px>  忽略小於此尺寸的偵測結果（預設 50x50）
```

## Actual

`macdoc ocr` 只提取文字，圖片完全遺失。含圖的題目在 .md 中變成：
```
3. 右圖是某國小老師想要探討...
```
但「右圖」指向的圖片不在 .md 中。

## Impact

- **教學場景**：轉學考/研究所考古題大量包含統計圖表（scatter plot、ANOVA 表、分配圖），沒有圖片等於題目不完整
- **自動化**：目前替代方案是手動從 PNG 裁切，每張圖要算座標，78 份試卷 × 38% 含圖 ≈ 30 份需要手動處理
- **macdoc 定位**：作為「PDF → 結構化文字」的工具，圖片提取是 OCR 的自然延伸

## Design Notes

### 為什麼用 VLM 而非傳統 CV

1. 掃描 PDF 的圖片區域沒有清晰邊界（不像網頁 `<img>` 有 DOM 邊界）
2. 需要語意理解：區分「這是一張統計圖」vs「這是一段有框線的文字」
3. VLM 可以提供 description（`"ANOVA table"`），方便在 .md 中生成有意義的 alt text
4. 同一個 Ollama 後端已經在跑，不需要額外 infra

### 與 `--parallel` 的整合

- 偵測和 OCR 可以是同一個 VLM 呼叫（一次處理文字 + 圖片偵測）
- 或者分開：先 glm-ocr 做文字，再 qwen2.5vl 做偵測
- 分開的好處：可以只在需要時啟用 `--extract-figures`，不影響純文字 OCR 速度

## Related

- PsychQuant/macdoc#73（`--parallel` 支援 — 圖片偵測也需要並行）
- PsychQuant/psychquant-claude-plugins#6（batch-ocr plugin — 會用到此功能）


## Current Status

**Phase**: diagnosed
**Last updated**: 2026-04-22 by idd-diagnose (batch)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ocr): VLM 輔助圖片偵測與裁切 --extract-figures #74

Problem

Type

Expected

技術設計

模型選擇

CLI 介面

Actual

Impact

Design Notes

為什麼用 VLM 而非傳統 CV

與 `--parallel` 的整合

Related

Current Status

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(ocr): VLM 輔助圖片偵測與裁切 --extract-figures #74

Description

Problem

Type

Expected

技術設計

模型選擇

CLI 介面

Actual

Impact

Design Notes

為什麼用 VLM 而非傳統 CV

與 --parallel 的整合

Related

Current Status

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

與 `--parallel` 的整合