# 🤖 探索 DeepSeek-OCR

> 日期：2025/10/21

<a href="https://www.tenlong.com.tw/products/9786264142915"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

## 👨‍💻 作者資源與聯絡方式

### 📚 深度學習專書
**📖 《LangGraph 實戰開發 AI Agent 全攻略》** - 我的最新技術著作
深入探討 LangGraph、Agentic AI System 等前沿技術
**[立即購買](https://www.tenlong.com.tw/products/9786264142915)**

### 🌐 社群媒體與技術交流
如果您有任何疑問或想要進一步交流，歡迎透過以下管道聯絡：

* **📖 技術專書**： [購買我的 LangGraph 實戰開發 AI Agent 全攻略](https://www.tenlong.com.tw/products/9786264142915)
* **💻 GitHub**： [我的開源專案](https://github.com/Heng-xiu)
* **🤗 Hugging Face**： [我的模型與資料集](https://huggingface.co/Heng666)
* **✍️ 部落格**： [技術文章分享](https://r23456999.medium.com/)

感謝大家的支持！期待與更多 AI 技術愛好者交流討論 🚀

<div class="align-center">
  <a href="https://ko-fi.com/hengshiousheu"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a>
</div>

<a href="https://www.tenlong.com.tw/products/9786264142915"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>



# 第一章、DeepSeek-OCR
DeepSeek 表示，DeepSeek-OCR 模型是透過光學二維映射技術壓縮長文字情境可行性的初步探索。

此模型主要由 DeepEncoder 和 DeepSeek3B-MoE-A570M 解碼器兩大核心組件構成。其中 DeepEncoder 作為核心引擎，既能維持高解析度輸入下的低啟動狀態，又能實現高壓縮比，進而產生數量適中的視覺 token。

實驗數據顯示，當文字 token 數量在視覺 token 的 10 倍以內（即壓縮率<10×）時，模型的解碼（OCR）精度可達97%；即使在壓縮率達到20× 的情況下，OCR 準確率仍維持在約60%。

> 這項結果顯示出該方法在長上下文壓縮和 LLM 的記憶遺忘機制等研究方向上具有相當潛力。
---

# 第二章：環境建置與前置準備
在我們開始微調 Whisper 之前，首先要確保我們的開發環境已經準備就緒。一個穩定且配置正確的環境是成功訓練模型的第一步。

> GPU 啟用： 請確保您的 Colab Notebook 已啟用 GPU。點擊菜單欄的 執行階段 (Runtime) -> 變更執行階段類型 (Change runtime type)，然後在 硬體加速器 (Hardware accelerator) 中選擇 GPU。

In [1]:
%pip install --quiet addict
%pip install --quiet transformers==4.46.3
%pip install --quiet tokenizers==0.20.3
%pip install --quiet PyMuPDF
%pip install --quiet img2pdf
%pip install --quiet einops
%pip install --quiet easydict
%pip install --quiet addict
%pip install --quiet Pillow
%pip install --quiet numpy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m112.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m122.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m118.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for img2pdf (setup.py) ... [?25l[?25hdone


##2.1 首先，來登入 HuggingFace

本章節我們先確認在本實驗環境中，可以獲取到 HuggingFace 資源，包含下載資料集、模型等操作

您可以在 Hugging Face Hub 中找到您的 [Hugging Face token](https://huggingface.co/login?next=%2Fsettings%2Ftokens)


In [None]:
from google.colab import userdata
from huggingface_hub import HfApi

HF_TOKEN = userdata.get("HF_TOKEN")

api = HfApi(token=HF_TOKEN)
username = api.whoami()['name']
print(username)

## 2.2 GPU 驅動與 CUDA 支援確認

LLM 的訓練需要大量的計算資源，幾乎必須仰賴 GPU (Graphics Processing Unit)。因此，確認您的環境是否正確偵測到 GPU 並支援 CUDA 是至關重要的一步。

CUDA 是 NVIDIA 提供的平行運算平台和程式設計模型，允許軟體使用 GPU 進行通用計算。PyTorch (一個流行的深度學習框架) 透過 CUDA 來利用 NVIDIA GPU 的運算能力。

請執行以下 Python 程式碼，檢查您的 PyTorch 環境是否已正確偵測到 CUDA：

In [None]:
import torch

print(f"PyTorch 是否支援 CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"目前使用的 CUDA 裝置名稱: {torch.cuda.get_device_name(0)}")
    print(f"CUDA 裝置數量: {torch.cuda.device_count()}")
    # 可以進一步檢查 CUDA 版本
    print(f"PyTorch 編譯的 CUDA 版本: {torch.version.cuda}")
    # 執行 nvidia-smi (僅限 Linux/Windows 終端機，Colab 可直接執行)
    # !nvidia-smi
else:
    print("警告：未偵測到 CUDA。模型訓練將在 CPU 上運行，速度會非常慢。")
    print("請檢查您的 GPU 驅動程式安裝、CUDA Toolkit 設定以及 PyTorch 的 CUDA 支援。")

#第三章、玩玩 DeepSeek-OCR

## 3.1 拿到預訓練模型：輕鬆載入 DeepSeek-OCR


In [None]:
%pip install --quiet flash-attn==2.7.3 --no-build-isolation

In [3]:
from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='eager', #attn_implementation="flash_attention_2" # 記得要額外安裝才能使用且只支持 Ampere GPUs
    trust_remote_code=True,
    use_safetensors=True
    )
model = model.eval().cuda().to(torch.bfloat16)


You are using a model of type deepseek_vl_v2 to instantiate a model of type DeepseekOCR. This is not supported for all configurations of models and can yield errors.
Some weights of DeepseekOCRForCausalLM were not initialized from the model checkpoint at deepseek-ai/DeepSeek-OCR and are newly initialized: ['model.vision_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#第四章、各類提示語測試

DeepSeek-OCR 提供多種 **Prompt 模式**，可針對不同應用情境（純文字擷取、文件結構保留、座標輸出、影像描述等）進行最佳化。
選擇不同 Prompt 將影響：

* 輸出格式（純文字、Markdown、座標資訊、描述）
* 處理時間
* 適用場景

---

## 🧩 四大 Prompt 模式總覽

| Prompt 名稱                | 語法                                        | 最佳用途                | 輸出格式                                | 推論時間               |                  |        |       |     |        |        |
| ------------------------ | ----------------------------------------- | ------------------- | ----------------------------------- | ------------------ | ---------------- | ------ | ----- | --- | ------ | ------ |
| **Free OCR** ⭐（推薦）       | `<image>\nFree OCR.`                      | 一般文字擷取、文件或文章、乾淨可讀輸出 | 純文字                                 | 約 24 秒             |                  |        |       |     |        |        |
| **Markdown**             | `<image>\n<                               | grounding           | >Convert the document to markdown.` | 具結構的文件、保留段落與標題格式   | Markdown 格式（含結構） | 約 39 秒 |       |     |        |        |
| **Grounding OCR**        | `<image>\n<                               | grounding           | >OCR this image.`                   | 需文字位置、做 UI 標註、文件分析 | 文字＋座標 (`<        | ref    | >...< | det | >...`) | 約 58 秒 |
| **Detailed Description** | `<image>\nDescribe this image in detail.` | 圖像理解、內容分析、非OCR主用途   | 詳細描述文字                              | 約 9 秒              |                  |        |       |     |        |        |


**Prompt：**

```python
prompt = "<image>\nFree OCR."
```

**最佳用途：**

* 一般文字擷取
* 乾淨、可讀性高的輸出
* 適合文章與文件

**輸出格式：**

* 純文字、自然段落流暢
* 最快且最穩定

**範例：**

```
# The perils of vibe coding

Elaine Moore

new OpenAI model arrived this month...
```

## 4.1 Free OCR（推薦）

In [5]:
prompt = "<image>\nFree OCR."
image_file = '/content/OCRTest.png'
output_path = '/content/'

res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([6, 100, 1280])
| Period Ending | Dec 31, 2008 | Dec 31, 2009 | Dec 31, 2010 |
|--------------|--------------|--------------|--------------|
| **Assets**    |              |              |              |
| **Current Assets** |              |              |              |
| Cash And Cash Equivalents | 8,656,672 | 10,198,000 | 13,630,000 |
| Short Term Investments | 7,189,099 | 14,287,000 | 21,345,000 |
| Net Receivables | 2,928,297 | 3,845,000 | 5,261,000 |
| Inventory | - | - | - |
| Other Current Assets | 1,404,114 | 837,000 | 1,326,000 |
| **Total Current Assets** | 20,178,182 | 29,167,000 | 41,562,000 |
| **Long Term Investments** | 85,160 | 129,000 | 523,000 |
| Property Plant and Equipment | 5,233,843 | 4,845,000 | 7,759,000 |
| Goodwill | 4,839,854 | 4,903,000 | 6,256,000 |
| Intangible Assets | 996,690 | 775,000 | 1,044,000 |
| Accumulated Amortization | - | - | - |
| Other Assets | 433,846 | 415,000 | 442,000 |
| Deferred L

image: 0it [00:00, ?it/s]
other: 0it [00:00, ?it/s]


## 4.2 Markdown 模式

**Prompt：**

```python
prompt = "<image>\n<|grounding|>Convert the document to markdown."
```

**最佳用途：**

* 結構化文件（含標題、段落、圖片）
* 需保留 Markdown 格式者

**輸出格式：**

* 含標題 `##`、圖片 `![]()`、座標資訊
* 結構化 Markdown

**範例：**

```markdown
## The perils of vibe coding
TECHNOLOGY
Elaine Moore
![](images/0.jpg)
new OpenAI model arrived...
```

In [4]:
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = '/content/OCRTest.png'
output_path = '/content/'

res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and si

BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([6, 100, 1280])
<|ref|>title<|/ref|><|det|>[[81, 70, 350, 96]]<|/det|>
# Google Inc. (GOOG)  

<|ref|>title<|/ref|><|det|>[[81, 101, 220, 119]]<|/det|>
# Balance Sheet  

<|ref|>text<|/ref|><|det|>[[588, 103, 797, 119]]<|/det|>
All numbers in thousands  

<|ref|>table<|/ref|><|det|>[[77, 139, 844, 465]]<|/det|>

<table><tr><td>Period Ending</td><td>Dec 31, 2008</td><td>Dec 31, 2009</td><td>Dec 31, 2010</td></tr><tr><td>Assets</td><td></td><td></td><td></td></tr><tr><td>Current Assets</td><td></td><td></td><td></td></tr><tr><td>Cash And Cash Equivalents</td><td>8,656,672</td><td>10,198,000</td><td>13,630,000</td></tr><tr><td>Short Term Investments</td><td>7,189,099</td><td>14,287,000</td><td>21,345,000</td></tr><tr><td>Net Receivables</td><td>2,928,297</td><td>3,845,000</td><td>5,261,000</td></tr><tr><td>Inventory</td><td></td><td></td><td></td></tr><tr><td>Other Current Assets</td><td>1,404,114</td><td>837,000</td><td>1,326,000</td

image: 0it [00:00, ?it/s]
other: 100%|██████████| 6/6 [00:00<00:00, 54120.05it/s]


## 4.3 Grounding OCR（帶座標）

**Prompt：**

```python
prompt = "<image>\n<|grounding|>OCR this image."
```

**最佳用途：**

* 需要文字座標資訊
* 用於建立標註工具或版面分析

**輸出格式：**

```
<|ref|>The perils of vibe coding<|/ref|><|det|>[[352, 30, 624, 111]]<|/det|>
```

**輸出檔案：**

* `result_with_boxes.jpg`（含邊框）
* Console 顯示文字座標

In [7]:
prompt = "<image>\n<|grounding|>OCR this image."
image_file = '/content/OCRTest.png'
output_path = '/content/'

res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([6, 100, 1280])
<|ref|>Google Inc. (GOOG)<|/ref|><|det|>[[82, 71, 348, 96]]<|/det|>
<|ref|>Balance Sheet<|/ref|><|det|>[[82, 102, 220, 120]]<|/det|>
<|ref|>All numbers in thousands<|/ref|><|det|>[[588, 104, 797, 121]]<|/det|>
<|ref|>Period Ending<|/ref|><|det|>[[80, 142, 190, 160]]<|/det|>
<|ref|>Dec31,2008<|/ref|><|det|>[[446, 143, 546, 158]]<|/det|>
<|ref|>Dec31,2009<|/ref|><|det|>[[584, 143, 676, 158]]<|/det|>
<|ref|>Dec31,2010<|/ref|><|det|>[[700, 143, 797, 158]]<|/det|>
<|ref|>Assets<|/ref|><|det|>[[80, 163, 135, 179]]<|/det|>
<|ref|>Current Assets<|/ref|><|det|>[[80, 180, 193, 198]]<|/det|>
<|ref|>Cash And Cash Equivalents<|/ref|><|det|>[[220, 200, 417, 215]]<|/det|>
<|ref|>8,656,672<|/ref|><|det|>[[465, 200, 546, 215]]<|/det|>
<|ref|>10,198,000<|/ref|><|det|>[[590, 200, 676, 215]]<|/det|>
<|ref|>13,630,000<|/ref|><|det|>[[715, 200, 799, 215]]<|/det|>
<|ref|>Short Term Investments<|/ref|><|det|>[[240, 218, 417, 233]]<|/det|>


image: 0it [00:00, ?it/s]
other: 100%|██████████| 125/125 [00:00<00:00, 160479.95it/s]


## 4.4 Detailed Description（影像分析）

**Prompt：**

```python
prompt = "<image>\nDescribe this image in detail."
```

**最佳用途：**

* 影像語意分析
* 內容理解或版面描述

**輸出格式：**

* 以自然語言描述影像內容
* 非OCR為主

**範例：**

```
The image displays a printed page from a publication,
likely a magazine or a book...
```

In [8]:
prompt = " <image>\nDescribe this image in detail."
image_file = '/content/OCRTest.png'
output_path = '/content/'

res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


BASE:  torch.Size([1, 256, 1280])
PATCHES:  torch.Size([6, 100, 1280])
This image displays a table with 31 rows and 9 columns. The first column is labeled "Period Ending", the second column is labeled "Dec 31, 2008", the third column is labeled "Dec 31, 2009", and the fourth column is labeled "Dec 31, 2010". The fifth column is labeled "Assets", the sixth column is labeled "Current Assets", the seventh column is labeled "Long Term Investments", the eighth column is labeled "Property Plant and Equipment", the ninth column is labeled "Goodwill", the tenth column is labeled "Intangible Assets", the eleventh column is labeled "Accumulated Amortization", the twelfth column is labeled "Other Assets", the thirteenth column is labeled "Deferred Long Term Assets", and the fourteenth column is labeled "Total Assets". The table is titled "Google Inc. (GOOG) Balance Sheet". The image is a screenshot of a spreadsheet.
image size:  (768, 1024)
valid image tokens:  792
output texts tokens (valid):  1

image: 0it [00:00, ?it/s]
other: 0it [00:00, ?it/s]


#第五章、其他數據

## 5.1 影像尺寸大小

| 模式名稱           | base_size | image_size | crop_mode | Token數 | 特性            |
| -------------- | --------- | ---------- | --------- | ------ | ------------- |
| Tiny           | 512       | 512        | False     | ~64    | 最快            |
| Small          | 640       | 640        | False     | ~100   | 快速            |
| Base           | 1024      | 1024       | False     | ~256   | 平衡            |
| Large          | 1280      | 1280       | False     | ~400   | 高品質           |
| **Gundam（推薦）** | 1024      | 640        | True      | 356+   | 動態裁切、最佳化速度與品質 |


## 5.2 實用建議

1. 選對 Prompt
- 要文字 → Free OCR
- 要文件格式 → Markdown
- 要座標資訊 → Grounding OCR
- 要影像內容 → Detailed

2. 影像尺寸調整

- 大圖 → Gundam 模式 (crop_mode=True)
- 小圖 → 適當 base_size (512/640)
- 速度優先 → 小尺寸
- 品質優先 → 大尺寸

3. 後處理建議
- Free OCR：文字流暢但可能略有斷行問題
- Markdown：結構較完整
- 複雜版面 → 可能需手動清理

## 5.3 常見問題

| 問題               | 原因        | 解決方式                          |
| ---------------- | --------- | ----------------------------- |
| `result.mmd` 為空白 | 模型未辨識出內容  | 改用 "Free OCR" 或 "Markdown"    |
| 文字流不順            | OCR 分段不理想 | 調整 `crop_mode` 或更換 Prompt     |
| 缺字/漏行            | 圖像品質或尺寸不足 | 提高 `image_size` / `base_size` |
