# 使用 Docling CLI (Using the Docling CLI)

你可以使用 **Docling CLI** 將文件轉換成多種支援的格式，  
這對於快速原型開發及嘗試不同文件的處理方式很有幫助，  
特別適合要將資料用於 RAG 流水線和生成式 AI 應用中。

👉 完整的指令清單與參數可參考官方文件：  
[https://docling-project.github.io/docling/reference/cli](https://docling-project.github.io/docling/reference/cli)


## 先決條件 (Pre-requisites)

- 已安裝 **Docling CLI**（可用 `pip install docling==2.36.1` 安裝）
- 已安裝 **Git CLI**（或可改用下載 ZIP 的方式）


## 實驗：使用 Docling CLI (Lab: Using the Docling CLI)

如果尚未下載示例文件，先從 GitHub 複製資料庫：  
（如果之前已經下載過，可略過這步）


In [1]:
!git clone https://github.com/RedHatQuickCourses/genai-apps.git


fatal: destination path 'genai-apps' already exists and is not an empty directory.


範例輸入檔案都在 **dataprep** 資料夾內。  
切換到該資料夾。


In [3]:
%cd /work/genai-apps/dataprep


/work/genai-apps/dataprep


## 驗證 Docling 是否已安裝

確定容器環境中已經安裝了 Docling。


In [4]:
!pip list | grep docling


docling                   2.55.1
docling-core              2.48.4
docling-ibm-models        3.9.1
docling-parse             4.5.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 將 PDF 轉換成 Markdown

使用 Docling CLI 將 PDF 檔案轉換成 Markdown。  
第一次執行時，Docling 會從 Hugging Face 下載並快取一些預訓練模型（用於辨識表格、影像、版面等），第一次可能較慢，之後會快約 50%。


In [5]:
#使用no-ocr
!docling /work/genai-apps/dataprep/sample-data/docling-rpt.pdf --from pdf --to md --output  /work/tmp/ --no-ocr -vv

2025-10-05 05:34:26,104 - INFO - Loading plugin 'docling_defaults'
2025-10-05 05:34:26,105 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-05 05:34:26,114 - INFO - paths: [PosixPath('/tmp/tmp8dkpb4_1/docling-rpt.pdf')]
2025-10-05 05:34:26,114 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-05 05:34:26,136 - INFO - Going to convert document batch...
2025-10-05 05:34:26,136 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 9b654b4e85723616700c42e6593aa131
2025-10-05 05:34:26,137 - INFO - Loading plugin 'docling_defaults'
2025-10-05 05:34:26,138 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-05 05:34:26,145 - INFO - Accelerator device: 'cpu'
2025-10-05 05:34:30,435 - INFO - Accelerator device: 'cpu'
2025-10-05 05:34:30,598 - INFO - Processing document docling-rpt.pdf
2025-10-05 05:34:40,867 - INFO - Finished converting document docling-rpt.pdf in 14.75 sec.
2025-10-05 05:34:40,867 -

In [1]:
#使用ocr tesseract, 預設是使用easyocr, 但mac m系統cpu不支援
!docling /work/genai-apps/dataprep/sample-data/docling-rpt.pdf --from pdf --to md --output  /work/tmp --ocr-engine tesseract

2025-10-06 13:28:41,743 - INFO - Loading plugin 'docling_defaults'
2025-10-06 13:28:41,744 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-06 13:28:41,751 - INFO - paths: [PosixPath('/tmp/tmpy1ya1y2s/docling-rpt.pdf')]
2025-10-06 13:28:41,751 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-06 13:28:41,758 - INFO - Going to convert document batch...
2025-10-06 13:28:41,758 - INFO - Initializing pipeline for StandardPdfPipeline with options hash adcb8572a3195493a34d50c95c8ab1bd
2025-10-06 13:28:41,759 - INFO - Loading plugin 'docling_defaults'
2025-10-06 13:28:41,760 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-06 13:28:41,797 - INFO - command: tesseract --list-langs
2025-10-06 13:28:41,812 - INFO - Accelerator device: 'cpu'
2025-10-06 13:28:55,953 - INFO - Accelerator device: 'cpu'
2025-10-06 13:28:56,088 - INFO - Processing document docling-rpt.pdf
2025-10-06 13:29:01,369 - INFO - command: tesseract 

檢查 `/tmp/` 目錄下是否產生相同檔名、但副檔名為 `.md` 的檔案。  
你可以用 VS Code 或任何 Markdown 預覽工具查看結果。


## 輸出為 JSON 格式

可使用 `--to` 參數指定輸出格式，例如要輸出 JSON：


檢查 `/tmp/` 目錄下的 JSON 輸出。  
開啟檔案可看到結構化內容，例如：
```json
{
  "schema_name": "DoclingDocument",
  "version": "1.7.0",
  "name": "docling-rpt",
  "origin": {
    "mimetype": "application/pdf",
    "binary_hash": 11465328351749296000,
    "filename": "docling-rpt.pdf"
  },
  ...
}


## 轉換 MS Word 文件

將 MS Word 文件轉換為 Markdown。

In [4]:
!docling --output /work/tmp/ /work/genai-apps/dataprep/sample-data/msword-sample.docx --ocr-engine tesseract

2025-10-06 13:40:42,799 - INFO - Loading plugin 'docling_defaults'
2025-10-06 13:40:42,800 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-06 13:40:42,805 - INFO - paths: [PosixPath('/tmp/tmpf5506u_l/msword-sample.docx')]
2025-10-06 13:40:42,806 - INFO - detected formats: [<InputFormat.DOCX: 'docx'>]
2025-10-06 13:40:42,808 - INFO - Going to convert document batch...
2025-10-06 13:40:42,808 - INFO - Initializing pipeline for SimplePipeline with options hash 6082ab411073f839a822bbff13035c3a
2025-10-06 13:40:42,809 - INFO - Loading plugin 'docling_defaults'
2025-10-06 13:40:42,809 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-06 13:40:42,809 - INFO - Processing document msword-sample.docx
2025-10-06 13:40:42,814 - INFO - deleted item in tree at stack: (0, 0, 3, 0, 11) => #/texts/17
2025-10-06 13:40:42,815 - INFO - deleted item in tree at stack: (0, 0, 3, 0, 11) => #/texts/17
2025-10-06 13:40:42,815 - INFO - deleted i

Docling 會辨識版面與表格結構並正確轉成 Markdown。


## 轉換 MS PowerPoint 文件

將 PowerPoint 文件轉換為 Markdown。


In [5]:
!docling --output /work/tmp/ /work/genai-apps/dataprep/sample-data/ppt-sample.pptx --ocr-engine tesseract


2025-10-06 13:41:11,168 - INFO - Loading plugin 'docling_defaults'
2025-10-06 13:41:11,169 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-06 13:41:11,175 - INFO - paths: [PosixPath('/tmp/tmpvlhl85z1/ppt-sample.pptx')]
2025-10-06 13:41:11,175 - INFO - detected formats: [<InputFormat.PPTX: 'pptx'>]
2025-10-06 13:41:11,216 - INFO - Going to convert document batch...
2025-10-06 13:41:11,216 - INFO - Initializing pipeline for SimplePipeline with options hash 6082ab411073f839a822bbff13035c3a
2025-10-06 13:41:11,218 - INFO - Loading plugin 'docling_defaults'
2025-10-06 13:41:11,219 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-06 13:41:11,219 - INFO - Processing document ppt-sample.pptx
2025-10-06 13:41:11,247 - INFO - Finished converting document ppt-sample.pptx in 0.07 sec.
2025-10-06 13:41:11,247 - INFO - writing Markdown output to /work/tmp/ppt-sample.md
2025-10-06 13:41:11,258 - INFO - Processed 1 docs, of which 0 f

> ⚠️ 若看到類似警告：  
> `Warning: image cannot be loaded by Pillow: cannot find loader for this WMF file`  
> 可忽略，Docling 仍會正確將每張投影片轉換為 Markdown 標題，並把投影片內容放在對應的標題下，子彈點也會轉成清單。


## 從 URL 直接轉換文件

Docling 也可以直接處理網路上的文件。例如：


In [6]:
!docling -vv --output /work/tmp/ https://arxiv.org/pdf/2408.09869 --ocr-engine tesseract


2025-10-06 13:41:33,665 - INFO - Loading plugin 'docling_defaults'
2025-10-06 13:41:33,666 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-06 13:41:34,111 - INFO - paths: [PosixPath('/tmp/tmpt6k3m4al/2408.09869v5.pdf')]
2025-10-06 13:41:34,111 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-06 13:41:34,124 - INFO - Going to convert document batch...
2025-10-06 13:41:34,124 - INFO - Initializing pipeline for StandardPdfPipeline with options hash adcb8572a3195493a34d50c95c8ab1bd
2025-10-06 13:41:34,126 - INFO - Loading plugin 'docling_defaults'
2025-10-06 13:41:34,127 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-06 13:41:34,139 - INFO - command: tesseract --list-langs
2025-10-06 13:41:34,149 - INFO - Accelerator device: 'cpu'
2025-10-06 13:41:34,899 - INFO - Accelerator device: 'cpu'
2025-10-06 13:41:35,045 - INFO - Processing document 2408.09869v5.pdf
2025-10-06 13:41:40,342 - INFO - command: tesserac

## 延伸練習

- 嘗試用自己的 PDF 或 MS Office 文件進行轉換  
- 測試不同輸入與輸出格式  
- 詳細參數說明請參考官方 CLI 文件：  
  [https://docling-project.github.io/docling/reference/cli](https://docling-project.github.io/docling/reference/cli)
