Document → Markdown conversion server, built on PaddleOCR — with meaningful extras.
Foil Serve is a FastAPI server that converts common document formats to structured Markdown. It uses PaddleOCR-VL-1.5 as its OCR backbone, served via vLLM, and adds several practical improvements for LLM/RAG workflows.
Extracted figures can be described by any OpenAI-compatible VLM of your choice. The description is injected directly into the Markdown inside a proper <figcaption> tag, keeping the output clean and semantically structured.
PaddleOCR outputs verbose HTML tables. foil-serve post-processes each table in two steps:
- Clean — strip redundant formatting attributes and normalize whitespace (3–5× token reduction).
- Convert to Markdown — if the table structure is simple enough (no merged cells, no nested tables, no multi-level header, consistent column count), it is rendered as a plain Markdown pipe table. Complex tables that cannot be represented without semantic loss stay as cleaned HTML.
The output format for Markdown tables is controlled by table_output_format in server_config.toml ("llm" compact or "human" aligned — same setting as for spreadsheet tables).
| Format | Conversion | Post-processing |
|---|---|---|
.txt, .json, .csv, .xml, .md |
Pass-through (smart encoding detection) | — |
.pdf, .png, .jpg, .bmp |
PaddleOCR-VL natively | HTML tables → Markdown or simplified HTML, optional VLM image description |
.tiff (incl. multi-page), .webp |
Pillow → PDF → PaddleOCR-VL | HTML tables → Markdown or simplified HTML, optional VLM image description |
.xls, .xlsx, .ods |
Pandas → Markdown tables (optional fallback: LibreOffice → PDF → PaddleOCR-VL) | Cell error masking, output size guard |
.docx, .doc, .pptx, .ppt, .odt, .odp |
LibreOffice → PDF → PaddleOCR-VL | HTML tables → Markdown or simplified HTML, optional VLM image description |
PDF conversion: when converting .docx and .doc files to PDF via LibreOffice, tracked changes (revisions) are automatically accepted, so the output reflects the final state of the document. Inline comments are not captured in the conversion.
Spreadsheet processing: cell errors (#REF!, #N/A, nan, etc.) are detected and masked by default. Legacy-encrypted .xls files (empty password) are automatically converted to .xlsx via LibreOffice before processing. When a spreadsheet produces very little cell data (e.g. content is mostly text boxes or images), foil-serve can fall back to PDF+OCR conversion via LibreOffice (configurable paper format and orientation, default A3 landscape, fit-to-width). This fallback is enabled by default but can be disabled via excel_pdf_fallback_enabled. Output size is capped relative to input size (HTTP 413 if exceeded). Empty spreadsheets (no cell data at all) return HTTP 422 when the fallback is disabled. The Markdown table format (table_output_format: "llm" compact or "human" aligned) applies to both spreadsheet tables and HTML tables converted from the OCR pipeline.
Markdown files with HTML tables: some Markdown files (e.g. outputs from other conversion tools) contain embedded HTML tables and are misdetected as text/html by libmagic. foil-serve applies a heuristic (_detect_md) to distinguish these from real HTML pages: if no HTML document root markers (<!DOCTYPE html>, <html>, <head>, <body>) are found but Markdown structural elements are present (headings, lists, links), the file is treated as Markdown and passed through unchanged. Real HTML pages remain unsupported and are rejected with HTTP 422.
OOXML detection fallback: some .docx, .xlsx, and .pptx files are misidentified as application/octet-stream by libmagic. foil-serve inspects the ZIP structure ([Content_Types].xml) to resolve the actual OOXML type automatically.
Only MIME types listed in utils.mime_def are accepted.
Support for additional formats is welcome — contributions are open 🙌
| Endpoint | Method | Description |
|---|---|---|
/docs |
GET |
Interactive Swagger documentation (offline-enabled) |
/v1/process |
POST |
Convert document to Markdown + extract images (JSON response) |
/v1/process/download |
POST |
Same as above but returns a tar.zst archive |
/v1/vlm_models |
GET |
List available VLM models for image description |
/health |
GET |
Server health check, optionally validates VLM endpoints |
Request: raw file bytes (application/octet-stream) + optional query params:
image_description_model_name: name of the VLM model to use for image description (see/v1/vlm_models)
Response (ProcessedDocument):
{
"page_content": "# Markdown content ...",
"images": {
"page_1_figure_1.jpg": "<base64 JPEG>"
},
"metadata": {
"active_conversion_time_no_img_desc": 12,
"img_desc_time": 20,
"wall_clock_time": 18
}
}| Field | Type | Description |
|---|---|---|
active_conversion_time_no_img_desc |
int (seconds) |
Accumulated active CPU/GPU time for document→Markdown conversion. Excludes semaphore wait times and VLM image description time. |
img_desc_time |
int (seconds) |
Accumulated active time for VLM image description calls. Sum of individual call durations (does not account for concurrency). |
wall_clock_time |
int (seconds) |
Total wall-clock latency as seen by the client (includes all waits). |
-
NVIDIA GPU with CUDA 13.0 (may work on other devices but untested)
-
PaddleOCR-VL-1.5 served via an OpenAI-compatible endpoint (vLLM recommended — see
./docker)
- Python 3.13 (can be automatically installed by uv )
- uv for python dependency management
- System packages (for Ubuntu 24.04):
apt install -y --no-install-recommends \ libgl1 \ libreoffice \ libmagic1t64 \ fontconfig \ fonts-dejavu-core \ fonts-liberation \ fonts-noto-cjk \ fonts-wqy-microhei \ fonts-freefont-ttf # On Ubuntu <= 22.04 use 'libmagic1' instead of 'libmagic1t64' # Not tested on non-Debian based linux.
uv sync --no-dev
# Make sure the user running the app can write to the log/artifact files (defaults are /var/log/foil/)
# sudo mkdir -p /var/log/foil && sudo chown <user> /var/log/foil
uv run --no-sync src/foil_serve/main.py # default listening on 0.0.0.0:8081
# or
cd src/foil_serve && uv run --no-sync uvicorn main:app --host 0.0.0.0 --port 8081
uvicorn is included in the project dependencies — no separate installation needed.
A ready-to-use stack (foil-serve + 2× vLLM services) is available in ./docker.
Paddle native models are included in the foil-serve -offline image, but you will still need to download the models for the vllm servers (PaddleOCR-VL-1.5 and any model used for image description).
cd docker/
docker compose up -dpipeline_config.yaml— PaddleOCR-VL-1.5 pipeline settings. Update the vLLM URL to point to your model endpoint. Other pipeline settings can be changed at your own risk 😉.server_config.toml— API keys, VLM model definitions, prompts, resource limits, spreadsheet processing options, and OCR output control.
[[vlm_models]]
enabled = true
name = "my-model" # used as `image_description_model_name` query param
url = "http://host:port/v1"
endpoint_name = "org/model-id" # exact model name for the OpenAI endpoint
endpoint_api_key = "sk-..."
temperature = 0.0
max_output_tokens = 4000
max_input_ocr_length = 700 # max chars of OCR text injected into the VLM prompt
min_size = [64, 64] # minimum image dimensions to process (pixels)
max_concurrent_requests = 10
prompt = "default" # key from [prompts] section or a direct prompt string
extra_body = {chat_template_kwargs = {enable_thinking = false}} # optional extra body parameter (e.g. disable thinking on vllm)Controls whether <ocr> tags (Paddle OCR text on images) appear in the final Markdown:
| Setting | Default | Description |
|---|---|---|
output_paddle_ocr |
false |
When VLM image description is requested: keep <ocr> tags alongside <figcaption>. |
output_paddle_ocr_no_img_desc |
true |
When no VLM is requested: keep <ocr> tags. If false, the OCR-on-images pipeline step is skipped entirely (performance optimization). |
When both OCR output and VLM are disabled, <figcaption> tags are omitted, and the image-block OCR step is skipped, reducing processing time.
If an image-only document contains no text, the Paddle pipeline may produce sparse Markdown (without even referencing the image), causing the VLM description step to be skipped. This server is not recommended for pure image description use cases.
Only cell content is extracted from spreadsheet files. Embedded images, charts, and text boxes are ignored. When the extracted cell content is too sparse, foil-serve falls back to PDF+OCR, which may capture some visual elements but with lower fidelity.
Extracted images may be rotated relative to the original document.
Office documents (excluding spreadsheets) are converted to PDF before OCR. This can occasionally cause rendering issues (e.g., overlapping text).
PaddleOCR is not thread-safe and leaks GPU/CPU memory over time. The current workaround uses a spawn-based multiprocessing pool (processes=1) that recycles the worker process every N documents (max_tasks_between_pipeline_reload in server_config.toml, default 5). Each recycle takes a few seconds while the pipeline reloads.
The root cause of the memory leak has not been fully identified — feel free to investigate 🔍
Running PaddleOCR-VL-1.5 natively proved problematic (Exception from the 'vlm' worker: only 0-dimensional arrays can be converted to Python scalars). The model is instead served via vLLM through an OpenAI-compatible endpoint. This also benefits from faster inference compared to the native Paddle runtime, and greatly reduces wasted time during worker recycling.
When save_failed_artifacts = true in server_config.toml, any processing failure creates a timestamped subdirectory under failed_artifacts_dir (default /foil-log/failed):
<yy-mm-dd_hh-mm>_<mime-type>/
input_file.<ext> — copy of the input file
converted.pdf — intermediate PDF (if conversion happened before the error)
partial_output.md — last markdown state (if the pipeline ran before the error)
meta.txt — app version, mime, timing, sha256, RAM/VRAM/CPU state
trace.txt — full exception traceback
This directory is not automatically cleaned up — manage disk space manually.
Additional artifact types can be enabled independently:
save_table_conversion_artifacts: saves input spreadsheet + generated PDF when sparse fallback triggers.save_cell_error_artifacts: saves Markdown before/after error masking when cell errors are detected.
Foil Serve is not compatible with multiple uvicorn worker (because of the way we spawn and kill the LibreOffice server). This should not be too hard to solve, but I don't expect uvicorn worker to be the bottleneck. If extra ressources are available and scaling is required, it would probably be better to try increasing the number of paddle pipeline.
foil-serve would not exist without:
- PaddleOCR / PaddlePaddle — the OCR backbone powering document understanding.
- LibreOffice — handling Office format conversion. Running headless, doing its job without complaint.