Skip to content
Merged

Pdf2md2 #1030

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## How to develop

For local development, simply use :
For local development, simply use:

```bash
$ yarn install
Expand Down
7 changes: 7 additions & 0 deletions gatsby-browser.js
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,13 @@ export const onRouteUpdate = ({ location, prevLocation }) => {
) {
pageHeadTittle = "PDF Services API Extract PDF";
} else if (
window.location.pathname.indexOf(
"pdf-services-api/howtos/pdf-to-markdown-api/"
) >= 0
) {
pageHeadTittle = "PDF Services API PDF to Markdown API";
}
else if (
window.location.pathname.indexOf(
"pdf-services-api/howtos/pdf-properties/"
) >= 0
Expand Down
13 changes: 13 additions & 0 deletions gatsby-config.js
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ module.exports = {
description: 'Create, combine and export PDFs',
path: '../document-services/apis/pdf-services/'
},
{
title: 'PDF to Markdown',
description: 'Convert PDF documents to Markdown format',
path: '../document-services/apis/pdf-to-markdown/'
},
{
title: 'PDF Accessibility Auto-Tag',
description: 'Auto-tag PDF content to improve accessibility',
Expand Down Expand Up @@ -229,6 +234,10 @@ module.exports = {
title: 'Extract PDF',
path: 'overview/pdf-services-api/howtos/extract-pdf.md'
},
{
title: 'PDF to Markdown API',
path: 'overview/pdf-services-api/howtos/pdf-to-markdown-api.md'
},
{
title: 'Get PDF Properties',
path: 'overview/pdf-services-api/howtos/pdf-properties.md'
Expand Down Expand Up @@ -716,6 +725,10 @@ module.exports = {
title: 'Extract PDF',
path: 'overview/legacy-documentation/pdf-services-api/howtos/extract-pdf.md'
},
{
title: 'PDF to Markdown API',
path: 'overview/legacy-documentation/pdf-services-api/howtos/pdf-to-markdown-api.md'
},
{
title: 'Get PDF Properties',
path: 'overview/legacy-documentation/pdf-services-api/howtos/pdf-properties.md'
Expand Down
2 changes: 1 addition & 1 deletion src/pages/apis/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Adobe PDF Services Open API spec
description: The OpenAPI spec for Adobe PDF Services API endpoints, parameters, and responses.
openAPISpec: https://raw.githubusercontent.com/AdobeDocs/pdfservices-api-documentation/main/src/pages/resources/openapi.json
openAPISpec: https://raw.githubusercontent.com/AdobeDocs/pdfservices-api-documentation/develop/src/pages/resources/openapi.json
---
-[]
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: PDF Accessibility Checker | How Tos | PDF Services API | Adobe PDF Servic
---
# PDF Accessibility Checker

The Accessibility Checker API verifies if PDF files meet the machine-verifiable requirements of PDF/UA and WCAG 2.0. It generates a report summarizing the findings of the accessibility checks. Additional human remediation may be required to ensure the reading order of elements is correct and that alternative text tags properly convey the meaning of images. The report contains links to documentation that assists in manually fixing problems using Adobe Acrobat Pro.
The Accessibility Checker API verifies if PDF files meet the machine-verifiable requirements of PDF/UA and WCAG. It generates a report summarizing the findings of the accessibility checks. Additional human remediation may be required to ensure the reading order of elements is correct and that alternative text tags properly convey the meaning of images. The report contains links to documentation that assists in manually fixing problems using Adobe Acrobat Pro.

## API Parameters

Expand Down Expand Up @@ -316,7 +316,6 @@ curl --location --request POST 'https://pdf-services.adobe.io/operation/accessib
}'
```


## Check accessibility for specified pages

The sample below performs an accessibility check operation for specified pages of a given PDF.
Expand Down
126 changes: 126 additions & 0 deletions src/pages/overview/pdf-services-api/howtos/pdf-to-markdown-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
title: PDF to Markdown API | Adobe PDF Services
description: Learn about the PDF to Markdown API service that converts PDF documents into well-formatted Markdown text.
---

# PDF to Markdown API

The PDF to Markdown API (included with the PDF Services API) is a cloud-based web service that automatically converts PDF documents – native or scanned – into well-formatted Markdown text. This service preserves the document's structure and formatting while converting it into a format that's widely used for LLM flows, content authoring and documentation.

## Structured Information Output Format

The output of a PDF to Markdown operation includes:

- A primary `.md` file containing the converted Markdown content

### Output Structure

The following is a summary of key elements in the converted Markdown:

#### Elements

Ordered list of semantic elements converted from the PDF document, preserving the natural reading order and document structure. The conversion handles:

- Text content with proper Markdown syntax
- Document hierarchy and structure
- Inline formatting and emphasis
- Links and references
- Images and figures
- Tables and complex layouts

#### Content Types

The API processes various content types as follows:

##### Text Elements

- **Headings**: Converted to appropriate Markdown heading levels (H1-H6)
- **Paragraphs**: Preserved with proper spacing and formatting
- **Lists**: Both ordered and unordered lists with proper nesting
- **Text Emphasis**: Bold, italic, and other text formatting
- **Links**: Preserved with proper Markdown link syntax

##### Images and Figures

- Provided as base64-embedded images in the Markdown output
- Referenced correctly in the Markdown output
- Original quality preserved
- Proper alt text and captions maintained

##### Tables

- Converted to Markdown table syntax
- Column alignment preserved
- Cell content formatting maintained
- Complex table structures supported

#### Element Types and Paths

The API recognizes and converts the following structural elements:

| Category | Element Type | Description |
| --------- | ----------------- | --------------------------------------------------------- |
| Aside | Aside | Content which is not part of regular content flow |
| Figure | Figure | Non-reflowable constructs like graphs, images, flowcharts |
| Footnote | Footnote | Footnote |
| Headings | H, H1, H2, etc | Heading levels |
| List | L, Li, Lbl, Lbody | List and list item elements |
| Paragraph | P, ParagraphSpan | Paragraphs and paragraph segments |
| Reference | Reference | Links |
| Section | Sect | Logical section of the document |
| StyleSpan | StyleSpan | Styling variations within text |
| Table | Table, TD, TH, TR | Table elements |
| Title | Title | Document title |

### Reading Order

The reading order in the output Markdown maintains:

- Natural document flow
- Proper content hierarchy
- Column-based layouts
- Page transitions
- Inline elements and references

## Use Cases

The PDF to Markdown API is particularly valuable for:

- LLM-friendly content ingestion and prompt creation
- Training/Fine-tuning LLM with PDFs
- Content migration from PDF to documentation platforms
- Legacy document conversion
- Content repurposing for modern documentation systems
- Integration with Markdown-based workflows
- Automated document processing pipelines
- Searchable internal knowledge repositories

## API Limitations

### File Constraints

- **File Size**: Maximum of 100MB per file
- **Page Count**:
- Non-scanned PDFs: Up to 400 pages
- Scanned PDFs: Up to 150 pages
- **Page Dimensions**: Between 6" and 17.5" in either dimension

### Processing Limits

- **Rate Limits**: Maximum 25 requests per minute
- **Language Support**: Optimized for English, supports other Latin-based languages
- **OCR Quality**: Dependent on scan quality (minimum 200 DPI recommended)

### Document Requirements

- Files must be unprotected or allow content copying
- No support for:
- Hidden objects (JavaScript, OCG)
- XFA and fillable forms
- Complex annotations
- CAD drawings or vector art
- Password-protected content

## REST API

See our public API Reference for [PDF to Markdown API](../../../apis/#tag/PDF-To-Markdown).
Loading