Skip to content

HelgeSverre/markdown

Repository files navigation

helgesverre/markdown

CI Latest Version on Packagist Total Downloads License

A fast PHP Markdown parser backed by md4c through PHP FFI.

It renders GitHub-flavored Markdown, supports front matter and heading TOCs, and ships prebuilt native libraries so normal installs do not need a C compiler.

Install

composer require helgesverre/markdown

Requirements:

  • PHP 8.5+
  • ext-ffi
  • ffi.enable=1 for web/FPM use, or an opcache preload setup

Bundled native artifacts are selected at runtime:

Platform Artifact
macOS Apple Silicon + Intel lib/darwin/libmd4cshim.dylib
Linux x86-64 lib/linux-x86_64/libmd4cshim.so
Linux aarch64 lib/linux-aarch64/libmd4cshim.so
Windows x64 lib/windows-x86_64/md4cshim.dll

HelgeSverre\Markdown\Ffi\Library::path() resolves libraries in this order:

  1. $MARKDOWN_FFI_LIB
  2. the bundled lib/<platform>/ binary
  3. a local native/ build

Usage

Render HTML

use HelgeSverre\Markdown\Markdown;

$html = Markdown::toHtml("# Hello\n\n- a\n- b\n");

$htmls = Markdown::toHtmlBatch([
    "# One\n",
    "# Two\n",
]);

toHtml() is the fast path: Markdown in, HTML out. toHtmlBatch() packs many documents into one native call and renders them across a C thread pool where pthreads are available.

For explicit lifecycle and options, construct the parser directly:

use HelgeSverre\Markdown\Data\Dialect;
use HelgeSverre\Markdown\Parser;

$parser = new Parser(
    dialect: Dialect::GitHub,
    safe: false,
    xhtml: false,
);

$html = $parser->toHtml("# Hello\n");

Parse Documents

parse() strips YAML front matter, renders the body, injects GitHub-style heading ids, and returns a ParsedMarkdown value with HTML, front matter, and TOC data.

use HelgeSverre\Markdown\Markdown;

$doc = <<<MD
---
title: Hello World
tags: [php, markdown]
---
# Introduction

## Getting started
MD;

$result = Markdown::parse($doc);

$result->html;
$result->frontmatter; // ['title' => 'Hello World', 'tags' => ['php', 'markdown']]
$result->toc;         // [['level' => 1, 'text' => 'Introduction', 'slug' => 'introduction'], ...]
(string) $result;     // same as $result->html

Malformed front matter degrades to an empty array. Heading ids are lower-cased, ASCII-folded, and de-duplicated with suffixes like intro-1.

Front matter is decoded by a vendored libyaml FFI path (parsed to JSON in C, then json_decoded) — no pure-PHP YAML parser is involved. Inputs libyaml's walker does not support — anchors/aliases and << merge keys — degrade to an empty array, the same as malformed YAML.

Date scalars are strings. A bare date: 2026-06-05 in front matter is returned as the string "2026-06-05" (matching PECL yaml, spyc, and dallgoot). This differs from symfony/yaml's default, which resolves it to an integer Unix timestamp. Quote or post-process if you need a different type.

Options

use HelgeSverre\Markdown\Data\Dialect;
use HelgeSverre\Markdown\Parser;

new Parser(
    dialect: Dialect::GitHub, // or Dialect::CommonMark
    safe: true,               // strip raw HTML
    xhtml: true,              // emit <br /> / <hr />
);

BatchParser accepts the same options. The Markdown facade uses the defaults.

Benchmarks

Run the full suite with:

composer bench

Fresh run from this checkout: PHP 8.5.5, Darwin arm64, PHPBench, opcache + tracing JIT + FFI preload. Full generated tables live in results/RESULTS.md, with machine-readable rows in results/results.json. The default corpus caps at ~256 KB (realistic document sizes plus two real-world corpora); the 1 MB and 8 MB scaling tiers are opt-in via composer bench:stress (run composer corpus first to generate them).

HTML Throughput Snapshot

toHtml() (render only) against the default corpus:

Corpus helgesverre/markdown league/commonmark GFM tempest/markdown
doc-128kb.md (135 KB) 0.71 ms / 196 MB/s 42.14 ms / 3.3 MB/s 10.94 ms / 12.6 MB/s
commonmark-spec.md (165 KB) 0.86 ms / 196 MB/s 28.78 ms / 5.9 MB/s — (threw)
tempest-docs.md (252 KB) 0.84 ms / 308 MB/s 26.24 ms / 9.8 MB/s 42.25 ms / 6.1 MB/s

On the 252 KB Tempest docs corpus, the render fast path measured about 31x faster than league/commonmark GFM and about 50x faster than tempest/markdown. The full parse() pipeline (front matter + render + heading anchors + TOC) is benchmarked too — on that corpus it runs in ~1.12 ms (231 MB/s), still ~24x faster than league/commonmark GFM.

Front Matter

extract() pulls the YAML front matter without rendering the body (vendored libyaml in C → JSON → json_decode):

Approach Mean Renders body?
helgesverre/markdown extract only 8.84 us no
helgesverre/markdown full parse 31.86 us yes
symfony/yaml floor 307.81 us no
league/commonmark front matter only 344.33 us no
tempest/markdown lex (no render) 402.79 us no
tempest/markdown full parse 939.14 us yes

Front-matter extraction measured about 35x faster than the symfony/yaml floor and about 39x faster than league/commonmark's dedicated front-matter parser. (tempest/markdown has no dedicated front-matter API — lex() is its cheapest path, full parse() its idiomatic one.)

Memory numbers in the benchmark output need context: this parser renders into a short-lived C heap buffer before copying HTML back into PHP, so PHP's memory metrics undercount part of its transient native allocation. Pure-PHP parsers keep their work on the Zend heap.

How It Works

The hot path is one FFI call into a small C shim around md4c:

char* md2html(const char* input, size_t input_len, size_t* out_len,
              unsigned int parser_flags, unsigned int renderer_flags);
void  md2html_free(char* p);

md4c renders through callbacks internally, but those callbacks stay in C. PHP passes a byte string in, receives one allocated HTML buffer back, copies it with FFI::string(), and frees it.

Front matter uses the same one-call shape: yaml2json() walks libyaml's event stream into a single JSON string in C, which PHP json_decodes — no per-node FFI crossings. libyaml is vendored and statically linked into the shim, so the shipped binaries carry no external runtime dependency.

For production, bench/preload.php can warm an FFI::load() scope through opcache preload. Without preload, the library falls back to FFI::cdef() automatically.

The shim also includes a small correctness pass for md4c's permissive autolinks: explicit links whose text is itself an autolinkable URL can otherwise become invalid nested anchors. The pass collapses that generated shape while preserving user-supplied raw nested anchors.

Build From Source

Most users do not need this. Build from source when hacking on the C shim or targeting an unshipped platform.

composer build       # current platform -> native/
composer build:all   # all shipped platforms -> lib/

composer build needs a local C compiler. composer build:all uses clang for the macOS universal binary and zig cc for Linux and Windows cross-builds.

Scripts

Command What it does
composer test Run PHPUnit
composer check Run the CI correctness smoke gate
composer bench Run PHPBench and regenerate results/
composer bench:stress Run the throughput bench against the 1 MB / 8 MB tiers
composer examples Run every example script
composer build Build the native shim for this platform
composer build:all Cross-build shipped libraries
composer format:check Check formatting with Mago
composer lint Run Mago lint

Tests

composer test

The suite covers GFM rendering, dialect/safe/XHTML options, generated anchor collapse without raw HTML rewrites, document parsing, front matter, heading slugs and TOCs, structural parity against league/commonmark, batch-vs-sequential output, shipped-library binding, hostile inputs, embedded NUL bytes, and leak checks.

CI runs the shipped binaries on Linux and macOS, keeps an experimental Windows shipped-binary job, and also builds the Linux shim from source.

Alternatives

  • league/commonmark is the mature pure-PHP default. If you want extensibility and no native artifact, start there.
  • tempest/markdown is a good fit inside the Tempest ecosystem, especially if you want its bundled syntax highlighting and heading behavior.

License

MIT. Bundled under their own MIT licenses: md4c (Martin Mitáš) for Markdown parsing and libyaml (Kirill Simonov et al.) for front-matter YAML — see THIRD_PARTY.md.

About

A deliberately over-engineered Markdown parser: PHP 8.5 FFI → md4c (C).

Topics

Resources

License

Stars

Watchers

Forks

Contributors