Skip to content

Feat/input callbacks#200

Merged
dginev merged 5 commits into
masterfrom
feat/input-callbacks
May 23, 2026
Merged

Feat/input callbacks#200
dginev merged 5 commits into
masterfrom
feat/input-callbacks

Conversation

@dginev
Copy link
Copy Markdown
Member

@dginev dginev commented May 22, 2026

Adding the ability to run input callbacks, motivated by in-memory libxslt use.

As a motivating example:

use libxml::io;

// Bundled at compile time.
static MAIN: &[u8] = b"<?xml version=\"1.0\"?>\n<root/>";

io::register_input_callback(
  |url| url.starts_with("embed:///"),
  |url| match url.strip_prefix("embed:///") {
    Some("main.xsl") => Some(MAIN.to_vec()),
    _ => None,
  },
);

The PR code is AI-generated, so I'll let it stew until I do a proper review pass. Tested and works with my main use case with libxslt, so there is a baseline of "it works".

dginev and others added 2 commits May 22, 2026 04:45
New `io` module that lifts libxml2's `xmlRegisterInputCallbacks` into
a closure-friendly Rust API:

  pub fn register_input_callback<M, O>(match_url: M, open: O)
  where
    M: Fn(&str) -> bool + Send + Sync + 'static,
    O: Fn(&str) -> Option<Vec<u8>> + Send + Sync + 'static;

`match_url` claims a URL; `open` returns the bytes (or None to defer
back through the callback chain). The C trampolines are registered
with libxml2 exactly once per process; subsequent calls just append
to a Mutex<Vec<Callback>> registry that the trampolines walk on each
URL load. `Send + Sync` because libxml2 may dispatch from any thread.

## Motivating use case

A single-binary CLI bundles its XSLT stylesheets / RNG schemas via
`include_bytes!` and serves them through a synthetic URL scheme
(e.g. `embed:///LaTeXML-html5.xsl`). The main stylesheet is parsed
from memory via `libxslt::parser::parse_bytes(bytes, "embed:///main.xsl")`
which sets the doc's base URI. Inside libxslt, `xsl:import href="…"`
composes the absolute URL against that base, then calls `xmlReadFile`
— which walks libxml2's input-callback table and finds ours. No disk
extraction needed.

The same trick handles RelaxNG `<include>` resolution from
`xmlRelaxNGParse`, DTD external subsets, and any other libxml2-side
URL load.

## Why not `Parser::parse_file`

The existing `Parser::parse_file` reads the file via Rust I/O
(`std::fs::File::open` + `xmlReadIO`) and bypasses libxml2's URL
machinery entirely. The doctest example is marked `no_run` and notes
that the callback fires from libxslt / xmlReadFile contexts, not
from the library's own `parse_file` surface.

## Tests

Three unit tests against `xmlReadFile` (the libxml2 entry point that
actually exercises the callback chain):

  * `callback_serves_registered_url` — registered URL parses through
    the callback (round-trip via xmlReadFile -> trampoline_open ->
    Rust closure -> trampoline_read -> libxml2 parse).
  * `callback_can_decline_via_none` — open returning None fails the
    load rather than returning phantom data.
  * `non_matching_url_defers_to_default_handlers` — match returning
    false leaves the default file/HTTP loaders intact (verified by
    a /nonexistent file:// URL failing through the default chain).

All 105 pre-existing tests still pass; full sweep clean.

## Notes

* libxml2 has no per-handler unregistration API (only
  `xmlCleanupInputCallbacks` which wipes the whole chain including
  the defaults), so the trampolines and the Rust registry live for
  the process lifetime. Reasonable for the embedded-asset use case;
  documented in the module docs.
* `Mutex::lock` is held only briefly during the registry walk on
  each URL load — no closures run while the lock is held that could
  re-enter libxml2.
* Callback ordering is last-registered-first, matching libxml2's own
  convention. Stacking multiple registrations for the same scheme is
  supported.

Version: 0.3.11 -> 0.3.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three #[test]s deadlocked under cargo's parallel runner on
libxml2 2.12.9 (pre-2.13 thread-safety bug in the input-callback
/ global error path); merge them into one #[test] so scenarios
run sequentially. Drive-bys from the same review:

* Drop redundant function-pointer aliases (4 non_camel_case
  warnings); Some(trampoline_*) already coerces to the bindgen
  Option<extern "C" fn> alias.
* Extract MatchFn/OpenFn (clippy::type_complexity on the Box dyn
  Fn).
* Iterate the registry newest-first in trampoline_open to match
  the module doc's "most recent wins" and libxml2's own callback
  table semantics.
* Store entries as Arc<Callback> and snapshot the Vec before
  invoking a closure, so an open() that re-enters libxml2 via
  xmlReadFile doesn't self-deadlock on the non-reentrant registry
  Mutex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dginev dginev requested a review from triptec May 22, 2026 18:24
Copy link
Copy Markdown
Collaborator

@triptec triptec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glhf!

dginev and others added 3 commits May 23, 2026 13:35
Mirror trampoline_match's iteration order to trampoline_open
(newest-first), .unwrap() the registry mutex in snapshot() to match
register_input_callback, and add a fifth scenario asserting the
documented "most recent registration wins" semantics with atomic
counters. Comments and CHANGELOG entry compacted; behaviour
unchanged for correct callers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the archived actions-rs/* and ryankurte/action-apt with their
current-standard equivalents so the workflows run on Node 24 ahead of
GitHub's June 2026 forced migration:

  * actions-rs/toolchain@v1 -> dtolnay/rust-toolchain (@stable, plus
    @master + toolchain:/targets: for the mingw windows-gnu job)
  * actions-rs/cargo@v1     -> plain `run: cargo test|doc`
  * ryankurte/action-apt    -> plain `run: apt-get update && install`
  * actions/checkout@v2/@v4 -> @v6

Also add least-privilege `permissions:` blocks (contents: read for the
CI/test workflows; contents: write for gh-pages, which pushes rendered
docs to the gh-pages branch).

CHANGELOG: date 0.3.12 (2026-05-23) and open a 0.3.13 in-development
section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The mingw64 job sets `defaults.run.shell: msys2 {0}`, so the converted
`run: cargo test` step executed inside the msys2 login shell. With
`path-type: minimal`, msys2 strips cargo (installed by rustup to the
Windows user profile) from PATH, so the step failed with exit 127.

The previous actions-rs/cargo@v1 step was a JS action that ran in the
runner's Windows context, never msys2 — so it always found cargo.
Restore that behavior by pinning the test step to `shell: pwsh`.
mingw64/bin is already on PATH from the prior step, so pkg-config, gcc,
and the libxml2 DLLs still resolve for the windows-gnu build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dginev
Copy link
Copy Markdown
Member Author

dginev commented May 23, 2026

I updated the CI workflows to silence the recent GHA warnings and prepped a v0.3.12 release. Merging and releasing.

@dginev dginev merged commit da0f3ef into master May 23, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants