Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner & Safelist API revamp #2284

Open
jhy opened this issue Mar 12, 2025 · 0 comments
Open

Cleaner & Safelist API revamp #2284

jhy opened this issue Mar 12, 2025 · 0 comments
Labels
improvement An improvement / new feature idea planned Things to get to in the near term

Comments

@jhy
Copy link
Owner

jhy commented Mar 12, 2025

I'm looking at revamping and modernizing the Cleaner and the Safelist APIs.

For the Cleaner, my thought is to allow a configured Cleaner to be supplied to a Parser. (Whose input will be whatever the regular inputs are; string, file, connection, inputstream etc.) That way the parser (or StreamParser) can do the tokenization, tree build, and sanitation in one pass. And we can simplify the cleaning APIs and support both body fragment and full document cleaning.

It will also allow us to track both parse errors and safelist errors in the one pass, to inform the Cleaner's isValid method.

And by directly integrating via the Parser and Document.OutputSettings, we can support both HTML and XML (and configuration for case preserving tags and attributes).

Would also like to add an API to retrieve the cleaner's error list -- what tags and attributes were dropped, what attributes added, etc.

(As a future idea, in might be useful to allow the cleaner to be configured with a custom strategy function, that can apply custom logic on what to pass, and could escape vs discard undesired elements, etc.)

For the Safelist, a long-standing request has been to supply canned lists of HTML5 elements. Now that jsoup has decent namespace support and HTML5 is actively using that for math and svg spaces, the Safelist will need to understand namespaces.

The API should take Tag objects (created with namespace and element names). I plan on revamping Tags / have a TagList to allow defining custom properties; these can properly interface. Keep a simple string-based interface for times when that's not required.

My plan is to add new lists suffixed with 5 like Relaxed5 etc. The existing lists will stay as-is to ensure no surprises for anyone. The new lists, in the spirit of HTML5's ongoing improvement, will be open to new elements / removals over time.

This would also be a good opportunity to add support for attribute wildcards, allowing the Safelist to pass e.g. data-, aria-, etc.

Plan

Cleaner API:

  • Pass configured Cleaner directly to Parser (StreamParser)
  • Single-pass tokenization, tree-build, and sanitize
  • Support HTML and XML via Syntax
  • Unify support for cleaning both body fragments and full documents
  • Track parse errors and safelist violations together (isValid)
  • API to retrieve detailed cleaning errors
  • (Future) support for custom sanitization strategies

Safelist API:

  • New HTML5-friendly safelists (e.g., Relaxed5) with namespace support
  • Defined with Tag objects
  • Keep existing safelists unchanged evermore for compatibility
  • Allow wildcard attributes (data-*, aria-*)
  • (Future) update HTML5-based safelists to match spec changes

If folks have views on this, I'm very keen to hear them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement An improvement / new feature idea planned Things to get to in the near term
Projects
None yet
Development

No branches or pull requests

1 participant