Skip to content

Cleaner & Safelist API revamp #2284

Open
@jhy

Description

@jhy

I'm looking at revamping and modernizing the Cleaner and the Safelist APIs.

For the Cleaner, my thought is to allow a configured Cleaner to be supplied to a Parser. (Whose input will be whatever the regular inputs are; string, file, connection, inputstream etc.) That way the parser (or StreamParser) can do the tokenization, tree build, and sanitation in one pass. And we can simplify the cleaning APIs and support both body fragment and full document cleaning.

It will also allow us to track both parse errors and safelist errors in the one pass, to inform the Cleaner's isValid method.

And by directly integrating via the Parser and Document.OutputSettings, we can support both HTML and XML (and configuration for case preserving tags and attributes).

Would also like to add an API to retrieve the cleaner's error list -- what tags and attributes were dropped, what attributes added, etc.

(As a future idea, in might be useful to allow the cleaner to be configured with a custom strategy function, that can apply custom logic on what to pass, and could escape vs discard undesired elements, etc.)

For the Safelist, a long-standing request has been to supply canned lists of HTML5 elements. Now that jsoup has decent namespace support and HTML5 is actively using that for math and svg spaces, the Safelist will need to understand namespaces.

The API should take Tag objects (created with namespace and element names). I plan on revamping Tags / have a TagList to allow defining custom properties; these can properly interface. Keep a simple string-based interface for times when that's not required.

My plan is to add new lists suffixed with 5 like Relaxed5 etc. The existing lists will stay as-is to ensure no surprises for anyone. The new lists, in the spirit of HTML5's ongoing improvement, will be open to new elements / removals over time.

This would also be a good opportunity to add support for attribute wildcards, allowing the Safelist to pass e.g. data-, aria-, etc.

Plan

Cleaner API:

  • Pass configured Cleaner directly to Parser (StreamParser)
  • Single-pass tokenization, tree-build, and sanitize
  • Support HTML and XML via Syntax
  • Unify support for cleaning both body fragments and full documents
  • Track parse errors and safelist violations together (isValid)
  • API to retrieve detailed cleaning errors
  • (Future) support for custom sanitization strategies

Safelist API:

  • New HTML5-friendly safelists (e.g., Relaxed5) with namespace support
  • Defined with Tag objects
  • Keep existing safelists unchanged evermore for compatibility
  • Allow wildcard attributes (data-*, aria-*)
  • (Future) update HTML5-based safelists to match spec changes

If folks have views on this, I'm very keen to hear them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementAn improvement / new feature ideaplannedThings to get to in the near term

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions