You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm looking at revamping and modernizing the Cleaner and the Safelist APIs.
For the Cleaner, my thought is to allow a configured Cleaner to be supplied to a Parser. (Whose input will be whatever the regular inputs are; string, file, connection, inputstream etc.) That way the parser (or StreamParser) can do the tokenization, tree build, and sanitation in one pass. And we can simplify the cleaning APIs and support both body fragment and full document cleaning.
It will also allow us to track both parse errors and safelist errors in the one pass, to inform the Cleaner's isValid method.
And by directly integrating via the Parser and Document.OutputSettings, we can support both HTML and XML (and configuration for case preserving tags and attributes).
Would also like to add an API to retrieve the cleaner's error list -- what tags and attributes were dropped, what attributes added, etc.
(As a future idea, in might be useful to allow the cleaner to be configured with a custom strategy function, that can apply custom logic on what to pass, and could escape vs discard undesired elements, etc.)
For the Safelist, a long-standing request has been to supply canned lists of HTML5 elements. Now that jsoup has decent namespace support and HTML5 is actively using that for math and svg spaces, the Safelist will need to understand namespaces.
The API should take Tag objects (created with namespace and element names). I plan on revamping Tags / have a TagList to allow defining custom properties; these can properly interface. Keep a simple string-based interface for times when that's not required.
My plan is to add new lists suffixed with 5 like Relaxed5 etc. The existing lists will stay as-is to ensure no surprises for anyone. The new lists, in the spirit of HTML5's ongoing improvement, will be open to new elements / removals over time.
This would also be a good opportunity to add support for attribute wildcards, allowing the Safelist to pass e.g. data-, aria-, etc.
Plan
Cleaner API:
Pass configured Cleaner directly to Parser (StreamParser)
Single-pass tokenization, tree-build, and sanitize
Support HTML and XML via Syntax
Unify support for cleaning both body fragments and full documents
Track parse errors and safelist violations together (isValid)
API to retrieve detailed cleaning errors
(Future) support for custom sanitization strategies
Safelist API:
New HTML5-friendly safelists (e.g., Relaxed5) with namespace support
Defined with Tag objects
Keep existing safelists unchanged evermore for compatibility
Allow wildcard attributes (data-*, aria-*)
(Future) update HTML5-based safelists to match spec changes
If folks have views on this, I'm very keen to hear them.
The text was updated successfully, but these errors were encountered:
I'm looking at revamping and modernizing the Cleaner and the Safelist APIs.
For the Cleaner, my thought is to allow a configured Cleaner to be supplied to a Parser. (Whose input will be whatever the regular inputs are; string, file, connection, inputstream etc.) That way the parser (or StreamParser) can do the tokenization, tree build, and sanitation in one pass. And we can simplify the cleaning APIs and support both body fragment and full document cleaning.
It will also allow us to track both parse errors and safelist errors in the one pass, to inform the Cleaner's
isValid
method.And by directly integrating via the Parser and Document.OutputSettings, we can support both HTML and XML (and configuration for case preserving tags and attributes).
Would also like to add an API to retrieve the cleaner's error list -- what tags and attributes were dropped, what attributes added, etc.
(As a future idea, in might be useful to allow the cleaner to be configured with a custom strategy function, that can apply custom logic on what to pass, and could escape vs discard undesired elements, etc.)
For the Safelist, a long-standing request has been to supply canned lists of HTML5 elements. Now that jsoup has decent namespace support and HTML5 is actively using that for
math
andsvg
spaces, the Safelist will need to understand namespaces.The API should take Tag objects (created with namespace and element names). I plan on revamping Tags / have a TagList to allow defining custom properties; these can properly interface. Keep a simple string-based interface for times when that's not required.
My plan is to add new lists suffixed with
5
likeRelaxed5
etc. The existing lists will stay as-is to ensure no surprises for anyone. The new lists, in the spirit of HTML5's ongoing improvement, will be open to new elements / removals over time.This would also be a good opportunity to add support for attribute wildcards, allowing the Safelist to pass e.g.
data-
,aria-
, etc.Plan
Cleaner API:
Safelist API:
Relaxed5
) with namespace supportdata-*
,aria-*
)If folks have views on this, I'm very keen to hear them.
The text was updated successfully, but these errors were encountered: