Cleaner & Safelist API revamp

I'm looking at revamping and modernizing the Cleaner and the Safelist APIs.

For the Cleaner, my thought is to allow a configured Cleaner to be supplied to a Parser. (Whose input will be whatever the regular inputs are; string, file, connection, inputstream etc.) That way the parser (or StreamParser) can do the tokenization, tree build, and sanitation in one pass. And we can simplify the cleaning APIs and support both body fragment and full document cleaning.

It will also allow us to track both parse errors and safelist errors in the one pass, to inform the Cleaner's `isValid` method.

And by directly integrating via the Parser and Document.OutputSettings, we can support both HTML and XML (and configuration for case preserving tags and attributes).

Would also like to add an API to retrieve the cleaner's error list -- what tags and attributes were dropped, what attributes added, etc.

(As a future idea, in might be useful to allow the cleaner to be configured with a custom strategy function, that can apply custom logic on what to pass, and could escape vs discard undesired elements, etc.)

For the Safelist, a long-standing request has been to supply canned lists of HTML5 elements. Now that jsoup has decent namespace support and HTML5 is actively using that for `math` and `svg` spaces, the Safelist will need to understand namespaces. 

The API should take Tag objects (created with namespace and element names). I plan on revamping Tags / have a TagList to allow defining custom properties; these can properly interface. Keep a simple string-based interface for times when that's not required.

My plan is to add new lists suffixed with `5` like `Relaxed5` etc. The existing lists will stay as-is to ensure no surprises for anyone. The new lists, in the spirit of HTML5's ongoing improvement, will be open to new elements / removals over time.

This would also be a good opportunity to add support for attribute wildcards, allowing the Safelist to pass e.g. `data-`, `aria-`, etc.

## Plan

#### Cleaner API:
- Pass configured Cleaner directly to Parser (StreamParser)
- Single-pass tokenization, tree-build, and sanitize
- Support HTML and XML via Syntax
- Unify support for cleaning both body fragments and full documents
- Track parse errors and safelist violations together (isValid)
- API to retrieve detailed cleaning errors
- (Future) support for custom sanitization strategies

#### Safelist API:
- New HTML5-friendly safelists (e.g., `Relaxed5`) with namespace support
- Defined with Tag objects
- Keep existing safelists unchanged evermore for compatibility
- Allow wildcard attributes (`data-*`, `aria-*`)
- (Future) update HTML5-based safelists to match spec changes

---

If folks have views on this, I'm very keen to hear them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleaner & Safelist API revamp #2284

Plan

Cleaner API:

Safelist API:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Cleaner & Safelist API revamp #2284

Description

Plan

Cleaner API:

Safelist API:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions