Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address HTML entity mbstring deprecation #4638

Closed
ssddanbrown opened this issue Oct 31, 2023 · 1 comment · Fixed by #4673
Closed

Address HTML entity mbstring deprecation #4638

ssddanbrown opened this issue Oct 31, 2023 · 1 comment · Fixed by #4673

Comments

@ssddanbrown
Copy link
Member

ssddanbrown commented Oct 31, 2023

Found during testing for PHP8.3 (but relevant since PHP8.2).
Deprecation notice:

mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead

This applies to a couple of places where we use this kind of pattern:

        $doc = new DOMDocument();
        $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

This converts UTF8 chars to HTML entities first, so UTF8 characters are retained no matter the output encoding. PHP won't load this as UTF8 unless a specific encoding is set within the loaded content. Setting the DOMDocument itself to UTF8 upon load won't help as it's overridden on the ->loadHTML() call.

Just throwing on '<?xml encoding="UTF-8">' can somewhat mangle the output depending on input (if partial or full HTML fragment).

Need to standardise proper handling depending on what's expected as input/output.
A good point to fully align (and centralise?) general high-level HTML handling maybe, therefore deferring to a full release.

@ssddanbrown
Copy link
Member Author

DOMDocument Usages

This is a little audit of DOMDocument usages within the app.

app/Entities/Tools/PageContent.php

  • Loads text as UTF-8 via new method already.
  • Loads text as partial HTML (page content).
  • Used for:
    • Searching for and extracting base64 images
    • Finding and extracting specific page sections (includes)
    • Formatting upon page save (Apply IDs and Fix links)
    • Loading page navigation
  • Returns:
    • Specific text/attribute values
    • All inner nodes loaded (in body)
    • Specific inner sections

app/Entities/Tools/ExportFormatter.php

  • mb_convert_encoding used.
  • Loads text as entire page HTML
  • Used for:
    • Opening detail blocks (apply open attr)
    • Replacing Iframes elements with links
  • Returns:
    • All nodes

app/References/ReferenceUpdater.php

  • mb_convert_encoding used.
  • Loads partial HTML (page content within body)
  • Used for:
    • Updating link attributes
  • Returns:
    • All inner nodes loaded (in body)

app/References/CrossLinkParser.php

  • Loads text as UTF-8 via new method already.
  • Loads partial HTML (page content within body)
  • Used for:
    • Getting links within content
  • Returns:
    • Array of link attribute values

app/Search/SearchIndex.php

  • Loads text as UTF-8 via new method already.
  • Loads partial HTML (page content within body)
  • Used for:
    • Parsing all text from content
  • Returns:
    • Map of numbers by text term

app/Util/HtmlNonceApplicator.php

  • Loads text as UTF-8 via new method already.
  • Loads partial HTML
  • Used for:
    • Applying attributes to elements
  • Returns:
    • All inner nodes loaded (in body)

app/Util/HtmlContentFilter.php

  • Loads text as UTF-8 via new method already.
  • Loads partial HTML
  • Used for:
    • Removing nodes and attributes
  • Returns:
    • All inner nodes loaded (in body)

ssddanbrown added a commit that referenced this issue Nov 14, 2023
Adds a thin wrapper for DOMDocument to simplify and align usage within
all areas of BookStack.
Also means we move away from old depreacted mb_convert_encoding usage.

Closes #4638
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

1 participant