Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Html Reader Non-UTF8 Charsets #4019

Merged
merged 6 commits into from
May 10, 2024
Merged

Html Reader Non-UTF8 Charsets #4019

merged 6 commits into from
May 10, 2024

Conversation

oleibman
Copy link
Collaborator

@oleibman oleibman commented May 6, 2024

Fix #3995. Fix #866. Fix #1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag.

This will allow us to use:

  • UTF-8 files or snippets without BOM, with or without charset
  • UTF-8 files with BOM (charset should not be specified and will be ignored if it is)
  • UTF-16 files with BOM (charset should not be specified and will be ignored if it is)
  • all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets.

We cannot use:

  • UTF-16BE or UTF-16LE declared in a meta tag
  • UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it)
  • unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?)

I will note that the way I detect the charset attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the encoding attribute, and, if it's good enough there, it ought to be good enough here.

This is:

  • a bugfix
  • a new feature
  • refactoring
  • additional unit tests

Checklist:

  • Changes are covered by unit tests
    • Changes are covered by existing unit tests
    • New unit tests have been added
  • Code style is respected
  • Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
  • CHANGELOG.md contains a short summary of the change and a link to the pull request if applicable
  • Documentation is updated as necessary

Why this change is needed?

Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.

Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag.

This will allow us to use:
- UTF-8 files or snippets without BOM, with or without charset
- UTF-8 files with BOM (charset should not be specified and will be ignored if it is)
- UTF-16 files with BOM (charset should not be specified and will be ignored if it is)
- all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets.

We cannot use:
- UTF-16BE or UTF-16LE declared in a meta tag
- UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it)
- unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?)

I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.
@oleibman oleibman mentioned this pull request May 6, 2024
11 tasks
@oleibman
Copy link
Collaborator Author

oleibman commented May 7, 2024

Scrutinizer new issues are both false positives, and are now suppressed.

@oleibman oleibman added this pull request to the merge queue May 10, 2024
Merged via the queue into PHPOffice:master with commit 791c8cf May 10, 2024
13 of 14 checks passed
@oleibman oleibman deleted the issue3995 branch May 10, 2024 06:34
oleibman added a commit to oleibman/PhpSpreadsheet that referenced this pull request May 13, 2024
Continuing work started with PR PHPOffice#4019. Improve documentation within program by making explicit what types of values are allowed for variables described as "mixed". In order to avoid broken functionality, this is done mainly through doc-blocks. This will get us closer to Phpstan Level 9, but many changes will be needed before we can consider that.

This change has more executable code changes than its predecessor. I will wait longer than normal before merging it to allow for additional testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant