Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web scraping support encodings such as EUC-JP #6112

Merged
merged 2 commits into from
Feb 18, 2024
Merged

Conversation

Alkarex
Copy link
Member

@Alkarex Alkarex commented Feb 18, 2024

fix #6106

// Save encoding information as XML declaration
return '<' . '?xml version="1.0" encoding="' . $httpCharsetNormalized . '" ?' . ">\n" . $html;
}
// Give up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glancing at this I was wondering if it might make more sense to give up with the charset from the <meta> element?

Copy link
Member Author

@Alkarex Alkarex Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOMDocument::loadHTML() will already sniff information from <meta charset. This is the reason why we have to strip this information after conversion to UTF-8.
Furthermore, we cannot always write <?xml encoding, in particular for character encodings that are not supersets of ASCII, such as UTF-32.
So as far as this function goes, I do not believe we can do much more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants