Web scraping support encodings such as EUC-JP #6112

Alkarex · 2024-02-18T00:15:49Z

Frenzie · 2024-02-18T08:56:58Z

lib/lib_rss.php

+		// Save encoding information as XML declaration
+		return '<' . '?xml version="1.0" encoding="' . $httpCharsetNormalized . '" ?' . ">\n" . $html;
+	}
+	// Give up


Glancing at this I was wondering if it might make more sense to give up with the charset from the <meta> element?

DOMDocument::loadHTML() will already sniff information from <meta charset. This is the reason why we have to strip this information after conversion to UTF-8.
Furthermore, we cannot always write <?xml encoding, in particular for character encodings that are not supersets of ASCII, such as UTF-32.
So as far as this function goes, I do not believe we can do much more.

Web scraping support encodings such as EUC-JP

b90e730

fix FreshRSS#6106

Alkarex added this to the 1.24.0 milestone Feb 18, 2024

Alkarex mentioned this pull request Feb 18, 2024

[BUG] HTTP+XPATH scraping retrieves garbled feeds from non UTF-8 contents #6106

Closed

Typo

8568276

Frenzie reviewed Feb 18, 2024

View reviewed changes

Alkarex merged commit 7d6a64a into FreshRSS:edge Feb 18, 2024
2 checks passed

Alkarex deleted the EUC-JP branch February 18, 2024 09:53

Alkarex mentioned this pull request Feb 28, 2024

Invalid char in most of articles [FULL CONTENT via CSS selector] (SimplePie problem converting to UTF-8) #6133

Closed

Alkarex linked an issue Feb 28, 2024 that may be closed by this pull request

Invalid char in most of articles [FULL CONTENT via CSS selector] (SimplePie problem converting to UTF-8) #6133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web scraping support encodings such as EUC-JP #6112

Web scraping support encodings such as EUC-JP #6112

Alkarex commented Feb 18, 2024

Frenzie Feb 18, 2024

Alkarex Feb 18, 2024 •

edited

Web scraping support encodings such as EUC-JP #6112

Web scraping support encodings such as EUC-JP #6112

Conversation

Alkarex commented Feb 18, 2024

Frenzie Feb 18, 2024

Choose a reason for hiding this comment

Alkarex Feb 18, 2024 • edited

Choose a reason for hiding this comment

Alkarex Feb 18, 2024 •

edited