-
-
Notifications
You must be signed in to change notification settings - Fork 270
fix: prefer HTTP header charset and fallback to UTF-8 when missing #1196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
What Content-Type do those problematic feeds return? I think the parser should prefer the charset provided by the feed (via the HTTP Content-Type header) and fall back to UTF-8 only when it’s missing. This PR forces UTF-8 for all cases, which may break feeds that legitimately specify a different encoding. |
|
Thanks for the clarification @Ashinch ! Here's what I found while debugging the problematic feeds. Some feeds (for example: with no According to RFC 3023 rules,
The feed itself is UTF-8, so decoding it as ASCII causes the broken characters. Proposed solutionAs you mentioned earlier, the parser should prefer the charset from the HTTP Content-Type header and fall back to UTF-8 only when missing, the safest approach is:
val httpContentType =
contentType?.let {
if (it.contains("charset=", ignoreCase = true)) it
else "$it; charset=UTF-8"
} ?: "text/xml; charset=UTF-8"
This keeps ROME’s charset detection intact, avoids forcing UTF-8 in cases where a different charset is explicitly declared, and fixes feeds that omit the charset but are actually UTF-8. Happy to adjust further if needed! |
Thanks for following up! This looks good, and we can go ahead with it. It might not cover every encoding issue, but it should work fine for most feeds. |
|
Feel free to @Ashinch when it's ready. |
|
Hey @Ashinch |
|
Btw, the CI builds are consistently failing due to some kind of OOM error, which doesn't seem to be related to the changes in the PRs:
It may be the GitHub Actions runner running out of RAM during Compose/Kotlin compilation. GitHub has recently updated repository cache limits, which might affect build caching or memory usage. This could be worth checking: Maybe you could review the CI configuration to see if anything needs adjustment. |
Fixes #1193
This fixes incorrectly decoded characters in some feeds that do not report a correct charset in their Content-Type header.
Rome was guessing the charset and decoding wrongly for certain sources
(e.g. feeds containing Chinese characters).
The fix is to force UTF-8 in XmlReader, which resolves the issue across feeds.
Tested with multiple feeds, including the problematic one.
Before / After