fix: prefer HTTP header charset and fallback to UTF-8 when missing #1196

blizzardOfAce · 2025-11-19T14:42:47Z

This fixes incorrectly decoded characters in some feeds that do not report a correct charset in their Content-Type header.

Rome was guessing the charset and decoding wrongly for certain sources
(e.g. feeds containing Chinese characters).

The fix is to force UTF-8 in XmlReader, which resolves the issue across feeds.

Tested with multiple feeds, including the problematic one.

Before / After

Before	After

Ashinch · 2025-11-20T02:09:41Z

What Content-Type do those problematic feeds return?

I think the parser should prefer the charset provided by the feed (via the HTTP Content-Type header) and fall back to UTF-8 only when it’s missing.

This PR forces UTF-8 for all cases, which may break feeds that legitimately specify a different encoding.

blizzardOfAce · 2025-11-20T11:40:54Z

Thanks for the clarification @Ashinch ! Here's what I found while debugging the problematic feeds.

Some feeds (for example: rustcc.cn/rss) return a Content-Type header of only:

text/xml

with no charset parameter.

According to RFC 3023 rules, text/xml without a charset defaults to US-ASCII, which cannot represent Chinese characters. When this header is passed into XmlReader(InputStream, httpContentType), ROME follows its default lenient behavior for XML Charset Encoding detection:

Check BOM → none
Check Content-Type charset → missing
Check XML prolog encoding → missing
Fall back to MIME-type default → US-ASCII
UTF-8 fallback is not triggered because detection is still considered “valid”

The feed itself is UTF-8, so decoding it as ASCII causes the broken characters.

Proposed solution

As you mentioned earlier, the parser should prefer the charset from the HTTP Content-Type header and fall back to UTF-8 only when missing, the safest approach is:

If the server includes a charset → use it
If it does not → append charset=UTF-8, since most modern feeds use UTF-8 even when they fail to declare it

 val httpContentType =
                contentType?.let {
                    if (it.contains("charset=", ignoreCase = true)) it
                    else "$it; charset=UTF-8"
                } ?: "text/xml; charset=UTF-8"

This keeps ROME’s charset detection intact, avoids forcing UTF-8 in cases where a different charset is explicitly declared, and fixes feeds that omit the charset but are actually UTF-8.

Happy to adjust further if needed!

Ashinch · 2025-11-23T03:45:00Z

Thanks for the clarification @Ashinch ! Here's what I found while debugging the problematic feeds.

Some feeds (for example: rustcc.cn/rss) return a Content-Type header of only:
text/xml
with no charset parameter.

According to RFC 3023 rules, text/xml without a charset defaults to US-ASCII, which cannot represent Chinese characters. When this header is passed into XmlReader(InputStream, httpContentType), ROME follows its default lenient behavior for XML Charset Encoding detection:

Check BOM → none

Check Content-Type charset → missing

Check XML prolog encoding → missing

Fall back to MIME-type default → US-ASCII

UTF-8 fallback is not triggered because detection is still considered “valid”

The feed itself is UTF-8, so decoding it as ASCII causes the broken characters.

Proposed solution

As you mentioned earlier, the parser should prefer the charset from the HTTP Content-Type header and fall back to UTF-8 only when missing, the safest approach is:

If the server includes a charset → use it

If it does not → append charset=UTF-8, since most modern feeds use UTF-8 even when they fail to declare it
 val httpContentType =

                contentType?.let {

                    if (it.contains("charset=", ignoreCase = true)) it

                    else "$it; charset=UTF-8"

                } ?: "text/xml; charset=UTF-8"

                
This keeps ROME’s charset detection intact, avoids forcing UTF-8 in cases where a different charset is explicitly declared, and fixes feeds that omit the charset but are actually UTF-8.

Happy to adjust further if needed!

@blizzardOfAce

Thanks for following up! This looks good, and we can go ahead with it. It might not cover every encoding issue, but it should work fine for most feeds.

Ashinch · 2025-11-26T03:19:56Z

Feel free to @Ashinch when it's ready.

blizzardOfAce · 2025-11-26T09:48:58Z

Hey @Ashinch
This should be ready for review.

blizzardOfAce · 2025-11-26T10:00:38Z

Btw, the CI builds are consistently failing due to some kind of OOM error, which doesn't seem to be related to the changes in the PRs:

Not enough memory to run compilation.
Try to increase kotlin.daemon.jvmargs=-Xmx<size>

It may be the GitHub Actions runner running out of RAM during Compose/Kotlin compilation.

GitHub has recently updated repository cache limits, which might affect build caching or memory usage. This could be worth checking:
https://github.blog/changelog/2025-11-20-github-actions-cache-size-can-now-exceed-10-gb-per-repository/

Maybe you could review the CI configuration to see if anything needs adjustment.

fix: feed character decoding issue

aab1b49

fix: prefer HTTP header charset, fallback to UTF-8 for missing charset

d2f4f1f

blizzardOfAce changed the title ~~fix: wrong character decoding in feed by enforcing UTF-8~~ fix: prefer HTTP header charset and fallback to UTF-8 when missing Nov 23, 2025

Ashinch approved these changes Nov 27, 2025

View reviewed changes

Ashinch merged commit 2afcc20 into ReadYouApp:main Nov 27, 2025
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prefer HTTP header charset and fallback to UTF-8 when missing #1196

fix: prefer HTTP header charset and fallback to UTF-8 when missing #1196

Uh oh!

blizzardOfAce commented Nov 19, 2025

Uh oh!

Ashinch commented Nov 20, 2025

Uh oh!

blizzardOfAce commented Nov 20, 2025

Uh oh!

Ashinch commented Nov 23, 2025

Proposed solution

Uh oh!

Ashinch commented Nov 26, 2025

Uh oh!

blizzardOfAce commented Nov 26, 2025

Uh oh!

blizzardOfAce commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix: prefer HTTP header charset and fallback to UTF-8 when missing #1196

fix: prefer HTTP header charset and fallback to UTF-8 when missing #1196

Uh oh!

Conversation

blizzardOfAce commented Nov 19, 2025

Before / After

Uh oh!

Ashinch commented Nov 20, 2025

Uh oh!

blizzardOfAce commented Nov 20, 2025

Proposed solution

Uh oh!

Ashinch commented Nov 23, 2025

Proposed solution

Uh oh!

Ashinch commented Nov 26, 2025

Uh oh!

blizzardOfAce commented Nov 26, 2025

Uh oh!

blizzardOfAce commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants