Skip to content

Conversation

@blizzardOfAce
Copy link
Contributor

Fixes #1193

This fixes incorrectly decoded characters in some feeds that do not report a correct charset in their Content-Type header.

Rome was guessing the charset and decoding wrongly for certain sources
(e.g. feeds containing Chinese characters).

The fix is to force UTF-8 in XmlReader, which resolves the issue across feeds.

Tested with multiple feeds, including the problematic one.

Before / After

Before After
Screenshot_1763562266 Screenshot_1763561008

@Ashinch
Copy link
Member

Ashinch commented Nov 20, 2025

What Content-Type do those problematic feeds return?

I think the parser should prefer the charset provided by the feed (via the HTTP Content-Type header) and fall back to UTF-8 only when it’s missing.

This PR forces UTF-8 for all cases, which may break feeds that legitimately specify a different encoding.

@blizzardOfAce
Copy link
Contributor Author

Thanks for the clarification @Ashinch ! Here's what I found while debugging the problematic feeds.

Some feeds (for example: rustcc.cn/rss) return a Content-Type header of only:

text/xml

with no charset parameter.

According to RFC 3023 rules, text/xml without a charset defaults to US-ASCII, which cannot represent Chinese characters. When this header is passed into XmlReader(InputStream, httpContentType), ROME follows its default lenient behavior for XML Charset Encoding detection:

  1. Check BOM → none
  2. Check Content-Type charset → missing
  3. Check XML prolog encoding → missing
  4. Fall back to MIME-type default → US-ASCII
  5. UTF-8 fallback is not triggered because detection is still considered “valid”

The feed itself is UTF-8, so decoding it as ASCII causes the broken characters.

Proposed solution

As you mentioned earlier, the parser should prefer the charset from the HTTP Content-Type header and fall back to UTF-8 only when missing, the safest approach is:

  • If the server includes a charset → use it
  • If it does not → append charset=UTF-8, since most modern feeds use UTF-8 even when they fail to declare it
 val httpContentType =
                contentType?.let {
                    if (it.contains("charset=", ignoreCase = true)) it
                    else "$it; charset=UTF-8"
                } ?: "text/xml; charset=UTF-8"
                

This keeps ROME’s charset detection intact, avoids forcing UTF-8 in cases where a different charset is explicitly declared, and fixes feeds that omit the charset but are actually UTF-8.

Happy to adjust further if needed!

@Ashinch
Copy link
Member

Ashinch commented Nov 23, 2025

Thanks for the clarification @Ashinch ! Here's what I found while debugging the problematic feeds.

Some feeds (for example: rustcc.cn/rss) return a Content-Type header of only:


text/xml

with no charset parameter.

According to RFC 3023 rules, text/xml without a charset defaults to US-ASCII, which cannot represent Chinese characters. When this header is passed into XmlReader(InputStream, httpContentType), ROME follows its default lenient behavior for XML Charset Encoding detection:

  1. Check BOM → none

  2. Check Content-Type charset → missing

  3. Check XML prolog encoding → missing

  4. Fall back to MIME-type default → US-ASCII

  5. UTF-8 fallback is not triggered because detection is still considered “valid”

The feed itself is UTF-8, so decoding it as ASCII causes the broken characters.

Proposed solution

As you mentioned earlier, the parser should prefer the charset from the HTTP Content-Type header and fall back to UTF-8 only when missing, the safest approach is:

  • If the server includes a charset → use it

  • If it does not → append charset=UTF-8, since most modern feeds use UTF-8 even when they fail to declare it

 val httpContentType =

                contentType?.let {

                    if (it.contains("charset=", ignoreCase = true)) it

                    else "$it; charset=UTF-8"

                } ?: "text/xml; charset=UTF-8"

                

This keeps ROME’s charset detection intact, avoids forcing UTF-8 in cases where a different charset is explicitly declared, and fixes feeds that omit the charset but are actually UTF-8.

Happy to adjust further if needed!

@blizzardOfAce

Thanks for following up! This looks good, and we can go ahead with it. It might not cover every encoding issue, but it should work fine for most feeds.

@blizzardOfAce blizzardOfAce changed the title fix: wrong character decoding in feed by enforcing UTF-8 fix: prefer HTTP header charset and fallback to UTF-8 when missing Nov 23, 2025
@Ashinch
Copy link
Member

Ashinch commented Nov 26, 2025

Feel free to @Ashinch when it's ready.

@blizzardOfAce
Copy link
Contributor Author

Hey @Ashinch
This should be ready for review.

@blizzardOfAce
Copy link
Contributor Author

Btw, the CI builds are consistently failing due to some kind of OOM error, which doesn't seem to be related to the changes in the PRs:

Not enough memory to run compilation.
Try to increase kotlin.daemon.jvmargs=-Xmx<size>

It may be the GitHub Actions runner running out of RAM during Compose/Kotlin compilation.

GitHub has recently updated repository cache limits, which might affect build caching or memory usage. This could be worth checking:
https://github.blog/changelog/2025-11-20-github-actions-cache-size-can-now-exceed-10-gb-per-repository/

Maybe you could review the CI configuration to see if anything needs adjustment.

@Ashinch Ashinch merged commit 2afcc20 into ReadYouApp:main Nov 27, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chinese characters are not displayed correctly for some source

2 participants