Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use UTF-8 for input encoding of seeds (FileSpout) #542

Merged
merged 1 commit into from Mar 6, 2018

Conversation

Projects
None yet
2 participants
@sebastian-nagel
Copy link
Collaborator

sebastian-nagel commented Mar 6, 2018

and improve logging of IDN URLs which fail to be converted to ASCII/Punycode

If the system local / default encoding is not UTF-8, non-ASCII seeds may fail to get injected. I've seen this while testing in a Docker container: the seed https://арктик-тв.рф/rss/arctic-tv.xml caused the error:

2018-03-06 11:09:11.273 c.d.s.f.URLFilters Thread-14-filter-executor[5 5] [ERROR] URL filtering threw exception
java.lang.IllegalArgumentException: java.text.ParseException: A prohibited code point was found in the input????????????-????
        at java.net.IDN.toASCIIInternal(IDN.java:274) ~[?:1.8.0_151]
        at java.net.IDN.toASCII(IDN.java:122) ~[?:1.8.0_151]
        at java.net.IDN.toASCII(IDN.java:151) ~[?:1.8.0_151]
        at com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer.filter(BasicURLNormalizer.java:140) ~[stormjar.jar:?]
        at com.digitalpebble.stormcrawler.filtering.URLFilters.filter(URLFilters.java:102) [stormjar.jar:?]
        at com.digitalpebble.stormcrawler.bolt.URLFilterBolt.execute(URLFilterBolt.java:53) [stormjar.jar:?]

The logged part of the URL ????????????-???? is not really helpful, the improved log message adds more context from the URL:

2018-03-06 10:59:55.551 c.d.s.f.b.BasicURLNormalizer Thread-14-filter-executor[5 5] [ERROR] Failed to convert IDN host ????????????-????.???? in https://????????????-????.????/rss/arctic-tv.xml
Use UTF-8 for input encoding of seeds (FileSpout),
improve logging of IDN URLs which fail to be converted
to ASCII/Punycode

@jnioche jnioche added this to the 1.8 milestone Mar 6, 2018

@jnioche jnioche added the core label Mar 6, 2018

@jnioche jnioche merged commit f00501c into DigitalPebble:master Mar 6, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnioche

This comment has been minimized.

Copy link
Member

jnioche commented Mar 6, 2018

@sebastian-nagel sebastian-nagel deleted the sebastian-nagel:sc-seed-encoding-utf8 branch Mar 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.