-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mishandling of UTF-8 in redirect targets #199
Comments
That's the most concise use case I have seen so far. :-) It seems to appear when fetching robot.txt only. Will investigate. |
After some research, I found the problem is with the server not properly encoding redirect URLs. The best explanation summary I found is here: http://stackoverflow.com/a/7654605/3974380
After analyzing at the "Location:" in HTTP headers that come back, I can confirm the redirect URL is not encoded properly. You should contact the site owner about this. I am not sure how a workaround could be implemented other than forcing to read the HTTP "Location" header using a specific charset, or trying to auto-detect it. It could be a risky proposition given most sites probably respect the standard. In this specific case, I can read the URL properly if I force it to use ISO-8829_1 (UTF-8 does not work). |
Thanks for the investigation; very interesting. I agree that non-ASCII characters shouldn't be present in HTTP headers (unless they are properly encoded / escaped). In practice however, the standard unfortunately seems to be violated quite frequently – as is so often the case on the web. The site referenced here is obviously a major case in point, but this Google search suggests that globally this isn't as rare a problem as one might hope. I also unearthed many bug reports for both server- and client-side software components that lamented seeming mis-handling of non-ASCII redirects further confirming that the problem is somewhat frequently encountered. For compatibility reasons, browsers seem to be more relaxed than the RFCs would demand. At least Firefox and Chrome seem to follow the redirect "correctly" (meaning: as the site author intended). E.g. if I go to http://www.mascus.com/agriculture/used-other-tractor-accessories/other/5pen7jcp.html in either browser, I get redirected to http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html even though the For Firefox, https://bugzilla.mozilla.org/show_bug.cgi?id=1142083 details the fix while https://bugzilla.mozilla.org/show_bug.cgi?id=439616 gives some more information about the use-case. Generally speaking, I would prefer for a crawler to behave as similarly to real-world browsers as possible. This is because site authors generally target the latter and not the former. If I can access a site with my web browser, I would expect the crawler to be able to access that same page (and parse it in the same manner). At the same time however, development resources here are of course much more limited than for the major browsers. Thus we can not come up with an implementation that will work as "expected" in all cases. From a philosophical standpoint as well, I would normally be opposed to programming special / edge cases into general-purpose software such as this crawler. Nevertheless, choking on – what appears to be – a somewhat common encoding of redirects seems to be a not insignificant flaw. Thus, I would like to propose the following implementation which I think strikes a good balance between compatibility and complexity:
An interesting alternative to (1.ii.b) would be to fall back to a per-crawler default (if configured) instead. This is a feature that you have suggested in #194 (comment) and which I would find very useful. This logic could be applied to all HTTP headers, not just While the logic sounds simple, I can't estimate the implementation effort as I am not yet sufficiently familiar with the codebase. Please feel free to close as WONTFIX if it would be a major hassle. |
Thanks for your research and suggestions! I am in agreement standards are often not respected. What is important is we cover the standards first, but let's not limit ourselves to that and let's try to support what's in the real world. What you are proposing makes lots of sense and I now plan to implement that (or very similar). |
GenericRedirectURLProvider now better handling redirect character encoding and offering encoding options. Github #199.
I have added a new configuration option in the latest snapshot. There is now a new <redirectURLProvider
class="com.norconex.collector.http.redirect.impl.GenericRedirectURLProvider"
fallbackCharset="ISO-8859-1" /> |
This is perfect! The latest snapshot follows all redirects "correctly" when an appropriate Thanks a lot for your diligence on this. |
Given a redirect from http://www.mascus.com/agriculture/used-other-tractor-accessories/%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0-%D0%B3%D1%96%D0%B4%D1%80%D0%B0%D0%B2%D0%BB%D1%96%D0%BA%D0%B0-%D1%81%D0%BF%D0%B5%D1%86%D1%82%D0%B5%D1%85%D0%BD%D1%96%D0%BA%D0%B0/5pen7jcp.html to http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html, the collector somehow chokes on the Cyrillic characters in the (new) target URL:
Redirect:
Test-Case Config
Result
Note that the crawler detects the redirect as http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гÑдÑавлÑка-ÑпеÑÑеÑнÑка/5pen7jcp.html when it should be http://www.mascus.com/agriculture/used-other-tractor-accessories/other-гідравліка-спецтехніка/5pen7jcp.html and then further tries to access the robots.txt at http://www.mascus.comнÑка/5pen7jcp.html/robots.txt which is an invalid hostname resulting in an exception.
The text was updated successfully, but these errors were encountered: