New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase detect.charset.maxlength default value #537

Closed
hatemalimam opened this Issue Mar 1, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@hatemalimam
Contributor

hatemalimam commented Mar 1, 2018

I had a case where the main seed was goal.com/ar, it's an arabic website with a charset of UTF-8, after some debugging I found out that the class responsible for detecting the charset is CharsetIdentification, the detected charset for that seed was wrong, it has to be UTF-8, instead the returned value is ISO-8859-1.

Basically the server is not providing a content-type header (httpCharset) nor the htmlCharset is being retrieved (due the shortness of the default detect.charset.maxlength value) , so that hintCharset is always null.

What I found out in many cases (not only in that particular seed), regarding RTL websites that 2048 of a maxlength is too few,
in my case increasing that to 10000 for example fixed the issue.

Here's an example:
URL:
http://www.goal.com/ar/%D8%A3%D8%AE%D8%A8%D8%A7%D8%B1/1

Expected result:
أخبار كرة القدم، صفحة 1 من 780 | Goal.com

Actual result with 2048 of a maxlength:
أخبار �رة ا��د�� ص�حة 1 �� 780 | Goal.com

Any particular reason why the default value is 2048 ?

detect.charset.maxlength: 2048

 

@jnioche

This comment has been minimized.

Member

jnioche commented Mar 1, 2018

See #391 for why we needed to limit it. As pointed out in #438, 2048 could indeed be too low.
Maybe the inputFilter lets scripts through which influences the detection. Maybe this could be improved as well.

We could also add a configuration so that the value set in the HTML is used if it is valid and there is nothing from the HTTP headers and the value is valid.

Of course we could simply go for a larger value like the one you suggested. Do you want to open a PR for this? Thanks

@jnioche jnioche added the core label Mar 1, 2018

@jnioche jnioche added this to the 1.8 milestone Mar 1, 2018

@kkrugler

This comment has been minimized.

kkrugler commented Mar 1, 2018

Actually I see two or three potential separate issues here...

  1. http://goal.com/ar returns a 301 (permanent redirect), and the response headers have Content-Type: text/html; charset=iso-8859-1. I assume that this charset info is ignored, but seems curious that it matches the incorrect charset being reported.
  2. http://www.goal.com/ar (the redirect) returns in the http response headers Content-Type: text/html; charset=UTF-8. So I would assume that should be used, unless the HTML has something that overrides it.
  3. The reason the <meta charset="utf-8" /> HTML element isn't found in the first 2K of text is because there's a large <script> element right before it. This illustrates the fundamental issue around trying to read an arbitrary amount of text to find HTML elements; you can always insert enough other text to push what you're looking for out of the buffer. I think the only viable solution is to add a "smart" HTML element processor, which incrementally reads until it hits the </head> tag (or some other element that can never appear in the head block, or some really big limit).
@hatemalimam

This comment has been minimized.

Contributor

hatemalimam commented Mar 1, 2018

@kkrugler Interesting.
Apart from the fact that a "smart" HTML processor is indeed a nice idea, but for time-being increasing that 2K seems to me a necessity.

@jnioche

We could also add a configuration so that the value set in the HTML is used if it is valid and there is nothing from the HTTP headers and the value is valid.

You mean like skipping the guessing part, right before hintCharset ?
we could define a new config, for example: detect.force.html.charset

In that case we skip the getCharsetFromText part and return htmlCharset if that is set to true ?

Could be an idea.

@hatemalimam

This comment has been minimized.

Contributor

hatemalimam commented Mar 1, 2018

For now I'll open a PR to increase the default value to 10K.

hatemalimam added a commit to hatemalimam/storm-crawler that referenced this issue Mar 1, 2018

hatemalimam added a commit that referenced this issue Mar 1, 2018

Merge pull request #538 from hatemalimam/master
Increase detect.charset.maxlength default value #537
@kkrugler

This comment has been minimized.

kkrugler commented Mar 2, 2018

@hatemalimam - the main issue with increasing the size to 10K is that you'll still hit badly-formatted HTML that has the meta element after 10K of text.

One interesting reference - see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta for their notes on charset detection. They state that the <meta> element must come within the first 1K of text. And that the Content-Type header and any BOM overrides what's in the <meta> element.

@jnioche

This comment has been minimized.

Member

jnioche commented Mar 5, 2018

Closing this one for now, we can continue the discussion on the best approach to use in a separate issue or better at crawler-commons crawler-commons/crawler-commons#171

@jnioche jnioche closed this Mar 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment