-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #1042: Adapt parsing of robots.txt files #1055
Issue #1042: Adapt parsing of robots.txt files #1055
Conversation
+1 looks good!
Yes, it's an improvement for now - until there is a way to throttle or block new "fetch items" arriving (see discussion in #1042). Because the forbid-all rule is kept in the error cache, fetches for the given host are blocked for some reasonable time (error cache default config: expireAfterWrite = 1h). |
{ | ||
|
||
// Parsing found rules; by default, all robots are forbidden (RFC 9309) | ||
robotRules = FORBID_ALL_RULES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default for all HTTP response codes not handled explicitly is now forbid-all instead of allow-all. Are there codes except 200, 403, 5xx, redirects which need a special treatment? Maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not.
Sorry, I was wrong. In case of a HTTP 404 (robots.txt not found), crawling is allowed.
In doubt, maybe do not swap the defaults? "In the wild" the crawler may see any HTTP status code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, how could I forget about status code 404?! After 200 and redirects, it is probably what the crawler encounters most often when fetching the robots.txt.
In the RFC 9309, it is written that for all status codes in the range of 400-499 the robots file may be considered as "unavailable" and therefore allow-all. This is also true for 300-399 in case the crawler followed at least five consecutive redirects.
This implementation might be better:
// TODO: Follow up to five redirections (at the moment, it is only one)
if (200 <= code && code <= 299) {
String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE);
robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames);
} else if (300 <= code && code <= 499) {
robotRules = EMPTY_RULES; // allow all
if (code == 403 && !allowForbidden) {
robotRules = FORBID_ALL_RULES; // forbid all
}
// E.g. Google handles Too many requests similar to a server error
// https://support.google.com/webmasters/answer/9679690#robots_details
if (code == 429) {
robotRules = FORBID_ALL_RULES; // forbid all
}
} else if (500 <= code) {
cacheRule = false;
robotRules = FORBID_ALL_RULES; // forbid all
if (allow5xx) {
robotRules = EMPTY_RULES; // allow all
}
} else {
robotRules = EMPTY_RULES; // allow all
}
I'm not sure whether e.g. if (200 <= code && code <= 299)
or if (code == 200)
is more correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After 200 and redirects, it is probably what the crawler encounters most often when fetching the robots.txt
Yes. In the long tail the crawler gets almost every possible HTTP status code, see the attached robotstxt-cc-main-2023-14.txt.
That's why I'd be careful about if (200 <= code && code <= 299)
or else if (500 <= code)
.
In the RFC 9309, it is written that for all status codes in the range of 400-499 the robots file may be considered as "unavailable" and therefore allow-all.
Yes. The 403 (and 401) handling is defined in the "original" norobots-RFC. So, there are some differences - you mentioned the 429 handled by Google's crawler.
// TODO: Follow up to five redirections (at the moment, it is only one)
Good catch!
Maybe keep this in a separate issue / PR to limit the amount of changes for an easy review. A unit test would be also useful to ensure that no change accidentally breaks something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In the long tail the crawler gets almost every possible HTTP status code, see the attached robotstxt-cc-main-2023-14.txt.
Wow, impressing numbers. Thanks for sharing. While crawling, you'll always have to be prepared to the unexpected, this was not clear to me in this extent.
As a consequence, the condition else if (500 <= code)
is definitely wrong and has to be substituted with else if (500 <= code && code <= 599)
in order to implement more precisely what is written in the RFC. Otherwise, the crawler would restrict itself more than necessary.
Besides that, there are probably two approaches: First, stick with the old way to go and parse the Robots.txt file only for Status code 200. In case the Status code is weird and unexpected, such as 204, 711 or 1, the crawler assumes the default ALLOW_ALL.
Or second, parse the Robots.txt file for any Status code except 300-599. This is probably a bit closer to the concrete wording in the RFC, in which no status code range is mentioned, as it only speaks of a "successful download".
2.3.1.1. Successful Access
If the crawler successfully downloads the robots.txt file, the
crawler MUST follow the parseable rules.
I think, in this case, the crawler wouldn't restrict itself too much either, as the parseContent() method by crawlercommons.robots.SimpleRobotRulesParser, which is furtheron called, also assumes ALLOW_ALL as default for an empty or unparsable Robots.txt file etc.
Sorry, I forgot about HTTP 404: no robots.txt means "allow-all".
{ | ||
|
||
// Parsing found rules; by default, all robots are forbidden (RFC 9309) | ||
robotRules = FORBID_ALL_RULES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not.
Sorry, I was wrong. In case of a HTTP 404 (robots.txt not found), crawling is allowed.
In doubt, maybe do not swap the defaults? "In the wild" the crawler may see any HTTP status code.
@michaeldinzinger I see that you've pushed a commit since. is the PR ready to be reviewed or does it need more work? |
@sebastian-nagel and me have shortly discussed this issue outside this PR and, in succession of what we have talked about, I will probably add another small code modification. Our hypothese was that for all downloads of robots.txt files with a status code, which is neither 200 nor in the range of 300-599, the parsing of the robots.txt files can be skipped as it's mostly no good. In this case the crawler presumably doesn't actually retrieve a robots.txt file, but a meaningless HTML file or something like this. So it would be good to keep the best practice as it was implemented so far: only parse the robots.txt file for status code 200. I'd change this, then the PR can be reviewed, I assume The reason was that in the RFC, it is only spoken of a successful download, and it wasn't really clear to me what this concretely means:
I suppose we can just call it 'download with status code 200' without losing conformity of the RFC |
With the last commit, I have implemented the small code change that I mentioned two weeks ago. I think it should be done now |
Fab thanks. @sebastian-nagel would you be able to review this PR? You've looked at it more than I have. |
core/src/main/java/com/digitalpebble/stormcrawler/protocol/HttpRobotRulesParser.java
Show resolved
Hide resolved
core/src/main/java/com/digitalpebble/stormcrawler/protocol/HttpRobotRulesParser.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/digitalpebble/stormcrawler/protocol/HttpRobotRulesParser.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! - need to resolve conflicts with changes to increase the number of followed redirects.
} else if (code >= 500) { | ||
} else if (code == 403 && !allowForbidden) { | ||
// If the fetch of the robots.txt file is forbidden, then forbid also the fetch | ||
// of the other pages within this domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a little nitpicking) "domain" -> better "host" or "site". The robots.txt applies strictly speaking to the unique combination of scheme://host:port/
.
@michaeldinzinger - to resolve the merge conflicts, could you try to rebase the branch to the current master? |
…pache#1065 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…"&" symbol in a parents path apache#1059 (apache#1062) * Fix unmangleQueryString filter. Fix unmangleQueryString filter. Do not analyze full URL path, just last child, * formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…che#1069) Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
v3 version of actions Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
* mechanism to retrieve more generic value of configuration if a specific one is not found, fixes apache#1070 Signed-off-by: Julien Nioche <julien@digitalpebble.com> * minor javadoc fix Signed-off-by: Julien Nioche <julien@digitalpebble.com> --------- Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
link to docker project Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
* Create DeletionBolt.java storm-crawler-solr bug. Missing DeletionBolt bolt code. apache#1050 * Update DeletionBolt.java License header added * Update DeletionBolt.java formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…to default topology Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…#1074) * Issue apache#1058: Allow 5 redirects for Robots.txt fetching Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Minor variable renaming Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> --------- Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…--no-transfer-progess maven options Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
650dde7
to
a32a068
Compare
thanks @michaeldinzinger and @sebastian-nagel for the review |
Modified implemention of Robots.txt parsing as described in Issue #1042
Motivation was the following: In case the web server answers with an server error, according to the IETF RFC 9309 the website should not be crawled (except you encounter these Server errors over a long time such that you can assume that there is no Robots.txt file). For now, it means that forbidding all rules becomes the default and by setting a new configuration parameter
http.robots.5xx.allow
it is possible to bypass the default and assume the crawling allowance in case of server errors (the current SC behaviour)