Issue #1042: Adapt parsing of robots.txt files #1055

michaeldinzinger · 2023-04-15T21:08:16Z

Modified implemention of Robots.txt parsing as described in Issue #1042
Motivation was the following: In case the web server answers with an server error, according to the IETF RFC 9309 the website should not be crawled (except you encounter these Server errors over a long time such that you can assume that there is no Robots.txt file). For now, it means that forbidding all rules becomes the default and by setting a new configuration parameter http.robots.5xx.allow it is possible to bypass the default and assume the crawling allowance in case of server errors (the current SC behaviour)

sebastian-nagel · 2023-04-16T16:30:48Z

+1 looks good!

For now, it means that forbidding all rules becomes the default

Yes, it's an improvement for now - until there is a way to throttle or block new "fetch items" arriving (see discussion in #1042). Because the forbid-all rule is kept in the error cache, fetches for the given host are blocked for some reasonable time (error cache default config: expireAfterWrite = 1h).

sebastian-nagel · 2023-04-16T16:43:56Z

core/src/main/java/com/digitalpebble/stormcrawler/protocol/HttpRobotRulesParser.java

-            {
+
+            // Parsing found rules; by default, all robots are forbidden (RFC 9309)
+            robotRules = FORBID_ALL_RULES;


The default for all HTTP response codes not handled explicitly is now forbid-all instead of allow-all. Are there codes except 200, 403, 5xx, redirects which need a special treatment? Maybe not.

Maybe not.

Sorry, I was wrong. In case of a HTTP 404 (robots.txt not found), crawling is allowed.

In doubt, maybe do not swap the defaults? "In the wild" the crawler may see any HTTP status code.

Oh yes, how could I forget about status code 404?! After 200 and redirects, it is probably what the crawler encounters most often when fetching the robots.txt.
In the RFC 9309, it is written that for all status codes in the range of 400-499 the robots file may be considered as "unavailable" and therefore allow-all. This is also true for 300-399 in case the crawler followed at least five consecutive redirects.

This implementation might be better:

// TODO: Follow up to five redirections (at the moment, it is only one) if (200 <= code && code <= 299) { String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE); robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames); } else if (300 <= code && code <= 499) { robotRules = EMPTY_RULES; // allow all if (code == 403 && !allowForbidden) { robotRules = FORBID_ALL_RULES; // forbid all } // E.g. Google handles Too many requests similar to a server error // https://support.google.com/webmasters/answer/9679690#robots_details if (code == 429) { robotRules = FORBID_ALL_RULES; // forbid all } } else if (500 <= code) { cacheRule = false; robotRules = FORBID_ALL_RULES; // forbid all if (allow5xx) { robotRules = EMPTY_RULES; // allow all } } else { robotRules = EMPTY_RULES; // allow all }

I'm not sure whether e.g. if (200 <= code && code <= 299) or if (code == 200) is more correct?

After 200 and redirects, it is probably what the crawler encounters most often when fetching the robots.txt

Yes. In the long tail the crawler gets almost every possible HTTP status code, see the attached robotstxt-cc-main-2023-14.txt.

That's why I'd be careful about if (200 <= code && code <= 299) or else if (500 <= code).

In the RFC 9309, it is written that for all status codes in the range of 400-499 the robots file may be considered as "unavailable" and therefore allow-all.

Yes. The 403 (and 401) handling is defined in the "original" norobots-RFC. So, there are some differences - you mentioned the 429 handled by Google's crawler.

// TODO: Follow up to five redirections (at the moment, it is only one)

Good catch!

Maybe keep this in a separate issue / PR to limit the amount of changes for an easy review. A unit test would be also useful to ensure that no change accidentally breaks something.

Yes. In the long tail the crawler gets almost every possible HTTP status code, see the attached robotstxt-cc-main-2023-14.txt.

Wow, impressing numbers. Thanks for sharing. While crawling, you'll always have to be prepared to the unexpected, this was not clear to me in this extent.
As a consequence, the condition else if (500 <= code) is definitely wrong and has to be substituted with else if (500 <= code && code <= 599) in order to implement more precisely what is written in the RFC. Otherwise, the crawler would restrict itself more than necessary.

Besides that, there are probably two approaches: First, stick with the old way to go and parse the Robots.txt file only for Status code 200. In case the Status code is weird and unexpected, such as 204, 711 or 1, the crawler assumes the default ALLOW_ALL.
Or second, parse the Robots.txt file for any Status code except 300-599. This is probably a bit closer to the concrete wording in the RFC, in which no status code range is mentioned, as it only speaks of a "successful download".

2.3.1.1. Successful Access If the crawler successfully downloads the robots.txt file, the crawler MUST follow the parseable rules.

I think, in this case, the crawler wouldn't restrict itself too much either, as the parseContent() method by crawlercommons.robots.SimpleRobotRulesParser, which is furtheron called, also assumes ALLOW_ALL as default for an empty or unparsable Robots.txt file etc.

Sorry, I forgot about HTTP 404: no robots.txt means "allow-all".

sebastian-nagel · 2023-04-16T18:45:24Z

core/src/main/java/com/digitalpebble/stormcrawler/protocol/HttpRobotRulesParser.java

-            {
+
+            // Parsing found rules; by default, all robots are forbidden (RFC 9309)
+            robotRules = FORBID_ALL_RULES;


Maybe not.

Sorry, I was wrong. In case of a HTTP 404 (robots.txt not found), crawling is allowed.

In doubt, maybe do not swap the defaults? "In the wild" the crawler may see any HTTP status code.

jnioche · 2023-05-02T16:44:25Z

@michaeldinzinger I see that you've pushed a commit since. is the PR ready to be reviewed or does it need more work?
if so @sebastian-nagel since you started looking at this, any chance you could review this PR again?
thanks to you both

michaeldinzinger · 2023-05-03T08:24:02Z

@michaeldinzinger I see that you've pushed a commit since. is the PR ready to be reviewed or does it need more work? if so @sebastian-nagel since you started looking at this, any chance you could review this PR again? thanks to you both

@sebastian-nagel and me have shortly discussed this issue outside this PR and, in succession of what we have talked about, I will probably add another small code modification. Our hypothese was that for all downloads of robots.txt files with a status code, which is neither 200 nor in the range of 300-599, the parsing of the robots.txt files can be skipped as it's mostly no good. In this case the crawler presumably doesn't actually retrieve a robots.txt file, but a meaningless HTML file or something like this. So it would be good to keep the best practice as it was implemented so far: only parse the robots.txt file for status code 200. I'd change this, then the PR can be reviewed, I assume

The reason was that in the RFC, it is only spoken of a successful download, and it wasn't really clear to me what this concretely means:

2.3.1.1.  Successful Access
If the crawler successfully downloads the robots.txt file, the crawler MUST follow the parseable rules.

I suppose we can just call it 'download with status code 200' without losing conformity of the RFC

michaeldinzinger · 2023-05-19T10:48:28Z

With the last commit, I have implemented the small code change that I mentioned two weeks ago. I think it should be done now

jnioche · 2023-05-19T16:48:44Z

With the last commit, I have implemented the small code change that I mentioned two weeks ago. I think it should be done now

Fab thanks. @sebastian-nagel would you be able to review this PR? You've looked at it more than I have.
I'll have a quick look now

core/src/main/java/com/digitalpebble/stormcrawler/protocol/HttpRobotRulesParser.java

sebastian-nagel

Looks good! - need to resolve conflicts with changes to increase the number of followed redirects.

sebastian-nagel · 2023-05-21T18:52:34Z

core/src/main/java/com/digitalpebble/stormcrawler/protocol/HttpRobotRulesParser.java

-            } else if (code >= 500) {
+            } else if (code == 403 && !allowForbidden) {
+                // If the fetch of the robots.txt file is forbidden, then forbid also the fetch
+                // of the other pages within this domain


(a little nitpicking) "domain" -> better "host" or "site". The robots.txt applies strictly speaking to the unique combination of scheme://host:port/.

sebastian-nagel · 2023-05-22T09:44:45Z

@michaeldinzinger - to resolve the merge conflicts, could you try to rebase the branch to the current master?

…pache#1065 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

…"&" symbol in a parents path apache#1059 (apache#1062) * Fix unmangleQueryString filter. Fix unmangleQueryString filter. Do not analyze full URL path, just last child, * formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>