Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #1042: Adapt parsing of robots.txt files #1055

Merged
merged 26 commits into from
May 23, 2023

Conversation

michaeldinzinger
Copy link
Contributor

Modified implemention of Robots.txt parsing as described in Issue #1042
Motivation was the following: In case the web server answers with an server error, according to the IETF RFC 9309 the website should not be crawled (except you encounter these Server errors over a long time such that you can assume that there is no Robots.txt file). For now, it means that forbidding all rules becomes the default and by setting a new configuration parameter http.robots.5xx.allow it is possible to bypass the default and assume the crawling allowance in case of server errors (the current SC behaviour)

@sebastian-nagel
Copy link
Contributor

+1 looks good!

For now, it means that forbidding all rules becomes the default

Yes, it's an improvement for now - until there is a way to throttle or block new "fetch items" arriving (see discussion in #1042). Because the forbid-all rule is kept in the error cache, fetches for the given host are blocked for some reasonable time (error cache default config: expireAfterWrite = 1h).

{

// Parsing found rules; by default, all robots are forbidden (RFC 9309)
robotRules = FORBID_ALL_RULES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for all HTTP response codes not handled explicitly is now forbid-all instead of allow-all. Are there codes except 200, 403, 5xx, redirects which need a special treatment? Maybe not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not.

Sorry, I was wrong. In case of a HTTP 404 (robots.txt not found), crawling is allowed.

In doubt, maybe do not swap the defaults? "In the wild" the crawler may see any HTTP status code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, how could I forget about status code 404?! After 200 and redirects, it is probably what the crawler encounters most often when fetching the robots.txt.
In the RFC 9309, it is written that for all status codes in the range of 400-499 the robots file may be considered as "unavailable" and therefore allow-all. This is also true for 300-399 in case the crawler followed at least five consecutive redirects.

This implementation might be better:

// TODO: Follow up to five redirections (at the moment, it is only one)

if (200 <= code && code <= 299) {
	String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE);
	robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames);
} else if (300 <= code && code <= 499) {
	robotRules = EMPTY_RULES; // allow all
	if (code == 403 && !allowForbidden) {
		robotRules = FORBID_ALL_RULES; // forbid all
	}
	// E.g. Google handles Too many requests similar to a server error
	// https://support.google.com/webmasters/answer/9679690#robots_details
	if (code == 429) {
		robotRules = FORBID_ALL_RULES; // forbid all
	}
} else if (500 <= code) {
	cacheRule = false;
	robotRules = FORBID_ALL_RULES; // forbid all
	if (allow5xx) {
		robotRules = EMPTY_RULES; // allow all
	}
} else {
	robotRules = EMPTY_RULES; // allow all
}

I'm not sure whether e.g. if (200 <= code && code <= 299) or if (code == 200) is more correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After 200 and redirects, it is probably what the crawler encounters most often when fetching the robots.txt

Yes. In the long tail the crawler gets almost every possible HTTP status code, see the attached robotstxt-cc-main-2023-14.txt.

That's why I'd be careful about if (200 <= code && code <= 299) or else if (500 <= code).

In the RFC 9309, it is written that for all status codes in the range of 400-499 the robots file may be considered as "unavailable" and therefore allow-all.

Yes. The 403 (and 401) handling is defined in the "original" norobots-RFC. So, there are some differences - you mentioned the 429 handled by Google's crawler.

// TODO: Follow up to five redirections (at the moment, it is only one)

Good catch!

Maybe keep this in a separate issue / PR to limit the amount of changes for an easy review. A unit test would be also useful to ensure that no change accidentally breaks something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In the long tail the crawler gets almost every possible HTTP status code, see the attached robotstxt-cc-main-2023-14.txt.

Wow, impressing numbers. Thanks for sharing. While crawling, you'll always have to be prepared to the unexpected, this was not clear to me in this extent.
As a consequence, the condition else if (500 <= code) is definitely wrong and has to be substituted with else if (500 <= code && code <= 599) in order to implement more precisely what is written in the RFC. Otherwise, the crawler would restrict itself more than necessary.

Besides that, there are probably two approaches: First, stick with the old way to go and parse the Robots.txt file only for Status code 200. In case the Status code is weird and unexpected, such as 204, 711 or 1, the crawler assumes the default ALLOW_ALL.
Or second, parse the Robots.txt file for any Status code except 300-599. This is probably a bit closer to the concrete wording in the RFC, in which no status code range is mentioned, as it only speaks of a "successful download".

2.3.1.1. Successful Access
If the crawler successfully downloads the robots.txt file, the
crawler MUST follow the parseable rules.

I think, in this case, the crawler wouldn't restrict itself too much either, as the parseContent() method by crawlercommons.robots.SimpleRobotRulesParser, which is furtheron called, also assumes ALLOW_ALL as default for an empty or unparsable Robots.txt file etc.

@sebastian-nagel sebastian-nagel dismissed their stale review April 16, 2023 18:37

Sorry, I forgot about HTTP 404: no robots.txt means "allow-all".

@sebastian-nagel sebastian-nagel self-requested a review April 16, 2023 18:37
{

// Parsing found rules; by default, all robots are forbidden (RFC 9309)
robotRules = FORBID_ALL_RULES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not.

Sorry, I was wrong. In case of a HTTP 404 (robots.txt not found), crawling is allowed.

In doubt, maybe do not swap the defaults? "In the wild" the crawler may see any HTTP status code.

@jnioche
Copy link
Contributor

jnioche commented May 2, 2023

@michaeldinzinger I see that you've pushed a commit since. is the PR ready to be reviewed or does it need more work?
if so @sebastian-nagel since you started looking at this, any chance you could review this PR again?
thanks to you both

@michaeldinzinger
Copy link
Contributor Author

@michaeldinzinger I see that you've pushed a commit since. is the PR ready to be reviewed or does it need more work? if so @sebastian-nagel since you started looking at this, any chance you could review this PR again? thanks to you both

@sebastian-nagel and me have shortly discussed this issue outside this PR and, in succession of what we have talked about, I will probably add another small code modification. Our hypothese was that for all downloads of robots.txt files with a status code, which is neither 200 nor in the range of 300-599, the parsing of the robots.txt files can be skipped as it's mostly no good. In this case the crawler presumably doesn't actually retrieve a robots.txt file, but a meaningless HTML file or something like this. So it would be good to keep the best practice as it was implemented so far: only parse the robots.txt file for status code 200. I'd change this, then the PR can be reviewed, I assume

The reason was that in the RFC, it is only spoken of a successful download, and it wasn't really clear to me what this concretely means:

2.3.1.1.  Successful Access
If the crawler successfully downloads the robots.txt file, the crawler MUST follow the parseable rules.

I suppose we can just call it 'download with status code 200' without losing conformity of the RFC

@michaeldinzinger
Copy link
Contributor Author

With the last commit, I have implemented the small code change that I mentioned two weeks ago. I think it should be done now

@jnioche
Copy link
Contributor

jnioche commented May 19, 2023

With the last commit, I have implemented the small code change that I mentioned two weeks ago. I think it should be done now

Fab thanks. @sebastian-nagel would you be able to review this PR? You've looked at it more than I have.
I'll have a quick look now

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! - need to resolve conflicts with changes to increase the number of followed redirects.

} else if (code >= 500) {
} else if (code == 403 && !allowForbidden) {
// If the fetch of the robots.txt file is forbidden, then forbid also the fetch
// of the other pages within this domain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a little nitpicking) "domain" -> better "host" or "site". The robots.txt applies strictly speaking to the unique combination of scheme://host:port/.

@sebastian-nagel
Copy link
Contributor

@michaeldinzinger - to resolve the merge conflicts, could you try to rebase the branch to the current master?

jnioche and others added 15 commits May 22, 2023 23:18
…pache#1065

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…"&" symbol in a parents path apache#1059 (apache#1062)

* Fix unmangleQueryString filter.

Fix unmangleQueryString filter. Do not analyze full URL path, just last child,

* formatting

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…che#1069)

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
v3 version of actions

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
* mechanism to retrieve more generic value of configuration if a specific one is not found, fixes apache#1070

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

* minor javadoc fix

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

---------

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
link to docker project

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
* Create DeletionBolt.java

storm-crawler-solr bug. Missing DeletionBolt bolt code. apache#1050

* Update DeletionBolt.java

License header added

* Update DeletionBolt.java

formatting

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…to default topology

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…#1074)

* Issue apache#1058: Allow 5 redirects for Robots.txt fetching

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Minor variable renaming

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

---------

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
jnioche and others added 10 commits May 22, 2023 23:18
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
…--no-transfer-progess maven options

Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
@jnioche jnioche merged commit fc36a10 into apache:master May 23, 2023
6 checks passed
@jnioche
Copy link
Contributor

jnioche commented May 23, 2023

thanks @michaeldinzinger and @sebastian-nagel for the review

@michaeldinzinger michaeldinzinger deleted the devAdaptRobotsParsing branch May 24, 2023 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants