Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements and fixes to HttpRobotRulesParser when following redirects #1103

Merged
merged 1 commit into from
Oct 2, 2023

Conversation

sebastian-nagel
Copy link
Contributor

  • remove unsafe check for absolute URLs on redirects (HTTP Location header): a string starting with http is not necessarily a valid and resolvable URL. For safety, always resolve the redirect location using java.net.URL(URL context, String spec)
  • resolve relative redirects using the redirect source as base URL. Resolving with the original URL as base/context may lead to a wrong redirect target in a chain of redirects with different hosts/authorities.
  • cache robot rules for all /robots.txt (if on default location) in a chain of redirects
  • add unit test for the three points above

Note: this PR is a result of implementing the RFC 9309 redirect rules to Nutch, see NUTCH-2990 and apache/nutch#779. I deliberately took the implementation in StormCrawler (#1058/#1074) as a starting point.

- remove unsafe check for absolute URLs on redirects (location header)
- resolve relative redirects using the redirect source as base URL
- cache robot rules for all /robots.txt (if on default location) in
  a chain of redirects

Signed-off-by: Sebastian Nagel <sebastian@commoncrawl.org>
@jnioche jnioche added this to the 2.10 milestone Oct 2, 2023
@jnioche jnioche merged commit d6f1377 into apache:master Oct 2, 2023
1 of 4 checks passed
@jnioche
Copy link
Contributor

jnioche commented Oct 2, 2023

Thanks @sebastian-nagel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants