Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching of redirected robots.txt may overwrite correct robots.txt rules #573

Closed
sebastian-nagel opened this issue May 22, 2018 · 4 comments
Labels
Milestone

Comments

@sebastian-nagel
Copy link
Contributor

Redirected robots.txt rules should be only cached for the target host if the URL path is /robots.txt otherwise the redirect may overwrite the correct robots rules, see NUTCH-2581.

@jnioche
Copy link
Contributor

jnioche commented May 22, 2018

In the example you gave in NUTCH-2581, the target URL ends in /robots.txt so I'm not sure I understand the problem.

@sebastian-nagel
Copy link
Contributor Author

It must be exactly /robots.txt, not /wyomingtheband/robots.txt or anything else. The problem is that the current logic assumes that the content returned for https://www.facebook.com/wyomingtheband/robots.txt is equivalent to that of https://www.facebook.com/robots.txt which is definitely not the case. Regarding the example: this redirect and many more mask the correct robots.txt and cause that even frequently the host www.facebook.com is considered to be allowed for crawling. Of course, there is some random what is fetched and cached first.

@jnioche
Copy link
Contributor

jnioche commented May 23, 2018

Ok, got it, thanks Sebastian. Will fix it right now.

@jnioche jnioche added this to the 1.9 milestone May 23, 2018
@jnioche jnioche added the core label May 23, 2018
@sebastian-nagel
Copy link
Contributor Author

Thanks, @jnioche!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants