New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching of redirected robots.txt may overwrite correct robots.txt rules #573

Closed
sebastian-nagel opened this Issue May 22, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@sebastian-nagel
Collaborator

sebastian-nagel commented May 22, 2018

Redirected robots.txt rules should be only cached for the target host if the URL path is /robots.txt otherwise the redirect may overwrite the correct robots rules, see NUTCH-2581.

@jnioche

This comment has been minimized.

Member

jnioche commented May 22, 2018

In the example you gave in NUTCH-2581, the target URL ends in /robots.txt so I'm not sure I understand the problem.

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented May 22, 2018

It must be exactly /robots.txt, not /wyomingtheband/robots.txt or anything else. The problem is that the current logic assumes that the content returned for https://www.facebook.com/wyomingtheband/robots.txt is equivalent to that of https://www.facebook.com/robots.txt which is definitely not the case. Regarding the example: this redirect and many more mask the correct robots.txt and cause that even frequently the host www.facebook.com is considered to be allowed for crawling. Of course, there is some random what is fetched and cached first.

@jnioche

This comment has been minimized.

Member

jnioche commented May 23, 2018

Ok, got it, thanks Sebastian. Will fix it right now.

@jnioche jnioche added this to the 1.9 milestone May 23, 2018

@jnioche jnioche added the core label May 23, 2018

@jnioche jnioche closed this in b37f6af May 23, 2018

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented May 23, 2018

Thanks, @jnioche!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment