Caching of redirected robots.txt may overwrite correct robots.txt rules #573

sebastian-nagel · 2018-05-22T13:15:14Z

Redirected robots.txt rules should be only cached for the target host if the URL path is /robots.txt otherwise the redirect may overwrite the correct robots rules, see NUTCH-2581.

The text was updated successfully, but these errors were encountered:

jnioche · 2018-05-22T19:40:18Z

In the example you gave in NUTCH-2581, the target URL ends in /robots.txt so I'm not sure I understand the problem.

sebastian-nagel · 2018-05-22T21:21:25Z

It must be exactly /robots.txt, not /wyomingtheband/robots.txt or anything else. The problem is that the current logic assumes that the content returned for https://www.facebook.com/wyomingtheband/robots.txt is equivalent to that of https://www.facebook.com/robots.txt which is definitely not the case. Regarding the example: this redirect and many more mask the correct robots.txt and cause that even frequently the host www.facebook.com is considered to be allowed for crawling. Of course, there is some random what is fetched and cached first.

jnioche · 2018-05-23T12:42:57Z

Ok, got it, thanks Sebastian. Will fix it right now.

sebastian-nagel · 2018-05-23T13:02:49Z

Thanks, @jnioche!

jnioche added this to the 1.9 milestone May 23, 2018

jnioche added the core label May 23, 2018

jnioche closed this as completed in b37f6af May 23, 2018

sebastian-nagel mentioned this issue May 3, 2019

Review Robots' caching logic, fixes #713 #717

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching of redirected robots.txt may overwrite correct robots.txt rules #573

Caching of redirected robots.txt may overwrite correct robots.txt rules #573

sebastian-nagel commented May 22, 2018

jnioche commented May 22, 2018

sebastian-nagel commented May 22, 2018

jnioche commented May 23, 2018

sebastian-nagel commented May 23, 2018

Caching of redirected robots.txt may overwrite correct robots.txt rules #573

Caching of redirected robots.txt may overwrite correct robots.txt rules #573

Comments

sebastian-nagel commented May 22, 2018

jnioche commented May 22, 2018

sebastian-nagel commented May 22, 2018

jnioche commented May 23, 2018

sebastian-nagel commented May 23, 2018