You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Redirected robots.txt rules should be only cached for the target host if the URL path is /robots.txt otherwise the redirect may overwrite the correct robots rules, see NUTCH-2581.
The text was updated successfully, but these errors were encountered:
It must be exactly /robots.txt, not /wyomingtheband/robots.txt or anything else. The problem is that the current logic assumes that the content returned for https://www.facebook.com/wyomingtheband/robots.txt is equivalent to that of https://www.facebook.com/robots.txt which is definitely not the case. Regarding the example: this redirect and many more mask the correct robots.txt and cause that even frequently the host www.facebook.com is considered to be allowed for crawling. Of course, there is some random what is fetched and cached first.
Redirected robots.txt rules should be only cached for the target host if the URL path is
/robots.txt
otherwise the redirect may overwrite the correct robots rules, see NUTCH-2581.The text was updated successfully, but these errors were encountered: