-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep User-Agent the same across requests #51
Comments
Hello, RobotUserAgent and UserAgent were added at the time to have a way to signal to site owners where that request came from, so that it showed up in the logs (on the robots.txt request) and then use a common browser user-agent (I believe it is the firefox one) to request the page as some sites still use user-agents to serve different pages. Googlebot was used in the robots.txt request to avoid falling in the cracks as many sites only have rules for googlebot. Was that a good idea? No. Was that a sane default? Probably not. For what it's worth, I haven't used any of this in fetchbot, my subsequent shot at a web crawling package. But those are just defaults (explicitly suggested to be customized in the readme), and it's easy enough to set both user-agents to the same value. Also see #41 for more on this. So I understand your point, but changing that (e.g. not using RobotUserAgent in robots requests) would break things for existing users, for little added value (it's easy to "fix" this from outside the package). Thanks, |
Hmmm hold on a second, having looked at the code a bit more, I think you're right and it could use the same UserAgent and use a "robot name" token to get the relevant robots.txt rules (and fallback on RobotUserAgent if there's no robot name, to avoid breaking existing users). Will take a closer look tonight. |
Updated, now all requests are made with |
I'm sorry for not reading the past issue. I had missed it. I basically agree with you, that it is only a default, but the crawling target, Internet, do change. |
Oops 19sec late. Sorry. |
:) I think that the fix I implemented is basically what you mention in the last comment, RobotUserAgent is now just the token to check for a match in robots.txt. |
Well reading my response again, it looks like I just rewritten your response. |
I think RobotUserAgent may well be able to be deprecated.
User-Agent header of a crawler is expected to be the same if it is for the same purpose of crawling (like for text search indexing.)
gocrawl seems to change the User-Agent header when accessing robots.txt.
Also it includes Googlebot even though it is not Googlebot, which makes site owners feel strange.
Sorry for me going to post multiple issues at once.
No offense to you but I believe these ideas useful to make this popular crawler even politer.
The text was updated successfully, but these errors were encountered: