Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep User-Agent the same across requests #51

Closed
ranvis opened this issue Feb 24, 2016 · 7 comments
Closed

Keep User-Agent the same across requests #51

ranvis opened this issue Feb 24, 2016 · 7 comments

Comments

@ranvis
Copy link

ranvis commented Feb 24, 2016

I think RobotUserAgent may well be able to be deprecated.
User-Agent header of a crawler is expected to be the same if it is for the same purpose of crawling (like for text search indexing.)
gocrawl seems to change the User-Agent header when accessing robots.txt.
Also it includes Googlebot even though it is not Googlebot, which makes site owners feel strange.

Sorry for me going to post multiple issues at once.
No offense to you but I believe these ideas useful to make this popular crawler even politer.

@mna
Copy link
Member

mna commented Feb 24, 2016

Hello,

RobotUserAgent and UserAgent were added at the time to have a way to signal to site owners where that request came from, so that it showed up in the logs (on the robots.txt request) and then use a common browser user-agent (I believe it is the firefox one) to request the page as some sites still use user-agents to serve different pages. Googlebot was used in the robots.txt request to avoid falling in the cracks as many sites only have rules for googlebot.

Was that a good idea? No. Was that a sane default? Probably not. For what it's worth, I haven't used any of this in fetchbot, my subsequent shot at a web crawling package. But those are just defaults (explicitly suggested to be customized in the readme), and it's easy enough to set both user-agents to the same value. Also see #41 for more on this.

So I understand your point, but changing that (e.g. not using RobotUserAgent in robots requests) would break things for existing users, for little added value (it's easy to "fix" this from outside the package).

Thanks,
Martin

@mna
Copy link
Member

mna commented Feb 24, 2016

Hmmm hold on a second, having looked at the code a bit more, I think you're right and it could use the same UserAgent and use a "robot name" token to get the relevant robots.txt rules (and fallback on RobotUserAgent if there's no robot name, to avoid breaking existing users). Will take a closer look tonight.

@mna mna reopened this Feb 24, 2016
@mna
Copy link
Member

mna commented Feb 24, 2016

Updated, now all requests are made with Options.UserAgent, and Options.RobotUserAgent is used only to find a matching policy in the robots.txt file (no point adding a new field for that).

@mna mna closed this as completed Feb 24, 2016
@ranvis
Copy link
Author

ranvis commented Feb 24, 2016

I'm sorry for not reading the past issue. I had missed it.
Can a default RobotUserAgent be a robot name token then, and use UserAgent as a header as you say?
So that existing users who set RobotUserAgent will still get correct match on robots.txt.

I basically agree with you, that it is only a default, but the crawling target, Internet, do change.
If changing defaults of how to interact with Internet doesn't break crawling, can't those default be changed?

@ranvis
Copy link
Author

ranvis commented Feb 24, 2016

Oops 19sec late. Sorry.

@mna
Copy link
Member

mna commented Feb 24, 2016

:) I think that the fix I implemented is basically what you mention in the last comment, RobotUserAgent is now just the token to check for a match in robots.txt.

@ranvis
Copy link
Author

ranvis commented Feb 24, 2016

Well reading my response again, it looks like I just rewritten your response.
Thank you for the detailed responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants