Keep User-Agent the same across requests #51

ranvis · 2016-02-24T09:15:19Z

I think RobotUserAgent may well be able to be deprecated.
User-Agent header of a crawler is expected to be the same if it is for the same purpose of crawling (like for text search indexing.)
gocrawl seems to change the User-Agent header when accessing robots.txt.
Also it includes Googlebot even though it is not Googlebot, which makes site owners feel strange.

Sorry for me going to post multiple issues at once.
No offense to you but I believe these ideas useful to make this popular crawler even politer.

mna · 2016-02-24T12:56:53Z

Hello,

RobotUserAgent and UserAgent were added at the time to have a way to signal to site owners where that request came from, so that it showed up in the logs (on the robots.txt request) and then use a common browser user-agent (I believe it is the firefox one) to request the page as some sites still use user-agents to serve different pages. Googlebot was used in the robots.txt request to avoid falling in the cracks as many sites only have rules for googlebot.

Was that a good idea? No. Was that a sane default? Probably not. For what it's worth, I haven't used any of this in fetchbot, my subsequent shot at a web crawling package. But those are just defaults (explicitly suggested to be customized in the readme), and it's easy enough to set both user-agents to the same value. Also see #41 for more on this.

So I understand your point, but changing that (e.g. not using RobotUserAgent in robots requests) would break things for existing users, for little added value (it's easy to "fix" this from outside the package).

Thanks,
Martin

mna · 2016-02-24T13:12:39Z

Hmmm hold on a second, having looked at the code a bit more, I think you're right and it could use the same UserAgent and use a "robot name" token to get the relevant robots.txt rules (and fallback on RobotUserAgent if there's no robot name, to avoid breaking existing users). Will take a closer look tonight.

mna · 2016-02-24T17:15:49Z

Updated, now all requests are made with Options.UserAgent, and Options.RobotUserAgent is used only to find a matching policy in the robots.txt file (no point adding a new field for that).

ranvis · 2016-02-24T17:16:15Z

I'm sorry for not reading the past issue. I had missed it.
Can a default RobotUserAgent be a robot name token then, and use UserAgent as a header as you say?
So that existing users who set RobotUserAgent will still get correct match on robots.txt.

I basically agree with you, that it is only a default, but the crawling target, Internet, do change.
If changing defaults of how to interact with Internet doesn't break crawling, can't those default be changed?

ranvis · 2016-02-24T17:16:52Z

Oops 19sec late. Sorry.

mna · 2016-02-24T17:33:36Z

:) I think that the fix I implemented is basically what you mention in the last comment, RobotUserAgent is now just the token to check for a match in robots.txt.

ranvis · 2016-02-24T17:54:35Z

Well reading my response again, it looks like I just rewritten your response.
Thank you for the detailed responses.

mna closed this as completed Feb 24, 2016

mna mentioned this issue Feb 24, 2016

Change User-agent to look for in robots.txt records #52

Closed

mna reopened this Feb 24, 2016

mna closed this as completed Feb 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep User-Agent the same across requests #51

Keep User-Agent the same across requests #51

ranvis commented Feb 24, 2016

mna commented Feb 24, 2016

mna commented Feb 24, 2016

mna commented Feb 24, 2016

ranvis commented Feb 24, 2016

ranvis commented Feb 24, 2016

mna commented Feb 24, 2016

ranvis commented Feb 24, 2016

Keep User-Agent the same across requests #51

Keep User-Agent the same across requests #51

Comments

ranvis commented Feb 24, 2016

mna commented Feb 24, 2016

mna commented Feb 24, 2016

mna commented Feb 24, 2016

ranvis commented Feb 24, 2016

ranvis commented Feb 24, 2016

mna commented Feb 24, 2016

ranvis commented Feb 24, 2016