Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httpClient.Transport #9

Closed
ghost opened this issue Dec 27, 2012 · 7 comments
Closed

httpClient.Transport #9

ghost opened this issue Dec 27, 2012 · 7 comments

Comments

@ghost
Copy link

ghost commented Dec 27, 2012

Currently I'm having a little problem with changing http client transport. It is caused by two facts:

  • httpClient (ext.go:120) is private
  • DefaultExtender.Fetch uses it

In my case I needed to change the standard transport to add timeouts:

client.Transport = &http.Transport{
            Dial: func(netw, addr string) (net.Conn, error) {
                deadline := time.Now().Add(ext.Info.Timeout)
                c, err := net.DialTimeout(netw, addr, ext.Info.Timeout)
                if err != nil {
                    return nil, err
                }
                c.SetDeadline(deadline)
                return c, nil
            },
        },

In current situation I ended up copypasting your default 'Fetch' code (as-is) and 'CheckRedirect' part of your httpClient (as-is). Also, I ended up copypasting your 'isRobotsTxt' func :)

But I think it would be better to allow somehow to change just parts of this logic, without rewriting it all.

For example, if we talk about Transport, it could be OK to add a Transport field to Options and to use it in the instantiation code (probably then it should be moved from package-level vars to DefaultExtender):

httpClient = &http.Client{Transport: options.Transport,  CheckRedirect: func(req *http.Request, via []*http.Request) error {
    // For robots.txt URLs, allow up to 10 redirects, like the default http client.
    // Rationale: the site owner explicitly tells us that this specific robots.txt
    // should be used for this domain.
    if isRobotsTxtUrl(req.URL) {
        if len(via) >= 10 {
            return errors.New("stopped after 10 redirects")
        }
        return nil
    }

    // For all other URLs, do NOT follow redirections, the default Fetch() implementation
    // will ask the worker to enqueue the new (redirect-to) URL. Returning an error
    // will make httpClient.Do() return a url.Error, with the URL field containing the new URL.
    return &EnqueueRedirectError{"redirection not followed"}
}}

I'm not sure if it is already in your package reorganization plans, I thought that I should submit it just in case.

@mna
Copy link
Member

mna commented Dec 31, 2012

Hi Dmitry,

I don't think adding a public option for Transport is the way to go, maybe allowing the http client to be specified through the Options. It could allow some flexibility for following redirects without necessarily overwriting the Fetch implementation, while allowing custom Transports.

I'll give it a look next year! :)

Martin

On 2012-12-27, at 6:16 AM, Dmitry Bondarenko notifications@github.com wrote:

Currently I'm having a little problem with changing http client transport. It is caused by two facts:

httpClient (ext.go:120) is private
DefaultExtender.Fetch uses it
In my case I needed to change the standard transport to add timeouts:

client.Transport = &http.Transport{
Dial: func(netw, addr string) (net.Conn, error) {
deadline := time.Now().Add(ext.Info.Timeout)
c, err := net.DialTimeout(netw, addr, ext.Info.Timeout)
if err != nil {
return nil, err
}
c.SetDeadline(deadline)
return c, nil
},
},
In current situation I ended up copypasting your default 'Fetch' code (as-is) and 'CheckRedirect' part of your httpClient (as-is). Also, I ended up copypasting your 'isRobotsTxt' func :)

But I think it would be better to allow somehow to change just parts of this logic, without rewriting it all.

For example, if we talk about Transport, it could be OK to add a Transport field to Options and to use it in the instantiation code (probably then it should be moved from package-level vars to DefaultExtender):

httpClient = &http.Client{Transport: options.Transport, CheckRedirect: func(req _http.Request, via []_http.Request) error {
// For robots.txt URLs, allow up to 10 redirects, like the default http client.
// Rationale: the site owner explicitly tells us that this specific robots.txt
// should be used for this domain.
if isRobotsTxtUrl(req.URL) {
if len(via) >= 10 {
return errors.New("stopped after 10 redirects")
}
return nil
}

// For all other URLs, do NOT follow redirections, the default Fetch() implementation
// will ask the worker to enqueue the new (redirect-to) URL. Returning an error
// will make httpClient.Do() return a url.Error, with the URL field containing the new URL.
return &EnqueueRedirectError{"redirection not followed"}

}}
I'm not sure if it is already in your package reorganization plans, I thought that I should submit it just in case.


Reply to this email directly or view it on GitHub.

@mna
Copy link
Member

mna commented Jan 22, 2013

Just to let you know, that's next on my "give-some-love-to-gocrawl" todo list.

@goodsign
Copy link

That's great, thanks!

@mna
Copy link
Member

mna commented Jan 22, 2013

I'm thinking about simply making gocrawl.HttpClient a public variable, just like golang's net/http.DefaultClient variable is public and customizable through its fields. What do you think? It would allow you to set a custom Transport, and it could also allow anyone to set a custom CheckRedirect strategy without necessarily overriding the default Fetch implementation.

It's a dead simple change in the code, it doesn't add bloat to an already crowded Extender interface, and it allows some useful customizations.

@goodsign
Copy link

Personally I like setting this via Options a bit more, because that way I can set different clients for different crawlers.

It is debatable, whether it is ever needed or not, but I feel that it could. For example, I may want to set different transports for crawling different web sites (E.g. different connection deadlines/dial timeouts).

So I vote for making client somehow settable via options. But in that case some 'default client constructor' would be useful, so that I can create clients, based on default one (Construct a default and then make needed changes).

@mna
Copy link
Member

mna commented Jan 23, 2013

I took a closer look at the impact of doing this, and I will opt for the public variable. I think you make a good point that using Options would allow more flexibility, but it requires a bit of tweaking and breaking changes, or a weird API (i.e.: if someone overrides the default Fetch, how can he grab the httpClient passed through the options? The signature would have to be changed to receive it as argument. Or he can refer to the httpClient internally, but then why have a useless one in the Options?). All this for what I think is a minor feature (ability to specify different http clients per crawler). The public variable makes it clear, it is not an Options field, just a default client that can still be customized, is used by the default Fetch(), and can be used by a custom Fetch().

What it doesn't do is allow for various clients per crawler, but if this is absolutely required (which I would assume is not often), then it is already possible by providing a custom Fetch() that can use whatever client it wants, so there's a fallback.

@mna mna closed this as completed in 4eb8be7 Jan 23, 2013
@goodsign
Copy link

Well, I understand your decision, actually it's not that critical and you make a good point about breaking changes. Maybe this degree of flexibility is not worth it.

Public variable is totally okay, and now if I want a crawler with its own custom client I can just create a new client based on yours, (copy CheckRedirect from your HttpClient) and then just rewrite Fetch using it.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants