Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl timeout #44

Closed
juliend2 opened this issue Oct 2, 2015 · 4 comments
Closed

Crawl timeout #44

juliend2 opened this issue Oct 2, 2015 · 4 comments

Comments

@juliend2
Copy link
Contributor

juliend2 commented Oct 2, 2015

I would like the crawls to stop after 1 hour, because some websites have an infinite amount of URLs.
So I think it would be great to have a CrawlTimeout option, or an Extender function like ShouldFinishCrawl() bool.

Note: I use it with SameHostOnly = true, to crawl only one site.

@mna
Copy link
Member

mna commented Oct 2, 2015

Hi Julien,

You can set the MaxVisits option if you know how many you want to visit, and of course you can stop returning URLs to visit in your Visit method, so you stop adding new links at some point.

I believe you can also call Crawler.Stop() after some time? I'm not sure how well that works to be honest, haven't used gocrawl in a long time.

HTH,
Martin

@mna mna closed this as completed Oct 2, 2015
@juliend2
Copy link
Contributor Author

juliend2 commented Oct 2, 2015

Thanks for your quick response!
I tried your solution of returning false from Visit (I also tried Filter), but it continues to loop.

2015/10/02 13:52:39 worker 1 - popped: http://www.residences-quebec.ca/fr/categorie-service/prise-de-sang-a-domicile?cat=31
2015/10/02 13:52:39 worker 1 - waiting for crawl delay
2015/10/02 13:52:40 worker 1 - using crawl-delay: 1s
STOPPING!
2015/10/02 13:52:40 worker 1 - popped: http://www.residences-quebec.ca/fr/categorie-service/soins-des-pieds?cat=32
2015/10/02 13:52:40 worker 1 - waiting for crawl delay
2015/10/02 13:52:41 worker 1 - using crawl-delay: 1s
STOPPING!
2015/10/02 13:52:42 worker 1 - popped: http://www.residences-quebec.ca/fr/categorie-service/soins-palliatifs-a-domicile?cat=33
2015/10/02 13:52:42 worker 1 - waiting for crawl delay
2015/10/02 13:52:43 worker 1 - using crawl-delay: 1s
STOPPING!

The "STOPPING!" string is called from Filter and Visit:

    if time.Now().After(timeoutStartTime.Add(time.Minute)) {
        fmt.Println("STOPING!")
        return false
    }

So my question now is how can I access the Crawler instance from Visit or Filter method, so I can call Stop() on it ?

BTW, many thanks for gocrawl, I love it so far!

@juliend2
Copy link
Contributor Author

juliend2 commented Oct 2, 2015

Ah, I see now; it's the value returned by NewCrawlerWithOptions(opts). Thanks!

@juliend2
Copy link
Contributor Author

juliend2 commented Oct 2, 2015

So I ended up doing this:

    ...
    c := gocrawl.NewCrawlerWithOptions(opts)
    // New code:
    go func(crawler *gocrawl.Crawler) {
        time.Sleep(1 * time.Hour)
        crawler.Stop()
    }(c)
    // End of new code
    c.Run(baseUrl)
    ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants