Graceful & Proper Termination, both internally and via Extender #73

Kleissner · 2019-10-27T12:40:28Z

The current stop functionality via Crawler.Stop is insufficient for multiple reasons:

Calling Stop twice results in a panic (because it would close c.stop twice)
Inside the functions Extender.Visit and Extender.Error there is right now no way of terminating the crawler. Often though, there is absolutely the need to terminate the crawler, for example if some limits are reached (like max amount of visited URLs, determined in a custom Extender.Visit function)
In Crawler.collectUrls when res.idleDeath is true, it may delete workers. However, if there are 0 workers it does NOT terminate the entire crawler, leaving it in limbo (memory leak). This I consider a bug.

I'll create a pull request with a fix. Since it affects the Extender functions, it will break compatibility, but that's a good thing here.

The text was updated successfully, but these errors were encountered:

Kleissner · 2019-10-27T13:42:26Z

I've committed the code into my fork: IntelligenceX@089a8ad

I will create the PR once the other two PRs are handled.

mna · 2019-10-27T17:45:35Z

Hello Peter,

Calling Stop twice results in a panic (because it would close c.stop twice)

It doesn't panic (outside the boundaries of the package), it calls the ErrorLog if stop is called more than once.

Inside the functions Extender.Visit and Extender.Error there is right now no way of terminating the crawler. Often though, there is absolutely the need to terminate the crawler, for example if some limits are reached

I agree that would be useful.

In Crawler.collectUrls when res.idleDeath is true, it may delete workers. However, if there are 0 workers it does NOT terminate the entire crawler, leaving it in limbo (memory leak). This I consider a bug.

I'm not sure I understand that part, do you have a minimal reproducible example of that? I'd like to get it fixed in the v1 tree if indeed there is a leak/bug there.

As for the changes, I took a look at your fork but if we're going to break compatibility in a v2 to better support termination, I'd prefer to use the context.Context approach that is now standard in Go instead of passing around channels to close and cancellation functions to call. I think something like calling Crawler.Run(ctx, seeds) and having that context propagate through all the extender methods (where that makes sense) would go a long way to make that cleaner and more idiomatic. That only addresses responding to work cancellation, though, and not initiating it (your second point). That's something that should be taken into account for the proposed changes too.

Note that I haven't looked at / used gocrawl in a very long while, so those are just general comments and I'm not sure how well that could be integrated. Happy to bounce ideas about this with you and anyone who cares about this, though!

Thanks,
Martin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful & Proper Termination, both internally and via Extender #73

Graceful & Proper Termination, both internally and via Extender #73

Kleissner commented Oct 27, 2019

Kleissner commented Oct 27, 2019

mna commented Oct 27, 2019

Graceful & Proper Termination, both internally and via Extender #73

Graceful & Proper Termination, both internally and via Extender #73

Comments

Kleissner commented Oct 27, 2019

Kleissner commented Oct 27, 2019

mna commented Oct 27, 2019