Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful & Proper Termination, both internally and via Extender #73

Open
Kleissner opened this issue Oct 27, 2019 · 2 comments
Open

Graceful & Proper Termination, both internally and via Extender #73

Kleissner opened this issue Oct 27, 2019 · 2 comments

Comments

@Kleissner
Copy link

The current stop functionality via Crawler.Stop is insufficient for multiple reasons:

  • Calling Stop twice results in a panic (because it would close c.stop twice)
  • Inside the functions Extender.Visit and Extender.Error there is right now no way of terminating the crawler. Often though, there is absolutely the need to terminate the crawler, for example if some limits are reached (like max amount of visited URLs, determined in a custom Extender.Visit function)
  • In Crawler.collectUrls when res.idleDeath is true, it may delete workers. However, if there are 0 workers it does NOT terminate the entire crawler, leaving it in limbo (memory leak). This I consider a bug.

I'll create a pull request with a fix. Since it affects the Extender functions, it will break compatibility, but that's a good thing here.

@Kleissner
Copy link
Author

I've committed the code into my fork: IntelligenceX@089a8ad

I will create the PR once the other two PRs are handled.

@mna
Copy link
Member

mna commented Oct 27, 2019

Hello Peter,

Calling Stop twice results in a panic (because it would close c.stop twice)

It doesn't panic (outside the boundaries of the package), it calls the ErrorLog if stop is called more than once.

Inside the functions Extender.Visit and Extender.Error there is right now no way of terminating the crawler. Often though, there is absolutely the need to terminate the crawler, for example if some limits are reached

I agree that would be useful.

In Crawler.collectUrls when res.idleDeath is true, it may delete workers. However, if there are 0 workers it does NOT terminate the entire crawler, leaving it in limbo (memory leak). This I consider a bug.

I'm not sure I understand that part, do you have a minimal reproducible example of that? I'd like to get it fixed in the v1 tree if indeed there is a leak/bug there.

As for the changes, I took a look at your fork but if we're going to break compatibility in a v2 to better support termination, I'd prefer to use the context.Context approach that is now standard in Go instead of passing around channels to close and cancellation functions to call. I think something like calling Crawler.Run(ctx, seeds) and having that context propagate through all the extender methods (where that makes sense) would go a long way to make that cleaner and more idiomatic. That only addresses responding to work cancellation, though, and not initiating it (your second point). That's something that should be taken into account for the proposed changes too.

Note that I haven't looked at / used gocrawl in a very long while, so those are just general comments and I'm not sure how well that could be integrated. Happy to bounce ideas about this with you and anyone who cares about this, though!

Thanks,
Martin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants