New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge internal queues of tuples which have already reached time out #564

Closed
jnioche opened this Issue Apr 18, 2018 · 1 comment

Comments

Projects
None yet
2 participants
@jnioche
Member

jnioche commented Apr 18, 2018

URLs can sit in the internal queues of the FetcherBolt longer than the value set in topology.message.timeout.secs. This happens for instance when there aren't enough fetching threads or if the server corresponding to the queue is slow. By the time the URL gets fetched, its tuple will have been failed by Storm. Even with ES, where es.status.ttl.purgatory allows a delay until an acked or failed URL is allowed through the topology, we can have the same URL reentering the queues later on. We could deduplicate the URLs in the queues but it is probably better and simpler to simply purge them if they have gone over the timeout. The following URLs will also to have less time to wait.

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Apr 18, 2018

+1

es.status.ttl.purgatory must be set close to fetcher.max.crawl.delay * fetcher.max.queue.size to reliably avoid duplicate fetches for sites which request slow crawling. That's usually a large time span (eg. 20 min.), sounds good if this can be set to a lower value.

@jnioche jnioche closed this May 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment