Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
SimpleFetcherBolt to send URLs back to its own queue if time to wait above threshold #582
The SimpleFetcherBolt is less complex than the standard FetcherBolt as it does not have to hold internal fetch queues but instead has many instances (threads managed by Storm). However, its performance is usually worse as it enforces the politeness by sleeping the necessary amount of time, which in effect, prevents it from processing URLs from other servers.
What we can do is to send any tuple for which the wait time is above a certain threshold back to the queue of the bolt if it is above a certain threshold. This would have the advantage of moving quicker to a URL from a different server, but a possible drawback is that a URL could get a timeout if it gets sent to the back of the queue too often.
By default, the threshold would be set to -1, meaning that the existing behaviour would be preserved and all delays would be slept.