New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop LinkExtraction when MaxDepth is reached #498
Comments
Not a bad idea! :-) Marking it as a feature request. |
@essiembre I think 8045d19 solves this, however was removed from #718 |
documents having reached the max depth. To keep former behavior, use the new method HttpCrawlerConfig#keepMaxDepthLinks(...). #498.
Both solutions in this ticket and in #718 were applying their logic to end extraction AFTER the extraction was actually performed, so no processing was saved. The only effect was to prevent extracted URLs from being queued, but that part is already handled by the queue pipeline so it also had no effect. The only benefit I can see of having a configurable option to stop the link extraction stage prematurely is to save the extraction process itself. For that reason, I instead created a new option called This new configuration option has no effect on crawler using the default "unlimited" max depth (-1), This I think addresses this feature request and #718. Give it a try and please confirm. |
Hi Pascal, I test this new option (keepMaxDepthLinks) and it do perfectly the job, less logs, less content in CrawlDataStore , quickly finished crawl execution and time saving for next executions. Thanks a lot. |
Hello,
I am using the Norconex collector 2.8.0 to crawl web sites.
I need to optimize the processing time, for that I 've override the LinkExtractorStage by adding a simple test before link extraction to not extract link if the page is already at max depth :
I think sometimes it's useful to always extract link and then rejected them if they are too deep, but for my case it's useless.
I wonder if it's possible to add a new feature in HttpCrawlerConfig to avoid link extraction when max depth is reached.
Thanks for your job !
The text was updated successfully, but these errors were encountered: