Stop LinkExtraction when MaxDepth is reached #498

stephjacq · 2018-06-21T09:18:59Z

Hello,

I am using the Norconex collector 2.8.0 to crawl web sites.

I need to optimize the processing time, for that I 've override the LinkExtractorStage by adding a simple test before link extraction to not extract link if the page is already at max depth :

  
    @Override
    public boolean executeStage(HttpImporterPipelineContext ctx) {
        String reference = ctx.getCrawlData().getReference();
        
        int depth = ctx.getCrawlData().getDepth();
        int maxdepth = ctx.getConfig().getMaxDepth();

        if (depth == maxdepth) {
            LOG.debug("Max depth reached ; do not extract link");
            return true;
        }

I think sometimes it's useful to always extract link and then rejected them if they are too deep, but for my case it's useless.
I wonder if it's possible to add a new feature in HttpCrawlerConfig to avoid link extraction when max depth is reached.

Thanks for your job !

The text was updated successfully, but these errors were encountered:

essiembre · 2018-06-22T03:40:07Z

Not a bad idea! :-) Marking it as a feature request.

simonwibberley · 2020-11-03T12:53:42Z

@essiembre I think 8045d19 solves this, however was removed from #718

documents having reached the max depth. To keep former behavior, use the new method HttpCrawlerConfig#keepMaxDepthLinks(...). #498.

essiembre · 2020-11-09T04:42:39Z

Both solutions in this ticket and in #718 were applying their logic to end extraction AFTER the extraction was actually performed, so no processing was saved. The only effect was to prevent extracted URLs from being queued, but that part is already handled by the queue pipeline so it also had no effect.

The only benefit I can see of having a configurable option to stop the link extraction stage prematurely is to save the extraction process itself.

For that reason, I instead created a new option called keepMaxDepthLinks in a new 2.x snapshot I just released. By default, the crawler will no longer store/extract URLs on pages having reached the specified max depth (if any). Set this new flag to true to keep the former behavior. This new logic takes place BEFORE extraction actually happens.

This new configuration option has no effect on crawler using the default "unlimited" max depth (-1),

This I think addresses this feature request and #718. Give it a try and please confirm.

stephjacq · 2021-06-29T10:49:27Z

Hi Pascal,

I test this new option (keepMaxDepthLinks) and it do perfectly the job, less logs, less content in CrawlDataStore , quickly finished crawl execution and time saving for next executions.

Thanks a lot.

essiembre added the feature-request label Jun 22, 2018

simonwibberley added a commit to CASM-Consulting/collector-http that referenced this issue Apr 29, 2020

Norconex#498 option for link extractor to quit at depth

8045d19

simonwibberley added a commit to CASM-Consulting/collector-http that referenced this issue Nov 3, 2020

Norconex#498 option for link extractor to quit at depth

9bcbb6c

essiembre added a commit that referenced this issue Nov 9, 2020

Extracted links are no longer extracted/kept by default for

08f944b

documents having reached the max depth. To keep former behavior, use the new method HttpCrawlerConfig#keepMaxDepthLinks(...). #498.

essiembre added the resolved label Nov 9, 2020

stephjacq closed this as completed Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop LinkExtraction when MaxDepth is reached #498

Stop LinkExtraction when MaxDepth is reached #498

stephjacq commented Jun 21, 2018

essiembre commented Jun 22, 2018

simonwibberley commented Nov 3, 2020

essiembre commented Nov 9, 2020

stephjacq commented Jun 29, 2021

Stop LinkExtraction when MaxDepth is reached #498

Stop LinkExtraction when MaxDepth is reached #498

Comments

stephjacq commented Jun 21, 2018

essiembre commented Jun 22, 2018

simonwibberley commented Nov 3, 2020

essiembre commented Nov 9, 2020

stephjacq commented Jun 29, 2021