Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop LinkExtraction when MaxDepth is reached #498

Closed
stephjacq opened this issue Jun 21, 2018 · 4 comments
Closed

Stop LinkExtraction when MaxDepth is reached #498

stephjacq opened this issue Jun 21, 2018 · 4 comments

Comments

@stephjacq
Copy link

Hello,

I am using the Norconex collector 2.8.0 to crawl web sites.

I need to optimize the processing time, for that I 've override the LinkExtractorStage by adding a simple test before link extraction to not extract link if the page is already at max depth :

  
    @Override
    public boolean executeStage(HttpImporterPipelineContext ctx) {
        String reference = ctx.getCrawlData().getReference();
        
        int depth = ctx.getCrawlData().getDepth();
        int maxdepth = ctx.getConfig().getMaxDepth();

        if (depth == maxdepth) {
            LOG.debug("Max depth reached ; do not extract link");
            return true;
        }

I think sometimes it's useful to always extract link and then rejected them if they are too deep, but for my case it's useless.
I wonder if it's possible to add a new feature in HttpCrawlerConfig to avoid link extraction when max depth is reached.

Thanks for your job !

@essiembre
Copy link
Contributor

Not a bad idea! :-) Marking it as a feature request.

simonwibberley added a commit to CASM-Consulting/collector-http that referenced this issue Apr 29, 2020
simonwibberley added a commit to CASM-Consulting/collector-http that referenced this issue Nov 3, 2020
@simonwibberley
Copy link

@essiembre I think 8045d19 solves this, however was removed from #718

essiembre added a commit that referenced this issue Nov 9, 2020
documents having reached the max depth. To keep former behavior, use the
new method HttpCrawlerConfig#keepMaxDepthLinks(...). #498.
@essiembre
Copy link
Contributor

Both solutions in this ticket and in #718 were applying their logic to end extraction AFTER the extraction was actually performed, so no processing was saved. The only effect was to prevent extracted URLs from being queued, but that part is already handled by the queue pipeline so it also had no effect.

The only benefit I can see of having a configurable option to stop the link extraction stage prematurely is to save the extraction process itself.

For that reason, I instead created a new option called keepMaxDepthLinks in a new 2.x snapshot I just released. By default, the crawler will no longer store/extract URLs on pages having reached the specified max depth (if any). Set this new flag to true to keep the former behavior. This new logic takes place BEFORE extraction actually happens.

This new configuration option has no effect on crawler using the default "unlimited" max depth (-1),

This I think addresses this feature request and #718. Give it a try and please confirm.

@stephjacq
Copy link
Author

Hi Pascal,

I test this new option (keepMaxDepthLinks) and it do perfectly the job, less logs, less content in CrawlDataStore , quickly finished crawl execution and time saving for next executions.

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants