Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTPCollector duplicates listeners when multiple crawlers have set them #784

Open
dutsuwak opened this issue May 2, 2022 · 1 comment
Open

Comments

@dutsuwak
Copy link

dutsuwak commented May 2, 2022

Hello!

I have been doing some tests in a situation where multiple crawlers are set each a with a Listener for a Crawl event. When the HttpCrawlerConfigs are added to the HttpCollector it duplicates the listeners therefore calling multiple times the logic of my program.

Simplified example:

HttpCollectorConfig config = new HttpCollectorConfig();
List<HttpCrawlerConfig> httpCrawlerConfigs = new ArrayList<>();

for(int i = 0; i < urlsList.length; i++){
    var httpCrawlerConfig = new HttpCrawlerConfig();
    httpCrawlerConfig.setEventListeners(new CrawlEventListener());

    httpCrawlerConfigs.add(httpCrawlerConfig);
}

HttpCrawlerConfig[] crawlerConfigs = httpCrawlerConfigs.toArray(new HttpCrawlerConfig[httpCrawlerConfigs.size()]);
config.setCrawlerConfigs(crawlerConfigs);


var collector = new HttpCollector(collectorConfig); // From the debugging I did seems it happens when it scans the crawlers 
collector.start();                                  // configs here, and duplicates the listeners in the event manager

I did a workaround to set the listeners only for the first HttpCrawlerConfig, but I think it should be possible to use separate listeners for each Crawler.

Regards,
Fabian

@essiembre
Copy link
Contributor

Hello Fabian!

Technically, the listeners are not duplicated but rather invoked for ALL events fired by the collector that is an instance of your listener "accept" method argument. That includes events from other crawlers.

It is by design as there might be legit cases for a crawler to want to know what is happening in another crawler for whatever reason. I understand it is not the most intuitive though.

Since it is possible to configure event listeners at both the collector-level and the crawler-level, it would make sense to imply an event hierarchy there and provide isolation from other crawlers when registered only for a specific crawler.

Since there are valid use cases for both approaches, I think we need to make it more flexible and offer an easy way to adjust the listening scope and maybe change the default behaviour to the most intuitive one.

I will mark this as a feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants