HTTP Collector and Hadoop #92

csaezl · 2015-04-20T19:40:36Z

Will HTTP Collector work with Hadoop in a near future?

essiembre · 2015-04-20T21:06:23Z

Not on the radar... yet. I thought this question would come earlier, but you are the first one to ask! :-)

Our focus has been maximum flexibility/extensibility over maximum quantity. In other words, "how many things can you do with one instance" over "how many docs can you process with multiple instances". It matches what we felt was needed the most (and matches the requests we get).

You can still have many different instances of the collector running in parallel with different start URLs or filters, to crawl many millions of pages.

For cases where you want to crawl the whole internet or just truly massive sites, a distributed crawl environment would indeed be more practical.

Maybe it is time we start thinking about this. Do you want to make this a feature request?

It should not be that hard to modify the collector code to have instances runnable in an Hadoop cluster and share processing of tons of URLs. Is that what you would envision or do you have something else in mind? Would you like the collector to take care of setting up the cluster itself and spawn itself according to whatever configuration?

csaezl · 2015-04-21T08:28:13Z

Maybe it is time we start thinking about this.

Please, do. Although not everybody needs it, I think there is a nich on Windows.

Would you like the collector to take care of setting up the cluster itself and spawn itself according to whatever configuration

You are the expert. All what you propose sounds good to me. Just keep applying "flexibility/extensibility" now to "maximum quantity".

Do you want to make this a feature request?

Yes, I do

essiembre · 2015-04-21T20:30:35Z

I am marking the integration with Hadoop as a feature request with no set release in mind. I'll pay attention to the demand. Anybody else reading this can chime in if they have a need for it as well.

csaezl · 2015-04-21T20:49:24Z

Thank you

essiembre added the question label Apr 20, 2015

essiembre added the feature-request label Apr 21, 2015

essiembre mentioned this issue May 3, 2017

Question on distributed crawl #345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP Collector and Hadoop #92

HTTP Collector and Hadoop #92

csaezl commented Apr 20, 2015

essiembre commented Apr 20, 2015

csaezl commented Apr 21, 2015

essiembre commented Apr 21, 2015

csaezl commented Apr 21, 2015

HTTP Collector and Hadoop #92

HTTP Collector and Hadoop #92

Comments

csaezl commented Apr 20, 2015

essiembre commented Apr 20, 2015

csaezl commented Apr 21, 2015

essiembre commented Apr 21, 2015

csaezl commented Apr 21, 2015