Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP Collector and Hadoop #92

Open
csaezl opened this issue Apr 20, 2015 · 4 comments
Open

HTTP Collector and Hadoop #92

csaezl opened this issue Apr 20, 2015 · 4 comments

Comments

@csaezl
Copy link

csaezl commented Apr 20, 2015

Will HTTP Collector work with Hadoop in a near future?

@essiembre
Copy link
Contributor

Not on the radar... yet. I thought this question would come earlier, but you are the first one to ask! :-)

Our focus has been maximum flexibility/extensibility over maximum quantity. In other words, "how many things can you do with one instance" over "how many docs can you process with multiple instances". It matches what we felt was needed the most (and matches the requests we get).

You can still have many different instances of the collector running in parallel with different start URLs or filters, to crawl many millions of pages.

For cases where you want to crawl the whole internet or just truly massive sites, a distributed crawl environment would indeed be more practical.

Maybe it is time we start thinking about this. Do you want to make this a feature request?

It should not be that hard to modify the collector code to have instances runnable in an Hadoop cluster and share processing of tons of URLs. Is that what you would envision or do you have something else in mind? Would you like the collector to take care of setting up the cluster itself and spawn itself according to whatever configuration?

@csaezl
Copy link
Author

csaezl commented Apr 21, 2015

Maybe it is time we start thinking about this.

Please, do. Although not everybody needs it, I think there is a nich on Windows.

Would you like the collector to take care of setting up the cluster itself and spawn itself according to whatever configuration

You are the expert. All what you propose sounds good to me. Just keep applying "flexibility/extensibility" now to "maximum quantity".

Do you want to make this a feature request?

Yes, I do

@essiembre
Copy link
Contributor

I am marking the integration with Hadoop as a feature request with no set release in mind. I'll pay attention to the demand. Anybody else reading this can chime in if they have a need for it as well.

@csaezl
Copy link
Author

csaezl commented Apr 21, 2015

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants