-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP Collector and Hadoop #92
Comments
Not on the radar... yet. I thought this question would come earlier, but you are the first one to ask! :-) Our focus has been maximum flexibility/extensibility over maximum quantity. In other words, "how many things can you do with one instance" over "how many docs can you process with multiple instances". It matches what we felt was needed the most (and matches the requests we get). You can still have many different instances of the collector running in parallel with different start URLs or filters, to crawl many millions of pages. For cases where you want to crawl the whole internet or just truly massive sites, a distributed crawl environment would indeed be more practical. Maybe it is time we start thinking about this. Do you want to make this a feature request? It should not be that hard to modify the collector code to have instances runnable in an Hadoop cluster and share processing of tons of URLs. Is that what you would envision or do you have something else in mind? Would you like the collector to take care of setting up the cluster itself and spawn itself according to whatever configuration? |
Please, do. Although not everybody needs it, I think there is a nich on Windows.
You are the expert. All what you propose sounds good to me. Just keep applying "flexibility/extensibility" now to "maximum quantity".
Yes, I do |
I am marking the integration with Hadoop as a feature request with no set release in mind. I'll pay attention to the demand. Anybody else reading this can chime in if they have a need for it as well. |
Thank you |
Will HTTP Collector work with Hadoop in a near future?
The text was updated successfully, but these errors were encountered: