Hi Julien,

We've made a variety of changes that you may find useful:
1. Solr 4.1.0 support, including for SolrCloud
2. Some more control variables around indexing annotations and metadata
3. A couple of other little things
4. I've moved the LucidWorks integration off to a separate github project:

Note, this merge was a bit hard for me since I was fairly far behind master, but it compiles and all tests pass as a starter. I expect we'll have a bit more work in terms of making.


DigitalPebble Ltd member

Grant, out of curiosity why did you comment out provided for the scope of the hadoop dependency?

DigitalPebble Ltd member

Fixed that by removing the @override annotations in the TikaProcessor

commit 83d919a

DigitalPebble Ltd member

Fixed in master, thanks!

[master 2b6f959] fixed counter issue for CorpusGenerator
1 files changed, 1 insertions(+), 1 deletions(-)

DigitalPebble Ltd member

Hi Grant. Thanks for sharing this. It is quite an enormous merge and I'll probably go through the changes and cherry pick the ones I want to merge back. I saw the new repo for LuidWorks, which makes a lot of sense and will simplify contributing back to Behemoth. I will get back to you with comments / questions on the help various points of this patch. Thanks again

DigitalPebble Ltd member

Quite a few differences are due to the formatting. There is a eclipse-format.xml file in the Behemoth repo that could be used with Eclipse so that your code follows the same patterns as the main Behemoth repo. Would make it easier to distinguish real differences from formatting ones.

The CommonCrawl-related code could be removed from the IO module as we have a separate repo for it in

Have just pushed 1c32edb to add the timings and apply formatting to all classes

Will look into the other files a bit later. Thanks!

DigitalPebble Ltd member

Have committed the changes to SOLR (with a few fixes on the way) and leaving some of the Tika stuff out as it seems to be LW specific. Pretty much everything else has been added apart from stuff which has been moved out like CommonCrawl. Thanks!

@jnioche jnioche closed this Jun 3, 2013
