Solr 4.1.0 and various other updates #43

wants to merge 65 commits into from

4 participants


Hi Julien,

We've made a variety of changes that you may find useful:
1. Solr 4.1.0 support, including for SolrCloud
2. Some more control variables around indexing annotations and metadata
3. A couple of other little things
4. I've moved the LucidWorks integration off to a separate github project:

Note, this merge was a bit hard for me since I was fairly far behind master, but it compiles and all tests pass as a starter. I expect we'll have a bit more work in terms of making.


DigitalPebble Ltd member

Grant, out of curiosity why did you comment out provided for the scope of the hadoop dependency?

DigitalPebble Ltd member

Fixed that by removing the @override annotations in the TikaProcessor

commit 83d919a

DigitalPebble Ltd member

Fixed in master, thanks!

[master 2b6f959] fixed counter issue for CorpusGenerator
1 files changed, 1 insertions(+), 1 deletions(-)

gsingers and others added some commits Mar 29, 2012
@gsingers gsingers new solr b1de721
@gsingers gsingers Start to incorporate Common Crawl 6aa3c83
@gsingers gsingers merge ae1063b
@gsingers gsingers merge 82ee3c3
@gsingers gsingers new solrj 3fe985e
@gsingers gsingers Merge branches 'master' and 'COMMON_CRAWL' into COMMON_CRAWL b8b9318
@gsingers gsingers updates to common crawl, still needs testing f4a68d0
@gsingers gsingers Merge branches 'COMMON_CRAWL' and 'LWE' into LWE c4fb63e
@gsingers gsingers input format is in gzip 30dcbe7
@gsingers gsingers update to libs 3a99530
@gsingers gsingers Merge branch 'master' of git:// 30283fd
@gsingers gsingers Merge branches 'master' and 'LWE' 82e288d
@gsingers gsingers add in standardized timing info around all job calls 07a1445
@gsingers gsingers solrj lib sync 61a504a
@gsingers gsingers add controls for metadata and annotations 0dc4456
@gsingers gsingers merge 4020953
@gsingers gsingers Merge branch 'master' of git:// 6a4ad89
@gsingers gsingers annotations are now optional for Tika 4158a10
@gsingers gsingers more tweaks to better handle metadata and annotations ed20f84
@gsingers gsingers updates e74c4c3
@mumrah mumrah Updating solr-solrj and adding a fix for LucidWorkWriter
LucidWorkWriter was not calling close/shutdown on the SolrServer and allowing
resources to leak (specifically ZooKeeper connection in the CloudSolrServer)
@mumrah mumrah Merge branch 'lwe' into gsingers-master
@iprovalo iprovalo SDA-177: safe refactored writer class in order to add couple of tests e4dbad2
@iprovalo iprovalo Merge branch 'master' of deed713
@gsingers gsingers mahout 0.7 e6dadf7
@mumrah mumrah Fix Annotation->Solr field mappings
There were some problems with the existing config parsing. The updated logic
allows for multiple mappings to be defined for an annotation type and/or Solr
field. Also, the SolrDocument code was using setField instead of addField
effectively eliminating the possibility of multi-valued fields.

Now, = spam.eggs

will create a mapping for the "eggs" feature of the "spam" annotation type to
the Solr field "foo". = spam.* = spam

will both map the text from the "spam" annotation to the Solr field "foo".
@gsingers gsingers Merge branches 'SDA-265' and 'master' 5c5a2d9
Andrzej Bialecki FOCUS-4041 Update to a more recent Solr snapshot. Upgrade to Tika 1.2.
Add better handling of metadata, field mapping and solr params needed for proper field mapping.
Andrzej Bialecki FOCUS-4123 Catch all exceptions and continue. 013f98f
Andrzej Bialecki FOCUS-4158 Fix the id-s of documents from archives. Add other types o…
…f archives.

Fall-back to adding original files if unpacking fails.
Andrzej Bialecki FOCUS-4158 Mark the status of parsing. 32dae07
Andrzej Bialecki FOCUS-4176 Use ":" instead of "!". 66966dd
Andrzej Bialecki FOCUS-4198 Copy the Tika parsing setup from LucidWorks. d3b2fb9
Andrzej Bialecki FOCUS-4198 Pass on more controls from LWS. a1aa333
Andrzej Bialecki FOCUS-4198 More fixes and better error reporting. a4a502f
@gsingers gsingers Merge branch 'master' of b23de00
@gsingers gsingers mahout snap 29ca0cd
@gsingers gsingers update for Solr 4.1.0, add Warc header helper, make SOLRWriter more c…
@gsingers gsingers moved Lucidworks module to standalone project at…
@gsingers gsingers merge 798faca
@gsingers gsingers add getHeader method fd0704a
DigitalPebble Ltd member

Hi Grant. Thanks for sharing this. It is quite an enormous merge and I'll probably go through the changes and cherry pick the ones I want to merge back. I saw the new repo for LuidWorks, which makes a lot of sense and will simplify contributing back to Behemoth. I will get back to you with comments / questions on the help various points of this patch. Thanks again

DigitalPebble Ltd member

Quite a few differences are due to the formatting. There is a eclipse-format.xml file in the Behemoth repo that could be used with Eclipse so that your code follows the same patterns as the main Behemoth repo. Would make it easier to distinguish real differences from formatting ones.

The CommonCrawl-related code could be removed from the IO module as we have a separate repo for it in

Have just pushed 1c32edb to add the timings and apply formatting to all classes

Will look into the other files a bit later. Thanks!

DigitalPebble Ltd member

Have committed the changes to SOLR (with a few fixes on the way) and leaving some of the Tika stuff out as it seems to be LW specific. Pretty much everything else has been added apart from stuff which has been moved out like CommonCrawl. Thanks!

@jnioche jnioche closed this Jun 3, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment