Stemmer override with large rule set causing java.lang.OutOfMemoryError: Java heap space #2800

vhyza · 2013-03-19T18:42:18Z

Hello,

I was trying to create czech lemmatisation analyzer using stemmer override filter and czech dictionary from aspell.

This dictionary contains around 300 000 words in base form and some suffix/prefix rules. After expansion format file looks like this

Aakjaer Aakjaerech Aakjaery Aakjaerům Aakjaerů Aakjaerem Aakjaere Aakjaerovi Aakjaeru Aakjaera Aakjaerové
Aakjaerová Aakjaerovými Aakjaerovým Aakjaerových Aakjaerovou Aakjaerové
Aakjaerův Aakjaerovýma Aakjaerovými Aakjaerových Aakjaerovou Aakjaerovo Aakjaerovy Aakjaerovi Aakjaerovým

each line is one word with its forms.

Because of rules format form => lemma the final rule set is expanded from 300 000 to 4 364 674 lines.

When I was trying on my local machine to index czech wikipedia pages (around 400 000 documents) java.lang.OutOfMemoryError: Java heap space error occured after approx 10 minutes of indexing (log file here)

I'm using snapshot build of elasticsearch (54e7e309a5d407b2fb1123a79e6af9d62e41ea1e), JAVA_OPTS -Xss200000 -Xms2g -Xmx2g with no other indices.

Index settings/mapping and river settings are in separate gist

I was trying to achieve this functionality using synonym token filter, because of better format of synonym rules - form1, form2, form3, form4 => lemma (so number of rules are only about 300 000).

But it's not the same. In the case of using stemmer override filter, when token was not found in rule set, stemmer was used. I probably can do the same by adding keyword marker and stemmer in the filter chain, but I don't think it is the right way to do that.

Please, is there some better 'compressed' format of stemmer override filter rules? Any thoughts how to avoid java.lang.OutOfMemoryError: Java heap space error?

The text was updated successfully, but these errors were encountered:

s1monw · 2013-03-20T14:17:42Z

hey, I opened a Lucene issue for this and I will port the quick fix to master. I think the main reason is that this map gets copied each time the filter is created which seems to kill your application in the first place.

vhyza · 2013-03-20T14:36:30Z

Great, thanks a lot!

vhyza · 2013-03-20T14:46:53Z

Just out of curiosity, I tried "workaround" with synonym token filter, I mentioned before, with combination with keyword marker and everything seems ok. During indexing about 400 000 documents HEAP grows up to 1.6gb /1.9gb and then it drops to around 500mb /1.9gb. I'd like to ask if this combination of synonym token filter and keyword marker will have memory/performance issues, because of dictionary duplication (synonym and keyword marker dictionaries have each 300 000 lines)

Closes elastic#2800

s1monw · 2013-03-20T17:22:45Z

I added a patch to lucene with a more efficient solution similar to SynonymFilter I also patched ES with this impl.

vhyza · 2013-03-20T23:20:56Z

Thanks a lot. I tried it and it is working fine. When indexing, memory consumption is similar to synonym token filter solution - HEAP grows up to 1.6gb /1.9gb and then it drops

s1monw · 2013-03-21T06:54:57Z

@vhyza thanks for verifying. I will pull this in soon.

ghost assigned s1monw Mar 20, 2013

s1monw added a commit to s1monw/elasticsearch that referenced this issue Mar 20, 2013

Use more efficient StemmerOverrideFilter from Lucene trunk

15dbc1e

Closes elastic#2800

s1monw closed this as completed in 5f05c21 Mar 21, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stemmer override with large rule set causing java.lang.OutOfMemoryError: Java heap space #2800

Stemmer override with large rule set causing java.lang.OutOfMemoryError: Java heap space #2800

vhyza commented Mar 19, 2013

s1monw commented Mar 20, 2013

vhyza commented Mar 20, 2013

vhyza commented Mar 20, 2013

s1monw commented Mar 20, 2013

vhyza commented Mar 20, 2013

s1monw commented Mar 21, 2013

Stemmer override with large rule set causing java.lang.OutOfMemoryError: Java heap space #2800

Stemmer override with large rule set causing java.lang.OutOfMemoryError: Java heap space #2800

Comments

vhyza commented Mar 19, 2013

s1monw commented Mar 20, 2013

vhyza commented Mar 20, 2013

vhyza commented Mar 20, 2013

s1monw commented Mar 20, 2013

vhyza commented Mar 20, 2013

s1monw commented Mar 21, 2013