New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stemmer override with large rule set causing java.lang.OutOfMemoryError: Java heap space #2800
Comments
hey, I opened a Lucene issue for this and I will port the quick fix to master. I think the main reason is that this map gets copied each time the filter is created which seems to kill your application in the first place. |
Great, thanks a lot! |
Just out of curiosity, I tried "workaround" with |
I added a patch to lucene with a more efficient solution similar to SynonymFilter I also patched ES with this impl. |
Thanks a lot. I tried it and it is working fine. When indexing, memory consumption is similar to |
@vhyza thanks for verifying. I will pull this in soon. |
Hello,
I was trying to create czech lemmatisation analyzer using stemmer override filter and czech dictionary from aspell.
This dictionary contains around
300 000
words in base form and some suffix/prefix rules. After expansion format file looks like thiseach line is one word with its forms.
Because of rules format
form => lemma
the final rule set is expanded from300 000
to4 364 674
lines.When I was trying on my local machine to index czech wikipedia pages (around
400 000
documents)java.lang.OutOfMemoryError: Java heap space
error occured after approx 10 minutes of indexing (log file here)I'm using snapshot build of elasticsearch (54e7e309a5d407b2fb1123a79e6af9d62e41ea1e),
JAVA_OPTS -Xss200000 -Xms2g -Xmx2g
with no other indices.Index settings/mapping and river settings are in separate gist
I was trying to achieve this functionality using synonym token filter, because of better format of synonym rules -
form1, form2, form3, form4 => lemma
(so number of rules are only about300 000
).But it's not the same. In the case of using
stemmer override filter
, when token was not found in rule set, stemmer was used. I probably can do the same by adding keyword marker and stemmer in the filter chain, but I don't think it is the right way to do that.Please, is there some better 'compressed' format of
stemmer override filter
rules? Any thoughts how to avoidjava.lang.OutOfMemoryError: Java heap space
error?The text was updated successfully, but these errors were encountered: