Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stemmer override with large rule set causing java.lang.OutOfMemoryError: Java heap space #2800

Closed
vhyza opened this issue Mar 19, 2013 · 6 comments

Comments

@vhyza
Copy link
Contributor

vhyza commented Mar 19, 2013

Hello,

I was trying to create czech lemmatisation analyzer using stemmer override filter and czech dictionary from aspell.

This dictionary contains around 300 000 words in base form and some suffix/prefix rules. After expansion format file looks like this

Aakjaer Aakjaerech Aakjaery Aakjaerům Aakjaerů Aakjaerem Aakjaere Aakjaerovi Aakjaeru Aakjaera Aakjaerové
Aakjaerová Aakjaerovými Aakjaerovým Aakjaerových Aakjaerovou Aakjaerové
Aakjaerův Aakjaerovýma Aakjaerovými Aakjaerových Aakjaerovou Aakjaerovo Aakjaerovy Aakjaerovi Aakjaerovým 

each line is one word with its forms.

Because of rules format form => lemma the final rule set is expanded from 300 000 to 4 364 674 lines.

When I was trying on my local machine to index czech wikipedia pages (around 400 000 documents) java.lang.OutOfMemoryError: Java heap space error occured after approx 10 minutes of indexing (log file here)

I'm using snapshot build of elasticsearch (54e7e309a5d407b2fb1123a79e6af9d62e41ea1e), JAVA_OPTS -Xss200000 -Xms2g -Xmx2g with no other indices.

Index settings/mapping and river settings are in separate gist

I was trying to achieve this functionality using synonym token filter, because of better format of synonym rules - form1, form2, form3, form4 => lemma (so number of rules are only about 300 000).

But it's not the same. In the case of using stemmer override filter, when token was not found in rule set, stemmer was used. I probably can do the same by adding keyword marker and stemmer in the filter chain, but I don't think it is the right way to do that.

Please, is there some better 'compressed' format of stemmer override filter rules? Any thoughts how to avoid java.lang.OutOfMemoryError: Java heap space error?

@ghost ghost assigned s1monw Mar 20, 2013
@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2013

hey, I opened a Lucene issue for this and I will port the quick fix to master. I think the main reason is that this map gets copied each time the filter is created which seems to kill your application in the first place.

@vhyza
Copy link
Contributor Author

vhyza commented Mar 20, 2013

Great, thanks a lot!

@vhyza
Copy link
Contributor Author

vhyza commented Mar 20, 2013

Just out of curiosity, I tried "workaround" with synonym token filter, I mentioned before, with combination with keyword marker and everything seems ok. During indexing about 400 000 documents HEAP grows up to 1.6gb /1.9gb and then it drops to around 500mb /1.9gb. I'd like to ask if this combination of synonym token filter and keyword marker will have memory/performance issues, because of dictionary duplication (synonym and keyword marker dictionaries have each 300 000 lines)

s1monw added a commit to s1monw/elasticsearch that referenced this issue Mar 20, 2013
@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2013

I added a patch to lucene with a more efficient solution similar to SynonymFilter I also patched ES with this impl.

@vhyza
Copy link
Contributor Author

vhyza commented Mar 20, 2013

Thanks a lot. I tried it and it is working fine. When indexing, memory consumption is similar to synonym token filter solution - HEAP grows up to 1.6gb /1.9gb and then it drops

@s1monw
Copy link
Contributor

s1monw commented Mar 21, 2013

@vhyza thanks for verifying. I will pull this in soon.

@s1monw s1monw closed this as completed in 5f05c21 Mar 21, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants