Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in mecab-python3 0.996.2? #34

Closed
orangain opened this issue Jun 8, 2019 · 7 comments
Closed

Memory leak in mecab-python3 0.996.2? #34

orangain opened this issue Jun 8, 2019 · 7 comments

Comments

@orangain
Copy link

@orangain orangain commented Jun 8, 2019

When parsing large text many times with mecab-python3 0.996.2, python process consumes too many memories. Finally, the process falls into significant slow down.

For example, after about 200 times of loop when running the following code, python process consumes 80% of machine's memory. Consumed memory does not seem to be released and the process does not make progress at all.

import MeCab


def main():
    with open('kokoro.utf8.txt') as f:
        text = f.read()

    tagger = MeCab.Tagger()
    tagger.parse('')  # Hack for mecab-python3 0.7
    for i in range(10000):
        if i % 10 == 0:
            print(i)
        
        parse(tagger, text)


def parse(tagger, text):
    node = tagger.parseToNode(text)
    while node:
        node = node.next

if __name__ == "__main__":
    main()

kokoro.utf8.txt is available at: https://gist.github.com/orangain/d48998bf258d2118d0a6589d06ed7e98

Environment

  • VirtualBox virtual machine on macOS 10.14
  • 2GB of memory
  • Ubuntu 18.04
  • Python 3.7.1
  • mecab-python3 0.996.2

Note that:

  • mecab-python3 0.7 (with mecab 0.996 installed via apt) consumes less memory than 0.996.2.
  • It also reproduces in macOS 10.14 and Python 3.7.3. After 1000 loops, python process consumed 7GB of memory.

P.S. I really appreciate your effort to provide wheels including fix of #3 and #19.

@Mikep86

This comment has been minimized.

Copy link

@Mikep86 Mikep86 commented Jul 19, 2019

I can confirm this memory leak. It appears to be related to the parseToNode method.

This code leaks:

import MeCab
tagger = MeCab.Tagger()
node = tagger.parseToNode("text")
while node:
    node = node.next

This code does not leak:

import MeCab
tagger = MeCab.Tagger()
parse_result = tagger.parse("text")
for parse_line in parse_result.split("\n"):
    if parse_line == "EOS":
        break

Environment

  • CentOS 7
  • Python 2.7
  • mecab-python3 0.996.2
@polm

This comment has been minimized.

Copy link
Collaborator

@polm polm commented Jul 20, 2019

I can confirm this bug exists. When adding a patch for 0.996.1 we checked to avoid memory leaks, see here, so this was a bit surprising. My linked code to check for memory usage shows no growth with 0.996.1 but shows growth (a leak) with 0.996.2.

I changed the example code above a little. I moved the tagger creation out of the loop and changed it to print memory usage. We should check the tagger later but I wanted to narrow the issue. Here's my code:

import MeCab
import os
import psutil # note: needs to be installed
process = psutil.Process(os.getpid())

tagger = MeCab.Tagger()
tagger.parse('')  # Hack for mecab-python3 0.7

def main():
    with open('kokoro.utf8.txt') as f:
        text = f.read()

    for i in range(10000):
        if i % 10 == 0:
            print(i, process.memory_info().rss)

        parse(tagger, text)


def parse(tagger, text):
    node = tagger.parseToNode(text)
    while node:
        node = node.next

if __name__ == "__main__":
    main()

Here's the start of output with 0.7:

0 15724544
10 273051648
20 278630400
30 284209152
40 289787904
50 295366656
60 300945408
70 306524160
80 312107008
90 317685760
100 323264512

with 0.996.1:

0 15568896
10 272306176
20 277884928
30 283463680
40 289046528
50 294625280
60 300204032
70 305782784
80 311361536
90 316940288
100 322523136
110 328101888

with 0.996.2:

0 18280448
10 236933120
20 313737216
30 390541312
40 467345408
50 544145408
60 620949504
70 697757696
80 774561792
90 851361792
100 928165888

So for all three versions, memory use increases over time, though it's noticeably faster for 0.996.2. I am not sure what the cause of memory increase in 0.996.1 is...

I'll look into the cause of this in more detail later, thanks for pointing it out!

@polm

This comment has been minimized.

Copy link
Collaborator

@polm polm commented Jul 28, 2019

I am still looking at this but unfortunately don't understand how to fix it. If anyone with a good knowledge of SWIG would like to step in it'd be much appreciated.

@fflexo

This comment has been minimized.

Copy link

@fflexo fflexo commented Jul 29, 2019

Although the SWIG code isn't ideal I don't think it's the problem here. %newobject is adding a delete[] call in _wrap_Node_surface_get which looks correct. The sample code is valgrind clean (in so far as there is nothing reported as definitely lost) too.

I spent a while looking at the Node objects, as they don't get deleted after Python is finished with them, but the documentation for parseToNode says that's correct and I have no reason to doubt it.

My best guess right now would be that there's some horrific heap fragmentation happening that's hurting lots, but I have no evidence to support that theory.

@akorobko

This comment has been minimized.

Copy link
Contributor

@akorobko akorobko commented Dec 13, 2019

@fflexo @polm I have found the issue. See PR #37
@SamuraiT could you, please, merge it and update package? Thanks!

@polm

This comment has been minimized.

Copy link
Collaborator

@polm polm commented Dec 14, 2019

@akorobko Thanks so much for the fix!

To be honest after this bug I got fed up with SWIG and wrote a Cython wrapper for MeCab called fugashi to avoid dealing with it.

polm added a commit to polm/mecab-python3 that referenced this issue Dec 22, 2019
Fix memory issues SamuraiT#34
@zackw

This comment has been minimized.

Copy link
Collaborator

@zackw zackw commented Dec 24, 2019

Should be fixed by #37.

@zackw zackw closed this Dec 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.