Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make node.surface return just the token (fixes #19) #22

Merged
merged 2 commits into from Dec 24, 2018

Conversation

polm
Copy link
Collaborator

@polm polm commented Dec 10, 2018

For details on the fix see #19, particularly comments by @tomoyukih-fw.

This completely reverts 5c23ea3. The previous commit was functional but
may have leaked memory.
Copy link
Collaborator

@zackw zackw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earliest I can look at this in any detail is Friday of this week. Saturday or Sunday is more likely.

I do have a concern that needs to be addressed before I can even consider merging this, though, and maybe you can look into it in the meantime. The restored function, mecab_node_t_surface_get, allocates a C-string using operator new for the token, and returns it. That C-string will be copied again into a Python string at some point inside the SWIG glue. Does anything ever free the intermediate C-string? If not, we have a serious memory leak and that needs to get fixed properly.

To be clear, I could easily imagine that the SWIG glue does know to free the C-string, I don't understand SWIG very well myself.

@polm
Copy link
Collaborator Author

polm commented Dec 10, 2018

Thank you very much for your quick reply.

Good point about leaking memory, but my understanding is that making sure the new string gets cleaned up is the purpose of the newobject directive:

If you wrap the function blah(), SWIG has no idea that the return value is a newly allocated object. As a result, the resulting extension module may produce a memory leak (SWIG is conservative and will never delete objects unless it knows for certain that the returned object was newly created).
...
When %newobject is supplied, many language modules will arrange to take ownership of the return value. This allows the value to be automatically garbage-collected when it is no longer in use. However, this depends entirely on the target language (a language module may also choose to ignore the %newobject directive).

So I think this is OK. I can write a test or check more closely if you like though.

No worries about looking at it this weekend or later; since the old version works this isn't urgent for me by any means. Thanks again for your quick reply.

@zackw
Copy link
Collaborator

zackw commented Dec 10, 2018

If you can somehow confirm that %newobject does cause the intermediate C-string to be deallocated, that would satisfy me for now. I would like to ask for a test program but we don't actually have a test suite right now (which is a problem in itself, but not one we need to address at the same time as this regression).

@polm
Copy link
Collaborator Author

polm commented Dec 10, 2018

Would showing a loop with the same operation has constant memory usage be adequate? This seems to work:

import MeCab
tokenizer = MeCab.Tagger()

import os
import psutil # note: needs to be installed
process = psutil.Process(os.getpid())

for ii in range(100000):
    node = tokenizer.parseToNode('日本語ですよ').next
    a = node.surface

    if (ii + 1) % 10000 == 0:
        # This should print memory usage 
        print(process.memory_info().rss)
        # if there is no memory leak the number should not change

I need to go to bed now, but if this isn't adequate I could look into hunting down the implementation of newobject in swig - otherwise not really sure how to prove the pointer is deallocated.

@zackw
Copy link
Collaborator

zackw commented Dec 10, 2018

That will do; I can run that program under valgrind or something for an even more stringent test. Thanks. Again, not going to actually merge the PR until I have time to dig into the changes a bit more myself, but I think we're on the right track.

@zackw zackw merged commit 8aaf058 into SamuraiT:master Dec 24, 2018
@zackw
Copy link
Collaborator

zackw commented Dec 24, 2018

I have now thoroughly verified that this code does not produce a memory leak. Also, on further investigation, this code (with trivial formatting changes) is in the latest version of the upstream MeCab.i, so clearly it is the intended API.

I've merged your PR and then committed a patch on top of it to make the formatting match the upstream file. There will be a new release on PyPI Real Soon Now (I need to sort out automatic building of binary wheels, first).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants