Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caught UnicodeDecodeError when use parseToNode alone #3

Closed
graph226 opened this issue Nov 14, 2016 · 9 comments
Closed

Caught UnicodeDecodeError when use parseToNode alone #3

graph226 opened this issue Nov 14, 2016 · 9 comments

Comments

@graph226
Copy link

When we use tagger.parseToNode(text) alone, sometimes we get such error as:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 1: invalid start byte

To avoid this, put tagger.parse(text) before parseToNode.

@graph226
Copy link
Author

graph226 commented Nov 14, 2016

For modifying this, we can call parse method before parseToNode but I don't know whether it works or not 😇

def parseToNode(self, *args): return _MeCab.Tagger_parseToNode(self, *args)

like

def parseToNode(self, *args):
    self.parse(self, *args)
    return _MeCab.Tagger_parseToNode(self, *args)

Please give me your idea about this.

@shiumachi
Copy link

I got the same error and fixed it with the above workaround.

@kei-s
Copy link

kei-s commented Jan 17, 2018

I investigated the reason of this bug.

In _wrap_Tagger_parseToNode method, this line deletes buf2 because alloc2 is SWIG_NEWOBJ.

if (alloc2 == SWIG_NEWOBJ) delete[] buf2;

In python 2, the buf2 is not deleted because alloc2 is SWIG_OLDOBJ.
(MeCab_wrap.cxx is completely same as original @taku910's one. https://github.com/taku910/mecab/blob/master/mecab/python/MeCab_wrap.cxx .)

So, the reason of this bug is in SWIG_AsCharPtrAndSize method.
I think this block has something wrong.

mecab-python3/MeCab_wrap.cxx

Lines 3461 to 3470 in 5ee7aa5

if (!alloc && cptr) {
/* We can't allow converting without allocation, since the internal
representation of string in Python 3 is UCS-2/UCS-4 but we require
a UTF-8 representation.
TODO(bhy) More detailed explanation */
return SWIG_RuntimeError;
}
obj = PyUnicode_AsUTF8String(obj);
PyBytes_AsStringAndSize(obj, &cstr, &len);
if(alloc) *alloc = SWIG_NEWOBJ;

But I don't have the patch to solve this bug at this time. 😕

@orangain
Copy link

I got the same problem and found that using the latest version of MeCab solves the problem.

My environment:

  • OS: macOS High Sierra 10.13.3
  • Python: 3.6.3
  • mecab-python3: 0.7
  • MeCab: 0.996 (BUILT FROM SOURCE taku910/mecab@3a07c4e)

This problem seems to be the same as the one reported in taku910/mecab#5, and it has been solved by taku910/mecab#24 merged in Feb 2016.

Alhough this problem occurs only in Python 3, it is not a matter of mecab-python3, but it seems to be a matter of memory management of MeCab itself.

Unfortunately, major package managers such as Homebrew and APT currently offer older version of MeCab based on the source in Feb 2013, which can be obtained from Google Drive.

To avoid this problem without using the workaround mentioned above, you need to build and install MeCab from the latest source on GitHub manually, and then reinstall mecab-python3.

@zackw
Copy link
Collaborator

zackw commented Nov 4, 2018

@graph226 I believe this ought to be fixed by using the latest version of the package and the latest version of MeCab, but I cannot be sure because you did not provide a complete test case that I can run for myself. Could you please try your code again? Make sure to use mecab-python3 0.8.3, MeCab 0.996, and a current version of SWIG (I have 3.0.12).

It's been a long time since you reported this bug and perhaps you have moved on, so if I don't hear from you in a month I will close the bug (but feel free to reopen it if you don't get to this until after that, and it's still a problem).

@polm
Copy link
Collaborator

polm commented Dec 17, 2018

Please see the spaCy issue linked above, which provides a Dockerfile and code to reproduce the issue. I think @orangain's explanation is exactly right.

@zackw
Copy link
Collaborator

zackw commented Dec 17, 2018

@polm Thanks for the pointer. I think you're right. I am going to consider this bug a concrete reason why we need to ship binary wheels from PyPI with bundled libmecab, so it will be addressed by PR #18, which I will be reviewing and landing Real Soon Now. I'll leave the bug open till then.

@zackw
Copy link
Collaborator

zackw commented Apr 16, 2019

Please try the release candidate available from https://test.pypi.org/project/mecab-python3/0.996.2rc2/ , this bug should be corrected. Thank you everyone for your patience. We plan to make a new official release in the next couple of weeks.

@zackw zackw mentioned this issue Apr 16, 2019
5 tasks
@zackw
Copy link
Collaborator

zackw commented Apr 22, 2019

0.996.2 has been officially released and this issue should be corrected. Please file a new bug report if you are still having problems with parseToNode.

@zackw zackw closed this as completed Apr 22, 2019
jiru added a commit to Tatoeba/tatomecab that referenced this issue Feb 7, 2020
But keep compatibility for Python 2, because latest packaged mecab
binaries include a bug that makes tatomecab unusable. The only way
to have it working with Python 3 right now is to compile mecab from
source:

SamuraiT/mecab-python3#3 (comment)

Note that in Python 3, http.server.BaseHTTPRequestHandler.parse_request()
forces decoding the request line as latin1-encoded, so commands such as
curl http://127.0.0.1:8842/furigana?str=振り仮名をつけろう
won’t work any more. One needs to %-encode everything proprely in the URL.

Closes #6.
polm added a commit that referenced this issue Dec 10, 2022
* Upgrade Github Actions in build scripts

* Cache mecab build

* Add caching of mecab builds

* Action forgotten action file

* Clean up paths

* Remove leftover cd

* Fix cache path

* Fix path

* Use glob for python versions

* Don't build mecab if cache is found

* Fix caching

* Fix job keys

* Try modifying conditionals

* Clean up conditionals

Apparently they don't need the ${{ }}

* Explicitly specify Python versions again

This may be the easiest way to exclude 3.6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants