Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer fails #8

Closed
mikkokotila opened this issue Apr 27, 2018 · 6 comments
Closed

tokenizer fails #8

mikkokotila opened this issue Apr 27, 2018 · 6 comments

Comments

@mikkokotila
Copy link
Contributor

I've installed with pypi and I'm doing...

import pybo as bo

# initialize the tokenizer
tok = bo.BoTokenizer('POS')

# load a string to a variable
input_str = 'འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་'

# tokenize the input
tokens = tok.tokenize(input_str)

# show the results
tokens

...at which point I get:

IndexError                                Traceback (most recent call last)
~/dev/astetik_test/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    381                 if cls in self.type_pprinters:
    382                     # printer registered in self.type_pprinters
--> 383                     return self.type_pprinters[cls](obj, self, cycle)
    384                 else:
    385                     # deferred printer

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
    559                 p.text(',')
    560                 p.breakable()
--> 561             p.pretty(x)
    562         if len(obj) == 1 and type(obj) is tuple:
    563             # Special case for 1-item tuples.

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    398                         if cls is not object \
    399                                 and callable(cls.__dict__.get('__repr__')):
--> 400                             return _repr_pprint(obj, self, cycle)
    401 
    402             return _default_pprint(obj, self, cycle)

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    693     """A pprint that just redirects to the normal repr function."""
    694     # Find newlines and replace them with p.break_()
--> 695     output = repr(obj)
    696     for idx,output_line in enumerate(output.splitlines()):
    697         if idx:

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in __repr__(self)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in <listcomp>(.0)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in <listcomp>(.0)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

IndexError: string index out of range

If I don't load tok.tokenize(input_str) in to variable, then the error comes in that step.

@drupchen
Copy link
Collaborator

hmmm. This is a good bug. It is most probably a bug in the string representation of a token in the list that comes from a bug in the splitting syllables with affixes into two distinct tokens. It looks likee the attribute token.syls does not have the expected content.

What the line 62 does is find the actual characters in token.content by using the indices listed in token.syls It looks like token.syls has not been correctly split in pybo.splitaffixed.py, in a private function called __split_syls().

Could you try with the latest version I have pushed ?

@drupchen
Copy link
Collaborator

By the way, the tokenizer did not fail, it is the string representation of the content of the Token object that fails. There still is a bug somewhere, but not big enough to prevent the tokenizer to function altogether. It would not have passed by the line "tokens = tok.tokenize(input_str)" otherwise.

@mikkokotila
Copy link
Contributor Author

I tried installing from the latest master, but issue is still there.

@drupchen
Copy link
Collaborator

drupchen commented Apr 28, 2018

I don't seem to be able to reproduce the bug, using the configuration that is in the latest master.

It seems you can't print one of the produced tokens.

Maybe a way to identify it would be to do something like the following:

for num, token in enumerate(tokens):
    print(num)  # to identify which token fails to print
    print(token)  # this calls the __repr__() responsible for the failing `cleaned_content`

Something else: if this code executes without problem, Ipython/Jupyter maybe has problems rendering the @property attributes of classes.

Could you elaborate on what you do instead of loading the output of tok.tokenize() into a variable and how it only happens then ?

@mikkokotila
Copy link
Contributor Author

Very strange, without actually reinstalling pybo, things work now. In the between what happened was #9 as a new issue, which was easy to resolve as I mention in #9. I'm using an env that went through some other changes in the meantime, so it probably had to do with that. Will try to reproduce later with a clean env and report based on that.

@drupchen
Copy link
Collaborator

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants