-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing Japanese Terms adds zero-width char to some terms #396
Comments
Hi @eujev , thanks for the issue. Lute parses each term when imported. Unfortunately for some languages like Japanese, the parsing depends on context. For some terms like the ones you mentioned, this means that the terms will be different when parsed in full texts. I’m not sure what the best approach is here, open to suggestions. For a data fix, we could do a db data fix. I know that’s not optimal! |
Hey @jzohrab, thanks for the quick reply! I guess the zero-width characters are needed to be able to show the individual parts of a larger Kanji word, or are they also for something else? If it is not advisable to get rid of the zero-width characters, maybe another option is to be able to create an alias for some words. Because only a few of the words are not correctly recognized in the text. (like 青森県 is recognized as known in the example) Greetings |
The zero width spaces are to delimit parsed tokens. I had a wiki page about that. In summary, the ZWS characters vastly simplify pattern matching. a db fix for these words would be just removing the ZWS chars in your words in the db, or creating separate aliases as you mentioned. There’s already an issue for that, but that doesn’t solve your current problem. can you live with your data as it is now? Unfortunately I’m travelling and am really strapped for time, but if you really need a fix for some reason let me know, I could do something like write a query for your particular case. Hacky but I don’t have time! Cheers and thanks for the thoughts above. |
Okay got it. |
Hi all. I have a follow-up question/proposal regarding the zero width spaces. As I currently understand, every time a Japanese text or term is imported/added, the text is first parsed via mecab. It is mecab that adds those zero width spaces, correct? Therefore I have a small proposal as an extra option. During the import of such a text, the user can select a check box telling Lute to bypass the mecab parser, tokenizing the text just on the real spaces, potentially replacing them by zero width spaces or something else, and displaying that parsed text to the user for reading and term recognition. Could implementing something like this be feasible in terms of desired feature? I might be wrong but it appeared to me that both LWT and LingQ allowed for fixing a text in this way when the word boundaries were not correctly recognized, at least in the past. |
Hi @andypeeters - thanks for the note.
What happens: mecab parses the text into smaller parts of speech tokens. Lute joins these tokens with zero-width spaces for pattern matching. e.g. here's mecab parsing:
When written onscreen, each parsed token is displayed with the zero-width-space in between, i.e., Re your suggestion (thanks for the notes!): it's the right suggestion for a project, but I'm not sure how to do it in Lute without making things more complicated in general. Importing is handled by a single thing. Having Japanese/mecab-specific settings on an import page doesn't make sense for non-jp users (of course), and then it opens up the same questions for other parsers like jieba (mandarin) or mecab-ko (korean). It's feasible to do, b/c this is software and there usually nice ways around everything, of course, but it gets tough, and only impacts a subset of users. I know it's not ideal, but at the moment, given the backlog of things that really should be done that will benefit everyone, I can't invest time in this particular thing. Despite all of the excuses above, this is still a good suggestion from you. Lute recently introduced the idea of "parser plug ins", and so more languages might be supported that will have this particular need. If so, then having "parser-specific import handlers" might have more of a payoff, and your idea would have more ROI. |
Description
When importing Terms from a .csv file for Japanese zero-width chars are added to some terms, which results in these words not being parsed or recognized correctly in the text. (Terms were originally from LWT) Maybe related to #371? Or are the terms parsed my Mecab when they are imported?
Importing Japanese Terms also seemed to take longer than from other languages.
To Reproduce
Steps to reproduce the behavior, e.g.:
Import terms from a .csv file for Japanese (Example file included)
Japanese_example_terms.csv
Create new book (for example with the following example text)
Japanese_example_text.txt
Hover over blue words (like 年生 or 日間)
See error ( in the text 年生 and the term in the database with a zero-width char 年生) it shows the term with the zero-width char and the individual Kanjis.
Screenshots
Extra software info, if not already included in the Description:
The text was updated successfully, but these errors were encountered: