Hiraganas does not take context into account (like numbers) #12

lasyan3 · 2021-05-24T06:44:35Z

Hi,
Not sure if this is a Kawazu or LibNMeCab issue, but when converting kanjis ignores exceptions.
For example, if one wants to convert 300 with 三百, it will output さんひゃく (sanhyaku) but the correct answer is さんびゃく (sanbyaku).
Same for 600 and 900.

Currently working on a workaround on my fork: lasyan3@156bf7e

The text was updated successfully, but these errors were encountered:

Cutano · 2021-05-25T02:13:57Z

I see..
I think it is MeCab didn't regard 三百 as a whole word. By handling special cases is an effective method but might be hard to cover all the cases. But currently, it is the best way to solve this problem.

lasyan3 · 2021-05-25T05:17:56Z

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not.
Anyway, feel free to use my code from my fork if you want to and if you think it's ok (I had to change the JapaneseElement from struct to class to be able to set some properties).

Cutano · 2021-05-25T09:34:39Z

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not.
Anyway, feel free to use my code from my fork if you want to and if you think it's ok (I had to change the JapaneseElement from struct to class to be able to set some properties).

Sure, if it goes well in your project please let me know and send me a pull request if possible. Handling cases one by one is better than handling nothing, thanks for the advice.

ookii-tsuki · 2021-05-25T10:10:01Z

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not.

I think you could use JMDICT and go through each element in that dictionary and check if the reading generated by NMeCab is the same as in the dictionary, if not, add it to a list that you can use to determine the special cases. It would be faster than doing it manually one by one.

lasyan3 · 2021-05-26T15:26:58Z

Well I downloaded the Wacton Desu nuget package (which is a dotnet port of the JMDICT) to make some tests and I just realized that the issue seems to only appear with kanjis showing numbers. Indeed, with a sentence, words are correctly "divided" from each others. For example : 日本語を勉強します --> [日本語] [を] [勉強] [し] [ます]
And in that case, as far as I can tell, the reading is correct.
But with numbers, all kanjis are splited and so the program cannot detect the specials readings.
So if I am right, we just have to handle the cases for numbers, which is pretty small.

ookii-tsuki · 2021-05-26T16:47:33Z

Yeah I think just fixing numbers would be enough
Also in case you didn't know, counters are also treated the same way as numbers
For example
一人(ひとり) would be divided into two elements 一(いち)、人(にん) while it should be one word.
Or
一回(いっかい) would divide to 一(いち)、回(かい)
And a lot of other counters.

lasyan3 · 2021-05-26T17:47:59Z

Hmm so maybe the root cause is the way NmeCab is parsing sentences ? Maybe there is a way to make it correctly group counters and numbers, I'll investigate this way.

ookii-tsuki · 2021-05-26T17:56:05Z

I think the actual problem is from the IpaDic, NMeCab uses that dictionary to parse the sentences.
NMeCab also supports UniDic which I heard is better than IpaDic and up-to-date but it's a lot bigger (2GB or so) but I don't know if it has this problem with numbers or not.

lasyan3 · 2021-05-28T15:08:12Z

I think I found a solution to deal with that issue. Maybe not the best one, but at least seems to work.
I let Kawazu split the sentence and I keep only the kanjis who are alone (because in other cases that means Kawazu identified the combo and thus the proper reading).
Then I use Wacton Desu to analyze the kanjis left and compare the readings.
You can view the detailed implementation in my repo, in the branch "desu": lasyan3@93bb51f

ookii-tsuki · 2021-05-28T15:23:17Z

I don't think implementing Wacton library is a good idea because it uses a lot of RAM (about 460MB for the Japanese enteries) so it's better to run this test in a separate project and get all the cases where the reading is wrong and then save them in a json or xml and then use that to check for wrong readings in Kawazu

lasyan3 · 2021-06-07T06:48:28Z

I tried running the test to get all the cases but I end with 58109 wrong readings, this seems anormaly big to me.
So for now I'll stay on my first idea, dealing with counters and adding exceptions each time I see a new kind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hiraganas does not take context into account (like numbers) #12

Hiraganas does not take context into account (like numbers) #12

lasyan3 commented May 24, 2021 •

edited

Cutano commented May 25, 2021

lasyan3 commented May 25, 2021

Cutano commented May 25, 2021

ookii-tsuki commented May 25, 2021

lasyan3 commented May 26, 2021

ookii-tsuki commented May 26, 2021

lasyan3 commented May 26, 2021

ookii-tsuki commented May 26, 2021

lasyan3 commented May 28, 2021 •

edited

ookii-tsuki commented May 28, 2021

lasyan3 commented Jun 7, 2021

Hiraganas does not take context into account (like numbers) #12

Hiraganas does not take context into account (like numbers) #12

Comments

lasyan3 commented May 24, 2021 • edited

Cutano commented May 25, 2021

lasyan3 commented May 25, 2021

Cutano commented May 25, 2021

ookii-tsuki commented May 25, 2021

lasyan3 commented May 26, 2021

ookii-tsuki commented May 26, 2021

lasyan3 commented May 26, 2021

ookii-tsuki commented May 26, 2021

lasyan3 commented May 28, 2021 • edited

ookii-tsuki commented May 28, 2021

lasyan3 commented Jun 7, 2021

lasyan3 commented May 24, 2021 •

edited

lasyan3 commented May 28, 2021 •

edited