Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in transliterations #16

Open
seertenedos opened this issue Oct 24, 2021 · 4 comments
Open

Errors in transliterations #16

seertenedos opened this issue Oct 24, 2021 · 4 comments

Comments

@seertenedos
Copy link

I could be doing things incorrectly but i am trying to basically do 2 things given almost any imput in a japanese text field.

  1. Convert to Kana
  2. Convert to Romaji

The Romaji conversion is thowing a heap of erorrs most of the time like the one below from a unit test.

        [Theory]
        [InlineData("袖ケ浦港運", "Sodegaura-kō un")]
        public async Task RomajiTransliterationTest(string input, string expectedOutput)
        {
            KawazuConverter converter = new();
            var output = await converter.Convert(input, To.Romaji, Mode.Okurigana, RomajiSystem.Hepburn);
            Assert.Equal(expectedOutput, output);
        }
System.ArgumentOutOfRangeException: Length cannot be less than zero. (Parameter 'length')
   at System.Text.StringBuilder.ToString(Int32 startIndex, Int32 length)
   at Kawazu.Division..ctor(MeCabIpaDicNode node, TextType type, RomajiSystem system)
   at Kawazu.KawazuConverter.<>c__DisplayClass6_0.<Convert>b__1(MeCabIpaDicNode node)
   at System.Linq.Enumerable.SelectArrayIterator`2.ToList()
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Kawazu.KawazuConverter.<>c__DisplayClass6_0.<Convert>b__0()
   at System.Threading.Tasks.Task`1.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__277_0(Object obj)
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
@seertenedos
Copy link
Author

i think i located the issue. hopefully this makes sense since i can't write or read japanese myself. The issue was that the Utilities.GetTextType method returns the wrong response cases where a Kana character is actually concidered a Kanji character like in the name "袖ケ浦港運". In that usecase the "ケ" should have been treated as Kanji and Utilities.GetTextType should have returned PureKanji but instead returns KanjiKanaMixed that then breaks the conversion. If you force it to PureKanji the conversion seems to work.

@Cutano
Copy link
Owner

Cutano commented Oct 25, 2021

i think i located the issue. hopefully this makes sense since i can't write or read japanese myself. The issue was that the Utilities.GetTextType method returns the wrong response cases where a Kana character is actually concidered a Kanji character like in the name "袖ケ浦港運". In that usecase the "ケ" should have been treated as Kanji and Utilities.GetTextType should have returned PureKanji but instead returns KanjiKanaMixed that then breaks the conversion. If you force it to PureKanji the conversion seems to work.

Yes, I checked the call stack and found the problem was caused by the ambiguity of kana "ケ". In common circumstances, it pronounced as "ke", but in the example that you offered, it is "ge", which directly caused the mismatch of the method IndexOf().
I'm working on this problem currently but don't know how to solve this in a decent way yet.

Cutano added a commit that referenced this issue Oct 25, 2021
@Cutano
Copy link
Owner

Cutano commented Oct 25, 2021

I updated the nuget package and solved the problem temporarily by filtering the ケ relating results, but there could be other problems. Right now, it is just a temporary solution.

@seertenedos
Copy link
Author

thanks Cutano. I am running people and company names though it so it is hitting other ones as well that are not fixed but that fix did work for a lot of the names in the batch i was testing on.

below is a few more that a failing. It is not all of them but i am hoping that it is enough to give a good idea of the issue. The first fix seem to reduce the errors by over 50%
日本コンクリート工業(株)
東京コスモス電機(株)
日本アジア投資(株)
日精エー・エス・ビー機械(株)
(株)三ッ星
三ッ矢産業
三ツ川工業所
三ツ川浩一
白神自然学校一ッ森校

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants