ProgramAB in Chinese? #129

hairygael · 2017-12-03T15:31:40Z

Do you guys think it's possible to use programAB with Chinese language?
There is a request for that.
I started the first AIML, but if it's not compatible with UTF-8, there is no use to go on.
[https://github.com/MyRobotLab/inmoov/blob/develop/InMoov/chatbot/bots/ch/_inmoovChatbot.aiml]

moz4r · 2017-12-05T10:41:31Z

Is it ok with simplified chinese ?

hairygael · 2017-12-05T12:41:49Z

Hello Anthony, For now I used simplify chinese by default... Kevin is also suggesting using simplify as a start. Gael Langevin Creator of InMoov InMoov Robot <http://www.inmoov.fr> @inmoov <http://twitter.com/inmoov> 2017-12-05 11:41 GMT+01:00 Anthony <notifications@github.com>:

…

Is it ok with simplified chinese ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#129 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIF2x26hW_Xp3xDojj3pkbSRPpLxYWZtks5s9R3cgaJpZM4QzwuB> .

kwatters · 2017-12-05T14:04:17Z

Hi Gael,
So, currently Chinese isn't supported in ProgramAB. The reason is: ProgramAB does not know where one word stops and the next word starts because in the Chinese language there are no spaces between words to delimit them.
This is a problem known as "word segmentation" also known as "tokenization". There is some limited support in ProgramAB for Japanese currently, but I am not a native speaker so I can't talk to the accuracy of it, it looked pretty crude when I first looked at it, but who knows, maybe it does a good job.
To make things a bit more complicated, Chinese actually can be written 3 different ways.
(probably more)
Traditional Chinese (These are decorative kanji characters and I think they're typically more formal.)
Simplified Chinese (These are slightly simpler kanji characters that school children would learn)
Pinyin (This is a phonetic transcription of the Chinese word using Latin characters. )

If we can get webkitspeech to return Pinyin from it's recognition, that would probably work right now as it is.

Traditional and simplified Chinese both have the problem of word segmentation. One issue is that you can write something in simplified chinese, or in traditional chinese, and they represent the exact same words, which means that we need to settle on one character set. I recommend we focus on simplified Chinese, as (I think..) it's slightly more common, but I'm not a Chinese speaker so I really can't comment on it with any authority.

So, long story short, no spaces in chinese text makes ProgramAB no worky, we need to introduce code into ProgramAB that can identify the start & stop of words in Chinese (maybe other langauges too!) so that the AIML will match properly.

Right now, AIML for Chinese will only work with an EXACT match of the input string.. (this isn't very useful.)

There are some libraries out there that can do word segmentation as this same technology is used in search engines, there are some tokenizers in Lucene-solr that might be able to do the trick for us. Otherwise, there's another library called icu4j that handles some of these things, and yet another one from Stanford.

I found some code here at stack overflow that is pretty relevant to what we need to do to make it work.:

https://stackoverflow.com/questions/12484019/how-to-tokenize-chinese-language-document

hairygael · 2017-12-05T16:48:30Z

Thanks Kevin for all this information. I have sent the thread link to the Chinese person which is concerned about the project in order to start defining what we should select between the three options. Gael Langevin Creator of InMoov InMoov Robot <http://www.inmoov.fr> @inmoov <http://twitter.com/inmoov> 2017-12-05 15:04 GMT+01:00 Kevin Watters <notifications@github.com>:

…

Hi Gael, So, currently Chinese isn't supported in ProgramAB. The reason is: ProgramAB does not know where one word stops and the next word starts because in the Chinese language there are no spaces between words to delimit them. This is a problem known as "word segmentation" also known as "tokenization". There is some limited support in ProgramAB for Japanese currently, but I am not a native speaker so I can't talk to the accuracy of it, it looked pretty crude when I first looked at it, but who knows, maybe it does a good job. To make things a bit more complicated, Chinese actually can be written 3 different ways. (probably more) Traditional Chinese (These are decorative kanji characters and I think they're typically more formal.) Simplified Chinese (These are slightly simpler kanji characters that school children would learn) Pinyin (This is a phonetic transcription of the Chinese word using Latin characters. ) If we can get webkitspeech to return Pinyin from it's recognition, that would probably work right now as it is. Traditional and simplified Chinese both have the problem of word segmentation. One issue is that you can write something in simplified chinese, or in traditional chinese, and they represent the exact same words, which means that we need to settle on one character set. I recommend we focus on simplified Chinese, as (I think..) it's slightly more common, but I'm not a Chinese speaker so I really can't comment on it with any authority. So, long story short, no spaces in chinese text makes ProgramAB no worky, we need to introduce code into ProgramAB that can identify the start & stop of words in Chinese (maybe other langauges too!) so that the AIML will match properly. Right now, AIML for Chinese will only work with an EXACT match of the input string.. (this isn't very useful.) There are some libraries out there that can do word segmentation as this same technology is used in search engines, there are some tokenizers in Lucene-solr that might be able to do the trick for us. Otherwise, there's another library called icu4j that handles some of these things, and yet another one from Stanford. I found some code here at stack overflow that is pretty relevant to what we need to do to make it work.: https://stackoverflow.com/questions/12484019/how-to- tokenize-chinese-language-document — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#129 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIF2x-T3hyOkkKmMSwo5Czuo9gjVo5dMks5s9U1igaJpZM4QzwuB> .

hairygael added the enhancement label Dec 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProgramAB in Chinese? #129

ProgramAB in Chinese? #129

hairygael commented Dec 3, 2017

moz4r commented Dec 5, 2017

hairygael commented Dec 5, 2017 via email

kwatters commented Dec 5, 2017

hairygael commented Dec 5, 2017 via email

ProgramAB in Chinese? #129

ProgramAB in Chinese? #129

Comments

hairygael commented Dec 3, 2017

moz4r commented Dec 5, 2017

hairygael commented Dec 5, 2017 via email

kwatters commented Dec 5, 2017

hairygael commented Dec 5, 2017 via email