Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProgramAB in Chinese? #129

Open
hairygael opened this issue Dec 3, 2017 · 4 comments
Open

ProgramAB in Chinese? #129

hairygael opened this issue Dec 3, 2017 · 4 comments

Comments

@hairygael
Copy link
Contributor

Do you guys think it's possible to use programAB with Chinese language?
There is a request for that.
I started the first AIML, but if it's not compatible with UTF-8, there is no use to go on.
[https://github.com/MyRobotLab/inmoov/blob/develop/InMoov/chatbot/bots/ch/_inmoovChatbot.aiml]

@moz4r
Copy link
Contributor

moz4r commented Dec 5, 2017

Is it ok with simplified chinese ?

@hairygael
Copy link
Contributor Author

hairygael commented Dec 5, 2017 via email

@kwatters
Copy link
Contributor

kwatters commented Dec 5, 2017

Hi Gael,
So, currently Chinese isn't supported in ProgramAB. The reason is: ProgramAB does not know where one word stops and the next word starts because in the Chinese language there are no spaces between words to delimit them.
This is a problem known as "word segmentation" also known as "tokenization". There is some limited support in ProgramAB for Japanese currently, but I am not a native speaker so I can't talk to the accuracy of it, it looked pretty crude when I first looked at it, but who knows, maybe it does a good job.
To make things a bit more complicated, Chinese actually can be written 3 different ways.
(probably more)
Traditional Chinese (These are decorative kanji characters and I think they're typically more formal.)
Simplified Chinese (These are slightly simpler kanji characters that school children would learn)
Pinyin (This is a phonetic transcription of the Chinese word using Latin characters. )

If we can get webkitspeech to return Pinyin from it's recognition, that would probably work right now as it is.

Traditional and simplified Chinese both have the problem of word segmentation. One issue is that you can write something in simplified chinese, or in traditional chinese, and they represent the exact same words, which means that we need to settle on one character set. I recommend we focus on simplified Chinese, as (I think..) it's slightly more common, but I'm not a Chinese speaker so I really can't comment on it with any authority.

So, long story short, no spaces in chinese text makes ProgramAB no worky, we need to introduce code into ProgramAB that can identify the start & stop of words in Chinese (maybe other langauges too!) so that the AIML will match properly.

Right now, AIML for Chinese will only work with an EXACT match of the input string.. (this isn't very useful.)

There are some libraries out there that can do word segmentation as this same technology is used in search engines, there are some tokenizers in Lucene-solr that might be able to do the trick for us. Otherwise, there's another library called icu4j that handles some of these things, and yet another one from Stanford.

I found some code here at stack overflow that is pretty relevant to what we need to do to make it work.:

https://stackoverflow.com/questions/12484019/how-to-tokenize-chinese-language-document

@hairygael
Copy link
Contributor Author

hairygael commented Dec 5, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants