-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProgramAB in Chinese? #129
Comments
Is it ok with simplified chinese ? |
Hello Anthony,
For now I used simplify chinese by default...
Kevin is also suggesting using simplify as a start.
Gael Langevin
Creator of InMoov
InMoov Robot <http://www.inmoov.fr>
@inmoov <http://twitter.com/inmoov>
2017-12-05 11:41 GMT+01:00 Anthony <notifications@github.com>:
… Is it ok with simplified chinese ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#129 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIF2x26hW_Xp3xDojj3pkbSRPpLxYWZtks5s9R3cgaJpZM4QzwuB>
.
|
Hi Gael, If we can get webkitspeech to return Pinyin from it's recognition, that would probably work right now as it is. Traditional and simplified Chinese both have the problem of word segmentation. One issue is that you can write something in simplified chinese, or in traditional chinese, and they represent the exact same words, which means that we need to settle on one character set. I recommend we focus on simplified Chinese, as (I think..) it's slightly more common, but I'm not a Chinese speaker so I really can't comment on it with any authority. So, long story short, no spaces in chinese text makes ProgramAB no worky, we need to introduce code into ProgramAB that can identify the start & stop of words in Chinese (maybe other langauges too!) so that the AIML will match properly. Right now, AIML for Chinese will only work with an EXACT match of the input string.. (this isn't very useful.) There are some libraries out there that can do word segmentation as this same technology is used in search engines, there are some tokenizers in Lucene-solr that might be able to do the trick for us. Otherwise, there's another library called icu4j that handles some of these things, and yet another one from Stanford. I found some code here at stack overflow that is pretty relevant to what we need to do to make it work.: https://stackoverflow.com/questions/12484019/how-to-tokenize-chinese-language-document |
Thanks Kevin for all this information.
I have sent the thread link to the Chinese person which is concerned about
the project in order to start defining what we should select between the
three options.
Gael Langevin
Creator of InMoov
InMoov Robot <http://www.inmoov.fr>
@inmoov <http://twitter.com/inmoov>
2017-12-05 15:04 GMT+01:00 Kevin Watters <notifications@github.com>:
… Hi Gael,
So, currently Chinese isn't supported in ProgramAB. The reason is:
ProgramAB does not know where one word stops and the next word starts
because in the Chinese language there are no spaces between words to
delimit them.
This is a problem known as "word segmentation" also known as
"tokenization". There is some limited support in ProgramAB for Japanese
currently, but I am not a native speaker so I can't talk to the accuracy of
it, it looked pretty crude when I first looked at it, but who knows, maybe
it does a good job.
To make things a bit more complicated, Chinese actually can be written 3
different ways.
(probably more)
Traditional Chinese (These are decorative kanji characters and I think
they're typically more formal.)
Simplified Chinese (These are slightly simpler kanji characters that
school children would learn)
Pinyin (This is a phonetic transcription of the Chinese word using Latin
characters. )
If we can get webkitspeech to return Pinyin from it's recognition, that
would probably work right now as it is.
Traditional and simplified Chinese both have the problem of word
segmentation. One issue is that you can write something in simplified
chinese, or in traditional chinese, and they represent the exact same
words, which means that we need to settle on one character set. I recommend
we focus on simplified Chinese, as (I think..) it's slightly more common,
but I'm not a Chinese speaker so I really can't comment on it with any
authority.
So, long story short, no spaces in chinese text makes ProgramAB no worky,
we need to introduce code into ProgramAB that can identify the start & stop
of words in Chinese (maybe other langauges too!) so that the AIML will
match properly.
Right now, AIML for Chinese will only work with an EXACT match of the
input string.. (this isn't very useful.)
There are some libraries out there that can do word segmentation as this
same technology is used in search engines, there are some tokenizers in
Lucene-solr that might be able to do the trick for us. Otherwise, there's
another library called icu4j that handles some of these things, and yet
another one from Stanford.
I found some code here at stack overflow that is pretty relevant to what
we need to do to make it work.:
https://stackoverflow.com/questions/12484019/how-to-
tokenize-chinese-language-document
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#129 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIF2x-T3hyOkkKmMSwo5Czuo9gjVo5dMks5s9U1igaJpZM4QzwuB>
.
|
Do you guys think it's possible to use programAB with Chinese language?
There is a request for that.
I started the first AIML, but if it's not compatible with UTF-8, there is no use to go on.
[https://github.com/MyRobotLab/inmoov/blob/develop/InMoov/chatbot/bots/ch/_inmoovChatbot.aiml]
The text was updated successfully, but these errors were encountered: