-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese whitespace error #7910
Comments
Hi @sara-tagger @degiz any progress? |
Hi @sara-tagger @degiz How is this bug? Any progress? |
Hi @gongshaojie12 what do your regex patterns for |
@koaning any thoughts on this? |
I am unfamiliar with @gongshaojie12 this is just something to try; we've recently added support for spaCy 3.0 which also supports Chinese models. Can you confirm if the issue persists with the spaCy tokenizers for Chinese? |
@gongshaojie12 locally in my notebook I can confirm that spaCy seems to have a reasonable way of dealing of splitting up the tokens. I don't speak Chinese however, so feel free to correct me. import spacy
text = "如何才能在下载和安装google app"
nlp = spacy.blank("zh")
for t in nlp(text):
print(t, t.idx) This is the output:
|
Hi @koaning Thank you for your reply.I rewritten the jiebatokenizer class to solve this problem. Thanks! |
Could you explain what you've changed? If you have any lessons to share we might be able to think about improvements to our components for any other users. |
Hi @koaning Because the tokenize method in JiebaTokenizer does not remove the spaces after the word segmentation is completed, and subsequent components will remove the spaces when extracting text features, which will cause inconsistencies. So I rewrote the tokenize method to remove the space after the word segmentation. |
Based on my investigation using the provided examples, the problem is exactly what @gongshaojie12 found. I'll explain it here with an example:
The extra whitespace added in between is the culprit. What @gongshaojie12 tried as a solution will work very well if you have no entities in the NLU pipeline. However, if you have entities and you remove the whitespace as a post-processing task after the tokenizer, the entity alignment will be messed up. This is because start and end spans of entity annotations are recorded when the data is loaded up and before the tokenizer processes the messages. @gongshaojie12 I would caution you of this potential pitfall. If you don't have any entities you can definitely use your solution. I haven't been able to try the spacy tokenizer on this though because downloading the spacy model took a long time and then errored out. @koaning If you already have the I see two follow up issues:
|
@dakshvar22 the medium model does the same as the tokenizer that I tried earlier. That's to my knowledge also how spaCy is designed, the tokenizer is the same across the blank/sm/md/lg/trf models. Interesting observation: spaCy depends on Jieba here, but it seems to add extra behavior. import spacy
nlp = spacy.load('zh_core_web_md')
# Building prefix dict from the default dictionary ...
# Dumping model to file cache /tmp/jieba.cache
# Loading model cost 0.499 seconds.
# Prefix dict has been built successfully.
[t for t in nlp("如何才能在下载和安装google app")]
# [如何, 才能, 在, 下载, 和, 安装, google, app] |
@koaning That's why I had recommended to try loading |
@dakshvar22 you are too quick! This was on my list of things to do today, but now I don't have to, thanks :) |
Hello, I have the same issue here, could you share more on how to rewrite JiebaTokenizer to solve this issue? Thanks |
Hi @ljcljc as follows:
|
Thanks a lot for the code, it works for me now. |
Hi, @ljcljc Since I don't currently have the need for slot filling, I didn't consider the entity recognition. If you have research in the entity, please share your thoughts, thank you! |
I don't have deep research in entity filling now. But I want to use it in my project, will let you know once I find any issue after deployment. Thanks for you work. |
Rasa version:2.2.3
Rasa SDK version (if used & relevant):
Rasa X version (if used & relevant):
Python version:3.6.12
Operating system (windows, osx, ...):windows and linux
Issue:
When the Chinese training data contains English and spaces, the DIETClassifier cannot be used for training
Error (including full traceback):
Command or request that led to error:
Content of configuration file (config.yml) (if relevant):
Content of domain file (domain.yml) (if relevant):
Content of nlu file (nlu.yml) (if relevant):
The reason for DIETClassifier training error is because of the space between
google
andapp
in the sentence "如何才能在下载和安装google app", If remove the space, DIETClassifier will train normally.How can I solve this problem, please help me, thanks!
Definition of done
Within 2 working days:
The text was updated successfully, but these errors were encountered: