-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Stanza #23
Comments
This definitely fits the goal of this package. It's now on the TODO. Are you interested in collaborating on this one? |
Hi! |
How about this, around the time that I think I might have something. Could you have a peek and give a review? The main thing I'd like to get a second pair of eyes on is the |
It feels like the components could be split up though.
|
Yes, I can help with reviews! I can also try and answer any questions you have about stanza. I might be able to answer them, since I have used it more. Estonian works with white-space tokeniser. It is used commonly. Entity detection with stanza is quite alright and best available resource in a sense that it can be easily used in other applications as well. Several Estonian language technologies that companies in Estonian use have stanza in their tummy. Lemmatisation is crucial for me and it is the whole reason I started commenting in the blog. Estonian (like many other languages, including Finnish and Hungarian) is mostly agglutinative language. This means that substantive can have 29+-1 different forms and verbs ca 93 different forms (sic!). If I don't lemmatize the text, Rasa thinks all those different forms are separate words and that might increase the need of more data and decrease the efficiency of Rasa. |
Interesting! Out of curiosity. Considering that our pipeline has a |
I've just checked with the research team, we actually offer lemmatisation in our |
Regarding the 2/3/4-grams question. Regarding your last question, I don't know. I tried to hack |
The Rasa word level CountVectorizer uses the |
@Lindafr I'm wondering what the best approach is here. I can attempt to get Would you agree that the POS tags are probably the most important feature to get started with? I should be able to add these as sparse features for the machine learning pipeline. |
Hi @koaning , Rasa component would be easy for the end user. I, myself, was thinking down the line of just substituting spaCy |
I've talked to the research team about lemmas and it seems like we've never really supported them directly. Only indirectly for spaCy pipelines. There's interest in exploring it further but it may take time to get to a consensus on what is the best approach. Until then, I'll keep in mind that when I've got time to work on this; |
I will be starting with a tokenizer first. The reason is that internally, if you add a lemma property to a token, the countvectorfeaturizer is able to pick up the lemma instead of the word. |
@Lindafr I have a first implementation up and running and you should be able to play with it. I've only tried it out with English sofar on my local machine but you should be able to use the implementation on the PR that is linked for any supported Stanza language. If you've got the time and you'd like to play with it. You should be able to download the from stanzatokenizer import StanzaTokenizer
from rasa.nlu.training_data import Message
# You can change the language setting here
tok = StanzaTokenizer(component_config={"lang": "en"})
# This is a Rasa internal thing, you need to wrap text with an object
m = Message("i am running and giving many greetings")
tok.process(m)
# You should now be able to check the properties of the message.
[t.text for t in m.as_dict()['tokens']]
[t.data.get('pos', '') for t in m.as_dict()['tokens']]
[t.lemma for t in m.as_dict()['tokens']] Per example, the last three lists on my machine were; ['i', 'am', 'running', 'and', 'giving', 'many', 'greetings', '__CLS__']
['PRON', 'AUX', 'VERB', 'CCONJ', 'VERB', 'ADJ', 'NOUN', '']
['i', 'be', 'run', 'and', 'give', 'many', 'greeting', '__CLS__'] The |
Hi @koaning , I just found this. I didn't know it existed before but it seems that maybe integrating stanza to spaCy is even easier and you won't have to invent a whole new pipeline. I'll take a look at this |
If either works, let me know :) |
Having glanced at the docs, it seems like the nlp.to_disk("./stanza-spacy-model") To properly link it you might need to do the same steps as I'm taking here. |
Hi @koaning , I got an error with |
Thanks for letting me know. I'm having a lunch break now but I might have some time to have a look at the spaCy bindings for stanza. Just to check, do you also have a Rasa project? You should now also be able to use that tokenizer in pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
- name: CountVectorsFeaturizer
OOV_token: oov.txt
token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: DIETClassifier
epochs: 200 If you run it with |
I've just tried trying to use spaCy directly and I had some troubles. There's a caveat mentioned here and I'm getting this error when I try to package it all up; TypeError: __init__() missing 1 required positional argument: 'snlp' I fear that we'd need to construct a special binding around this spacy-stanza package to get it to work for the Rasa use-case. It's do-able but it might make more sense to build the stanza feature directly for Rasa. |
Did you try this on English? With Estonian I get the language error that should be solved ("But because this package exposes a spacy_languages entry point in its setup.py that points to StanzaLanguage, spaCy knows how to initialize it.").
My hope was that this would exclude the need of constructing a spacial binding in Rasa. But if this doesn't work then stanza support in Rasa would be more sensible indeed. |
@Lindafr I've tried doing it with English yes, but the current Rasa support for spaCy does not assume that we need to pass If I'll end up having a component here for stanza then I prefer to host a direct binding to Rasa. Less things to maintain that way. |
I agree. I'll investigate the |
@Lindafr out of curiosity. Did you notice an improvement with the |
@Lindafr not the biggest rush, but did you try the tool on a Rasa project by any chance? If you've experience the merits of it then I might be able to wrap up this feature this week. |
I get this error, when running a test project with The config.yml:
|
@Lindafr It was indeed the same issue. It should now be fixed, the only caveat is that now you need to supply the path to your stanza installation manually. That means you might need to do something like: language: en
pipeline:
- name: rasa_nlu_examples.tokenizers.StanzaTokenizer
lang: "en"
cache_dir: "tests/data/stanza"
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 1
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy If you are running this locally then usually the stanza files are located in |
Ubuntu :) |
Just to make sure I did everything accordingly I'll describe here the testing conditions. I used the The stanza_test folder looks sth like this: then with
My WhiteSpaceTokenizer is in another folder and it's
I trained new first models and then tested them. I tried to create some harder user messages which won't result in high confidence to better review the confidence differences. Here are the results (for the sake of shortness, I excluded the messages and intent names themselves):
As you can see that the pipeline with stanzatokenizer is not confident about anything and usually gets the intent wrong. The WST however usually gets things right, but when it's wrong, it's wrong quite confidently. The dataset I am working on, has ca 32 intents, some of which have very similar keywords or situations (a la creating a passport application and receiving a passport). |
@Lindafr this is very elaborate. Thanks for sharing! I may have found an issue with your setup though. Your current Stanza Configlanguage: et
pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
cache_dir: "data/stanza"
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 1
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy Notice how DIET is only using 1 epoch? This is probably why the setup is underperforming. Also, you've removed the My Stanza Config ProposalCould you try running this? language: et
pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
cache_dir: "data/stanza"
- name: CountVectorsFeaturizer
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 200
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy If you're interested in getting summary statistics, you can also run two config files and get summary statistics. Here's a snippet that might help: rasa test nlu --config configs/config-light.yml \
--cross-validation --runs 1 --folds 2 \
--out gridresults/config-light
rasa test nlu --config configs/config-heavy.yml \
--cross-validation --runs 1 --folds 2 \
--out gridresults/config-heavy This will grab two config files (in this case |
Hi, @koaning , Based on the results I suspected some kind of a mistake, that's why I wrote about the test conditions (I copy-pased your example and didn't think it through). Today I'll ran it again and then we'll have some real results. It might take a bit time, because I have to do some other things regarding to Rasa due to tomorrow before. |
@Lindafr no rush! I'm just super curious of the results. 😄 |
Renewed results are as follows. (with config proposed earlier):
As you can see, ST got only 3 times wrong, while WST got 4 utterances wrong. Overall, if you exclude the first example, WST is more confident in it's wrong intents, while stanza is not confident when it's wrong. Stanza seems to be better on this manual check. Now, for the other statistics. WhiteSpaceTokeniser2020-09-09 14:26:20 INFO rasa.test - Intent evaluation results StanzaTokeniser2020-09-09 14:43:59 INFO rasa.test - Intent evaluation results SummaryOne can see that WST's test accuracy, F1 and precision is actually slightly better. Bit suprising as I thought the extra info of POS would help. ExtraWith
The results are slightly better than the recommended ST, but still below WTS: |
Interesting. Looking at these results I wonder about another issue. The difference between the train/test results in both the If you're up for it, could you try tuning the epochs down to maybe 50 for DIET? If you still see a big difference between train/test feel free to tune it down even further to 25. The goal is to manually stop the algorithm before it starts overfitting. |
@Lindafr is your dataset publicly available? I might also be able to run some benchmarks on your behalf if you're interested. The research team here would love to have an Estonian dataset to benchmark our algorithms on. |
Some more statistics on basically the same StanzaTokenizer( with DIET epoch 50 with DIET epoch 25 with DIET epoch 15 WhiteSpaceTokenizer( with DIET epoch 50 with DIET epoch 25 with DIET epoch 15 The overfitting is still quite large. I think it's partly due to my still very small dataset - every intent (32 total) I have has only 10-20 example questions and there's not much to learn from. I have plans to make it larger with test users and by hand, but right now that's all I got. |
@koaning , the dataset is not publicy available and I cannot share it. Do you think part of the overfitting results can be due to my small dataset? I can try again later when I have more examples, but right now the results are like this and adding POS doesn't seem to add any improvement (?). |
@Lindafr yeah those last results seem convincing. I'll add the feature just in case so that other folks can try it out but it seems safe to say that on your use-case right now it doesn't contribute anything substantial. Thanks a lot for the feedback though 👍! It's been very(!) helpful. I'll merge the stanza feature today. It might prove useful to other folks and you can also check if it boosts performance once you've got more data. I'll close this issue once the feature is merged. One last question out of curiosity, did you try the pretrained embeddings (BytePair/FastText)? If so, I'd love to hear if they were helpful. |
#30 has been merged! Closing this issue but we can continue/pick up the conversation in this issue if there's a related discussion to be continued. |
Just wanted to mention it here. Full stanza support is now coming to spaCy https://explosion.ai/blog/spacy-v3-nightly. |
@koaning , are you sure? They don't mention it anywhere. They only compare Stanza's NER results with theirs in one table. Plus they do have the Btw, is there anyway to use |
I recall reading it elsewhere that a tighter integration was now possible but I can't find the link anymore ... will look again! If you want FastText embeddings, can't you use the ones that are available directly in this repository? |
In places where FastText wrapped into spaCy is no use, Stanza comes in handy - it can give us the necessary POS-es and lemmatization. It is, at least, the case for Estonian. Should be also for Finnish, Hebrew, Hindi, Hungarian, Indonesian, Irish, Korean, Latvian, Persian, Telugu, Urdu, etc.
The text was updated successfully, but these errors were encountered: