Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Stanza #23

Closed
Lindafr opened this issue Aug 21, 2020 · 42 comments
Closed

Adding Stanza #23

Lindafr opened this issue Aug 21, 2020 · 42 comments
Assignees

Comments

@Lindafr
Copy link

Lindafr commented Aug 21, 2020

In places where FastText wrapped into spaCy is no use, Stanza comes in handy - it can give us the necessary POS-es and lemmatization. It is, at least, the case for Estonian. Should be also for Finnish, Hebrew, Hindi, Hungarian, Indonesian, Irish, Korean, Latvian, Persian, Telugu, Urdu, etc.

@koaning
Copy link
Contributor

koaning commented Aug 21, 2020

This definitely fits the goal of this package. It's now on the TODO.

Are you interested in collaborating on this one?

@Lindafr
Copy link
Author

Lindafr commented Aug 25, 2020

Hi!
Collaboration would be interesting, but I doubt I have enough time. I tried to hack Rasa and add some Stanza Estonian lemmatization straight into SpacyTokenizer, but failed so far. I guess those, who are more familiar with the code, find the task easier.

@koaning
Copy link
Contributor

koaning commented Aug 25, 2020

How about this, around the time that I think I might have something. Could you have a peek and give a review? The main thing I'd like to get a second pair of eyes on is the stanza package because I've never used it.

@koaning
Copy link
Contributor

koaning commented Aug 25, 2020

It feels like the components could be split up though.

  • There's a potential for a tokeniser. Just to check @Lindafr does Estonian not work with a white-space tokeniser?
  • There's potential for a POS featurizer. These would be sparse features that would get added.
  • There's potential for an entity detector as well. Do you have any experience with the quality of it?
  • There also seems to be dependency parsing and lemmatisation in there but it feels to me like these features are less relevant in a Rasa pipeline.

@Lindafr
Copy link
Author

Lindafr commented Aug 25, 2020

Yes, I can help with reviews! I can also try and answer any questions you have about stanza. I might be able to answer them, since I have used it more.

Estonian works with white-space tokeniser. It is used commonly.

Entity detection with stanza is quite alright and best available resource in a sense that it can be easily used in other applications as well. Several Estonian language technologies that companies in Estonian use have stanza in their tummy.

Lemmatisation is crucial for me and it is the whole reason I started commenting in the blog. Estonian (like many other languages, including Finnish and Hungarian) is mostly agglutinative language. This means that substantive can have 29+-1 different forms and verbs ca 93 different forms (sic!). If I don't lemmatize the text, Rasa thinks all those different forms are separate words and that might increase the need of more data and decrease the efficiency of Rasa.
Since we don't have any good lemmatized word vector models trained on big data available, I guessed it would be worth trying to take the FastText (or the BytePair) non-lemmatised text embeddings and take it as one feature and during tokenization replace the word with it's lemma so further in pipeline Rasa has the word's non-lemmatized embedding and sees the lemma.

@koaning
Copy link
Contributor

koaning commented Aug 25, 2020

Interesting!

Out of curiosity. Considering that our pipeline has a CountVectorizer that can also use 2/3/4-grams on a character-level I wonder what can be gained by adding Lemmatisation. In the case of ütlevad and ütlen you might still have the ütl as a common feature, no?

@koaning
Copy link
Contributor

koaning commented Aug 25, 2020

I've just checked with the research team, we actually offer lemmatisation in our CountvectorFeaturizer if you use a spaCy model. This suggests another route that might also be useful ... how can we easily create spaCy-compatible models that rely on stanza as a backend.

@Lindafr
Copy link
Author

Lindafr commented Aug 25, 2020

Regarding the 2/3/4-grams question.
Estonian is also a bit fusional language, that means we have verbs which root also changes ("hüpa-ta", "hüppa-n" or "võidel-da", "võitle-n" or "tõmba-n", "tõmma-ta"). We also have tons of compound words. Lemmatisation helps to reduce the set of words or char-grams. It might not be so helpful when using 'char' analyzer, but helps a lot when using 'word' analyzer. Experience so far has shown that char-level CountvectorFeaturizer do not give so good results in Estonian and I planned to start testing Rasa with word-level CountvectorFeaturizer.

Regarding your last question, I don't know. I tried to hack SpacyTokenizer in a way that in function tokenize(self, message: Message, attribute: Text) -> List[Token]: it returns Token() that's content is given by stanza instead. I ended up with another error down the pipeline ("ValueError: Sequence dimensions for sparse and dense features don't coincide in...")

@koaning
Copy link
Contributor

koaning commented Aug 25, 2020

The Rasa word level CountVectorizer uses the .lemma_ if SpacyTokenizer is present. This suggests that if you have a spaCy model with a custom .lemma_ implementation that it should work. I wouldn't know how easy/hard it is to implement this but it seems worth to check out. I'll keep you posted if I learn anything.

@koaning
Copy link
Contributor

koaning commented Aug 26, 2020

@Lindafr I'm wondering what the best approach is here. I can attempt to get stanza into something that is spaCy compatible but with the advent of spaCy 3.0 as well as a lot of details in the "getting it right" department I'm currently leaning towards just making a Rasa component.

Would you agree that the POS tags are probably the most important feature to get started with? I should be able to add these as sparse features for the machine learning pipeline.

@Lindafr
Copy link
Author

Lindafr commented Aug 27, 2020

Hi @koaning , Rasa component would be easy for the end user. I, myself, was thinking down the line of just substituting spaCy .lemma_ values with Stanza values (more like a hack than a beautiful pipeline solution).
For me, the most important feature would be lemmas. POS-es (and NERs?) come after that.

@koaning koaning changed the title Adding stanza Adding Stanza Aug 27, 2020
@koaning
Copy link
Contributor

koaning commented Aug 27, 2020

I've talked to the research team about lemmas and it seems like we've never really supported them directly. Only indirectly for spaCy pipelines. There's interest in exploring it further but it may take time to get to a consensus on what is the best approach. Until then, I'll keep in mind that when I've got time to work on this; POS is a good candidate to start out with.

@koaning
Copy link
Contributor

koaning commented Sep 3, 2020

I will be starting with a tokenizer first. The reason is that internally, if you add a lemma property to a token, the countvectorfeaturizer is able to pick up the lemma instead of the word.

@koaning koaning mentioned this issue Sep 3, 2020
3 tasks
@koaning
Copy link
Contributor

koaning commented Sep 3, 2020

@Lindafr I have a first implementation up and running and you should be able to play with it. I've only tried it out with English sofar on my local machine but you should be able to use the implementation on the PR that is linked for any supported Stanza language.

If you've got the time and you'd like to play with it. You should be able to download the stanzatokenizer.py file that is in the PR locally and open the tools in jupyter. You should be able to play around with the tools using code similar to;

from stanzatokenizer import StanzaTokenizer
from rasa.nlu.training_data import Message

# You can change the language setting here 
tok = StanzaTokenizer(component_config={"lang": "en"})

# This is a Rasa internal thing, you need to wrap text with an object
m = Message("i am running and giving many greetings")
tok.process(m)

# You should now be able to check the properties of the message. 
[t.text for t in m.as_dict()['tokens']] 
[t.data.get('pos', '') for t in m.as_dict()['tokens']]
[t.lemma for t in m.as_dict()['tokens']]

Per example, the last three lists on my machine were;

['i', 'am', 'running', 'and', 'giving', 'many', 'greetings', '__CLS__']
['PRON', 'AUX', 'VERB', 'CCONJ', 'VERB', 'ADJ', 'NOUN', '']
['i', 'be', 'run', 'and', 'give', 'many', 'greeting', '__CLS__']

The __CLS__ token is also a Rasa internal thing, you should see it make an appearance. This is the token we use internally to represent the entire message. If you can confirm that the results you see there are sensible then I can move on to the next phase of the implementation :)

@koaning koaning self-assigned this Sep 3, 2020
@Lindafr
Copy link
Author

Lindafr commented Sep 3, 2020

Hi @koaning , I just found this. I didn't know it existed before but it seems that maybe integrating stanza to spaCy is even easier and you won't have to invent a whole new pipeline.

I'll take a look at this stanzatokenizer.py first thing tomorrow morning and then I'll investigate the spacy_stanza module.

@koaning
Copy link
Contributor

koaning commented Sep 3, 2020

If either works, let me know :)

@koaning
Copy link
Contributor

koaning commented Sep 3, 2020

Having glanced at the docs, it seems like the stanza-spacy plugin would be the preferable route. It's even hosted by explosion. Let me know if you're having trouble linking it with Rasa! The docs suggest you should be able to just save the model to disk via

nlp.to_disk("./stanza-spacy-model")

To properly link it you might need to do the same steps as I'm taking here.

@Lindafr
Copy link
Author

Lindafr commented Sep 4, 2020

Hi @koaning ,

Your code works great:
image

I got an error with stanza-spacy while following it's tutorial, so I haven't had the chance of trying to link it with Rasa yet, because I couldn't even get it to work outside Rasa (nlp = spacy.load("./stanza-spacy-model", snlp=snlp) doesn't work after nlp.to_disk("./stanza-spacy-model")). I'm investigating it atm, but your code gives the right results for Estonian.

@koaning
Copy link
Contributor

koaning commented Sep 4, 2020

Thanks for letting me know. I'm having a lunch break now but I might have some time to have a look at the spaCy bindings for stanza.

Just to check, do you also have a Rasa project? You should now also be able to use that tokenizer in config.yml. The following configuration should automatically pick up the pos and the lemma from the pipeline. The lemma is a bit of a hidden feature of the countvectorizer. This configuration assumes that you've placed the stanzatokenizer.py file in the root of your Rasa project.

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: DIETClassifier
  epochs: 200

If you run it with stanzatokenizer.StanzaTokenizer vs. WhitespaceTokenzier you should get different results in your NLU scores.

@koaning
Copy link
Contributor

koaning commented Sep 4, 2020

I've just tried trying to use spaCy directly and I had some troubles. There's a caveat mentioned here and I'm getting this error when I try to package it all up;

TypeError: __init__() missing 1 required positional argument: 'snlp'

I fear that we'd need to construct a special binding around this spacy-stanza package to get it to work for the Rasa use-case. It's do-able but it might make more sense to build the stanza feature directly for Rasa.

@Lindafr
Copy link
Author

Lindafr commented Sep 4, 2020

Did you try this on English? With Estonian I get the language error that should be solved ("But because this package exposes a spacy_languages entry point in its setup.py that points to StanzaLanguage, spaCy knows how to initialize it.").

import spacy
import stanza
from spacy_stanza import StanzaLanguage
#stanza.download("et")
snlp = stanza.Pipeline(lang="et")
nlp = StanzaLanguage(snlp)
nlp.to_disk("./stanza-spacy-model")
nlp = spacy.load("./stanza-spacy-model", snlp=snlp)

My hope was that this would exclude the need of constructing a spacial binding in Rasa. But if this doesn't work then stanza support in Rasa would be more sensible indeed.

@koaning
Copy link
Contributor

koaning commented Sep 4, 2020

@Lindafr I've tried doing it with English yes, but the current Rasa support for spaCy does not assume that we need to pass snlp when we call spacy.load. I think that's what is causing some bugs on my side now. It might be like I'm missing a detail but it seems that to support stanza via this route I'll need to implement a component for spacy-stanza to handle this.

If I'll end up having a component here for stanza then I prefer to host a direct binding to Rasa. Less things to maintain that way.

@Lindafr
Copy link
Author

Lindafr commented Sep 4, 2020

I agree. I'll investigate the spacy-stanza a bit more, but probably it isn't suitable in this case.

@koaning
Copy link
Contributor

koaning commented Sep 4, 2020

@Lindafr out of curiosity. Did you notice an improvement with the stanzatokenizer.py in Rasa?

@koaning
Copy link
Contributor

koaning commented Sep 7, 2020

@Lindafr not the biggest rush, but did you try the tool on a Rasa project by any chance? If you've experience the merits of it then I might be able to wrap up this feature this week.

@Lindafr
Copy link
Author

Lindafr commented Sep 8, 2020

I get this error, when running a test project with stanzatokenizer.py.

The config.yml:

language: "et"
pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: DIETClassifier
  epochs: 200
  
policies:
- name: MemoizationPolicy
  max_history: 5
- name: TEDPolicy
  epochs: 100
- name: TwoStageFallbackPolicy
  nlu_threshold: 0.8
  core_threshold: 0.8
  fallback_core_action_name: "action_default_fallback"
  fallback_nlu_action_name: "action_default_fallback"
  deny_suggestion_intent_name: "out_of_scope"

@koaning
Copy link
Contributor

koaning commented Sep 8, 2020

@Lindafr It was indeed the same issue. It should now be fixed, the only caveat is that now you need to supply the path to your stanza installation manually. That means you might need to do something like:

language: en

pipeline:
- name: rasa_nlu_examples.tokenizers.StanzaTokenizer
  lang: "en"
  cache_dir: "tests/data/stanza"
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 1

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

If you are running this locally then usually the stanza files are located in ~/stanza_resources/. Also, just to check, what platform are you using? Windows/MacOS/Ubuntu?

@Lindafr
Copy link
Author

Lindafr commented Sep 8, 2020

Ubuntu :)

@Lindafr
Copy link
Author

Lindafr commented Sep 8, 2020

Just to make sure I did everything accordingly I'll describe here the testing conditions. I used the stanzatokenizer.py here.

The stanza_test folder looks sth like this:
|-PPA_stanza_test/
| | stanzatokenizer.py
| | config.yml
| | domain.yml
| | endpoints.yml
| | credentials.yml
| | actions.py
| |-tests/
| |-results/
| |-data/
| | |-stanza/
| | | |-et/
| | | | |-depparse
| | | | |-pos
| | | | |-tokenize
| | | | |-lemma
| | | | |-pretrain
| |-models/

then with config.yml like this:

language: et

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
  cache_dir: "data/stanza"
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 1

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

My WhiteSpaceTokenizer is in another folder and it's config.yml looks like this:

language: "et"

pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: DIETClassifier
  epochs: 200
  
policies:
- name: MemoizationPolicy
  max_history: 5
- name: TEDPolicy
  epochs: 100
- name: TwoStageFallbackPolicy
  nlu_threshold: 0.8
  core_threshold: 0.8
  fallback_core_action_name: "action_default_fallback"
  fallback_nlu_action_name: "action_default_fallback"
  deny_suggestion_intent_name: "out_of_scope"

I trained new first models and then tested them. I tried to create some harder user messages which won't result in high confidence to better review the confidence differences. Here are the results (for the sake of shortness, I excluded the messages and intent names themselves):

Ideal case WhiteSpaceTokenizer (WST) StanzaTokenizer (ST) Which one is more confident
Received user message 'abc' with intent '{'name': 'X', 'confidence':1.0} Received user message 'abc' with intent '{'name': 'X', 'confidence': 0.98} Received user message 'abc' with intent '{'name': 'Y', 'confidence': 0.136} WST got it right and is sure of it, stanza got it wrong and is not sure of it.
Received user message 'abcd' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'abcd' with intent '{'name': 'Y', 'confidence': 0.656}' Received user message 'abcd' with intent '{'name': 'X', 'confidence': 0.148} WST is quite confident in a wrong intent, while ST is not confident at all in the right intent.
Received user message 'abcde' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'abcde' with intent '{'name': 'Y', 'confidence': 0.367}' Received user message ''abcde' with intent '{'name': 'Z', 'confidence': 0.122}' Seems to be a bad example- both configs got it wrong. ST is a bit less confident, which is good.
Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 0.97}' Received user message 'abcdef' with intent '{'name': 'Y', 'confidence': 0.13} WST is right and confident about it, ST is wrong and not confident about it
Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.97}' Received user message 'abcdefg' with intent '{'name': 'Y', 'confidence': 0.13} WST is right and confident about it, ST is wrong and not confident about it
Received user message 'a1' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.868}' Received user message 'a1' with intent '{'name': 'Y', 'confidence': 0.13} WST is right and quite confident about it, ST is wrong and not confident about it
Received user message 'a12' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.827}' Received user message 'a12' with intent '{'name': 'Y', 'confidence': 0.136} WST is right and quite confident about it, ST is wrong and not confident about it
Received user message 'a123' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'a123' with intent '{'name': 'Y', 'confidence': 0.83}' Received user message 'a123' with intent '{'name': 'Z', 'confidence': 0.12} WST is wrong and quite confident about it, stanza is wrong and not at all confident about it.
Received user message 'b1' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'b1' with intent '{'name': 'Y', 'confidence': 0.666}' Received user message 'b1' with intent '{'name': 'Z', 'confidence': 0.115} WST is wrong and a bit confident about it, stanza is wrong and not at all confident about it.
Received user message 'b12' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.998}' Received user message 'b12' with intent '{'name': 'Z', 'confidence': 0.137}' WST is right and confident about it, ST is wrong and not confident about it

As you can see that the pipeline with stanzatokenizer is not confident about anything and usually gets the intent wrong. The WST however usually gets things right, but when it's wrong, it's wrong quite confidently.

The dataset I am working on, has ca 32 intents, some of which have very similar keywords or situations (a la creating a passport application and receiving a passport).

@koaning
Copy link
Contributor

koaning commented Sep 8, 2020

@Lindafr this is very elaborate. Thanks for sharing!

I may have found an issue with your setup though.

Your current Stanza Config

language: et

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
  cache_dir: "data/stanza"
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 1

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

Notice how DIET is only using 1 epoch? This is probably why the setup is underperforming. Also, you've removed the CountVectorizer which we need to get the lemma property. It's an undocumented feature but the CountVectorizer will grab the lemma if it is available on the token.

My Stanza Config Proposal

Could you try running this?

language: et

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
  cache_dir: "data/stanza"
- name: CountVectorsFeaturizer
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 200

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

If you're interested in getting summary statistics, you can also run two config files and get summary statistics. Here's a snippet that might help:

rasa test nlu --config configs/config-light.yml \
              --cross-validation --runs 1 --folds 2 \
              --out gridresults/config-light
rasa test nlu --config configs/config-heavy.yml \
              --cross-validation --runs 1 --folds 2 \
              --out gridresults/config-heavy

This will grab two config files (in this case config-light.yml and config-heavy.yml and it will save the summary statistics in the gridresults/config-light and gridresults/config-heavy folders. You might enjoy using rasalit for this.

@Lindafr
Copy link
Author

Lindafr commented Sep 9, 2020

Hi, @koaning ,

Based on the results I suspected some kind of a mistake, that's why I wrote about the test conditions (I copy-pased your example and didn't think it through). Today I'll ran it again and then we'll have some real results. It might take a bit time, because I have to do some other things regarding to Rasa due to tomorrow before.

@koaning
Copy link
Contributor

koaning commented Sep 9, 2020

@Lindafr no rush! I'm just super curious of the results. 😄

@Lindafr
Copy link
Author

Lindafr commented Sep 9, 2020

Renewed results are as follows. (with config proposed earlier):

Ideal case WhiteSpaceTokenizer (WST) StanzaTokenizer (ST) Which one is more confident
Received user message 'abc' with intent '{'name': 'X', 'confidence':1.0} Received user message 'abc' with intent '{'name': 'X', 'confidence': 0.98} Received user message 'abc' with intent '{'name': 'Y', 'confidence': 0.95} WST got it right and is sure of it, stanza got it wrong but very sure of it.
Received user message 'abcd' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'abcd' with intent '{'name': 'Y', 'confidence': 0.656}' Received user message 'abcd' with intent '{'name': 'X', 'confidence': 0.955}' WST is quite confident in a wrong intent, while ST is confident in the right intent.
Received user message 'abcde' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'abcde' with intent '{'name': 'Y', 'confidence': 0.367}' Received user message 'abcde' with intent '{'name': 'X', 'confidence': 0.675}' WST got it wrong. ST is not too confident in the right intent.
Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 0.97}' Received user message 'abcdef' with intent '{'name': 'Y', 'confidence': 0.13} WST is right and confident about it, ST is wrong and not confident about it
Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.97}' Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.999}' WST is right and confident about it, ST also right and very confident about it
Received user message 'a1' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.868}' Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.895}' Both are right and quite confident about it
Received user message 'a12' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.827}' Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.999}' Both are right and quite confident about it, WST is less confident
Received user message 'a123' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'a123' with intent '{'name': 'Y', 'confidence': 0.83}' Received user message 'a123' with intent '{'name': 'Z', 'confidence': 0.36} WST is wrong and a bit confident about it, stanza is wrong in another way and a bit less confident about it.
Received user message 'b1' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'b1' with intent '{'name': 'Y', 'confidence': 0.666}' Received user message 'b1' with intent '{'name': 'X', 'confidence': 0.678}' WST is wrong and a bit confident about it, stanza is right and quite confident about it.
Received user message 'b12' with intent '{'name': 'X', 'confidence': 1.0}' Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.998}' Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.936}' WST is right and confident about it, ST is right and less confident about it

As you can see, ST got only 3 times wrong, while WST got 4 utterances wrong. Overall, if you exclude the first example, WST is more confident in it's wrong intents, while stanza is not confident when it's wrong. Stanza seems to be better on this manual check.

Now, for the other statistics.

WhiteSpaceTokeniser

2020-09-09 14:26:20 INFO rasa.test - Intent evaluation results
2020-09-09 14:26:20 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000)
2020-09-09 14:26:20 INFO rasa.nlu.test - train F1-score: 1.000 (0.000)
2020-09-09 14:26:20 INFO rasa.nlu.test - train Precision: 1.000 (0.000)
2020-09-09 14:26:20 INFO rasa.nlu.test - test Accuracy: 0.619 (0.023)
2020-09-09 14:26:20 INFO rasa.nlu.test - test F1-score: 0.612 (0.028)
2020-09-09 14:26:20 INFO rasa.nlu.test - test Precision: 0.645 (0.042)

image

StanzaTokeniser

2020-09-09 14:43:59 INFO rasa.test - Intent evaluation results
2020-09-09 14:43:59 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000)
2020-09-09 14:43:59 INFO rasa.nlu.test - train F1-score: 1.000 (0.000)
2020-09-09 14:43:59 INFO rasa.nlu.test - train Precision: 1.000 (0.000)
2020-09-09 14:43:59 INFO rasa.nlu.test - test Accuracy: 0.596 (0.016)
2020-09-09 14:43:59 INFO rasa.nlu.test - test F1-score: 0.589 (0.009)
2020-09-09 14:43:59 INFO rasa.nlu.test - test Precision: 0.615 (0.009)

image

Summary

One can see that WST's test accuracy, F1 and precision is actually slightly better. Bit suprising as I thought the extra info of POS would help.

Extra

With config.yml (the difference between WST and this is the first element in the pipeline):

language: "et"

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
  cache_dir: "data/stanza"
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: DIETClassifier
  epochs: 200
  
policies:
- name: MemoizationPolicy
  max_history: 5
- name: TEDPolicy
  epochs: 100
- name: TwoStageFallbackPolicy
  nlu_threshold: 0.8
  core_threshold: 0.8
  fallback_core_action_name: "action_default_fallback"
  fallback_nlu_action_name: "action_default_fallback"
  deny_suggestion_intent_name: "out_of_scope"

The results are slightly better than the recommended ST, but still below WTS:
2020-09-09 15:01:52 INFO rasa.test - Intent evaluation results
2020-09-09 15:01:52 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000)
2020-09-09 15:01:52 INFO rasa.nlu.test - train F1-score: 1.000 (0.000)
2020-09-09 15:01:52 INFO rasa.nlu.test - train Precision: 1.000 (0.000)
2020-09-09 15:01:52 INFO rasa.nlu.test - test Accuracy: 0.593 (0.003)
2020-09-09 15:01:52 INFO rasa.nlu.test - test F1-score: 0.593 (0.000)
2020-09-09 15:01:52 INFO rasa.nlu.test - test Precision: 0.631 (0.002)

@koaning
Copy link
Contributor

koaning commented Sep 9, 2020

Interesting. Looking at these results I wonder about another issue. The difference between the train/test results in both the Stanza and Whitespace scenario is huge. Especially because the training accuracy is 100% I'm thinking that we're overfitting in both examples here.

image

If you're up for it, could you try tuning the epochs down to maybe 50 for DIET? If you still see a big difference between train/test feel free to tune it down even further to 25. The goal is to manually stop the algorithm before it starts overfitting.

@koaning
Copy link
Contributor

koaning commented Sep 10, 2020

@Lindafr is your dataset publicly available? I might also be able to run some benchmarks on your behalf if you're interested. The research team here would love to have an Estonian dataset to benchmark our algorithms on.

@Lindafr
Copy link
Author

Lindafr commented Sep 10, 2020

Some more statistics on basically the same config.yml files.

StanzaTokenizer

(config.yml is as the Extra in here.)
with DIET epoch 100
2020-09-10 09:31:00 INFO rasa.test - Intent evaluation results
2020-09-10 09:31:00 INFO rasa.nlu.test - train Accuracy: 0.984 (0.005)
2020-09-10 09:31:00 INFO rasa.nlu.test - train F1-score: 0.983 (0.006)
2020-09-10 09:31:00 INFO rasa.nlu.test - train Precision: 0.984 (0.007)
2020-09-10 09:31:00 INFO rasa.nlu.test - test Accuracy: 0.536 (0.023)
2020-09-10 09:31:00 INFO rasa.nlu.test - test F1-score: 0.508 (0.022)
2020-09-10 09:31:00 INFO rasa.nlu.test - test Precision: 0.570 (0.018)

with DIET epoch 50
2020-09-10 09:10:10 INFO rasa.test - Intent evaluation results
2020-09-10 09:10:10 INFO rasa.nlu.test - train Accuracy: 0.803 (0.052)
2020-09-10 09:10:10 INFO rasa.nlu.test - train F1-score: 0.775 (0.055)
2020-09-10 09:10:10 INFO rasa.nlu.test - train Precision: 0.835 (0.009)
2020-09-10 09:10:10 INFO rasa.nlu.test - test Accuracy: 0.500 (0.034)
2020-09-10 09:10:10 INFO rasa.nlu.test - test F1-score: 0.454 (0.041)
2020-09-10 09:10:10 INFO rasa.nlu.test - test Precision: 0.570 (0.021)

with DIET epoch 25
2020-09-10 09:14:20 INFO rasa.test - Intent evaluation results
2020-09-10 09:14:20 INFO rasa.nlu.test - train Accuracy: 0.650 (0.060)
2020-09-10 09:14:20 INFO rasa.nlu.test - train F1-score: 0.614 (0.066)
2020-09-10 09:14:20 INFO rasa.nlu.test - train Precision: 0.725 (0.071)
2020-09-10 09:14:20 INFO rasa.nlu.test - test Accuracy: 0.329 (0.003)
2020-09-10 09:14:20 INFO rasa.nlu.test - test F1-score: 0.295 (0.021)
2020-09-10 09:14:20 INFO rasa.nlu.test - test Precision: 0.367 (0.065)

with DIET epoch 15
2020-09-10 09:17:48 INFO rasa.test - Intent evaluation results
2020-09-10 09:17:48 INFO rasa.nlu.test - train Accuracy: 0.482 (0.109)
2020-09-10 09:17:48 INFO rasa.nlu.test - train F1-score: 0.447 (0.123)
2020-09-10 09:17:48 INFO rasa.nlu.test - train Precision: 0.578 (0.104)
2020-09-10 09:17:48 INFO rasa.nlu.test - test Accuracy: 0.298 (0.060)
2020-09-10 09:17:48 INFO rasa.nlu.test - test F1-score: 0.238 (0.072)
2020-09-10 09:17:48 INFO rasa.nlu.test - test Precision: 0.314 (0.057)

WhiteSpaceTokenizer

(config.yml is as the described in here.)
with DIET epoch 100
2020-09-10 09:38:18 INFO rasa.test - Intent evaluation results
2020-09-10 09:38:18 INFO rasa.nlu.test - train Accuracy: 0.992 (0.008)
2020-09-10 09:38:18 INFO rasa.nlu.test - train F1-score: 0.992 (0.008)
2020-09-10 09:38:18 INFO rasa.nlu.test - train Precision: 0.993 (0.007)
2020-09-10 09:38:18 INFO rasa.nlu.test - test Accuracy: 0.562 (0.018)
2020-09-10 09:38:18 INFO rasa.nlu.test - test F1-score: 0.545 (0.020)
2020-09-10 09:38:18 INFO rasa.nlu.test - test Precision: 0.570 (0.013)

with DIET epoch 50
2020-09-10 10:11:53 INFO rasa.test - Intent evaluation results
2020-09-10 10:11:53 INFO rasa.nlu.test - train Accuracy: 0.889 (0.013)
2020-09-10 10:11:53 INFO rasa.nlu.test - train F1-score: 0.868 (0.026)
2020-09-10 10:11:53 INFO rasa.nlu.test - train Precision: 0.878 (0.046)
2020-09-10 10:11:53 INFO rasa.nlu.test - test Accuracy: 0.487 (0.016)
2020-09-10 10:11:53 INFO rasa.nlu.test - test F1-score: 0.443 (0.012)
2020-09-10 10:11:53 INFO rasa.nlu.test - test Precision: 0.491 (0.047)

with DIET epoch 25
2020-09-10 10:13:59 INFO rasa.test - Intent evaluation results
2020-09-10 10:13:59 INFO rasa.nlu.test - train Accuracy: 0.754 (0.018)
2020-09-10 10:13:59 INFO rasa.nlu.test - train F1-score: 0.733 (0.024)
2020-09-10 10:13:59 INFO rasa.nlu.test - train Precision: 0.849 (0.028)
2020-09-10 10:13:59 INFO rasa.nlu.test - test Accuracy: 0.412 (0.013)
2020-09-10 10:13:59 INFO rasa.nlu.test - test F1-score: 0.357 (0.011)
2020-09-10 10:13:59 INFO rasa.nlu.test - test Precision: 0.452 (0.027)

with DIET epoch 15
2020-09-10 10:17:19 INFO rasa.test - Intent evaluation results
2020-09-10 10:17:19 INFO rasa.nlu.test - train Accuracy: 0.490 (0.085)
2020-09-10 10:17:19 INFO rasa.nlu.test - train F1-score: 0.479 (0.052)
2020-09-10 10:17:19 INFO rasa.nlu.test - train Precision: 0.654 (0.042)
2020-09-10 10:17:19 INFO rasa.nlu.test - test Accuracy: 0.238 (0.098)
2020-09-10 10:17:19 INFO rasa.nlu.test - test F1-score: 0.205 (0.099)
2020-09-10 10:17:19 INFO rasa.nlu.test - test Precision: 0.262 (0.107)

The overfitting is still quite large. I think it's partly due to my still very small dataset - every intent (32 total) I have has only 10-20 example questions and there's not much to learn from. I have plans to make it larger with test users and by hand, but right now that's all I got.

@Lindafr
Copy link
Author

Lindafr commented Sep 10, 2020

@koaning , the dataset is not publicy available and I cannot share it.

Do you think part of the overfitting results can be due to my small dataset? I can try again later when I have more examples, but right now the results are like this and adding POS doesn't seem to add any improvement (?).

@koaning
Copy link
Contributor

koaning commented Sep 10, 2020

@Lindafr yeah those last results seem convincing. I'll add the feature just in case so that other folks can try it out but it seems safe to say that on your use-case right now it doesn't contribute anything substantial.

Thanks a lot for the feedback though 👍! It's been very(!) helpful.

I'll merge the stanza feature today. It might prove useful to other folks and you can also check if it boosts performance once you've got more data. I'll close this issue once the feature is merged.

One last question out of curiosity, did you try the pretrained embeddings (BytePair/FastText)? If so, I'd love to hear if they were helpful.

@koaning
Copy link
Contributor

koaning commented Sep 10, 2020

#30 has been merged! Closing this issue but we can continue/pick up the conversation in this issue if there's a related discussion to be continued.

@koaning koaning closed this as completed Sep 10, 2020
@koaning
Copy link
Contributor

koaning commented Oct 16, 2020

Just wanted to mention it here. Full stanza support is now coming to spaCy https://explosion.ai/blog/spacy-v3-nightly.

@Lindafr
Copy link
Author

Lindafr commented Oct 20, 2020

@koaning , are you sure? They don't mention it anywhere. They only compare Stanza's NER results with theirs in one table. Plus they do have the spacy-stanza package we tried, but failed to integrate with Rasa.

Btw, is there anyway to use stanzatokenizer.py for getting POS and lemma and also use SpacyNLP for getting the Fasttext embedding? There can't be two tokenizers in the same pipeline and SpacyNLP+stanzatokenizer.StanzaTokenizer doesn't work. I guess not, but I am asking anyways just in case there is some obvious solution I didn't think of.

@koaning
Copy link
Contributor

koaning commented Oct 20, 2020

I recall reading it elsewhere that a tighter integration was now possible but I can't find the link anymore ... will look again!

If you want FastText embeddings, can't you use the ones that are available directly in this repository?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants