Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanza Tokenizer #30

Merged
merged 28 commits into from
Sep 10, 2020
Merged

Stanza Tokenizer #30

merged 28 commits into from
Sep 10, 2020

Conversation

koaning
Copy link
Contributor

@koaning koaning commented Sep 3, 2020

This is the first implementation of the stanza tokenizer and adresses #23. There's some cool things about this implementation:

  1. We add a tokenizer. Hopefully this works better for some languages than our basic WhitespaceTokenizer.
  2. We automatically add a lemma to each token so that means that the CountvectorFeaturizer inside of Rasa can pick them up.
  3. We automatically add a pos to the data property of each token. I'm using the same technique as the SpacyTokenizer so that means that you should also be able to use them from the LexicalSyntacticFeaturizer.

I still need to figure out a nice way to test this on CI because the model that I'm downloading is fairly big. I also need to figure out if the caching mechanisms are appropriate but it is ready for a first review. There's a few things that need to be checked before merging.

  • Can we make this dependency optional? It also install a lot of pytorch and that might not be relevant to everybody.
  • Can we confirm the caching dir works as we expect?
  • Can we confirm/add a test that ensures that the countvectorizer/lexicalfeaturizer behave differently with stanza around.

@koaning
Copy link
Contributor Author

koaning commented Sep 8, 2020

Edit, since we install via pip+git it's going to be somewhat tricky/odd to make the stanza dependency option. Work for later.

@koaning
Copy link
Contributor Author

koaning commented Sep 8, 2020

The docs still need to be appended. It's annoying but for Windows you need another pip install command to deal with the pytorch dependencies.

Copy link
Collaborator

@tabergma tabergma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 💯

rasa_nlu_examples/tokenizers/stanzatokenizer.py Outdated Show resolved Hide resolved
rasa_nlu_examples/tokenizers/stanzatokenizer.py Outdated Show resolved Hide resolved
rasa_nlu_examples/tokenizers/stanzatokenizer.py Outdated Show resolved Hide resolved
config.yml Outdated Show resolved Hide resolved
rasa_nlu_examples/tokenizers/stanzatokenizer.py Outdated Show resolved Hide resolved
rasa_nlu_examples/tokenizers/stanzatokenizer.py Outdated Show resolved Hide resolved
rasa_nlu_examples/tokenizers/stanzatokenizer.py Outdated Show resolved Hide resolved
rasa_nlu_examples/tokenizers/stanzatokenizer.py Outdated Show resolved Hide resolved
koaning and others added 2 commits September 8, 2020 14:29
Co-authored-by: Tanja <tabergma@gmail.com>
Co-authored-by: Tanja <tabergma@gmail.com>
@koaning koaning mentioned this pull request Sep 8, 2020
Copy link
Collaborator

@tabergma tabergma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 🚀

@koaning koaning merged commit b29e76c into master Sep 10, 2020
@koaning koaning deleted the stanza-tokenizer branch September 10, 2020 10:19
@koaning koaning mentioned this pull request Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants