Stanza Tokenizer #30

koaning · 2020-09-03T09:33:46Z

This is the first implementation of the stanza tokenizer and adresses #23. There's some cool things about this implementation:

We add a tokenizer. Hopefully this works better for some languages than our basic WhitespaceTokenizer.
We automatically add a lemma to each token so that means that the CountvectorFeaturizer inside of Rasa can pick them up.
We automatically add a pos to the data property of each token. I'm using the same technique as the SpacyTokenizer so that means that you should also be able to use them from the LexicalSyntacticFeaturizer.

I still need to figure out a nice way to test this on CI because the model that I'm downloading is fairly big. I also need to figure out if the caching mechanisms are appropriate but it is ready for a first review. There's a few things that need to be checked before merging.

Can we make this dependency optional? It also install a lot of pytorch and that might not be relevant to everybody.
Can we confirm the caching dir works as we expect?
Can we confirm/add a test that ensures that the countvectorizer/lexicalfeaturizer behave differently with stanza around.

koaning · 2020-09-08T08:54:05Z

Edit, since we install via pip+git it's going to be somewhat tricky/odd to make the stanza dependency option. Work for later.

koaning · 2020-09-08T12:01:21Z

The docs still need to be appended. It's annoying but for Windows you need another pip install command to deal with the pytorch dependencies.

tabergma

Looks great 💯

rasa_nlu_examples/tokenizers/stanzatokenizer.py

config.yml

rasa_nlu_examples/tokenizers/stanzatokenizer.py

tests/test_tokenizers/test_stanza_tokenizer.py

Co-authored-by: Tanja <tabergma@gmail.com>

tests/configs/stanza-tokenizer-config.yml

docs/docs/tokenizer/stanza.md

tabergma

Looks great 🚀

Co-authored-by: Tanja <tabergma@gmail.com>

…s into stanza-tokenizer

koaning added 8 commits September 3, 2020 11:27

added-tests

1d2c0fe

its clear we need to do this without spacy

c3d185b

stanza-dependency

32f6b84

Merge branch 'master' into stanza-tokenizer

5928bb4

maybe-windows-is-not-supported

7b62e6b

check-grep

afe39e7

download-stanza-online

a54f44d

cache-dir-fix

edb2425

koaning added 6 commits September 8, 2020 10:56

workflow-bug-windows

35a2854

tests-fixed

236f946

windows-workflow-yuck

7fbc4c4

extra-stanza-tests

859dbb4

winbdowzzzz

09465b0

pytorch-install-urgh

c2d0362

koaning mentioned this pull request Sep 8, 2020

Tokenizers for Less Common Languages #15

Closed

tabergma reviewed Sep 8, 2020

View reviewed changes

koaning and others added 2 commits September 8, 2020 14:29

Update rasa_nlu_examples/tokenizers/stanzatokenizer.py

05fb98a

Co-authored-by: Tanja <tabergma@gmail.com>

Apply suggestions from code review

0953b7e

Co-authored-by: Tanja <tabergma@gmail.com>

koaning mentioned this pull request Sep 8, 2020

Thai Tokenizer #35

Merged

koaning added 8 commits September 8, 2020 23:24

Update stanza.md

4925faa

feedback-tanja

2d9251c

Merge branch 'master' into stanza-tokenizer

adfca57

internal-test-refactor

ec1b6cb

Update windows-check.yml

15c7055

Update mac-os-check.yml

00d26a5

Update pythonpackage.yml

b996195

pytest-paths-fixed

30983ff

tabergma reviewed Sep 10, 2020

View reviewed changes

tests/configs/stanza-tokenizer-config.yml Outdated Show resolved Hide resolved

remove-policies-from-nlu-tests

af69ec6

tabergma reviewed Sep 10, 2020

View reviewed changes

docs/docs/tokenizer/stanza.md Outdated Show resolved Hide resolved

tabergma reviewed Sep 10, 2020

View reviewed changes

docs/docs/tokenizer/stanza.md Outdated Show resolved Hide resolved

tabergma approved these changes Sep 10, 2020

View reviewed changes

koaning and others added 3 commits September 10, 2020 11:49

Update docs/docs/tokenizer/stanza.md

f783241

Co-authored-by: Tanja <tabergma@gmail.com>

removed-policies-from-docs

310f494

Merge branch 'stanza-tokenizer' of github.com:RasaHQ/rasa-nlu-example…

1b4a2d5

…s into stanza-tokenizer

koaning merged commit b29e76c into master Sep 10, 2020

koaning deleted the stanza-tokenizer branch September 10, 2020 10:19

koaning mentioned this pull request Sep 10, 2020

Adding Stanza #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanza Tokenizer #30

Stanza Tokenizer #30

koaning commented Sep 3, 2020 •

edited

Loading

koaning commented Sep 8, 2020

koaning commented Sep 8, 2020 •

edited

Loading

tabergma left a comment

tabergma left a comment

Stanza Tokenizer #30

Stanza Tokenizer #30

Conversation

koaning commented Sep 3, 2020 • edited Loading

koaning commented Sep 8, 2020

koaning commented Sep 8, 2020 • edited Loading

tabergma left a comment

Choose a reason for hiding this comment

tabergma left a comment

Choose a reason for hiding this comment

koaning commented Sep 3, 2020 •

edited

Loading

koaning commented Sep 8, 2020 •

edited

Loading