-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stanza Tokenizer #30
Stanza Tokenizer #30
Conversation
Edit, since we install via pip+git it's going to be somewhat tricky/odd to make the stanza dependency option. Work for later. |
The docs still need to be appended. It's annoying but for Windows you need another |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great 💯
Co-authored-by: Tanja <tabergma@gmail.com>
Co-authored-by: Tanja <tabergma@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great 🚀
Co-authored-by: Tanja <tabergma@gmail.com>
…s into stanza-tokenizer
This is the first implementation of the stanza tokenizer and adresses #23. There's some cool things about this implementation:
WhitespaceTokenizer
.lemma
to each token so that means that the CountvectorFeaturizer inside of Rasa can pick them up.pos
to the data property of each token. I'm using the same technique as the SpacyTokenizer so that means that you should also be able to use them from the LexicalSyntacticFeaturizer.I still need to figure out a nice way to test this on CI because the model that I'm downloading is fairly big. I also need to figure out if the caching mechanisms are appropriate but it is ready for a first review. There's a few things that need to be checked before merging.