Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spaCy 3.0 #7869

Merged
merged 80 commits into from
Mar 23, 2021
Merged

spaCy 3.0 #7869

merged 80 commits into from
Mar 23, 2021

Conversation

koaning
Copy link
Contributor

@koaning koaning commented Feb 2, 2021

This PR should fix #7724, fix #6906, fix #5400 and add support for many languages!

This PR will add basic support for spaCy 3.0. I'll list a bunch of things that have changed.

  • I ran poetry update which caused a lot of changes in poetry.lock
  • the spacy link command has been deprecated. That means that some of the Dockerfiles needed to be rewritten as well as our Makefile. This also means that this PR contains breaking changes. You must name your spaCy model explicitly in a pipeline from now on.
  • Every reference to a SpacyNLP component in the tests now needs a model attached. I've chosen to use en_core_web_md (for English) and de_core_news_sm (for German). These are already downloaded beforehand in the CI. Note that this means that loads of tests required an update.

@koaning
Copy link
Contributor Author

koaning commented Feb 2, 2021

Milestone: the docker containers build!

@koaning koaning mentioned this pull request Feb 2, 2021
@koaning
Copy link
Contributor Author

koaning commented Feb 3, 2021

I think, as is, this PR is a nice feature. Many more languages will be supported and many more tools can be built on the linguistic features inside of spaCy now.

I haven't check support for the new transformer models though. These will require more work because the API is slightly different. We could say "this feature ain't done until those models go in", but I'll leave this up for discussion.

Do we need to maybe do a performance check for models? The embeddings have been retrained if I recall correctly.

Things to do:

  • Add proper errors if users don't pass in a model in config.yml
  • Do a scan to remove code that we no longer use (the default language no longer works, so cleanup should be possible).
  • Make changes to the docs.
  • Plan a release for this feature.
  • Keep the transformer models as a separate PR.

@wochinge
Copy link
Contributor

wochinge commented Feb 5, 2021

@koaning Can we make this not breaking by using the language setting from the domain as fallback? E.g. if the language is en we automatically use en_core_web_md (+ deprecation warning of course)

@koaning
Copy link
Contributor Author

koaning commented Feb 5, 2021

@koaning Can we make this not breaking by using the language setting from the domain as fallback? E.g. if the language is en we automatically use en_core_web_md (+ deprecation warning of course)

I guess we could ... but it gets very tricky. The English models use en_core_web_md but, for example, the French models use fr_core_news_md. The naming standard <lang>_core_<src>_<size> technically cannot be assumed to be consistent across languages and spaCy doesn't support each language equally. We'd also need to assume that the user downloaded en_core_web_md (as opposed to en_core_web_lg) beforehand. We'd also need to consider checking the languages that spaCy supports which will change over time too.

From my perspective spaCy has made a very sensible change here, as it forces users to be explicit, which is more maintainable.

@koaning
Copy link
Contributor Author

koaning commented Feb 9, 2021

@wochinge I'm wondering timeline-wise ... should I merge all of this work on top of the main branch or is it best to target a branch with a version? If so, which one might be practical? I'm asking because I imagine this PR will need a substantial amount of reviews.

@wochinge
Copy link
Contributor

should I merge all of this work on top of the main branch or is it best to target a branch with a version?

As it's feature it should go into main. Do you expect more changes? Then might make sense to do separate PRs which target this branch.

From my perspective spaCy has made a very sensible change here, as it forces users to be explicit, which is more maintainable."

I agree. My issues are just that we shouldn't have this sort of breaking changes in a minor. Couldn't we do

  • map the main languages to explict model names
  • print a big warning in case we have to fallback to this case ("please use the explicit model name")

In this case we would at least try?

@koaning
Copy link
Contributor Author

koaning commented Mar 22, 2021

@wochinge thanks for pointing out the remaining flaws.

Once again, I think the changes are now properly addressed. I didn't resolve any issues myself to make it easy for you to confirm.

There's a new issue though, it seems like the doc tests aren't passing. Since the issue revolves around dead links on other parts of the repo, can I safely ignore these?

I'm running poetry update now as well to resolve those issues.

@koaning
Copy link
Contributor Author

koaning commented Mar 22, 2021

Ah. It seems that we're pointing users to a sanic doc that no longer exists. Will investigate.

@wochinge I think I've found the issue. Sanic moved their docs. I added a PR here #8272 to fix.

@koaning koaning mentioned this pull request Mar 22, 2021
Copy link
Contributor

@wochinge wochinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nearly there 💯

rasa/nlu/utils/spacy_utils.py Outdated Show resolved Hide resolved
rasa/nlu/utils/spacy_utils.py Outdated Show resolved Hide resolved
rasa/nlu/utils/spacy_utils.py Outdated Show resolved Hide resolved
rasa/nlu/utils/spacy_utils.py Outdated Show resolved Hide resolved
koaning and others added 5 commits March 22, 2021 18:44
Co-authored-by: Tobias Wochinger <t.wochinger@rasa.com>
Co-authored-by: Tobias Wochinger <t.wochinger@rasa.com>
Co-authored-by: Tobias Wochinger <t.wochinger@rasa.com>
Copy link
Contributor

@wochinge wochinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 🚀

@joejuzl is it okay to dismiss your "Request for changes"?

rasa/nlu/utils/spacy_utils.py Outdated Show resolved Hide resolved
rasa/nlu/utils/spacy_utils.py Outdated Show resolved Hide resolved
rasa/nlu/utils/spacy_utils.py Outdated Show resolved Hide resolved
rasa/nlu/utils/spacy_utils.py Outdated Show resolved Hide resolved
@@ -9,6 +10,8 @@
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.model import InvalidModelError
from rasa.nlu.constants import SPACY_DOCS, DENSE_FEATURIZABLE_ATTRIBUTES
from rasa.shared.utils.io import raise_deprecation_warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from rasa.shared.utils.io import raise_deprecation_warning
rasa.shared.utils.io

According to our Python code conventions we don't import functions directly but reference them by their module. https://www.notion.so/rasa/Python-Code-Conventions-26835ffd72484dd699f28fa693916d2e#5b63fd5269514270aaf61ab8303ae319

koaning and others added 2 commits March 23, 2021 10:22
Copy link
Contributor

@joejuzl joejuzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job!

@koaning koaning enabled auto-merge March 23, 2021 10:26
@koaning koaning merged commit e275958 into main Mar 23, 2021
@koaning koaning deleted the spacy-3-dot-0-support branch March 23, 2021 12:41
@koaning
Copy link
Contributor Author

koaning commented Mar 23, 2021

Booya!

@wochinge
Copy link
Contributor

Thanks for this awesome change @koaning ! 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components
Projects
None yet
5 participants