Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/spacy detector #64

Merged
merged 51 commits into from
Nov 5, 2020

Conversation

aCampello
Copy link
Collaborator

@aCampello aCampello commented Oct 24, 2020

Current stats for the old entity detector:

Documents generated in 62.92s
Scrubbed documents in  3.98s
                           precision    recall  f1-score   support

      name           name       0.09      1.00      0.16       558

Current stats for new spacy detector

Documents generated in 66.99s
Scrubbed documents in  78.61s
                               precision    recall  f1-score   support

named_entity     name       0.98      0.95      0.97      1143

                    micro avg       0.98      0.95      0.97      1143
                    macro avg       0.98      0.95      0.97      1143
                 weighted avg       0.98      0.95      0.97      1143

@coveralls
Copy link

coveralls commented Oct 24, 2020

Coverage Status

Coverage decreased (-2.1%) to 95.958% when pulling 856f291 on aCampello:feature/spacy_detector into 0eec66e on LeapBeyond:master.

@thomasbird
Copy link
Member

Nice work, thanks! I haven't had a chance to look through it properly, but two things popped out:

  • It would be great to have some tests for the new behaviour around iter_filth_documents and the new detector.
  • I'd like to keep Filth types related to what they are as opposed how they were detected. So if we find a name a NameFilth should be returned or if we find a organisation an OrganisationFilth could be returned. The filth_cls attribute is no longer needed on the Detector class to make it possible to return any Filth you'd like.

I look forward to playing with this next week! Thanks for the commits! 😁

@aCampello
Copy link
Collaborator Author

aCampello commented Oct 25, 2020

  • It would be great to have some tests for the new behaviour around iter_filth_documents and the new detector.

100%. This is highly unfinished. I am planning to writing tests shortly, to get coverage to the same point (or ideally higher) than it was before this PR. Just wanted to write a draft PR to get early feedback.

  • I'd like to keep Filth types related to what they are as opposed how they were detected. So if we find a name a NameFilth should be returned or if we find a organisation an OrganisationFilth could be returned. The filth_cls attribute is no longer needed on the Detector class to make it possible to return any Filth you'd like.

Yes, that is also something that occurred to me, as I was playing with it. Will implement it 👍 !

@aCampello
Copy link
Collaborator Author

aCampello commented Oct 29, 2020

I have now implemented those changes pinpointed before. Also added some initial tests to the detector.

At the moment the build fails for python3.5, because Spacy doesn't support it. So wondering how you want to go about that. Should I add the spacy detector as an "extra"/should I allow the tests to fail ? I believe that we can make travis skip tests for specific environments, but that will require some investigation.

Adding as an extra requirement would allow someone that only wants to use "vanilla" scrubadub to just use it, but if someone wants to use spacy transformers (which is arguably way slower) they could just say:

pip install scrubadub[spacy]

@thomasbird
Copy link
Member

thomasbird commented Oct 30, 2020

At the moment the build fails for python3.5, because Spacy doesn't support it

Ah. Yeah including it as an extra seems very sensible.

python 3.5 is almost end of life, but i believe it's the python still in ubuntu LTS 18.04... Don't worry about it in the MR, I'll work out something...

@aCampello aCampello marked this pull request as ready for review November 1, 2020 23:04
@aCampello
Copy link
Collaborator Author

aCampello commented Nov 1, 2020

Hi @thomasbird , I believe I am happy with the initial functionality. I added the tests. On python 3.8 I am over 98%, however since the functionality doesn't work in 3.5, there is a decrease in coverage (I set the tests to skip).

Let me know what you think would be the best approach for coveralls. Happy to hear your thoughts on the MR in general.

Copy link
Member

@thomasbird thomasbird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Thanks for the commits!

@thomasbird thomasbird merged commit 4e43ce8 into LeapBeyond:master Nov 5, 2020
@aCampello
Copy link
Collaborator Author

Thanks for merging this. I believe this closes #18

@aCampello aCampello mentioned this pull request Nov 5, 2020
@thomasbird thomasbird linked an issue Nov 6, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use spaCy for NLP
3 participants