Feature/spacy detector #64

aCampello · 2020-10-24T17:22:05Z

Current stats for the old entity detector:

Documents generated in 62.92s
Scrubbed documents in  3.98s
                           precision    recall  f1-score   support

      name           name       0.09      1.00      0.16       558

Current stats for new spacy detector

Documents generated in 66.99s
Scrubbed documents in  78.61s
                               precision    recall  f1-score   support

named_entity     name       0.98      0.95      0.97      1143

                    micro avg       0.98      0.95      0.97      1143
                    macro avg       0.98      0.95      0.97      1143
                 weighted avg       0.98      0.95      0.97      1143

coveralls · 2020-10-24T17:30:42Z

Coverage decreased (-2.1%) to 95.958% when pulling 856f291 on aCampello:feature/spacy_detector into 0eec66e on LeapBeyond:master.

thomasbird · 2020-10-25T00:06:53Z

Nice work, thanks! I haven't had a chance to look through it properly, but two things popped out:

It would be great to have some tests for the new behaviour around iter_filth_documents and the new detector.
I'd like to keep Filth types related to what they are as opposed how they were detected. So if we find a name a NameFilth should be returned or if we find a organisation an OrganisationFilth could be returned. The filth_cls attribute is no longer needed on the Detector class to make it possible to return any Filth you'd like.

I look forward to playing with this next week! Thanks for the commits! 😁

aCampello · 2020-10-25T21:43:42Z

It would be great to have some tests for the new behaviour around iter_filth_documents and the new detector.

100%. This is highly unfinished. I am planning to writing tests shortly, to get coverage to the same point (or ideally higher) than it was before this PR. Just wanted to write a draft PR to get early feedback.

I'd like to keep Filth types related to what they are as opposed how they were detected. So if we find a name a NameFilth should be returned or if we find a organisation an OrganisationFilth could be returned. The filth_cls attribute is no longer needed on the Detector class to make it possible to return any Filth you'd like.

Yes, that is also something that occurred to me, as I was playing with it. Will implement it 👍 !

aCampello · 2020-10-29T18:18:46Z

I have now implemented those changes pinpointed before. Also added some initial tests to the detector.

At the moment the build fails for python3.5, because Spacy doesn't support it. So wondering how you want to go about that. Should I add the spacy detector as an "extra"/should I allow the tests to fail ? I believe that we can make travis skip tests for specific environments, but that will require some investigation.

Adding as an extra requirement would allow someone that only wants to use "vanilla" scrubadub to just use it, but if someone wants to use spacy transformers (which is arguably way slower) they could just say:

pip install scrubadub[spacy]

thomasbird · 2020-10-30T11:45:27Z

At the moment the build fails for python3.5, because Spacy doesn't support it

Ah. Yeah including it as an extra seems very sensible.

python 3.5 is almost end of life, but i believe it's the python still in ubuntu LTS 18.04... Don't worry about it in the MR, I'll work out something...

aCampello · 2020-11-01T23:07:03Z

Hi @thomasbird , I believe I am happy with the initial functionality. I added the tests. On python 3.8 I am over 98%, however since the functionality doesn't work in 3.5, there is a decrease in coverage (I set the tests to skip).

Let me know what you think would be the best approach for coveralls. Happy to hear your thoughts on the MR in general.

thomasbird

Awesome work! Thanks for the commits!

aCampello · 2020-11-05T22:36:52Z

Thanks for merging this. I believe this closes #18

aCampello added 26 commits October 20, 2020 22:05

Add NER filth

4f31787

Add NER filth to __init__

449d150

Add NER detector

ed77716

Add iter_filth_documents basic logic

b20c212

Add support to different types for documents

e372b07

Flake8 tweaks

cd8b489

Edit f-string for 3.5 compatibility

7e5b3d5

Add iter_filth

3d9a888

Simplify name_entities type

3aaebb4

Rename named-entity-filth

7a3c679

Figure out which detectors can run on a batch of documents

c2b9aeb

Add possibility to disable detector

c12b3d6

Logic to scrubbers to detect if a detector has document iterator

7f05d28

Scrubbers to merge with document detectors

f413757

Tidy document processors merge

00c3343

Named entity filth to accept a label

9481159

Add Spacy detector to init

e148dfc

Change detector name to follow the pattern

af80edd

Update module name

86f6e6b

Remove unecessary imports

ae42eb0

Change type for NamedEntityFilth depending on label

86b4532

Revert NamedEntityFilth name because it was a bad idea

45ee26f

Change replacement string of named entity filth

5dacd62

Add spacy nightly to requirements

d319029

Add benchmark with spacy accuracy

8a63d47

Comment named entity test code

0d0b839

NamedEntityDetector to return standard Filth when it is avaliable

f6386dd

aCampello added 6 commits October 28, 2020 23:18

Scrubber simplification

3edc20d

Fix types for document_names and text

0ee6c4a

Fix types for document dictionary to include None

1b80110

Update requirements to nightly 3.0.0rc1

ffb6ed9

Comment unecessary piece of code

ecd9654

Initial tests to named entity detector

aaf36a0

aCampello added 14 commits October 31, 2020 14:22

Skip tests if python version < 3.6

2c8eedb

Add spacy as extra

c83829a

Tweak travis for python3.8

dd5a7be

Revert CI and add environment marker to requirements

491c10d

Add check for extras

3822cf9

Fix test skipping

f1b29cd

Change import error for 3.5 compatibility

9949918

Yet another test logic fix

de100c1

Test new scrubbers clean_documents signature

9941285

Remove unecessary return

92c922a

Test spacy compatibility with scrubber

46bb178

Mypy tweaks

56f400a

Flake8 tweaks

ef3b3c6

Type annotation tweak for 3.5 compatibility

856f291

aCampello requested a review from thomasbird November 1, 2020 23:04

aCampello marked this pull request as ready for review November 1, 2020 23:04

thomasbird approved these changes Nov 5, 2020

View reviewed changes

thomasbird merged commit 4e43ce8 into LeapBeyond:master Nov 5, 2020

aCampello mentioned this pull request Nov 5, 2020

use spaCy for NLP #18

Closed

thomasbird linked an issue Nov 6, 2020 that may be closed by this pull request

use spaCy for NLP #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/spacy detector #64

Feature/spacy detector #64

aCampello commented Oct 24, 2020 •

edited

Loading

coveralls commented Oct 24, 2020 •

edited

Loading

thomasbird commented Oct 25, 2020

aCampello commented Oct 25, 2020 •

edited

Loading

aCampello commented Oct 29, 2020 •

edited

Loading

thomasbird commented Oct 30, 2020 •

edited

Loading

aCampello commented Nov 1, 2020 •

edited

Loading

thomasbird left a comment

aCampello commented Nov 5, 2020

Feature/spacy detector #64

Feature/spacy detector #64

Conversation

aCampello commented Oct 24, 2020 • edited Loading

coveralls commented Oct 24, 2020 • edited Loading

thomasbird commented Oct 25, 2020

aCampello commented Oct 25, 2020 • edited Loading

aCampello commented Oct 29, 2020 • edited Loading

thomasbird commented Oct 30, 2020 • edited Loading

aCampello commented Nov 1, 2020 • edited Loading

thomasbird left a comment

Choose a reason for hiding this comment

aCampello commented Nov 5, 2020

aCampello commented Oct 24, 2020 •

edited

Loading

coveralls commented Oct 24, 2020 •

edited

Loading

aCampello commented Oct 25, 2020 •

edited

Loading

aCampello commented Oct 29, 2020 •

edited

Loading

thomasbird commented Oct 30, 2020 •

edited

Loading

aCampello commented Nov 1, 2020 •

edited

Loading