parallelized scrubbing #11

deanmalmgren · 2015-10-30T11:29:39Z

Right now, we scrub for each feature (email, credentials, URLs, names, etc) in a particular order. This order has been fine tuned to address certain things that have come up during development, but its easy to imagine this dependency tree becoming ever more complicated and ultimately becoming very difficult to tune for all possible examples.

A different approach would be to do all of these scrubbings in parallel instead of in series. For example, we could have a method like Scrubber.iter_filth() that yields a Filth object (based on a MatchObject?). The iter_filth method could be monitoring all of the features we are scrubbing and return them in the order in which they appear in the text while resolving any conflicts that come up along the way. Each Filth object, for example, could have a Filth.score attribute (or property or method) that signals the likelihood that this piece of Filth is actually a piece of CredentialFilth and not EmailFilth.

This would allow us to properly deal with situations like this:

Your credentials are the following
username: joe@example.com
password: p@ssw0rd

which (I think) would preferrably yield

Your credentials are the following
username: {{USERNAME}}
password: {{PASSWORD}}

or optionally

Your credentials are the following
username: {{EMAIL}}
password: {{PASSWORD}}

but certainly NOT

Your credentials are the following
username: {{EMAIL}}
password: p@ssw0rd

is implemented

…ethinking the iter_filth method, it became clear that it was important to identify which methods should be public and could be internal to the package

…orward

parallelized scrubbing

deanmalmgren added the enhancement label Apr 13, 2015

deanmalmgren pushed a commit that referenced this pull request Apr 14, 2015

fixed problem with Skype capitalization, which is a workaround until #11

e324353

is implemented

Dean Malmgren added 26 commits April 16, 2015 06:11

added Scrubber.configure method to set up all scrubber options

8a8e8ae

worked on a design spec for the next major version of scrubadub. in r…

f53b41b

…ethinking the iter_filth method, it became clear that it was important to identify which methods should be public and could be internal to the package

started to mess around with iter_filth

f8aed60

minor changes to the design documents

7d32cb4

updated API to using Scrubber.clean()

4230a77

got Scrubber.iter_filth API sorted out

4bc5b81

added ipdb to development requirements for convenience

02506eb

NameFilth and NameDetector implemented

9e0cdc2

got the EmailDetector and EmailFilth to work

4478059

refactoring to accomodate UrlFilth and UrlDetector

3e54644

PhoneDetector and PhoneFilth

b5d53c7

added CredentialDetector and CredentialFilth

4e1d8a2

cleaning up the UrlFilth

67a1a7d

added SkypeFilth and SkypeDetector

ba0bdd3

fixed bug with empty lists of proper nouns or skype usernames

ceb2cf1

fixed some bugs when there is no erroring text

b6f6929

fixing url tests

1c437a1

cleaned up a few other tests

bfd3872

pep8 💄

6eeb3f1

added test for raising UnexpectedFilth error

9369795

making sure disallowed_names are lower case

e66be23

added a few more tests for CanonicalStringSet

7596b2e

testing Filth merging

9bf72a2

added Filth.type argument to make MergeFilth implementation straightf…

d3a51b8

…orward

cleaning up Scrubber.iter_filth method

68a8796

pep8 💄

870c34e

deanmalmgren pushed a commit that referenced this pull request Oct 31, 2015

Merge pull request #11 from datascopeanalytics/parallel

1fbb132

parallelized scrubbing

deanmalmgren merged commit 1fbb132 into master Oct 31, 2015

deanmalmgren deleted the parallel branch October 31, 2015 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelized scrubbing #11

parallelized scrubbing #11

deanmalmgren commented Oct 30, 2015

parallelized scrubbing #11

parallelized scrubbing #11

Conversation

deanmalmgren commented Oct 30, 2015