Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelized scrubbing #11

Merged
merged 26 commits into from Oct 31, 2015
Merged

parallelized scrubbing #11

merged 26 commits into from Oct 31, 2015

Conversation

deanmalmgren
Copy link
Collaborator

Right now, we scrub for each feature (email, credentials, URLs, names, etc) in a particular order. This order has been fine tuned to address certain things that have come up during development, but its easy to imagine this dependency tree becoming ever more complicated and ultimately becoming very difficult to tune for all possible examples.

A different approach would be to do all of these scrubbings in parallel instead of in series. For example, we could have a method like Scrubber.iter_filth() that yields a Filth object (based on a MatchObject?). The iter_filth method could be monitoring all of the features we are scrubbing and return them in the order in which they appear in the text while resolving any conflicts that come up along the way. Each Filth object, for example, could have a Filth.score attribute (or property or method) that signals the likelihood that this piece of Filth is actually a piece of CredentialFilth and not EmailFilth.

This would allow us to properly deal with situations like this:

Your credentials are the following
username: joe@example.com
password: p@ssw0rd

which (I think) would preferrably yield

Your credentials are the following
username: {{USERNAME}}
password: {{PASSWORD}}

or optionally

Your credentials are the following
username: {{EMAIL}}
password: {{PASSWORD}}

but certainly NOT

Your credentials are the following
username: {{EMAIL}}
password: p@ssw0rd

deanmalmgren pushed a commit that referenced this pull request Oct 31, 2015
@deanmalmgren deanmalmgren merged commit 1fbb132 into master Oct 31, 2015
@deanmalmgren deanmalmgren deleted the parallel branch October 31, 2015 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant