New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelized scrubbing #11
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
deanmalmgren
pushed a commit
that referenced
this pull request
Apr 14, 2015
…ethinking the iter_filth method, it became clear that it was important to identify which methods should be public and could be internal to the package
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Right now, we scrub for each feature (email, credentials, URLs, names, etc) in a particular order. This order has been fine tuned to address certain things that have come up during development, but its easy to imagine this dependency tree becoming ever more complicated and ultimately becoming very difficult to tune for all possible examples.
A different approach would be to do all of these scrubbings in parallel instead of in series. For example, we could have a method like
Scrubber.iter_filth()
that yields aFilth
object (based on aMatchObject
?). Theiter_filth
method could be monitoring all of the features we are scrubbing and return them in the order in which they appear in the text while resolving any conflicts that come up along the way. EachFilth
object, for example, could have aFilth.score
attribute (or property or method) that signals the likelihood that this piece ofFilth
is actually a piece ofCredentialFilth
and notEmailFilth
.This would allow us to properly deal with situations like this:
which (I think) would preferrably yield
or optionally
but certainly NOT