Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show which pattern was matched in the why data #1275

Merged
merged 1 commit into from Nov 26, 2017

Conversation

@quartata
Copy link
Member

quartata commented Nov 26, 2017

This adds a new function called blacklist_factory, which acts as a wrapper that takes in a list of keywords (either wb with word boundaries or nwb without word boundaries) and transparently constructs a regex out of them and returns a function that tests a given string against it. Most importantly however, it wraps each keyword in its own named group. Then, when constructing the why data, it uses the last matched named group to retrieve which keyword was matched. A bit of a hack, but it's the best we can do without hooking into the regex engine or maybe using something like Perl's Regexp::Assemble.

Currently the blacklist_factory is employed for the bad_keywords.txt keywords, bad_keywords_nwb, blacklisted_usernames.txt, blacklisted_websites.txt and watched_keywords.txt. It could be used on some of the other fixed regexes (like pattern-matching websites) to enable this effect if desired.

This closes #1133.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Nov 26, 2017

Codecov Report

Merging #1275 into master will increase coverage by 0.39%.
The diff coverage is 95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1275      +/-   ##
==========================================
+ Coverage   64.04%   64.43%   +0.39%     
==========================================
  Files           6        6              
  Lines        1819     1839      +20     
==========================================
+ Hits         1165     1185      +20     
  Misses        654      654
Impacted Files Coverage Δ
findspam.py 84.88% <95%> (+0.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a8be29...3ae8e56. Read the comment docs.

@quartata

This comment has been minimized.

Copy link
Member Author

quartata commented Nov 26, 2017

Incidentally, this pattern could also be achieved with a callable object instead of a closure. This might be better if we want to add additional state or maybe say a fetch() method to reload the blacklist.

@ArtOfCode-

This comment has been minimized.

Copy link
Member

ArtOfCode- commented Nov 26, 2017

What's the performance like on this? If you compare runs/sec for the old function and this one, what do you get?

@quartata

This comment has been minimized.

Copy link
Member Author

quartata commented Nov 26, 2017

I don't have exact numbers but when running strings through the full FindSpam suite there was no noticeable difference. NFA capture groups are fast and the last named group was already tracked, so we're mostly adding one function call and some parsing. Peanuts compared to an NS query

@quartata

This comment has been minimized.

Copy link
Member Author

quartata commented Nov 26, 2017

I was mostly concered about memory usage, which is why I delete the regex pattern after compilation (something that is not currently done at all - it's recompiled on every check)

@quartata

This comment has been minimized.

Copy link
Member Author

quartata commented Nov 26, 2017

I would add that if you're concerned about time performance, findspam is currently a hot mess: most egregiously, tests are run on all fields (title, body, username) before even checking whether those tests should be run on those. If they shouldn't, the result is thrown out. A transition to the whole_post model (one of the first things I ever added, and the only option in NG) would make at least method tests cleaner and probably faster. This would be a massive refactor though (and is the current roadblock for NG).

@ArtOfCode-

This comment has been minimized.

Copy link
Member

ArtOfCode- commented Nov 26, 2017

!!/approve

Approved with PullApprove

@ArtOfCode- ArtOfCode- merged commit fb1e32e into Charcoal-SE:master Nov 26, 2017
6 checks passed
6 checks passed
ci/circleci Your tests passed on CircleCI!
Details
code-review/pullapprove Approved by ArtOfCode-
Details
codeclimate All good!
Details
codecov/patch 95% of diff hit (target 64.04%)
Details
codecov/project 64.43% (+0.39%) compared to 6a8be29
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Undo1 added a commit that referenced this pull request Nov 29, 2017
Revert "Merge pull request #1295 from quartata/factory_optimize --autopull"

This reverts commit 9633a5d, reversing
changes made to 6f4b3b0.

Revert "Merge pull request #1276 from quartata/which_pattern"

This reverts commit a6087a2, reversing
changes made to fb1e32e.

Revert "Merge pull request #1275 from quartata/which_pattern"

This reverts commit fb1e32e, reversing
changes made to 6a8be29.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.