Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filter] Performance increases. #2509

Merged
merged 9 commits into from Apr 3, 2019

Conversation

@mikeshardmind
Copy link
Member

commented Mar 5, 2019

Type

  • Bugfix
  • Enhancement
  • New feature

Description of the changes

The filter was already using re to split words, this just does the entire search in re instead.

[Filter] Performance increases.
The filter was already using re to split words, this just does the entire search in re instead.

A further improvement to this would cache patterns used and update them if the wordlist changes.
@Tobotimus

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

First issue with this is that there's a fatal change of behaviour - with an empty word list, it will return a set containing a single empty string, causing the message to be deleted. Obviously we can just exit early instead.

Secondly, I've timed these changes and they're not necessarily any faster than before, in fact in many cases it's significantly slower. They seem to be showing a performance increase when the word list is small and the message is large. However as the parameters go in the opposite direction (bigger word list / smaller message) it shows a bigger slowdown.

I tried to give it a fair benchmark by taking a random sample of messages and testing them against an auto-generated banned word list of increasing size. I got the random sample by using the last 10,000 messages in #support. Here's a graph of my results:
graph

@Tobotimus

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

It also seems to be performing worse when the banned words and phrases themselves have a longer average length. In the previous test I used word lists in the form ["word1", "word2", ...], but here are my results using word lists in the form ["longerword1", "longerword2", ...]:
graph

@mikeshardmind

This comment has been minimized.

Copy link
Member Author

commented Mar 6, 2019

  1. The previous version was returning an empty set on no hits as well, this is not changing the API.
  2. I can't reprofile this right now, but this was running faster with the test cases I used, so when I get home, I'll see if I can see what happened with that.
  3. This should show up as even faster on repeated use if the further changes which aren't made here to cache patterns are made, so I'll add those to this PR as well.

mikeshardmind added some commits Mar 6, 2019

@mikeshardmind

This comment has been minimized.

Copy link
Member Author

commented Mar 6, 2019

The majority of the time difference without the cache was the generator expression which created the pattern.

Adding a pattern cache makes the regex solution multiple orders of magnitude faster on any message other than the first received in a channel with a filter.

(Side by side comparison from discord conversation, running the same above tests on the updated version, but in a profiler instead of just a straight time count.)

image

Context: old_filter is the behavior of the existing, the green block with re.compile at the same level is the initial message cost, and new_filter (again at the same level) is the amount of time excluding generating the cached pattern (not visible at that scale, extra image below):

image

@Tobotimus

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

RE: 1, You seem to have fixed the issue now, the issue was not returning an empty set on no hits. I said the issue was it returned a non-empty set on everything when filter was loaded but there were no banned words.

@mikeshardmind

This comment has been minimized.

Copy link
Member Author

commented Mar 6, 2019

yeah, I realized I misunderstood your meaning on that, and added that, but forgot to comment 😓

mikeshardmind added some commits Mar 6, 2019

...
redbot/cogs/filter/filter.py Outdated Show resolved Hide resolved
@Tobotimus

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Very nice, you managed to take it from linear time to constant time 😁
graph

mikeshardmind added some commits Mar 27, 2019

Slight further improvements, a fix, and restructure
Moved actual set creation out of the inner portion.
Reduced config lookups in case of no filter.
Fixed channel wordlist fetching.
@tekulvw

tekulvw approved these changes Apr 3, 2019

@tekulvw tekulvw merged commit 8ab3951 into Cog-Creators:V3/develop Apr 3, 2019

1 check passed

Travis CI - Pull Request Build Passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.