Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filter] Performance increases. #2509

Merged
merged 9 commits into from Apr 3, 2019
Merged

[Filter] Performance increases. #2509

merged 9 commits into from Apr 3, 2019

Conversation

mikeshardmind
Copy link
Contributor

@mikeshardmind mikeshardmind commented Mar 5, 2019

Type

  • Bugfix
  • Enhancement
  • New feature

Description of the changes

The filter was already using re to split words, this just does the entire search in re instead.

The filter was already using re to split words, this just does the entire search in re instead.

A further improvement to this would cache patterns used and update them if the wordlist changes.
@Tobotimus
Copy link
Member

First issue with this is that there's a fatal change of behaviour - with an empty word list, it will return a set containing a single empty string, causing the message to be deleted. Obviously we can just exit early instead.

Secondly, I've timed these changes and they're not necessarily any faster than before, in fact in many cases it's significantly slower. They seem to be showing a performance increase when the word list is small and the message is large. However as the parameters go in the opposite direction (bigger word list / smaller message) it shows a bigger slowdown.

I tried to give it a fair benchmark by taking a random sample of messages and testing them against an auto-generated banned word list of increasing size. I got the random sample by using the last 10,000 messages in #support. Here's a graph of my results:
graph

@Tobotimus
Copy link
Member

It also seems to be performing worse when the banned words and phrases themselves have a longer average length. In the previous test I used word lists in the form ["word1", "word2", ...], but here are my results using word lists in the form ["longerword1", "longerword2", ...]:
graph

@mikeshardmind
Copy link
Contributor Author

  1. The previous version was returning an empty set on no hits as well, this is not changing the API.
  2. I can't reprofile this right now, but this was running faster with the test cases I used, so when I get home, I'll see if I can see what happened with that.
  3. This should show up as even faster on repeated use if the further changes which aren't made here to cache patterns are made, so I'll add those to this PR as well.

@mikeshardmind
Copy link
Contributor Author

The majority of the time difference without the cache was the generator expression which created the pattern.

Adding a pattern cache makes the regex solution multiple orders of magnitude faster on any message other than the first received in a channel with a filter.

(Side by side comparison from discord conversation, running the same above tests on the updated version, but in a profiler instead of just a straight time count.)

image

Context: old_filter is the behavior of the existing, the green block with re.compile at the same level is the initial message cost, and new_filter (again at the same level) is the amount of time excluding generating the cached pattern (not visible at that scale, extra image below):

image

@Tobotimus
Copy link
Member

RE: 1, You seem to have fixed the issue now, the issue was not returning an empty set on no hits. I said the issue was it returned a non-empty set on everything when filter was loaded but there were no banned words.

@mikeshardmind
Copy link
Contributor Author

yeah, I realized I misunderstood your meaning on that, and added that, but forgot to comment 😓

@Tobotimus
Copy link
Member

Very nice, you managed to take it from linear time to constant time 😁
graph

@Tobotimus Tobotimus added the Type: Optimisation Situations where too much time is necessary to complete the task. label Mar 6, 2019
Michael H added 2 commits March 27, 2019 14:16
Moved actual set creation out of the inner portion.
Reduced config lookups in case of no filter.
Fixed channel wordlist fetching.
@tekulvw tekulvw merged commit 8ab3951 into Cog-Creators:V3/develop Apr 3, 2019
@mikeshardmind mikeshardmind deleted the patch-21 branch December 26, 2019 17:01
@Jackenmen Jackenmen added this to the 3.1.0 milestone Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Optimisation Situations where too much time is necessary to complete the task.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants