Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findspam.py: Anchor pattern_websites #4128

Closed
wants to merge 1 commit into from

Conversation

user12986714
Copy link
Contributor

So spam.co does not match notspam.com

@NobodyNada
Copy link
Member

I'm pretty sure pattern-matching websites is intentionally non-anchored, because we're looking for specific keywords in a domain name rather than trying to match specific domain names. For instance, we currently catch the domain xxxfreehdporn DOT com, but AFAICT we would not if the reason was anchored.

@makyen
Copy link
Contributor

makyen commented Jun 29, 2020

My understanding is that many of the regexes in pattern_websites are explicitly intended to start matching from anywhere within a word, not just at word boundaries. In other words, the lack of having \b bookends is intentional. Some of the regexes may benefit, but that would be on a case by case basis. It appears that adding this would break the intended operation of at least some of those regexes.

What is the problem that this change is attempting to address? (i.e. the specific issue you've seen)

@makyen
Copy link
Contributor

makyen commented Jun 29, 2020

Also, if bookending is going to be added, it's probably preferable to do it once in the string in which the .format() is substituting into, rather than add \b to the beginning/end of each sub-pattern.

@user12986714
Copy link
Contributor Author

The specific issue is that almost all fps from this reason is due to fragment matches; however I agree that this may reduce detection and may be harmful to the overall goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants