Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex for Spam Usernames #1

Closed
Bhargav-Rao opened this issue Jan 29, 2018 · 11 comments
Closed

Regex for Spam Usernames #1

Bhargav-Rao opened this issue Jan 29, 2018 · 11 comments

Comments

@Bhargav-Rao
Copy link
Member

Bhargav-Rao commented Jan 29, 2018

One of the Puzzling Stack Exchange moderators, Rubio, has provided us with dataset of usernames which were used to spam Puzzling Stack Exchange.

The list is here https://pastebin.com/CkeV99c5

Can we come up with a new Regex list to catch most of these spammers?

I need some help to create the regex list, not to develop anything. We can use a new file on GH to store the new regexes and just use the same code which we are using to match against the smokey blacklisted username list.

@rjrudman
Copy link

A regex-trie may be useful here.

Here's an explanation
https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3/42789508#42789508

@Bhargav-Rao
Copy link
Member Author

Interesting, thanks for that.

@Bhargav-Rao
Copy link
Member Author

If any one wants to contribute, drop in the regexes into this file here https://github.com/SOBotics/UserStalker/blob/master/data/blacklistRegex.txt

@rjrudman
Copy link

Are you happy with a computer generated regex-trie, or are you more after a human readable regex?

@Bhargav-Rao
Copy link
Member Author

Anything is fine. As long as it does the job.

@adeak
Copy link
Contributor

adeak commented Jan 31, 2018

Note that my uneducated impression is that these are just vague guidelines to help figure out filters for the future. Catching all those Mayweather vs McGregor spam accounts won't be terribly helpful going forward. Similarly, filtering for mlbopeningdayx will probably be less useful. So my layman's impression is that throwing these names into a TRIE will not necessarily be the best course of action. But I'm curious of whatever we can do :)

@adeak
Copy link
Contributor

adeak commented Jan 31, 2018

Come to think of it, in cases of false positives one often wants to look at the pattern that was triggered. So if a TRIE is included, it would probably be prudent to additionally provide the list of keywords from which the TRIE was generated.

@rjrudman
Copy link

@adeak Yeah, you raise a good point if we're trying to catch future spammers

@Papershine
Copy link
Member

There is a blacklisted username list by Charcoal for SmokeDetector here

@Bhargav-Rao
Copy link
Member Author

Bhargav-Rao commented Feb 1, 2018

Yeah, we're using that, but it certainly isn't as comprehensive as Rubio's list.

@codygray
Copy link
Contributor

codygray commented Dec 5, 2021

I'm marking this as "closed", because the action items have been completed.

We are now using all of the regex blacklists available from HeatDetector (to detect offensive words) as well as those available from Charcoal's SmokeDetector project (which will detect known trolls/spammers, as well as known problematic patterns). Plus, User Stalker has its own user-name blacklist built in, which includes Rubio's list, and is meant to be kept up-to-date by any moderator who handles the User Stalker reports.

Help is, of course, always welcome on improving and/or expanding the regexes in blacklists. If you have something you want to add or improve, simply submit a pull request (PR) for one of the existing patterns (https://github.com/SOBotics/UserStalker/tree/master/patterns). If you have a more complex suggestion, please open a new Issue.

@codygray codygray closed this as completed Dec 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants