Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New filter: website resembles username #450

Closed
Glorfindel83 opened this issue Jan 10, 2017 · 13 comments
Closed

New filter: website resembles username #450

Glorfindel83 opened this issue Jan 10, 2017 · 13 comments
Labels
area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) type: feature request Shinies.

Comments

@Glorfindel83
Copy link
Member

E.g. for these kind of spam posts, which go undetected quite often or
https://metasmoke.erwaysoftware.com/post/52946
https://metasmoke.erwaysoftware.com/post/52841
https://metasmoke.erwaysoftware.com/post/51936

Procedure: replace spaces in username by \W? and check if there's a link in the post which contains that string.
There are some users with 3 character usernames which have a chance of accidentally triggering the filter. Maybe this should only work for usernames above a certain length.

@magisch
Copy link
Member

magisch commented Jan 10, 2017

Sounds like a good idea

@ArtOfCode- ArtOfCode- self-assigned this Jan 10, 2017
@ArtOfCode-
Copy link
Member

Actually, having assigned this to myself, I've just realised this isn't currently possible. We only check one of username/title/body/summary at a time, so there's no point when check code has access to both.

@ArtOfCode- ArtOfCode- removed their assignment Jan 10, 2017
@magisch
Copy link
Member

magisch commented Jan 11, 2017

@ArtOfCode- Wouldn't it be possible to schedule the Username check before the body check and save the username temporarily so you can access it in the body check?

@ArtOfCode-
Copy link
Member

@magisch Possibly. Would have to look at that.

@Undo1
Copy link
Member

Undo1 commented Jan 11, 2017

Sounds messy, I'd probably be against that. It'd be better to just make a new reason-method type that takes all parts of the post at once.

@ghost
Copy link

ghost commented Jan 12, 2017

@Undo1 agree

@angussidney angussidney added type: feature request Shinies. area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) labels Jan 23, 2017
@AWegnerGitHub
Copy link
Member

Are there other test cases to run against? Right now I am using the following tests:

checks = [
	("http://www.price-buy.com/", "Price Buy"),
	("https://thebestparkourgear.com/backpack-for-parkour/", "TheBestParkourGear"),
	("httl://bestonwardticket.com", "Best onward Ticket"),
	("https://i.stack.imgur.com/eS6WQ.jpg", "Best onward Ticket"),
	("www.stackoverflow.com", "Andy"),
	("www.stackoverflow.notarealtld", "Andy"),
	("stackoverflow.notarealtld", "Andy"),
	("http://stackoverflow.notarealtld", "Andy"),
	("httl://stackoverflow.notarealtld", "Andy"),
]

I get the following results:

SIMILAR: (1.0) => Name: Price Buy, Domain: http://www.price-buy.com/
SIMILAR: (1.0) => Name: TheBestParkourGear, Domain: https://thebestparkourgear.com/backpack-for-parkour/
SIMILAR: (1.0) => Name: Best onward Ticket, Domain: httl://bestonwardticket.com
NOT SIMILAR: (0.0952380952381) => Name: Best onward Ticket, Domain: https://i.stack.imgur.com/eS6WQ.jpg
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.com
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.notarealtld
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: stackoverflow.notarealtld
NOT SIMILAR: (0.0) => Name: Andy, Domain: http://stackoverflow.notarealtld
NOT SIMILAR: (0.0) => Name: Andy, Domain: httl://stackoverflow.notarealtld

It's a little messier than I thought it'd be, and does require a library be added to Smokey, but it works. My tests have been pretty simple so far. I've only passed the domain, not the entire body of the text. Doing that will require an HTML parser (likely BeautifulSoup), so that'd need to be included too.

What I need:

  • The OK to include at least 1 new library: tld. If we don't already include BeautifulSoup, we also need to include that for parsing the links out of the body.
  • More test cases so I can throw those into here and make sure I'm not missing any other cases.

@Undo1
Copy link
Member

Undo1 commented Feb 21, 2017 via email

@Glorfindel83
Copy link
Member Author

Glorfindel83 commented Feb 21, 2017

Good job! Here's another TP from today: https://metasmoke.erwaysoftware.com/post/58200
Also, one of your testcases has a httl://. I don't know that scheme.

@ArtOfCode-
Copy link
Member

@Glorfindel83 HyperText Testing Language

@AWegnerGitHub
Copy link
Member

No, no branch yet. I've been testing alternatives all morning though and am ready to implement. However this brings up another point of discussion. I've opened another issue because it will impact more than just this change.

Related issue: #538

@AWegnerGitHub
Copy link
Member

@Glorfindel83, yes it does. That httl is from https://metasmoke.erwaysoftware.com/post/51936

@AWegnerGitHub
Copy link
Member

Closed with 2860085

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) type: feature request Shinies.
Development

No branches or pull requests

6 participants