Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we have a 'pharma spam site' reason? #971

Closed
superplane39 opened this issue Jul 20, 2017 · 15 comments
Closed

Should we have a 'pharma spam site' reason? #971

superplane39 opened this issue Jul 20, 2017 · 15 comments
Labels
area: blacklists type: feedback wanted "Closed as too opinion-based."

Comments

@superplane39
Copy link
Member

superplane39 commented Jul 20, 2017

We see a lot of Indian pharma spam. We know that. We also know that these spammers often rotate a lot of different sites that they spam.

I propose adding another reason - pharma spam site in {} - for a couple reasons.

1.) to raise the weight of these posts

If we have another reason, that has... I'm guessing it would be around 0 FPs, that would increase the weight. Increasing the weight is important, as that affects the autoflags (especially if we increase the number of flags depending on the weight - i.e. an 800 weight post would get 4 flags while 500 would get only 3).

2.) to keep a list of these sites

Why do we need a list? So that we can help get these taken down. @tripleee is working on submitting complaints (kudos!), and it's easier to submit complaints when you have a list of the problematic sites.

So - what do people think? Pros? Cons? Additional complications?

@tripleee
Copy link
Member

@tripleee
Copy link
Member

https://gist.github.com/tripleee/ab226f77b6deaf4ffea6d22d9b976beb contains 481 domain names extracted out of the currently 7505 hits from reason #106

There are probably a few stray domains with just a single hit -- let me know if I should try to process this further.

@angussidney
Copy link
Member

angussidney commented Jul 20, 2017

I think it would be a good idea, as long as all of the existing pharma domains in the blacklists are moved over into the new rule. Otherwise we would have two reasons which trigger on the same criteria, which is a bad idea.

@tripleee
Copy link
Member

On closer inspection the "repeated URL at end of long post" is not exclusively Indian pharma after all. There are hits from support telephone number spam, MP3 sites, Oracle training etc. But I'm hoping the gist would be useful as a starting point nevertheless.

@j-f1
Copy link
Contributor

j-f1 commented Jul 20, 2017

Maybe there should be !!/pharm and !!/unpharm commands to modify the list.

@angussidney angussidney added area: blacklists type: feedback wanted "Closed as too opinion-based." labels Jul 20, 2017
@honnza
Copy link
Member

honnza commented Jul 20, 2017

Does this separation lose sensitivity for other valid spam detections? We don't want that...

@tripleee
Copy link
Member

@honnza How do you mean? Moving some domains from the general blacklist to a more focused high-precision blacklist should not lose any existing functionality.

@Undo1
Copy link
Member

Undo1 commented Jul 20, 2017

Fun fact: Metasmoke ignores anything in parenthesis in reason names. We could have "bad keyword in body (pharma)" as a reason name. It'd do nothing to autoflagging, but would be searchable in the why data.

@tripleee
Copy link
Member

But I want to be able to find, search, manipulate, and organize these hits in Metasmoke.

@Undo1
Copy link
Member

Undo1 commented Jul 21, 2017

Absolutely. We could append the original reason set to the why data to make searching possible.

@tripleee
Copy link
Member

I can search for "pharma" in "why" and that currently gets me 42/42. But the "why" data is currently awfully unstructured, and contains bits and pieces of the original post. (It's also hard to see which snippet corresponds to which reason. You see "bad keyword in body" and you can search the body hits in "why" and usually figure out which one corresponds to that reason, but it's not always straightforward.) What I'm hoping is that we could have a separate reason to make it easy and obvious how to list just the posts which belong to this set, and no others. It can be done via "why" but the way that it is currently (not) structured, avoiding false positives in the essentially free-form text is basically impossible without additional postprocessing.

For this reason, I'm hoping we could have a dedicated reason; this is a first-class Metasmoke identity that you can search unambiguously right from the Metasmoke search panel.

Granted, that would pollute the currently high-level and generic reasons hierarchy.

I can see two ways this could be avoided;

  1. Revamp "why", or replace it with a structured format which can be unambiguously searched and manipulated (and also improve the mapping between reasons and "why" indicators).
  2. Generalize the regex-based blacklists to "sets" (for lack of a better word) where each regex has a tag identifying which set it belongs to. There could be a hierarchy, like (ad hocking here, bear with) blacklist.website vs blacklist.website.pharma vs blacklist.website.supportnumber vs watch.website.pharma vs blacklist.keyword.pharma etc. I'm not entirely sure how this should be tagged, identified, and searchable in Metasmoke. As a first approximation, putting these basically machine-readable tags in "why" would at least make searching that data reasonably unambiguous.

@tripleee
Copy link
Member

Tangentially, I'm also thinking Metasmoke v2 should have URLs and domains exposed, cataloged, tracked, indexed, etc. Maybe at that point at least we could figure out a way to collect related domains (pharma domains, support number domains, and why not phone numbers and email addresses) by some sort of tagging (named sets or just user-assigned free-form tags?)

@Undo1
Copy link
Member

Undo1 commented Jul 21, 2017

Wouldn't be that hard to do retroactively. Just need to parse them out of the post text and store them.

@tripleee
Copy link
Member

Why was this closed? I think we should still pursue this as a separate reason somehow, and have been slowly working towards compiling a list of domain names which should be moved.

@ArtOfCode-
Copy link
Member

Closed because there's been no discussion in a month. We can still have further discussion here, but unless someone's actively going to work on it there's no point having the issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: blacklists type: feedback wanted "Closed as too opinion-based."
Development

No branches or pull requests

7 participants