Skip to content
This repository has been archived by the owner. It is now read-only.

Implement token blacklist filtering #37

Merged
merged 9 commits into from Jun 27, 2017
Merged

Implement token blacklist filtering #37

merged 9 commits into from Jun 27, 2017

Conversation

@c-w
Copy link
Contributor

@c-w c-w commented Jun 27, 2017

Built without using the keyword extractor so that we can offer more predictable behavior to the user: events will only get filtered out if tokens match exactly what the user specified.

Resolves #34

@kevinhartman
Copy link
Contributor

@kevinhartman kevinhartman commented Jun 27, 2017

The keyword extractor will only return exact matches. Can you explain how this is different?

There's also a requirement for conjunctive blacklist term filtering, where we filter a post only if it contains a tupleN of blacklisted terms.

I was imagining that this would be implemented by retrieving the set of blacklisted terms present using the keyword extractor, and then determining which conjunctive filter tuples were satisfied by it.

@c-w
Copy link
Contributor Author

@c-w c-w commented Jun 27, 2017

The keyword extractor is case insensitive.

@kevinhartman
Copy link
Contributor

@kevinhartman kevinhartman commented Jun 27, 2017

Why should the blacklist be case sensitive, but not the whitelist?

@@ -70,6 +72,11 @@ object Pipeline {
analysis.keywords.nonEmpty
}

def hasBlacklistedTerms(details: Details): Boolean = {
Copy link
Contributor

@kevinhartman kevinhartman Jun 27, 2017

Ideally, this method should be added to the Analyzer interface, and this implementation should exist in AnalysisDefaults. The type of the details parameter there should be changed to accept ExtendedDetails[T] to give each concrete Analyzer the chance to override it, and blacklist will need to be added as a parameter as well.

Copy link
Contributor Author

@c-w c-w Jun 27, 2017

Is there a scenario in which we do not want to blacklist terms? Also: we have a bunch of other filters that are currently not in the Analyzer trait (hasKeywords, isLanguageSupported). Do you want to include all of those too? If so: this should be a follow-up PR.

Copy link
Contributor Author

@c-w c-w Jun 27, 2017

Done in 94bf88b. Not convinced that this will or shold be ever over-written though.

Copy link
Contributor

@kevinhartman kevinhartman Jun 27, 2017

Imagine if a pipeline had yet another field that they wanted to blacklist terms from.

I think hasKeywords and isLanguageSupported belong where they are in Pipeline.scala. We extract the keywords and language from the Analyzer since we need those results for our analysis. We don't care what the blacklisted terms were, so we don't need them as an output from the Analyzer.

@c-w
Copy link
Contributor Author

@c-w c-w commented Jun 27, 2017

I'd assert that it's worse to erroneously filter out an event than to erroneously include it. Ergo, we don't want any surprises for blacklist terms. For example, I may want to blacklist "Trump" (as in "the president") but not "trump" (as in "to best").

Additionally, unless I'm missing something, re-using the keyword extractor will
(1) couple us to that implementation so we won't be able to make changes to the keyword code unless they also work for the blacklist code
(2) make the code somewhat less time efficient because we have to re-tokenize the sentence once per blacklist
(3) make the code less memory efficient as we'll store the blacklist words twice
(4) make the code a little bit more complicated to understand

At the end of the day, this implementation also works, but I'm not convinced it's superior:

class Blacklist(blacklist: Seq[Set[String]]) {
  private lazy val extractors = blacklist.map(new KeywordExtractor(_))

  def matches(text: String): Boolean = {
    extractors.zip(blacklist).exists(kv => kv._1.extractKeywords(text).size == kv._2.size)
  }
}

Copy link
Contributor

@kevinhartman kevinhartman left a comment

LGTM

@@ -19,6 +19,7 @@ private[analyzer] object AnalysisDefaults {
with EnableLocation[T]
with EnableEntity[T]
with EnableLanguage[T]
with FilterBlacklist[T]
Copy link
Contributor

@kevinhartman kevinhartman Jun 27, 2017

What about EnableBlacklist or something that starts with Enable for consistency?

Copy link
Contributor Author

@c-w c-w Jun 27, 2017


import com.microsoft.partnercatalyst.fortis.spark.transforms.nlp.Tokenizer

class Blacklist(blacklist: Seq[Set[String]]) {
Copy link
Contributor

@kevinhartman kevinhartman Jun 27, 2017

I like this since we can choose to implement this differently later on :)

@@ -6,16 +6,16 @@ import com.microsoft.partnercatalyst.fortis.spark.ProjectFortis.Settings
import scala.reflect.runtime.universe.TypeTag
import com.microsoft.partnercatalyst.fortis.spark.analyzer.{Analyzer, ExtendedFortisEvent}
import com.microsoft.partnercatalyst.fortis.spark.dba.ConfigurationManager
import com.microsoft.partnercatalyst.fortis.spark.dto.{Analysis, FortisEvent}
import com.microsoft.partnercatalyst.fortis.spark.dto.{Analysis, Details, FortisEvent}
Copy link
Contributor

@kevinhartman kevinhartman Jun 27, 2017

Is this one still needed after your latest commits?

Copy link
Contributor Author

@c-w c-w Jun 27, 2017

@c-w c-w merged commit d9329ca into master Jun 27, 2017
2 checks passed
@c-w c-w deleted the blacklist branch Jun 27, 2017
@c-w c-w removed the in progress label Jun 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants