Skip to content
This repository has been archived by the owner. It is now read-only.

Implement token blacklist filtering #37

Merged
merged 9 commits into from Jun 27, 2017

Conversation

Projects
None yet
2 participants
@c-w
Copy link
Member

commented Jun 27, 2017

Built without using the keyword extractor so that we can offer more predictable behavior to the user: events will only get filtered out if tokens match exactly what the user specified.

Resolves #34

c-w added some commits Jun 27, 2017

@c-w c-w added the in progress label Jun 27, 2017

@kevinhartman

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2017

The keyword extractor will only return exact matches. Can you explain how this is different?

There's also a requirement for conjunctive blacklist term filtering, where we filter a post only if it contains a tupleN of blacklisted terms.

I was imagining that this would be implemented by retrieving the set of blacklisted terms present using the keyword extractor, and then determining which conjunctive filter tuples were satisfied by it.

@c-w

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2017

The keyword extractor is case insensitive.

@kevinhartman

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2017

Why should the blacklist be case sensitive, but not the whitelist?

@@ -70,6 +72,11 @@ object Pipeline {
analysis.keywords.nonEmpty
}

def hasBlacklistedTerms(details: Details): Boolean = {

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 27, 2017

Contributor

Ideally, this method should be added to the Analyzer interface, and this implementation should exist in AnalysisDefaults. The type of the details parameter there should be changed to accept ExtendedDetails[T] to give each concrete Analyzer the chance to override it, and blacklist will need to be added as a parameter as well.

This comment has been minimized.

Copy link
@c-w

c-w Jun 27, 2017

Author Member

Is there a scenario in which we do not want to blacklist terms? Also: we have a bunch of other filters that are currently not in the Analyzer trait (hasKeywords, isLanguageSupported). Do you want to include all of those too? If so: this should be a follow-up PR.

This comment has been minimized.

Copy link
@c-w

c-w Jun 27, 2017

Author Member

Done in 94bf88b. Not convinced that this will or shold be ever over-written though.

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 27, 2017

Contributor

Imagine if a pipeline had yet another field that they wanted to blacklist terms from.

I think hasKeywords and isLanguageSupported belong where they are in Pipeline.scala. We extract the keywords and language from the Analyzer since we need those results for our analysis. We don't care what the blacklisted terms were, so we don't need them as an output from the Analyzer.

@c-w

This comment has been minimized.

Copy link
Member Author

commented Jun 27, 2017

I'd assert that it's worse to erroneously filter out an event than to erroneously include it. Ergo, we don't want any surprises for blacklist terms. For example, I may want to blacklist "Trump" (as in "the president") but not "trump" (as in "to best").

Additionally, unless I'm missing something, re-using the keyword extractor will
(1) couple us to that implementation so we won't be able to make changes to the keyword code unless they also work for the blacklist code
(2) make the code somewhat less time efficient because we have to re-tokenize the sentence once per blacklist
(3) make the code less memory efficient as we'll store the blacklist words twice
(4) make the code a little bit more complicated to understand

At the end of the day, this implementation also works, but I'm not convinced it's superior:

class Blacklist(blacklist: Seq[Set[String]]) {
  private lazy val extractors = blacklist.map(new KeywordExtractor(_))

  def matches(text: String): Boolean = {
    extractors.zip(blacklist).exists(kv => kv._1.extractKeywords(text).size == kv._2.size)
  }
}

c-w added some commits Jun 27, 2017

@kevinhartman
Copy link
Contributor

left a comment

LGTM

@@ -19,6 +19,7 @@ private[analyzer] object AnalysisDefaults {
with EnableLocation[T]
with EnableEntity[T]
with EnableLanguage[T]
with FilterBlacklist[T]

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 27, 2017

Contributor

What about EnableBlacklist or something that starts with Enable for consistency?

This comment has been minimized.

Copy link
@c-w

c-w Jun 27, 2017

Author Member

import com.microsoft.partnercatalyst.fortis.spark.transforms.nlp.Tokenizer

class Blacklist(blacklist: Seq[Set[String]]) {

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 27, 2017

Contributor

I like this since we can choose to implement this differently later on :)

@@ -6,16 +6,16 @@ import com.microsoft.partnercatalyst.fortis.spark.ProjectFortis.Settings
import scala.reflect.runtime.universe.TypeTag
import com.microsoft.partnercatalyst.fortis.spark.analyzer.{Analyzer, ExtendedFortisEvent}
import com.microsoft.partnercatalyst.fortis.spark.dba.ConfigurationManager
import com.microsoft.partnercatalyst.fortis.spark.dto.{Analysis, FortisEvent}
import com.microsoft.partnercatalyst.fortis.spark.dto.{Analysis, Details, FortisEvent}

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 27, 2017

Contributor

Is this one still needed after your latest commits?

This comment has been minimized.

Copy link
@c-w

c-w Jun 27, 2017

Author Member

@c-w c-w merged commit d9329ca into master Jun 27, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@c-w c-w deleted the blacklist branch Jun 27, 2017

@c-w c-w removed the in progress label Jun 27, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.