-
Notifications
You must be signed in to change notification settings - Fork 4
Implement token blacklist filtering #37
Conversation
The keyword extractor will only return exact matches. Can you explain how this is different? There's also a requirement for conjunctive blacklist term filtering, where we filter a post only if it contains a tupleN of blacklisted terms. I was imagining that this would be implemented by retrieving the set of blacklisted terms present using the keyword extractor, and then determining which conjunctive filter tuples were satisfied by it. |
The keyword extractor is case insensitive. |
Why should the blacklist be case sensitive, but not the whitelist? |
@@ -70,6 +72,11 @@ object Pipeline { | |||
analysis.keywords.nonEmpty | |||
} | |||
|
|||
def hasBlacklistedTerms(details: Details): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, this method should be added to the Analyzer
interface, and this implementation should exist in AnalysisDefaults
. The type of the details parameter there should be changed to accept ExtendedDetails[T] to give each concrete Analyzer
the chance to override it, and blacklist
will need to be added as a parameter as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a scenario in which we do not want to blacklist terms? Also: we have a bunch of other filters that are currently not in the Analyzer trait (hasKeywords
, isLanguageSupported
). Do you want to include all of those too? If so: this should be a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 94bf88b. Not convinced that this will or shold be ever over-written though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine if a pipeline had yet another field that they wanted to blacklist terms from.
I think hasKeywords
and isLanguageSupported
belong where they are in Pipeline.scala
. We extract the keywords and language from the Analyzer
since we need those results for our analysis. We don't care what the blacklisted terms were, so we don't need them as an output from the Analyzer
.
I'd assert that it's worse to erroneously filter out an event than to erroneously include it. Ergo, we don't want any surprises for blacklist terms. For example, I may want to blacklist "Trump" (as in "the president") but not "trump" (as in "to best"). Additionally, unless I'm missing something, re-using the keyword extractor will At the end of the day, this implementation also works, but I'm not convinced it's superior: class Blacklist(blacklist: Seq[Set[String]]) {
private lazy val extractors = blacklist.map(new KeywordExtractor(_))
def matches(text: String): Boolean = {
extractors.zip(blacklist).exists(kv => kv._1.extractKeywords(text).size == kv._2.size)
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -19,6 +19,7 @@ private[analyzer] object AnalysisDefaults { | |||
with EnableLocation[T] | |||
with EnableEntity[T] | |||
with EnableLanguage[T] | |||
with FilterBlacklist[T] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about EnableBlacklist
or something that starts with Enable
for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
import com.microsoft.partnercatalyst.fortis.spark.transforms.nlp.Tokenizer | ||
|
||
class Blacklist(blacklist: Seq[Set[String]]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this since we can choose to implement this differently later on :)
@@ -6,16 +6,16 @@ import com.microsoft.partnercatalyst.fortis.spark.ProjectFortis.Settings | |||
import scala.reflect.runtime.universe.TypeTag | |||
import com.microsoft.partnercatalyst.fortis.spark.analyzer.{Analyzer, ExtendedFortisEvent} | |||
import com.microsoft.partnercatalyst.fortis.spark.dba.ConfigurationManager | |||
import com.microsoft.partnercatalyst.fortis.spark.dto.{Analysis, FortisEvent} | |||
import com.microsoft.partnercatalyst.fortis.spark.dto.{Analysis, Details, FortisEvent} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this one still needed after your latest commits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Built without using the keyword extractor so that we can offer more predictable behavior to the user: events will only get filtered out if tokens match exactly what the user specified.
Resolves #34