Conversation
|
The keyword extractor will only return exact matches. Can you explain how this is different? There's also a requirement for conjunctive blacklist term filtering, where we filter a post only if it contains a tupleN of blacklisted terms. I was imagining that this would be implemented by retrieving the set of blacklisted terms present using the keyword extractor, and then determining which conjunctive filter tuples were satisfied by it. |
|
The keyword extractor is case insensitive. |
|
Why should the blacklist be case sensitive, but not the whitelist? |
| analysis.keywords.nonEmpty | ||
| } | ||
|
|
||
| def hasBlacklistedTerms(details: Details): Boolean = { |
There was a problem hiding this comment.
Ideally, this method should be added to the Analyzer interface, and this implementation should exist in AnalysisDefaults. The type of the details parameter there should be changed to accept ExtendedDetails[T] to give each concrete Analyzer the chance to override it, and blacklist will need to be added as a parameter as well.
There was a problem hiding this comment.
Is there a scenario in which we do not want to blacklist terms? Also: we have a bunch of other filters that are currently not in the Analyzer trait (hasKeywords, isLanguageSupported). Do you want to include all of those too? If so: this should be a follow-up PR.
There was a problem hiding this comment.
Done in 94bf88b. Not convinced that this will or shold be ever over-written though.
There was a problem hiding this comment.
Imagine if a pipeline had yet another field that they wanted to blacklist terms from.
I think hasKeywords and isLanguageSupported belong where they are in Pipeline.scala. We extract the keywords and language from the Analyzer since we need those results for our analysis. We don't care what the blacklisted terms were, so we don't need them as an output from the Analyzer.
|
I'd assert that it's worse to erroneously filter out an event than to erroneously include it. Ergo, we don't want any surprises for blacklist terms. For example, I may want to blacklist "Trump" (as in "the president") but not "trump" (as in "to best"). Additionally, unless I'm missing something, re-using the keyword extractor will At the end of the day, this implementation also works, but I'm not convinced it's superior: class Blacklist(blacklist: Seq[Set[String]]) {
private lazy val extractors = blacklist.map(new KeywordExtractor(_))
def matches(text: String): Boolean = {
extractors.zip(blacklist).exists(kv => kv._1.extractKeywords(text).size == kv._2.size)
}
} |
| with EnableLocation[T] | ||
| with EnableEntity[T] | ||
| with EnableLanguage[T] | ||
| with FilterBlacklist[T] |
There was a problem hiding this comment.
What about EnableBlacklist or something that starts with Enable for consistency?
|
|
||
| import com.microsoft.partnercatalyst.fortis.spark.transforms.nlp.Tokenizer | ||
|
|
||
| class Blacklist(blacklist: Seq[Set[String]]) { |
There was a problem hiding this comment.
I like this since we can choose to implement this differently later on :)
| import com.microsoft.partnercatalyst.fortis.spark.analyzer.{Analyzer, ExtendedFortisEvent} | ||
| import com.microsoft.partnercatalyst.fortis.spark.dba.ConfigurationManager | ||
| import com.microsoft.partnercatalyst.fortis.spark.dto.{Analysis, FortisEvent} | ||
| import com.microsoft.partnercatalyst.fortis.spark.dto.{Analysis, Details, FortisEvent} |
There was a problem hiding this comment.
Is this one still needed after your latest commits?
Built without using the keyword extractor so that we can offer more predictable behavior to the user: events will only get filtered out if tokens match exactly what the user specified.
Resolves #34