-
Notifications
You must be signed in to change notification settings - Fork 3
Replace Cognitive Services language detector #92
Conversation
|
|
||
| def logIncomingEventBatch(streamId: String, connectorName: String, batchSize: Long): Unit = { | ||
| val properties = new HashMap[String, String](2) | ||
| val properties = new util.HashMap[String, String](2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would using util.HashMap be better than using HashMap here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to fix a Scala warning. More details here or here. Apparently it's a convention in the Scala community to make non-standard collections (e.g. java.util, scala.mutable) stand out more.
Mostly I just didn't want the bright yellow warning highlights on my screen since they're distracting from actual issues and these ones were very easy to fix :P
jcjimenez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| class LocalLanguageDetector extends LanguageDetector { | ||
| @transient private lazy val languageProfiles = new LanguageProfileReader().readAllBuiltIn | ||
| @transient private lazy val languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard()).withProfiles(languageProfiles).build() | ||
| @transient private lazy val textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is large text always the best choice here? I see the other relevant-sounding option is forDetectingShortCleanText. While Tweets probably aren't clean, they're also not long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't only have Tweets but also news text, Facebook posts, comments, etc. so overall the default detector is probably fine. We can also run it through both detectors to potentially increase accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in b0fff7d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we can based it off of text length af093c6
|
|
||
| def addLanguage(event: ExtendedFortisEvent[T]): ExtendedFortisEvent[T] = { | ||
| val language = analyzer.detectLanguage(event.details, languageDetector) | ||
| val language = analyzer.detectLanguage(event.details, new LocalLanguageDetector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is on a per-event code path, it'll get constructed for every event, so the lazy vals in the impl won't be reused. Are they light-weight to construct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved outside in ffc18e8
b0fff7d to
e925f0d
Compare
| client.trackEvent("batch.sink", properties, metrics) | ||
| } | ||
|
|
||
| def logLanguageDetection(language: Option[String]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a general utility class for writing events into app insights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the class that's being used to log events to AppInsights, so that's I reckon where additional event logging should be encapsulated.
| parseResponse(response, textId) | ||
| } | ||
|
|
||
| protected def callCognitiveServices(requestBody: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we still need to call the cog svc language endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the old class that got renamed. I can delete it if you want, but given that we're keeping the Kafka sink around too, might as well keep this option around.
As per conversation yesterday: we want to reduce the number of cognitive services calls so replacing Cognitive Services language detector with a local one.