Skip to content
This repository has been archived by the owner on Mar 7, 2018. It is now read-only.

Replace Cognitive Services language detector #92

Merged
merged 8 commits into from Aug 16, 2017
Merged

Conversation

c-w
Copy link
Contributor

@c-w c-w commented Aug 16, 2017

As per conversation yesterday: we want to reduce the number of cognitive services calls so replacing Cognitive Services language detector with a local one.


class AppInsightsTelemetry extends FortisTelemetry {
private val client: TelemetryClient = new TelemetryClient(TelemetryConfiguration.createDefault())

def logIncomingEventBatch(streamId: String, connectorName: String, batchSize: Long): Unit = {
val properties = new HashMap[String, String](2)
val properties = new util.HashMap[String, String](2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would using util.HashMap be better than using HashMap here?

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix a Scala warning. More details here or here. Apparently it's a convention in the Scala community to make non-standard collections (e.g. java.util, scala.mutable) stand out more.

Mostly I just didn't want the bright yellow warning highlights on my screen since they're distracting from actual issues and these ones were very easy to fix :P

Copy link
Contributor

@jcjimenez jcjimenez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

class LocalLanguageDetector extends LanguageDetector {
@transient private lazy val languageProfiles = new LanguageProfileReader().readAllBuiltIn
@transient private lazy val languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard()).withProfiles(languageProfiles).build()
@transient private lazy val textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is large text always the best choice here? I see the other relevant-sounding option is forDetectingShortCleanText. While Tweets probably aren't clean, they're also not long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't only have Tweets but also news text, Facebook posts, comments, etc. so overall the default detector is probably fine. We can also run it through both detectors to potentially increase accuracy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b0fff7d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we can based it off of text length af093c6

@@ -57,7 +57,7 @@ object Pipeline {
}

def addLanguage(event: ExtendedFortisEvent[T]): ExtendedFortisEvent[T] = {
val language = analyzer.detectLanguage(event.details, languageDetector)
val language = analyzer.detectLanguage(event.details, new LocalLanguageDetector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is on a per-event code path, it'll get constructed for every event, so the lazy vals in the impl won't be reused. Are they light-weight to construct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved outside in ffc18e8

metrics.put("batchSize", batchSize.toDouble)
metrics.put("duration", duration.toDouble)

client.trackEvent("batch.sink", properties, metrics)
}

def logLanguageDetection(language: Option[String]): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a general utility class for writing events into app insights?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the class that's being used to log events to AppInsights, so that's I reckon where additional event logging should be encapsulated.

parseResponse(response, textId)
}

protected def callCognitiveServices(requestBody: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we still need to call the cog svc language endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the old class that got renamed. I can delete it if you want, but given that we're keeping the Kafka sink around too, might as well keep this option around.

@c-w c-w merged commit eaa5433 into master Aug 16, 2017
@c-w c-w removed the in progress label Aug 16, 2017
@c-w c-w deleted the inproc-language-detector branch August 16, 2017 15:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants