Skip to content
This repository has been archived by the owner. It is now read-only.

Replace Cognitive Services language detector #92

Merged
merged 8 commits into from Aug 16, 2017

Conversation

Projects
None yet
5 participants
@c-w
Copy link
Member

commented Aug 16, 2017

As per conversation yesterday: we want to reduce the number of cognitive services calls so replacing Cognitive Services language detector with a local one.

@c-w c-w requested review from kevinhartman and erikschlegel Aug 16, 2017

@c-w c-w added the in progress label Aug 16, 2017


class AppInsightsTelemetry extends FortisTelemetry {
private val client: TelemetryClient = new TelemetryClient(TelemetryConfiguration.createDefault())

def logIncomingEventBatch(streamId: String, connectorName: String, batchSize: Long): Unit = {
val properties = new HashMap[String, String](2)
val properties = new util.HashMap[String, String](2)

This comment has been minimized.

Copy link
@Smarker

Smarker Aug 16, 2017

Contributor

Why would using util.HashMap be better than using HashMap here?

This comment has been minimized.

Copy link
@c-w

c-w Aug 16, 2017

Author Member

This is to fix a Scala warning. More details here or here. Apparently it's a convention in the Scala community to make non-standard collections (e.g. java.util, scala.mutable) stand out more.

Mostly I just didn't want the bright yellow warning highlights on my screen since they're distracting from actual issues and these ones were very easy to fix :P

@jcjimenez
Copy link
Contributor

left a comment

LGTM

class LocalLanguageDetector extends LanguageDetector {
@transient private lazy val languageProfiles = new LanguageProfileReader().readAllBuiltIn
@transient private lazy val languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard()).withProfiles(languageProfiles).build()
@transient private lazy val textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText()

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Aug 16, 2017

Contributor

Is large text always the best choice here? I see the other relevant-sounding option is forDetectingShortCleanText. While Tweets probably aren't clean, they're also not long.

This comment has been minimized.

Copy link
@c-w

c-w Aug 16, 2017

Author Member

We don't only have Tweets but also news text, Facebook posts, comments, etc. so overall the default detector is probably fine. We can also run it through both detectors to potentially increase accuracy.

This comment has been minimized.

Copy link
@c-w

c-w Aug 16, 2017

Author Member

Done in b0fff7d.

This comment has been minimized.

Copy link
@c-w

c-w Aug 16, 2017

Author Member

Actually, we can based it off of text length af093c6

@@ -57,7 +57,7 @@ object Pipeline {
}

def addLanguage(event: ExtendedFortisEvent[T]): ExtendedFortisEvent[T] = {
val language = analyzer.detectLanguage(event.details, languageDetector)
val language = analyzer.detectLanguage(event.details, new LocalLanguageDetector)

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Aug 16, 2017

Contributor

Since this is on a per-event code path, it'll get constructed for every event, so the lazy vals in the impl won't be reused. Are they light-weight to construct?

This comment has been minimized.

Copy link
@c-w

c-w Aug 16, 2017

Author Member

Moved outside in ffc18e8

@c-w c-w force-pushed the inproc-language-detector branch from b0fff7d to e925f0d Aug 16, 2017

metrics.put("batchSize", batchSize.toDouble)
metrics.put("duration", duration.toDouble)

client.trackEvent("batch.sink", properties, metrics)
}

def logLanguageDetection(language: Option[String]): Unit = {

This comment has been minimized.

Copy link
@erikschlegel

erikschlegel Aug 16, 2017

Contributor

Isn't this a general utility class for writing events into app insights?

This comment has been minimized.

Copy link
@c-w

c-w Aug 16, 2017

Author Member

This is the class that's being used to log events to AppInsights, so that's I reckon where additional event logging should be encapsulated.

parseResponse(response, textId)
}

protected def callCognitiveServices(requestBody: String): String = {

This comment has been minimized.

Copy link
@erikschlegel

erikschlegel Aug 16, 2017

Contributor

Why would we still need to call the cog svc language endpoint?

This comment has been minimized.

Copy link
@c-w

c-w Aug 16, 2017

Author Member

This is the old class that got renamed. I can delete it if you want, but given that we're keeping the Kafka sink around too, might as well keep this option around.

@c-w c-w merged commit eaa5433 into master Aug 16, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@c-w c-w removed the in progress label Aug 16, 2017

@c-w c-w deleted the inproc-language-detector branch Aug 16, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.