Skip to content
This repository has been archived by the owner. It is now read-only.

Replace Cognitive Services language detector #92

Merged
merged 8 commits into from Aug 16, 2017
Merged

Conversation

@c-w
Copy link
Contributor

@c-w c-w commented Aug 16, 2017

As per conversation yesterday: we want to reduce the number of cognitive services calls so replacing Cognitive Services language detector with a local one.

@c-w c-w requested review from kevinhartman and erikschlegel Aug 16, 2017

class AppInsightsTelemetry extends FortisTelemetry {
private val client: TelemetryClient = new TelemetryClient(TelemetryConfiguration.createDefault())

def logIncomingEventBatch(streamId: String, connectorName: String, batchSize: Long): Unit = {
val properties = new HashMap[String, String](2)
val properties = new util.HashMap[String, String](2)
Copy link
Contributor

@Smarker Smarker Aug 16, 2017

Why would using util.HashMap be better than using HashMap here?

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

This is to fix a Scala warning. More details here or here. Apparently it's a convention in the Scala community to make non-standard collections (e.g. java.util, scala.mutable) stand out more.

Mostly I just didn't want the bright yellow warning highlights on my screen since they're distracting from actual issues and these ones were very easy to fix :P

Copy link
Contributor

@jcjimenez jcjimenez left a comment

LGTM

class LocalLanguageDetector extends LanguageDetector {
@transient private lazy val languageProfiles = new LanguageProfileReader().readAllBuiltIn
@transient private lazy val languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard()).withProfiles(languageProfiles).build()
@transient private lazy val textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText()
Copy link
Contributor

@kevinhartman kevinhartman Aug 16, 2017

Is large text always the best choice here? I see the other relevant-sounding option is forDetectingShortCleanText. While Tweets probably aren't clean, they're also not long.

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

We don't only have Tweets but also news text, Facebook posts, comments, etc. so overall the default detector is probably fine. We can also run it through both detectors to potentially increase accuracy.

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

Done in b0fff7d.

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

Actually, we can based it off of text length af093c6

@@ -57,7 +57,7 @@ object Pipeline {
}

def addLanguage(event: ExtendedFortisEvent[T]): ExtendedFortisEvent[T] = {
val language = analyzer.detectLanguage(event.details, languageDetector)
val language = analyzer.detectLanguage(event.details, new LocalLanguageDetector)
Copy link
Contributor

@kevinhartman kevinhartman Aug 16, 2017

Since this is on a per-event code path, it'll get constructed for every event, so the lazy vals in the impl won't be reused. Are they light-weight to construct?

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

Moved outside in ffc18e8

@c-w c-w force-pushed the inproc-language-detector branch from b0fff7d to e925f0d Aug 16, 2017
metrics.put("batchSize", batchSize.toDouble)
metrics.put("duration", duration.toDouble)

client.trackEvent("batch.sink", properties, metrics)
}

def logLanguageDetection(language: Option[String]): Unit = {
Copy link
Contributor

@erikschlegel erikschlegel Aug 16, 2017

Isn't this a general utility class for writing events into app insights?

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

This is the class that's being used to log events to AppInsights, so that's I reckon where additional event logging should be encapsulated.

parseResponse(response, textId)
}

protected def callCognitiveServices(requestBody: String): String = {
Copy link
Contributor

@erikschlegel erikschlegel Aug 16, 2017

Why would we still need to call the cog svc language endpoint?

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

This is the old class that got renamed. I can delete it if you want, but given that we're keeping the Kafka sink around too, might as well keep this option around.

@c-w c-w merged commit eaa5433 into master Aug 16, 2017
2 checks passed
@c-w c-w removed the in progress label Aug 16, 2017
@c-w c-w deleted the inproc-language-detector branch Aug 16, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants