Skip to content
This repository was archived by the owner on Mar 7, 2018. It is now read-only.

Conversation

@c-w
Copy link
Contributor

@c-w c-w commented Aug 16, 2017

As per conversation yesterday: we want to reduce the number of cognitive services calls so replacing Cognitive Services language detector with a local one.


def logIncomingEventBatch(streamId: String, connectorName: String, batchSize: Long): Unit = {
val properties = new HashMap[String, String](2)
val properties = new util.HashMap[String, String](2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would using util.HashMap be better than using HashMap here?

Copy link
Contributor Author

@c-w c-w Aug 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix a Scala warning. More details here or here. Apparently it's a convention in the Scala community to make non-standard collections (e.g. java.util, scala.mutable) stand out more.

Mostly I just didn't want the bright yellow warning highlights on my screen since they're distracting from actual issues and these ones were very easy to fix :P

Copy link
Contributor

@jcjimenez jcjimenez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

class LocalLanguageDetector extends LanguageDetector {
@transient private lazy val languageProfiles = new LanguageProfileReader().readAllBuiltIn
@transient private lazy val languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard()).withProfiles(languageProfiles).build()
@transient private lazy val textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is large text always the best choice here? I see the other relevant-sounding option is forDetectingShortCleanText. While Tweets probably aren't clean, they're also not long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't only have Tweets but also news text, Facebook posts, comments, etc. so overall the default detector is probably fine. We can also run it through both detectors to potentially increase accuracy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b0fff7d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we can based it off of text length af093c6


def addLanguage(event: ExtendedFortisEvent[T]): ExtendedFortisEvent[T] = {
val language = analyzer.detectLanguage(event.details, languageDetector)
val language = analyzer.detectLanguage(event.details, new LocalLanguageDetector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is on a per-event code path, it'll get constructed for every event, so the lazy vals in the impl won't be reused. Are they light-weight to construct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved outside in ffc18e8

@c-w c-w force-pushed the inproc-language-detector branch from b0fff7d to e925f0d Compare August 16, 2017 15:34
client.trackEvent("batch.sink", properties, metrics)
}

def logLanguageDetection(language: Option[String]): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a general utility class for writing events into app insights?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the class that's being used to log events to AppInsights, so that's I reckon where additional event logging should be encapsulated.

parseResponse(response, textId)
}

protected def callCognitiveServices(requestBody: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we still need to call the cog svc language endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the old class that got renamed. I can delete it if you want, but given that we're keeping the Kafka sink around too, might as well keep this option around.

@c-w c-w merged commit eaa5433 into master Aug 16, 2017
@c-w c-w removed the in progress label Aug 16, 2017
@c-w c-w deleted the inproc-language-detector branch August 16, 2017 15:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants