feat: add translator #1108

serena-ruan · 2021-06-30T08:48:50Z

Add translator into mmlspark

serena-ruan · 2021-07-06T10:02:52Z

/azp run

azure-pipelines · 2021-07-06T10:02:56Z

No pipelines are associated with this pull request.

serena-ruan · 2021-07-06T10:06:46Z

/azp run

azure-pipelines · 2021-07-06T10:06:51Z

No pipelines are associated with this pull request.

serena-ruan · 2021-07-08T03:11:32Z

/azp run

azure-pipelines · 2021-07-08T03:11:43Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov · 2021-07-08T03:17:44Z

Codecov Report

Merging #1108 (82f026e) into master (d287be6) will decrease coverage by 0.18%.
The diff coverage is 78.16%.

@@            Coverage Diff             @@
##           master    #1108      +/-   ##
==========================================
- Coverage   85.55%   85.37%   -0.19%     
==========================================
  Files         254      257       +3     
  Lines       11805    12053     +248     
  Branches      625      629       +4     
==========================================
+ Hits        10100    10290     +190     
- Misses       1705     1763      +58

Impacted Files	Coverage Δ
.../microsoft/ml/spark/cognitive/FormRecognizer.scala	`81.00% <ø> (ø)`
.../microsoft/ml/spark/cognitive/TextTranslator.scala	`76.24% <76.24%> (ø)`
.../microsoft/ml/spark/cognitive/ComputerVision.scala	`78.57% <76.92%> (ø)`
...rosoft/ml/spark/cognitive/DocumentTranslator.scala	`81.96% <81.96%> (ø)`
...crosoft/ml/spark/cognitive/TranslatorSchemas.scala	`100.00% <100.00%> (ø)`
...ala/org/apache/spark/ml/param/DataFrameParam.scala	`66.66% <0.00%> (-16.67%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d287be6...82f026e. Read the comment docs.

serena-ruan · 2021-07-08T03:27:32Z

/azp run

azure-pipelines · 2021-07-08T03:27:41Z

Azure Pipelines successfully started running 1 pipeline(s).

… tests

serena-ruan · 2021-07-08T06:57:48Z

/azp run

azure-pipelines · 2021-07-08T06:58:11Z

Azure Pipelines successfully started running 1 pipeline(s).

…xemption in fuzzingTest

serena-ruan · 2021-07-08T07:42:55Z

/azp run

azure-pipelines · 2021-07-08T07:43:05Z

Azure Pipelines successfully started running 1 pipeline(s).

serena-ruan · 2021-07-08T07:56:18Z

/azp run

azure-pipelines · 2021-07-08T07:56:27Z

Azure Pipelines successfully started running 1 pipeline(s).

serena-ruan · 2021-07-08T09:25:16Z

/azp run

azure-pipelines · 2021-07-08T09:25:25Z

Azure Pipelines successfully started running 1 pipeline(s).

serena-ruan · 2021-07-09T02:57:52Z

/azp run

azure-pipelines · 2021-07-09T02:58:02Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2021-07-09T05:21:29Z

cognitive/src/main/scala/com/microsoft/ml/spark/cognitive/DocumentTranslator.scala

+          ))).toJson.compactPrint, ContentType.APPLICATION_JSON))
+  }
+
+  private def queryForResult(key: Option[String],


Are there any similarities that would allow us to abstract this and other async querying logic into the same function?

mhamilton723 · 2021-07-09T05:23:26Z

cognitive/src/main/scala/com/microsoft/ml/spark/cognitive/TextTranslator.scala

+  with HasInternalJsonOutputParser with HasCognitiveServiceInput with HasSubscriptionRegion
+  with HasSetLocation {
+
+  protected val subscriptionRegionHeaderName = "Ocp-Apim-Subscription-Region"


this is so strange i cant believe they make you specify this in a header lol

Yes it's quite strange... the document says it's optional for global translator resource, but if we don't add it into header the response will be '{"error":{"code":401000,"message":"The request is not authorized because credentials are missing or invalid."}}'

Arent cognitive services so great!

mhamilton723 · 2021-07-09T05:24:08Z

cognitive/src/main/scala/com/microsoft/ml/spark/cognitive/TextTranslator.scala

+
+  def setToLanguageCol(v: String): this.type = setVectorParam(toLanguage, v)
+
+  val fromLanguage = new ServiceParam[String](this, "fromLanguage", "Specifies the language of the input" +


Can abstract the fromLanguage and toLanguage methods into separate traits and "mix together" to reduce code. Also be on the lookout for other params that can be "factored" out in this manner as its less maintenance work for us later on ;)

Although there're several fromLanguage & toLanguage in this file, the 'isRequired' parameter and type might be different, so didn't factor these two within Translate out. Or do you have other approaches to mix them together?

Okay! One idea would be to add protected fromLanguageRequired: Boolean = true on base class and just override that

mhamilton723 · 2021-07-09T05:31:05Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+class TranslateSuite extends TransformerFuzzing[Translate]
+  with TranslatorKey with Flaky with TranslatorUtils {
+
+  lazy val translate: Translate = new Translate()


Can factor out common setters into a base method then just .set only the differing params. Isnt the fluent API nifty!

mhamilton723 · 2021-07-09T05:31:40Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+
+  lazy val textDf2: DataFrame = Seq(List("Hello, what is your name?", "Bye")).toDF("text")
+
+  lazy val textDf3: DataFrame = Seq(List("This is bullshit.")).toDF("text")


mhamilton723 · 2021-07-09T05:32:59Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+    .setOutputCol("translation")
+    .setConcurrency(5)
+
+  test("Translate multiple pieces of text with language autodetection") {


all of these tests have a similiar structure, might want to factor this structure out so that the tests are smaller to write. I know it's just test code and who cares, but it will make your life easier when you need to update I promise

mhamilton723 · 2021-07-09T05:34:42Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+  }
+
+  test("Handle profanity") {
+    val results = translate


when re-using the same estimator make translate a def not a lazy val. I know its weird but the setters actually MODIFY state globally which is a weird thing on SparkML's part

Done. That solves my pain on creating multiple translators lol!

mhamilton723 · 2021-07-09T05:36:05Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+  import spark.implicits._
+
+  // TODO: Replace all of those SAS urls after 2022-07-07
+  lazy val sourceUrl: String = "https://mmlspark.blob.core.windows.net/datasets?sp=rl&st=2021-07-06T06" +


are you able to get away without SAS URLs? the datasets blob is public I believe. Also you might want to pull the root URL into a val to keep code DRY

Emmm I tried but seems can't access it directly using '"https://mmlspark.blob.core.windows.net/datasets", it returns "error": {
"code": "InvalidRequest",
"message": "Cannot access source document location with the current permissions.",
"target": "Operation",
"innerError": {
"code": "InvalidDocumentAccessLevel",
"message": "Cannot access source document location with the current permissions."
}
},

On folders it might not work, but it should work on files. Is this manually inspecting a folder and loading content from it, or is it taking it URL by URL?

mhamilton723 · 2021-07-09T05:37:00Z

core/src/test/scala/com/microsoft/ml/spark/Secrets.scala

@@ -48,6 +48,8 @@ object Secrets {
  lazy val AnomalyApiKey: String = getSecret("anomaly-api-key")
  lazy val AzureSearchKey: String = getSecret("azure-search-key")
  lazy val BingSearchKey: String = getSecret("bing-search-key")
+  lazy val TranslatorKey: String = getSecret("translator-key")
+  lazy val TranslatorName: String = getSecret("translator-name")


What does the translator name do? is it state or a secret?

It's the service name

mhamilton723 · 2021-07-09T05:37:30Z

Awesome stuff, so great to see this flying out of your fingertips!

serena-ruan · 2021-07-09T08:31:54Z

/azp run

azure-pipelines · 2021-07-09T08:32:04Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2021-07-09T17:29:05Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+
+  lazy val transDf: DataFrame = Seq(List("こんにちは", "さようなら")).toDF("text")
+
+  lazy val transliterate: Transliterate = new Transliterate()


can you change lazy val -> def here and elsewhere? I know its just used once but if we add more tests then it might be important for avoiding subtle errors

mhamilton723 · 2021-07-09T17:29:20Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+class DetectSuite extends TransformerFuzzing[Detect]
+  with TranslatorKey with Flaky with TranslatorUtils {
+
+  lazy val detect: Detect = new Detect()


lazy val -> def

mhamilton723 · 2021-07-09T17:35:14Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+  import spark.implicits._
+
+  // TODO: Replace root SAS urls after 2022-07-07
+  lazy val sourceRoot: String = "?sp=rl&st=2021-07-06T06:28:26Z&se=2022-07-07T06:28:00Z" +


might want to consider factor this in the following way

lazy val containerSasToken = ... (If this needs to be a SAS otherwise we should try to remove SAS)
lazy val urlRoot = "https://mmlspark.blob.core.windows.net/"

and use the same container as for all experiments. If you find the container SAS to be used all over we might consider breaking this into a single trait for all cog service tests so that we only need to update in one location if it expires. For file based APIs we probably don't need the SAS but for container and folder listing operations the SAS is probably needede

mhamilton723 · 2021-07-09T17:41:54Z

cognitive/src/test/scala/com/microsoft/ml/spark/cognitive/split1/TranslatorSuite.scala

+      TargetInput(None, None, targetFileUrl2, "de", None))))
+    .toDF("sourceUrl", "storageType", "targets")
+
+  lazy val documentTranslator: DocumentTranslator = new DocumentTranslator()


lazy val -> def and factor out shared structure

mhamilton723

Thanks for making these changes have a few more ideas for how to continue tidying, awesome work and appreciate the iterations on this :)

serena-ruan · 2021-07-12T04:55:19Z

/azp run

azure-pipelines · 2021-07-12T04:55:28Z

Azure Pipelines successfully started running 1 pipeline(s).

serena-ruan · 2021-07-13T01:34:21Z

/azp run

azure-pipelines · 2021-07-13T01:34:31Z

Azure Pipelines successfully started running 1 pipeline(s).

serena-ruan added 4 commits June 30, 2021 16:47

feat: add text translation

8796ead

Merge branch 'master' into serena/addTranslator

772a8e5

refactor textTranslator to be cleaner

7549de6

add document translation

fc7034a

serena-ruan marked this pull request as ready for review July 6, 2021 10:06

serena-ruan requested a review from mhamilton723 as a code owner July 6, 2021 10:06

serena-ruan changed the title ~~feat: add text translation~~ feat: add translator Jul 8, 2021

Merge branch 'master' into serena/addTranslator

2345292

fix textAndTranslation param name

9f3d2cf

serena-ruan added 2 commits July 8, 2021 14:54

fix ServiceParam name consistency issue & update document translation…

9e5710a

… tests

format

ff45443

fix: from is keyword in python so rename the ServiceParam and set e…

15abde6

…xemption in fuzzingTest

update fuzzingTest

4a3278d

fix from as python keyword issue

f1accfd

Merge branch 'master' into serena/addTranslator

dc8a19f

mhamilton723 requested changes Jul 9, 2021

View reviewed changes

address comments

4e00db5

mhamilton723 reviewed Jul 9, 2021

View reviewed changes

mhamilton723 requested changes Jul 9, 2021

View reviewed changes

serena-ruan and others added 2 commits July 12, 2021 11:04

Merge branch 'master' into serena/addTranslator

e1b15fc

refactor translator tests

79b324b

Merge branch 'master' into serena/addTranslator

82f026e

mhamilton723 approved these changes Jul 13, 2021

View reviewed changes

serena-ruan merged commit 84d8d24 into microsoft:master Jul 13, 2021

serena-ruan deleted the serena/addTranslator branch July 13, 2021 03:42


		def setToLanguageCol(v: String): this.type = setVectorParam(toLanguage, v)

		val fromLanguage = new ServiceParam[String](this, "fromLanguage", "Specifies the language of the input" +


		lazy val textDf2: DataFrame = Seq(List("Hello, what is your name?", "Bye")).toDF("text")

		lazy val textDf3: DataFrame = Seq(List("This is bullshit.")).toDF("text")


		lazy val transDf: DataFrame = Seq(List("こんにちは", "さようなら")).toDF("text")

		lazy val transliterate: Transliterate = new Transliterate()

feat: add translator #1108

feat: add translator #1108

Conversation

serena-ruan commented Jun 30, 2021

serena-ruan commented Jul 6, 2021

azure-pipelines bot commented Jul 6, 2021

serena-ruan commented Jul 6, 2021

azure-pipelines bot commented Jul 6, 2021

serena-ruan commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

codecov bot commented Jul 8, 2021 • edited Loading

Codecov Report

serena-ruan commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

serena-ruan commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

serena-ruan commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

serena-ruan commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

serena-ruan commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

serena-ruan commented Jul 9, 2021

azure-pipelines bot commented Jul 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhamilton723 commented Jul 9, 2021

serena-ruan commented Jul 9, 2021

azure-pipelines bot commented Jul 9, 2021

mhamilton723 Jul 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhamilton723 Jul 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhamilton723 left a comment

Choose a reason for hiding this comment

serena-ruan commented Jul 12, 2021

azure-pipelines bot commented Jul 12, 2021

serena-ruan commented Jul 13, 2021

azure-pipelines bot commented Jul 13, 2021

codecov bot commented Jul 8, 2021 •

edited

Loading

mhamilton723 Jul 9, 2021 •

edited

Loading

mhamilton723 Jul 9, 2021 •

edited

Loading