Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add more notebook samples for documentation #1043

Merged
merged 24 commits into from May 19, 2021

Conversation

serena-ruan
Copy link
Contributor

No description provided.

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented May 6, 2021

Codecov Report

Merging #1043 (3e483c6) into master (12cea2d) will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1043      +/-   ##
==========================================
+ Coverage   84.92%   84.93%   +0.01%     
==========================================
  Files         203      203              
  Lines        9689     9689              
  Branches      558      558              
==========================================
+ Hits         8228     8229       +1     
+ Misses       1461     1460       -1     
Impacted Files Coverage Δ
...osoft/ml/spark/io/http/PartitionConsolidator.scala 93.61% <0.00%> (-2.13%) ⬇️
...microsoft/ml/spark/cognitive/SpeechToTextSDK.scala 89.84% <0.00%> (-0.79%) ⬇️
...a/com/microsoft/ml/spark/io/http/HTTPClients.scala 83.33% <0.00%> (+6.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12cea2d...3e483c6. Read the comment docs.

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Job! Mostly just little things left.

Two larger questions:

there are a lot of cache count and repartitions going on in VW code. Would you be able to try removing some of these to see if they are necessary? We want to avoid having many dataframes cached, but if they are needed to avoid re-fitting the model that is OK.

I will also send over Jack Gerrits example on Vowpal Wabbit Contextual Bandit code when available, (we don't have to block on this though it can be a separate PR)

"- Anomaly status of latest point: generates a model using preceding points and determines whether the latest point is anomalous ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc3/scala/com/microsoft/ml/spark/cognitive/DetectLastAnomaly.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc3/pyspark/mmlspark.cognitive.html#module-mmlspark.cognitive.DetectLastAnomaly))\n",
"- Find anomalies: generates a model using an entire series and finds anomalies in the series ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc3/scala/com/microsoft/ml/spark/cognitive/DetectAnomalies.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc3/pyspark/mmlspark.cognitive.html#module-mmlspark.cognitive.DetectAnomalies))\n",
"\n",
"### Web Search\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Web Search -> Search

"\n",
"### Web Search\n",
"- [Bing Image search](https://azure.microsoft.com/en-us/services/cognitive-services/bing-image-search-api/) ([Scala](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc3/scala/com/microsoft/ml/spark/cognitive/BingImageSearch.html), [Python](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc3/pyspark/mmlspark.cognitive.html#module-mmlspark.cognitive.BingImageSearch))\n",
"- [Azure Cognitive search](https://docs.microsoft.com/en-us/azure/search/search-what-is-azure-search)\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add corresponding scala snd python docs links?

"metadata": {},
"outputs": [],
"source": [
"train_data.show(10)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove

"metadata": {},
"outputs": [],
"source": [
"train_data.groupBy(\"Bankrupt?\").count().show()"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show -> display

"outputs": [],
"source": [
"from mmlspark.lightgbm import LightGBMClassificationModel\n",
"model.saveNativeModel(\"/lgbmcmodel\")\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can call this lgbmclassifier.model and add a cried markdown description that this allows you to extract the underlying lightGBM model for fast deployment after you train on spark

"dt1 = spark.read.format('libsvm') \\\n",
" .load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/lightGBMRanker_rank_test.libsvm\") \\\n",
" .withColumn('iid', monotonically_increasing_id())\n",
"dt2 = spark.read.format('csv').option('inferSchema', True) \\\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here

@@ -0,0 +1,659 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit to keep with the style of others lets make title Vowpal Wabbit - Overview. Likewise for other NBs

"metadata": {},
"outputs": [],
"source": [
"train_data.groupBy(\"target\").count().show()"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

display

"data = spark.read.parquet(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet\")\n",
"data = data.select([\"education\", \"marital-status\", \"hours-per-week\", \"income\"])\n",
"train, test = data.randomSplit([0.75, 0.25], seed=123)\n",
"display(train.limit(10))"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for limit

"# Making predictions\n",
"test = test.withColumn(\"label\", when(col(\"income\").contains(\"<\"), 0.0).otherwise(1.0))\n",
"prediction = vw_trained.transform(test)\n",
"display(prediction.limit(10))"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for limit

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants