Skip to content

@mhamilton723 mhamilton723 released this Oct 2, 2020

v1.0.0-rc3

Bug Fixes 🐞

  • fix broken test link
  • Fix incorrect indexing for determining eval prob in CB (#922)
  • Update DBC path

Features 🌈

  • Add Env variable parametrized UserAgent header
  • Add support for ContextualBandit in the VW module (#896)
  • Update text analytics api to v3 (#916)

Maintenance 🔧

  • bump version to 1.0.0-rc3

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

@jackgerrits @rohit21agrawal

Assets 2

@mmlspark-bot mmlspark-bot released this Sep 17, 2020

Microsoft ML for Apache Spark v1.0.0-rc2

Highlights

Isolation Forest on Spark CyberML Speech To Text Conditional KNN LightGBM + SHAP
Distributed Nonlinear Outlier Detection Machine Learning Tools for Cyber Security Custom Speech to Text with Streaming Support Scalable KNN Models with Conditional Queries Interpret LightGBM Models using Additive Shapley Explanations

New Features

Isolation Forest on Spark ⛺️

  • Added LinkedIn's Isolation Forest outlier detection algorithm
  • Read the original work for more info

CyberML 🧙‍♂️

  • CyberML aims to provide open source tools for distributed cybersecurity workflows. This first release includes an algorithm that learns user-resource access patterns to detect anomalous access patterns. For more information see the docs

Cognitive Services for Big Data🧠

  • Added SpechToTextSDK transformer. This new transformer transcribes raw audio files and live audio streams into text. Transcription supports realtime audio streaming, automatic splitting into utterances, and profanity detection. Supports several languages and Custom Speech Models.
  • added TextSentimentV3 transformer to leverage new Cognitive Services v3 API
  • add save and load methods to AccessAnomalyModel (#905)
  • stream robustness, output audio stream to file, and custom speech
  • Add m3u8 streaming for SpeechToTextSDK
  • enable mp3 file streaming in stt sdk (#822)

Conditional K-Nearest Neighbors 🏡🏡

  • Added ConditionalKNN estimator and model for efficient search of high dimensional KNNs with conditional predicates.
  • Added Conditional KNN demo here
  • Find hidden artistic connections with the Mosaic application.

HTTP on Spark 🌐

  • Added integration with python Requests to accelerate Python Requests with HTTP on Spark!
  • Optimized HTTP on Spark asynchronous performance

Vowpal Wabbit on Spark 🐇

  • add barrier mode support for VW (#832)
  • add support for VW readable model, invert hash and re-using a previously trained VW Spark model (#821)
  • support generic numeric types for weights and labels (#817)

LightGBM on Spark 🌳

  • add featuresShapCol to LightGBMClassifierModel (#863)
  • Expose parameter bin_construct_sample_cnt in spark for LightGBM (#780)
  • add interface function for updating learning_rate per each iteration in LightGBMDelegate (#849)
  • add delegate to monitor training (#847)
  • Add the option to get Feature Contributions in LightGBMBooster used by LightGBMRanker (#791)
  • Add option to add tolerance to improvement in metric evolution (#786)
  • added pred leaf index for LightGBMClassifier
  • Adding a new param for explicitly setting slot names. (#752)
  • added the top_k param for voting parallel (#762)
  • Adding a feature for positive and negative bagging fraction params. (#754)

Learn More

MosAIc Finds Hidden Connections in World Art (Article, Demo, Webinar) Watch the Spark Summit Europe Keynote on MMLSpark Learn about AI for Good and MMLSpark on the MSR Podcast
New Docs for the Cognitive Services for Big Data Read our New Paper on Conditional KNN Trees Read our New Paper on Microservices in Databases

Bug Fixes 🐞

  • Updating regular Docker Images for helm chart. (#885)
  • improve error message for invalid slot names (#897)
  • categorical parameter regression on dense dataset caused by missing whitespace (#909)
  • fix cyberml test imports
  • add "s" to failing publicwasb download
  • spark.executor.cores' default value based on master when counting workers (#855)
  • fix flakiness in BiLSTM notebook
  • make file type case insensitive
  • Add support for URI parameters and default filetypes
  • remove save_resume/preserve_performance_counters options as it breaks SGD/BFGS chaining (#828)
  • fix optional parsing for the CustomOutputParser (#835)
  • Fix flakiness in io tests
  • Improve codegen readability and added getters and setters to generated models
  • move tests to a separate package and refactor common code
  • added multiclass init score support (#805)
  • LightGBMRanker should repartition by grouping column (#778)
  • Possible multithreading issue when two scores may come in parallel they may not safely fill pointer values (#799)
  • Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance (#792)
  • Fix subtle bug in reverse index creation
  • add cap on max allowed port in network init (#759)
  • added min_data_in_leaf parameter (#760)
  • Reorder ADB Status Checks to fix flakiness
  • increase library install timeout (#763)
  • Fix an issue with the sparkContext not being instantiated at eval time
  • Fix GH release bade display
  • Codegen dataframe param fixes

Build 🏭

  • bump version
  • Ignore existing installation when running installPipPackageTask (#895)
  • update ffmpeg on build server
  • make python test loop easier:
  • updating lightgbm to 2.3.180 (#850)
  • split cog services on spark tests
  • Split e2e and publishing (#836)
  • Add Caching to build pipeline
  • added isolation forest test to build pipeline (#800)
  • exclude scala from fat jar

Code Style 🎶

  • Removing redundant file in the root directory: sp.txt (#796)
  • ball tree style fixes

Documentation 📘

  • Adding section to readme for installing with apache livy (#785)
  • Add fix for maven resolver
  • Added two classification examples using Vowpal Wabbit (#733)

Maintenance 🔧

  • add Roy to CODEOWNERS
  • fix flaky analyze image test
  • move build to new subscription (#888)
  • Update codeowners file to fix helm owwners
  • remove flaky lightGBM test and add retries to Cog service tests
  • Update CODEOWNERS (#831)
  • Add time in httpv2 tests to reduce flakiness on build VMs
  • fixes to improve test flakiness
  • updated lightgbm to 2.3.150 (#757)
  • improve efficiency of lightgbm tests
  • Add more cluster status checks
  • fix flakiness in IdentifyFacesSuite
  • bump heap size in build
  • add default UA

Acknowledgements 🙌

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

  • Ilya Matiach @imatiach-msft
  • Markus Cosowicz @eisber
  • Lucy Zhang @zhang-lucy
  • Roy Levin @rolevin
  • Keunhyun Oh @ocworld
  • James Verbus
  • Christina Lee
  • Anand Raman
  • William T Freeman
  • Lei Zhang
  • Rohit Agrawal
  • Nisheet Jain
  • Chris Hoder
  • Chris Templeman
  • Chenhui Hu @chenhuims
  • Ryan Hurey
  • Jun Ki Min @loomlike
  • Dotan Patrich,
  • Addy Santo,
  • Anil Francis Thomas,
  • Amrit Bhattacharya,
  • Moshe Israel
  • Dalitso Banda
  • Joan Fontanals @JoanFM
  • Jack Gerrits @jackgerrits
  • Akshaya Annavajhala
  • Heiko Rahmel
  • Felix Tran @felixtran39
  • Stephanie Fu
  • Parker Levy
  • Casey Hillenburg
  • Vick Wowo
  • Brendan Walsh
  • Nick Gonsalves
  • Mindren Lu
  • Nurudín Álvarez
  • Guolin Ke
  • Chris Smith @chris-smith-zocdoc
  • David Lacalle Castillo @WaterKnight1998
  • Fokko Driesprong @Fokko
  • Diego Mazon
  • Tommy Li @tommyzli
  • Azure CAT
  • Vowpal Wabbit Team
  • Light GBM Team
  • MSFT Garage Team
  • MSR Outreach Team
  • Speech SDK Team

Changes:

  • 81e73a2 chore: add Roy to CODEOWNERS
  • b12be50 build: bump version
  • b431a61 fix: Updating regular Docker Images for helm chart. (#885)
  • 96f0b77 fix: improve error message for invalid slot names (#897)
  • 95c1f8a fix: categorical parameter regression on dense dataset caused by missing whitespace (#909)
  • 040ad34 feat: add save and load methods to AccessAnomalyModel (#905)
  • 8f8c504 fix: fix cyberml test imports
  • 9aed004 chore: fix flaky analyze image test
  • 826cfc2 fix: add "s" to failing publicwasb download
  • 22e19e5 feat: CyberML (#890)
See More * 54a623d build: Ignore existing installation when running installPipPackageTask (#895) * f1b4a94 chore: move build to new subscription (#888) * f07e558 Merge pull request #882 from ocworld/fix-rename-clusterutils-numcores * e741993 build: update ffmpeg on build server * 9f9ae53 feat: stream robustness, output audio stream to file, and custom speech * 0319650 build: make python test loop easier: * 65a13bc chore: Update codeowners file to fix helm owwners * 7409ba5 Add num tasks override parameter for LightGBM learners (#881) * 64481e9 fix: spark.executor.cores' default value based on master when counting workers (#855) * 4ae0fe8 reduce network communication overhead cost on reduce step for LightGBM learners (#869) * b413749 fixed shap values shape for multiclass case and improved pyspark API (#870) * 840781a unify APIs across LightGBM learner types and add SHAP feature importances to regressor (#864) * 84b392c re-disable flaky test (#866) * d86a937 build: updating lightgbm to 2.3.180 (#850) * 6bb4a45 feat: add featuresShapCol to LightGBMClassifierModel (#863) * 82e7a8e Bump Apache Spark to 2.4.5 * a0db5b3 build: split cog services on spark tests * 537b611 1) add functions for before/after batch training (#852) * ed435b8 feat: Add m3u8 streaming for `SpeechToTextSDK` * 4d99879 feat: add interface function for updating learning_rate per each iteration in LightGBMDelegate (#849) * be366c5 feat: add delegate to monitor training (#847) * c695d7a add option for driver listen port * 99795bc fic: Codegen dataframe param fixes * 37e336e feat: add barrier mode support for VW (#832) * 9c9a93b fix: fix flakiness in BiLSTM notebook * 5d9410a fix: make file type case insensitive * 55765f8 chore: remove flaky lightGBM test and add retries to Cog service tests * b1e3797 fix: Add support for URI parameters and default filetypes * 5ae664a improvement: support numeric types (not just double) for weight/label (#817) * 9f15b6c feat: add support for VW readable model, invert hash and re-using a previous… (#821) * 038b26b fix: remove save_resume/preserve_performance_counters options as it breaks SGD/BFGS chaining (#828) * 7dd4670 build: Split e2e and publishing (#836) * ca05d1b extended test case to validate duplicate passes parameter (#834) * 2ff6a36 fix: fix optional parsing for the CustomOutputParser (#835) * f9a56e8 chore: Update CODEOWNERS (#831) * c79dd12 chore: Add time in httpv2 tests to reduce flakiness on build VMs * c7eed5a build: Add Caching to build pipeline * c5b8b15 fix: Fix flakiness in io tests * 3abd9b4 chore:Split up io tests into 2 sections * 5489271 fix:remove error prone IO from notebook tests * b4a60e5 fix:remove error prone IO from notebook tests * 2455cbe chore: fixes to improve test flakiness * 6d7cfb5 fix: Improve codegen readability and added getters and setters to generated models * 015d4ea fix: move tests to a separate package and refactor common code * 6b2edc3 feat: enable mp3 file streaming in stt sdk (#822) * 8005c17 feat: Add `TextSentimentV3` Transformer (#812) * df0244c fix: added multiclass init score support (#805) * e745784 fix: LightGBMRanker should repartition by grouping column (#778) * f702921 feat: Add the option to get Feature Contributions in LightGBMBooster used by LightGBMRanker (#791) * 875f89d build: added isolation forest test to build pipeline (#800) * 290f5cf fix: Possible multithreading issue when two scores may come in parallel they may not safely fill pointer values (#799) * fb3ac99 docs: Adding section to readme for installing with apache livy (#785) * 7b8efa5 fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance (#792) * 4c812d7 style: Removing redundant file in the root directory: sp.txt (#796) * bd2f71e feat: Integration of LinkedIn's Isolation Forest (#781) * 9c61053 feat: Add option to add tolerance to improvement in metric evolution (#786) * dbb2818 feat: Expose parameter bin_construct_sample_cnt in spark for LightGBM (#780) * fde2d3c fix: Fix subtle bug in reverse index creation * 4b4af04 feat: add demo for `ConditionalKNN` * cf48d53 chore: remove keys from demo * 2618422 feat: Add `SpeechToTextSDK` Transformer * 4da1ff2 style: ball tree style fixes * 849527d feat: Add python bindings for `ConditionalBallTree` * d4d4ca8 feat: Add KNN and ConditionalKNN Estimators * 134ddb5 fix bug in serialization * a00c141 fix review points * 9cf33ce feat: added pred leaf index for LightGBMClassifier * 461d27d feat: added pred leaf index for LightGBMClassifier * 3a7a813 feat: added pred leaf index for LightGBMClassifier * f3d624d feat: Adding a new param for explicitly setting slot names. (#752) * 280cab7 Expose dump model method on MMLSpark-LightGBM so that models can be saved as json. * 3da5d4f fix: add cap on max allowed port in network init (#759) * 91652f2 fix: added min_data_in_leaf parameter (#760) * 6bb0429 chore: updated lightgbm to 2.3.150 (#757) * 344dbbd feat: added the top_k param for voting parallel (#762) * ae63497 chore: improve efficiency of lightgbm tests * d9568dc chore: Add more cluster status checks * a9b05b9 chore: fix flakiness in IdentifyFacesSuite * 988403f fix: Reorder ADB Status Checks to fix flakiness * e1dc2b3 fix: increase library install timeout (#763) * a47922f change labelGain description * 43b4e63 feat: Adding a feature for positive and negative bagging fraction params. (#754) * 087f290 docs: Add fix for maven resolver * 3da1d14 docs: Added two classification examples using Vowpal Wabbit (#733) * dece5ae chore: bump heap size in build * 8bb7d86 build: exclude scala from fat jar * 2465d4e fix: Fix an issue with the sparkContext not being instantiated at eval time * d091b37 chore: add default UA * 614a444 perf: remove async bottlenecks from HTTP on Spark * 3caf8f0 feat: Add wrappers for integrating with python Requests * 2fdfe3e added max_bin_by_feature, min_gain_to_split, max_delta_step parameters (#712) * 95b7ef0 Fix scalastyle * 5604602 Fix default case check. Add test cases for countCardinality * 491c01c change getTrainingCols from Option[DataType] -> Seq[DataType] * 25425a0 Use a case class instead of anonymous tuple * c58b216 Support the group column being a string * f22aa73 Fix: Fix GH release bade display

This list of changes was auto generated.

Assets 2

@mmlspark-bot mmlspark-bot released this Nov 19, 2019

v1.0.0-rc1

Features 🌈

  • Add brands and objects to AnalyzeImage transformer
  • Add label conversion for VW binary classifier (0/1 -> -1/1) (#700)
  • Add VowpalWabbit ngram support (#696)
  • Add automatic schema inference for writing to Azure Search (#704)
  • Add metric parameter to lightgbm learners (#672)

Bug Fixes 🐞

  • Vowpal Wabbit kwargs + improvements (#692)
  • Fix cast errors for label, weight, and init score columns
  • Fix probabilities and some win errors
  • Fix barrier execution mode with repartition for spark standalone (#651)
  • Mitigate flakiness in SpeechToText test

Build 🏭

  • Add ability to create fat jars (#702)
  • Make Databricks tests use instance pools to remove state (#673)

Code Refactoring 💎

  • Clean up distributed and continuous HTTP tests
  • Clean up LightGBM tests

Documentation 📘

  • Example notebook of VW vs LightGBM (#641)
  • Update Cognitive Service docs (#659)
  • Fix typo in Spark Serving sdocs (#656)
  • Add centOS to VW on spark docs

Maintenance 🔧

  • Improve code-quality
  • Update lightgbm to 2.2.400
  • Move build to new Azure subscription (#661)

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.\n

Changes:

  • 8d31c02 chore: Bump Version Number to 1.0.0-rc1
  • 2701aed fixed early stopping test for validation (#711)
  • 6b07829 docs: Example notebook of VW vs LightGBM (#641)
  • 163dead fix:fix num cores per executor if config not specified (#709)
  • bc0e010 chore: ignore flaky test for now
  • ea7d899 feat: Add brands and objects to analyze image transformer
  • 04a2fbd feat: added label conversion for VW binary classifier (0/1 -> -1/1) (#700)
  • da124d7 feat: Add VowpalWabbit ngram support (#696)
  • a44dafd fix validation data and ranker preprocessing
  • 4037869 feat: Add automatic schema inference for writing to Azure Search (#704)
See More
  • 77bb678 update lightgbm to 2.3.100, remove generateMissingLabels, fix lightgbm getting stuck on unbalanced data
  • 2e45613 build: Add ability to create fat jars (#702)
  • 035fcd9 cleanup duplication in unit tests (#695)
  • 932ec86 adding debug for client mode issue and future investigations
  • 95061d0 fix: Vowpal Wabbit kwargs + improvements (#692)
  • 3ea5bc5 fix: cast errors for label, weight and init score columns
  • f2bf39f fix categorical handling on lightgbm learners
  • 671b688 re-enabling windows tests for lightgbm
  • 8361ead add eval_at parameter to lightgbm ranker
  • c0921fb Better error message when the group column is not a Int/Long
  • 05a2bef fix: update lightgbm to 2.2.400, fix probabilities and some win errors
  • 16ea090 chore: imporve code-quality
  • ef14350 build: databricks tests use instance pools to remove state (#673)
  • 8b27d88 feat: add metric parameter to lightgbm learners (#672)
  • 9805996 fix: fix barrier execution mode with repartition for spark standalone (#651)
  • 1e186ad chore: move to new subscription (#661)
  • 360f2f7 refactor: clean up distributed HTTP tests
  • 5eedc93 fix: mitigate flakiness in speechToText test
  • 0290386 refactor: clean up continuous http tests
  • 8ed3aeb refactor: clean up LightGBM tests
  • f99c9f4 docs: Update Cog Service docs (#659)
  • df089cd docs: fix typo in spark serving docs (#656)
  • b369244 docs: add vw to related software
  • 876553a docs: add links to readme
  • 8136022 docs: change paper badge color
  • f974a6a docs: improve README
  • 8190eb5 Add links to API documentation
  • 241a486 docs: add centOS to vw on spark docs

This list of changes was auto generated.

Assets 2

@mmlspark-bot mmlspark-bot released this Aug 20, 2019

v0.18.1

Bug Fixes 🐞

  • fix lightgbm stuck in multiclass scenario and added stratified repartition transformer (#618)
  • fix schema issue with databricks e2e tests (#653)
  • update VW dependency to 8.7.0.2 built on CentOS and optimized for portability (#652)

Build 🏭

  • add proper secrets to publishing step (#650)

Documentation 📘

  • Remove script action section

Maintenance 🔧

  • bump version number

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

Ilya Matiach, Markus Cozowicz

Changes:

  • 62946d1 chore: bump version number
  • d518b8a fix: fix lightgbm stuck in multiclass scenario and added stratified repartition transformer (#618)
  • 85fb3fc fix: fix schema issue with databricks e2e tests (#653)
  • 258cafb fix: update VW dependency to 8.7.0.2 built on CentOS and optimized for portability (#652)
  • 376cc6a build: add proper secrets to publishing step (#650)
  • 0be08e9 docs: Remove script action section

This list of changes was auto generated.

Assets 2

@mmlspark-bot mmlspark-bot released this Aug 20, 2019


Microsoft ML for Apache Spark v0.18.0

Highlights

Vowpal Wabbit on Spark Quality and Build Refactor LightGBM Ranking and More Anomaly Detection and Speech To Text
Fast, Sparse, and Scalable Text Analytics New Azure Pipelines build with Code Coverage, CICD, and an organized package structure. Barrier Execution mode, performance improvements, increased parameter coverage New cognitive services on Spark

New Features

Vowpal Wabbit on Spark: Fast and Sparse Text Analytics

LightGBM on Spark

  • Now supports barrier execution mode
  • Added the LightGBMRanker
  • Added is_provide_training_metric to LightGBMRanker.
  • Enabled continued training with init score column
  • Added batch training support
  • Reduced memory usage
  • Fixed issues with frozen jobs
  • Fixes for multiclass classification
  • Fixed issue where multiclass classification hangs due to partitions without all classes

HTTP on Spark

  • Added AnomalyDetector and SimpleAnomalyDetector APIs
  • Added SpeechToText transformer
  • Improved service concurrency
  • Added robustness to socket timeouts

Miscellaneous

  • Codegen support for wrapping Ranker classes
  • Notebooks now leverage public blob for faster execution
  • Fixed summarize data column handling
  • Better compute model statistics error messages
  • Upgraded to Spark 2.4.3
  • Added Spark on Kubernetes Helm Charts
  • Added StratifiedRepartition transformer for ensuring partitions contain all classes
  • Fixed issue where ImageFeaturizer could not be executed on Databricks 2.4.3

Build, Quality, and Infrastructure Refactor

Azure Pipelines Integration

  • Tests parallelized on Azure Pipelines. Builds now take ~25min vs ~90min!
  • Serverless Builds: Queue as many builds as needed with no machine maintenance costs
  • Test results, error messages, and time are viewable from github PR section
  • Individual Tests can be re-queued from the GitHub PR Page
  • Builds can be queued using the pull request comment: /azp run.
    • Full details can be seen by typing /azp help
  • CI pipeline entirely specified in small .yaml file in git repo

Local Developer Support

  • Dramatically simpler developer setup (all through SBT)
  • Local developer setup now works on any platform including windows!
  • Local setup no longer needs VM, Vagrant, or 30 min to import the library
  • All build stages are SBT tasks and can be done locally for rapid testing
    • This includes publishing maven packages to local repositories and the MMLSpark maven repo
  • All secrets now managed by centralized Azure Key Vault
  • IntelliJ will pick up on all scalastyle rules for editor-level style feedback while typing

Code Quality Gates

  • Code Coverage now supported for every PR and reported in the comments and badge
    • Coverage is now a check-in gate to never decrease
  • Test coverage increased and dead code removed from the library
  • Custom and auto-generated Python tests now supported
  • CODEOWNERS file for better code reviews and maintenance
  • Codacy integration for automated PR reviews

Streamlined Library Structure

  • MMLSpark now supports a true Scala/Java idiomatic package hierarchy
  • Namespace hierarchy also reflected in PySpark code
  • Note: This will require changes to existing MMLSpark Programs. For Support in migrating please contact mmlspark-support@microsoft.com

Maintainability and Community Management

  • Issue and PR templates
  • Gitter channel
  • Welcome bot to greet new contributors
  • Semantic Commits for autogenerating release notes
  • Badges to display current and master versions in the README

Migration Support:

  • For those that already have MMLSpark developer setups please read the new developer guide to reconfigure.
  • For those that have standing PRs that need rebasing assistance please reach out to mmlspark-support@microsoft.com
  • Please report any bugs or feedback!

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

  • Ilya Matiach, Markus Cozowicz, Scott Graham, Daniel Ciborowski, Christina Lee, Dalitso Banda, Shaochen Shi, Sudarshan Raghunathan, Anand Raman, Eli Barzilay, Nick Gonsalves, Tao Wu, Jeremy Reynolds, Miguel Fierro, Robert Alexander, AI CAT Team, Azure Search Team

Contributions, Collaborations, and Feedback Welcome!

Changes:

  • 3bb48b8 chore: bump version number
  • b0797b3 docs: Improve cog services on spark docs
  • 8e966b3 docs: Docs for Cognitive Services (#647)
  • eb0a421 docs: Improve VW on Spark Docs
  • 54dbcad docs: add VowpalWabbit documentation
  • fb5b79f docs: fix vw on spark description
  • c0d5786 docs: update readme badges and icons
  • 071b6b0 docs: Add gitter badge
  • 5c34356 docs: Add VW on Spark to table
  • 1bdcdbf chore: ignore .github folder for CI
See more
  • 01d498c build: add sonatype publishing
  • 8fab72d build: make e2e cancellable
  • ddc7a4f build: remove broken codecov flags (will reinstate when codecov fixes their service_
  • 188cbdb chore: Update issue templates
  • f67b16a chore: fix welcome bot indenting
  • eeb7eba fix: Fix logistic regression error when passing "--link logistic" (#644)
  • b6a4f93 fix: fix socket timeout error (#640)
  • 856db6d build: add mcr publishing
  • c6e44f9 fix: fix issue with socket timeout in advanced handler
  • 2425b7a fix: update detect anomaly suite to make anomaly more pronounced
  • 07c7fec style: run markdown through markdown linter
  • a0e85f5 build: increase setup timeouts
  • 5c190f8 style: Fix style issues
  • 4bf6f71 build: Add build cancel timeouts
  • 915d683 build: add release job to Azure Pipelines
  • e48f9cb build: Add github version badges
  • 73581cb build: fix flaky codecov upload
  • ce1e66d build: fix e2e notebook cluster check
  • 19aeb80 build: Add behavior bot
  • 72ccae2 build: Make task retry part of bash script
  • 16dd7f4 Update formatting
  • 3fe4db5 adding vagrant doc and fixing indentation in vagrantfile
  • d58d6f4 Vowpal Wabbit on Spark
  • 95dc734 adding vagrant file back in, updated for sbt (#622)
  • 605c98f Add flaky test retry
  • 4ebbb41 remove brittle dataset downloading from demos
  • e572a9a try to Fix codecov upload
  • fac542e Add codecov to python tests
  • b6ba62f Add test publishing tobuild
  • 5cada6f Increase coverage and remove dead code
  • ae191a6 Fix build summary
  • e18ec2e leverage codecov.io's coverage capabilities
  • 8e76263 Improve noisy neighbor problems for e2e tests
  • 6ab8916 add codecov file
  • 70881b2 improve test coverage
  • 41da2b7 improve flakiness
  • aa3c98f improve coverage
  • 237d388 Add Code Coverage badge
  • 7146b9b Add unit test timeout
  • fa87e42 Fix noisy neighbor search index tests
  • 0f98f7d add codeowners file
  • 4321809 add codeowners file
  • 80aecab Add upload to codecov.io
  • 66db39f Split LGBM tests for speed
  • a6998ec Update README.md
  • 027e6d7 Remove unused code
  • 0205b7e Squash with partition fix
  • dc1554f Add r package upload
  • 2fbd81c Fix pipeline retry
  • 0fde594 attempt to fix partition consolidator flakiness
  • 7940967 Add codecov
  • 7e8225f fix retry logic
  • d8c0eb4 Increase timeout for e2e notebook tests
  • ff059a3 Add ability to retry pipeline
  • 8cf91ca Simplify build pipeline
  • 5c8c903 Delete runme
  • 210b522 Update CNTK code in README
  • da6e497 Update pipeline.yaml for Azure Pipelines
  • e946318 Add build status bar
  • 37d36af Enable PR builds
  • 6c56326 transition to new build system
  • fb3e99e Update dockerfile
  • 637df9d Update documentation for new build
  • e9ef538 Improve test robustness
  • d34f9d1 Remove unused build scripts
  • 4034a4f Add doc publishing to build
  • 36d8c3b Fixup after rebase
  • 7c5e7b6 Get e2e tests working
  • 07316a8 Fix serialization fuzzing error
  • f6df907 Make recomendation tests faster
  • dd99937 Add python tests
  • 02a8ac6 Add publish task
  • 3a526c8 Fix Test Errors and Improve Reliability
  • 4a696c5 Parallelize Tests
  • 2b75b62 Make build windows compatible
  • 94e9b21 Add developer-readme.md
  • 5659287 Fix python testing
  • 987c7c4 Get python codegen to work
  • 90089fa Add scalastyle and unidoc
  • 79d4110 Add secrets
  • 5742c0e Refactor build
  • 77d7cb4 Move library into a single package
  • 29c15cb add barrier execution mode
  • aac0536 fix default value for double array param in codegen
  • 2bd2faf fix wrapper generator for ranker models
  • 6885ef5 added lightgbm ranker model pyspark api
  • 08b3085 fix summarize data columns
  • 044d0b5 reduce memory usage, fix frozen jobs, add more debug logging
  • 45c91f9 defer lightgbm probability calculation to native core to fix multiclass bug in some scenarios (#578)
  • 4473520 squish runs together
  • 00ebf64 use right python version
  • 216abea updated readme. more mini images
  • 3232d84 Fix flakey test
  • e9a612b Fix Entity Detector Suite
  • ba3dbd0 Improve service concurrency
  • 75819a5 Add simple Anamoly Detector
  • 17a765e Add is_provide_training_metric to LightGBMRanker.
  • ceb5291 Print metrics of validation data as well.
  • b54363c Implement is_provide_training_metric in Scala codes through JNI.
  • c7e31e6 fix query column to support long type
  • 6a6d57f Poke Build System
  • 11fe799 Fixing Cog Service Test
  • 6eba0b6 ignore flaky test
  • 53c4b9e adding LightGBMRanker
  • fa77857 add init score column for continued training
  • 32ac353 Add anomaly detection and speech to text services
  • 06273b2 improved compute model statistics error message
  • e7a309c pass through slot names to native structure
  • b295dae add batch training support in lightgbm classifier and regressor

This list of changes was auto generated.

Assets 2

@mhamilton723 mhamilton723 released this Apr 23, 2019

Highlights

  • LightGBM evaluation 3-4x faster!
  • Spark Serving v2
  • LightGBM training supports early stopping and regularization
  • LIME on Spark significantly faster

New Features

Spark Serving v2:

  • Both Microbatch and Continuous mode have sub-millisecond latency
  • Supports fault tolerance
  • Can reply from anywhere in the pipeline
  • Fail fast modes for warning callers of bad JSON parsing
  • Fully based on DataSource API v2

LightGBM:

  • 3-4x evaluation performance improvement
  • Add early stopping capabilities
  • Added L1 and L2 Regularization parameters
  • Made network init more robust
  • Fixed bug caused by empty partitions

LIME on Spark:

  • LIME Parallelization significantly faster for large datasets
  • Tabular Lime now supported

Other:

  • Added UnicodeNormalizer for working with complex text
  • Recognize Text exposes parameters for its polling handlers

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

  • Ilya Matiach, Markus Cozowicz, Scott Graham, Daniel Ciborowski, Jeremy Reynolds, Miguel Fierro, Robert Alexander, Tao Wu, Sudarshan Raghunathan, Anand Raman,Casey Hong, Karthik Rajendran, Dalitso Banda, Manon Knoertzer, Lars Ahlfors, The Microsoft AI Development Acceleration Program, Cognitive Search Team, Azure Search Team
Assets 2

@mhamilton723 mhamilton723 released this Mar 6, 2019

New Features

New Examples

Updates and Improvements

General

  • MMLSpark Image Schema now unified with Spark Core
  • Bugfixes for Text Analytics services
  • PageSplitter now propagates nulls
  • HTTP on Spark now supports socket and read timeouts
  • HyperparamBuilder python wrappers now return idiomatic python objects

LightGBM on Spark

  • Added multiclass classification
  • Added multiple types of boosting (Gradient Boosting Decision Tree, Random Forest, Dropout meet Multiple Additive Regression Trees, Gradient-based One-Side Sampling)
  • Added windows OS support/bugfix
  • LightGBM version bumped to 2.2.200
  • Added native support for categorical columns, either through Spark's StringIndexer, MMLSpark's ValueIndexer or list of indexes/slot names parameter
  • isUnbalance parameter for unbalanced datasets
  • Added boost from average parameter

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

  • Ilya Matiach, Casey Hong, Daniel Ciborowski, Karthik Rajendran, Dalitso Banda, Manon Knoertzer, Sudarshan Raghunathan, Anand Raman,Markus Cozowicz, The Microsoft AI Development Acceleration Program, Cognitive Search Team, Azure Search Team
Assets 2

@mhamilton723 mhamilton723 released this Nov 30, 2018

New Features

  • Add the TagImage and DescribeImage services
  • Add Ranking Cross Validator and Evaluator

New Examples

Updates and Improvements

LightGBM

  • Fix issue with raw2probabilityInPlace
  • Add weight column
  • Add getModel API to TrainClassifier and TrainRegressor
  • Improve robustness of getting executor cores

HTTP on Spark and Spark Serving

  • Improve robustness of Gateway creation and management
  • Imrpove Gateway documentation

Version Bumps

  • Updated to Spark 2.4.0
  • LightGBM version update to 2.1.250

Misc

  • Fix Flaky Tests
  • Remove autogeneration of scalastyle
  • Increase training dataset size in snow leopard example

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

  • Ilya Matiach, Casey Hong, Karthik Rajendran, Daniel Ciborowski, Sebastien Thomas, Eli Barzilay, Sudarshan Raghunathan, @flybywind, @wentongxin, @haal
Assets 2

@mhamilton723 mhamilton723 released this Oct 23, 2018

New Features

  • The Cognitive Services on Spark: A simple and scalable integration between the Microsoft Cognitive Services and SparkML
    • Bing Image Search
    • Computer Vision: OCR, Recognize Text, Recognize Domain Specific Content,
      Analyze Image, Generate Thumbnails
    • Text Analytics: Language Detector, Entity Detector, Key Phrase Extractor,
      Sentiment Detector, Named Entity Recognition
    • Face: Detect, Find Similar, Identify, Group, Verify
  • Added distributed model interpretability with LIME on Spark
  • 100x lower latencies (<1ms) with Spark Serving
  • Expanded Spark Serving to cover the full HTTP protocol
  • Added the SuperpixelTransformer for segmenting images
  • Added a Fluent API, mlTransform and mlFit, for composing pipelines more elegantly

New Examples

  • Chain together cognitive services to understand the feelings of your favorite celebrities with CognitiveServices - Celebrity Quote Analysis.ipynb
  • Explore how you can use Bing Image Search and Distributed Model Interpretability to get an Object Detection system without labeling any data in ModelInterpretation - Snow Leopard Detection.ipynb
  • See how to deploy any spark computation as a Web service on any Spark platform with the SparkServing - Deploying a Classifier.ipynb notebook

Updates and Improvements

LightGBM

  • More APIs for loading LightGBM Native Models
  • LightGBM training checkpointing and continuation
  • Added tweedie variance power to LightGBM
  • Added early stopping to lightGBM
  • Added feature importances to LightGBM
  • Added a PMML exporter for LightGBM on Spark

HTTP on Spark

  • Added the VectorizableParam for creating column parameterizable inputs
  • Added handler parameter added to HTTP services
  • HTTP on Spark now propagates nulls robustly

Version Bumps

  • Updated to Spark 2.3.1
  • LightGBM version update to 2.1.250

Misc

  • Added Vagrantfile for easy windows developer setup
  • Improved Image Reader fault tolerance
  • Reorganized Examples into Topics
  • Generalized Image Featurizer and other Image based code to handle Binary Files as well as Spark Images
  • Added ModelDownloader R wrapper
  • Added getBestModel and getBestModelInfo to TuneHyperparameters
  • Expanded Binary File Reading APIs
  • Added Explode and Lambda transformers
  • Added SparkBindings trait for automating spark binding creation
  • Added retries and timeouts to ModelDownloader
  • Added ResizeImageTransformer to remove ImageFeaturizer dependence on OpenCV

Acknowledgements

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark. (In alphabetical order)

  • Abhiram Eswaran, Anand Raman, Ari Green, Arvind Krishnaa Jagannathan, Ben Brodsky, Casey Hong, Courtney Cochrane, Henrik Frystyk Nielsen, Ilya Matiach, Janhavi Suresh Mahajan, Jaya Susan Mathew, Karthik Rajendran, Mario Inchiosa, Minsoo Thigpen, Soundar Srinivasan, Sudarshan Raghunathan, @terrytangyuan
Assets 2

@mhamilton723 mhamilton723 released this Jun 28, 2018

New Functionality:

  • Export trained LightGBM models for evaluation outside of Spark

  • LightGBM on Spark supports multiple cores per executor

  • CNTKModel works with multi-input multi-output models of any CNTK
    datatype

  • Added Minibatching and Flattening transformers for adding flexible
    batching logic to pipelines, deep networks, and web clients.

  • Added Benchmark test API for tracking model performance across
    versions

  • Added PartitionConsolidator function for aggregating streaming data
    onto one partition per executor (for use with connection/rate-limited
    HTTP services)

Updates and Improvements:

  • Updated to Spark 2.3.0

  • Added Databricks notebook tests to build system

  • CNTKModel uses significantly less memory

  • Simplified example notebooks

  • Simplified APIs for MMLSpark Serving

  • Simplified APIs for CNTK on Spark

  • LightGBM stability improvements

  • ComputeModelStatistics stability improvements

Acknowledgements:

We would like to acknowledge the external contributors who helped create
this version of MMLSpark (in order of commit history):

Assets 2