Sep 23, 2018
- The Cognitive Services on Spark: A simple and scalable integration …
…between the Microsoft Cognitive Services and SparkML

	- Bing Image Search
	- Computer Vision: OCR, Recognize Text, Recognize Domain Specific Content,
		 Analyze Image, Generate Thumbnails
	- Text Analytics: Language Detector, Entity Detector, Key Phrase Extractor,
		Sentiment Detector, Named Entity Recognition
    - Face: Detect, Find Similar, Identify, Group, Verify
- Added distributed model interpretability with LIME on Spark
- **100x** lower latencies (\<1ms) with Spark Serving
- Expanded Spark Serving to cover the full HTTP protocol
- Added the `SuperpixelTransformer` for segmenting images
- Added a Fluent API, `mlTransform` and `mlFit`,  for composing pipelines more elegantly

- Chain together cognitive services to understand the feelings of your favorite celebrities with `CognitiveServices - Celebrity Quote Analysis.ipynb`
- Explore how you can use Bing Image Search and Distributed Model Interpretability to get an Object Detection system without labeling any data in `ModelInterpretation - Snow Leopard Detection.ipynb`
- See how to deploy *any* spark computation as a Web service on *any* Spark platform with the `SparkServing - Deploying a Classifier.ipynb` notebook

- More APIs for loading LightGBM Native Models
- LightGBM training checkpointing and continuation
- Added tweedie variance power to LightGBM
- Added early stopping to lightGBM
- Added feature importances to LightGBM
- Added a PMML exporter for LightGBM on Spark

- Added the `VectorizableParam` for creating column parameterizable inputs
- Added `handler` parameter added to HTTP services
- HTTP on Spark now propagates nulls robustly

- Updated to Spark 2.3.1
- LightGBM version update to 2.1.250

- Added Vagrantfile for easy windows developer setup
- Improved Image Reader fault tolerance
- Reorganized Examples into Topics
- Generalized Image Featurizer and other Image based code to handle Binary Files as well as Spark Images
- Added `ModelDownloader` R wrapper
- Added `getBestModel` and `getBestModelInfo` to `TuneHyperparameters`
- Expanded Binary File Reading APIs
- Added `Explode` and `Lambda` transformers
- Added `SparkBindings` trait for automating spark binding creation
- Added retries and timeouts to `ModelDownloader`
- Added `ResizeImageTransformer` to remove `ImageFeaturizer` dependence on OpenCV

We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark. (In alphabetical order)

- Abhiram Eswaran, Anand Raman, Ari Green, Arvind Krishnaa Jagannathan, Ben Brodsky, Casey Hong, Courtney Cochrane, Henrik Frystyk Nielsen, Ilya Matiach, Janhavi Suresh Mahajan, Jaya Susan Mathew, Karthik Rajendran, Mario Inchiosa, Minsoo Thigpen, Soundar Srinivasan, Sudarshan Raghunathan,  @terrytangyuan

@mhamilton723 mhamilton723 released this Jun 28, 2018 · 101 commits to master since this release

Assets 2

New Functionality:

  • Export trained LightGBM models for evaluation outside of Spark

  • LightGBM on Spark supports multiple cores per executor

  • CNTKModel works with multi-input multi-output models of any CNTK

  • Added Minibatching and Flattening transformers for adding flexible
    batching logic to pipelines, deep networks, and web clients.

  • Added Benchmark test API for tracking model performance across

  • Added PartitionConsolidator function for aggregating streaming data
    onto one partition per executor (for use with connection/rate-limited
    HTTP services)

Updates and Improvements:

  • Updated to Spark 2.3.0

  • Added Databricks notebook tests to build system

  • CNTKModel uses significantly less memory

  • Simplified example notebooks

  • Simplified APIs for MMLSpark Serving

  • Simplified APIs for CNTK on Spark

  • LightGBM stability improvements

  • ComputeModelStatistics stability improvements


We would like to acknowledge the external contributors who helped create
this version of MMLSpark (in order of commit history):

Apr 17, 2018
New functionality:

* MMLSpark Serving: a RESTful computation engine built on Spark
  streaming.  See `docs/` for details.

* New LightGBM Binary Classification and Regression learners and
  infrastructure with a Python notebook for examples.

* MMLSpark Clients: a general-purpose, distributed, and fault tolerant
  HTTP Library usable from Spark, Pyspark, and SparklyR.  See

* Add `MinibatchTransformer` and `FlattenBatch` to enable efficient,
  buffered, minibatch processing in Spark.

* Added Python wrappers and a notebook example for the
  `TuneHyperparameters` module, demonstrating parallel distributed
  hyperparameter tuning through randomized grid search.

* Add a `MultiNGram` transformer for efficiently computing variable
  length n-grams.

* Added DataType parameter for building models that are parameterized by
  Spark data types.


* Update per-instance statistics module so it works for any Spark ML

* Update CNTK to version 2.4.

* Updated Spark to version v2.2.1 (the following release is likely to be
  based on Spark 2.3).

* Also updated SBT and JVM.

* Refactored readers directory into `io` directory


* Fix the Conda installation in our Docker image, resolving issues with
  importing `numpy`.

* Fix a regression in R wrappers with the latest SparklyR version.

* Additional bugfixes, stability, and notebook improvements.

@elibarzilay elibarzilay released this Feb 8, 2018 · 188 commits to master since this release

Assets 2

New functionality:

  • TuneHyperparameters: parallel distributed randomized grid search for
    SparkML and TrainClassifier/TrainRegressor parameters. Sample
    notebook and python wrappers will be added in the near future.

  • Added PowerBIWriter for writing and streaming data frames to

  • Expanded image reading and writing capabilities, including using
    images with Spark Structured Streaming. Images can be read from and
    written to paths specified in a dataframe.

  • New functionality for convenient plotting in Python.

  • UDF transformer and additional UDFs.

  • Expanded pipeline support for arbitrary user code and libraries such
    as NLTK through UDFTransformer.

  • Refactored fuzzing system and added test coverage.

  • GPU training supports multiple VMs.


  • Updated to Conda 4.3.31, which comes with Python 3.6.3.

  • Also updated SBT and JVM.


  • Additional bugfixes, stability, and notebook improvements.
Feb 8, 2018
Same as v0.11, but using an older Spark v2.1.0 installation.

@elibarzilay elibarzilay released this Nov 15, 2017 · 214 commits to master since this release

Assets 2

New functionality:

  • We now provide initial support for training on a GPU VM, and an ARM
    template to deploy an HDI Cluster with an associated GPU machine. See
    docs/ for instructions on setting this up.

  • New auto-generated R wrappers for estimators and transformers. To
    import them into R, you can use devtools to import from the uploaded
    zip file. Tests and sample notebooks to come.

  • A new RenameColumn transformer for renaming columns within a

New notebooks:

  • Notebook 104: An experiment to demonstrate regression models to
    predict automobile prices. This notebook demonstrates the use of
    Pipeline stages, CleanMissingData, and

  • Notebook 105: Demonstrates DataConversion to make some columns Categorical.

  • There us a 401 notebook in notebooks/gpu which demonstrates CNTK
    training when using a GPU VM. (It is not shown with the rest of the
    notebooks yet.)


  • Updated to use CNTK 2.2. Note that this version of CNTK depends on
    libpng12 and libjasper1 -- which are included in our docker images.
    (This should get resolved in the upcoming CNTK 2.3 release.)


  • Local builds will always use a "0.0" version instead of a version
    based on the git repository. This should simplify the build process
    for developers and avoid hard-to-resolve update issues.

  • The TextPreprocessor transformer can be used to find and replace all
    key value pairs in an input map.

  • Fixed a regression in the image reader where zip files with images no
    longer displayed the full path to the image inside a zip file.

  • Additional minor bug and stability fixes.

Nov 15, 2017
Same as v0.10, but using an older Conda installation with Python 3.5.2.

@elibarzilay elibarzilay released this Oct 14, 2017 · 265 commits to master since this release

Assets 2

New functionality:

  • Refactor ImageReader and BinaryFileReader to support streaming
    images, including a Python API. Also improved performance of the
    readers. Check the 302 notebook for usage example.

  • Add ClassBalancer estimator for improving classification performance
    on highly imbalanced datasets.

  • Create an infrastructure for automated fuzzing, serialization, and
    python wrapper tests.

  • Added a DropColumns pipeline stage.

New notebooks:

  • 305: A Flowers sample notebook demonstrating deep transfer learning
    with ImageFeaturizer.


  • Our main build is now based on Spark 2.2.


  • Enable streaming through the EnsembleByKey transformer.

  • ImageReader, HDFS issue, etc.

Oct 14, 2017
Same as v0.9, but using an older Conda installation with Python 3.5.2.

@elibarzilay elibarzilay released this Oct 12, 2017 · 286 commits to master since this release

Assets 2

New functionality:

  • We are now uploading MMLSpark as a Azure/mmlspark spark package.
    Use --packages Azure:mmlspark:0.8 with the Spark command-line tools.

  • Add a bi-directional LSTM medical entity extractor to the
    ModelDownloader, and new jupyter notebook for medical entity
    extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM.

  • Add ImageSetAugmenter for easy dataset augmentation within image
    processing pipelines.


  • Optimize the performance of CNTKModel. It now broadcasts a loaded
    model to workers and shares model weights between partitions on the
    same worker. Minibatch padding (an internal workaround of a CNTK bug)
    is now no longer used, eliminating excess computations when there is a
    mismatch between the partition size and minibatch size.

  • Bugfix: CNTKModel can work with models with unnamed outputs.

Docker image improvements:

  • Environment variables are now part of the docker image (in addition to
    being set in bash).

  • New docker images:

    • microsoft/mmlspark:latest: plain image, as always,
    • microsoft/mmlspark:gpu: GPU variant based on an nvidia/cuda image.
    • microsoft/mmlspark:plus and microsoft/mmlspark:plus-gpu: these
      images contain additional packages for internal use; they will
      probably be based on an older Conda version too in future releases.


  • The Conda environment now includes NLTK.

  • Updated Java and SBT versions.