Initial content

microsoft · Jun 5, 2017 · 70be8dd · 70be8dd
1 parent bb5be49
commit 70be8dd
Show file tree

Hide file tree

Showing 219 changed files with 20,154 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,29 @@
+# include BuildArtifacts.zip which is used in some parts of the build
+/BuildArtifacts*
+/TestResults
+# accommodate installing the build environment locally
+/pkgs/
+# useful env configurations
+/tools/local-config.sh
+
+# Generated by tools/build-pr
+/.build-pr
+
+# Ignore these for safety
+*.class
+*.jar
+*.log
+*.tgz
+*.zip
+*.exe
+*.pyc
+*.pyo
+
+# Generic editors
+.vscode
+
+# Common things
+*~
+.#*
+.*.swp
+.DS_Store
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,38 @@
+## Interested in contributing to MMLSpark?  We're excited to work with you.
+
+### You can contribute in many ways
+
+* Use the library and give feedback
+* Report a bug
+* Request a feature
+* Fix a bug
+* Add examples and documentation
+* Code a new feature
+* Review pull requests
+
+### How to contribute?
+
+You can give feedback, report bugs and request new features anytime by
+opening an issue. Also, you can up-vote and comment on existing issues.
+
+To make a pull request into the repo, such as bug fixes, documentation
+or new features, follow these steps:
+
+* If it's a new feature, open an issue for preliminary discussion with
+  us, to ensure your contribution is a good fit and doesn't duplicate
+  on-going work.
+* Typically, you'll need to accept Microsoft Contributor Licence
+  Agreement (CLA).
+* Familiarize yourself with coding style and guidelines.
+* Fork the repository, code your contribution, and create a pull
+  request.
+* Wait for an MMMLSpark team member to review and accept it.  Be patient
+  as we iron out the process for a new project.
+
+A good way to get started contributing is to look for issues with a "help
+wanted" label.  These are issues that we do want to fix, but don't have
+resources to work on currently.
+
+*Apache®, Apache Spark, and Spark® are either registered trademarks or
+trademarks of the Apache Software Foundation in the United States and/or other
+countries.*
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,22 @@
+MIT License
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+
+Permission is hereby granted, free of charge, to any person obtaining a
+copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,177 @@
+# Microsoft Machine Learning for Apache Spark
+
+<img title="Build Status" src="https://mmlspark.azureedge.net/icons/BuildStatus.png" align="right" />
+
+MMLSpark provides a number of deep learning and data science tools for [Apache
+Spark](https://github.com/apache/spark), including seamless integration of Spark
+Machine Learning pipelines with [Microsoft Cognitive Toolkit
+(CNTK)](https://github.com/Microsoft/CNTK) and [OpenCV](http://www.opencv.org/),
+enabling you to quickly create powerful, highly-scalable predictive and
+analytical models for large image and text datasets.
+
+MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or
+Python 3.5+.  See the API documentation
+[for Scala](http://mmlspark.azureedge.net/docs/scala/) and
+[for PySpark](http://mmlspark.azureedge.net/docs/pyspark/).
+
+
+## Salient features
+
+* Easily ingest images from HDFS into Spark `DataFrame` ([example:301])
+* Pre-process image data using transforms from OpenCV ([example:302])
+* Featurize images using pre-trained deep neural nets using CNTK ([example:301])
+* Train DNN-based image classification models on N-Series GPU VMs on Azure
+  ([example:301])
+* Featurize free-form text data using convenient APIs on top of primitives in
+  SparkML via a single transformer ([example:201])
+* Train classification and regression models easily via implicit featurization
+  of data ([example:101])
+* Compute a rich set of evaluation metrics including per-instance metrics
+  ([example:102])
+
+See our [notebooks](notebooks/samples/) for all examples.
+
+[example:101]: notebooks/samples/101%20-%20Adult%20Census%20Income%20Training.ipynb
+  "Adult Census Income Training"
+[example:102]: notebooks/samples/102%20-%20Regression%20Example%20with%20Flight%20Delay%20Dataset.ipynb
+  "Regression Example with Flight Delay Dataset"
+[example:201]: notebooks/samples/201%20-%20Amazon%20Book%20Reviews%20-%20TextFeaturizer.ipynb
+  "Amazon Book Reviews - TextFeaturizer"
+[example:301]: notebooks/samples/301%20-%20CIFAR10%20CNTK%20CNN%20Evaluation.ipynb
+  "CIFAR10 CNTK CNN Evaluation"
+[example:302]: notebooks/samples/302%20-%20Pipeline%20Image%20Transformations.ipynb
+  "Pipeline Image Transformations"
+
+
+## A short example
+
+Below is an excerpt from a simple example of using a pre-trained CNN to classify
+images in the CIFAR-10 dataset.  View the whole source code as [an example
+notebook](notebooks/samples/301%20-%20CIFAR10%20CNTK%20CNN%20Evaluation.ipynb).
+
+   ```python
+   ...
+   import mmlspark as mml
+   # Initialize CNTKModel and define input and output columns
+   cntkModel = mml.CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(modelFile)
+   # Train on dataset with internal spark pipeline
+   scoredImages = cntkModel.transform(imagesWithLabels)
+   ...
+   ```
+
+See [other sample notebooks](notebooks/samples/) as well as the MMLSpark
+documentation for [Scala](http://mmlspark.azureedge.net/docs/scala/)
+and [PySpark](http://mmlspark.azureedge.net/docs/pyspark/).
+
+
+## Setup and installation
+
+### Docker
+
+The easiest way to evaluate MMLSpark is via our pre-built Docker container.  To
+do so, run the following command:
+
+    docker run -it -p 8888:8888 microsoft/mmlspark
+
+Navigate to <http://localhost:8888> in your web browser to run the sample
+notebooks.  See the
+[documentation](http://mmlspark.azureedge.net/docs/pyspark/install.html)
+for more on Docker use.
+
+> Note: If you wish to run a new instance of the Docker image, make sure you
+> stop & remove the container with the name `my-mml` (using `docker rm my-mml`)
+> before you try to run a new instance, or run it with a `--rm` flag.
+
+### Spark package
+
+MMLSpark can be conveniently installed on existing Spark clusters via the
+`--packages` option, examples:
+
+    spark-shell --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
+                --repositories=https://mmlspark.azureedge.net/maven
+
+    pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
+            --repositories=https://mmlspark.azureedge.net/maven
+
+    spark-submit --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
+                 --repositories=https://mmlspark.azureedge.net/maven \
+                 MyApp.jar
+
+<img title="Script action submission" src="http://i.imgur.com/oQcS0R2.png" align="right" />
+
+### HDInsight
+
+To install MMLSpark on an existing [HDInsight Spark
+Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/), you can execute a
+script action on the cluster head and worker nodes.  For instructions on running
+script actions, see [this
+guide](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#use-a-script-action-during-cluster-creation).
+
+The script action url is:
+<https://mmlspark.azureedge.net/buildartifacts/0.5/install-mmlspark.sh> .
+
+If you're using the Azure Portal to run the script action, go to `Script
+actions` ⇒ `Submit new` in the `Overview` section of your cluster blade.  In the
+`Bash script URI` field, input the script action URL provided above.  Mark the
+rest of the options as shown on the screenshot to the right.
+
+Submit, and the cluster should finish configuring within 10 minutes or so.
+
+### Databricks cloud
+
+To install MMLSpark on the
+[Databricks cloud](http://community.cloud.databricks.com), create a new
+[library from Maven coordinates](https://docs.databricks.com/user-guide/libraries.html#libraries-from-maven-pypi-or-spark-packages)
+in your workspace.
+
+For the coordinates use: `com.microsoft.ml.spark:mmlspark:0.5`.  Then, under
+Advanced Options, use `https://mmlspark.azureedge.net/maven` for the repository.
+Ensure this library is attached to all clusters you create.
+
+Finally, ensure that your Spark cluster has at least Spark 2.1 and Scala 2.11.
+
+You can use MMLSpark in both your Scala and PySpark notebooks.
+
+### SBT
+
+If you are building a Spark application in Scala, add the following lines to
+your `build.sbt`:
+
+   ```scala
+   resolvers += "MMLSpark Repo" at "https://mmlspark.azureedge.net/maven"
+   libraryDependencies += "com.microsoft.ml.spark" %% "mmlspark" % "0.5"
+   ```
+
+### Building from source
+
+You can also easily create your own build by cloning this repo and use the main
+build script: `./runme`.  Run it once to install the needed dependencies, and
+again to do a build.  See [this guide](docs/developer-readme.md) for more
+information.
+
+
+## Contributing & feedback
+
+This project has adopted the [Microsoft Open Source Code of
+Conduct](https://opensource.microsoft.com/codeofconduct/).  For more information
+see the [Code of Conduct
+FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
+[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional
+questions or comments.
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines.
+
+To give feedback and/or report an issue, open a [GitHub
+Issue](https://help.github.com/articles/creating-an-issue/).
+
+
+## Other relevant projects
+
+* [Microsoft Cognitive Toolkit](https://github.com/Microsoft/CNTK)
+
+* [Azure Machine Learning
+  Operationalization](https://github.com/Azure/Machine-Learning-Operationalization)
+
+*Apache®, Apache Spark, and Spark® are either registered trademarks or
+trademarks of the Apache Software Foundation in the United States and/or other
+countries.*
diff --git a/docs/developer-readme.md b/docs/developer-readme.md
@@ -0,0 +1,80 @@
+# MMLSpark
+
+## Repository Layout
+
+* `runme`:    main build entry point
+* `src/`:     scala and python sources
+  - `core/`:  shared functionality
+  - `project/`: sbt build-related materials
+* `tools/`:   build-related tools
+
+
+## Build
+
+### Build Environment
+
+Currently, this code is developed and built on Linux.  The main build entry
+point, `./runme`, will install the needed packages.  When everything is
+installed, you can use `./runme` again to do a build.
+
+
+### Development
+
+From now on, you can continue using `./runme` for builds.  Alternatively, use
+`sbt full-build` to do the build directly through SBT.  The output will show
+the individual steps that are running, and you can use them directly as usual
+with SBT.  For example, use `sbt "project foo-bar" test` to run the tests of
+the `foo-bar` sub-project, or `sbt ~compile` to do a full compilation step
+whenever any file changes.
+
+Note that the SBT environment is set up in a way that makes *all* code in
+`com.microsoft.ml.spark` available in the Scala console that you get when you
+run `sbt console`.  This can be a very useful debugging tool, since you get to
+play with your code in an interactive REPL.
+
+Every once in a while the installed libraries will be updated.  In this case,
+executing `./runme` will update the libraries, and the next run will do a build
+as usual.  If you're using `sbt` directly, it will warn you whenever there was
+a change to the library configurations.
+
+Note: the libraries are all installed in `$HOME/lib` with a few
+executable symlinks in `$HOME/bin`.  The environment is configured in
+`$HOME/.mmlspark_profile` which will be executed whenever a shell starts.
+Occasionally, `./runme` will tell you that there was an update to the
+`.mmlspark_profile` file --- when this happens, you can start a new shell
+to get the updated version, but you can also apply the changes to your
+running shell with `. ~/.mmlspark_profile` which will evaluate its
+contents and save a shell restart.
+
+
+## Adding a Module
+
+To add a new module, create a directory with an appropriate name, and in the
+new directory create a `build.sbt` file.  The contents of `build.sbt` is
+optional, and can be completely empty: its presence will make the build include
+your directory as a sub-project which gets included in SBT work.
+
+You can put the usual SBT customizations in your `build.sbt`, for example:
+
+    version := "1.0"
+    name := "A Useful Module"
+
+In addition, there are a few utilities in `Extras` that can be useful to
+specify some things.  Currently, there is only one such utility:
+
+    Extras.noJar
+
+putting this in your `build.sbt` indicates that no `.jar` file should be
+created for your sub-project in the `package` step.  (Useful, for example, for
+build tools and test-only directories.)
+
+Finally, whenever SBT runs it generates an `autogen.sbt` file that specifies
+the sub-projects.  This file is generated automatically so there is no need to
+edit a central file when you add a module, and therefore customizing what
+appears in it is done via "meta comments" in your `build.sbt`.  This is
+currently used to specify dependencies for your sub-project --- in most cases
+you will want to add this:
+
+    //> DependsOn: core
+
+to use the shared code in the `common` sub-project.