Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bb5be49
commit 70be8dd
Showing
219 changed files
with
20,154 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# include BuildArtifacts.zip which is used in some parts of the build | ||
/BuildArtifacts* | ||
/TestResults | ||
# accommodate installing the build environment locally | ||
/pkgs/ | ||
# useful env configurations | ||
/tools/local-config.sh | ||
|
||
# Generated by tools/build-pr | ||
/.build-pr | ||
|
||
# Ignore these for safety | ||
*.class | ||
*.jar | ||
*.log | ||
*.tgz | ||
*.zip | ||
*.exe | ||
*.pyc | ||
*.pyo | ||
|
||
# Generic editors | ||
.vscode | ||
|
||
# Common things | ||
*~ | ||
.#* | ||
.*.swp | ||
.DS_Store |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
## Interested in contributing to MMLSpark? We're excited to work with you. | ||
|
||
### You can contribute in many ways | ||
|
||
* Use the library and give feedback | ||
* Report a bug | ||
* Request a feature | ||
* Fix a bug | ||
* Add examples and documentation | ||
* Code a new feature | ||
* Review pull requests | ||
|
||
### How to contribute? | ||
|
||
You can give feedback, report bugs and request new features anytime by | ||
opening an issue. Also, you can up-vote and comment on existing issues. | ||
|
||
To make a pull request into the repo, such as bug fixes, documentation | ||
or new features, follow these steps: | ||
|
||
* If it's a new feature, open an issue for preliminary discussion with | ||
us, to ensure your contribution is a good fit and doesn't duplicate | ||
on-going work. | ||
* Typically, you'll need to accept Microsoft Contributor Licence | ||
Agreement (CLA). | ||
* Familiarize yourself with coding style and guidelines. | ||
* Fork the repository, code your contribution, and create a pull | ||
request. | ||
* Wait for an MMMLSpark team member to review and accept it. Be patient | ||
as we iron out the process for a new project. | ||
|
||
A good way to get started contributing is to look for issues with a "help | ||
wanted" label. These are issues that we do want to fix, but don't have | ||
resources to work on currently. | ||
|
||
*Apache®, Apache Spark, and Spark® are either registered trademarks or | ||
trademarks of the Apache Software Foundation in the United States and/or other | ||
countries.* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
MIT License | ||
|
||
Copyright (c) Microsoft Corporation. All rights reserved. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a | ||
copy of this software and associated documentation files (the | ||
"Software"), to deal in the Software without restriction, including | ||
without limitation the rights to use, copy, modify, merge, publish, | ||
distribute, sublicense, and/or sell copies of the Software, and to | ||
permit persons to whom the Software is furnished to do so, subject to | ||
the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included | ||
in all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS | ||
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF | ||
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE | ||
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION | ||
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION | ||
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
# Microsoft Machine Learning for Apache Spark | ||
|
||
<img title="Build Status" src="https://mmlspark.azureedge.net/icons/BuildStatus.png" align="right" /> | ||
|
||
MMLSpark provides a number of deep learning and data science tools for [Apache | ||
Spark](https://github.com/apache/spark), including seamless integration of Spark | ||
Machine Learning pipelines with [Microsoft Cognitive Toolkit | ||
(CNTK)](https://github.com/Microsoft/CNTK) and [OpenCV](http://www.opencv.org/), | ||
enabling you to quickly create powerful, highly-scalable predictive and | ||
analytical models for large image and text datasets. | ||
|
||
MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or | ||
Python 3.5+. See the API documentation | ||
[for Scala](http://mmlspark.azureedge.net/docs/scala/) and | ||
[for PySpark](http://mmlspark.azureedge.net/docs/pyspark/). | ||
|
||
|
||
## Salient features | ||
|
||
* Easily ingest images from HDFS into Spark `DataFrame` ([example:301]) | ||
* Pre-process image data using transforms from OpenCV ([example:302]) | ||
* Featurize images using pre-trained deep neural nets using CNTK ([example:301]) | ||
* Train DNN-based image classification models on N-Series GPU VMs on Azure | ||
([example:301]) | ||
* Featurize free-form text data using convenient APIs on top of primitives in | ||
SparkML via a single transformer ([example:201]) | ||
* Train classification and regression models easily via implicit featurization | ||
of data ([example:101]) | ||
* Compute a rich set of evaluation metrics including per-instance metrics | ||
([example:102]) | ||
|
||
See our [notebooks](notebooks/samples/) for all examples. | ||
|
||
[example:101]: notebooks/samples/101%20-%20Adult%20Census%20Income%20Training.ipynb | ||
"Adult Census Income Training" | ||
[example:102]: notebooks/samples/102%20-%20Regression%20Example%20with%20Flight%20Delay%20Dataset.ipynb | ||
"Regression Example with Flight Delay Dataset" | ||
[example:201]: notebooks/samples/201%20-%20Amazon%20Book%20Reviews%20-%20TextFeaturizer.ipynb | ||
"Amazon Book Reviews - TextFeaturizer" | ||
[example:301]: notebooks/samples/301%20-%20CIFAR10%20CNTK%20CNN%20Evaluation.ipynb | ||
"CIFAR10 CNTK CNN Evaluation" | ||
[example:302]: notebooks/samples/302%20-%20Pipeline%20Image%20Transformations.ipynb | ||
"Pipeline Image Transformations" | ||
|
||
|
||
## A short example | ||
|
||
Below is an excerpt from a simple example of using a pre-trained CNN to classify | ||
images in the CIFAR-10 dataset. View the whole source code as [an example | ||
notebook](notebooks/samples/301%20-%20CIFAR10%20CNTK%20CNN%20Evaluation.ipynb). | ||
|
||
```python | ||
... | ||
import mmlspark as mml | ||
# Initialize CNTKModel and define input and output columns | ||
cntkModel = mml.CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(modelFile) | ||
# Train on dataset with internal spark pipeline | ||
scoredImages = cntkModel.transform(imagesWithLabels) | ||
... | ||
``` | ||
|
||
See [other sample notebooks](notebooks/samples/) as well as the MMLSpark | ||
documentation for [Scala](http://mmlspark.azureedge.net/docs/scala/) | ||
and [PySpark](http://mmlspark.azureedge.net/docs/pyspark/). | ||
|
||
|
||
## Setup and installation | ||
|
||
### Docker | ||
|
||
The easiest way to evaluate MMLSpark is via our pre-built Docker container. To | ||
do so, run the following command: | ||
|
||
docker run -it -p 8888:8888 microsoft/mmlspark | ||
|
||
Navigate to <http://localhost:8888> in your web browser to run the sample | ||
notebooks. See the | ||
[documentation](http://mmlspark.azureedge.net/docs/pyspark/install.html) | ||
for more on Docker use. | ||
|
||
> Note: If you wish to run a new instance of the Docker image, make sure you | ||
> stop & remove the container with the name `my-mml` (using `docker rm my-mml`) | ||
> before you try to run a new instance, or run it with a `--rm` flag. | ||
### Spark package | ||
|
||
MMLSpark can be conveniently installed on existing Spark clusters via the | ||
`--packages` option, examples: | ||
|
||
spark-shell --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \ | ||
--repositories=https://mmlspark.azureedge.net/maven | ||
|
||
pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \ | ||
--repositories=https://mmlspark.azureedge.net/maven | ||
|
||
spark-submit --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \ | ||
--repositories=https://mmlspark.azureedge.net/maven \ | ||
MyApp.jar | ||
|
||
<img title="Script action submission" src="http://i.imgur.com/oQcS0R2.png" align="right" /> | ||
|
||
### HDInsight | ||
|
||
To install MMLSpark on an existing [HDInsight Spark | ||
Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/), you can execute a | ||
script action on the cluster head and worker nodes. For instructions on running | ||
script actions, see [this | ||
guide](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#use-a-script-action-during-cluster-creation). | ||
|
||
The script action url is: | ||
<https://mmlspark.azureedge.net/buildartifacts/0.5/install-mmlspark.sh> . | ||
|
||
If you're using the Azure Portal to run the script action, go to `Script | ||
actions` ⇒ `Submit new` in the `Overview` section of your cluster blade. In the | ||
`Bash script URI` field, input the script action URL provided above. Mark the | ||
rest of the options as shown on the screenshot to the right. | ||
|
||
Submit, and the cluster should finish configuring within 10 minutes or so. | ||
|
||
### Databricks cloud | ||
|
||
To install MMLSpark on the | ||
[Databricks cloud](http://community.cloud.databricks.com), create a new | ||
[library from Maven coordinates](https://docs.databricks.com/user-guide/libraries.html#libraries-from-maven-pypi-or-spark-packages) | ||
in your workspace. | ||
|
||
For the coordinates use: `com.microsoft.ml.spark:mmlspark:0.5`. Then, under | ||
Advanced Options, use `https://mmlspark.azureedge.net/maven` for the repository. | ||
Ensure this library is attached to all clusters you create. | ||
|
||
Finally, ensure that your Spark cluster has at least Spark 2.1 and Scala 2.11. | ||
|
||
You can use MMLSpark in both your Scala and PySpark notebooks. | ||
|
||
### SBT | ||
|
||
If you are building a Spark application in Scala, add the following lines to | ||
your `build.sbt`: | ||
|
||
```scala | ||
resolvers += "MMLSpark Repo" at "https://mmlspark.azureedge.net/maven" | ||
libraryDependencies += "com.microsoft.ml.spark" %% "mmlspark" % "0.5" | ||
``` | ||
|
||
### Building from source | ||
|
||
You can also easily create your own build by cloning this repo and use the main | ||
build script: `./runme`. Run it once to install the needed dependencies, and | ||
again to do a build. See [this guide](docs/developer-readme.md) for more | ||
information. | ||
|
||
|
||
## Contributing & feedback | ||
|
||
This project has adopted the [Microsoft Open Source Code of | ||
Conduct](https://opensource.microsoft.com/codeofconduct/). For more information | ||
see the [Code of Conduct | ||
FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact | ||
[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional | ||
questions or comments. | ||
|
||
See [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines. | ||
|
||
To give feedback and/or report an issue, open a [GitHub | ||
Issue](https://help.github.com/articles/creating-an-issue/). | ||
|
||
|
||
## Other relevant projects | ||
|
||
* [Microsoft Cognitive Toolkit](https://github.com/Microsoft/CNTK) | ||
|
||
* [Azure Machine Learning | ||
Operationalization](https://github.com/Azure/Machine-Learning-Operationalization) | ||
|
||
*Apache®, Apache Spark, and Spark® are either registered trademarks or | ||
trademarks of the Apache Software Foundation in the United States and/or other | ||
countries.* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# MMLSpark | ||
|
||
## Repository Layout | ||
|
||
* `runme`: main build entry point | ||
* `src/`: scala and python sources | ||
- `core/`: shared functionality | ||
- `project/`: sbt build-related materials | ||
* `tools/`: build-related tools | ||
|
||
|
||
## Build | ||
|
||
### Build Environment | ||
|
||
Currently, this code is developed and built on Linux. The main build entry | ||
point, `./runme`, will install the needed packages. When everything is | ||
installed, you can use `./runme` again to do a build. | ||
|
||
|
||
### Development | ||
|
||
From now on, you can continue using `./runme` for builds. Alternatively, use | ||
`sbt full-build` to do the build directly through SBT. The output will show | ||
the individual steps that are running, and you can use them directly as usual | ||
with SBT. For example, use `sbt "project foo-bar" test` to run the tests of | ||
the `foo-bar` sub-project, or `sbt ~compile` to do a full compilation step | ||
whenever any file changes. | ||
|
||
Note that the SBT environment is set up in a way that makes *all* code in | ||
`com.microsoft.ml.spark` available in the Scala console that you get when you | ||
run `sbt console`. This can be a very useful debugging tool, since you get to | ||
play with your code in an interactive REPL. | ||
|
||
Every once in a while the installed libraries will be updated. In this case, | ||
executing `./runme` will update the libraries, and the next run will do a build | ||
as usual. If you're using `sbt` directly, it will warn you whenever there was | ||
a change to the library configurations. | ||
|
||
Note: the libraries are all installed in `$HOME/lib` with a few | ||
executable symlinks in `$HOME/bin`. The environment is configured in | ||
`$HOME/.mmlspark_profile` which will be executed whenever a shell starts. | ||
Occasionally, `./runme` will tell you that there was an update to the | ||
`.mmlspark_profile` file --- when this happens, you can start a new shell | ||
to get the updated version, but you can also apply the changes to your | ||
running shell with `. ~/.mmlspark_profile` which will evaluate its | ||
contents and save a shell restart. | ||
|
||
|
||
## Adding a Module | ||
|
||
To add a new module, create a directory with an appropriate name, and in the | ||
new directory create a `build.sbt` file. The contents of `build.sbt` is | ||
optional, and can be completely empty: its presence will make the build include | ||
your directory as a sub-project which gets included in SBT work. | ||
|
||
You can put the usual SBT customizations in your `build.sbt`, for example: | ||
|
||
version := "1.0" | ||
name := "A Useful Module" | ||
|
||
In addition, there are a few utilities in `Extras` that can be useful to | ||
specify some things. Currently, there is only one such utility: | ||
|
||
Extras.noJar | ||
|
||
putting this in your `build.sbt` indicates that no `.jar` file should be | ||
created for your sub-project in the `package` step. (Useful, for example, for | ||
build tools and test-only directories.) | ||
|
||
Finally, whenever SBT runs it generates an `autogen.sbt` file that specifies | ||
the sub-projects. This file is generated automatically so there is no need to | ||
edit a central file when you add a module, and therefore customizing what | ||
appears in it is done via "meta comments" in your `build.sbt`. This is | ||
currently used to specify dependencies for your sub-project --- in most cases | ||
you will want to add this: | ||
|
||
//> DependsOn: core | ||
|
||
to use the shared code in the `common` sub-project. |
Oops, something went wrong.