Skip to content

Commit

Permalink
Initial content
Browse files Browse the repository at this point in the history
  • Loading branch information
mmlspark-bot committed Jun 5, 2017
1 parent bb5be49 commit 70be8dd
Show file tree
Hide file tree
Showing 219 changed files with 20,154 additions and 0 deletions.
29 changes: 29 additions & 0 deletions .gitignore
@@ -0,0 +1,29 @@
# include BuildArtifacts.zip which is used in some parts of the build
/BuildArtifacts*
/TestResults
# accommodate installing the build environment locally
/pkgs/
# useful env configurations
/tools/local-config.sh

# Generated by tools/build-pr
/.build-pr

# Ignore these for safety
*.class
*.jar
*.log
*.tgz
*.zip
*.exe
*.pyc
*.pyo

# Generic editors
.vscode

# Common things
*~
.#*
.*.swp
.DS_Store
38 changes: 38 additions & 0 deletions CONTRIBUTING.md
@@ -0,0 +1,38 @@
## Interested in contributing to MMLSpark? We're excited to work with you.

### You can contribute in many ways

* Use the library and give feedback
* Report a bug
* Request a feature
* Fix a bug
* Add examples and documentation
* Code a new feature
* Review pull requests

### How to contribute?

You can give feedback, report bugs and request new features anytime by
opening an issue. Also, you can up-vote and comment on existing issues.

To make a pull request into the repo, such as bug fixes, documentation
or new features, follow these steps:

* If it's a new feature, open an issue for preliminary discussion with
us, to ensure your contribution is a good fit and doesn't duplicate
on-going work.
* Typically, you'll need to accept Microsoft Contributor Licence
Agreement (CLA).
* Familiarize yourself with coding style and guidelines.
* Fork the repository, code your contribution, and create a pull
request.
* Wait for an MMMLSpark team member to review and accept it. Be patient
as we iron out the process for a new project.

A good way to get started contributing is to look for issues with a "help
wanted" label. These are issues that we do want to fix, but don't have
resources to work on currently.

*Apache®, Apache Spark, and Spark® are either registered trademarks or
trademarks of the Apache Software Foundation in the United States and/or other
countries.*
22 changes: 22 additions & 0 deletions LICENSE
@@ -0,0 +1,22 @@
MIT License

Copyright (c) Microsoft Corporation. All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
177 changes: 177 additions & 0 deletions README.md
@@ -0,0 +1,177 @@
# Microsoft Machine Learning for Apache Spark

<img title="Build Status" src="https://mmlspark.azureedge.net/icons/BuildStatus.png" align="right" />

MMLSpark provides a number of deep learning and data science tools for [Apache
Spark](https://github.com/apache/spark), including seamless integration of Spark
Machine Learning pipelines with [Microsoft Cognitive Toolkit
(CNTK)](https://github.com/Microsoft/CNTK) and [OpenCV](http://www.opencv.org/),
enabling you to quickly create powerful, highly-scalable predictive and
analytical models for large image and text datasets.

MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or
Python 3.5+. See the API documentation
[for Scala](http://mmlspark.azureedge.net/docs/scala/) and
[for PySpark](http://mmlspark.azureedge.net/docs/pyspark/).


## Salient features

* Easily ingest images from HDFS into Spark `DataFrame` ([example:301])
* Pre-process image data using transforms from OpenCV ([example:302])
* Featurize images using pre-trained deep neural nets using CNTK ([example:301])
* Train DNN-based image classification models on N-Series GPU VMs on Azure
([example:301])
* Featurize free-form text data using convenient APIs on top of primitives in
SparkML via a single transformer ([example:201])
* Train classification and regression models easily via implicit featurization
of data ([example:101])
* Compute a rich set of evaluation metrics including per-instance metrics
([example:102])

See our [notebooks](notebooks/samples/) for all examples.

[example:101]: notebooks/samples/101%20-%20Adult%20Census%20Income%20Training.ipynb
"Adult Census Income Training"
[example:102]: notebooks/samples/102%20-%20Regression%20Example%20with%20Flight%20Delay%20Dataset.ipynb
"Regression Example with Flight Delay Dataset"
[example:201]: notebooks/samples/201%20-%20Amazon%20Book%20Reviews%20-%20TextFeaturizer.ipynb
"Amazon Book Reviews - TextFeaturizer"
[example:301]: notebooks/samples/301%20-%20CIFAR10%20CNTK%20CNN%20Evaluation.ipynb
"CIFAR10 CNTK CNN Evaluation"
[example:302]: notebooks/samples/302%20-%20Pipeline%20Image%20Transformations.ipynb
"Pipeline Image Transformations"


## A short example

Below is an excerpt from a simple example of using a pre-trained CNN to classify
images in the CIFAR-10 dataset. View the whole source code as [an example
notebook](notebooks/samples/301%20-%20CIFAR10%20CNTK%20CNN%20Evaluation.ipynb).

```python
...
import mmlspark as mml
# Initialize CNTKModel and define input and output columns
cntkModel = mml.CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(modelFile)
# Train on dataset with internal spark pipeline
scoredImages = cntkModel.transform(imagesWithLabels)
...
```

See [other sample notebooks](notebooks/samples/) as well as the MMLSpark
documentation for [Scala](http://mmlspark.azureedge.net/docs/scala/)
and [PySpark](http://mmlspark.azureedge.net/docs/pyspark/).


## Setup and installation

### Docker

The easiest way to evaluate MMLSpark is via our pre-built Docker container. To
do so, run the following command:

docker run -it -p 8888:8888 microsoft/mmlspark

Navigate to <http://localhost:8888> in your web browser to run the sample
notebooks. See the
[documentation](http://mmlspark.azureedge.net/docs/pyspark/install.html)
for more on Docker use.

> Note: If you wish to run a new instance of the Docker image, make sure you
> stop & remove the container with the name `my-mml` (using `docker rm my-mml`)
> before you try to run a new instance, or run it with a `--rm` flag.
### Spark package

MMLSpark can be conveniently installed on existing Spark clusters via the
`--packages` option, examples:

spark-shell --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
--repositories=https://mmlspark.azureedge.net/maven

pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
--repositories=https://mmlspark.azureedge.net/maven

spark-submit --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
--repositories=https://mmlspark.azureedge.net/maven \
MyApp.jar

<img title="Script action submission" src="http://i.imgur.com/oQcS0R2.png" align="right" />

### HDInsight

To install MMLSpark on an existing [HDInsight Spark
Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/), you can execute a
script action on the cluster head and worker nodes. For instructions on running
script actions, see [this
guide](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#use-a-script-action-during-cluster-creation).

The script action url is:
<https://mmlspark.azureedge.net/buildartifacts/0.5/install-mmlspark.sh> .

If you're using the Azure Portal to run the script action, go to `Script
actions``Submit new` in the `Overview` section of your cluster blade. In the
`Bash script URI` field, input the script action URL provided above. Mark the
rest of the options as shown on the screenshot to the right.

Submit, and the cluster should finish configuring within 10 minutes or so.

### Databricks cloud

To install MMLSpark on the
[Databricks cloud](http://community.cloud.databricks.com), create a new
[library from Maven coordinates](https://docs.databricks.com/user-guide/libraries.html#libraries-from-maven-pypi-or-spark-packages)
in your workspace.

For the coordinates use: `com.microsoft.ml.spark:mmlspark:0.5`. Then, under
Advanced Options, use `https://mmlspark.azureedge.net/maven` for the repository.
Ensure this library is attached to all clusters you create.

Finally, ensure that your Spark cluster has at least Spark 2.1 and Scala 2.11.

You can use MMLSpark in both your Scala and PySpark notebooks.

### SBT

If you are building a Spark application in Scala, add the following lines to
your `build.sbt`:

```scala
resolvers += "MMLSpark Repo" at "https://mmlspark.azureedge.net/maven"
libraryDependencies += "com.microsoft.ml.spark" %% "mmlspark" % "0.5"
```

### Building from source

You can also easily create your own build by cloning this repo and use the main
build script: `./runme`. Run it once to install the needed dependencies, and
again to do a build. See [this guide](docs/developer-readme.md) for more
information.


## Contributing & feedback

This project has adopted the [Microsoft Open Source Code of
Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
see the [Code of Conduct
FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional
questions or comments.

See [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines.

To give feedback and/or report an issue, open a [GitHub
Issue](https://help.github.com/articles/creating-an-issue/).


## Other relevant projects

* [Microsoft Cognitive Toolkit](https://github.com/Microsoft/CNTK)

* [Azure Machine Learning
Operationalization](https://github.com/Azure/Machine-Learning-Operationalization)

*Apache®, Apache Spark, and Spark® are either registered trademarks or
trademarks of the Apache Software Foundation in the United States and/or other
countries.*
80 changes: 80 additions & 0 deletions docs/developer-readme.md
@@ -0,0 +1,80 @@
# MMLSpark

## Repository Layout

* `runme`: main build entry point
* `src/`: scala and python sources
- `core/`: shared functionality
- `project/`: sbt build-related materials
* `tools/`: build-related tools


## Build

### Build Environment

Currently, this code is developed and built on Linux. The main build entry
point, `./runme`, will install the needed packages. When everything is
installed, you can use `./runme` again to do a build.


### Development

From now on, you can continue using `./runme` for builds. Alternatively, use
`sbt full-build` to do the build directly through SBT. The output will show
the individual steps that are running, and you can use them directly as usual
with SBT. For example, use `sbt "project foo-bar" test` to run the tests of
the `foo-bar` sub-project, or `sbt ~compile` to do a full compilation step
whenever any file changes.

Note that the SBT environment is set up in a way that makes *all* code in
`com.microsoft.ml.spark` available in the Scala console that you get when you
run `sbt console`. This can be a very useful debugging tool, since you get to
play with your code in an interactive REPL.

Every once in a while the installed libraries will be updated. In this case,
executing `./runme` will update the libraries, and the next run will do a build
as usual. If you're using `sbt` directly, it will warn you whenever there was
a change to the library configurations.

Note: the libraries are all installed in `$HOME/lib` with a few
executable symlinks in `$HOME/bin`. The environment is configured in
`$HOME/.mmlspark_profile` which will be executed whenever a shell starts.
Occasionally, `./runme` will tell you that there was an update to the
`.mmlspark_profile` file --- when this happens, you can start a new shell
to get the updated version, but you can also apply the changes to your
running shell with `. ~/.mmlspark_profile` which will evaluate its
contents and save a shell restart.


## Adding a Module

To add a new module, create a directory with an appropriate name, and in the
new directory create a `build.sbt` file. The contents of `build.sbt` is
optional, and can be completely empty: its presence will make the build include
your directory as a sub-project which gets included in SBT work.

You can put the usual SBT customizations in your `build.sbt`, for example:

version := "1.0"
name := "A Useful Module"

In addition, there are a few utilities in `Extras` that can be useful to
specify some things. Currently, there is only one such utility:

Extras.noJar

putting this in your `build.sbt` indicates that no `.jar` file should be
created for your sub-project in the `package` step. (Useful, for example, for
build tools and test-only directories.)

Finally, whenever SBT runs it generates an `autogen.sbt` file that specifies
the sub-projects. This file is generated automatically so there is no need to
edit a central file when you add a module, and therefore customizing what
appears in it is done via "meta comments" in your `build.sbt`. This is
currently used to specify dependencies for your sub-project --- in most cases
you will want to add this:

//> DependsOn: core

to use the shared code in the `common` sub-project.

0 comments on commit 70be8dd

Please sign in to comment.