Permalink
Browse files

Initial content

  • Loading branch information...
mmlspark-bot committed Jun 2, 2017
1 parent bb5be49 commit 70be8dd11720fc2d34b764b642c9dcbacd540508
Showing with 20,154 additions and 0 deletions.
  1. +29 −0 .gitignore
  2. +38 −0 CONTRIBUTING.md
  3. +22 −0 LICENSE
  4. +177 −0 README.md
  5. +80 −0 docs/developer-readme.md
  6. +298 −0 docs/third-party-notices.txt
  7. +109 −0 docs/your-first-model.md
  8. +139 −0 notebooks/samples/101 - Adult Census Income Training.ipynb
  9. +161 −0 notebooks/samples/102 - Regression Example with Flight Delay Dataset.ipynb
  10. +286 −0 notebooks/samples/103 - Before and After MMLSpark.ipynb
  11. +186 −0 notebooks/samples/201 - Amazon Book Reviews - TextFeaturizer.ipynb
  12. +228 −0 notebooks/samples/202 - Amazon Book Reviews - Word2Vec.ipynb
  13. +264 −0 notebooks/samples/301 - CIFAR10 CNTK CNN Evaluation.ipynb
  14. +236 −0 notebooks/samples/302 - Pipeline Image Transformations.ipynb
  15. +107 −0 notebooks/tests/BasicDFOpsSmokeTest.ipynb
  16. +50 −0 runme
  17. +44 −0 src/.gitignore
  18. +2 −0 src/.sbtopts
  19. +11 −0 src/build.sbt
  20. +1 −0 src/checkpoint-data/build.sbt
  21. +71 −0 src/checkpoint-data/src/main/scala/CheckpointData.scala
  22. +41 −0 src/checkpoint-data/src/test/scala/CheckpointDataSuite.scala
  23. +3 −0 src/cntk-model/build.sbt
  24. +21 −0 src/cntk-model/src/main/python/CNTKModel.py
  25. +230 −0 src/cntk-model/src/main/scala/CNTKModel.scala
  26. +60 −0 src/cntk-model/src/test/scala/CNTKBindingSuite.scala
  27. +157 −0 src/cntk-model/src/test/scala/CNTKModelSuite.scala
  28. +74 −0 src/cntk-model/src/test/scala/CNTKTestUtils.scala
  29. +3 −0 src/cntk-train/build.sbt
  30. +23 −0 src/cntk-train/src/main/python/CNTKLearner.py
  31. +117 −0 src/cntk-train/src/main/scala/BrainscriptBuilder.scala
  32. +168 −0 src/cntk-train/src/main/scala/CNTKLearner.scala
  33. +117 −0 src/cntk-train/src/main/scala/CommandBuilders.scala
  34. +173 −0 src/cntk-train/src/main/scala/DataConversion.scala
  35. +41 −0 src/cntk-train/src/main/scala/TypeMapping.scala
  36. +267 −0 src/cntk-train/src/test/scala/ValidateCntkTrain.scala
  37. +28 −0 src/cntk-train/src/test/scala/ValidateConfiguration.scala
  38. +83 −0 src/cntk-train/src/test/scala/ValidateDataConversion.scala
  39. +14 −0 src/cntk-train/src/test/scala/ValidateEnvironmentUtils.scala
  40. +12 −0 src/codegen/build.sbt
  41. +79 −0 src/codegen/src/main/scala/CodeGen.scala
  42. +29 −0 src/codegen/src/main/scala/Config.scala
  43. +345 −0 src/codegen/src/main/scala/PySparkWrapper.scala
  44. +123 −0 src/codegen/src/main/scala/PySparkWrapperGenerator.scala
  45. +275 −0 src/codegen/src/main/scala/PySparkWrapperTest.scala
  46. +3 −0 src/compute-model-statistics/build.sbt
  47. +559 −0 src/compute-model-statistics/src/main/scala/ComputeModelStatistics.scala
  48. +245 −0 src/compute-model-statistics/src/test/scala/VerifyComputeModelStatistics.scala
  49. +3 −0 src/compute-per-instance-statistics/build.sbt
  50. +110 −0 src/compute-per-instance-statistics/src/main/scala/ComputePerInstanceStatistics.scala
  51. +130 −0 src/compute-per-instance-statistics/src/test/scala/VerifyComputePerInstanceStatistics.scala
  52. +1 −0 src/core/build.sbt
  53. +1 −0 src/core/contracts/build.sbt
  54. +35 −0 src/core/contracts/src/main/scala/Exceptions.scala
  55. +47 −0 src/core/contracts/src/main/scala/Metrics.scala
  56. +134 −0 src/core/contracts/src/main/scala/Params.scala
  57. +7 −0 src/core/env/build.sbt
  58. +13 −0 src/core/env/src/main/scala/CodegenTags.scala
  59. +51 −0 src/core/env/src/main/scala/Configuration.scala
  60. +52 −0 src/core/env/src/main/scala/EnvironmentUtils.scala
  61. +139 −0 src/core/env/src/main/scala/FileUtilities.scala
  62. +23 −0 src/core/env/src/main/scala/Logging.scala
  63. +194 −0 src/core/env/src/main/scala/NativeLoader.java
  64. +26 −0 src/core/env/src/main/scala/ProcessUtilities.scala
  65. +1 −0 src/core/hadoop/build.sbt
  66. +176 −0 src/core/hadoop/src/main/scala/HadoopUtils.scala
  67. +3 −0 src/core/ml/build.sbt
  68. +81 −0 src/core/ml/src/test/scala/HashingTFSpec.scala
  69. +103 −0 src/core/ml/src/test/scala/IDFSpec.scala
  70. +74 −0 src/core/ml/src/test/scala/NGramSpec.scala
  71. +102 −0 src/core/ml/src/test/scala/OneHotEncoderSpec.scala
  72. +93 −0 src/core/ml/src/test/scala/Word2VecSpec.scala
  73. +4 −0 src/core/schema/build.sbt
  74. +17 −0 src/core/schema/src/main/python/TypeConversionUtils.py
  75. +69 −0 src/core/schema/src/main/python/Utils.py
  76. +32 −0 src/core/schema/src/main/scala/BinaryFileSchema.scala
  77. +317 −0 src/core/schema/src/main/scala/Categoricals.scala
  78. +68 −0 src/core/schema/src/main/scala/DatasetExtensions.scala
  79. +46 −0 src/core/schema/src/main/scala/ImageSchema.scala
  80. +44 −0 src/core/schema/src/main/scala/SchemaConstants.scala
  81. +352 −0 src/core/schema/src/main/scala/SparkSchema.scala
  82. +131 −0 src/core/schema/src/test/scala/TestCategoricals.scala
  83. +118 −0 src/core/schema/src/test/scala/VerifyFastVectorAssembler.scala
  84. +56 −0 src/core/schema/src/test/scala/VerifySparkSchema.scala
  85. +1 −0 src/core/spark/build.sbt
  86. +70 −0 src/core/spark/src/main/scala/ArrayMapParam.scala
  87. +36 −0 src/core/spark/src/main/scala/EstimatorParam.scala
  88. +154 −0 src/core/spark/src/main/scala/FastVectorAssembler.scala
  89. +74 −0 src/core/spark/src/main/scala/MapArrayParam.scala
  90. +10 −0 src/core/spark/src/main/scala/MetadataUtilities.scala
  91. +58 −0 src/core/spark/src/main/scala/TransformParam.scala
  92. +1 −0 src/core/test/base/build.sbt
  93. +53 −0 src/core/test/base/src/main/scala/SparkSessionFactory.scala
  94. +155 −0 src/core/test/base/src/main/scala/TestBase.scala
  95. +1 −0 src/core/test/build.sbt
  96. +1 −0 src/core/test/datagen/build.sbt
  97. +68 −0 src/core/test/datagen/src/main/scala/DatasetConstraints.scala
  98. +57 −0 src/core/test/datagen/src/main/scala/DatasetOptions.scala
  99. +37 −0 src/core/test/datagen/src/main/scala/GenerateDataType.scala
  100. +114 −0 src/core/test/datagen/src/main/scala/GenerateDataset.scala
  101. +70 −0 src/core/test/datagen/src/main/scala/GenerateRow.scala
  102. +52 −0 src/core/test/datagen/src/main/scala/ModuleFuzzingTest.scala
  103. +46 −0 src/core/test/datagen/src/test/scala/VerifyGenerateDataset.scala
  104. +1 −0 src/data-conversion/build.sbt
  105. +161 −0 src/data-conversion/src/main/scala/DataConversion.scala
  106. +232 −0 src/data-conversion/src/test/scala/VerifyDataConversion.scala
  107. +1 −0 src/downloader/build.sbt
  108. +101 −0 src/downloader/src/main/python/ModelDownloader.py
  109. +260 −0 src/downloader/src/main/scala/ModelDownloader.scala
  110. +92 −0 src/downloader/src/main/scala/Schema.scala
  111. +49 −0 src/downloader/src/test/scala/DownloaderSuite.scala
  112. +3 −0 src/featurize/build.sbt
  113. +499 −0 src/featurize/src/main/scala/AssembleFeatures.scala
  114. +92 −0 src/featurize/src/main/scala/Featurize.scala
  115. +330 −0 src/featurize/src/test/scala/VerifyFeaturize.scala
  116. +12 −0 src/featurize/src/test/scala/benchmarkBasicDataTypes.json
  117. +6 −0 src/featurize/src/test/scala/benchmarkNoOneHot.json
  118. +6 −0 src/featurize/src/test/scala/benchmarkOneHot.json
  119. +5 −0 src/featurize/src/test/scala/benchmarkString.json
  120. +6 −0 src/featurize/src/test/scala/benchmarkStringIndexOneHot.json
  121. +5 −0 src/featurize/src/test/scala/benchmarkStringMissing.json
  122. +7 −0 src/featurize/src/test/scala/benchmarkVectors.json
  123. +3 −0 src/find-best-model/build.sbt
  124. +331 −0 src/find-best-model/src/main/scala/FindBestModel.scala
  125. +106 −0 src/find-best-model/src/test/scala/VerifyFindBestModel.scala
  126. +5 −0 src/fuzzing/build.sbt
  127. +254 −0 src/fuzzing/src/test/scala/Fuzzing.scala
  128. +5 −0 src/image-featurizer/build.sbt
  129. +128 −0 src/image-featurizer/src/main/scala/ImageFeaturizer.scala
  130. +66 −0 src/image-featurizer/src/test/scala/ImageFeaturizerSuite.scala
  131. +2 −0 src/image-transformer/build.sbt
  132. +96 −0 src/image-transformer/src/main/python/ImageTransform.py
  133. +314 −0 src/image-transformer/src/main/scala/ImageTransformer.scala
  134. +70 −0 src/image-transformer/src/main/scala/UnrollImage.scala
  135. +293 −0 src/image-transformer/src/test/scala/ImageTransformerSuite.scala
  136. +1 −0 src/multi-column-adapter/build.sbt
  137. +121 −0 src/multi-column-adapter/src/main/scala/MultiColumnAdapter.scala
  138. +49 −0 src/multi-column-adapter/src/test/scala/MultiColumnAdapterSpec.scala
  139. +1 −0 src/partition-sample/build.sbt
  140. +117 −0 src/partition-sample/src/main/scala/PartitionSample.scala
  141. +67 −0 src/partition-sample/src/test/scala/VerifyPartitionSample.scala
  142. +1 −0 src/pipeline-stages/build.sbt
  143. +42 −0 src/pipeline-stages/src/main/scala/Repartition.scala
  144. +63 −0 src/pipeline-stages/src/main/scala/SelectColumns.scala
  145. +50 −0 src/pipeline-stages/src/test/scala/RepartitionSuite.scala
  146. +75 −0 src/pipeline-stages/src/test/scala/SelectColumnsSuite.scala
  147. +16 −0 src/project/build.sbt
  148. +201 −0 src/project/build.scala
  149. +34 −0 src/project/lib-check.scala
  150. +108 −0 src/project/meta.sbt
  151. +5 −0 src/project/plugins.sbt
  152. +136 −0 src/project/scalastyle.scala
  153. +1 −0 src/readers/build.sbt
  154. +52 −0 src/readers/src/main/python/BinaryFileReader.py
  155. +50 −0 src/readers/src/main/python/ImageReader.py
  156. +72 −0 src/readers/src/main/scala/AzureBlobReader.scala
  157. +53 −0 src/readers/src/main/scala/AzureSQLReader.scala
  158. +79 −0 src/readers/src/main/scala/BinaryFileReader.scala
  159. +12 −0 src/readers/src/main/scala/FileFormat.scala
  160. +63 −0 src/readers/src/main/scala/ImageReader.scala
  161. +47 −0 src/readers/src/main/scala/ReaderUtils.scala
  162. +50 −0 src/readers/src/main/scala/Readers.scala
  163. +47 −0 src/readers/src/main/scala/WasbReader.scala
  164. +44 −0 src/readers/src/test/scala/BinaryFileReaderSuite.scala
  165. +75 −0 src/readers/src/test/scala/ImageReaderSuite.scala
  166. +1 −0 src/summarize-data/build.sbt
  167. +189 −0 src/summarize-data/src/main/scala/SummarizeData.scala
  168. +52 −0 src/summarize-data/src/test/scala/SummarizeDataSuite.scala
  169. +2 −0 src/text-featurizer/build.sbt
  170. +442 −0 src/text-featurizer/src/main/scala/TextFeaturizer.scala
  171. +86 −0 src/text-featurizer/src/test/scala/TextFeaturizerSpec.scala
  172. +3 −0 src/train-classifier/build.sbt
  173. +367 −0 src/train-classifier/src/main/scala/TrainClassifier.scala
  174. +560 −0 src/train-classifier/src/test/scala/VerifyTrainClassifier.scala
  175. +68 −0 src/train-classifier/src/test/scala/benchmarkMetrics.csv
  176. +2 −0 src/train-regressor/build.sbt
  177. +246 −0 src/train-regressor/src/main/scala/TrainRegressor.scala
  178. +184 −0 src/train-regressor/src/test/scala/VerifyTrainRegressor.scala
  179. +1 −0 src/utils/build.sbt
  180. +139 −0 src/utils/src/main/scala/JarLoadingUtils.scala
  181. +71 −0 src/utils/src/main/scala/ObjectUtilities.scala
  182. +55 −0 src/utils/src/main/scala/PipelineUtilities.scala
  183. +37 −0 tools/bin/mml-exec
  184. +58 −0 tools/build-pr/checkout
  185. +39 −0 tools/build-pr/report
  186. +47 −0 tools/build-pr/shared.sh
  187. +274 −0 tools/config.sh
  188. +54 −0 tools/docker/Dockerfile
  189. +203 −0 tools/docker/bin/EULA.txt
  190. +13 −0 tools/docker/bin/eula
  191. +54 −0 tools/docker/bin/eula.html
  192. +37 −0 tools/docker/bin/eula.py
  193. +24 −0 tools/docker/bin/launcher
  194. +49 −0 tools/docker/build-docker
  195. +28 −0 tools/docker/build-env
  196. +165 −0 tools/hdi/install-mmlspark.sh
  197. +34 −0 tools/hdi/setup-test-authkey.sh
  198. +25 −0 tools/hdi/update_livy.py
  199. +67 −0 tools/mmlspark-packages.spec
  200. +110 −0 tools/notebook/postprocess.py
  201. +69 −0 tools/notebook/tester/NotebookTestSuite.py
  202. +36 −0 tools/notebook/tester/TestNotebooksLocally.py
  203. +48 −0 tools/notebook/tester/TestNotebooksOnHdi.py
  204. +32 −0 tools/notebook/tester/parallel_run.sh
  205. +5 −0 tools/pip/MANIFEST.in
  206. +8 −0 tools/pip/README.txt
  207. +29 −0 tools/pip/generate-pip.sh
  208. +33 −0 tools/pip/setup.py
  209. +19 −0 tools/pytests/auto-tests
  210. +11 −0 tools/pytests/notebook-tests
  211. +16 −0 tools/pytests/shared.sh
  212. +4 −0 tools/runme/README.txt
  213. +12 −0 tools/runme/build-readme.tmpl
  214. +249 −0 tools/runme/build.sh
  215. +206 −0 tools/runme/install.sh
  216. +51 −0 tools/runme/runme.sh
  217. +7 −0 tools/runme/show-version
  218. +450 −0 tools/runme/utils.sh
  219. +70 −0 tools/tests/tags.sh
View
@@ -0,0 +1,29 @@
# include BuildArtifacts.zip which is used in some parts of the build
/BuildArtifacts*
/TestResults
# accommodate installing the build environment locally
/pkgs/
# useful env configurations
/tools/local-config.sh
# Generated by tools/build-pr
/.build-pr
# Ignore these for safety
*.class
*.jar
*.log
*.tgz
*.zip
*.exe
*.pyc
*.pyo
# Generic editors
.vscode
# Common things
*~
.#*
.*.swp
.DS_Store
View
@@ -0,0 +1,38 @@
## Interested in contributing to MMLSpark? We're excited to work with you.
### You can contribute in many ways
* Use the library and give feedback
* Report a bug
* Request a feature
* Fix a bug
* Add examples and documentation
* Code a new feature
* Review pull requests
### How to contribute?
You can give feedback, report bugs and request new features anytime by
opening an issue. Also, you can up-vote and comment on existing issues.
To make a pull request into the repo, such as bug fixes, documentation
or new features, follow these steps:
* If it's a new feature, open an issue for preliminary discussion with
us, to ensure your contribution is a good fit and doesn't duplicate
on-going work.
* Typically, you'll need to accept Microsoft Contributor Licence
Agreement (CLA).
* Familiarize yourself with coding style and guidelines.
* Fork the repository, code your contribution, and create a pull
request.
* Wait for an MMMLSpark team member to review and accept it. Be patient
as we iron out the process for a new project.
A good way to get started contributing is to look for issues with a "help
wanted" label. These are issues that we do want to fix, but don't have
resources to work on currently.
*Apache®, Apache Spark, and Spark® are either registered trademarks or
trademarks of the Apache Software Foundation in the United States and/or other
countries.*
View
22 LICENSE
@@ -0,0 +1,22 @@
MIT License
Copyright (c) Microsoft Corporation. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
View
177 README.md
@@ -0,0 +1,177 @@
# Microsoft Machine Learning for Apache Spark
<img title="Build Status" src="https://mmlspark.azureedge.net/icons/BuildStatus.png" align="right" />
MMLSpark provides a number of deep learning and data science tools for [Apache
Spark](https://github.com/apache/spark), including seamless integration of Spark
Machine Learning pipelines with [Microsoft Cognitive Toolkit
(CNTK)](https://github.com/Microsoft/CNTK) and [OpenCV](http://www.opencv.org/),
enabling you to quickly create powerful, highly-scalable predictive and
analytical models for large image and text datasets.
MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or
Python 3.5+. See the API documentation
[for Scala](http://mmlspark.azureedge.net/docs/scala/) and
[for PySpark](http://mmlspark.azureedge.net/docs/pyspark/).
## Salient features
* Easily ingest images from HDFS into Spark `DataFrame` ([example:301])
* Pre-process image data using transforms from OpenCV ([example:302])
* Featurize images using pre-trained deep neural nets using CNTK ([example:301])
* Train DNN-based image classification models on N-Series GPU VMs on Azure
([example:301])
* Featurize free-form text data using convenient APIs on top of primitives in
SparkML via a single transformer ([example:201])
* Train classification and regression models easily via implicit featurization
of data ([example:101])
* Compute a rich set of evaluation metrics including per-instance metrics
([example:102])
See our [notebooks](notebooks/samples/) for all examples.
[example:101]: notebooks/samples/101%20-%20Adult%20Census%20Income%20Training.ipynb
"Adult Census Income Training"
[example:102]: notebooks/samples/102%20-%20Regression%20Example%20with%20Flight%20Delay%20Dataset.ipynb
"Regression Example with Flight Delay Dataset"
[example:201]: notebooks/samples/201%20-%20Amazon%20Book%20Reviews%20-%20TextFeaturizer.ipynb
"Amazon Book Reviews - TextFeaturizer"
[example:301]: notebooks/samples/301%20-%20CIFAR10%20CNTK%20CNN%20Evaluation.ipynb
"CIFAR10 CNTK CNN Evaluation"
[example:302]: notebooks/samples/302%20-%20Pipeline%20Image%20Transformations.ipynb
"Pipeline Image Transformations"
## A short example
Below is an excerpt from a simple example of using a pre-trained CNN to classify
images in the CIFAR-10 dataset. View the whole source code as [an example
notebook](notebooks/samples/301%20-%20CIFAR10%20CNTK%20CNN%20Evaluation.ipynb).
```python
...
import mmlspark as mml
# Initialize CNTKModel and define input and output columns
cntkModel = mml.CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(modelFile)
# Train on dataset with internal spark pipeline
scoredImages = cntkModel.transform(imagesWithLabels)
...
```
See [other sample notebooks](notebooks/samples/) as well as the MMLSpark
documentation for [Scala](http://mmlspark.azureedge.net/docs/scala/)
and [PySpark](http://mmlspark.azureedge.net/docs/pyspark/).
## Setup and installation
### Docker
The easiest way to evaluate MMLSpark is via our pre-built Docker container. To
do so, run the following command:
docker run -it -p 8888:8888 microsoft/mmlspark
Navigate to <http://localhost:8888> in your web browser to run the sample
notebooks. See the
[documentation](http://mmlspark.azureedge.net/docs/pyspark/install.html)
for more on Docker use.
> Note: If you wish to run a new instance of the Docker image, make sure you
> stop & remove the container with the name `my-mml` (using `docker rm my-mml`)
> before you try to run a new instance, or run it with a `--rm` flag.
### Spark package
MMLSpark can be conveniently installed on existing Spark clusters via the
`--packages` option, examples:
spark-shell --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
--repositories=https://mmlspark.azureedge.net/maven
pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
--repositories=https://mmlspark.azureedge.net/maven
spark-submit --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
--repositories=https://mmlspark.azureedge.net/maven \
MyApp.jar
<img title="Script action submission" src="http://i.imgur.com/oQcS0R2.png" align="right" />
### HDInsight
To install MMLSpark on an existing [HDInsight Spark
Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/), you can execute a
script action on the cluster head and worker nodes. For instructions on running
script actions, see [this
guide](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#use-a-script-action-during-cluster-creation).
The script action url is:
<https://mmlspark.azureedge.net/buildartifacts/0.5/install-mmlspark.sh> .
If you're using the Azure Portal to run the script action, go to `Script
actions``Submit new` in the `Overview` section of your cluster blade. In the
`Bash script URI` field, input the script action URL provided above. Mark the
rest of the options as shown on the screenshot to the right.
Submit, and the cluster should finish configuring within 10 minutes or so.
### Databricks cloud
To install MMLSpark on the
[Databricks cloud](http://community.cloud.databricks.com), create a new
[library from Maven coordinates](https://docs.databricks.com/user-guide/libraries.html#libraries-from-maven-pypi-or-spark-packages)
in your workspace.
For the coordinates use: `com.microsoft.ml.spark:mmlspark:0.5`. Then, under
Advanced Options, use `https://mmlspark.azureedge.net/maven` for the repository.
Ensure this library is attached to all clusters you create.
Finally, ensure that your Spark cluster has at least Spark 2.1 and Scala 2.11.
You can use MMLSpark in both your Scala and PySpark notebooks.
### SBT
If you are building a Spark application in Scala, add the following lines to
your `build.sbt`:
```scala
resolvers += "MMLSpark Repo" at "https://mmlspark.azureedge.net/maven"
libraryDependencies += "com.microsoft.ml.spark" %% "mmlspark" % "0.5"
```
### Building from source
You can also easily create your own build by cloning this repo and use the main
build script: `./runme`. Run it once to install the needed dependencies, and
again to do a build. See [this guide](docs/developer-readme.md) for more
information.
## Contributing & feedback
This project has adopted the [Microsoft Open Source Code of
Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
see the [Code of Conduct
FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional
questions or comments.
See [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines.
To give feedback and/or report an issue, open a [GitHub
Issue](https://help.github.com/articles/creating-an-issue/).
## Other relevant projects
* [Microsoft Cognitive Toolkit](https://github.com/Microsoft/CNTK)
* [Azure Machine Learning
Operationalization](https://github.com/Azure/Machine-Learning-Operationalization)
*Apache®, Apache Spark, and Spark® are either registered trademarks or
trademarks of the Apache Software Foundation in the United States and/or other
countries.*
View
@@ -0,0 +1,80 @@
# MMLSpark
## Repository Layout
* `runme`: main build entry point
* `src/`: scala and python sources
- `core/`: shared functionality
- `project/`: sbt build-related materials
* `tools/`: build-related tools
## Build
### Build Environment
Currently, this code is developed and built on Linux. The main build entry
point, `./runme`, will install the needed packages. When everything is
installed, you can use `./runme` again to do a build.
### Development
From now on, you can continue using `./runme` for builds. Alternatively, use
`sbt full-build` to do the build directly through SBT. The output will show
the individual steps that are running, and you can use them directly as usual
with SBT. For example, use `sbt "project foo-bar" test` to run the tests of
the `foo-bar` sub-project, or `sbt ~compile` to do a full compilation step
whenever any file changes.
Note that the SBT environment is set up in a way that makes *all* code in
`com.microsoft.ml.spark` available in the Scala console that you get when you
run `sbt console`. This can be a very useful debugging tool, since you get to
play with your code in an interactive REPL.
Every once in a while the installed libraries will be updated. In this case,
executing `./runme` will update the libraries, and the next run will do a build
as usual. If you're using `sbt` directly, it will warn you whenever there was
a change to the library configurations.
Note: the libraries are all installed in `$HOME/lib` with a few
executable symlinks in `$HOME/bin`. The environment is configured in
`$HOME/.mmlspark_profile` which will be executed whenever a shell starts.
Occasionally, `./runme` will tell you that there was an update to the
`.mmlspark_profile` file --- when this happens, you can start a new shell
to get the updated version, but you can also apply the changes to your
running shell with `. ~/.mmlspark_profile` which will evaluate its
contents and save a shell restart.
## Adding a Module
To add a new module, create a directory with an appropriate name, and in the
new directory create a `build.sbt` file. The contents of `build.sbt` is
optional, and can be completely empty: its presence will make the build include
your directory as a sub-project which gets included in SBT work.
You can put the usual SBT customizations in your `build.sbt`, for example:
version := "1.0"
name := "A Useful Module"
In addition, there are a few utilities in `Extras` that can be useful to
specify some things. Currently, there is only one such utility:
Extras.noJar
putting this in your `build.sbt` indicates that no `.jar` file should be
created for your sub-project in the `package` step. (Useful, for example, for
build tools and test-only directories.)
Finally, whenever SBT runs it generates an `autogen.sbt` file that specifies
the sub-projects. This file is generated automatically so there is no need to
edit a central file when you add a module, and therefore customizing what
appears in it is done via "meta comments" in your `build.sbt`. This is
currently used to specify dependencies for your sub-project --- in most cases
you will want to add this:
//> DependsOn: core
to use the shared code in the `common` sub-project.
Oops, something went wrong.

0 comments on commit 70be8dd

Please sign in to comment.