NOTE This content is no longer maintained. Visit the Azure Machine Learning Notebook project for sample Jupyter notebooks for ML and deep learning with Azure Machine Learning.
This sample demonstrates the power of simplification by implementing a binary classifier using the popular Adult Census dataset, first with the open-source mmlspark Spark package then comparing that with the standard Spark ML constructs.
As a quick comparision, here is the one-line training code using mmlspark, clean and simple:
model = TrainClassifier(model=LogisticRegression(regParam=reg), labelCol=" income", numFeatures=256).fit(train)
And here is the equivalent code in standard Spark ML. Notice the one-hot encoding, string-indexing and vectorization that you have to do on the training data:
# create a new Logistic Regression model.
lr = LogisticRegression(regParam=reg)
# string-index and one-hot encode the education column
si1 = StringIndexer(inputCol=' education', outputCol='ed')
ohe1 = OneHotEncoder(inputCol='ed', outputCol='ed-encoded')
# string-index and one-hot encode the matrial-status column
si2 = StringIndexer(inputCol=' marital-status', outputCol='ms')
ohe2 = OneHotEncoder(inputCol='ms', outputCol='ms-encoded')
# string-index the label column into a column named "label"
si3 = StringIndexer(inputCol=' income', outputCol='label')
# assemble the encoded feature columns in to a column named "features"
assembler = VectorAssembler(inputCols=['ed-encoded', 'ms-encoded', ' hours-per-week'], outputCol="features")
# put together the pipeline
pipe = Pipeline(stages=[si1, ohe1, si2, ohe2, si3, assembler, lr])
# train the model
model = pipe.fit(train)
To learn more about mmlspark Spark package, please visit: http://github.com/azure/mmlspark.
Metrics can be automatically logged from MMLSpark in Run History with the modules logging package.
When executing ComputeModelStatistics
function, the metrics will appear in the run automatically:
To add the modules logging package:
-
Add a
log4j.properties
file to use theAmlAppender
andAmlLayout
-
Add the modules logging package to
spark_dependencies.yml
file:
- group: "com.microsoft.moduleslogging"
artifact: "modules-logging_2.11"
version: "1.0.0024"
- Configure
log4j
to use thelog4j.properties
file in train_mmlspark.py:
spark._jvm.org.apache.log4j.PropertyConfigurator.configure(os.getcwd() + "/log4j.properties")
Run train_mmlspark.py in a local Docker container.
$ az ml experiment submit -c docker train_mmlspark.py 0.1
Configure a compute environment myvm
targeting a Docker container running on a remote VM.
$ az ml computetarget attach --name myvm --address <ip address or FQDN> --username <username> --password <pwd> --type remotedocker
# prepare the environment
$ az ml experiment prepare -c myvm
Run train_mmlspark.py in a Docker container (with Spark) in a remote VM:
$ az ml experiment submit -c myvm train_mmlspark.py 0.3
Configure a compute environment myvm
targeting an HDInsight Spark cluster.
$ az ml computetarget attach --name myhdi --address <ip address or FQDN of the head node> --username <username> --password <pwd> --type cluster
# prepare the environment
$ az ml experiment prepare -c myhdi
Run it in a remote HDInsight cluster:
$ az ml experiment submit -c myhdi train_mmlspark.py 0.5
Get the run id of the train_mmlspark.py job from run history.
$ az ml history list -o table
And promote the trained model using the run id.
$ az ml history promote -ap ./outputs/AdultCensus.mml -n AdultCensusModel -r <run id>
Download the model to a directory.
$ az ml asset download -l ./assets/AdultCensusModel.link -d mmlspark_model
Note: The download step may fail if file paths within project folder become too long. If that happens, create the project closer to file system root, for example C:/AzureML/Income.
Promote the schema file
$ az ml history promote -ap ./outputs/service_schema.json -n service_schema.json -r <run id>
Download the schema
$ az ml asset download -l ./assets/service_schema.json.link -d mmlspark_schema
Run score_mmlspark.py in local Docker. Check the output of the job for results.
$ az ml experiment submit -c docker score_mmlspark.py
If you have not set up a Model Management deployment environment, see the Set up Model Managment document under Deploy Models on the documentation page.
If you have already setup an environment, look up it's name and resource group:
$ az ml env list
Set the deployment environment:
$ az ml env set -n <environment cluster name> -g <resource group>
Deploy the web service
$ az ml service create realtime -f score_mmlspark.py -m mmlspark_model -s mmlspark_schema/service_schema.json -r spark-py -n mmlsparkservice -c aml_config/conda_dependencies.yml
Use the Sample CLI command from the output of the previous call to test the web service.
$ az ml service run realtime -i mmlsparkservice -d "{\"input_df\": [{\" hours-per-week\": 35.0, \" education\": \"10th\", \" marital-status\": \"Married-civ-spouse\"}]}"