Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/training/train-in-spark/train-in-spark.png)

# 05. Train in Spark
* Create Workspace
* Create Experiment
* Copy relevant files to the script folder
* Configure and Run

## Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration](../../../configuration.ipynb) Notebook first if you haven't already to establish your connection to the AzureML Workspace.

In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.13.0


## Initialize Workspace

Initialize a workspace object from persisted configuration.

In [13]:
from azureml.core import Workspace
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

zhenzhuUKSouth
zhenzhuUKSouth
uksouth
e9b2ec51-5c94-4fa8-809a-dc1e695e4896


## Create Experiment


In [14]:
experiment_name = 'train-on-spark-mmlspark'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

## View `train-spark.py`

For convenience, we created a training script for you. It is printed below as a text, but you can also run `%pfile ./train-spark.py` in a cell to show the file.

In [3]:
script = 'train-spark.py'
with open(script, 'r') as training_script:
    print(training_script.read())

# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license.

import numpy as np
import pyspark
import os
import urllib
import sys

from pyspark.sql.functions import *
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.feature import *
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

from azureml.core.run import Run

# initialize logger
run = Run.get_context()

# start Spark session
spark = pyspark.sql.SparkSession.builder.appName('Iris').getOrCreate()

# print runtime versions
print('****************')
print('Python version: {}'.format(sys.version))
print('Spark version: {}'.format(spark.version))
print('****************')

# load iris.csv into Spark dataframe
schema = StructType([
    StructField("sepal-length", DoubleType()),
    StructField("sepal-width", DoubleType()),
    StructField("petal-length", DoubleType()),
    StructField("petal-width", Dou

## Configure & Run

**Note** You can use Docker-based execution to run the Spark job in local computer or a remote VM. Please see the `train-in-remote-vm` notebook for example on how to configure and run in Docker mode in a VM. Make sure you choose a Docker image that has Spark installed, such as `microsoft/mmlspark:0.12`.

### Attach an AML Compute


In [17]:
from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "spark-low--cpu"

# Verify that the cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4, 
                                                           vm_priority="lowpriority",
                                                           idle_seconds_before_scaledown=2400)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Configure AzureML Environment run and Custom Docker

Configure AzureML custom environment to use [mmlspark](https://mmlspark.blob.core.windows.net/website/index.html#install) docker image.

In [18]:
from azureml.core.environment import Environment
abesparksenv = Environment(name="abesparksenv")
# Specify custom Docker base image and registry, if you don't want to use the defaults
#abesparksenv.docker.base_image="mcr.microsoft.com/mmlspark/release"
abesparksenv.python.user_managed_dependencies = True

### Choose the base image
Most Dockerfiles start from a parent image, rather than from scratch. Azure ML has a set of maintained images that serve as base images for training and inference. You can find information on these images, including the Dockerfiles used to build them, at this GitHub repo [Azure/AzureML-Containers](https://github.com/Azure/AzureML-Containers). Depending on your scenario, you should pick an appropriate CPU or GPU image to use as the parent image. The base images include Miniconda, and the GPU base images include the necessary GPU drivers needed to run GPU jobs on Azure ML. [Here](https://github.com/Azure/AzureML-Containers#featured-tags) is the list of images and associated tags for the Azure ML base images.

As an example, we will use one of the mmlspark images as the parent image for our Dockerfile:

```
FROM mcr.microsoft.com/mmlspark/release
```

### Specify conda dependencies
Since using Azure ML has some Python dependencies, we will add an instruction to install these dependencies via Conda, an open-source package management system:

1. **azureml-defaults**, a lightweight version of the full Azure ML Python SDK that includes the `azureml-core` and `applicationinsights` packages required for tasks such as logging metrics, uploading artifacts, accessing datastores from within runs, and inference. `azureml-defaults` should be sufficient for most remote training and deployment scenarios; if for some reason you need the full SDK, you can specify pip installing the full `azureml-sdk` package instead.

```
FROM mcr.microsoft.com/mmlspark/release
RUN conda install -y pip=20.1.1 && \
    conda clean -ay && \
    pip install --no-cache-dir azureml-defaults \
    pip install azureml-core
```


In [19]:
# # Alternatively, load the string from a file.
abesparksenv.docker.base_image = None
abesparksenv.docker.base_dockerfile = "./Dockerfile"

### Submit the script to AzureML Compute

In [21]:
from azureml.core import ScriptRunConfig, Environment

script_run_config = ScriptRunConfig(source_directory = '.',
                                    script= script,
                                    environment=abesparksenv,
                                    compute_target=cpu_cluster)
run = exp.submit(config=script_run_config)

Monitor the run using a Juypter widget

In [22]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).

After the run is succesfully finished, you can check the metrics logged.

In [29]:
# get all metris logged in the run
metrics = run.get_metrics()
print(metrics)

{'Regularization Rate': 0.01, 'Accuracy': 0.9069767441860465}


In [30]:
# register the generated model
model = run.register_model(model_name='iris.model', model_path='outputs/iris.model')