Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 05. Train in Spark
* Create Workspace
* Create Project
* Create `train-spark.py` file in the project folder
* Execute a PySpark script in ACI.
* Execute a PySpark script in a Docker container on remote DSVM
* Execute a PySpark script in HDI

## Prerequisites
Make sure you go through the [00. Installation and Configuration](00.configuration.ipynb) Notebook first if you haven't.

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

## Initialize Workspace

Initialize a workspace object from persisted configuration.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Create Project and Associate with Run History
**Project** is a local folder that contains files for your Azure ML experiments. It is associated with a **run history**, a cloud container of run metrics and output artifacts from your experiments. You can either attach a local folder as a new project, or load a local folder as a project if it has been attached before.

In [None]:
# choose a name for the run history container in the workspace
experiment_name = 'train-on-spark'

# project folder
project_folder = './sample_projects/train-on-spark'

In [None]:
import os
from azureml.project.project import Project

project = Project.attach(workspace_object = ws,
                         experiment_name = experiment_name,
                         directory = project_folder)

print(project.project_directory, project.history.name, sep = '\n')

## Copy files


Copy `train-spark.py` and `iris.csv` into the project folde

In [None]:
from shutil import copyfile

# copy iris dataset in to project folder
copyfile('./iris.csv', os.path.join(project_folder, 'iris.csv'))

# copy train-spark.py file into project folder
# train-spark.py trains a simple LogisticRegression model using Spark.ML algorithm
copyfile('./train-spark.py', os.path.join(project_folder, 'train-spark.py'))

Review the train-spark.py file in the project folder.

In [None]:
with open(os.path.join(project_folder, 'train-spark.py'), 'r') as fin:
    print(fin.read())

## Configure & Run

### Configure ACI target

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# create a new runconfig object
run_config = RunConfiguration()

# signal that you want to use ACI to execute script.
run_config.target = "containerinstance"

# ACI container group is only supported in certain regions, which can be different than the region the Workspace is in.
run_config.container_instance.region = 'eastus'

# set the ACI CPU and Memory 
run_config.container_instance.cpu_cores = 1
run_config.container_instance.memory_gb = 2

# enable Docker 
run_config.environment.docker.enabled = True

# set Docker base image to the default CPU-based image
run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_MMLSPARK_CPU_IMAGE
print('base image is', run_config.environment.docker.base_image)
#run_config.environment.docker.base_image = 'microsoft/mmlspark:plus-0.9.9'

# use conda_dependencies.yml to create a conda environment in the Docker image for execution
# please update this file if you need additional packages.
run_config.environment.python.user_managed_dependencies = False

# auto-prepare the Docker image when used for execution (if it is not already prepared)
run_config.auto_prepare_environment = True

cd = CondaDependencies()
# add numpy as a dependency
cd.add_conda_package('numpy')
# overwrite the default conda_dependencies.yml file
cd.save_to_file(base_directory = project_folder, conda_file_path='aml_config/conda_dependencies.yml')


### Run Spark job in ACI

In [None]:
%%time 
from azureml.core.experiment import Experiment
from azureml.core.script_run_config import ScriptRunConfig

experiment = Experiment(project_object.workspace_object, project_object.history.name)
script_run_config = ScriptRunConfig(source_directory = project.project_directory,
                                    script= 'train-spark.py',
                                    run_config = run_config)
run = experiment.submit(script_run_config)


In [None]:
run.wait_for_completion(show_output = True)

### Show the run in the web UI
**IMPORTANT**: Please use Chrome to navigate to the URL.

In [None]:
# import helpers.py
import helpers

# get the URL of the run history web page
print(helpers.get_run_history_url(run))

### Attach a remote Linux VM
To use remote docker commpute target:
 1. Create a Linux DSVM in Azure. Here is some [quick instructions](https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-create-dsvm-hdi). Make sure you use the Ubuntu flavor, NOT CentOS.
 2. Enter the IP address, username and password below
 
**Note**: the below example use port 5022. By default SSH runs on port 22 and you don't need to specify it. But if for security reasons you switch to a different port (such as 5022), you can append the port number to the address like the example below. [Read more](../../documentation/sdk/ssh-issue.md) on this.

In [None]:
from azureml.core.compute import RemoteCompute

try:
    # Attaches a remote docker on a remote vm as a compute target.
    RemoteCompute.attach(workspace,name = "cpu-dsvm",  username = "ninghai", 
                         address = "hai2.eastus2.cloudapp.azure.com:5022", 
                         ssh-port=22
                         password = "<password>"))
except UserErrorException as e:
    print("Caught = {}".format(e.message))
    print("Compute config already attached.")

### Configure a Spark Docker run on the VM
Execute in the Spark engine in a Docker container in the VM. 

In [None]:
# Load the "cpu-dsvm.runconfig" file (created by the above attach operation) in memory
run_config = RunConfiguration.load(path = project_folder, name = "cpu-dsvm")

# set framework to PySpark
run_config.framework = "PySpark"

# Use Docker in the remote VM
run_config.environment.docker.enabled = True

# Use the MMLSpark CPU based image.
# https://hub.docker.com/r/microsoft/mmlspark/
run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_MMLSPARK_CPU_IMAGE
print('base image is:', run_config.environment.docker.base_image)

# signal use the user-managed environment
# do NOT provision a new one based on the conda.yml file
run_config.environment.python.user_managed_dependencies = False

# Prepare the Docker and conda environment automatically when execute for the first time.
run_config.auto_prepare_environment = True

### Submit the Experiment
Submit script to run in the Spark engine in the Docker container in the remote VM.

In [None]:
script_run_config = ScriptRunConfig(source_directory = project.project_directory,
                                    script= 'train-spark.py',
                                    run_config = run_config)
run = experiment.submit(script_run_config)

run.wait_for_completion(show_output = True)

In [None]:
# get the URL of the run history web page
print(helpers.get_run_history_url(run))

### Attach an HDI cluster
To use HDI commpute target:
 1. Create an Spark for HDI cluster in Azure. Here is some [quick instructions](https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-create-dsvm-hdi). Make sure you use the Ubuntu flavor, NOT CentOS.
 2. Enter the IP address, username and password below

In [None]:
from azureml.core.compute import HDInsightCompute

try:
    # Attaches a HDI cluster as a compute target.
    HDInsightCompute.attach(ws, name = "myhdi",
                            username = "ninghai", 
                            address = "sparkhai-ssh.azurehdinsight.net", 
                            password = "<pwd>"))
except UserErrorException as e:
    print("Caught = {}".format(e.message))
    print("Compute config already attached.")

### Configure HDI run

In [None]:
# load the runconfig object from the "myhdi.runconfig" file generated by the attach operaton above.
run_config = RunConfiguration.load(path = project_folder, name = 'myhdi')

# ask system to prepare the conda environment automatically when executed for the first time
run_config.auto_prepare_environment = True

### Submit the script to HDI

In [None]:
script_run_config = ScriptRunConfig(source_directory = project.project_directory,
                                    script= 'train-spark.py',
                                    run_config = run_config)
run = experiment.submit(script_run_config)

run.wait_for_completion(show_output = True)

In [None]:
# get the URL of the run history web page
print(helpers.get_run_history_url(run))

In [None]:
# get all metris logged in the run
metrics = run.get_metrics()
print(metrics)