Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 05. Train in Spark
* Create Workspace
* Create Experiment
* Copy relevant files to the script folder
* Configure and Run

## Prerequisites
Make sure you go through the [00. Installation and Configuration](00.configuration.ipynb) Notebook first if you haven't.

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

## Initialize Workspace

Initialize a workspace object from persisted configuration.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

## Create Experiment


In [None]:
experiment_name = 'train-on-spark'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

## View `train-spark.py`

For convenience, we created a training script for you. It is printed below as a text, but you can also run `%pfile ./train-spark.py` in a cell to show the file.

In [None]:
with open('train-spark.py', 'r') as training_script:
    print(training_script.read())

## Configure & Run

### Configure an ACI run
Before you try running on an actual Spark cluster, you can use a Docker image with Spark already baked in, and run it in ACI(Azure Container Registry).

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# use pyspark framework
aci_run_config = RunConfiguration(framework="pyspark")

# use ACI to run the Spark job
aci_run_config.target = 'containerinstance'
aci_run_config.container_instance.region = 'eastus2'
aci_run_config.container_instance.cpu_cores = 1
aci_run_config.container_instance.memory_gb = 2

# specify base Docker image to use
aci_run_config.environment.docker.enabled = True
aci_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_MMLSPARK_CPU_IMAGE

# specify CondaDependencies
cd = CondaDependencies()
cd.add_conda_package('numpy')
aci_run_config.environment.python.conda_dependencies = cd

### Submit script to ACI to run

In [None]:
from azureml.core import ScriptRunConfig

script_run_config = ScriptRunConfig(source_directory = '.',
                                    script= 'train-spark.py',
                                    run_config = aci_run_config)
run = exp.submit(script_run_config)

In [None]:
run

In [None]:
run.wait_for_completion(show_output=True)

**Note** you can also create a new VM, or attach an existing VM, and use Docker-based execution to run the Spark job. Please see the `04.train-in-vm` for example on how to configure and run in Docker mode in a VM.

### Attach an HDI cluster
Now we can use a real Spark cluster, HDInsight for Spark, to run this job. To use HDI commpute target:
 1. Create a Spark for HDI cluster in Azure. Here are some [quick instructions](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql). Make sure you use the Ubuntu flavor, NOT CentOS.
 2. Enter the IP address, username and password below

In [None]:
from azureml.core.compute import HDInsightCompute
from azureml.exceptions import ComputeTargetException

try:
    # if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase
    hdi_compute = HDInsightCompute.attach(workspace=ws, 
                                          name="myhdi", 
                                          address="<myhdi-ssh>.azurehdinsight.net", 
                                          ssh_port=22, 
                                          username='<ssh-username>', 
                                          password='<ssh-pwd>')

except ComputeTargetException as e:
    print("Caught = {}".format(e.message))
    
        
hdi_compute.wait_for_completion(show_output=True)

### Configure HDI run

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies


# use pyspark framework
hdi_run_config = RunConfiguration(framework="pyspark")

# Set compute target to the HDI cluster
hdi_run_config.target = hdi_compute.name

# specify CondaDependencies object to ask system installing numpy
cd = CondaDependencies()
cd.add_conda_package('numpy')
hdi_run_config.environment.python.conda_dependencies = cd

### Submit the script to HDI

In [None]:
from azureml.core import ScriptRunConfig

script_run_config = ScriptRunConfig(source_directory = '.',
                                    script= 'train-spark.py',
                                    run_config = hdi_run_config)
run = exp.submit(config=script_run_config)

In [None]:
# get the URL of the run history web page
run

In [None]:
# get all metris logged in the run
metrics = run.get_metrics()
print(metrics)