## Training DLWP on Azure with Microsoft Azure Machine Learning service
For a reference on getting started with the Microsoft Azure Machine Learning service, refer to the [Microsoft documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/).

First, let's import the core AzureML Python modules.

In [None]:
import azureml.core
from azureml.core import Workspace
from azureml.core import Experiment

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

import os

#### Set the parameters for our model run
Here we set the directory where the dataset of predictor/target data is stored, the name of said dataset, and the name of the model to save. Tags optionally specifies some parameters for easy reference in the list of experiment runs.

In [None]:
data_directory = '/home/disk/wave2/jweyn/Data/Azure'
predictor_file = 'cfs_6h_CS48_1979-2010_z3-5-7-10_tau_sfc.nc'
model_file = 'dlwp_CS48_tau_UNET'
log_file = 'logs/CS48_tau_UNET'
tags = {'in': 'z-tau', 'out': 'z-tau', 'arch': 'UNET'}

#### Create or import a workspace
In this example, we assume a workspace already exists, but it is easy to create a workspace on-the-fly with `Workspace.create()`. Use environment variables to load sensitive information such as `subscription_id` and authentication passwords.

In [None]:
ws = Workspace.get(
    name='dlwp-ml-1',
    subscription_id=os.environ.get('AZURE_SUBSCRIPTION_ID'),
    resource_group='DLWP'
)

#### Set up the compute cluster
This code, adapted from the Microsoft documentation example, checks for existing compute resources in the workspace or creates them if they do not exist. We use GPU nodes, of which there are a few choices:
- STANDARD_NC6: Tesla K80
- STANDARD_NC6_v2: Tesla P100
- STANDARD_NC6_v3: Tesla V100
- STANDARD_ND6: Tesla P40
- STANDARD_NV6: Tesla M60

In [None]:
# Name of the cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "compute-NC12")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 2)

# Set a GPU VM type
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_NV6")

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target (%s)' % compute_name)
else:
    print('creating a new compute target (%s)' % compute_name)
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=10)

#### Copy data to the compute cluster
This optional step is needed if data hasn't yet been uploaded to a storage blob connected to the workspace.

In [None]:
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)

# ds.upload(src_dir=data_directory, target_path='DLWP', overwrite=False, show_progress=True)

#### Create the experiment

In [None]:
experiment_name = 'dlwp-CS'

exp = Experiment(workspace=ws, name=experiment_name)

#### Create a TensorFlow estimator
Now we create an image for a TensorFlow estimator that will be used as the VM for the compute cluster. Azure creates a Docker image the first time this is run; in the future, it can re-use existing images to run faster. We upload all of the DLWP source code files located in the parent directory of this notebook.
The script we pass to the job is `train_tf.py`, located in this directory. Details about the option parameters (and configurable settings for the specific run) can be seen/set there.

In [None]:
from azureml.train.dnn import TensorFlow

script_params = {
    '--root-directory': ds.path('DLWP').as_mount(),
    '--predictor-file': predictor_file,
    '--model-file': model_file,
    '--log-directory': log_file,
    '--temp-dir': '/mnt/tmp'
}

tf_est = TensorFlow(source_directory=os.path.join(os.getcwd(), os.pardir),
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script=os.path.join(os.getcwd(), 'train_cs.py'),
#                     framework_version='1.12',
                    conda_packages=['scikit-learn', 'netCDF4', 'dask', 'xarray==0.12.1'],
                    pip_packages=['keras'],
                    use_gpu=True)

#### Submit the experiment
...and also print a summary table.

In [None]:
run = exp.submit(config=tf_est, tags=tags)
run

#### Download the saved model
...once the run is complete.

In [None]:
if run.get_status() == 'Completed':
    ds.download('/Users/Jojo/Temp/', prefix='DLWP/%s' % model_file)
else:
    print("model is in '%s' status; can't download files yet" % run.get_status())