# Running a single experiment on an AzML Compute Cluster

In this notebook we demonstrate running a single experiemnt (load data, train classifier, evaluate accuracy, produce classifications) running not locally on the same compute instance as the notebook, but rather being submitted to a compute cluster.

To handle elements of  the processing pipeline that are different running in this way, I have created some classes derived from the main `XbtDataset` and `ClassificationExperiment` classes, namely `AzureDataset` and `AzureExperiment`. The principle changes are:

* loading data from a mounted AzML Dataset
* locating where the JSON experiment file has been copied to by the submit function that launches the experiment on the cluster.
* registering results with the Run in the AzML Experiment Framework through the API.
* (TODO) writing the output classifications to an Azure Blob 


In [1]:
import warnings
warnings.filterwarnings('ignore')
import os
import sys
import pathlib

In [2]:
import matplotlib
import matplotlib.pyplot


In [3]:
root_repo_dir = pathlib.Path().absolute().parent
sys.path = [os.path.join(root_repo_dir)] + sys.path

In [4]:
import azureml.core
import azureml.core.compute
import azureml.core.compute_target
import azureml.train.sklearn

In [5]:
import classification.xbt_azureml
import xbt.common

## Set up parameters
Define some key paths for the experiment. Paths are not generally defined in experiment description, to make the experiment description more portable.
Import definitions include
* The root data directory. This should have subdirectories with the XBT input dataset, as well as for outputs.
* The names of the input and output subdirectories
* The path to JSON experiment description file. 

In [6]:
# Set up some site specific parameters for the notebook
try:
    environment = os.environ['XBT_ENV_NAME']
except KeyError:
    environment = 'azureml'

In [7]:
# AZURE ML SPECIFIC definitions
azure_working_root = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/xbt-test1/code/Users/stephen.haddad'
xbt_compute_cluster_name = 'xbt-cluster'
xbt_vm_size = 'STANDARD_D2_V2'
xbt_max_nodes = 4

# would be good if AzML could figure this from the user info / credentials, as I don't think a user can access other when logged into to a particular workspace?
azml_subscription_id = '1fedcbc3-e156-45f5-a034-c89c2fc0ac61'
azml_resource_group = 'AWSEarth'
azml_workspace_name = 'stephenHaddad_xbt_europeWest'

azml_xbt_dataset_name = 'xbt_input_files'
azml_output_datastore_name = 'misc'
azml_output_datastore_dir = 'xbt-data/results'

In [8]:
root_data_dirs = {
    'MO_scitools': '/data/users/shaddad/xbt-data/',
    'pangeo': '/data/misc/xbt-data/',
    'azureml': os.path.join(azure_working_root, 'xbt-data'),
}
env_date_ranges = {
    'MO_scitools': (1966,2015),
    'pangeo': (1966,2015),
    'azureml': (1966,2015),
}

In [9]:
# Set up some dataset specific parameters
root_data_dir = root_data_dirs[environment]
year_range = env_date_ranges[environment]
experiment_name = 'cluster_azml_single_decisionTree_country'

In [10]:
input_dir_name = 'csv_with_imeta'
exp_out_dir_name = 'experiment_outputs'

In [11]:
xbt_input_dir = os.path.join(root_data_dir, input_dir_name)
xbt_output_dir = os.path.join(root_data_dir, exp_out_dir_name)

In [12]:
json_params_path = os.path.join('examples', 'xbt_param_decisionTree_country.json')

## Set up the Azure ML environment stuff

location of working dir on the compute instance, not on the cluster. This will be spceific to the compute instance, so we need to find a better way to do this.
It would be good if this sort of thing could be defined either through the AzML API, or in the compute instance environment variables, for example `$AZML_HOME_DIR`


In [13]:
xbt_workspace = azureml.core.Workspace(azml_subscription_id, azml_resource_group, azml_workspace_name)    

In [14]:
experiment = azureml.core.Experiment(workspace=xbt_workspace, name=experiment_name)

## Prepare/access the the Azure ML compute cluster
If we want to use an AzureML clsuter for training, cross-validation, hyperparameter tuning etc. we need to create an object to access (and potentially start up) a suitable compute cluster.

`

In [15]:
try:
    compute_target = azureml.core.compute.ComputeTarget(workspace=xbt_workspace, name=xbt_compute_cluster_name)
    print('Found existing compute target')
except azureml.core.compute_target.ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = azureml.core.compute.AmlCompute.provisioning_configuration(vm_size=xbt_vm_size, 
                                                           max_nodes=xbt_max_nodes)

    # create the cluster
    compute_target = azureml.core.compute.ComputeTarget.create(xbt_workspace, xbt_compute_cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Found existing compute target
{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 1, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2020-09-18T09:03:36.270000+00:00', 'errors': None, 'creationTime': '2020-08-26T13:32:02.981314+00:00', 'modifiedTime': '2020-09-11T16:38:10.632722+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 2, 'nodeIdleTimeBeforeScaleDown': 'PT1200S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D13_V2'}


## Running on the cluster


In [16]:
script_params = {
    '--input-dataset-name': azml_xbt_dataset_name,
    '--json-experiment': json_params_path,
    '--output-datastore-name': azml_output_datastore_name,
    '--output-datastore-dir': azml_output_datastore_dir,
}


In [17]:
conda_packages = ['python=3.8',
                  'joblib=0.13.2',
                  'pandas=1.0.1',
                  'scikit-learn=0.22.1',
                  'iris=2.4',
                 ]

In [18]:
launch_script = 'bin/run_azml_experiment'

In [23]:
xbt_estimator = azureml.train.sklearn.SKLearn(source_directory=str(root_repo_dir), 
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script=launch_script,
                    conda_packages=conda_packages,
                   )



In [24]:
xbt_single_run = experiment.submit(xbt_estimator)

The trained classifier objects are saved to the output directory, one per file. There is also a JSON experiment description file, which is the same as the original description, but with inference added to experiment name and a list of classifier file names. This file can be used to create and run an inference job. The classifier files should be in the same directory as the JSON inference description files.

In [46]:
azureml.core.Dataset.get_by_name(xbt_workspace, 'xbt_nc_iquod').mount().

<azureml.dataprep.fuse.daemon.MountContext at 0x7fbfd46eee48>

In [34]:
print(os.environ)

environ({'LANG': 'en_US.UTF-8', 'PATH': '/anaconda/envs/azureml_py36/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games', 'HOME': '/home/azureuser', 'LOGNAME': 'azureuser', 'USER': 'azureuser', 'SHELL': '/bin/bash', 'AML_CloudName': 'AzureCloud', 'KERNEL_LAUNCH_TIMEOUT': '40', 'JPY_PARENT_PID': '4069', 'TERM': 'xterm-color', 'CLICOLOR': '1', 'PAGER': 'cat', 'GIT_PAGER': 'cat', 'MPLBACKEND': 'module://ipykernel.pylab.backend_inline', 'KMP_INIT_AT_FORK': 'FALSE'})


In [None]:
help(xbt_single_run.upload_file)

In [None]:
json_inf_path = exp2_cv.inference_out_json_path

In [None]:
fig_results = matplotlib.pyplot.figure('xbt_results',figsize=(25,15))
for label1, metrics1  in classifiers_cv.items():
    ax_precision = fig_results.add_subplot(3,5,label1 +1, title='precision split {0}'.format(label1))
    ax_recall = fig_results.add_subplot(3,5,label1 + 1 + 5 * 1, title='recall split {0}'.format(label1))
    ax_f1 = fig_results.add_subplot(3,5,label1 + 1 + 5 * 2, title='f1 split {0}'.format(label1))
    results_cv.plot.line(ax=ax_precision, x='year', y=[f'precision_train_{label1}_all',f'precision_test_{label1}_all'], color=['b', 'r'], ylim=(0.7,1.0))
    results_cv.plot.line(ax=ax_recall, x='year', y=[f'recall_train_{label1}_all',f'recall_test_{label1}_all'], color=['b', 'r'], ylim=(0.7,1.0))
    results_cv.plot.line(ax=ax_f1, x='year', y=[f'f1_train_{label1}_all',f'f1_test_{label1}_all'], color=['b', 'r'], ylim=(0.7,1.0))

### Inference
Once we have trained the classifiers, we want to be able to load from the saved state files and run inference on the whole dataset, with the same results. We can use the JSOn file created by the training to run the inference. The JSON inference parameters and the saved classifier object files should be in the same directory. We can then use the `run_inference` function. This will perform the following steps:
* load dataset
* load previously classifiers from file list defined in JSON inference description.
* run inference for each of the classifiers
* fill in classifications where not possible with classifiers using iMeta algorithm
* calculate vote-based probability using ensemble of previously trained classifiers
* save classification results

In [None]:
exp3_inf = experiment.ClassificationExperiment(json_inf_path, xbt_input_dir, xbt_output_dir)


In [None]:
%%time
classifiers_reloaded = exp3_inf.run_inference()