# Goes Through How to Train Models via AML Pipelines
First we need to get our Azure SDK imports, authentication, etc.

If you're running this, use your own tenant-id, because AzureML workspaces authenticate using InteractiveLoginAuthentication and obviously it won't work because you're not signed into my Azure account. Can find this by searching for tenant properties under services. 

1. We get our workspace
2. We grab the dataset from our registered dataset list
3. Using "as_mount" converts this into a DataSetConsumptionConfig object which is what pipelines will use to transfer data inbetween places.

The Tiny Imagenet dataset already comes prepped. So we didn't need a data-prep step. Can be an additional step here for other datasets.

In [1]:
from azureml.core import Workspace #azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset, Datastore
from azureml.core.authentication import InteractiveLoginAuthentication
# If someone else is running this, put your tenant id here... auth is being finnicky
interactive_auth = InteractiveLoginAuthentication(tenant_id="72f988bf-86f1-41af-91ab-2d7cd011db47")
# get/create experiment
ws = Workspace.get(
    subscription_id="92c76a2f-0e1c-4216-b65e-abf7a3f34c1e",
    resource_group="azureml_uw_imageclassification",
    name="tiny-image-net",
    auth=interactive_auth
)

# get dataset (FileDataset object)
tiny_imagenet = Dataset.get_by_name(ws, name='Tiny ImageNet')
# Convert to DataSetConsumptionConfig object
#ds_input = tiny_imagenet.as_named_input("tiny_imagenet_dataset")
ds_input = tiny_imagenet.as_mount(path_on_compute="./data")

In [2]:
type(ds_input)

azureml.data.dataset_consumption_config.DatasetConsumptionConfig

In [3]:
from azureml.data import OutputFileDatasetConfig

This import "OutputFileDatasetConfig" is what lets us connect output from one script to another. In this case, I'm using it to store our saved models to our default datastore so I can just make sure everything's working. 

# Compute Clusters
If there aren't any of the specific clusters we need running in the workspace already, create them. We really only need 1 node. Ignore the max_nodes. This is a WIP.

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute

compute_name = "compute-cluster"
vm_size = "STANDARD_NC12"
if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Found compute target: ' + compute_name)
else:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,  # STANDARD_NC6 is GPU-enabled
                                                                vm_priority='lowpriority',
                                                                min_nodes=1,
                                                                max_nodes=4)
    # create the compute target
    compute_target = ComputeTarget.create(
        ws, compute_name, provisioning_config)

    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20)

    # For a more detailed view of current cluster status, use the 'status' property
    print(compute_target.status.serialize())

Found compute target: compute-cluster


# Configure Training Environment
Need to include the dependencies in order to train the model. Using a curated environment will use pre-built Docker images from the Microsoft Container Registry. Conda dependencies and pip dependencies do not like each other...

In [5]:
from azureml.core.runconfig import RunConfiguration, DockerConfiguration, MpiConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Environment 

# Use the 4 nodes we have
#distributed_training_config = MpiConfiguration(node_count=4)

aml_run_config = RunConfiguration()
aml_run_config.target = compute_target

USE_CURATED_ENV = False

if USE_CURATED_ENV:
    # We don't have a curated environment set up
    curated_environment = Environment.get(workspace=ws, name="AzureML-Tutorial")
    aml_run_config.environment = curated_environment
else: 
    aml_run_config.environment.python.user_managed_dependencies = False
    # base docker environment
    #aml_run_config.environment.docker = DockerConfiguration(use_docker=True)
    aml_run_config.environment.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
    
    # Add some packages relied on by data prep step
    conda_dependencies = CondaDependencies.create(
        #conda_packages=['tensorflow-gpu==2.2.0'],
        conda_packages=[], 
        pin_sdk_version=False)
    pip_packages=['joblib', 'pandas', 'tensorflow==2.2.0', 'keras', 'pillow','azureml-sdk', 'azureml-dataprep[fuse,pandas, random, math, os, warnings]'] 
    for pip_package in pip_packages:
        conda_dependencies.add_pip_package(pip_package)
    aml_run_config.environment.python.conda_dependencies = conda_dependencies
aml_run_config.environment.name = "InceptionV3_TinyImagenet"

In [6]:
aml_run_config
# Need to name the environment in order to save it to the workspace
aml_run_config.environment.register(workspace=ws)

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "InceptionV3_TinyImagenet",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
               

# Set Datastore to get output
Will be using this default datastore for the outputs of our training.

In [8]:
def_blob_store = Datastore(ws, "workspaceblobstore")
def_blob_store

This "def_blob_store" object is essentially a reference to our default datastore's blob container. That output config object earlier references this. Think of it as passing along a path to the directory we want to save our files.

# Constructing Pipeline Steps

In [9]:
from azureml.pipeline.steps import PythonScriptStep

def_blob_store = Datastore(ws, "workspaceblobstore")
train_source_dir = "./train"
entry_point = "train.py"

# Place to save the outputted model. This is converted into a path for the model to save. 
output = OutputFileDatasetConfig(destination=(
            def_blob_store,'./model/run_{run-id}'))
# So looking inside our datastore... inside the container name listed in the cell above..., 
# you should see a directory "model" with subdirectories representing each run.

# Use the 4 nodes we have
#distributed_training_config = MpiConfiguration(node_count=4)

train_step = PythonScriptStep(
                script_name=entry_point,
                arguments=[
                        "--data_path", "./data", # Path the data is mounted to. Look in train.py
                        "--steps_per_epoch", 150, #params
                        "--num_epochs", 10,
                        "--batch_size", 64,
                        "--save_dest", output,  #The output config object from earlier. See train.py
                        "--run_eval", True # Needs to be implemented still
                          ],
                inputs=[ds_input],
                compute_target=compute_target,
                source_directory=train_source_dir,
                runconfig=aml_run_config,
                allow_reuse=True
            )

**NOTE**: The interactive login thing I mentioned earlier will show up in the jupyter notebook cell... It'll ask you to go to a specific link and enter a short code. I forgot about this one run and left it alone for it to not work for hours.

If cold-running this, it'll need to download the docker image for ubuntu & cuda. This may take ~10-13 minutes. Otherwise, it should get to that authentication prompt within a few minutes. 

Also sidenote: The jupyter cell output pulls from a log.txt file that "Experiments" uses. Sometimes this lags/freezes... If this happens, go to the experiments tab, click on the **display name** for the run.\
Then, under "Graph", click on the train.py box, azureml-logs > 70_driver_log.txt. The prompt should be there.

In [None]:
#THIS WAS FOR TESTING. DON'T USE THIS ONE
from azureml.pipeline.steps import PythonScriptStep

def_blob_store = Datastore(ws, "workspaceblobstore")
train_source_dir = "./train"
entry_point = "test.py"

# Place to save the outputted model. This is converted into a path for the model to save. 
output = OutputFileDatasetConfig(destination=(
            def_blob_store,'outputs/run_{run-id}'))
# So looking inside our datastore... inside the container name listed in the cell above..., 
# you should see a directory "model" with subdirectories representing each run.

# Use the 4 nodes we have
#distributed_training_config = MpiConfiguration(node_count=4)

train_step = PythonScriptStep(
                script_name=entry_point,
                arguments=[
                        "--data_path", "./data", # Path the data is mounted to. Look in train.py
                        "--steps_per_epoch", 10, #params
                        "--num_epochs", 1,
                        #"--batch_size", 64,
                        "--save_dest", output_data,  #The output config object from earlier. See train.py
                        #"--run_eval", True
                          ],
                inputs=[ds_input],
                outputs=[output_data],
                compute_target=compute_target,
                source_directory=train_source_dir,
                runconfig=aml_run_config,
                allow_reuse=True
            )

In [None]:
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment

pipeline = Pipeline(workspace=ws, steps=[train_step])
print("Pipeline made. Submitting run...")
exp = Experiment(ws, "Training_InceptionV3")
pipeline_run = exp.submit(pipeline)
#run = pipeline_run.start_logging(outputs=None, snapshot_directory=None)
print("Submitted. Waiting for completion...")
pipeline_run.wait_for_completion()
print("Completed! Experiment should appear under the Experiments page.")

In [None]:
import os
print(output_data.datastore)

# Model Saved
The model's been saved to the default datastore associated with the account. We can now load it into our workspace, save a local copy, register the model, etc.

We tried utilizing the built in work-flow AzureML has set up by saving directly to outputs or logs directory but some issues have come up that we are not entirely sure why.

Saving to those directories would save the files within a directory named "outputs" of each step and not the experiment. This meant we couldn't get the files directly after training from the PipelineRun object. Did find another work-around that still has some merits. Still looking into why it's not saving to the output and logs folder for the experiment.

We can get the reference to the saved model's location through a couple steps. Previously, we specified an output file path to save the model, we can extract it using the same datastore object.

**def_blob_store.download()**. We need to provide the path to save the file to, and the prefix of the path on the datastore.

We specified to save the models to: model/run_{run_id}

We can get the run id from the pipelinerun object we used to submit the training job.

In [None]:
for step in pipeline_run.get_steps():
    run_id = step.id

Had to get the id of the step. Every resource we've looked into said saving to **outputs** or **logs** would save them to the experiment output file for persistence - those two specifically were special directory names. 

E.g. saving an object named "example" as example.txt by doing joblib.dump(example, "objects/example.txt") in train.py would save it in the output of the step, not the experiment. This download work around is temporary until we can figure out why this portion is not working.

In [None]:
def_blob_store.download(target_path="./run_output",
                        prefix=f"model/run_{run_id}", 
                        show_progress=True)

With the model downloaded, we can register the model, save a local copy, etc.

In [None]:
!ls ./run_output/model

In [None]:
!ls ./run_output/model/run_b9d700aa-5b98-4ef2-84f0-851393f692cd

Can see the saved files there. This solution is not as automated as we want it to be however.

# Clean Up

Clean up downloaded model if we're done using it. Run cell if you don't need the model files anymore.

In [None]:
!rm -rf ./run_output/