# Training

In this notebook, we will learn how to train an AI model in the cloud.  
There are a few things that are special regarding Cloud AI training, but also a lot of similarities between our old-school way of working.

Again let us start by setting some global parameters first

In [1]:
BATCH_SIZE = 32
MAX_EPOCHS = 100
PATIENCE = 11
INITIAL_LEARNING_RATE = 0.01
DROPOUTRATE = 0.0
ACTIVATION_HIDDEN = 'relu' # activatiefunctie van de hidden layer neuronen
ACTIVATION_OUTPUT = 'sigmoid'# activatie van de output layer neuronen
INITIALIZER = 'RandomUniform' # type van kernel intializer
model_name = 'sequences-nn'

And of course importing the packages we need! Again, don't forget to set your kernel right in the top-right corner!

In [2]:
import numpy as np

import os
from glob import glob
import warnings

warnings.filterwarnings("ignore")
import random
SEED = 42   # set random seed
random.seed(SEED)

from typing import List

In [3]:
## Import AzureML packages
from azureml.core import Workspace
from azureml.core import Dataset
from azureml.data.datapath import DataPath
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

One special import are these Utils scripts. You can read more about them in the `utils > utils.py` file. I have included them here to load them in. They contain some helper functions we will be needing later on.

In [48]:
from utils.utils import *

# Step 1: Connect Workspace

Follow the same steps as the previous notebook, to set up your Workspace configuration!

In [6]:
## Either get environment variables, or a fallback name, which is the second parameter.
## Currently, fill in the fallback values. Later on, we will make sure to work with Environment values. So we're already preparing for it in here!
workspace_name = os.environ.get('WORKSPACE', 'hermans-cedric-ml')
subscription_id = os.environ.get('SUBSCRIPTION_ID', '51f06efd-190a-4388-8d72-41669512398d')
resource_group = os.environ.get('RESOURCE_GROUP', '04-AzureML')

In [7]:
ws = Workspace.get(name=workspace_name,
               subscription_id=subscription_id,
               resource_group=resource_group)

## Step 1.1 -- Create Compute Cluster

A Compute Cluster is a combination of multiple Compute Instances. Azure will scale these machines according to the number of nodes we fill into the configuration.  
Based on the amount of Jobs we want to run in parallel, multiple machines will be created.

We choose to define a minimum of 0 machines, which means Azure will need some time to create at least one machine everytime we need one.
If you keep the minimum on 1, you always have one that's ready for your development.
The timeout time to scale down back to 0 machines can also be configured if required.

In [8]:
import os

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

found compute target: cpu-cluster


## Find and download datasets

In [9]:
datasets = Dataset.get_all(workspace=ws) # Make sure to give our workspace with it
print(datasets)

{ 'NRPS': DatasetRegistration(id='cd0e7d4c-fd07-4538-8df1-5344ac913e2b', name='NRPS', version=1, description='', tags={}),
  'PKS': DatasetRegistration(id='019c4c3d-8a82-4260-98ce-2e3047dc8170', name='PKS', version=1, description='', tags={}),
  'animals-testing-set': DatasetRegistration(id='b8ec1966-c9da-4ca9-b07d-e41ee35f5e07', name='animals-testing-set', version=1, description='The Animal Images to test, resized tot 64, 64', tags={'animals': 'cats,dogs,pandas', 'AI-Model': 'CNN', 'Split size': '0.2', 'type': 'testing'}),
  'animals-training-set': DatasetRegistration(id='02ab0351-03ab-4e0e-b42f-620fa183bd4d', name='animals-training-set', version=1, description='The Animal Images to train, resized tot 64, 64', tags={'animals': 'cats,dogs,pandas', 'AI-Model': 'CNN', 'Split size': '0.8', 'type': 'training'}),
  'cats': DatasetRegistration(id='db847d56-0389-43e8-93aa-6cd3df7507c7', name='cats', version=1, description='', tags={}),
  'dogs': DatasetRegistration(id='1f40c908-e568-4434-a620

# Step 2: Create an AI model and training code

We will first create an AI model to use in our training script.  
A basic AI model has been given in the /utils/utils.py directory. You can change it there if you want to

In this step, we will also configure a Training script. This script is an Executable Python script.  
This is slightly different from our other way of working, where we work with Notebooks.

Because Azure will be launching and running our Python scripts, we need to create one file that can be executed in one go.
This needs all our imports, packages, data ... ready without manual interference.

We'll store all of these files into a scripts directory. That way we can upload that directory to our training VM later.

### Step 2.1 -- Prepare the scripts

In [49]:
script_folder = os.path.join(os.getcwd(), 'scripts')
os.makedirs(script_folder, exist_ok=True)

In [62]:
%%writefile $script_folder/train.py

import argparse
import os
from glob import glob
import random
import numpy as np

# This time we will need our Tensorflow Keras libraries, as we will be working with the AI training now
from tensorflow import keras
from tensorflow.keras.optimizers import SGD
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
# This AzureML package will allow to log our metrics etc.
from azureml.core import Run

# Important to load in the utils as well!
from utils import *


### HARDCODED VARIABLES FOR NOW
### TODO for the students:
### Make sure to adapt the ArgumentParser on line 31 to include these parameters
### You can base your answer on the lines that are already there

BATCH_SIZE = 32
MAX_EPOCHS = 100
PATIENCE = 11
INITIAL_LEARNING_RATE = 0.01
DROPOUTRATE = 0.0
ACTIVATION_HIDDEN = 'relu' # activatiefunctie van de hidden layer neuronen
ACTIVATION_OUTPUT = 'sigmoid'# activatie van de output layer neuronen
INITIALIZER = 'RandomUniform' # type van kernel intializer
model_name = 'gene-cnn-test'


parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--epochs', type=int, dest='epochs', help='The amount of Epochs to train')
parser.add_argument('--initial-lr', type=float, dest='initial_learning_rate', help='The initial learning rate')
parser.add_argument('--batch-size', type=int, dest='batch_size', help='The batch size')
args = parser.parse_args()


data_folder = args.data_folder
print('Training folder:', data_folder)

INITIAL_LEARNING_RATE = args.initial_learning_rate
MAX_EPOCHS = args.epochs
BATCH_SIZE = args.batch_size

# As we're mounting the training_folder and testing_folder onto the `/mnt/data` directories, we can load in the sequences by using glob.
X_paths = glob(os.path.join('/mnt/data', '**', '*X.np'), recursive=True)
y_paths = glob(os.path.join('/mnt/data', '**', '*y.np'), recursive=True)

print("features samples:", len(X_paths))
print(X_paths)
print("target samples:", len(y_paths))
print(y_paths)

X=[]
y=[]

for xf in X_paths:
    X+=list(np.loadtxt(xf))
for yf in y_paths:
    y+=list(np.loadtxt(yf))
X_train,X_test,y_train,y_test=train_test_split(np.array(X),np.array(y),test_size=0.2)


print('Shapes:')
print(X_train.shape)
print(X_test.shape)
print(len(y_train))
print(len(y_test))

# Create an output directory where our AI model will be saved to.
# Everything inside the `outputs` directory will be logged and kept aside for later usage.
model_path = os.path.join('outputs', model_name)
os.makedirs(model_path, exist_ok=True)

## START OUR RUN context.
## We can now log interesting information to Azure, by using these methods.
run = Run.get_context()

# Save the best model, not the last
cb_save_best_model = keras.callbacks.ModelCheckpoint(filepath=model_path,
                                                         monitor='val_loss', 
                                                         save_best_only=True, 
                                                         verbose=1)

# Early stop when the val_los isn't improving for PATIENCE epochs
cb_early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', 
                                              patience= PATIENCE,
                                              verbose=1,
                                              restore_best_weights=True)

# Reduce the Learning Rate when not learning more for 4 epochs.
cb_reduce_lr_on_plateau = keras.callbacks.ReduceLROnPlateau(factor=.5, patience=4, verbose=1)

opt = SGD(learning_rate=INITIAL_LEARNING_RATE, decay=INITIAL_LEARNING_RATE / MAX_EPOCHS) # Define the Optimizer

model = buildModel(20, 2) # Create the AI model as defined in the utils script.

model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

# train the network
history = model.fit(X_train, y_train, batch_size=BATCH_SIZE,
                        validation_data=(X_test, y_test),
                        epochs=MAX_EPOCHS,
                        callbacks=[cb_save_best_model, cb_early_stop, cb_reduce_lr_on_plateau] )
run.log_list('val_loss', history.history['val_loss'])
run.log_list('val_accuracy', history.history['val_accuracy'])
run.log("history",history.history)

print("[INFO] evaluating network...")
predictions = model.predict(X_test, batch_size=32)
print(classification_report(y_test.argmax(axis=1), predictions.argmax(axis=1), target_names=['NRPS', 'PKS'])) # Give the target names to easier refer to them.
# If you want, you can enter the target names as a parameter as well, in case you ever adapt your AI model to more genes.

cf_matrix = confusion_matrix(y_test.argmax(axis=1), predictions.argmax(axis=1))
print(cf_matrix)

### TODO for students
### Find a way to log more information to the Run context.

# Save the confusion matrix to the outputs.
np.save('outputs/confusion_matrix.npy', cf_matrix)

print("DONE TRAINING")


Overwriting /mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-ch/code/Users/cedric.hermans2/AzureML-assignment/scripts/train.py


In [63]:
# Copy the Utils file into the script_folder
import shutil
shutil.copy('utils/utils.py', script_folder)

'/mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-ch/code/Users/cedric.hermans2/AzureML-assignment/scripts/utils.py'

### Step 2.2 -- Prepare the environment

The training script we have just defined still needs some more information before we can start it.  
We'll need to define it's Anaconda or Pip environment with all the packages that should be installed prior to training.  
We can re-use the environments later, or we can use environments other people have created for us.

You can also customize the Base Docker image to train on, if you prefer. I won't use this in here.

In [13]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

# Create an Environment name for later use
environment_name = os.environ.get('TRAINING_ENV_NAME', 'genes-classification-env-training')
env = Environment(environment_name)

# It's called CondaDependencies, but you can also use pip packages ;-)
env.python.conda_dependencies = CondaDependencies.create(
        # Using opencv-python-headless is interesting to skip the overhead of packages that we don't need in a headless-VM.
        pip_packages=['azureml-dataset-runtime[pandas,fuse]', 'azureml-defaults', 'tensorflow', 'scikit-learn', 'opencv-python-headless']
    )
# Register environment to re-use later
env.register(workspace = ws)

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20211124.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "genes-classification-env-training",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
          

### Step 2.3 -- Prepare the ScriptRunConfig

A **ScriptRunConfig** is a configuration that contains all the information needed to launch a Job inside an Experiment.
This contains information to the directory of scripts it should use, the **name** of the script to start,
the **arguments** to pass into that script, the **compute** target to run the script on, and finally the **environment** to run it on.

We then need to attach such a ScriptRunConfig onto an Experiment on Azure.

In [56]:
datasets['processed_genes'].as_mount('/mnt/data/')
# As we're mounting the training_folder and testing_folder onto the `/mnt/data` directories, we can load in the sequences by using glob.
X_paths = glob(os.path.join('/mnt/data','**', '*X.np'), recursive=True)
y_paths = glob(os.path.join('/mnt/data','**', '*y.np'), recursive=True)

print("features samples:", len(X_paths))
print("target samples:", len(y_paths))


features samples: 0
target samples: 0


In [65]:
from azureml.core import ScriptRunConfig
from azureml.core import Experiment

experiment_name = os.environ.get('EXPERIMENT_NAME', 'Genes-Classification')

exp = Experiment(workspace=ws, name=experiment_name) # Create a new experiment

experiment_runs = []

# We can start four experiments for a bunch of different epoch options
for epochs in [25, 50, 75, 100]:
    for initial_learning_rate in [0.3,0.1, 0.05, 0.01]:
        for batch_size in [16,32,64,128]:
            args = [
                '--data-folder', datasets['processed_genes'].as_mount('/mnt/data/'),
                '--epochs', epochs,
                '--initial-lr', initial_learning_rate,
                '--batch-size', batch_size]

            script_run_config = ScriptRunConfig(source_directory=script_folder,
                            script='train.py', 
                            arguments=args,
                            compute_target=compute_target,
                            environment=env)

            run = exp.submit(config=script_run_config)
            experiment_runs.append(run) # Append it to our list of experiment runs for now. This is easy for referring later!
            print('Run started!')

        

Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!
Run started!


### Step 2.4 -- Await the results!

Now that our experiment runs are starting, we can await the logs and results.  
It can take a while to run everything, but the 4 jobs should run in Parallel, if all was well configured!

The cells below can help you in viewing the results, while you head out for a coffee!

I use the `experiment_runs[0]` as our run to log. It's the first one that was started.

There are a few different options for each to select the one they prefer :-)

#### Step 2.4.1 -- Plain text output

In [19]:
# specify show_output to True for a verbose log
experiment_runs[0].wait_for_completion(show_output=True) 

RunId: Genes-Classification_1644069031_cffbc973
Web View: https://ml.azure.com/runs/Genes-Classification_1644069031_cffbc973?wsid=/subscriptions/51f06efd-190a-4388-8d72-41669512398d/resourcegroups/04-AzureML/workspaces/hermans-cedric-ml&tid=4ded4bb1-6bff-42b3-aed7-6a36a503bf7a

Streaming azureml-logs/20_image_build_log.txt

2022/02/05 13:50:37 Downloading source code...
2022/02/05 13:50:38 Finished downloading source code
2022/02/05 13:50:39 Creating Docker network: acb_default_network, driver: 'bridge'
2022/02/05 13:50:39 Successfully set up Docker network: acb_default_network
2022/02/05 13:50:39 Setting up Docker configuration...
2022/02/05 13:50:40 Successfully set up Docker configuration
2022/02/05 13:50:40 Logging in to registry: 4bc96a278283483f8523a427ca6960f3.azurecr.io
2022/02/05 13:50:40 Successfully logged into 4bc96a278283483f8523a427ca6960f3.azurecr.io
2022/02/05 13:50:40 Executing step ID: acb_step_0. Timeout(sec): 5400, Working directory: '', Network: 'acb_default_networ

ExperimentExecutionException: ExperimentExecutionException:
	Message: The output streaming for the run interrupted.
But the run is still executing on the compute target. 
Details for canceling the run can be found here: https://aka.ms/aml-docs-cancel-run
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "The output streaming for the run interrupted.\nBut the run is still executing on the compute target. \nDetails for canceling the run can be found here: https://aka.ms/aml-docs-cancel-run"
    }
}