# Training Batch AI
This notebook will set up everything for the tutorial. This notebook assumes that nothing has been set up previously and will create everything from scratch. The necessary steps are broken up into the following sections:

**Note:** If you have run the Batch Shipyard example you can skip the installation of the Azure CLI

* [Install tools and dependencies](#section1)
* [Azure account login](#section2)
* [Setup](#section3)
* [Create Azure resources](#section4)
* [Define our model](#model)
* [Create Azure Batch AI Cluster](#section6)
* [Configure Job](#section7)
* [Submit Job](#section8)
* [Delete Job and Deallocate Pool](#section9)
* [Delete Azure resources](#section10)

<a id='section1'></a>
## Install tools and dependencies
Azure CLI 2.0 will also be installed to help us in provisioning Azure Storage accounts.

In [None]:
import sys

In [None]:
!{sys.executable} -m pip install -I azure-cli

In [None]:
!az provider register -n Microsoft.BatchAI
!az provider register -n Microsoft.Batch

Check that everything is working:

In [None]:
!az --version

<a id='section2'></a>
## Azure account login
The command below will initiate a login to your Azure account. You will see a url to browse to where you will enter the specified code. This will log you into the Azure account within the Azure CLI.

In [None]:
!az login -o table

If you have multiple subscriptions you can select the one you need with the command below. This will not be necessary for your assigned Azure Pass account for the workshop.

In [None]:
selected_subscription = "<YOUR SUBSCRIPTION>" # Replace with the name of your subscription
!az account set --subscription "$selected_subscription"

**Note:** If you cannot login with the Azure CLI, you can create Storage accounts on the [Azure Portal](https://portal.azure.com).
- [Instructions for creating an Azure Storage Account](https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account#create-a-storage-account) (You can create an "Auto Storage" account at the same time as your Batch Account on the portal instead)

Please pay attention to special instructions regarding Azure Portal created accounts below.

<a id='section3'></a>
## Setup
Now we will define the various names for the resources needed to run Batch AI jobs.

**Note:** If you manually created your accounts on the Azure Portal, you will need to modify `GROUP_NAME` and `STORAGE_ACCOUNT_NAME` accordingly.

In [None]:
import os
import uuid
import random
import json

def write_json_to_file(json_dict, filename):
    """ Simple function to write JSON dictionaries to files
    """
    with open(filename, 'w') as outfile:
        json.dump(json_dict, outfile)

LOCATION = 'eastus' # We are setting everything up in East US
                    # Be aware that you need to set things up in a region that has GPU VMs (N-Series)

# Tensorflow image
IMAGE_NAME = "tensorflow/tensorflow:1.8.0-gpu-py3"

short_uuid = str(uuid.uuid4())[:8]
GROUP_NAME = "batch{uuid}rg".format(uuid=short_uuid)
FILESHARE_NAME = "batch{uuid}share".format(uuid=short_uuid)
SHARE_DIRECTORY = "cnn_example"
STORAGE_ACCOUNT_NAME = "batch{uuid}st".format(uuid=short_uuid)
WORKSPACE='workspace'
EXPERIMENT='experiment'

<a id='section4'></a>
## Create Azure resources
### Create Resource Group
Azure encourages the use of resource groups to organise all the Azure components you deploy. That way it is easier to find them but also we can deleted a number of resources simply by deleting the Resource Group.

In [None]:
!az group create -n $GROUP_NAME -l $LOCATION -o table

### Create Storage account
Here we simply create the Storage accounts. Once we have created the accounts we can the use the Azure CLI to query it and obtain the **storage_account_key** which we will need for our Batch AI configuration files later.

In [None]:
json_data = !az storage account create -l $LOCATION -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME --sku Standard_LRS
print('Storage account {} provisioning state: {}'.format(STORAGE_ACCOUNT_NAME, json.loads(''.join(json_data))['provisioningState']))

In [None]:
!az storage share create --account-name $STORAGE_ACCOUNT_NAME --name $FILESHARE_NAME

Next we retrieve the **storage_account_key** which we will need for the Batch AI configuration files further down.

**Note:** If you created your Storage account in the Azure Portal, you will need to retrieve your keys in the Portal. Then set `batch_account_key` to the appropriate value instead of the Azure CLI callouts.

In [None]:
json_data = !az storage account keys list -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME
storage_account_key = json.loads(''.join(json_data))[0]['value']

In [None]:
!az configure --defaults group=$GROUP_NAME
!az configure --defaults location=$LOCATION

Next we create the directory on the fileshare we create where we will save our script to.

In [None]:
!az storage directory create --share-name $FILESHARE_NAME  --name $SHARE_DIRECTORY \
--account-name $STORAGE_ACCOUNT_NAME --account-key $storage_account_key

Saving the account information so that we can retrieve it if something happens to the notebook.

In [None]:
account_information = {
    "IMAGE_NAME": IMAGE_NAME,
    "LOCATION": LOCATION,
    "resource_group": GROUP_NAME,
    "storage_account_key": storage_account_key,
    "storage_account_name": STORAGE_ACCOUNT_NAME,
}
write_json_to_file(account_information, 'account_information.json')

<a id='model'></a>
## Define Our Model
The file below contains a simple CNN written in Keras. It will load the CIFAR 10 data and then train the model for a number of epochs and then evaluate it on the test set.

In [None]:
%%writefile cifar10_cnn.py
'''Train a VGG-like CNN on the CIFAR10 small images dataset.
'''

import numpy as np
import os
import sys
import tarfile
import pickle
import tensorflow as tf
import argparse
import logging
from urllib.request import urlretrieve

# Parameters
EPOCHS = 30
BATCHSIZE = 64
LR = 0.01
MOMENTUM = 0.9
N_CLASSES = 10 # There are 10 classes in the CIFAR10 dataset
DATA_FORMAT = 'channels_first'

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger(__name__)


def read_pickle(src):
    with open(src, 'rb') as f:
        data = pickle.load(f, encoding='latin1')
    return data


def process_cifar():
    """ Read data
    """
    
    logger.info('Preparing train set...')
    train_list = [read_pickle('./cifar-10-batches-py/data_batch_{0}'.format(i)) for i in range(1, 6)]
    x_train = np.concatenate([t['data'] for t in train_list])
    y_train = np.concatenate([t['labels'] for t in train_list])
    
    logger.info('Preparing test set...')
    tst = read_pickle('./cifar-10-batches-py/test_batch')
    x_test = tst['data']
    y_test = np.asarray(tst['labels'])
    
    return x_train, y_train, x_test, y_test


def prepare_cifar(x_train, y_train, x_test, y_test):
    
    # Scale pixel intensity
    x_train = x_train / 255.0
    x_test = x_test / 255.0
    
    # Reshape
    x_train = x_train.reshape(-1, 3, 32, 32)
    x_test = x_test.reshape(-1, 3, 32, 32)
    
    return (x_train.astype(np.float32), 
            y_train.astype(np.int32), 
            x_test.astype(np.float32), 
            y_test.astype(np.int32))


def load_cifar(src="http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"):
    """ Load CIFAR10 Dataset
    """
    try:
        return process_cifar()
    except FileNotFoundError:
        logger.info('Data does not exist. Downloading ' + src)
        fname, h = urlretrieve(src, './delete.me')
        logger.info('Extracting files...')
        with tarfile.open(fname) as tar:
            tar.extractall()
        os.remove(fname)
    return process_cifar()


def init_model_training(m, labels, learning_rate=LR, momentum=MOMENTUM):
    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=m, labels=labels)
    loss = tf.reduce_mean(cross_entropy)
    optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=momentum)
    return optimizer.minimize(loss)


def shuffle_data(X, y):
    index = np.arange(len(X))
    np.random.shuffle(index)
    return X[index], y[index]


def minibatch_from(X, y, batchsize=BATCHSIZE, shuffle=False):
    if len(X) != len(y):
        raise Exception("The length of X {} and y {} don't match".format(len(X), len(y)))
        
    if shuffle:
        X, y = shuffle_data(X, y)
    
    for i in range(0, len(X), batchsize):
        yield X[i:i + batchsize], y[i:i + batchsize]
        

def main(epochs=EPOCHS, lr=LR, mb_size=BATCHSIZE, data_format=DATA_FORMAT):
    logger.info('Learning Rate: {} Minibatch Size: {} Epochs: {}'.format(lr, mb_size, epochs))
    
    logger.info('Loading data....')
    # Data into format for library
    x_train, y_train, x_test, y_test = prepare_cifar(*load_cifar())
    logger.info('Data shape {}'.format(str((x_train.shape, x_test.shape, y_train.shape, y_test.shape))))
    logger.info('Data types {}'.format(str((x_train.dtype, x_test.dtype, y_train.dtype, y_test.dtype))))
    
    tf.reset_default_graph()
    # Place-holders
    X = tf.placeholder(tf.float32, shape=[None, 3, 32, 32])
    y = tf.placeholder(tf.int32, shape=[None])
    training = tf.placeholder(tf.bool)  # Indicator for dropout layer
    
    # Define model
    # Block 1
    conv1_1 = tf.layers.conv2d(X, 
                               filters=64, 
                               kernel_size=(3, 3), 
                               padding='same', 
                               data_format=data_format,
                               activation=tf.nn.relu)
    conv1_2 = tf.layers.conv2d(conv1_1, 
                               filters=64, 
                               kernel_size=(3, 3), 
                               padding='same', 
                               data_format=data_format,
                               activation=tf.nn.relu)
    pool1_1 = tf.layers.max_pooling2d(conv1_2, 
                                      pool_size=(2, 2), 
                                      strides=(2, 2), 
                                      padding='valid', 
                                      data_format=data_format)
    # Block 2
    conv2_1 = tf.layers.conv2d(pool1_1, 
                               filters=128, 
                               kernel_size=(3, 3), 
                               padding='same', 
                               data_format=data_format,
                               activation=tf.nn.relu)
    conv2_2 = tf.layers.conv2d(conv2_1, 
                               filters=128, 
                               kernel_size=(3, 3), 
                               padding='same', 
                               data_format=data_format,
                               activation=tf.nn.relu)
    pool2_1 = tf.layers.max_pooling2d(conv2_2, 
                                      pool_size=(2, 2), 
                                      strides=(2, 2), 
                                      padding='valid', 
                                      data_format=data_format)

    # Block 3
    conv3_1 = tf.layers.conv2d(pool2_1, 
                               filters=256, 
                               kernel_size=(3, 3), 
                               padding='same', 
                               data_format=data_format,
                               activation=tf.nn.relu)
    conv3_2 = tf.layers.conv2d(conv3_1, 
                               filters=256, 
                               kernel_size=(3, 3), 
                               padding='same', 
                               data_format=data_format,
                               activation=tf.nn.relu)
    conv3_3 = tf.layers.conv2d(conv3_2, 
                               filters=256, 
                               kernel_size=(3, 3), 
                               padding='same', 
                               data_format=data_format,
                               activation=tf.nn.relu)
    pool3_1 = tf.layers.max_pooling2d(conv3_3, 
                                      pool_size=(2, 2), 
                                      strides=(2, 2), 
                                      padding='valid', 
                                      data_format=data_format)

    relu2 = tf.nn.relu(pool3_1)
    flatten = tf.reshape(relu2, shape=[-1, 256*4*4])
    fc1 = tf.layers.dense(flatten, 4096, activation=tf.nn.relu)
    drop1 = tf.layers.dropout(fc1, 0.5, training=training)
    fc2 = tf.layers.dense(drop1, 4096, activation=tf.nn.relu)
    drop2 = tf.layers.dropout(fc2, 0.5, training=training)
    model = tf.layers.dense(drop2, N_CLASSES, name='output')

    train_model = init_model_training(model, y, learning_rate=lr)
    init = tf.global_variables_initializer()
    sess = tf.Session()
    sess.run(init)

    # Accuracy logging
    correct = tf.nn.in_top_k(model, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
    logger.info('Training model...')
    
    # Train model
    for j in range(epochs):
        for data, label in minibatch_from(x_train, y_train, shuffle=True, batchsize=mb_size):
            sess.run(train_model, feed_dict={X: data, y: label, training: True})
        # Log
        acc_train = sess.run(accuracy, feed_dict={X: data, y: label, training: True})
        logger.info("{} | Train accuracy: {}".format(j, acc_train))
    
    logger.info('Evaluating model...')
    y_guess = list()
    for data, label in minibatch_from(x_test, y_test):
        pred = tf.argmax(model,1)
        output = sess.run(pred, feed_dict={X: data, training: False})
        y_guess.append(output)
    logger.info("Accuracy: {}".format(sum(np.concatenate(y_guess) == y_test)/float(len(y_test))))
    
    
if __name__=='__main__':
    logger.info('Starting script....')
    parser = argparse.ArgumentParser(description='Script to train VGG-like model on CIFAR10 dataset')
    parser.add_argument('--lr', help='Specify learning rate', type=float, default=LR)
    parser.add_argument('--mb_size', help='Minibatch size', type=int, default=BATCHSIZE)
    parser.add_argument('--epochs', help='Number of epochs to train for', type=int, default=EPOCHS)
    args = parser.parse_args()
    main(epochs=args.epochs, lr=args.lr, mb_size=args.mb_size)


Now we transfer the script we created above to the fileshare.

In [None]:
!az storage file upload --share-name $FILESHARE_NAME --source cifar10_cnn.py --path $SHARE_DIRECTORY \
--account-name $STORAGE_ACCOUNT_NAME --account-key $storage_account_key

<a id='section6'></a>
## Create Azure Batch AI Cluster
Before we do anything we need to create the cluster for our jobs to run on. This can take a little bit of time so please be patient while the compute nodes are allocated from the Azure Cloud and the Docker images are pre-loaded on to the compute nodes. 

Before we do anything though we must create a workspace within which our cluster, experiments and jobs will be associated to.

In [None]:
!az batchai workspace create --workspace $WORKSPACE --location $LOCATION --resource-group $GROUP_NAME

**NOTE**
Before executing the command below make sure you replace `<admin_username>` and `<admin_password>` with your desired usename and password.

In [None]:
CLUSTER_NAME = 'gpupool'
!az batchai cluster create --name $CLUSTER_NAME --vm-size STANDARD_NC6 --image UbuntuLTS \
--workspace $WORKSPACE \
--min 1 --max 1 --storage-account-name $STORAGE_ACCOUNT_NAME \
--storage-account-key $storage_account_key \
--afs-name $FILESHARE_NAME --afs-mount-path azurefileshare \
--user-name <admin_username> --password <admin_password>

Once the pool is created we can confirm everything by running the commands below. Wait until the State reads steady and the number of nodes in the idle state is 1

In [None]:
!az batchai cluster show -n $CLUSTER_NAME --workspace $WORKSPACE

In [None]:
!az batchai cluster list -w $WORKSPACE -o table

<a id='section7'></a>
## Configure Job
As before the dictionary below defines the job we will execute. The job specification below tells Batch AI to run the Docker image we defined earlier (`tensorflow/tensorflow:1.8.0-gpu-py3`) and to mount `$AZ_BATCHAI_MOUNT_ROOT/azurefileshare/cnn_example` to the location defined by the environment variable `AZ_BATCHAI_INPUT_SCRIPT`. You will notice that the last word in `AZ_BATCHAI_INPUT_SCRIPT` matches the id we gave to our `inputDirectory`. The Docker image The command simply executes the `cifar10_cnn.py`

In [None]:
JOBNAME="tf-training-job"
job = {
    "$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2018-05-01/job.json",
    "properties": {
        "nodeCount": 1,
        "customToolkitSettings": {
            "commandLine": "python $AZ_BATCHAI_INPUT_SCRIPT/cifar10_cnn.py"
        },
        "stdOutErrPathPrefix": os.path.join('$AZ_BATCHAI_MOUNT_ROOT', 'azurefileshare'),
        "inputDirectories": [{
            "id": "SCRIPT",
            "path": os.path.join('$AZ_BATCHAI_MOUNT_ROOT', 'azurefileshare', SHARE_DIRECTORY)
        }],
        "containerSettings": {
            "imageSourceRegistry": {
                "image": IMAGE_NAME
            }
        }
    }
}

In [None]:
write_json_to_file(job,'job.json')

<a id='section8'></a>
## Submit Job
Before submitting our job we will create an experiments we can associate the jobs with.

In [None]:
!az batchai experiment create -n $EXPERIMENT --workspace $WORKSPACE -g $GROUP_NAME

Now that we have confirmed everything is working we can execute our job using the command below. 

In [None]:
!az batchai job create --experiment $EXPERIMENT --workspace $WORKSPACE --name $JOBNAME \
--cluster $CLUSTER_NAME --config job.json

We can check the status of the job with the command below. The job should take around 10-15 minutes to run. If it ran without any errors and it has finished the state will read `succeeded` and the exit code will be 0

In [None]:
!az batchai job list -e $EXPERIMENT -w $WORKSPACE -o table

We can use the command below to list links to the stdout and stderr log files.

In [None]:
!az batchai job file list -e $EXPERIMENT -w $WORKSPACE --job $JOBNAME --output-directory-id stdouterr

We can stream any of the text files that are on the node while our job is executing. Interrupt the output by interrupting the kernel (the stop button on the toolbar above)

In [None]:
!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --job $JOBNAME --output-directory-id stdouterr --file-name stdout.txt

In [None]:
!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --job $JOBNAME --output-directory-id stdouterr --file-name stderr.txt

<a id='section9'></a>
## Delete job and delete cluster

In [None]:
!az batchai job delete -w $WORKSPACE -e $EXPERIMENT --name $JOBNAME -y

In [None]:
!az batchai cluster delete -w $WORKSPACE --name $CLUSTER_NAME -y

<a id='section10'></a>
## Delete Azure resources
Once you have deleted the pool all that remains is the storage account.

Note that you do not need to delete your storage accounts.
- Storage costs include data stored in blobs and transactions. For the examples in these notebooks, the cost will be very small.

However, if you wish to delete your accounts, you can do so by deleting the resource group containing the accounts.

In [None]:
!az group delete --name $GROUP_NAME -y

Finally let's remove the previously configured default settings for the location and resource group.

In [None]:
!az configure --defaults group=''
!az configure --defaults location=''