# II - Single GPU Training
In the previous notebok we set up our pool of GPU nodes. In this notebook we are going to get one of the nodes in the pool to train a deep learning model for small number of epochs. The model and results of the training will be then loaded into blob storage for later retrieval.

* [Setup](#section1)
* [Configure job](#section2)
* [Submit job](#section3)
* [Delete job](#section4)

<a id='section1'></a>

## Setup

Create a simple alias for Batch Shipyard

In [1]:
%alias shipyard SHIPYARD_CONFIGDIR=config python $HOME/batch-shipyard/shipyard.py %l

Check that everything is working

In [2]:
shipyard

Usage: shipyard.py [OPTIONS] COMMAND [ARGS]...

  Batch Shipyard: Provision and Execute Docker Workloads on Azure Batch

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  cert      Certificate actions
  data      Data actions
  fs        Filesystem in Azure actions
  jobs      Jobs actions
  keyvault  KeyVault actions
  misc      Miscellaneous actions
  pool      Pool actions
  storage   Storage actions


Get some variables stored in the Setup notebook:

In [3]:
import json

def read_json(filename):
    with open(filename, 'r') as infile:
        return json.load(infile)
    
account_info = read_json('account_information.json')

IMAGE_NAME = account_info['IMAGE_NAME']
STORAGE_ALIAS = account_info['STORAGE_ALIAS']

<a id='section2'></a>
## Configure job
In the dictonary below we define the properties of the job we wish to execute. You can see that we have specified that the image to use is the one we defined at the beginning of this notebook. Another interesting note is that we specify the gpu switch to true since we want the job to use the GPU. Finally the command is as follows:

```
source /cntk/activate-cntk
python ConvNet_CIFAR10.py
```

Which in essence activates the CNTK Anaconda environment then runs the **ConvNet_CIFAR10.py** script which will train and evaluate the model.

In the jobs json below, `resource_files` contains the script to train our CNN. 

In [4]:
TASK_ID = 'run_cifar10' # This should be changed per task

JOB_ID = 'cntk-training-job'

COMMAND = 'bash -c "source /cntk/activate-cntk; python -u ConvNet_CIFAR10.py --datadir $AZ_BATCH_NODE_SHARED_DIR/data"'

jobs = {
    "job_specifications": [
        {
            "id": JOB_ID,
            "tasks": [
                {
                    "id": TASK_ID,
                    "image": IMAGE_NAME,
                    "remove_container_after_exit": True,
                    "command": COMMAND,
                    "gpu": True,
                    "resource_files": [
                        {
                            "file_path": "ConvNet_CIFAR10.py",
                            "blob_source": "https://batchshipyardexamples.blob.core.windows.net/code/ConvNet_CIFAR10.py",
                            "file_mode":'0777'
                        }
                    ],
                    "output_data": {
                        "azure_storage": [
                            {
                                "storage_account_settings": STORAGE_ALIAS,
                                "container": "output",
                                "source": "$AZ_BATCH_TASK_WORKING_DIR/Models"
                            },
                        ]
                    },
                }
            ],
        }
    ]
}

Write the jobs configuration to the `jobs.json` file:

In [5]:
import json
import os

def write_json_to_file(json_dict, filename):
    """ Simple function to write JSON dictionaries to files
    """
    with open(filename, 'w') as outfile:
        json.dump(json_dict, outfile)

write_json_to_file(jobs, os.path.join('config', 'jobs.json'))
print(json.dumps(jobs, indent=4, sort_keys=True))

{
    "job_specifications": [
        {
            "id": "cntk-training-job", 
            "tasks": [
                {
                    "command": "bash -c \"source /cntk/activate-cntk; python -u ConvNet_CIFAR10.py --datadir $AZ_BATCH_NODE_SHARED_DIR/data\"", 
                    "gpu": true, 
                    "id": "run_cifar10", 
                    "image": "microsoft/cntk:2.0-gpu-python3.5-cuda8.0-cudnn5.1", 
                    "output_data": {
                        "azure_storage": [
                            {
                                "container": "output", 
                                "source": "$AZ_BATCH_TASK_WORKING_DIR/Models", 
                                "storage_account_settings": "mystorageaccount"
                            }
                        ]
                    }, 
                    "remove_container_after_exit": true, 
                    "resource_files": [
                        {
                            "blob_source": "ht

<a id='section3'></a>

## Submit job
Check that everything is ok with our pool before we submit our jobs


In [6]:
shipyard pool listnodes

2017-07-12 13:37:17,721 DEBUG - listing nodes for pool gpupool
2017-07-12 13:37:17,979 INFO - node_id=tvm-1392786932_1-20170712t132611z [state=ComputeNodeState.idle start_task_exit_code=0 scheduling_state=SchedulingState.enabled ip_address=10.0.0.6 vm_size=standard_nc6 dedicated=True total_tasks_run=0 running_tasks_count=0 total_tasks_succeeded=0]
2017-07-12 13:37:17,979 INFO - node_id=tvm-1392786932_2-20170712t132611z [state=ComputeNodeState.idle start_task_exit_code=0 scheduling_state=SchedulingState.enabled ip_address=10.0.0.5 vm_size=standard_nc6 dedicated=True total_tasks_run=0 running_tasks_count=0 total_tasks_succeeded=0]
2017-07-12 13:37:17,979 INFO - node_id=tvm-1392786932_3-20170712t132611z [state=ComputeNodeState.idle start_task_exit_code=0 scheduling_state=SchedulingState.enabled ip_address=10.0.0.4 vm_size=standard_nc6 dedicated=True total_tasks_run=0 running_tasks_count=0 total_tasks_succeeded=0]


Now that we have confirmed everything is working we can execute our job using the command below. The tail switch at the end will stream stdout from the node.

In [7]:
shipyard jobs add --tail stdout.txt

2017-07-12 13:37:21,013 INFO - Adding job cntk-training-job to pool gpupool
2017-07-12 13:37:21,520 INFO - uploading file /tmp/tmpfH3qcR as u'shipyardtaskrf-cntk-training-job/run_cifar10.shipyard.envlist'
2017-07-12 13:37:21,761 DEBUG - submitting 1 tasks (0 -> 0) to job cntk-training-job
2017-07-12 13:37:22,022 INFO - submitted all 1 tasks to job cntk-training-job
2017-07-12 13:37:22,022 DEBUG - attempting to stream file stdout.txt from job=cntk-training-job task=run_cifar10

************************************************************
CNTK is activated.

Please checkout tutorials and examples here:
  /cntk/Tutorials
  /cntk/Examples

To deactivate the environment run

  source /root/anaconda3/bin/deactivate

************************************************************
Creating NN model
Learning rate per sample: 0.0015625
Momentum per sample: 0.0
Finished Epoch[1 of 20]: [Training] loss = 2.110387 * 50048, metric = 80.21% * 50048 17.387s (2878.5 samples/s);
Finished Epoch[2 of 20]: [T

We can also retrieve this `stdout.txt` data independently of `--tail` above by using the `data stream` command. Note that when you delete the job all this information is also deleted.

In [8]:
shipyard data stream --filespec $JOB_ID,$TASK_ID,stdout.txt

2017-07-12 13:51:59,954 DEBUG - attempting to stream file stdout.txt from job=cntk-training-job task=run_cifar10

************************************************************
CNTK is activated.

Please checkout tutorials and examples here:
  /cntk/Tutorials
  /cntk/Examples

To deactivate the environment run

  source /root/anaconda3/bin/deactivate

************************************************************
Creating NN model
Learning rate per sample: 0.0015625
Momentum per sample: 0.0
Finished Epoch[1 of 20]: [Training] loss = 2.110387 * 50048, metric = 80.21% * 50048 17.387s (2878.5 samples/s);
Finished Epoch[2 of 20]: [Training] loss = 1.856963 * 49984, metric = 70.04% * 49984 15.930s (3137.7 samples/s);
Finished Epoch[3 of 20]: [Training] loss = 1.717015 * 49984, metric = 63.58% * 49984 15.758s (3172.0 samples/s);
Finished Epoch[4 of 20]: [Training] loss = 1.592324 * 49984, metric = 58.27% * 49984 15.736s (3176.4 samples/s);
Finished Epoch[5 of 20]: [Training] loss = 1.479462 * 50

If something goes wrong you can run the following command to get the stderr output from the job.

In [9]:
shipyard data stream --filespec $JOB_ID,$TASK_ID,stderr.txt

2017-07-12 13:52:04,470 DEBUG - attempting to stream file stderr.txt from job=cntk-training-job task=run_cifar10
INFO:__main__:Processing /mnt/batch/tasks/shared/data/train_map.txt...
INFO:__main__:Processing /mnt/batch/tasks/shared/data/test_map.txt...
INFO:__main__:Running network with: 
                2 convolution layers
                64  minibatch size
                for 20 epochs
Selected GPU[0] Tesla K80 as the process wide default device.
ping [requestnodes (before change)]: 1 nodes pinging each other
ping [requestnodes (after change)]: 1 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 1 out of 1 MPI nodes on a single host (1 requested); we (0) are in (participating)
ping [mpihelper]: 1 nodes pinging each other
-------------------------------------------------------------------
Build info: 

		Built time: May 31 2017 17:14:18
		Last modified date: Sun May 21 16:00:04 2017
		Build type: release
		Build target: GPU
		With 1bit-SGD: no
		With ASGD: yes
		Math lib:

<a id='section4'></a>

## Delete job

To delete the job use the command below. Just be aware that this will get rid of all the files created by the job and tasks.

In [None]:
shipyard jobs del -y --termtasks --wait

2017-07-12 13:52:08,691 INFO - Deleting job: cntk-training-job
2017-07-12 13:52:08,691 DEBUG - disabling job cntk-training-job first due to task termination
2017-07-12 13:52:09,564 DEBUG - Skipping termination of completed task run_cifar10 on job cntk-training-job
2017-07-12 13:52:10,050 DEBUG - waiting for job cntk-training-job to delete
2017-07-12 13:52:42,000 INFO - job cntk-training-job does not exist


[Next notebook: Scoring](03_Scoring_model.ipynb)