# Training CNTK and TensorFlow models for image classification

## Outline
- [Preparing an Azure N-Series GPU Deep Learning VM](#prepare)
   - [Provision the VM](#provision)
   - [Connect to the VM by remote desktop](#rd)
   - [Clone/download scripts and supporting files](#repo)
   - [Download training set data locally](#trainingset)
   - [(Optional) Access the VM remotely via Jupyter Notebook](#jupyter)
- [Microsoft Cognitive Toolkit](#cntk)
- [TensorFlow](#tensorflow)
   - [Training script](#tfscript)
   - [Model](#tfmodel)
   - [Running the training script](#tfrun)

<a name="prepare"></a>
## Preparing an Azure N-Series GPU Deep Learning VM

<a name="provision"></a>
### Provision the VM
Deploy a "Deep Learning toolkit for the DSVM" resource in a region that offers GPU VMs, such as East US. As of this writing (1/19), the DSVM deploys with CNTK 2.0.

<a name="rd"></a>
### Connect to the VM by remote desktop

To use remote desktop, click "Connect" on the VM's main pane to download an RDP file. When accessing, make sure that you specify the "domain" (VM name) as well as your username, e.g. "mawahgpudsvm\mawah", so that the connection doesn't attempt to use your Microsoft domain.

<a name="repo"></a>
### Clone/download scripts and supporting files
Download the contents of this repo and copy the contents of the `tf` and `cntk` subfolders to appropriate locations. We have used locations on the temporary drive, e.g. `D:\tf` and `D:\cntk`.

<a name="trainingset"></a>
### Download training set data locally
During image set preparation, a training image set and descriptive files were created for use with CNTK and TensorFlow. Transfer these files to the GPU VM and store in an appropriate location. (We have used the `D:\combined\train_subsample` folder.) If you did not generate a larger training set earlier, you can use the small training set included in this git repo. You may need to regenerate the CNTK map file if image paths have been changed.

<a name="jupyter"></a>
### (Optional) Access the VM remotely via Jupyter Notebook

Follow these steps if you wish to be able to access the notebook server remotely:
1. In the [Azure Portal](https://portal.azure.com), navigate to the deployed VM's pane and determine its IP address.
1. In the [Azure Portal](https://portal.azure.com), navigate to the deployed VM's Network Security Group's pane and add inbound/outbound rules permitting traffic on port 9999.
1. While connected to the VM via remote desktop, launch a command prompt (Windows key + R) and type the following commands:

   ```
   cd C:\dsvm\tools\setup
   JupyterSetPasswordAndStart.cmd
   ```

   Follow the prompts to set your remote access password.
   
1. Connect to your VM remotely via Jupyter Notebooks using the IP address you determined earlier and port 9999, e.g. `https://[__.__.__.__]:9999`. The default directory on login will be `C:\dsvm\notebooks`.

<a name="cntk"></a>
## Cognitive Toolkit (CNTK)

The script below can be used to train a 20-layer ResNet for image classification from scratch. The script is adapted from the [CNTK ResNet/CIFAR10 image classification example](https://github.com/Microsoft/CNTK/tree/master/Examples/Image/Classification/ResNet/Python): if training on a multi-GPU VM, see their example code for distributed training.

In [None]:
# Copyright (c) Microsoft. All rights reserved.

# Licensed under the MIT license. See LICENSE.md file in the project root
# for full license information.
# ==============================================================================

''' running parameters -- edit as necessary '''
data_path  = 'E:\\combined\\train_subsample2'
model_path = 'E:\\cntk\\models'
image_height = 224
image_width  = 224
num_channels = 3  # RGB
num_classes  = 7

from __future__ import print_function
import os
import math
import numpy as np

from cntk.initializer import he_normal
from cntk.layers import AveragePooling, BatchNormalization, Convolution, Dense
from cntk.utils import *
from cntk.ops import input_variable, cross_entropy_with_softmax, classification_error, element_times, relu
from cntk.io import MinibatchSource, ImageDeserializer, StreamDef, StreamDefs
from cntk import Trainer, cntk_py
from cntk.learner import momentum_sgd, learning_rate_schedule, momentum_as_time_constant_schedule, UnitType
from _cntk_py import set_computation_network_trace_level

# Helper functions for ResNet construction
def conv_bn(input, filter_size, num_filters, strides=(1,1), init=he_normal()):
    c = Convolution(filter_size, num_filters, activation=None, init=init, pad=True, strides=strides, bias=False)(input)
    r = BatchNormalization(map_rank=1, normalization_time_constant=4096, use_cntk_engine=False)(c)
    return r

def conv_bn_relu(input, filter_size, num_filters, strides=(1,1), init=he_normal()):
    r = conv_bn(input, filter_size, num_filters, strides, init) 
    return relu(r)

def resnet_basic(input, num_filters):
    c1 = conv_bn_relu(input, (3,3), num_filters)
    c2 = conv_bn(c1, (3,3), num_filters)
    p  = c2 + input
    return relu(p)

def resnet_basic_inc(input, num_filters, strides=(2,2)):
    c1 = conv_bn_relu(input, (3,3), num_filters, strides)
    c2 = conv_bn(c1, (3,3), num_filters)
    s  = conv_bn(input, (1,1), num_filters, strides)
    p  = c2 + s
    return relu(p)

def resnet_basic_stack(input, num_stack_layers, num_filters): 
    assert (num_stack_layers >= 0)
    l = input 
    for _ in range(num_stack_layers): 
        l = resnet_basic(l, num_filters)
    return l 

def create_model(input, num_stack_layers, num_classes):
    c_map = [16, 32, 64]
    conv = conv_bn_relu(input, (3,3), c_map[0])
    r1 = resnet_basic_stack(conv, num_stack_layers, c_map[0])
    r2_1 = resnet_basic_inc(r1, c_map[1])
    r2_2 = resnet_basic_stack(r2_1, num_stack_layers-1, c_map[1])
    r3_1 = resnet_basic_inc(r2_2, c_map[2])
    r3_2 = resnet_basic_stack(r3_1, num_stack_layers-1, c_map[2])
    pool = AveragePooling(filter_shape=(8,8))(r3_2) 
    z = Dense(num_classes)(pool)
    return z

# Function for accessing and preprocessing the images
def create_reader(map_file):
    if not os.path.exists(map_file):
        raise RuntimeError("File '{}' does not exist".format(map_file))

    # transformation pipeline for the features has jitter/crop only when training
    transforms = [ImageDeserializer.scale(width=image_width,
                                          height=image_height,
                                          channels=num_channels,
                                          interpolations='linear')]
    return MinibatchSource(ImageDeserializer(map_file, StreamDefs(
        features = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
        labels   = StreamDef(field='label', shape=num_classes))))   # and second as 'label'

# Function for coordinating training
def train(reader_train, epoch_size, max_epochs, model_location=None):
    set_computation_network_trace_level(0)
    input_var = input_variable((num_channels, image_height, image_width))
    label_var = input_variable((num_classes))

    z = create_model(input_var, 8, num_classes)
    lr_per_mb = [0.001]+[0.01]*80+[0.001]*40+[0.0001]

    # loss and metric
    ce = cross_entropy_with_softmax(z, label_var)
    pe = classification_error(z, label_var)

    minibatch_size = 16
    momentum_time_constant = -minibatch_size/np.log(0.9)
    l2_reg_weight = 0.0001
    
    lr_per_sample = [lr/minibatch_size for lr in lr_per_mb]
    lr_schedule = learning_rate_schedule(lr_per_sample, epoch_size=epoch_size, unit=UnitType.sample)
    mm_schedule = momentum_as_time_constant_schedule(momentum_time_constant)
    
    learner = momentum_sgd(z.parameters, lr_schedule, mm_schedule,
                           l2_regularization_weight = l2_reg_weight,
                           unit_gain=True)
    trainer = Trainer(z, ce, pe, learner)
    if model_location is not None:
        trainer.restore_from_checkpoint(model_location)

    input_map = {input_var: reader_train.streams.features,
                 label_var: reader_train.streams.labels}

    log_number_of_parameters(z) ; print()
    progress_printer = ProgressPrinter(tag='Training')

    for epoch in range(max_epochs):
        sample_count = 0
        while sample_count < epoch_size:
            data = reader_train.next_minibatch(min(minibatch_size, epoch_size-sample_count), input_map=input_map)
            trainer.train_minibatch(data)
            sample_count += trainer.previous_minibatch_sample_count
            progress_printer.update_with_trainer(trainer, with_metric=True)
        progress_printer.epoch_summary(with_metric=True)
        z.save_model(os.path.join(model_path, 'resnet50_{}.dnn'.format(epoch)))
    return

# Create data reader and run training
reader_train = create_reader(os.path.join(data_path, 'map.txt'))
train(reader_train, network_name, epoch_size=1000, max_epochs=160)

For details of the model evaluation process, please see the socring notebook in the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository.

<a name="tensorflow"></a>
## Tensorflow

<a name="tfscript"></a>
### Training script

We made use of the [`tf-slim` API](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim) for Tensorflow, which provides pre-trained ResNet models and helpful scripts for retraining and scoring. During training set preparation, we converted raw PNG images to the [TFRecords](https://www.tensorflow.org/how_tos/reading_data/#file_formats) files that those scripts expect as input. (Our evaluation set images will be scored on Spark without conversion to TFRecord format.)

Our training script is a modified version of `train_image_classifier.py` from the [Tensorflow models repo's slim subdirectory](https://github.com/tensorflow/models/tree/master/slim). Changes have also been made to some of that script's dependencies. We recommend that you clone this repo and transfer the `tf` subfolder, including dependencies, to a suitable location, e.g.

In [None]:
repo_dir = 'D:\\tf'

<a name="tfmodel"></a>
### Model

We will retrain the logits of a 152-layer ResNet pretrained on ImageNet. This model is highlighted in the [Tensorflow models repo's slim subdirectory](https://github.com/tensorflow/models/tree/master/slim). The pretrained model can be obtained and unpacked with the code snippet below:

In [None]:
import urllib.request
import tarfile
import os

urllib.request.urlretrieve('http://download.tensorflow.org/models/resnet_v1_152_2016_08_28.tar.gz',
                           os.path.join(repo_dir, 'resnet_v1_152_2016_08_28.tar.gz'))
with tarfile.open(os.path.join(repo_dir, 'resnet_v1_152_2016_08_28.tar.gz'), 'r:gz') as f:
    f.extractall(path=repo_dir)
os.remove(os.path.join(repo_dir, 'resnet_v1_152_2016_08_28.tar.gz'))

<a name="tfrun"></a>
### Running the training script

We recommend that you run the training script from an Anaconda prompt. The code cell below will help you generate the appropriate command based on your file locations.

In [None]:
# path where retrained model and logs will be saved during training
train_dir = os.path.join(repo_dir, 'models')
if not os.path.exists(train_dir):
    os.makedirs(train_dir)
    
# location of the unpacked pretrained model
checkpoint_path = os.path.join(repo_dir, 'resnet_v1_152.ckpt')

# Location of the TFRecords and other files generated during image set preparation
image_dir = 'D:\\combined\\train_subsample'

command = '''activate py35
python {0} --train_dir={1} --dataset_name=aerial --dataset_split_name=train --dataset_dir={2} --checkpoint_path={3}
'''.format(os.path.join(repo_dir, 'retrain.py'),
           train_dir,
           dataset_dir,
           checkpoint_path)

print(command)

For details of the model evaluation process, please see the socring notebook in the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository.