# Training CNTK and TensorFlow models for image classification

## Outline
- [Provision an Azure N-Series GPU Deep Learning VM](#provision)
- [Microsoft Cognitive Toolkit](#cntk)
- [TensorFlow](#tensorflow)
   - [Training script](#tfscript)
   - [Model](#tfmodel)
   - [Running the training script](#tfrun)

<a name="provision"></a>
## Provision an Azure N-Series GPU Deep Learning VM

Deploy a "Deep Learning toolkit for the DSVM" resource in a region that offers GPU VMs, such as East US. As of this writing (1/19), the DSVM deploys with CNTK 2.0.

### Connecting to the VM by remote desktop

To use remote desktop, click "Connect" on the VM's main pane to download an RDP file. When accessing, make sure that you specify the "domain" (VM name) as well as your username, e.g. "mawahgpudsvm\mawah", so that the connection doesn't attempt to use your Microsoft domain.

### Clone/download the contents of this repo

Download the contents of this repo and copy the contents of the `tf` and `cntk` subfolders to appropriate locations. We have used locations on the temporary drive, e.g. `D:\tf` and `D:\cntk`.

### Downloading the training and evaluation set locally

During image set preparation, a training image set and descriptive files were created for use with CNTK and TensorFlow. Transfer these files to the GPU VM and store in an appropriate location. (We have used the `D:\combined\train_subsample` folder.) If you did not generate a larger training set earlier, you can use the small training set included in this git repo. You may need to regenerate the CNTK map file if image paths have been changed.

### (Optional) Access the VM remotely via Jupyter Notebook

Follow these steps if you wish to be able to access the notebook server remotely:
1. In the [Azure Portal](https://portal.azure.com), navigate to the deployed VM's pane and determine its IP address.
1. In the [Azure Portal](https://portal.azure.com), navigate to the deployed VM's Network Security Group's pane and add inbound/outbound rules permitting traffic on port 9999.
1. While connected to the VM via remote desktop, launch a command prompt (Windows key + R) and type the following commands:

   ```
   cd C:\dsvm\tools\setup
   JupyterSetPasswordAndStart.cmd
   ```

   Follow the prompts to set your remote access password.
   
1. Connect to your VM remotely via Jupyter Notebooks using the IP address you determined earlier and port 9999, e.g. `https://[__.__.__.__]:9999`. The default directory on login will be `C:\dsvm\notebooks`.

<a name="tensorflow"></a>
## Tensorflow

<a name="tfscript"></a>
### Training script

We made use of the [`tf-slim` API](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim) for Tensorflow, which provides pre-trained ResNet models and helpful scripts for retraining and scoring. During training set preparation, we converted raw PNG images to the [TFRecords](https://www.tensorflow.org/how_tos/reading_data/#file_formats) files that those scripts expect as input. (Our evaluation set images will be scored on Spark without conversion to TFRecord format.)

Our training script is a modified version of `train_image_classifier.py` from the [Tensorflow models repo's slim subdirectory](https://github.com/tensorflow/models/tree/master/slim). Changes have also been made to some of that script's dependencies. We recommend that you clone this repo and transfer the `tf` subfolder, including dependencies, to a suitable location, e.g.

In [None]:
repo_dir = 'D:\\tf'

<a name="tfmodel"></a>
### Model

We will retrain the logits of a 152-layer ResNet pretrained on ImageNet. This model is highlighted in the [Tensorflow models repo's slim subdirectory](https://github.com/tensorflow/models/tree/master/slim). The pretrained model can be obtained and unpacked with the code snippet below:

In [None]:
import urllib.request
import tarfile
import os

urllib.request.urlretrieve('http://download.tensorflow.org/models/resnet_v1_152_2016_08_28.tar.gz',
                           os.path.join(repo_dir, 'resnet_v1_152_2016_08_28.tar.gz'))
with tarfile.open(os.path.join(repo_dir, 'resnet_v1_152_2016_08_28.tar.gz'), 'r:gz') as f:
    f.extractall(path=repo_dir)
os.remove(os.path.join(repo_dir, 'resnet_v1_152_2016_08_28.tar.gz'))

<a name="tfrun"></a>
### Running the training script

We recommend that you run the training script from an Anaconda prompt. The code cell below will help you generate the appropriate command based on your file locations.

In [None]:
# repo_dir was defined above

# path where retrained model and logs will be saved during training
train_dir = os.path.join(repo_dir, 'models')
if not os.path.exists(train_dir):
    os.makedirs(train_dir)
    
# location of the unpacked pretrained model
checkpoint_path = os.path.join(repo_dir, 'resnet_v1_152.ckpt')

# Location of the TFRecords and other files generated during image set preparation
image_dir = 'D:\\combined\\train_subsample'

command = '''activate py35
python {0} --train_dir={1} --dataset_name=aerial --dataset_split_name=train --dataset_dir={2} --checkpoint_path={3}
'''.format(os.path.join(repo_dir, 'retrain.py'),
           train_dir,
           dataset_dir,
           checkpoint_path)

print(command)