# Scoring images on Spark

This notebook illustrates how trained Cognitive Toolkit (CNTK) and TensorFlow models can be applied to large image collections using PySpark. For more detail on image set creation and model training, please see the rest of the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository.

## Outline
- [Set up a Microsoft HDInsight Spark cluster and Azure Data Lake Store](#setup)
   - [Provision the resources](#provision)
   - [Transfer the image set and models](#transfer)
   - [Install Cognitive Toolkit and Tensorflow](#install)
- [Image scoring with PySpark](#pyspark)
   - [Cognitive Toolkit](#cntk)
   - [TensorFlow](#tf)

<a name="setup"></a>
## Setting up a Microsoft HDInsight Spark cluster and associated Azure Data Lake Store

<a name="provision"></a>
### Provision the resources

#### Azure Data Lake Store
1. After logging into [Azure Portal](https://ms.portal.azure.com), click the "+ New" button near the upper left to create a new resource.
1. In the search field that appears, enter "Data Lake Store" and press Enter.
1. In the search results, click on the "Data Lake Store" option published by Microsoft.
1. Click the "Create" button at the bottom of the new pane that opens to describe the Data Lake Store resource type.
1. Choose a unique name, subscription, resource group, and location for your Data Lake Store. Note the location: you will need to use the same location when deploying the Spark cluster.
   - Some Azure subscriptions limit the number of HDInsight cores that can be used by location. Ensure that you choose a location where you will be able to generate an HDInsight cluster with 48 cores.
1. Select the appropriate pricing plan for your needs. (We recommend "Pay-as-you-go"; the tutorial will use <1 TB of data.)
1. Click the "Create" button at the bottom of the pane.

#### Azure HDInsight Spark Cluster
1. After logging into [Azure Portal](https://ms.portal.azure.com), click the "+ New" button near the upper left to create a new resource.
1. In the search field that appears, enter "HDInsight" and press Enter.
1. In the search results, click on the "HDInsight" option published by Microsoft.
1. Click the "Create" button at the bottom of the new pane that opens to describe the HDInsight resource type.
1. In the "Basics" section of the "New HDInsight cluster" pane:
    1. Choose a unique cluster name and the appropriate subscription.
    1. Click on "Cluster configuration" to load a pane of settings.
       1. Set the cluster type to "Spark".
       1. Set the version to "Spark 2.0.2 (HDI 3.5)".
       1. Click the "Select" button at the bottom of the pane.
    1. Choose a password for the `admin` account. You will use this account to log into Jupyter Notebook later in the walkthrough.
    1. Select the resource group and location where your Data Lake Store is located.
    1. Click the "Next" button at the bottom of the pane.
1. In the "Storage" section of the "New HDInsight cluster" pane:
   1. Ensure that "Data Lake Store" is selected for the "Primary storage type".
   1. Click on "Select Data Lake Storage Account" to load a pane of settings.
       1. Under "Select a storage account", select your Azure Data Lake store.
       1. Change the "root path" to "/clusters".
          - The default option, "/clusters/<cluster name>", will not work unless you have previously created a "/clusters" folder in your Azure Data Lake Store.
       1. Click on "Configure Data Lake Access" to load a pane of settings.
           1. Create a new service principal with the name and password of your choice. Save the generated certificate file.
           1. Click on "Access" to load a pane of settings.
               1. Under "Select File Permissions", click the box to the left of your ADLS name. (The box may be obscured until mouseover.). Click "Select".
               1. Under "Assign Selected Permissions", click "Run".
               1. When the run completes, click "Done".
       1. Click the "Next" button at the bottom of the pane.
1. In the "Summary" section of the "New HDInsight cluster" pane:
   1. If desired, you can edit the cluster size settings to choose node counts/sizes based on your budget and time constraints. We recommend completing this tutorial using a cluster with **10** worker nodes and a node size of **D12 v2** (for both worker and head nodes). For more information, please see the [cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters) and [VM](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-sizes#dv2-series) size guides.
1. Click the "Create" button at the bottom of the pane.

#### Checking cluster deployment status

Cluster deployment may take approximately twenty minutes. (We recommend that you continue with the tutorial, creating all other Azure resources and transferring your image set to the Azure Data Lake Store, while you wait for the cluster to deploy.) Cluster deployment status can be checked as follows:
1. Click on the "Search Resources" magnifying glass icon along the top bar of [Azure Portal](https://ms.portal.azure.com).
1. Type in the name of your HDInsight cluster and click on its entry in the resulting drop-down list. The overview pane for your HDInsight cluster will appear.
1. During deployment, a blue bar will appear across the top of the overview pane with the title "Applying changes". When this bar disappears, deployment is complete.

### Azure Blob Storage account
1. After logging into [Azure Portal](https://ms.portal.azure.com), click the "+ New" button near the upper left to create a new resource.
1. In the search field that appears, enter "Storage account" and press Enter.
1. In the search results, click on the "Storage account" option published by Microsoft.
1. Click the "Create" button at the bottom of the new pane that opens to describe the Storage account resource type.
1. Enter a name for the blob storage account.
1. Select the same resource group and location as the Azure Data Lake Store you created earlier.
1. Click the "Create" button to deploy the new storage account.
1. When the Storage account's deployment finishes, navigate to its overview pane using the search functionality described above.
1. In the left-hand menu, under Settings, click on Access Keys. Note the primary key, which we will use for transferring images from the GPU VM to the Azure Data Lake Store.
1. Return to the Storage account's Overview pane and click on "Blobs".
1. Click "+ Container" near the top of the Blob service pane.
1. Enter the name "balancedtest" and click "Create".

#### Azure Data Factory
1. After logging into [Azure Portal](https://ms.portal.azure.com), click the "+ New" button near the upper left to create a new resource.
1. In the search field that appears, enter "Data Factory" and press Enter.
1. In the search results, click on the "Data Factory" option published by Microsoft.
1. Click the "Create" button at the bottom of the new pane that opens to describe the Data Factory resource type.
1. Choose a unique name, subscription, resource group, and location for your Data Factory.
   - Do not worry if you cannot match the location of your Azure Data Lake Store/Storage account. This will not affect file transfer rates.
1. Click the "Create" button at the bottom of the pane.

<a name="transfer"></a>
### Transfer the image set

#### Transfer from the VM to blob storage
Once the [balanced validation image set](https://mawahstorage.blob.core.windows.net/aerialimageclassification/imagesets/balanced_validation_set.zip) has been downloaded and decompressed, the images can be transferred to the Data Lake Store via Azure Blob Storage. In the first stage, files are copied to Azure Blob Storage using [AzCopy](https://docs.microsoft.com/en-gb/azure/storage/storage-use-azcopy). After downloading and installing AzCopy, you can generate the necessary commands for the file transfer using the cell below:

In [None]:
# Be sure to fill in your credentials before running!
local_image_dir = 'D:\\balanced_validation_set\'
blob_account_name = ''
blob_account_key = ''
blob_account_container = 'balancedtest'

command = '''
AzCopy /Source:{0} /Dest:https://{1}.blob.core.windows.net/{2} /DestKey:{3} /S
'''.format(local_image_dir, blob_account_name, blob_account_container,
           blob_account_key)

print(command)

This command should be run from a CLI interface and will take several minutes to complete. After running this command, you should find that your validation set images have been transferred to the `balancedtest` container in your Blob storage account. You can visually confirm this by navigating to the Storage account's overview page, clicking on Blobs, then clicking on the `balancedtest` container's name. The directory structure displayed can be navigated in the usual way.

#### Transfer files from blob storage to Azure Data Lake Store using Azure Data Factory

1. Navigate to your data factory's overview pane in Azure Portal and click "Copy data (PREVIEW)" to launch a guided walkthrough of transfer pipeline setup.
1. Set the task cadence to "Run once now." Leaving all other settings at their default values, click "Next".
1. For the source connection, select "Azure Blob Storage" and click "Next".
1. Choose your Azure subscription's name, then the name of your Blob storage account. Click "Next".
1. Choose "balancedtest" as your input folder by clicking on its name. Click the "Choose" button.
1. Check the "Copy files recursively" and "Binary copy" checkboxes, then click "Next".
1. For the destination connection, select "Azure Data Lake Store" and click "Next".
1. Choose your Azure subscription's name, then the name of your Azure Data Lake Store. Set the authentication type to "OAuth". Click "Next".
1. Type "/balancedtest" in the Folder path field. Click "Next".
1. Click "Next" without changing the advanced settings.
1. On the Summary page, click "Authorize" next to "Linked service Destination..." and provide your credentials. Click "Next".
1. After the pipeline is deployed, data transfer will begin. You can monitor progress by navigating to your Azure Data Lake Store's overview page and clicking "Data Explorer" along the top bar.

The notebook code below has been written to install the [CNTK]() and [TF]() DNNs we generated. If you prefer to use your own model files, we recommend that you upload them to your Blob Storage account and download them to your Spark cluster. The sample files and code below can be used aas a template.

<a name="install"></a>
### Install Cognitive Toolkit and Tensorflow

#### Obtain the script action

We will install Cognitive Toolkit and Tensorflow on all head and worker nodes via Script Action. We have included a sample script action in the `scoring` subdirectory of [the Embarrassingly Parallel Image Classification git repository](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification), reproduced below for your convenience:

The script action above will copy our pretrained models (the files `cntkv2rc1_50.dnn` and `tf.zip`) to your HDInsight cluster. If you prefer to use your own models, you can replace the URIs above with the URIs of models that you upload to blob storage. (You can upload a file to blob storage by navigating to the storage account's overview pane, clicking on "Blobs", clicking on the desired destination container name, and clicking "Upload".)

The code above installed CNTK 2.0 RC1. As of this writing, other CNTK releases can be substituted as follows:
1. Navigate to the [CNTK Releases](https://github.com/Microsoft/CNTK/releases) page
1. Click on the appropriate release's link for a Linux, CPU Only release.
1. After reading and agreeing to the mentioned licenses, copy the URL linked to the "I accept" button (e.g. from the page source) and paste over the URL in the `curl` command above.

#### Running the script action

After HDInsight cluster deployment finishes, run the script action to install CNTK as follows:
1. Obtain the URI for the script action.
   - If using the unmodified version in the `scoring` subdirectory of this git repo, ensure that your URI points to the "raw" file (not a webpage-embedded file), e.g.:
   [https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/raw/master/scoring/script_action.sh](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/raw/master/scoring/script_action.sh)
   - If you have modified the script action, upload it to the website or Azure Blob Storage account of your choice and note its URI.
1. Click on the "Search Resources" magnifying glass icon along the top bar of [Azure Portal](https://ms.portal.azure.com).
1. Type in the name of your HDInsight cluster and click on its entry in the resulting drop-down list. The overview pane for your HDInsight cluster will appear.
1. In the search field at upper left, type in "Script actions". Click the "Script actions" option in the results list.
1. Click the "+ Submit new" button along the top of the Script Actions pane. A new pane of options will appear.
   1. Under name, type "install" (without the quotes).
   1. Under "Bash script URI", type in the URI.
   1. Ensure that "Head" and "Worker" boxes are checked.
   1. Click the "Create" button along the bottom of the pane.
   
Expect the script action to take roughly fifteen minutes to run. When the script action is complete, the blue bar at the top of the screen will disappear and a green icon will appear next to the submitted script action's name. Do not proceed until the script action has finished.
   
#### Updating the Python 3 path

The script action above installed Cognitive Toolkit and Tensorflow under a new Python environment, `cntk-py35`. Follow the steps below to direct PySpark to use this new environment:

1. Navigate back to the HDInsight cluster's overview pane by clicking "Overview" near the upper left of the pane.
1. Under "Quick links" in the main window, click the "Cluster dashboards" button. A new pane of dashboard options will appear.
1. Click "HDInsight cluster dashboard". A new window will load. You may be asked for the username (default: admin) and password you selected during deployment.
1. In the menu at left, click "Spark2".
1. In the main window, click on the "Configs" tab.
1. Scroll down to the "Custom spark2-defaults" option and expand its dropdown by clicking on the label (or triange beside it).
1. Find the `spark.yarn.appMasterEnv.PYSPARK3_PYTHON` entry in the dropdown list. Change its path to the following:

    `/usr/bin/anaconda/envs/cntk-py35/bin/python`<br/><br/>
    
1. Click on the green "Save" button that appears at upper right.
1. When prompted, click the orange "Restart" button and select "Restart all affected".
1. When the restart concludes, close the window. This will return you to a pane of dashboard options.

### Upload and start this notebook on HDInsight Spark

1. On the Overview pane of your HDInsight Spark cluster, click on "Cluster dashboards" and select "Jupyter Notebooks".
1. If prompted, log in with the username (default: admin) and password you selected during deployment.
1. Use the "Upload" button to upload a copy of this notebook. You may be prompted to confirm the destination filename during the upload process: the default value will do.
1. Once the notebook has been uploaded, double-check on the notebook's name to launch it.
1. The PySpark3 kernel should be used to run the notebook. If necessary, change the kernel by clicking "Kernel -> Change Kernel -> PySpark3".

<a name="pyspark"></a>
## Image scoring with PySpark

### Define functions/variables/RDDs used by both scoring pipelines

Edit the variables below to define the name of your Azure Data Lake Store and the folder where the images have been stored. Execute the code cell to create an RDD of the images in the test set. Note that if this is the first code cell executed, there will be an additional delay as the Spark connection initiates.

In [2]:
import os
import numpy as np
import pandas as pd
from io import BytesIO
from PIL import Image
from pyspark import SparkFiles

def get_nlcd_id(my_filename):
    ''' Extracts the true label  '''
    folder, _ = os.path.split(my_filename)
    return(int(os.path.basename(folder)))

adls_name = ''
adls_folder = 'balancedtest'

n_workers = 10
local_tmp_dir = '/tmp/models'

dataset_dir = 'adl://{}.azuredatalakestore.net/{}'.format(adls_name, adls_folder)
image_rdd = sc.binaryFiles('{}/*/*.png'.format(dataset_dir), minPartitions=n_workers).coalesce(n_workers)

<a name="cntk"></a>
### Score and evaluate with a trained Cognitive Toolkit (CNTK) model

#### Make the trained CNTK model available to all workers

In [3]:
from cntk import load_model

cntk_model_filepath = '{}/cntkv2rc1_50.dnn'.format(local_tmp_dir)
cntk_model_filepath_bc = sc.broadcast(cntk_model_filepath)
sc.addFile(cntk_model_filepath)

#### Define functions to be run by worker nodes

In [4]:
def cntk_get_preprocessed_image(filename):
    ''' Perform transposition and RGB -> BGR permutation '''
    image_data = np.array(Image.open(filename), dtype=np.float32)
    bgr_image = image_data[:, :, ::-1]
    image_data = np.ascontiguousarray(np.transpose(bgr_image, (2,0,1)))
    return(image_data)

def cntk_run_worker(files):
    ''' Scoring script run by each worker '''
    cntk_model_filepath = cntk_model_filepath_bc.value
    loaded_model = load_model(SparkFiles.get(cntk_model_filepath))
    
    # Iterate through the files. The first value in each tuple is the file name; the second is the image data
    for file in files:
        preprocessed_image = cntk_get_preprocessed_image(BytesIO(file[1]))
        dnn_output = loaded_model.eval({loaded_model.arguments[0]: [preprocessed_image]})
        true_label = get_nlcd_id(file[0])
        yield (file[0], true_label, np.argmax(np.squeeze(dnn_output)))

#### Score all test set images with the trained model

In [5]:
labeled_images = image_rdd.mapPartitions(cntk_run_worker)

start = pd.datetime.now()
cntk_results = labeled_images.collect()
print('Scored {} images'.format(len(cntk_results)))
stop = pd.datetime.now()
print(stop - start)

Scored 11760 images
0:04:35.190213

#### Evaluate the model's performance

We first report the model's raw overall accuracy. We then calculate the overall accuracy when all undeveloped land types are grouped under the same label. (We will use the latter grouping in a subsequent notebook to simplify result interpretation.)

In [6]:
def group_undeveloped_land_types(original_label):
    if original_label in [3, 5]:  # developed and cultivated land types
        return(original_label)
    else:
        return(6)  # new grouped label for all undeveloped land types

cntk_df = pd.DataFrame(cntk_results, columns=['filename', 'true_label', 'predicted_label'])
num_correct = sum(cntk_df['true_label'] == cntk_df['predicted_label'])
num_total = len(cntk_results)
print('When using all six categories, correctly predicted ' +
      '{} of {} images ({:0.2f}%)'.format(num_correct,
                                          num_total,
                                          100 * num_correct / num_total))

cntk_df['true_label_regrouped'] = cntk_df['true_label'].apply(group_undeveloped_land_types)
cntk_df['predicted_label_regrouped'] = cntk_df['predicted_label'].apply(group_undeveloped_land_types)
num_correct = sum(cntk_df['true_label_regrouped'] == cntk_df['predicted_label_regrouped'])
print('After regrouping land use categories, correctly predicted ' +
      '{} of {} images ({:0.2f}%)'.format(num_correct,
                                          num_total,
                                          100 * num_correct / num_total))

When using all six categories, correctly predicted 9212 of 11760 images (78.33%)
After regrouping land use categories, correctly predicted 10818 of 11760 images (91.99%)

<a name="tf"></a>
### Score and evaluate with a trained TensorFlow model

#### Make the trained TensorFlow model available to all workers

Loads a slightly modified version of the tf-slim ResNet definition from the [TensorFlow models git repository](https://github.com/tensorflow/models).

In [7]:
sc.addPyFile(os.path.join(local_tmp_dir, 'resnet_utils.py'))
sc.addPyFile(os.path.join(local_tmp_dir, 'resnet_v1.py'))
model_dir_bc = sc.broadcast(local_tmp_dir)

import tensorflow as tf
import functools
import resnet_v1
slim = tf.contrib.slim

#### Define functions used by workers for scoring

In [8]:
def get_network_fn(num_classes, weight_decay=0.0, is_training=False):
    arg_scope = resnet_v1.resnet_arg_scope(weight_decay=weight_decay)
    func = resnet_v1.resnet_v1_50
    @functools.wraps(func)
    def network_fn(images):
        with slim.arg_scope(arg_scope):
            return func(images, num_classes, is_training=is_training)
    if hasattr(func, 'default_image_size'):
        network_fn.default_image_size = func.default_image_size
    return(network_fn)

def mean_image_subtraction(image, means):
    num_channels = image.get_shape().as_list()[-1]
    channels = tf.split(2, num_channels, image)
    for i in range(num_channels):
        channels[i] -= means[i]
    return(tf.concat(2, channels))

def get_preprocessing():
    def preprocessing_fn(image, output_height=224, output_width=224):
        image = tf.expand_dims(image, 0)
        resized_image = tf.image.resize_bilinear(image, [output_height, output_width], align_corners=False)
        resized_image = tf.squeeze(resized_image)
        resized_image.set_shape([output_height, output_width, 3])
        image = tf.to_float(resized_image)
        return(mean_image_subtraction(image, [123.68, 116.78, 103.94]))
    return(preprocessing_fn)

def tf_run_worker(files):
    model_dir = model_dir_bc.value
    results = []
    
    with tf.Graph().as_default():
        network_fn = get_network_fn(num_classes=6, is_training=False)
        image_preprocessing_fn = get_preprocessing()
        
        current_image = tf.placeholder(tf.uint8, shape=(224, 224, 3))
        preprocessed_image = image_preprocessing_fn(current_image, 224, 224)
        image  = tf.expand_dims(preprocessed_image, 0)
        logits, _ = network_fn(image)
        predictions = tf.argmax(logits, 1)
        
        with tf.Session() as sess:
            my_saver = tf.train.Saver()
            my_saver.restore(sess, tf.train.latest_checkpoint(model_dir))
            
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(sess=sess, coord=coord)
            try:
                for file in files:
                    imported_image_np = np.asarray(Image.open(BytesIO(file[1])), dtype=np.uint8)
                    result = sess.run(predictions, feed_dict={current_image: imported_image_np})
                    true_label = get_nlcd_id(file[0])
                    results.append([file[0], true_label, result[0]])
            finally:
                coord.request_stop()
            coord.join(threads)
    return(results)

#### Score all images with trained TensorFlow model

In [9]:
labeled_images_tf = image_rdd.mapPartitions(tf_run_worker)

start = pd.datetime.now()
results_tf = labeled_images_tf.collect()
print('Scored {} images'.format(len(results_tf)))
stop = pd.datetime.now()
print(stop - start)

Scored 11760 images
0:05:24.885293

#### Evaluate the model's performance

We first report the model's raw overall accuracy. We also report the overall accuracy when all undeveloped land types are grouped under the same label. (We will use the latter grouping in a subsequent notebook to simplify result interpretation.)

In [10]:
def group_undeveloped_land_types(original_label):
    if original_label in [3, 5]:  # developed and cultivated land types
        return(original_label)
    else:
        return(6)

tf_df = pd.DataFrame(results_tf, columns=['filename', 'true_label', 'predicted_label'])
num_correct = sum(tf_df['true_label'] == tf_df['predicted_label'])
num_total = len(results_tf)
print('When using all six categories, correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct,
                                                                                             num_total,
                                                                                             100 * num_correct / num_total))

tf_df['true_label_regrouped'] = tf_df['true_label'].apply(group_undeveloped_land_types)
tf_df['predicted_label_regrouped'] = tf_df['predicted_label'].apply(group_undeveloped_land_types)
num_correct = sum(tf_df['true_label_regrouped'] == tf_df['predicted_label_regrouped'])
print('After regrouping land use categories, correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct,
                                                                                                    num_total,
                                                                                                    100 * num_correct / num_total))


When using all six categories, correctly predicted 9611 of 11760 images (81.73%)
After regrouping land use categories, correctly predicted 10788 of 11760 images (91.73%)

<a name="next"></a>
## Next Steps

For an example of how the trained model can be applied to identify newly developed regions and explore county-level patterns in development, please see the next document in this repository: [Land Use Prediction in Middlesex County, MA](land_use_prediction.md).