# Scoring images on Spark

This notebook illustrates how trained Cognitive Toolkit (CNTK) and TensorFlow models can be applied to large image collections using PySpark.

This notebook is part of the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) git repository. It assumes that a dataset and Azure N-series GPU VM have already been created for model training as described in the previous [Image Set Preparation](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/image_set_preparation.ipynb) notebook. Note that an abbreviated instruction set is mentioned in that notebook for users who would like to employ our sample image set rather than generating their own.

By default, this notebook uses our provided retrained DNNs. If you have completed the [Model Training](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/blob/master/model_training.ipynb) notebook in this repository, you can elect to use your own models by modifying the script action used during CNTK and TensorFlow installation.

## Outline
- [Set up a Microsoft HDInsight Spark cluster and Azure Data Lake Store](#setup)
   - [Provision the resources](#provision)
      - [Azure Data Lake Store](#adls)
      - [Azure HDInsight Spark cluster](#hdinsight)
      - [Check cluster deployment status](#checkstatus)
   - [Install Cognitive Toolkit, TensorFlow, and model files](#install)
      - [Run the script action](#runsa)
      - [Update the Python 3 path](#updatepath)
- [Image scoring with PySpark](#pyspark)
   - [Define functions/variables/RDDs used by both scoring pipelines](#shared)
   - [Cognitive Toolkit](#cntk)
      - [Make the trained CNTK model available to all workers](#cntkbroadcast)
      - [Define functions to be run by worker nodes](#cntkworker)
      - [Score all test set images with the trained model](#cntkscore)
      - [Evaluate the model's performance](#cntkevaluate)
   - [TensorFlow](#tf)
      - [Make the trained TensorFlow model available to all workers](#tfbroadcast)
      - [Define functions to be run by worker nodes](#tfworker)
      - [Score all test set images with the trained model](#tfscore)
      - [Evaluate the model's performance](#tfevaluate)
- [Next steps](#next)

<a name="setup"></a>
## Set up a Microsoft HDInsight Spark cluster and associated Azure Data Lake Store

In the previous notebooks, we illustrated how to use a Windows Data Science Virtual Machine to create the training/validation image sets and train DNNs for this image classification task. In this section, we illustrate how to transfer the data and models to an Azure Data Lake Store (a cloud-based HDFS). We also show how to provision and set up the HDInsight Spark cluster which will apply the models to the data in subsequent sections.

<a name="provision"></a>
### Provision the resources

<a name="adls"></a>
#### Azure Data Lake Store
We have provided directions for provisioning and setting up the ADLS through the [Azure CLI 2.0](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest), which comes preinstalled on the Data Science Virtual Machine.

1. From a command prompt, log into the Azure CLI by running the following command:

    ```
    az login
    ```

    You will be asked to visit a website and type in a temporary code. The website may ask you to provide your Azure account credentials.
1. When login is complete, return to the CLI and run the following command to determine which subscriptions are available in your account:

    ```
    az account list
    ```

    Copy the "id" of the subscription you would like to use when creating resources, then execute the command below to set it as the active subscription:

    ```
    az account set --subscription [subscription id]
    ```

1. Choose a unique resource group name, then create an Azure resource group using the command below:

    ```
    set RESOURCE_GROUP_NAME=[your resource group name]
    az group create --location eastus2 --name %RESOURCE_GROUP_NAME%
    ```
    
1. Choose a unique Data Lake Store name, then create the resource with the command below:

    ```
    set DATA_LAKE_STORE_NAME=[your data lake store name]
    az dls account create --account %DATA_LAKE_STORE_NAME% --resource-group %RESOURCE_GROUP_NAME% --location eastus2
    az dls fs create --account %DATA_LAKE_STORE_NAME% --path /clusters --folder
    ```
    
1. Upload the balanced validation dataset from your DSVM to the appropriate folder of the ADLS (expect this step to take ~10 minutes to complete):
    ```
    az dls fs upload --account %DATA_LAKE_STORE_NAME% --source-path "D:\balanced_validation_set" --destination-path "/balancedvalidationset"
    ```

<a name="hdinsight"></a>
#### Azure HDInsight Spark cluster

We provide instructions for creating the HDInsight Spark cluster through the Azure Portal, a graphical interface:

1. After logging into [Azure Portal](https://ms.portal.azure.com), click the "+ New" button near the upper left to create a new resource.
1. In the search field that appears, enter "HDInsight" and press Enter.
1. In the search results, click on the "HDInsight" option published by Microsoft.
1. Click the "Create" button at the bottom of the new pane that opens to describe the HDInsight resource type.
1. In the "Basics" section of the "New HDInsight cluster" pane:
    1. Choose a unique cluster name and the appropriate subscription.
    1. Click on "Cluster configuration" to load a pane of settings.
       1. Set the cluster type to "Spark".
       1. Set the version to "Spark 2.1 (HDI 3.6)".
       1. Click the "Select" button at the bottom of the pane.
    1. Choose a password for the `admin` account. You will use this account to log into Jupyter Notebook later in the walkthrough.
    1. Select the resource group and location where your Data Lake Store is located.
    1. Click the "Next" button at the bottom of the pane.
1. In the "Storage" section of the "New HDInsight cluster" pane:
   1. Ensure that "Data Lake Store" is selected for the "Primary storage type".
   1. Click on "Select Data Lake Storage Account" to load a pane of settings.
       1. Under "Select a storage account", select your Azure Data Lake store.
       1. Click on "Configure Data Lake Access" to load a pane of settings.
           1. Create a new service principal with the name and password of your choice. Save the generated certificate file.
           1. Click on "Access" to load a pane of settings.
               1. Under "Select File Permissions", click the box to the left of your ADLS name. (The box may not be visible until mouseover.). Click "Select".
               1. Under "Assign Selected Permissions", click "Run".
               1. When the run completes, click "Done".
       1. Click the "Next" button at the bottom of the pane.
1. In the "Summary" section of the "New HDInsight cluster" pane:
   - If desired, you can edit the cluster size settings to choose node counts/sizes based on your budget and time constraints. We recommend completing this tutorial using a cluster with **10** worker nodes and a node size of **D12 v2** (for both worker and head nodes).
   - For more information, please see the [cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters) and [VM](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-sizes#dv2-series) size guides.
1. Click the "Create" button at the bottom of the pane.

<a name="checkstatus"></a>
#### Check cluster deployment status

Cluster deployment may take approximately twenty minutes. (We recommend that you continue with the tutorial, creating all other Azure resources and transferring your image set to the Azure Data Lake Store, while you wait for the cluster to deploy.) Cluster deployment status can be checked as follows:
1. Click on the "Search Resources" magnifying glass icon along the top bar of [Azure Portal](https://ms.portal.azure.com).
1. Type in the name of your HDInsight cluster and click on its entry in the resulting drop-down list. The overview pane for your HDInsight cluster will appear.
1. During deployment, a blue bar will appear across the top of the overview pane with the title "Applying changes". When this bar disappears, deployment is complete.

<a name="install"></a>
### Install Cognitive Toolkit, TensorFlow, and model files

Once the HDInsight Spark cluster deployment is complete, we will run a script action to install CNTK, TensorFlow, and the sample trained model files we provide. As of this writing, the script action will install CNTK 2.1 and TensorFlow 1.2.

If you completed the previous model training notebook and would prefer to use your own model files, you can modify the script action to download the model files you created from the online location of your choice. Simply download a copy of the sample script action in the `scoring` subdirectory of [the Embarrassingly Parallel Image Classification git repository](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification), edit the `wget` commands to point to your own model files, and upload the modified script action to the online location of your choosing.

<a name="runsa"></a>
#### Run the script action

After HDInsight cluster deployment finishes, run the script action to install CNTK as follows:
1. Obtain the URI for the script action.
   - If using the unmodified version in the `scoring` subdirectory of this git repo, ensure that your URI points to the "raw" file (not a webpage-embedded file), e.g.:
   [https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/raw/master/scoring/script_action.sh](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification/raw/master/scoring/script_action.sh)
   - If you have modified the script action, upload it to the website or Azure Blob Storage account of your choice and note its URI.
1. Click on the "Search Resources" magnifying glass icon along the top bar of [Azure Portal](https://ms.portal.azure.com).
1. Type in the name of your HDInsight cluster and click on its entry in the resulting drop-down list. The overview pane for your HDInsight cluster will appear.
1. In the search field at upper left, type in "Script actions". Click the "Script actions" option in the results list.
1. Click the "+ Submit new" button along the top of the Script Actions pane. A new pane of options will appear.
   1. Under name, type "install" (without the quotes).
   1. Under "Bash script URI", type in the URI.
   1. Ensure that "Head" and "Worker" boxes are checked.
   1. Click the "Create" button along the bottom of the pane.
   
Expect the script action to take roughly fifteen minutes to run. When the script action is complete, the blue bar at the top of the screen will disappear and a green icon will appear next to the submitted script action's name. Do not proceed until the script action has finished.

<a name="updatepath"></a>
#### Update the Python 3 path

The script action above installed Cognitive Toolkit and TensorFlow under a new Python environment, `cntk-py35`. Follow the steps below to direct PySpark to use this new environment:

1. Navigate back to the HDInsight cluster's overview pane by clicking "Overview" near the upper left of the pane.
1. Under "Quick links" in the main window, click the "Cluster dashboards" button. A new pane of dashboard options will appear.
1. Click "HDInsight cluster dashboard". A new window will load. You may be asked for the username (default: admin) and password you selected during deployment.
1. In the menu at left, click "Spark2".
1. In the main window, click on the "Configs" tab.
1. Scroll down to the "Custom spark2-defaults" option and expand its dropdown by clicking on the label (or triange beside it).
1. Find the `spark.yarn.appMasterEnv.PYSPARK3_PYTHON` entry in the dropdown list. Change its path to the following:

    `/usr/bin/anaconda/envs/cntk-py35/bin/python`<br/><br/>
    
1. Click on the green "Save" button that appears at upper right.
    - You may receive a warning regarding a setting that you did not change. This value was set by default during HDInsight cluster deployment; the warning can be safely disregarded.
1. When prompted, click the orange "Restart" button and select "Restart all affected".
1. When the restart concludes, close the window. This will return you to a pane of dashboard options.

### Upload and start this notebook on HDInsight Spark

1. On the Overview pane of your HDInsight Spark cluster, click on "Cluster dashboards" and select "Jupyter Notebooks".
1. If prompted, log in with the username (default: admin) and password you selected during deployment.
1. Use the "Upload" button to upload a copy of this notebook. You may be prompted to confirm the destination filename during the upload process: the default value will do.
1. Once the notebook has been uploaded, double-check on the notebook's name to launch it.
1. The PySpark3 kernel should be used to run the notebook. If necessary, change the kernel by clicking "Kernel -> Change Kernel -> PySpark3".

If you intend to execute the code cells below, please switch to using the notebook copy you've just opened on the Spark cluster. (Note the outline at the top of the notebook includes a hotlink to the "Image scoring with PySpark" section.

<a name="pyspark"></a>
## Image scoring with PySpark
<a name="shared"></a>
### Define functions/variables/RDDs used by both scoring pipelines

Edit the variables below to define the name of your Azure Data Lake Store and the folder where the images have been stored. Execute the code cell to create an RDD of the images in the test set. Note that if this is the first code cell executed, there will be an additional delay as the Spark connection initiates.

In [1]:
import os
import numpy as np
import pandas as pd
from io import BytesIO
from PIL import Image
from pyspark import SparkFiles

label_to_number_dict = {'Barren': 0,
                        'Forest': 1,
                        'Shrub': 2,
                        'Cultivated': 3,
                        'Herbaceous': 4,
                        'Developed': 5}

def get_nlcd_id(my_filename):
    ''' Extracts the true label  '''
    folder, _ = os.path.split(my_filename)
    return(label_to_number_dict[os.path.basename(folder)])

adls_name = 'mawahdemo2'
adls_folder = 'balancedvalidationset'

n_workers = 10
local_tmp_dir = '/tmp/models'

dataset_dir = 'adl://{}.azuredatalakestore.net/{}'.format(adls_name, adls_folder)
image_rdd = sc.binaryFiles('{}/*/*.png'.format(dataset_dir), minPartitions=n_workers).coalesce(n_workers)

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1504722417824_0006,pyspark3,idle,Link,Link,✔


SparkSession available as 'spark'.


<a name="cntk"></a>
### Score and evaluate with a trained Cognitive Toolkit (CNTK) model
<a name="cntkbroadcast"></a>
#### Make the trained CNTK model available to all workers

In [2]:
from cntk import load_model

cntk_model_filepath = '{}/retrained.model'.format(local_tmp_dir)
cntk_model_filepath_bc = sc.broadcast(cntk_model_filepath)
sc.addFile(cntk_model_filepath)

<a name="cntkworker"></a>
#### Define functions to be run by worker nodes

In [3]:
def cntk_run_worker(files):
    ''' Scoring script run by each worker '''
    loaded_model = load_model(SparkFiles.get(cntk_model_filepath_bc.value))
    
    # Iterate through the files. The first value in each tuple is the file name; the second is the image data
    for file in files:
        # Load the image from its byte array, with proper dimensions and color channel order
        image_data = np.array(Image.open(BytesIO(file[1])), dtype=np.float32)
        image_data = np.ascontiguousarray(np.transpose(image_data[:, :, ::-1], (2,0,1)))
        
        # Apply the model to the image and return the true and predicted labels
        dnn_output = loaded_model.eval({loaded_model.arguments[0]: [image_data]})
        true_label = get_nlcd_id(file[0])
        yield (file[0], true_label, np.argmax(np.squeeze(dnn_output)))

<a name="cntkscore"></a>
#### Score all test set images with the trained model

In [4]:
labeled_images = image_rdd.mapPartitions(cntk_run_worker)

start = pd.datetime.now()
cntk_results = labeled_images.collect()
print('Scored {} images'.format(len(cntk_results)))
stop = pd.datetime.now()
print('Time elapsed: {}'.format(stop - start))

Scored 11760 images
Time elapsed: 0:07:49.978144

Note that this step may take up to minutes to complete.

<a name="cntkevaluate"></a>
#### Evaluate the model's performance

We first report the model's raw overall accuracy. We then calculate the overall accuracy when all undeveloped land types are grouped under the same label. (This is done to illustrate that the majority of errors confuse different types of undeveloped land.)

In [5]:
def group_undeveloped_land_types(original_label):
    if original_label in [3, 5]:  # developed and cultivated land types
        return(original_label)
    else:
        return(6)  # new grouped label for all undeveloped land types

cntk_df = pd.DataFrame(cntk_results, columns=['filename', 'true_label', 'predicted_label'])
num_correct = sum(cntk_df['true_label'] == cntk_df['predicted_label'])
num_total = len(cntk_results)
print('When using all six categories, correctly predicted ' +
      '{} of {} images ({:0.2f}%)'.format(num_correct,
                                          num_total,
                                          100 * num_correct / num_total))

cntk_df['true_label_regrouped'] = cntk_df['true_label'].apply(group_undeveloped_land_types)
cntk_df['predicted_label_regrouped'] = cntk_df['predicted_label'].apply(group_undeveloped_land_types)
num_correct = sum(cntk_df['true_label_regrouped'] == cntk_df['predicted_label_regrouped'])
print('After regrouping land use categories, correctly predicted ' +
      '{} of {} images ({:0.2f}%)'.format(num_correct,
                                          num_total,
                                          100 * num_correct / num_total))

When using all six categories, correctly predicted 9528 of 11760 images (81.02%)
After regrouping land use categories, correctly predicted 10861 of 11760 images (92.36%)

<a name="tf"></a>
### Score and evaluate with a trained TensorFlow model

<a name="tfmodel"></a>
#### Make the trained TensorFlow model available to all workers

Loads a slightly modified version of the tf-slim ResNet definition from the [TensorFlow models git repository](https://github.com/tensorflow/models).

In [6]:
sc.addPyFile(os.path.join(local_tmp_dir, 'resnet_utils.py'))
sc.addPyFile(os.path.join(local_tmp_dir, 'resnet_v1.py'))
model_dir_bc = sc.broadcast(local_tmp_dir)

import tensorflow as tf
import functools
import resnet_v1
slim = tf.contrib.slim

<a name="tfworker"></a>
#### Define functions used by workers for scoring

In [7]:
def get_network_fn(num_classes, weight_decay=0.0, is_training=False):
    arg_scope = resnet_v1.resnet_arg_scope(weight_decay=weight_decay)
    func = resnet_v1.resnet_v1_50
    @functools.wraps(func)
    def network_fn(images):
        with slim.arg_scope(arg_scope):
            return func(images, num_classes, is_training=is_training)
    if hasattr(func, 'default_image_size'):
        network_fn.default_image_size = func.default_image_size
    return(network_fn)

def mean_image_subtraction(image, means):
    num_channels = image.get_shape().as_list()[-1]
    channels = tf.split(image, num_channels, 2)
    for i in range(num_channels):
        channels[i] -= means[i]
    return(tf.concat(channels, 2))

def get_preprocessing():
    def preprocessing_fn(image, output_height=224, output_width=224):
        image = tf.expand_dims(image, 0)
        resized_image = tf.image.resize_bilinear(image, [output_height, output_width], align_corners=False)
        resized_image = tf.squeeze(resized_image)
        resized_image.set_shape([output_height, output_width, 3])
        image = tf.to_float(resized_image)
        return(mean_image_subtraction(image, [123.68, 116.78, 103.94]))
    return(preprocessing_fn)

def tf_run_worker(files):
    model_dir = model_dir_bc.value
    results = []
    
    with tf.Graph().as_default():
        network_fn = get_network_fn(num_classes=6, is_training=False)
        image_preprocessing_fn = get_preprocessing()
        
        current_image = tf.placeholder(tf.uint8, shape=(224, 224, 3))
        preprocessed_image = image_preprocessing_fn(current_image, 224, 224)
        image  = tf.expand_dims(preprocessed_image, 0)
        logits, _ = network_fn(image)
        predictions = tf.argmax(logits, 1)
        
        with tf.Session() as sess:
            my_saver = tf.train.Saver()
            my_saver.restore(sess, tf.train.latest_checkpoint(model_dir))
            
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(sess=sess, coord=coord)
            try:
                for file in files:
                    imported_image_np = np.asarray(Image.open(BytesIO(file[1])), dtype=np.uint8)
                    result = sess.run(predictions, feed_dict={current_image: imported_image_np})
                    true_label = get_nlcd_id(file[0])
                    results.append([file[0], true_label, result[0]])
            finally:
                coord.request_stop()
            coord.join(threads)
    return(results)

<a name="tfscore"></a>
#### Score all images with trained TensorFlow model

In [8]:
labeled_images_tf = image_rdd.mapPartitions(tf_run_worker)

start = pd.datetime.now()
results_tf = labeled_images_tf.collect()
print('Scored {} images'.format(len(results_tf)))
stop = pd.datetime.now()
print(stop - start)

Scored 11760 images
0:09:13.936763

Note that this step may take up to 10 minutes to complete.

<a name="tfevaluate"></a>
#### Evaluate the model's performance

We first report the model's raw overall accuracy. We then calculate the overall accuracy when all undeveloped land types are grouped under the same label. (This is done to illustrate that the majority of errors confuse different types of undeveloped land.)

In [9]:
def group_undeveloped_land_types(original_label):
    if original_label in [3, 5]:  # developed and cultivated land types
        return(original_label)
    else:
        return(6)

tf_df = pd.DataFrame(results_tf, columns=['filename', 'true_label', 'predicted_label'])
num_correct = sum(tf_df['true_label'] == tf_df['predicted_label'])
num_total = len(results_tf)
print('When using all six categories, correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct,
                                                                                             num_total,
                                                                                             100 * num_correct / num_total))

tf_df['true_label_regrouped'] = tf_df['true_label'].apply(group_undeveloped_land_types)
tf_df['predicted_label_regrouped'] = tf_df['predicted_label'].apply(group_undeveloped_land_types)
num_correct = sum(tf_df['true_label_regrouped'] == tf_df['predicted_label_regrouped'])
print('After regrouping land use categories, correctly predicted {} of {} images ({:0.2f}%)'.format(num_correct,
                                                                                                    num_total,
                                                                                                    100 * num_correct / num_total))


When using all six categories, correctly predicted 8844 of 11760 images (75.20%)
After regrouping land use categories, correctly predicted 10931 of 11760 images (92.95%)

<a name="next"></a>
## Next Steps

For an example of how the trained model can be applied to identify newly developed regions and explore county-level patterns in development, please see the next document in this repository: [Land Use Prediction in Middlesex County, MA](land_use_prediction.md).