# Scoring images on Spark

This notebook illustrates how trained Cognitive Toolkit (CNTK) and TensorFlow models can be applied to large image collections using PySpark. For more detail on image set creation and model training, please see the rest of the [Embarrassingly Parallel Image Classification](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification) repository.

## Outline
- [Setting up a Microsoft HDInsight Spark cluster and Azure Data Lake Store](#setup)
   - [Provisioning the resources](#provision)
   - [Transferring the image set](#transfer)
   - [Installing Cognitive Toolkit and Tensorflow](#install)
- [Image scoring with PySpark](#pyspark)
   - [Cognitive Toolkit](#cntk)
   - [TensorFlow](#tf)

<a name="setup"></a>
## Setting up a Microsoft HDInsight Spark cluster and associated Azure Data Lake Store

<a name="provision"></a>
### Provisioning the resources

1. After logging into [Azure Portal](https://ms.portal.azure.com), click the "+ New" button near the upper left to create a new resource.
1. In the search field that appears, enter "HDInsight" and press Enter.
1. In the search results, click on the "HDInsight" option published by Microsoft.
1. Click the "Create" button at the bottom of the new pane that opens to describe the HDInsight resource type.
1. In the "New HDInsight cluster" pane, choose a unique cluster name and the appropriate subscription.
1. Click on "Cluster configuration" to load a pane of settings.
   1. Set the cluster type to "Spark".
   1. Set the version to "Spark 2.0.1 (HDI 3.5)".
   1. Click the "Select" button at the bottom of the pane.
1. Click on "Credentials" to load a pane of settings.
   1. Choose a password for the `admin` account. You will use this account to log into Jupyter Notebook later in the walkthrough.
   1. Choose a username and password for SSH access. We will not use this account in this walkthrough.
   1. Click the "Select" button at the bottom of the pane.
1. Click on "Data source" to load a pane of settings.
   1. Ensure that "Azure Storage" is selected for the "Primary storage type".
   1. Under "Select a storage account", click Create New.
   1. Choose a name for the new storage account.
   1. Under "Default container", enter "hdinsight" (without the quotes).
   1. Click the "Select" button at the bottom of the pane.
1. Click on "Cluster size" to load a pane of settings.
   1. Choose a number of workers and node sizes according to your budget and time constraints. This tutorial can be completed using a cluster with **4** worker nodes and a node size of **D12 v2** (for both worker and head nodes). For more information, please see the [cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters) and [VM](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-sizes#dv2-series) size guides.
   1. Click the "Select" button at the bottom of the pane.
1. Choose "Create new" under resource group, and enter a unique resource group name.
1. Click the "Create" button at the bottom of the pane.

Cluster deployment will take approximately twenty minutes. (Since Azure Data Lake Store deployment will finish much sooner, we recommend transferring your image set to the ADLS while you wait; see the next section.) Cluster deployment status can be checked as follows:
1. Click on the "Search Resources" magnifying glass icon along the top bar of [Azure Portal](https://ms.portal.azure.com).
1. Type in the name of your HDInsight cluster and click on its entry in the resulting drop-down list. The overview pane for your HDInsight cluster will appear.
1. During deployment, a blue bar will appear across the top of the overview pane with the title "Applying changes". When this bar disappears, deployment is complete.

<a name="transfer"></a>
### Transferring the image set

Our evaluation image set was creating on a Data Science Virtual Machine. To transfer these images to our Azure Data Lake Store, we first copied the images to Azure Blob Storage using [AzCopy](https://docs.microsoft.com/en-gb/azure/storage/storage-use-azcopy), then to the Azure Data Lake Store with [AdlCopy](https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-azure-storage-blob). After following the instructions linked above to download and install AzCopy/AdlCopy, we transferred the files with the following shell commands:

In [2]:
local_image_dir = 'E:\\combined\\train'
blob_account_name = 'mawahstorage'
blob_account_key = 'o62OKYWfsL/sNki1udZPWUZkOY5y6tL7cLRlgDTMciO9ZavfwmKqa8vNTNwrJXqkjeqHl9wJULwowfQFkj4/JA=='
blob_account_container = 'training'
adl_account_name = 'mawahtensorflow'
adl_account_folder = 'training'

commands = '''
AzCopy /Source:{0} /Dest:https://{1}.blob.core.windows.net/{2} /DestKey:{3} /S
AdlCopy /source https://{1}.blob.core.windows.net/{2}/ /dest swebhdfs://{4}.azuredatalakestore.net/{5}/ /sourcekey {3}
'''.format(local_image_dir, blob_account_name, blob_account_container,
           blob_account_key, adl_account_name, adl_account_folder)

print(commands)


AzCopy /Source:E:\combined\train /Dest:https://mawahstorage.blob.core.windows.net/training /DestKey:o62OKYWfsL/sNki1udZPWUZkOY5y6tL7cLRlgDTMciO9ZavfwmKqa8vNTNwrJXqkjeqHl9wJULwowfQFkj4/JA== /S
AdlCopy /source https://mawahstorage.blob.core.windows.net/training/ /dest swebhdfs://mawahtensorflow.azuredatalakestore.net/training/ /sourcekey o62OKYWfsL/sNki1udZPWUZkOY5y6tL7cLRlgDTMciO9ZavfwmKqa8vNTNwrJXqkjeqHl9wJULwowfQFkj4/JA==



<a name="install"></a>
### Installing Cognitive Toolkit and Tensorflow

#### Obtaining and (optionally) modifying the script action

We will install Cognitive Toolkit and Tensorflow on all head and worker nodes via Script Action. We have included a sample script action in the `scoring` subdirectory of [the Embarrassingly Parallel Image Classification git repository](https://github.com/Azure/Embarrassingly-Parallel-Image-Classification), reproduced below for your convenience:

The code above installed CNTK 2.0 beta release 10. As of this writing, other CNTK releases can be substituted as follows:
1. Navigate to the [CNTK Releases](https://github.com/Microsoft/CNTK/releases) page
1. Click on the appropriate release's link for a Linux, CPU Only release.
1. After reading and agreeing to the mentioned licenses, copy the URL linked to the "I accept" button (e.g. from the page source) and paste over the URL in the `curl` command above.

#### Running the script action

After HDInsight cluster deployment finishes, run the script action to install CNTK as follows:
1. Obtain the URI for the script action.
   - If using the unmodified version in this git repo, ensure that your URI points to the "raw" file (not a webpage-embedded file).
   - If you have modified the script action, upload it to the website or Azure Blob Storage account of your choice and note its URI.
1. Click on the "Search Resources" magnifying glass icon along the top bar of [Azure Portal](https://ms.portal.azure.com).
1. Type in the name of your HDInsight cluster and click on its entry in the resulting drop-down list. The overview pane for your HDInsight cluster will appear.
1. In the search field at upper left, type in "Script actions". Click the "Script actions" option in the results list.
1. Click the "+ Submit new" button along the top of the Script Actions pane. A new pane of options will appear.
   1. Under name, type "install" (without the quotes).
   1. Under "Bash script URI", type in the URI.
   1. Ensure that "Head" and "Worker" boxes are checked.
   1. Click the "Create" button along the bottom of the pane.
   
Expect the script action to take roughly fifteen minutes to run.
   
#### Updating the Python 3 path

The script action above installed Cognitive Toolkit and Tensorflow under a new Python environment, `cntk-py35`. Follow the steps below to direct PySpark to use this new environment:

1. Navigate back to the HDInsight cluster's overview pane by clicking "Overview" near the upper left of the pane.
1. Under "Quick links" in the main window, click the "Cluster dashboards" button. A new pane of dashboard options will appear.
1. Click "HDInsight cluster dashboard". A new window will load.
1. In the menu at left, click "Spark2".
1. In the main window, click on the "Configs" tab.
1. Scroll down to the "Custom spark2-defaults" option and expand its dropdown by clicking on the label (or triange beside it).
1. Find the `spark.yarn.appMasterEnv.PYSPARK3_PYTHON` entry in the dropdown list. Change its path to the following:

    `/usr/bin/anaconda/envs/cntk-py35/bin/python`<br/><br/>
    
1. Click on the green "Save" button that appears at upper right.
1. When prompted, click the orange "Restart" button and select "Restart all affected".
1. When the restart concludes, close the window. This will return you to a pane of dashboard options.

<a name="pyspark"></a>
## Image scoring with PySpark

<a name="cntk"></a>
### Cognitive Toolkit

<a name="tf"></a>
### Tensorflow