Copyright (c) Microsoft Corporation. All rights reserved. 
Licensed under the MIT License.


# Migrate from Real Time Inference to Batch Inference

In this notebook, we will show you the difference between using batch inference and real-time inference. Batch inference is optimized for high throughput, fire-and-forget predictions for a large collection of data. Real-time inference is more suitable for low-latency predicions. You may use one of them before, if so it's not difficult for you to use another. In the process of using them, they have the following major differences:
- Batch inference needs to set up compute resources.
- Batch inference needs Dataset for input data, a real-time inference can use user-uploaded data for input data.
- Batch inference needs to build and run batch inference pipeline, real-time inference needs to deploy web service to MIR.
> **Note** Please install azureml-pipeline-steps package and azureml-contrib-mir package before running this notebook.
```
pip install azureml-sdk
pip install azureml-pipeline-steps
pip install azureml-contrib-mir
```

## Common prerequisites
Common prerequisites are work for both batch inference and real-time inference.
### Connect to workspace
Create a workspace object from the existing workspace, make sure you downloaded the config.json from your own workspace.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

### Registering models with the workspace
We will use the MNIST model for both batch inference and real-time inference.

In [None]:
from azureml.core.model import Model

# register the MNIST model 
model = Model.register(model_path = "model/",
                       model_name = "mnist", # this is the name the model is registered as
                       tags = {'pretrained': "mnist"},
                       description = "Mnist trained tensorflow model",
                       workspace = ws)

### Specify then environment to your script
Specify the conda dependencies for your script. This will allow us to install pip packages as well as configure the inference environment.

In [None]:
from azureml.core import Environment
from azureml.core.runconfig import CondaDependencies, DEFAULT_CPU_IMAGE

# specify inference environment
conda_deps = CondaDependencies.create(conda_packages=['numpy'], 
                                      pip_packages=["tensorflow==1.13.1", "pillow", 'azureml-defaults'])
myenv = Environment(name="my_environment")
myenv.python.conda_dependencies = conda_deps
myenv.docker.enabled = True
myenv.docker.base_image = DEFAULT_CPU_IMAGE

## Batch inference prequisites
Following 3 steps only needed by batch inference.
### Create or Attach existing compute target
The code below creates the compute clusters for you if they don't already exist in your workspace.

In [None]:
import os
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

### Create a FileDataset for input
We have already created a public blob container `sampledata` on an account named `pipelinedata`, containing images from the MNIST dataset. We will create a FileDataset base on this blob container.
Create a datastore with blob container.

In [None]:
from azureml.core.datastore import Datastore

account_name = "pipelinedata"
datastore_name = "mnist_datastore"
container_name = "sampledata"

mnist_data = Datastore.register_azure_blob_container(ws, 
                      datastore_name=datastore_name, 
                      container_name= container_name, 
                      account_name=account_name,
                      overwrite=True)

Create a FileDataset with the `mnist_data` above.

In [None]:
from azureml.core.dataset import Dataset

mnist_ds_name = 'mnist_sample_data'

path_on_datastore = mnist_data.path('mnist')
input_mnist_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)
registered_mnist_ds = input_mnist_ds.register(ws, mnist_ds_name, create_new_version=True)
named_mnist_ds = registered_mnist_ds.as_named_input(mnist_ds_name)

### Configure output data
Let's specify the default datastore for the outputs and construct our PipelineData.

In [None]:
def_data_store = ws.get_default_datastore()

from azureml.pipeline.core import Pipeline, PipelineData

output_dir = PipelineData(name="inferences", 
                          datastore=def_data_store, 
                          output_path_on_compute="mnist/results")

## Create inference config
### For batch inference, create a ParallelRunConfig and a pipeline step
Create the ParallelRunConfig to wrap the inference script.

In [None]:
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig

parallel_run_config = ParallelRunConfig(
    source_directory="batch-script",
    entry_script="digit_identification.py",
    mini_batch_size="5",
    error_threshold=10,
    output_action="append_row",
    environment=myenv,
    compute_target=compute_target,
    node_count=2)

Create the pipeline step using the script, environment configuration, and parameters. Specify the compute target you already attached to your workspace as the target of execution of the script. We will use ParallelRunStep to create the pipeline step.

In [None]:
parallelrun_step = ParallelRunStep(
    name="predict-digits-mnist",
    parallel_run_config=parallel_run_config,
    inputs=[ named_mnist_ds ],
    output=output_dir,
    models=[ model ],
    arguments=[ ],
    allow_reuse=True
)

### For real-time inference, create a InferenceConfig and deploy web service to MIR
Create the InferenceConfig to warp the inference script.

In [None]:
from azureml.core.model import InferenceConfig

inference_config = InferenceConfig(entry_script='real-time-script/score.py', environment=myenv)

Set the web service configuration, deploy it to MIR.

In [None]:
from azureml.contrib.mir.webservice import MirWebservice

mir_config = MirWebservice.deploy_configuration(cpu_cores=4, 
                                                memory_gb=4, 
                                                sku="Standard_NC6_Promo", 
                                                num_replicas=1, 
                                                gpu_cores=1)
mir_service_name ='mnist-mir-service'

mir_service = Model.deploy(workspace = ws, 
                           name = mir_service_name,
                           models = [model],
                           inference_config = inference_config,
                           deployment_config = mir_config)
mir_service.wait_for_deployment(show_output = True)
print(mir_service.state)
print(mir_service.scoring_uri)

## Run the inference
### Run the pipeline of batch inference
At this point you can run the pipeline and examine the output it produced. The Experiment object is used to track the run of the pipeline.

In [None]:
from azureml.core import Experiment

pipeline = Pipeline(workspace=ws, steps=[parallelrun_step])
experiment = Experiment(ws, 'digit_identification')
pipeline_run = experiment.submit(pipeline)
pipeline_run.wait_for_completion(show_output=True)

### Run real-time inference by web service using run method
We call the web service by passing data.
Run() method retrieves API keys behind the scenes to make sure that call is authenticated. For real-time inference, the input data should not be too large, otherwise, you should use batch inference.

In [None]:
# Used to test your webservice
import os
import urllib
import gzip
import numpy as np
import struct
import requests

# load compressed MNIST gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        struct.unpack('I', gz.read(4))
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return res

# one-hot encode a 1-D array
def one_hot_encode(array, num_of_classes):
    return np.eye(num_of_classes)[array.reshape(-1)]

# Download test data
os.makedirs('./data/mnist', exist_ok=True)
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename='./data/mnist/test-images.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename='./data/mnist/test-labels.gz')

# Load test data from model training
X_test = load_data('./data/mnist/test-images.gz', False) / 255.0
y_test = load_data('./data/mnist/test-labels.gz', True).reshape(-1)

# send a random row from the test set to score
random_index = np.random.randint(0, len(X_test)-1)
input_data = "{\"data\": [" + str(list(X_test[random_index])) + "]}"

token = mir_service.get_access_token().access_token
headers = {'Content-Type': 'application/json',
           'Authorization': ('Bearer ' + token)}
resp = requests.post(mir_service.scoring_uri, input_data, headers=headers)

print("POST to url", mir_service.scoring_uri)
print("label:", y_test[random_index])
print("prediction:", resp.text)

## Cleanup Compute resources

For re-occurring jobs, it may be wise to keep compute the compute resources. However, since this is just a single-run job, we are free to release the allocated compute resources.

In [None]:
# uncomment below and run if compute resources are no longer needed 
compute_target.delete()