Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Model deployment

Please ensure you have run all previous notebooks in sequence before running this.

Please Register Azure Container Instance(ACI) using Azure Portal: https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-supported-services#portal in your subscription before using the SDK to deploy your ML model to ACI.

In [4]:
from azureml.core import Workspace
import azureml.core
import os

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

config_path = '/dbfs/tmp/'

#'''
ws = Workspace.from_config(path=os.path.join(config_path, 'aml_config', 'config.json'))
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Resource group: ' + ws.resource_group, sep = '\n')
#'''

In [5]:
## NOTE: service deployment always gets the model from the current working dir.
model_name = "PdM_logistic_regression.mml" # 
model_name_dbfs = os.path.join("/dbfs", model_name)

print("copy model from dbfs to local")
model_local = "file:" + os.getcwd() + "/" + model_name
dbutils.fs.cp(model_name, model_local, True)

In [6]:
# register the model
from azureml.core.model import Model
mymodel = Model.register(model_path = model_name, # this points to a local file
                       model_name = model_name, # this is the name the model is registered as, am using same name for both path and name.                 
                       description = "ADB trained model by an amazing data scientist",
                       workspace = ws)

print(mymodel.name, mymodel.description, mymodel.version)

## Converting your data to and from JSON

The most common way to interact with a webservice is using a [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) API, sending and receiving [JSON](https://en.wikipedia.org/wiki/JSON) data.  

We therefore need to convert our dataframe to JSON to send it to the webservice, and the webservice has to then convert it back into a dataframe so that we can use our pyspark model to score the data.

Very often this is straightforward, because json can interpret the schema of our data correctly. However, this is not always the case.  Our usecase is an example where we need to help spark, by explicitly providing the schema when converting the JSON data back to a dataframe.

Let's start with an example to illustrate that.

  **Note**: Explicitly providing the schema of data is generally good practice, because it can speed up reading data and avoids surprises.  This is not only try when working with spark, but also e.g. in *R*  or *scikit-learn*.

In [8]:
df = spark.read.parquet("dbfs:/FileStore/tables/preprocessed").cache()
display(df)

# from pyspark.sql.types import DateType
from pandas import datetime
from pyspark.sql.functions import col, hour

# we sample every nth row of the data using the `hour` function
df_train = df.filter((col('datetime') < datetime(2015, 10, 1)))
df_test = df.filter(col('datetime') > datetime(2015, 10, 15)).limit(5)

In [9]:
# test_data_path = "TestData"

# test_data_path_dbfs = os.path.join("/dbfs", test_data_path)

# df_test = spark.read.parquet(test_data_path).limit(5)

display(df_test.limit(5))

In [10]:
import json

test_json = json.dumps(df_test.toJSON().collect())

print(test_json)

In [11]:
input_list = json.loads(test_json)
input_rdd = sc.parallelize(input_list)
input_df = spark.read.json(input_rdd)

Now, let's see whether the data look as expected after the rountrip though JSON.

In [13]:
print("This is the schema of the original data frame:")
df_test.printSchema()

print("This is the schema of our data frame after converting it to/from JSON:")
input_df.printSchema()

try:
  assert(df_test.schema == input_df.schema)
except AssertionError:
  print("Sadly, the schemas of the two data frames are not the same.")

## Hands-on Lab

Help spark by explicitly providing the schema when reading the JSON data.

This requires several parts:
1. Identify the schema of the original data
1. Create a schema definition that spark can use when reading the JSON data
1. Tell spark to use that schema definition when reading the JSON data

In [15]:
# Let's identify the schma
df_test.schema

OK. It looks like:
- `norm_features` are encoded as a `VectorUDT`
- `error` is encoded as `IntegerType`

The schema definition further depends on the classes `StructType` and `StructField`.

Try to find where those are defined using the pyspark API, and add the import statements at the top of the next cell. Hint, you need two lines of code.

Use the search function of the pyspark API [documentation](https://spark.apache.org/docs/latest/api/python/index.html) to find the location of most of the definitions of these classes. Unfortunately, `VectorUDT` is a little bit harder to find, and will require some finesse on your side.

In [17]:
#from pyspark.<...> import <...>
#from pyspark.<...> import <...>

myschema = StructType([
                      StructField("norm_features",VectorUDT()),
                      StructField("error",IntegerType())
                      ])

In [18]:
from pyspark.sql.types import StructField, StructType, IntegerType
from pyspark.ml.linalg import VectorUDT

myschema = StructType([
                      StructField("norm_features",VectorUDT()),
                      StructField("error",IntegerType())
                      ])

Now that you were able to define the schema, tell spark to use it when reading the JSON data.

Instead of simply writing `spark.read.json(input_rdd)`, tell spark to use your schema while reading the data.

Use this [documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=spark%20read%20schema#pyspark.sql.DataFrameReader.schema) for some hint on how to do this.

In [20]:
input_list = json.loads(test_json)
input_rdd = sc.parallelize(input_list)
# todo: modify the next line 
input_df = spark.read.json(input_rdd)

In [21]:
input_list = json.loads(test_json)
input_rdd = sc.parallelize(input_list)
# todo: modify the next line 
input_df = spark.read.schema(myschema).json(input_rdd)

Now, let's see whether you were successful.

In [23]:
print("This is the schema of the original data frame:")
df_test.printSchema()

print("This is the schema of our data frame after converting it to/from JSON:")
input_df.printSchema()

try:
  assert(df_test.schema == input_df.schema)
  print("You did it!")
except AssertionError:
  print("Sadly, the schemas of the two data frames are not the same.")

## End of lab

## Create a score file

The next step of creating a web service is to define a score script that defines what the webservice does.

A typical score script has two methods defined:
- `init` is executed once, when the webservice is started
- `run` is executed everytime a user is interacting with the webservice to score data

Look at this score script below, can you see where we made the changes that are related to explicitly providing the schema when reading JSON data?

There are several places:
1. Importing the modules for defining the schema
1. Defining a global variable for holding the schema
1. Defining the schema
1. Using the schema when reading the data

In [25]:
score_sparkml = """

import json

def init():
    # One-time initialization of PySpark and predictive model
    import pyspark
    from azureml.core.model import Model
    from pyspark.ml import PipelineModel
    from pyspark.sql.types import StructField, StructType, IntegerType
    from pyspark.ml.linalg import VectorUDT

    global trainedModel
    global spark
    global schema
    
    spark = pyspark.sql.SparkSession.builder.appName("ADB and AML notebook by an amazing data scientist").getOrCreate()
    model_name = "{model_name}" #interpolated
    model_path = Model.get_model_path(model_name)
    trainedModel = PipelineModel.load(model_path)
    
    schema = StructType([StructField("norm_features",VectorUDT()), StructField("error",IntegerType())])
    
def run(input_json):
    if isinstance(trainedModel, Exception):
        return json.dumps({{"trainedModel":str(trainedModel)}})
      
    try:
        sc = spark.sparkContext
        input_list = json.loads(input_json)
        input_rdd = sc.parallelize(input_list)
        input_df = spark.read.schema(schema).json(input_rdd)
        
        # Compute prediction
        prediction = trainedModel.transform(input_df)
        #result = prediction.first().prediction
        predictions = prediction.collect()

        #Get each scored result
        preds = [str(x['prediction']) for x in predictions]
        result = ",".join(preds)
        # you can return any data type as long as it is JSON-serializable
        return json.dumps({{"result":result}})        
    except Exception as e:
        result = str(e)
        return json.dumps({{"error":result}})
    
""".format(model_name=model_name)

exec(score_sparkml)

with open("score_sparkml.py", "w") as file:
    file.write(score_sparkml)

Creating a webservice requires creating a docker container in which to run our score script. 

This can all be done with the python AML sdk. 

First we create a conda environment, which makes sure that all the python dependencies are installed in the docker container.  Then we create the container.

In [27]:
from azureml.core.conda_dependencies import CondaDependencies 

myacienv = CondaDependencies.create(conda_packages=['scikit-learn','numpy','pandas']) #showing how to add libs as an example - not needed for this model.

with open("mydeployenv.yml","w") as f:
    f.write(myacienv.serialize_to_string())

In [28]:
with open("mydeployenv.yml","r") as f:
  print(f.read())

In [29]:
# this will take 10-15 minutes to finish

service_name = "myaci"
image_name = 'myimage'
runtime = "spark-py" 
driver_file = "score_sparkml.py"
my_conda_file = "mydeployenv.yml"

# image creation
from azureml.core.image import ContainerImage
myimage_config = ContainerImage.image_configuration(execution_script = driver_file, 
                                    runtime = runtime, 
                                    conda_file = my_conda_file)

# Create container Image
myimage = ContainerImage.create(
  workspace=ws, 
  name=image_name,
  models = [mymodel],
  image_config = myimage_config)

myimage.wait_for_creation(show_output=True)

In [30]:
help(ContainerImage)

Now we create the actual webservice, using the Docker image that is stored in the Azure Container Registry. 

Before you continue, try to find your container image in the Azure portal.

In [32]:
# deploy to ACI
from azureml.core.webservice import AciWebservice, Webservice

myaci_config = AciWebservice.deploy_configuration(
    cpu_cores = 2, 
    memory_gb = 2, 
    tags = {'name':'Databricks Azure ML ACI'}, 
    description = 'This is for ADB and AML example. Azure Databricks & Azure ML SDK demo with ACI.',
    location='westus2')

In [33]:
help(azureml.core.webservice)

In [34]:
# Webservice creation
myservice = Webservice.deploy_from_image(
  workspace=ws, 
  name=service_name,
  image=myimage,
  deployment_config = myaci_config)

myservice.wait_for_deployment(show_output=True)

Let's see what we created above. Here is a summary.

In [36]:
print(myservice.serialize())

You can also print individual properties of your webservice, for example the URL used by the webservice.

In [38]:
#for using the Web HTTP API 
print(myservice.scoring_uri)

## Test Webservice

In [40]:
# We can use the test_json data we created above. 
myservice.run(input_data = test_json)

In [41]:
# comment below line to not delete the web service
myservice.delete()

Please make sure to install **VS Code** on your compute *before* the next session tmrw morning.

You can find binary installers for VS Code here:

[https://code.visualstudio.com/download](https://code.visualstudio.com/download)

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.