# Step 4b: Authoring the components for a real time web service and testing locally

The best model is now saved as a .model file along with the relevant scheme for deployment. The functions are first tested locally before operationalizing the model using Azure Machine Learning Model Management environment for use in production in realtime.

In [1]:
## Setup our environment by importing required libraries
import os
import csv

import pandas as pd
import io
import requests

import glob
from azure.storage.blob import BlockBlobService
from azure.storage.blob import PublicAccess

# For creating pipelines and model
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, VectorIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Setup the pyspark environment
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [2]:
%%time
# load the previous created final dataset into the workspace
from azure.storage.blob import BlockBlobService
import glob
import os

# define parameters 
ACCOUNT_NAME = "pdmvienna"
ACCOUNT_KEY = "PDuXK61GpmMVWMrWdvr29THbPdlOXa61fN5RfgQV/jBO8berC1zLzZ678Nxrx+D3CRp4+ZvSff9al+lrUh8qUQ=="
CONTAINER_NAME = "featureengineering"

# define your blob service     
my_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

# create a local path where to store the results later.
LOCAL_DIRECT = 'model_operationalize.parquet'
if not os.path.exists(LOCAL_DIRECT):
    os.makedirs(LOCAL_DIRECT)
    print('DONE creating a local directory!')

# define your blob service     
my_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

# download the entire parquet result folder to local path for a new run 
for blob in my_service.list_blobs(CONTAINER_NAME):
    if 'featureengineering_files.parquet' in blob.name:
        local_file = os.path.join(LOCAL_DIRECT, os.path.basename(blob.name))
        my_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

data = spark.read.parquet('model_operationalize.parquet')
#data.persist()
data.show(5)
print('Feature engineering final dataset files loaded!')

DONE creating a local directory!
+---------+--------------------+------------------+--------------------+----------------------+-----------------------+-------------------+---------------------+-----------------------+------------------------+------------------+-------------------+---------------------+----------------------+------------------+--------------------+----------------------+-----------------------+------------------------+------------------------+------------------------+------------------------+------------------------+-----------------+-----------------+-----------------+-----------------+------+---+-------------+--------+-------+
|machineID|        dt_truncated|volt_rollingmean_3|rotate_rollingmean_3|pressure_rollingmean_3|vibration_rollingmean_3|volt_rollingmean_24|rotate_rollingmean_24|pressure_rollingmean_24|vibration_rollingmean_24| volt_rollingstd_3|rotate_rollingstd_3|pressure_rollingstd_3|vibration_rollingstd_3|volt_rollingstd_24|rotate_rollingstd_24|pressure_rol

# Authoring  Realtime Web Service

In this section, we show you how to author a realtime web service that scores the model you saved above. First check to ensure that the latest version of the azure-ml-api-sdk is available. 

In [3]:
!pip install azure-ml-api-sdk

[33mThe directory '/home/mmlspark/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.[0m
[33mThe directory '/home/mmlspark/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.[0m
Collecting azure-ml-api-sdk
  Downloading azure_ml_api_sdk-0.1.0a10-py2.py3-none-any.whl (76kB)
[K    100% |################################| 81kB 127kB/s ta 0:00:01
Collecting liac-arff (from azure-ml-api-sdk)
  Downloading liac-arff-2.1.1.tar.gz
Installing collected packages: liac-arff, azure-ml-api-sdk
  Running setup.py install for liac-arff ... [?25ldone
[?25hSuccessfully installed azure-ml-api-sdk-0.1.0a10 liac-arff-2.1.1


In [4]:
from azure.ml.api.schema.dataTypes import DataTypes
from azure.ml.api.schema.sampleDefinition import SampleDefinition
from azure.ml.api.realtime.services import generate_schema

## Define init and run functions

We start by defining our init and run functions in the cell below. Then write them to the score.py file. This file will load the model, perform the prediction, and return the result.

The init function initializes your web service, loading in any data or models that you need to score your inputs. In the example below, we load in the trained model. This command is run when the Docker container containing your service initializes.

The run function defines what is executed on a scoring call. In our simple example, we simply load in the input as a data frame, and run our pipeline on the input, and return the prediction.

In [5]:
#%%save_file -f score.py
# after testing the below init() and run() functions,
# uncomment this cell to create the score.py after.

# remove import from init() from function.

def init():
    # read in the model file
    from pyspark.ml import PipelineModel
    global pipeline
    pipeline = PipelineModel.load("/azureml-share/pdmrfull.model")
    
def run(input_df):
    import json
    response = ''
    
    try:
        #Get prediction results for the dataframe
        input_features = [
            'volt_rollingmean_3',
            'rotate_rollingmean_3',
            'pressure_rollingmean_3',
            'vibration_rollingmean_3',
            'volt_rollingmean_24',
            'rotate_rollingmean_24',
            'pressure_rollingmean_24',
            'vibration_rollingmean_24',
            'volt_rollingstd_3',
            'rotate_rollingstd_3',
            'pressure_rollingstd_3',
            'vibration_rollingstd_3',
            'volt_rollingstd_24',
            'rotate_rollingstd_24',
            'pressure_rollingstd_24',
            'vibration_rollingstd_24',
            'error1sum_rollingmean_24',
            'error2sum_rollingmean_24',
            'error3sum_rollingmean_24',
            'error4sum_rollingmean_24',
            'error5sum_rollingmean_24',
            'comp1sum',
            'comp2sum',
            'comp3sum',
            'comp4sum',
            'age',
        ]

        va = VectorAssembler(inputCols=(input_features), outputCol='features')
        data = va.transform(input_df).select('machineID','features')
        score = pipeline.transform(data)
        predictions = score.collect()

        #Get each scored result
        preds = [str(x['prediction']) for x in predictions]
        response = ",".join(preds)
    except Exception as e:
        print("Error: {0}",str(e))
        return (str(e))
    
    # Return results
    print(json.dumps(response))
    return json.dumps(response)

## Create schema and schema file 

Create a schema for the input to the web service and generate the schema file. This will be used to create a Swagger file for your web service which can be used to discover its input and sample data when calling it.

In [6]:
# define the input data frame
inputs = {"input_df": SampleDefinition(DataTypes.SPARK, data.drop("dt_truncated","failure1","label_e", "model","model_encoded"))}

In [7]:
#import score
generate_schema(run_func=run, inputs=inputs, filepath='service_schema.json')

{'input': {'input_df': {'internal': {'fields': [{'metadata': {},
      'name': 'machineID',
      'nullable': True,
      'type': 'integer'},
     {'metadata': {},
      'name': 'volt_rollingmean_3',
      'nullable': True,
      'type': 'double'},
     {'metadata': {},
      'name': 'rotate_rollingmean_3',
      'nullable': True,
      'type': 'double'},
     {'metadata': {},
      'name': 'pressure_rollingmean_3',
      'nullable': True,
      'type': 'double'},
     {'metadata': {},
      'name': 'vibration_rollingmean_3',
      'nullable': True,
      'type': 'double'},
     {'metadata': {},
      'name': 'volt_rollingmean_24',
      'nullable': True,
      'type': 'double'},
     {'metadata': {},
      'name': 'rotate_rollingmean_24',
      'nullable': True,
      'type': 'double'},
     {'metadata': {},
      'name': 'pressure_rollingmean_24',
      'nullable': True,
      'type': 'double'},
     {'metadata': {},
      'name': 'vibration_rollingmean_24',
      'nullable': True,
 

# Test init and run

We can then test the init and run functions right here in the notebook, before we decide to actually publish a web service.

In [8]:
# this is how the input data should be
input_data = [[114, 163.375732902,333.149484586,100.183951698,44.0958812638,164.114723991,277.191815232,97.6289110707,50.8853505161,21.0049565219,67.5287259378,12.9361526861,4.61359760918,15.5377738062,67.6519885441,10.528274633,6.94129487555,0.0,0.0,0.0,0.0,0.0,489.0,549.0,549.0,564.0,18.0]]
input_data

[[114,
  163.375732902,
  333.149484586,
  100.183951698,
  44.0958812638,
  164.114723991,
  277.191815232,
  97.6289110707,
  50.8853505161,
  21.0049565219,
  67.5287259378,
  12.9361526861,
  4.61359760918,
  15.5377738062,
  67.6519885441,
  10.528274633,
  6.94129487555,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  489.0,
  549.0,
  549.0,
  564.0,
  18.0]]

In [9]:
df = (spark.createDataFrame(input_data, ["machineID", "volt_rollingmean_3", "rotate_rollingmean_3", "pressure_rollingmean_3", "vibration_rollingmean_3", "volt_rollingmean_24", 
            "rotate_rollingmean_24", "pressure_rollingmean_24", "vibration_rollingmean_24", "volt_rollingstd_3", "rotate_rollingstd_3",
            "pressure_rollingstd_3", "vibration_rollingstd_3", "volt_rollingstd_24", "rotate_rollingstd_24", "pressure_rollingstd_24",
            "vibration_rollingstd_24", "error1sum_rollingmean_24", "error2sum_rollingmean_24", "error3sum_rollingmean_24",
            "error4sum_rollingmean_24", "error5sum_rollingmean_24", "comp1sum", "comp2sum", "comp3sum", "comp4sum",
            "age"]))

In [10]:
init()

In [11]:
run(df)

"0.0"


'"0.0"'