# How to create a real-time web service for a Spark model on Azure

Before running the tutorial, you must configure your DSVM as specified in the README on the [Machine Learning Operationalization](https://aka.ms/o16ncli) GitHub repo. If you have previously configured your DSVM, you may want to check the GitHub repo to ensure that you are using the most recent instructions

In the tutorial, we will walk you through loading a dataset, exploring
its features, training a model on the dataset, and then publishing a
realtime scoring API for the model.

First, read in the Boston Housing Price dataset. This dataset is publicly available at https://archive.ics.uci.edu/ml/datasets/Housing. We have placed a copy in your ```azureml/datasets``` folder.

In [None]:
# Import Azure ML API SDK. The SDK is installed implicitly with the latest
# version of the CLI in your default python environment
from azure.ml.api.schema.dataTypes import DataTypes
from azure.ml.api.schema.sampleDefinition import SampleDefinition
from azure.ml.api.realtime.services import generate_schema

In [None]:
# Read in the housing price dataset
df2 = spark.read.csv("datasets/housing.csv", header=True, inferSchema=True)
df2.show()
df2.printSchema()

## Train your model

Using Spark's ML library, we can train a gradient boosted tree regressor for our data to produce a model that can predict median values of houses in Boston. Once you have trained the model, you can evaluate it for quality using the root mean squared error metric.

In [None]:
# Train a boosted decision tree regressor
from pyspark.ml.feature import RFormula
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.pipeline import Pipeline
import numpy as np
formula = RFormula(formula="MEDV~.")
gbt = GBTRegressor()
pipeline = Pipeline(stages=[formula, gbt]).fit(df2)

In [None]:
# Evaluate scores
scores = pipeline.transform(df2)
from pyspark.ml.evaluation import RegressionEvaluator
print "R^2 error =", RegressionEvaluator(metricName="r2").evaluate(scores)

### Save your model

Once you have a model that performs well, you can package it into a scoring service. To prepare for this, save your model locally first.

In [None]:
# Save model
pipeline.write().overwrite().save("housing.model")
print "Model saved"

## Authoring a Realtime Web Service

In this section, you how author a realtime web service that scores the model you saved above. 

### Define ```init``` and ```run```

Start by defining your ```init``` and ```run``` functions in the cell below. 

The ```init``` function initializes the web service, loading in any data or models that it needs to score your inputs. In the example below, it loads in the trained model and the schema of your dataset.

The ```run``` function defines what is executed on a scoring call. In this simple example, the service loads the json input as a data frame and runs the pipeline on the input.

In [None]:
# Prepare the web service definition by authoring
# init() and run() functions. 
# User written init function should mainly focus on loading the model(s) now. Schema loading is done in generated code
def init():
    from pyspark.ml import PipelineModel
    global pipeline
    pipeline = PipelineModel.load("housing.model")

def run(input_df):
    score = pipeline.transform(input_df)
    return score.collect()[0]['prediction']

### Create a schema file 

To generate a schema for the inputs (and outputs for rich swagger), You define a map of input names to input sample data. The input name must match exactly with the names of the arguments for the run function. For samples use the data structures you created and used for testing the model after training.

In [None]:
inputs = {"input_df": SampleDefinition(DataTypes.SPARK, df2.drop("MEDV"))}

### Create the driver and schema files

Finally, we put it all of this together by calling the generate_schema function with the run, filepath, and input (and/or output) definitions.

This creates a file named *filepath* in the current working directory that contains the schema. 

In [None]:
generate_schema(run_func=run, inputs=inputs, filepath='service_schema.json')

### Test ```init``` and ```run```

Before publishing the web service, you can test the init and run functions in the notebook by running the the following cell.

In [None]:
input_data = [[0.00632, 18.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296, 15.3, 4.98, 24.0]]
df = spark.createDataFrame(input_data, ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"])
init()
print(run(df))

### Use the CLI to deploy and manage your web services

SSH into the DSVM and run the following commands to deploy your service locally.

Set the environment variables, either from the command line or from a script, that you generated when you setup your DSVM. 

Change to azureml folder containing the realtime notebook.

```
cd ~/notebooks/azureml/realtime
```
Next, using the driver and schema files that are output to the *output\_{timestamp}* folder, run the following commands to create the web service:

```
az ml env local
az ml service create realtime -f main.py -m housing.model -s service_schema.json -n mytestapp -r spark-py -v
```

To create and run the web service on the ACS cluster, change to the cluster mode and rerun the service creation command:

```
az ml env cluster
az ml service create realtime -f main.py -m housing.model -s service_schema.json -n mytestapp -r spark-py -v
```

To test the local web service, run the following command with a sample data input:

Linux

```
az ml service run realtime -n mytestapp -d "{\"input_df\": [{\"CRIM\": 0.00632, \"RM\": 6.575, \"TAX\": 296, \"NOX\": 0.538, \"PTRATIO\": 15.3, \"LSTAT\": 4.98, \"CHAS\": 0, \"DIS\": 4.09, \"INDUS\": 2.31, \"RAD\": 1, \"ZN\": 18.0, \"AGE\": 65.2}, {\"CRIM\": 0.02731, \"RM\": 6.421, \"TAX\": 242, \"NOX\": 0.469, \"PTRATIO\": 17.8, \"LSTAT\": 9.14, \"CHAS\": 0, \"DIS\": 4.9671, \"INDUS\": 7.07, \"RAD\": 2, \"ZN\": 0.0, \"AGE\": 78.9}, {\"CRIM\": 0.02729, \"RM\": 7.185, \"TAX\": 242, \"NOX\": 0.469, \"PTRATIO\": 17.8, \"LSTAT\": 4.03, \"CHAS\": 0, \"DIS\": 4.9671, \"INDUS\": 7.07, \"RAD\": 2, \"ZN\": 0.0, \"AGE\": 61.1}]}"
```

Windows

```
az ml service run realtime -n mytestapp1 -d "{\"input_df\": [{\"CRIM\": 0.00632, \"RM\": 6.575, \"TAX\": 296, \"NOX\": 0.538, \"PTRATIO\": 15.3, \"LSTAT\": 4.98, \"CHAS\": 0, \"DIS\": 4.09, \"INDUS\": 2.31, \"RAD\": 1, \"ZN\": 18.0, \"AGE\": 65.2}, {\"CRIM\": 0.02731, \"RM\": 6.421, \"TAX\": 242, \"NOX\": 0.469, \"PTRATIO\": 17.8, \"LSTAT\": 9.14, \"CHAS\": 0, \"DIS\": 4.9671, \"INDUS\": 7.07, \"RAD\": 2, \"ZN\": 0.0, \"AGE\": 78.9}, {\"CRIM\": 0.02729, \"RM\": 7.185, \"TAX\": 242, \"NOX\": 0.469, \"PTRATIO\": 17.8, \"LSTAT\": 4.03, \"CHAS\": 0, \"DIS\": 4.9671, \"INDUS\": 7.07, \"RAD\": 2, \"ZN\": 0.0, \"AGE\": 61.1}]}"

```

You can retrieve the swagger document using the following command

```
curl http://127.0.0.1:<portNumber>/swagger.json
```