## Building your first AzureML Spark web service

In this tutorial, we will walk you through loading a dataset, exploring
its features, training a model on the dataset, and then publishing a
realtime scoring API for the model.

First, let's read in the Boston Housing Price dataset. We have placed a copy in your azureml/datasets folder.

In [25]:
# Import Azure ML API SDK. The SDK is installed implicitly with the latest
# version of the CLI in your default python environment
from azure.ml.api.schema.dataTypes import DataTypes
from azure.ml.api.schema.sampleDefinition import SampleDefinition
from azure.ml.api.realtime.services import generate_schema

In [26]:
# Read in the housing price dataset
df2 = spark.read.csv("../datasets/housing.csv", header=True, inferSchema=True)
df2.show()
df2.printSchema()

+-------+----+-----+----+-----+-----+-----+------+---+---+-------+-----+----+
|   CRIM|  ZN|INDUS|CHAS|  NOX|   RM|  AGE|   DIS|RAD|TAX|PTRATIO|LSTAT|MEDV|
+-------+----+-----+----+-----+-----+-----+------+---+---+-------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575| 65.2|  4.09|  1|296|   15.3| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421| 78.9|4.9671|  2|242|   17.8| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185| 61.1|4.9671|  2|242|   17.8| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998| 45.8|6.0622|  3|222|   18.7| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147| 54.2|6.0622|  3|222|   18.7| 5.33|36.2|
|0.02985| 0.0| 2.18|   0|0.458| 6.43| 58.7|6.0622|  3|222|   18.7| 5.21|28.7|
|0.08829|12.5| 7.87|   0|0.524|6.012| 66.6|5.5605|  5|311|   15.2|12.43|22.9|
|0.14455|12.5| 7.87|   0|0.524|6.172| 96.1|5.9505|  5|311|   15.2|19.15|27.1|
|0.21124|12.5| 7.87|   0|0.524|5.631|100.0|6.0821|  5|311|   15.2|29.93|16.5|
|0.17004|12.5| 7.87|   0|0.524|6.004| 85.9|6.5921|  5|311|   15.

### Train your model

Using Spark's ML library, we can train a gradient boosted tree regressor for our data to produce a model that can predict median values of houses in Boston. Once we have trained the model, we can then evaluate it for quality using the root mean squared error metric.

In [27]:
# Train a boosted decision tree regressor
from pyspark.ml.feature import RFormula
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.pipeline import Pipeline
import numpy as np
formula = RFormula(formula="MEDV~.")
gbt = GBTRegressor()
pipeline = Pipeline(stages=[formula, gbt]).fit(df2)

In [28]:
# Evaluate scores
scores = pipeline.transform(df2)
from pyspark.ml.evaluation import RegressionEvaluator
print("R^2 error = " + str(RegressionEvaluator(metricName="r2").evaluate(scores)))

R^2 error = 0.9776138989970254


### Save your model and schema

Once you have a model that performs well, you can package it into a scoring service. To prepare for this, save your model and dataset schema locally first.

In [29]:
# Save model
pipeline.write().overwrite().save("housing.model")
print("Model saved")

Model saved


## Authoring a Realtime Web service

In this section, we show you how to author a realtime web service that scores the model you saved above. 

### 1. Define ```init``` and ```run``` and create the ```score.py``` file

We start by defining our ```init``` and ```run``` functions in the cell below. 

The ```init``` function initializes your web service, loading in any data or models that you need to score your inputs. In the example below, we load in the trained model and the schema of our dataset.

The ```run``` function defines what is executed on a scoring call. In our simple example, we simply load in the json input as a data frame, and run our pipeline on the input.

The %%writefile command will save the score.py file.

In [30]:
%%writefile score.py
# After testing the below init() and run() functions,
# uncomment this cell to create the score.py.

def init():
    # read in the model file
    from pyspark.ml import PipelineModel
    global pipeline
    pipeline = PipelineModel.load("housing.model")
    
def run(input_df):
    response = ''
    
    try:
        #Get prediction results for the dataframe
        score = pipeline.transform(input_df)
        predictions = score.collect()

        #Get each scored result
        for pred in predictions:
            response += str(pred['prediction']) + ","
        # Remove the last comma
        response = response[:-1]
    except Exception as e:
        return (str(e))
    
    # Return results
    return response

Overwriting score.py


### Create Schema

Create a schema for the input to the web service.

In [31]:
# Define the input data frame
inputs = {"input_df": SampleDefinition(DataTypes.SPARK, df2.drop("MEDV"))}

### Create schema file

Generate the schema file. This will be used to create a Swagger file for your web service which can be used to discover its input and sample data when calling it.

In [32]:
import score
generate_schema(run_func=score.run, inputs=inputs, filepath='service_schema.json')

{'input': {'input_df': {'internal': {'fields': [{'metadata': {},
      'name': 'CRIM',
      'nullable': True,
      'type': 'double'},
     {'metadata': {}, 'name': 'ZN', 'nullable': True, 'type': 'double'},
     {'metadata': {}, 'name': 'INDUS', 'nullable': True, 'type': 'double'},
     {'metadata': {}, 'name': 'CHAS', 'nullable': True, 'type': 'integer'},
     {'metadata': {}, 'name': 'NOX', 'nullable': True, 'type': 'double'},
     {'metadata': {}, 'name': 'RM', 'nullable': True, 'type': 'double'},
     {'metadata': {}, 'name': 'AGE', 'nullable': True, 'type': 'double'},
     {'metadata': {}, 'name': 'DIS', 'nullable': True, 'type': 'double'},
     {'metadata': {}, 'name': 'RAD', 'nullable': True, 'type': 'integer'},
     {'metadata': {}, 'name': 'TAX', 'nullable': True, 'type': 'integer'},
     {'metadata': {}, 'name': 'PTRATIO', 'nullable': True, 'type': 'double'},
     {'metadata': {}, 'name': 'LSTAT', 'nullable': True, 'type': 'double'}],
    'type': 'struct'},
   'swagger': {'

### 3. Test ```init``` and ```run```

We can then test the ```init``` and ```run``` functions right here in the notebook, before we decide to actually publish a web service.

In [33]:
# Create the sample input dataframe
input_data = [[0.00632, 18.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296, 15.3, 4.98, 24.0],[0.00632, 59.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296, 15.3, 4.98, 24.0],[0.00332, 76.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296, 15.3, 4.98, 12.0]]
df = spark.createDataFrame(input_data, ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"])

#Call the run function to score using the model
score.init() #Score file imported above
print(score.run(df))

18.687468518340438,19.21096852021207,24.48585095195574


### 4. Use the CLI to deploy and manage your web service

#### Pre-requisites

Use the following commands to set up an environment and account to run the web service. For more info, see the Getting Started Guide and the CLI Command Reference. You can use -h flag at the end of the commands for command help.

* Create the environment (you need to do this once per environment e.g. dev or prod)

```
az ml env setup -c -n <yourclustername> --location <e.g. eastus2>
```
* Create a Model Management account (one time setup)

```
az ml account modelmanagement create --location <e.g. eastus2> -n <your-new-acctname> -g <yourresourcegroupname> --sku-capacity 1 --sku-name S1
```

*  Set the Model Management account

```
az ml account modelmanagement set -n <youracctname> -g <yourresourcegroupname>
```

*  Set the environment. The cluster name is the name used in step 1 above. The resource group name was the output of the same process and would be in the command window when the setup process is completed.

```
az ml env set -n <yourclustername> -g <yourresourcegroupname>
```

#### Deploy your web service

Switch to a bash shell, and run the following commands to deploy your service and run it.
Note that the cluster-name in the first command is from the az ml env set --cluster-name.

```
cd ~/notebooks/azureml/spark/realtime/
```
This assumes that you saved your model locally.
```
az ml service create realtime --model-file housing.model -f score.py -n housingservice -s service_schema.json -r spark-py
```
This command will return the sample run command with sample data. 
You can get the Service Id from the output of the create command above.
```
az ml service show realtime -i <yourserviceid>
```
Call the web service to get a prediction
```
az ml service run realtime -i <yourserviceid> -d "{\"input_df\": [{\"CRIM\": 0.00632, \"RM\": 6.575, \"TAX\": 296, \"NOX\": 0.538, \"PTRATIO\": 15.3, \"LSTAT\": 4.98, \"CHAS\": 0, \"DIS\": 4.09, \"INDUS\": 2.31, \"RAD\": 1, \"ZN\": 18.0, \"AGE\": 65.2}]}"
```
Prediction result:

{'result': '24.27495913312397'}