## Building your first AzureML Spark web service

In this tutorial, we will walk you through loading a dataset, exploring
its features, training a model on the dataset, and then publishing a
realtime scoring API for the model.

First, let's read in the Boston Housing Price dataset. This dataset is publicly available at https://archive.ics.uci.edu/ml/datasets/Housing. We have placed a copy in your azureml/datasets folder.

In [2]:
# Importing Azure ML API SDK functionality. The SDK is installed implicitly with the latest
# version of the CLI in your default python environment
from azure.ml.api.schema.dataTypes import DataTypes
from azure.ml.api.schema.sampleDefinition import SampleDefinition
from azure.ml.api.realtime.services import prepare

In [3]:
# Read in the housing price dataset
df2 = spark.read.csv("datasets/housing.csv", header=True, inferSchema=True)
df2.show()
df2.printSchema()

+-------+----+-----+----+-----+-----+-----+------+---+---+-------+-----+----+
|   CRIM|  ZN|INDUS|CHAS|  NOX|   RM|  AGE|   DIS|RAD|TAX|PTRATIO|LSTAT|MEDV|
+-------+----+-----+----+-----+-----+-----+------+---+---+-------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575| 65.2|  4.09|  1|296|   15.3| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421| 78.9|4.9671|  2|242|   17.8| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185| 61.1|4.9671|  2|242|   17.8| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998| 45.8|6.0622|  3|222|   18.7| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147| 54.2|6.0622|  3|222|   18.7| 5.33|36.2|
|0.02985| 0.0| 2.18|   0|0.458| 6.43| 58.7|6.0622|  3|222|   18.7| 5.21|28.7|
|0.08829|12.5| 7.87|   0|0.524|6.012| 66.6|5.5605|  5|311|   15.2|12.43|22.9|
|0.14455|12.5| 7.87|   0|0.524|6.172| 96.1|5.9505|  5|311|   15.2|19.15|27.1|
|0.21124|12.5| 7.87|   0|0.524|5.631|100.0|6.0821|  5|311|   15.2|29.93|16.5|
|0.17004|12.5| 7.87|   0|0.524|6.004| 85.9|6.5921|  5|311|   15.

### Train your model

Using Spark's ML library, we can train a gradient boosted tree regressor for our data to produce a model that can predict median values of houses in Boston. Once we have trained the model, we can then evaluate it for quality using the root mean squared error metric.

In [4]:
# Train a boosted decision tree regressor
from pyspark.ml.feature import RFormula
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.pipeline import Pipeline
import numpy as np
formula = RFormula(formula="MEDV~.")
gbt = GBTRegressor()
pipeline = Pipeline(stages=[formula, gbt]).fit(df2)

In [5]:
# Evaluate scores
scores = pipeline.transform(df2)
from pyspark.ml.evaluation import RegressionEvaluator
print "R^2 error =", RegressionEvaluator(metricName="r2").evaluate(scores)

R^2 error = 0.977613898997


### Save your model and schema

Once you have a model that performs well, you can package it into a scoring service. To prepare for this, save your model and dataset schema locally first.

In [6]:
# Save model
pipeline.write().overwrite().save("housing.model")
print "Model saved"

Model saved


## Authoring a Realtime Web Service

In this section, we show you how to author a realtime web service that scores the model you saved above. 

### 1. Define ```init``` and ```run```

We start by defining our ```init``` and ```run``` functions in the cell below. 

The ```init``` function initializes your web service, loading in any data or models that you need to score your inputs. In the example below, we load in the trained model and the schema of our dataset.

The ```run``` function defines what is executed on a scoring call. In our simple example, we simply load in the json input as a data frame, and run our pipeline on the input.

In [7]:
#%%save_file -f driver.py
# User written init function should mainly focus on loading the model(s) now. 
# Schema loading is done in generated code
def init():
    from pyspark.ml import PipelineModel
    global pipeline
    pipeline = PipelineModel.load("housing.model")


# Run method now takes actual objects (dataframes, numpy arrays, other types) as arguments
# so the user no longer needs to worry about HTTP body parsing to object(s), 
# the generated code does that, if schema is generated
def run(input_df):
    score = pipeline.transform(input_df)
    return score.collect()[0]['prediction']

In [8]:
# To generate schema for inputs (and outputs for rich swagger), users should define a map of input name -> input sample
# where input name needs to match exactly with the names of arguments for the run function, and for samples the
# user is advised to use the data structures he likely already created and used for testing the model after training
inputs = {"input_df": SampleDefinition(DataTypes.SPARK, df2.drop("MEDV"))}

In [9]:
# Finally, we put it all together by calling prepare with the init, run and inputs (and/or outputs) definitions.
# This will create a folder named output_{timestamp} in the current working directory (by default, or can use the
# drop_folder param to override that) and inside it it will generate the driver program - main.py and schema file -
# service_schema.json (if inputs or outputs definitions are specified).
# The user can then use the generated driver with the -f option in the CLI and schema for -s when publishing a service
prepare(run_func=run, init_func=init, input_types=inputs)

processing inputs
processing init function
processing run function
setting up output directory
Done setting up output directory, available here output_20170622202541


'output_20170622202541'

### 2. Test ```init``` and ```run```

We can then test the ```init``` and ```run``` functions right here in the notebook, before we decide to actually publish a web service.

In [10]:
# Here, if publisher wants to test the init and run functions like before in the notebook, I had to create a DF for
# the input instead of a string since an data frame is expected now. 
input_data = [[0.00632, 18.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296, 15.3, 4.98, 24.0]]
df = spark.createDataFrame(input_data, ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"])
init()
print(run(df))

18.6874685183


### 3. Save your web service definition

Go back to the cell where you defined your ```init``` and ```run``` functions, uncomment the magic in the first line (```#%%save_file -f testing.py```), and run the cell again. This will save the contents of the cell to a local file with the name supplied to the ```-f``` argument.

### 4. Use the Azure Machine Learning CLI to deploy and manage your web services

Switch to a bash shell, and run the following commands to deploy your service locally:
```
cd ~home/azuremluser/notebooks/azureml/realtime/
az ml env local
az ml service create realtime -f main.py -m ../housing.model -s service_schema.json -n mytestapp1 -r spark-py -v

```
### 5. Test the Web Service
Use the below command to get the usage help with sample data input:

```
az ml service 
```

### 6. Get the Swagger document

```
curl http://127.0.0.1:<portNumber>/swagger.json 
```