## Model Deployment with Spark Serving 
In this example, we try to predict incomes from the *Adult Census* dataset. Then we will use Spark serving to deploy it as a realtime web service. 
First, we import needed packages:

In [1]:
import sys
import numpy as np
import pandas as pd


StatementMeta(SamplePool, 23, 1, Finished, Available)



Now let's read the data and split it to train and test sets:

In [2]:
data = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet")
data = data.select(["education", "marital-status", "hours-per-week", "income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
train.limit(10).toPandas()

StatementMeta(SamplePool, 23, 2, Finished, Available)

  education       marital-status  hours-per-week  income
0      10th             Divorced            25.0   <=50K
1      10th             Divorced            40.0   <=50K
2      10th             Divorced            40.0   <=50K
3      10th             Divorced            40.0   <=50K
4      10th   Married-civ-spouse            16.0   <=50K
5      10th   Married-civ-spouse            35.0   <=50K
6      10th   Married-civ-spouse            40.0   <=50K
7      10th   Married-civ-spouse            40.0   <=50K
8      10th   Married-civ-spouse            40.0   <=50K
9      10th   Married-civ-spouse            40.0   <=50K

`TrainClassifier` can be used to initialize and fit a model, it wraps SparkML classifiers.
You can use `help(mmlspark.TrainClassifier)` to view the different parameters.

Note that it implicitly converts the data into the format expected by the algorithm. More specifically it:
 tokenizes, hashes strings, one-hot encodes categorical variables, assembles the features into a vector
etc.  The parameter `numFeatures` controls the number of hashed features.

In [3]:
from mmlspark.train import TrainClassifier
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol="income", numFeatures=256).fit(train)

StatementMeta(SamplePool, 23, 3, Finished, Available)



After the model is trained, we score it against the test dataset and view metrics.

In [4]:
from mmlspark.train import ComputeModelStatistics, TrainedClassifierModel
prediction = model.transform(test)
prediction.printSchema()

StatementMeta(SamplePool, 23, 4, Finished, Available)

root
 |-- education: string (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- hours-per-week: double (nullable = true)
 |-- income: string (nullable = true)
 |-- scores: vector (nullable = true)
 |-- scored_probabilities: vector (nullable = true)
 |-- scored_labels: double (nullable = false)

In [5]:
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()

StatementMeta(SamplePool, 23, 5, Finished, Available)

  evaluation_type  ...       AUC
0  Classification  ...  0.865245

[1 rows x 6 columns]
  Unsupported type in conversion to Arrow: MatrixUDT
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.

First, we will define the webservice input/output.
For more information, you can visit the [documentation for Spark Serving](https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md)

In [6]:
from pyspark.sql.types import *
from mmlspark.io import *
import uuid

serving_inputs = spark.readStream.server() \
    .address("localhost", 8898, "my_api") \
    .option("name", "my_api") \
    .load() \
    .parseRequest("my_api", test.schema)

serving_outputs = model.transform(serving_inputs) \
  .makeReply("scored_labels")

server = serving_outputs.writeStream \
    .server() \
    .replyTo("my_api") \
    .queryName("my_query") \
    .option("checkpointLocation", "file:///tmp/checkpoints-{}".format(uuid.uuid1())) \
    .start()


StatementMeta(SamplePool, 23, 6, Finished, Available)



Test the webservice

In [7]:
import requests
data = u'{"education":" 10th","marital-status":"Divorced","hours-per-week":40.0}'
r = requests.post(data=data, url="http://localhost:8898/my_api")
print("Response {}".format(r.text))

StatementMeta(SamplePool, 23, 7, Finished, Available)

Response {"scored_labels":0.0}

In [8]:
import requests
data = u'{"education":" Masters","marital-status":"Married-civ-spouse","hours-per-week":40.0}'
r = requests.post(data=data, url="http://localhost:8898/my_api")
print("Response {}".format(r.text))

StatementMeta(SamplePool, 23, 8, Finished, Available)

Response {"scored_labels":1.0}

In [9]:
import time
time.sleep(20) # wait for server to finish setting up (just to be safe)
server.stop()

StatementMeta(SamplePool, 23, 9, Finished, Available)



In [10]:
spark.stop()

StatementMeta(SamplePool, 23, 10, Finished, Available)

