## Building a sample AzureML Spark Batch web service

In this tutorial you will use [Apache Spark](http://spark.apache.org/) to create a model that uses a Logistic Regression learner to predict food inspection results. To do this, you will call the Spark Python API ([PySpark](http://spark.apache.org/docs/0.9.0/python-programming-guide.html)) to load a dataset, train a model using the dataset, and publish a batch scoring API for the model.

### Load the data

The tutorial uses the *Food Inspections Data Set* which contains the results of food inspections that were conducted in Chicago. To facilitate this tutorial, we have placed a copy of the data in the ```azureml/datasets``` folder. The original dataset is available from the [City of Chicago data portal](https://data.cityofchicago.org/). 

In [1]:
### Import the relevant PySpark bindings
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *

#### Read in the food inspections dataset and create numerical labels for training

In [2]:
def csvParse(s):
    import csv
    from StringIO import StringIO
    sio = StringIO(s)
    value = csv.reader(sio).next()
    sio.close()
    return value

inspections = sc.textFile(str("datasets/food_inspections1.csv")).map(csvParse)

schema = StructType([StructField("id", IntegerType(), False), 
                     StructField("name", StringType(), False), 
                     StructField("results", StringType(), False), 
                     StructField("violations", StringType(), True)])

df = sqlContext.createDataFrame(inspections.map(lambda l: (int(l[0]), l[1], l[12], l[13])) , schema)
df.registerTempTable('CountResults')

def labelForResults(s):
    if s == 'Fail':
        return 0.0
    elif s == 'Pass w/ Conditions' or s == 'Pass':
        return 1.0
    else:
        return -1.0
    
label = UserDefinedFunction(labelForResults, DoubleType())
labeledData = df.select(label(df.results).alias('label'), df.violations).where('label >= 0')

#### Create and save the model
Next, you train a logistic regression model to predict inspection results. The following code tokenizes each "violations" string to get the individual words in each string. It then uses a HashingTF to convert each set of tokens into a feature vector which is passed to the logistic regression algorithm to construct a model. 

In [3]:
tokenizer = Tokenizer(inputCol="violations", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

model = pipeline.fit(labeledData)

Finally, you save the model to use when deploying the web service.

In [4]:
model.write().overwrite().save("food_inspection.model")
print "Model saved"

Model saved


## Authoring a Batch Web Service

In this section, you will author a batch web service the model you saved previously to generate your predictions. 

### Create a PySpark script that defines the web service

To deploy a web service, you must create a PySpark script that defines the web service. The script specifies how your web service operates: what inputs it expects from the caller of the web service and what outputs it produces. 

In the script, you identify the input parameters you want your web service to consume and the outputs it should produce. 

When you create your batch web service using the Azure Machine Learning CLI, you provide the parameters that you identified in the script as command line arguments.

In the sample provided, the script takes a data file as its input-data argument, uses the saved logistic regression model to make predictions on the data, and then saves the predictions as a parquet file to the path provided through the output-data argument.

The following cell contains the PySpark script that you pass the AML CLI to create the web service. The save file call (```%%save_file -f batch_score.py```) in the first line of the of the cell saves the contents of the cell to a local file with the name supplied by the ```-f``` argument.

In [7]:
%%save_file -f batch_score.py
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import Row
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *
import argparse

sc = SparkContext.getOrCreate()
sqlContext = SQLContext.getOrCreate(sc)

parser = argparse.ArgumentParser()
parser.add_argument("--input-data")
parser.add_argument("--output-data")

args = parser.parse_args()
print str(args.input_data)
print str(args.output_data)

def csvParse(s):
    import csv
    from StringIO import StringIO
    sio = StringIO(s)
    value = csv.reader(sio).next()
    sio.close()
    return value

model = PipelineModel.load(food_inspection.model)

testData = sc.textFile(str(args.input_data))\
             .map(csvParse) \
             .map(lambda l: (int(l[0]), l[1], l[12], l[13]))

schema = StructType([StructField("id", IntegerType(), False), 
                     StructField("name", StringType(), False), 
                     StructField("results", StringType(), False), 
                     StructField("violations", StringType(), True)])

testDf = sqlContext.createDataFrame(testData, schema).where("results = 'Fail' OR results = 'Pass' OR results = 'Pass w/ Conditions'")

predictionsDf = model.transform(testDf)

predictionsDf.write.parquet(str(args.output_data))

Saved cell to batch_score.py


### Use the Azure Machine Learning CLI to deploy and manage your batch web service

#### Deploy to local VM

To create the batch web service locally on the DSVM, open an SSH session to your DSVM. 

**Note**: When you first run the Azure ML CLI you are prompted to configure your Azure ML API key. If you do not have a key, please refer to the readme at [https://github.com/Azure/AzureML-vNext](https://github.com/Azure/AzureML-vNext).

To deploy the web service, run the following commands:

```
aml env local
aml service create batch -f batch_score.py -n batchwebservice --input=input-data --output=output-data
```

You can choose to provide default values for the variables during web service creation. 

The following command is an example of providing the output location during web service creation.

```
aml service create batch -f batch_score.py -n batchwebservice --input=--input-data --output=--output-data:food_inspection_output
```

#### Deploy to HDInsight Cluster
For instructions to deploy a sample batch web service to your HDInsight Cluster visit the git page: https://github.com/Azure/AzureML-vNext

---  
Created by a Microsoft Employee.  
Copyright (C) Microsoft. All Rights Reserved.