# PySpark and PMML Example
In this example we train an MNIST model using pySpark, export the model to PMML and then wrap it using Seldon's S2I interface so we can run predictions against it using seldon-core.

## Dependencies

To run this notebook you will need to set up pySpark and JPMML's spark export.

 * [Install pySpark along with Spark](http://spark.apache.org/downloads.html)
 * [Install JPMML Spark Package](https://github.com/jpmml/jpmml-sparkml-package)
 
 Following the above instruction you should add a set of environment variables of the form shown below to your shell:

```
export SPARK_HOME=<MY SPARK INSTALL FOLDER>/spark-2.3.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYTHONPATH=<MY JPPML FOLDER>/jpmml-sparkml-package/target/jpmml_sparkml-1.4rc0-py3.6.egg
```

Then when you run pyspark from the folder of this notebook it should start Jupyter running with a Spark context and the JPMML libraries availble, e.g.

```
pyspark --jars <MY JPPML FOLDER>/jpmml-sparkml-package/target/jpmml-sparkml-package-1.4-SNAPSHOT.jar
```

# Train MNIST Model using pySpark

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
import numpy as np

mnist = input_data.read_data_sets('data/MNIST_data', one_hot=False)
X = (mnist.train.images * 225).astype(int)
X_y = np.concatenate((X,np.expand_dims(mnist.train.labels,1)),axis=1)
np.savetxt("mnist_train.csv", X_y, fmt='%i', delimiter=",")


In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel
from pyspark.ml.feature import VectorAssembler

df = sqlContext.read.csv("./mnist_train.csv",inferSchema=True)

df = df.withColumnRenamed("_c784","label")

assembler = (VectorAssembler()
    .setInputCols(df.columns[0:784])
    .setOutputCol("features"))

lr = LogisticRegression(maxIter=10, regParam=0.01)

pipeline = Pipeline(stages=[assembler, lr])
model = pipeline.fit(df)

In [None]:
from jpmml_sparkml import toPMMLBytes

pmmlBytes = toPMMLBytes(sc, df, model)
f = open('model.pmml', 'wb')
f.write(pmmlBytes)
f.close()

In [None]:
!mv model.pmml src/main/resources

# Build Image with S2I

In [None]:
!s2i build . seldonio/seldon-core-s2i-java-build pyspark-test:0.1 --runtime-image seldonio/seldon-core-s2i-java-runtime

# Test with Docker

In [None]:
!docker run --name "pyspark_predictor" -d --rm -p 5000:5000 pyspark-test:0.1

In [None]:
!cd ../../../wrappers/testing && make build_protos

In [None]:
!python ../../../wrappers/testing/tester.py contract.json 0.0.0.0 5000 -p -t

In [None]:
!docker rm pyspark_predictor --force

# Test in Minikube

In [None]:
!minikube start --memory 4096 --feature-gates=CustomResourceValidation=true --extra-config=apiserver.Authorization.Mode=RBAC

In [None]:
!kubectl create clusterrolebinding kube-system-cluster-admin --clusterrole=cluster-admin --serviceaccount=kube-system:default

In [None]:
!helm init

In [None]:
!helm install ../../../helm-charts/seldon-core-crd --name seldon-core-crd  --set usage_metrics.enabled=true
!helm install ../../../helm-charts/seldon-core --name seldon-core

In [None]:
!eval $(minikube docker-env) && s2i build . seldonio/seldon-core-s2i-java-build pyspark-test:0.1 --runtime-image seldonio/seldon-core-s2i-java-runtime

In [None]:
!kubectl create -f mnist_deployment.json

Wait until ready (replicas == replicasAvailable)

In [None]:
!kubectl get seldondeployments seldon-deployment-example -o jsonpath='{.status}'

In [None]:
!python ../../../util/api_tester/api-tester.py contract.json \
    `minikube ip` `kubectl get svc -l app=seldon-apiserver-container-app -o jsonpath='{.items[0].spec.ports[0].nodePort}'` \
    --oauth-key oauth-key --oauth-secret oauth-secret -p

In [None]:
!minikube delete