# Kubeflow NLP end-to-end with Seldon

In this example we showcase how to build re-usable components to build an ML pipeline that can be trained and deployed at scale.

We will automate content moderation on the Reddit comments in /r/science building a machine learning NLP model with the following components:

![](img/kubeflow-seldon-nlp-reusable-components.jpg)

This tutorial will break down in the following sections:

1) TODO

Let's get started! 🚀🔥

# Before you start
Make sure you install the following dependencies, as they are critical for this example to work:

* Helm v2.13.1+
* A Kubernetes cluster running v1.13 or above (minkube / docker-for-windows work well if enough RAM)
* kubectl v1.14+
* ksonnet v0.13.1+
* kfctl 0.5.2 - Please use this exact version as there are major changes every few months
* Python 3.6+
* Python DEV requirements (we'll install them below)


In [26]:
# You can also install the python dependencies that we'll need to build and test:
!pip install -r requirements-dev.txt

Collecting https://storage.googleapis.com/ml-pipeline/release/0.1.20/kfp.tar.gz (from -r requirements-dev.txt (line 2))
  Using cached https://storage.googleapis.com/ml-pipeline/release/0.1.20/kfp.tar.gz
Building wheels for collected packages: kfp
  Building wheel for kfp (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-4b2ca_6e/wheels/ae/bb/02/32b1356ee756181099d8f1b0950ac6567cb2b38e71b48f02e8
Successfully built kfp


# Create project
Kubeflow's CLI allows us to create a project which will allow us to build the configuration we need to deploy our kubeflow and seldon clusters.

In [1]:
!kfctl init kubeflow-seldon
!ls kubeflow-seldon

Now we run the following commands to basically launch our Kubeflow cluster with all its components. 

It may take a while to download all the images for Kubeflow so feel free to make yourself a cup of ☕.

If you have a terminal you can see how the containers are created in real-time by running `kubectl get pods -n kubeflow -w`.

In [13]:
%%bash
cd kubeflow-seldon
kfctl generate all -V
kfctl apply all -V

time="2019-05-25T18:24:21+01:00" level=info msg="reading from /home/alejandro/Programming/kubernetes/seldon/seldon-core/examples/kubeflow/kubeflow-seldon/app.yaml" filename="coordinator/coordinator.go:341"
time="2019-05-25T18:24:21+01:00" level=info msg="reading from /home/alejandro/Programming/kubernetes/seldon/seldon-core/examples/kubeflow/kubeflow-seldon/app.yaml" filename="coordinator/coordinator.go:341"
time="2019-05-25T18:24:21+01:00" level=info msg="Ksonnet.Generate Name kubeflow-seldon AppDir /home/alejandro/Programming/kubernetes/seldon/seldon-core/examples/kubeflow/kubeflow-seldon Platform " filename="ksonnet/ksonnet.go:369"
time="2019-05-25T18:24:21+01:00" level=info msg="Creating environment \"default\" with namespace \"kubeflow\", pointing to \"version:v1.13.0\" cluster at address \"https://localhost:6445\"" filename="env/create.go:77"
time="2019-05-25T18:24:25+01:00" level=info msg="Generating ksonnet-lib data at path '/home/alejandro/Programming/kubernetes/seldon/seldon-

### Now let's run Seldon 
For this we'll need Helm to be running, so we'll initialise it.

In [14]:
%%bash
helm init 
kubectl rollout status deploy/tiller-deploy -n kube-system

$HELM_HOME has been configured at /home/alejandro/.helm.

Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.

Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
To prevent this, run `helm init` with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation
Happy Helming!
Waiting for deployment "tiller-deploy" rollout to finish: 0 of 1 updated replicas are available...
deployment "tiller-deploy" successfully rolled out


Once it's running we can now run the installation command for Seldon.

As you can see, we are running the Seldon Operator in the Kubeflow namespace. 

In [15]:
!helm install seldon-core-operator --namespace kubeflow --repo https://storage.googleapis.com/seldon-charts

NAME:   yummy-donkey
LAST DEPLOYED: Sat May 25 18:27:34 2019
NAMESPACE: kubeflow
STATUS: DEPLOYED

RESOURCES:
==> v1/ClusterRole
NAME                          AGE
seldon-operator-manager-role  1s

==> v1/ClusterRoleBinding
NAME                                 AGE
seldon-operator-manager-rolebinding  1s

==> v1/Pod(related)
NAME                                  READY  STATUS             RESTARTS  AGE
seldon-operator-controller-manager-0  0/1    ContainerCreating  0         1s

==> v1/Secret
NAME                                   TYPE    DATA  AGE
seldon-operator-webhook-server-secret  Opaque  0     1s

==> v1/Service
NAME                                        TYPE       CLUSTER-IP      EXTERNAL-IP  PORT(S)  AGE
seldon-operator-controller-manager-service  ClusterIP  10.105.250.206  <none>       443/TCP  1s

==> v1/StatefulSet
NAME                                READY  AGE
seldon-operator-controller-manager  0/1    1s

==> v1beta1/CustomResourceDefinition
NAME                            

Check all the Seldon Deployment is running

In [18]:
!kubectl get pod -n kubeflow | grep seldon

seldon-operator-controller-manager-0                       1/1     Running   1          52s


# Train our NLP Pipeline with Kubeflow
We can access the Kubeflow dashboard to train our ML pipeline via http://localhost/_/pipeline-dashboard

If you can't edit this, you need to make sure that the ambassador gateway service is accessible:

In [19]:
!kubectl get svc ambassador -n kubeflow

NAME         TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
ambassador   NodePort   10.99.81.144   <none>        80:31713/TCP   9m13s


In my case, I need to change the kind from `NodePort` into `LoadBalancer` which can be done with the following command:

In [20]:
!kubectl patch svc ambassador --type='json' -p '[{"op":"replace","path":"/spec/type","value":"LoadBalancer"}]' -n kubeflow

service/ambassador patched


Now that I've changed it to a loadbalancer, it has allocated the external IP as my localhost so I can access it at http://localhost/_/pipeline-dashboard

In [21]:
!kubectl get svc ambassador -n kubeflow

NAME         TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
ambassador   LoadBalancer   10.99.81.144   localhost     80:31713/TCP   10m


If this was successfull, you should see the following screen:
![](img/k-pipeline-dashboard.jpg)

# Upload our NLP pipeline
To uplaod our NLP pipeline we must first create it. This will consist in two steps:
    
x.1) Build all the docker images for each pipeline step
x.2) Build the Kubeflow pipeline file to upload through the dashboard
x.3) Run a new pipeline instance through the dashboard


## x.1 Build all docker images for each pipeline step

We will start by building each of the components in our ML pipeline. 

![](img/kubeflow-seldon-nlp-reusable-components.jpg)

## Let's first have a look at our clean_text step:

In [22]:
!ls pipeline/pipeline_steps/clean_text/

Transformer.py	__pycache__	pipeline_step.py
__init__.py	build_image.sh	requirements.txt


Like in this step, all of the other steps can be found in the `pipeline/pipeline_steps/` folder, and all have the following structure:
* `pipeline_step.py` which exposes the functionality through a CLI 
* `Transformer.py` which transforms the data accordingly
* `requirements.txt` which states the python dependencies to run
* `build_image.sh` which uses `s2i` to build the image with one line

### Let's check out the CLI for clean_text

In [25]:
!python pipeline/pipeline_steps/clean_text/pipeline_step.py --help

Usage: pipeline_step.py [OPTIONS]

Options:
  --in-path TEXT
  --out-path TEXT
  --help           Show this message and exit.


This is actually a very simple file, as we are using the click library to define the commands:

In [27]:
!cat pipeline/pipeline_steps/clean_text/pipeline_step.py

import dill
import click
import dill
try:
    # Running for tests
    from .Transformer import Transformer
except:
    # Running from CLI
    from Transformer import Transformer

@click.command()
@click.option('--in-path', default="/mnt/raw_text.data")
@click.option('--out-path', default="/mnt/clean_text.data")
def run_pipeline(in_path, out_path):
    clean_text_transformer = Transformer()
    with open(in_path, 'rb') as in_f:
        x = dill.load(in_f)
    y = clean_text_transformer.predict(x)
    with open(out_path, "wb") as out_f:
        dill.dump(y, out_f)

if __name__ == "__main__":
    run_pipeline()



If you want to understand how the CLI pipeline talks to each other, have a look at the end to end test in `pipeline/pipeline_tests/`:

In [29]:
!pytest ./pipeline/pipeline_tests/. --disable-pytest-warnings

platform linux -- Python 3.7.3, pytest-4.5.0, py-1.8.0, pluggy-0.11.0
rootdir: /home/alejandro/Programming/kubernetes/seldon/seldon-core/examples/kubeflow
collected 1 item                                                               [0m[1m

pipeline/pipeline_tests/test_pipeline.py [32m.[0m[36m                               [100%][0m



To build the image we provide a build script in each of the steps that contains the instructions:

In [30]:
!cat pipeline/pipeline_steps/clean_text/build_image.sh

#!/bin/bash

s2i build . seldonio/seldon-core-s2i-python3:0.6 clean_text_transformer:0.1



The only thing you need to make sure is that Seldon knows how to wrap the right model and file.

This can be achieved with the s2i/environment file. 

As you can see, here we just tell it we want it to use the Transformer file:

In [31]:
!cat pipeline/pipeline_steps/clean_text/.s2i/environment

MODEL_NAME=Transformer
API_TYPE=REST
SERVICE_TYPE=MODEL
PERSISTENCE=0


That's it! Quite simple right? 

The only thing we need to do is to run the `build_image.sh` for all the reusable components.

Here we show the manual way to do it, but we recommend to just run `make build_pipeline_steps`.

In [34]:
%%bash
# we must be in the same directory
cd pipeline/pipeline_steps/clean_text/ && ./build_image.sh
cd ../data_downloader && ./build_image.sh
cd ../lr_text_classifier && ./build_image.sh
cd ../spacy_tokenize && ./build_image.sh
cd ../tfidf_vectorizer && ./build_image.sh

/home/alejandro/Programming/kubernetes/seldon/seldon-core/examples/kubeflow/pipeline/pipeline_steps/clean_text


---> Installing application source...
---> Installing dependencies ...
Looking in links: /whl
Collecting dill==0.2.9 (from -r requirements.txt (line 1))
Downloading https://files.pythonhosted.org/packages/fe/42/bfe2e0857bc284cbe6a011d93f2a9ad58a22cb894461b199ae72cfef0f29/dill-0.2.9.tar.gz (150kB)
Building wheels for collected packages: dill
Building wheel for dill (setup.py): started
Building wheel for dill (setup.py): finished with status 'done'
Stored in directory: /root/.cache/pip/wheels/5b/d7/0f/e58eae695403de585269f4e4a94e0cd6ca60ec0c202936fa4a
Successfully built dill
Installing collected packages: dill
Successfully installed dill-0.2.9
You should consider upgrading via the 'pip install --upgrade pip' command.
Build completed successfully


# Train ML Pipeline via Kubeflow
Now that we've built our steps, we can actually train our ML pipeline, which looks as follows:
![](img/kubeflow-seldon-nlp-ml-pipelines-training.jpg)

To do this, we have to generate the workflow using our Kubeflow graph definition and then uplaoding it through the UI:

In [36]:
%%bash
# Generating graph definition
python train_pipeline/nlp_pipeline.py
ls train_pipeline/

nlp_pipeline.py
nlp_pipeline.py.tar.gz


We now need to upload the resulting `nlp_pipeline.py.tar.gz` file generated.

This can be done through the "Upload PIpeline" button in the UI at http://localhost/_/pipeline-dashboard

![](img/upload-pipeline.jpg)