# End to End Machine Learning Pipeline for Income Prediction

We use [demographic features from the 1996 US census](https://archive.ics.uci.edu/ml/datasets/census+income) to build an end to end machine learning pipeline. The pipeline is also annotated so it can be run as a [Kubeflow Pipeline](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/) using the [Kale](https://github.com/kubeflow-kale/kale) pipeline generator.

The notebook/pipeline stages are:

 1. Setup 
   * Imports
   * pipeline-parameters
   * minio client test
 1. Train a simple sklearn model and push to minio
 1. Prepare an Anchors explainer for model and push to minio
 1. Test Explainer
 1. Train an isolation forest outlier detector for model and push to minio
 1. Deploy a KfSering model and test
 1. Deploy an outlier detector and test



In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from alibi.explainers import AnchorTabular
from alibi.datasets import fetch_adult
from minio import Minio
from minio.error import ResponseError
from joblib import dump, load
import dill
import time
import json
from subprocess import run, Popen, PIPE
from alibi_detect.utils.data import create_outlier_batch

In [None]:
MINIO_HOST="minio-service.kubeflow:9000"
MINIO_ACCESS_KEY="minio"
MINIO_SECRET_KEY="minio123"
MINIO_MODEL_BUCKET="seldon"
INCOME_MODEL_PATH="sklearn/income/model"
EXPLAINER_MODEL_PATH="sklearn/income/explainer"
OUTLIER_MODEL_PATH="sklearn/income/outlier"
DEPLOY_NAMESPACE="admin"

In [None]:
def get_minio():
    return Minio(MINIO_HOST,
                    access_key=MINIO_ACCESS_KEY,
                    secret_key=MINIO_SECRET_KEY,
                    secure=False)

In [None]:
minioClient = get_minio()
buckets = minioClient.list_buckets()
for bucket in buckets:
    print(bucket.name, bucket.creation_date)

In [None]:
if not minioClient.bucket_exists(MINIO_MODEL_BUCKET):
    minioClient.make_bucket(MINIO_MODEL_BUCKET)

## Train Model

In [None]:
adult = fetch_adult()
adult.keys()

In [None]:
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map = adult.category_map

Note that for your own datasets you can use our utility function [gen_category_map](../api/alibi.utils.data.rst) to create the category map:

In [None]:
from alibi.utils.data import gen_category_map

Define shuffled training and test set

In [None]:
np.random.seed(0)
data_perm = np.random.permutation(np.c_[data, target])
data = data_perm[:,:-1]
target = data_perm[:,-1]

In [None]:
idx = 30000
X_train,Y_train = data[:idx,:], target[:idx]
X_test, Y_test = data[idx+1:,:], target[idx+1:]

### Create feature transformation pipeline
Create feature pre-processor. Needs to have 'fit' and 'transform' methods. Different types of pre-processing can be applied to all or part of the features. In the example below we will standardize ordinal features and apply one-hot-encoding to categorical features.

Ordinal features:

In [None]:
ordinal_features = [x for x in range(len(feature_names)) if x not in list(category_map.keys())]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

Categorical features:

In [None]:
categorical_features = list(category_map.keys())
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

Combine and fit:

In [None]:
preprocessor = ColumnTransformer(transformers=[('num', ordinal_transformer, ordinal_features),
                                               ('cat', categorical_transformer, categorical_features)])

### Train Random Forest model

Fit on pre-processed (imputing, OHE, standardizing) data.

In [None]:
np.random.seed(0)
clf = RandomForestClassifier(n_estimators=50)

In [None]:
model=Pipeline(steps=[("preprocess",preprocessor),("model",clf)])
model.fit(X_train,Y_train)

Define predict function

In [None]:
def predict_fn(x):
    return model.predict(x)

In [None]:
#predict_fn = lambda x: clf.predict(preprocessor.transform(x))
print('Train accuracy: ', accuracy_score(Y_train, predict_fn(X_train)))
print('Test accuracy: ', accuracy_score(Y_test, predict_fn(X_test)))

In [None]:
dump(model, 'model.joblib') 

In [None]:
print(get_minio().fput_object(MINIO_MODEL_BUCKET, f"{INCOME_MODEL_PATH}/model.joblib", 'model.joblib'))

## Train Explainer

In [None]:
model.predict(X_train)
explainer = AnchorTabular(predict_fn, feature_names, categorical_names=category_map)

Discretize the ordinal features into quartiles

In [None]:
explainer.fit(X_train, disc_perc=[25, 50, 75])

In [None]:
with open("explainer.dill", "wb") as dill_file:
    dill.dump(explainer, dill_file)    
    dill_file.close()
print(get_minio().fput_object(MINIO_MODEL_BUCKET, f"{EXPLAINER_MODEL_PATH}/explainer.dill", 'explainer.dill'))

## Get Explanation

Below, we get an anchor for the prediction of the first observation in the test set. An anchor is a sufficient condition - that is, when the anchor holds, the prediction should be the same as the prediction for this instance.

In [None]:
model.predict(X_train)
idx = 0
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

We set the precision threshold to 0.95. This means that predictions on observations where the anchor holds will be the same as the prediction on the explained instance at least 95% of the time.

In [None]:
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

## Train Outlier Detector

In [None]:
from alibi_detect.od import IForest

od = IForest(
    threshold=0.,
    n_estimators=200,
)


In [None]:
od.fit(X_train)

In [None]:
np.random.seed(0)
perc_outlier = 5
threshold_batch = create_outlier_batch(X_train, Y_train, n_samples=1000, perc_outlier=perc_outlier)
X_threshold, y_threshold = threshold_batch.data.astype('float'), threshold_batch.target
#X_threshold = (X_threshold - mean) / stdev
print('{}% outliers'.format(100 * y_threshold.mean()))

In [None]:
od.infer_threshold(X_threshold, threshold_perc=100-perc_outlier)
print('New threshold: {}'.format(od.threshold))
threshold = od.threshold

In [None]:
X_outlier = [[300,  4,  4,  2,  1,  4,  4,  0,  0,  0, 600,  9]]

In [None]:
od.predict(
    X_outlier
)

In [None]:
from alibi_detect.utils.saving import save_detector, load_detector
from os import listdir
from os.path import isfile, join

filepath="ifoutlier"
save_detector(od, filepath) 
onlyfiles = [f for f in listdir(filepath) if isfile(join(filepath, f))]
for filename in onlyfiles:
    print(filename)
    print(get_minio().fput_object(MINIO_MODEL_BUCKET, f"{OUTLIER_MODEL_PATH}/{filename}", join(filepath, filename)))

## Deploy KFServing Model

In [None]:
secret=f"""apiVersion: v1
kind: Secret
metadata:
  name: income-kf-secret
  namespace: {DEPLOY_NAMESPACE}
  annotations:
     serving.kubeflow.org/s3-endpoint: {MINIO_HOST} # replace with your s3 endpoint
     serving.kubeflow.org/s3-usehttps: "0" # by default 1, for testing with minio you need to set to 0
type: Opaque
stringData:
  awsAccessKeyID: {MINIO_ACCESS_KEY}
  awsSecretAccessKey: {MINIO_SECRET_KEY}
"""
with open("secret.yaml","w") as f:
    f.write(secret)
run("kubectl apply -f secret.yaml", shell=True)

In [None]:
secret = f"""apiVersion: v1
kind: Secret
metadata:
  name: seldon-init-container-secret
  namespace: {DEPLOY_NAMESPACE}
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: {MINIO_ACCESS_KEY}
  AWS_SECRET_ACCESS_KEY: {MINIO_SECRET_KEY}
  AWS_ENDPOINT_URL: http://{MINIO_HOST}
  USE_SSL: "false"
"""
with open("secret.yaml","w") as f:
    f.write(secret)
run("cat secret.yaml | kubectl apply -f -", shell=True)

In [None]:
sa = f"""apiVersion: v1
kind: ServiceAccount
metadata:
  name: minio-kf-sa
  namespace: {DEPLOY_NAMESPACE}
secrets:
  - name: income-kf-secret
"""
with open("sa.yaml","w") as f:
    f.write(sa)
run("kubectl apply -f sa.yaml", shell=True)

In [None]:
from kubernetes import client
from kfserving import KFServingClient
from kfserving import constants
from kfserving import utils
from kfserving import V1alpha2EndpointSpec
from kfserving import V1alpha2PredictorSpec
from kfserving import V1alpha2ExplainerSpec
from kfserving import V1alpha2AlibiExplainerSpec
from kfserving import V1alpha2SKLearnSpec
from kfserving import V1alpha2InferenceServiceSpec
from kfserving import V1alpha2InferenceService
from kfserving import V1alpha2Logger
from kubernetes.client import V1ResourceRequirements

api_version = constants.KFSERVING_GROUP + '/' + constants.KFSERVING_VERSION
default_endpoint_spec = V1alpha2EndpointSpec(
                          predictor=V1alpha2PredictorSpec(
                            service_account_name='minio-kf-sa',
                            sklearn=V1alpha2SKLearnSpec(
                              storage_uri='s3://'+MINIO_MODEL_BUCKET+'/'+ INCOME_MODEL_PATH,
                              resources=V1ResourceRequirements(
                                  requests={'cpu':'100m','memory':'1Gi'},
                                  limits={'cpu':'100m', 'memory':'1Gi'})),
                            logger=V1alpha2Logger(
                                mode='all'
                            )),
                            explainer=V1alpha2ExplainerSpec(
                              service_account_name='minio-kf-sa',
                            alibi=V1alpha2AlibiExplainerSpec(
                              type='AnchorTabular',
                              storage_uri='s3://'+MINIO_MODEL_BUCKET+'/'+ EXPLAINER_MODEL_PATH,
                              resources=V1ResourceRequirements(
                                  requests={'cpu':'100m','memory':'1Gi'},
                                  limits={'cpu':'100m', 'memory':'1Gi'}))))
    
isvc = V1alpha2InferenceService(api_version=api_version,
                          kind=constants.KFSERVING_KIND,
                          metadata=client.V1ObjectMeta(
                              name='kf-income', namespace=DEPLOY_NAMESPACE),
                          spec=V1alpha2InferenceServiceSpec(default=default_endpoint_spec))

In [None]:
KFServing = KFServingClient()
KFServing.create(isvc)

In [None]:
KFServing.get('kf-cifar10', namespace=DEPLOY_NAMESPACE, watch=True, timeout_seconds=120)

## Test Model and explainer

In [None]:
payload='{"instances": [[53,4,0,2,8,4,4,0,0,0,60,9]]}'
cmd=f"""curl -v -d '{payload}' \
   -H "Host: kf-income.admin.example.com" \
   -H "Content-Type: application/json" \
   http://kfserving-ingressgateway.istio-system/v1/models/kf-income:predict
"""
ret = Popen(cmd, shell=True,stdout=PIPE)
raw = ret.stdout.read().decode("utf-8")
print(raw)
res=json.loads(raw)
arr=np.array(res["predictions"])
if arr[0] > 0:
    print("Prediction: High Income")
else:
    print("Prediction: Low Income")

Make an explanation request

In [None]:
payload='{"instances": [[53,4,0,2,8,4,4,0,0,0,60,9]]}'
cmd=f"""curl -v -d '{payload}' \
   -H "Host: kf-income.admin.example.com" \
   -H "Content-Type: application/json" \
   http://kfserving-ingressgateway.istio-system/v1/models/kf-income:explain
"""
ret = Popen(cmd, shell=True,stdout=PIPE)
raw = ret.stdout.read().decode("utf-8")
res=json.loads(raw)
print(res["names"])

## Deploy Outier Detector

In [None]:
outlier_yaml=f"""apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: income-outlier
  namespace: {DEPLOY_NAMESPACE}
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "1"
    spec:
      containers:
      - image: seldonio/alibi-detect-server:1.2.2-dev_alibidetect
        imagePullPolicy: IfNotPresent
        args:
        - --model_name
        - adultod
        - --http_port
        - '8080'
        - --protocol
        - tensorflow.http
        - --storage_uri
        - s3://{MINIO_MODEL_BUCKET}/{OUTLIER_MODEL_PATH}
        - --reply_url
        - http://default-broker       
        - --event_type
        - org.kubeflow.serving.inference.outlier
        - --event_source
        - org.kubeflow.serving.incomeod
        - OutlierDetector
        envFrom:
        - secretRef:
            name: seldon-init-container-secret
"""
with open("outlier.yaml","w") as f:
    f.write(outlier_yaml)
run("kubectl apply -f outlier.yaml", shell=True)

In [None]:
trigger_outlier_yaml=f"""apiVersion: eventing.knative.dev/v1alpha1
kind: Trigger
metadata:
  name: income-outlier-trigger
  namespace: {DEPLOY_NAMESPACE}
spec:
  filter:
    sourceAndType:
      type: org.kubeflow.serving.inference.request
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1alpha1
      kind: Service
      name: income-outlier
"""
with open("outlier_trigger.yaml","w") as f:
    f.write(trigger_outlier_yaml)
run("kubectl apply -f outlier_trigger.yaml", shell=True)

In [None]:
run(f"kubectl rollout status -n {DEPLOY_NAMESPACE} deploy/$(kubectl get deploy -l serving.knative.dev/service=income-outlier -o jsonpath='{{.items[0].metadata.name}}' -n {DEPLOY_NAMESPACE})", shell=True)

## Deploy KNative Eventing Event Display

In [None]:
event_display=f"""apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-display
  namespace: {DEPLOY_NAMESPACE}          
spec:
  replicas: 1
  selector:
    matchLabels: &labels
      app: event-display
  template:
    metadata:
      labels: *labels
    spec:
      containers:
        - name: helloworld-go
          # Source code: https://github.com/knative/eventing-contrib/tree/master/cmd/event_display
          image: gcr.io/knative-releases/knative.dev/eventing-contrib/cmd/event_display@sha256:f4628e97a836c77ed38bd3b6fd3d0b06de4d5e7db6704772fe674d48b20bd477
---
kind: Service
apiVersion: v1
metadata:
  name: event-display
  namespace: {DEPLOY_NAMESPACE}
spec:
  selector:
    app: event-display
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
---
apiVersion: eventing.knative.dev/v1alpha1
kind: Trigger
metadata:
  name: income-outlier-display
  namespace: {DEPLOY_NAMESPACE}
spec:
  broker: default
  filter:
    attributes:
      type: org.kubeflow.serving.inference.outlier
  subscriber:
    ref:
      apiVersion: v1
      kind: Service
      name: event-display
"""
with open("event_display.yaml","w") as f:
    f.write(event_display)
run("kubectl apply -f event_display.yaml", shell=True)

In [None]:
run(f"kubectl rollout status -n {DEPLOY_NAMESPACE} deploy/event-display -n {DEPLOY_NAMESPACE}", shell=True)

## Test Outlier Detection

In [None]:
def predict():
    payload='{"instances": [[300,  4,  4,  2,  1,  4,  4,  0,  0,  0, 600,  9]]}'
    cmd=f"""curl -v -d '{payload}' \
       -H "Host: kf-income.admin.example.com" \
       -H "Content-Type: application/json" \
       http://kfserving-ingressgateway.istio-system/v1/models/kf-income:predict
    """
    ret = Popen(cmd, shell=True,stdout=PIPE)
    raw = ret.stdout.read().decode("utf-8")
    print(raw)

In [None]:
def get_outlier_event_display_logs():
    cmd=f"kubectl logs $(kubectl get pod -l app=event-display -o jsonpath='{{.items[0].metadata.name}}' -n {DEPLOY_NAMESPACE}) -n {DEPLOY_NAMESPACE}"
    ret = Popen(cmd, shell=True,stdout=PIPE)
    res = ret.stdout.read().decode("utf-8").split("\n")
    data= []
    for i in range(0,len(res)):
        if res[i] == 'Data,':
            j = json.loads(json.loads(res[i+1]))
            if "is_outlier"in j["data"].keys():
                data.append(j)
    if len(data) > 0:
        return data[-1]
    else:
        return None
j = None
while j is None:
    predict()
    print("Waiting for outlier logs, sleeping")
    time.sleep(2)
    j = get_outlier_event_display_logs()
    
print(j)
print("Outlier",j["data"]["is_outlier"]==[1])

## Clean Up Resources

In [None]:
run(f"kubectl delete inferenceservice kf-income -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete ksvc income-outlier -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete sa  minio-kf-sa -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete secret seldon-init-container-secret -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete secret income-kf-secret -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete deployment event-display -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete svc event-display -n {DEPLOY_NAMESPACE}", shell=True)