# Building Pipeline with External Dataset

## Introduction
[Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/) helps with building entire workflows

These steps can be triggered automatically by a CI/CD workflow or on demand from a command line or notebook.


**Components** performs a single step in a Machine Learning workflow such (e.g. data ingestion, data preprocessing, data transformation, model training, hyperparameter tuning).

<ADD Diagram here>


## Prerequisites
check to see if kfp is installed:

In [1]:
! pip3 show kfp

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Name: kfp
Version: 1.3.0
Summary: KubeFlow Pipelines SDK
Home-page: UNKNOWN
Author: google
Author-email: None
License: UNKNOWN
Location: /home/jovyan/.local/lib/python3.6/site-packages
Requires: Deprecated, PyYAML, kfp-server-api, google-auth, kubernetes, cloudpickle, docstring-parser, tabulate, jsonschema, strip-hints, google-cloud-storage, click, requests-toolbelt, kfp-pipeline-spec
Required-by: kfp-notebook


## Configure access Minio (External Data source)

In [2]:
MINIO_ACCESS_KEY_ID='minio'
MINIO_SECRET_ACCESS_KEY='minio123'


# !echo -n MINIO_ACCESS_KEY_ID | base64
# !echo -n MINIO_SECRET_ACCESS_KEY | base64

### Upload our Training Dataset to Minio

First, we configure credentials for `mc`, the MinIO command line client.
We then use it to create a bucket, upload the dataset to it, and set access policy so that the pipeline can download it from MinIO.

Follow the steps below to download minio client
<div class="alert">
   <code>
    wget https://dl.min.io/client/mc/release/linux-amd64/mc
    chmod +x mc
    ./mc --help
    </code>

</div>




In [3]:
! wget https://dl.min.io/client/mc/release/linux-amd64/mc
! chmod +x mc
! ./mc --help

--2021-03-19 04:05:04--  https://dl.min.io/client/mc/release/linux-amd64/mc
Resolving dl.min.io (dl.min.io)... 178.128.69.202
Connecting to dl.min.io (dl.min.io)|178.128.69.202|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20287488 (19M) [application/octet-stream]
Saving to: ‘mc.1’


2021-03-19 04:05:33 (695 KB/s) - ‘mc.1’ saved [20287488/20287488]

NAME:
  mc - MinIO Client for cloud storage and filesystems.

USAGE:
  mc [FLAGS] COMMAND [COMMAND FLAGS | -h] [ARGUMENTS...]

COMMANDS:
  alias      set, remove and list aliases in configuration file
  ls         list buckets and objects
  mb         make a bucket
  rb         remove a bucket
  cp         copy objects
  mirror     synchronize object(s) to a remote site
  cat        display object contents
  head       display first 'n' lines of an object
  pipe       stream STDIN to an object
  share      generate URL for temporary access to an object
  find       search for objects
  sql        run sql queries 

In [4]:
! ./mc alias set minio http://minio-service.kubeflow:9000 minio minio123

[m[32mAdded `minio` successfully.[0m
[0m

In [5]:
! ./mc ls minio mlpipeline

[m[32m[2021-03-19 03:56:28 UTC][0m[33m     0B[0m[36;1m mlpipeline/[0m
[0m

In [6]:
! tar --dereference -czf datasets.tar.gz ./datasets
! ./mc cp datasets.tar.gz minio/mlpipeline/datasets.tar.gz
! ./mc policy set download minio/mlpipeline

...ts.tar.gz:  10.96 MiB / 10.96 MiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 104.96 MiB/s 0s[0m[0m[m[32;1m[m[32;1mAccess permission for `minio/mlpipeline` is set to `download`[0m
[0m

## How to Implement Kubeflow Pipelines Components

In this pipeline, we have the following components:
- MNIST dataset download component
- Train the TensorFlow model
- Evaluate the trained model
- Export the trained model

In [7]:
from typing import NamedTuple
import kfp
import kfp.components as components
import kfp.dsl as dsl
from kfp.components import InputPath, OutputPath #helps define the input & output between the components

### Component 1: Download the MNIST Data Set

In [8]:
def download_dataset(data_dir: OutputPath(str)):
    """Download the MNIST data set to the KFP volume to share it among all steps"""
    import urllib.request
    import tarfile
    import os
    import subprocess

    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    url = "http://minio-service.kubeflow:9000/mlpipeline/datasets.tar.gz"
    stream = urllib.request.urlopen(url)
    tar = tarfile.open(fileobj=stream, mode="r|gz")
    tar.extractall(path=data_dir)
    
    subprocess.call(["ls", "-lha", data_dir])

### Component 2: Preprocess the Titanic DataSet

In [None]:
def preprocess_dataset(data_dir: InputPath(str), preprocessed_data_dir: OutputPath(str)):
    
    import pandas as pd
    import pickle
    
    train_df = pd.read_csv(data_dir/'datasets/train.csv')
    test_df= pd.read_csv(data_dir/'datasets/test.csv')
    
    data = [train_df, test_df]
    for dataset in data:
        dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
        dataset.loc[dataset['relatives'] > 0, 'not_alone'] = 0
        dataset.loc[dataset['relatives'] == 0, 'not_alone'] = 1
        dataset['not_alone'] = dataset['not_alone'].astype(int)
        
    # This does not contribute to a person survival probability
    train_df = train_df.drop(['PassengerId'], axis=1)
   
    #dealing with missing data in cabin feature
    deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
    data = [train_df, test_df]

    for dataset in data:
        dataset['Cabin'] = dataset['Cabin'].fillna("U0")
        dataset['Deck'] = dataset['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
        dataset['Deck'] = dataset['Deck'].map(deck)
        dataset['Deck'] = dataset['Deck'].fillna(0)
        dataset['Deck'] = dataset['Deck'].astype(int)
    # we can now drop the cabin feature
    train_df = train_df.drop(['Cabin'], axis=1)
    test_df = test_df.drop(['Cabin'], axis=1)
    
    #dealing with missing data in age feature
    data = [train_df, test_df]
    
    for dataset in data:
        mean = train_df["Age"].mean()
        std = test_df["Age"].std()
        is_null = dataset["Age"].isnull().sum()
        # compute random numbers between the mean, std and is_null
        rand_age = np.random.randint(mean - std, mean + std, size = is_null)
        # fill NaN values in Age column with random values generated
        age_slice = dataset["Age"].copy()
        age_slice[np.isnan(age_slice)] = rand_age
        dataset["Age"] = age_slice
        dataset["Age"] = train_df["Age"].astype(int)

    #dealing with missing data in emabrk feature
    # fill with most common value
    common_value = 'S'
    data = [train_df, test_df]

    for dataset in data:
        dataset['Embarked'] = dataset['Embarked'].fillna(common_value)
    
    train_df.to_pickle(f'{preprocessed_data_dir}/train.pkl')
    test_df.to_pickle(f'{preprocessed_data_dir}/test.pkl')
    
    return(print('Done!'))

### Component 3: Feature Engineering for the Titanic DataSet

In [None]:
def feateng_dataset(preprocessed_data_dir: InputPath(str), feature_dir: OutputPath(str)):
        
    import pandas as pd
    import pickle
    
    #loading the preprocessed data
    train_df = pd.read_pickle(f'{preprocessed_data_dir}/train.pkl')
    test_df = pd.read_pickle(f'{preprocessed_data_dir}/test.pkl')
    
    
    data = [train_df, test_df]
    for dataset in data:
        dataset['Fare'] = dataset['Fare'].fillna(0)
        dataset['Fare'] = dataset['Fare'].astype(int)
        
    #title features
    data = [train_df, test_df]
    titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

    for dataset in data:
        # extract titles
        dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
        # replace titles with a more common title or as Rare
        dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                                'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
        dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
        dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
        dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
        # convert titles into numbers
        dataset['Title'] = dataset['Title'].map(titles)
        # filling NaN with 0, to get safe
        dataset['Title'] = dataset['Title'].fillna(0)
    train_df = train_df.drop(['Name'], axis=1)
    test_df = test_df.drop(['Name'], axis=1)
    
    #mapping sex feature into numeric
    genders = {"male": 0, "female": 1}
    data = [train_df, test_df]

    for dataset in data:
        dataset['Sex'] = dataset['Sex'].map(genders)
    
    #dropping ticket feature
    train_df = train_df.drop(['Ticket'], axis=1)
    test_df = test_df.drop(['Ticket'], axis=1)
    
    #mapping embarked into numeric
    ports = {"S": 0, "C": 1, "Q": 2}
    data = [train_df, test_df]

    for dataset in data:
        dataset['Embarked'] = dataset['Embarked'].map(ports)
      
    #grouping age into categories
    data = [train_df, test_df]
    for dataset in data:
        dataset['Age'] = dataset['Age'].astype(int)
        dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
        dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age'] = 1
        dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age'] = 2
        dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
        dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
        dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age'] = 5
        dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age'] = 6
        dataset.loc[ dataset['Age'] > 66, 'Age'] = 6
     
    #grouping fare into categories
    data = [train_df, test_df]

    for dataset in data:
        dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
        dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
        dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
        dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare']   = 3
        dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare']   = 4
        dataset.loc[ dataset['Fare'] > 250, 'Fare'] = 5
        dataset['Fare'] = dataset['Fare'].astype(int)
        
    #adding new feature
    #age times class
    data = [train_df, test_df]
    for dataset in data:
        dataset['Age_Class']= dataset['Age']* dataset['Pclass']
    #fare per head    
    for dataset in data:
    dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['relatives']+1)
    dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)
    
    X_train = train_df.drop("Survived",axis=1)
    Y_train = train_df["Survived"]
    X_test  = test_df.drop("PassengerId",axis=1)
    X_test  = X_test.copy()
    
    #Save the train_data as a pickle file to be used by the train component.
    with open(f'{feature_dir}/train', 'wb') as f:
        pickle.dump((X_train,  Y_train), f)
    
    #Save the test_feature as a pickle file to be used.
    with open(f'{feature_dir}/test', 'wb') as f:
        pickle.dump(X_test, f)
        
    return(print('Done!'))

### Component 2: Train the Model
For both the training and evaluation we must divide the integer-valued pixel values by 255 to scale all values into the [0, 1] (floating-point) range.

In [None]:
def random_forest(feature_dir: InputPath(str), models_dir: OutputPath(str)):
    
    import pickle
    from sklearn import linear_model
    from sklearn.ensemble import RandomForestClassifier
    
    #loading the train data
    with open(f'{feature_dir}/train', 'rb') as f:
        train_data = pickle.load(f)
        
    # Separate the X_train from y_train.
    X_train, Y_train = train_data
    
    #loading the test data
    with open(f'{feature_dir}/test', 'rb') as f:
        X_test = pickle.load(f)
    
    random_forest = RandomForestClassifier(n_estimators=100)
    random_forest.fit(X_train, Y_train)
    y_pred = regressor.predict(X_test)
    acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
    
    #Save the accuracy as a pickle file to be used 
    with open(f'{models_dir}/random_forest', 'wb') as f:
        pickle.dump(acc_random_forest, f)
    
    return(print('Done!'))
    

In [None]:
def logistic_reg(feature_dir: InputPath(str), models_dir: OutputPath(str)):
    
    import pickle
    from sklearn import linear_model
    from sklearn.linear_model import LogisticRegression

    #loading the train data
    with open(f'{feature_dir}/train', 'rb') as f:
        train_data = pickle.load(f)
    # Separate the X_train from y_train.
    X_train, Y_train = train_data
    
    logreg = LogisticRegression(solver='lbfgs', max_iter=110)
    logreg.fit(X_train, Y_train)
    acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
    
    #Save the accuracy as a pickle file to be used 
    with open(f'{models_dir}/logistic_reg', 'wb') as f:
        pickle.dump(acc_log, f)
    
    return(print('Done!'))
    

In [None]:
def gaussian_NB(feature_dir: InputPath(str), models_dir: OutputPath(str)):
    
    import pickle
    from sklearn import linear_model
    from sklearn.naive_bayes import GaussianNB
    
    #loading the train data
    with open(f'{feature_dir}/train', 'rb') as f:
        train_data = pickle.load(f)
    # Separate the X_train from y_train.
    X_train, Y_train = train_data
    
    gaussian = GaussianNB()
    gaussian.fit(X_train, Y_train)
    acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
    
    #Save the accuracy as a pickle file to be used 
    with open(f'{models_dir}/gaus_NB', 'wb') as f:
        pickle.dump(acc_gaussian, f)
    
    return(print('Done!'))

In [None]:
def SVM(feature_dir: InputPath(str), models_dir: OutputPath(str)):
    import pickle
    from sklearn import linear_model
    from sklearn.svm import SVC
    
    #loading the train data
    with open(f'{feature_dir}/train', 'rb') as f:
        train_data = pickle.load(f)
    # Separate the X_train from y_train.
    X_train, Y_train = train_data
    
    linear_svc = SVC(gamma='auto')
    linear_svc.fit(X_train, Y_train)
    acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
    
    #Save the accuracy as a pickle file to be used 
    with open(f'{models_dir}/svm', 'wb') as f:
        pickle.dump(acc_linear_svc, f)
    
    return(print('Done!'))

In [None]:
def decision_tree(feature_dir: InputPath(str), models_dir: OutputPath(str)):
    
    import pickle
    from sklearn import linear_model
    from sklearn.tree import DecisionTreeClassifier
    
    #loading the train data
    with open(f'{feature_dir}/train', 'rb') as f:
        train_data = pickle.load(f)
    # Separate the X_train from y_train.
    X_train, Y_train = train_data
    
    decision_tree = DecisionTreeClassifier()
    decision_tree.fit(X_train, Y_train)
    acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
    
    #Save the accuracy as a pickle file to be used 
    with open(f'{models_dir}/decision_tree', 'wb') as f:
        pickle.dump(acc_decision_tree, f)
    
    return(print('Done!'))

In [None]:
def results(models_dir: InputPath(str), result_dir: OutputPath(str))

    import pickle
    import pandas as pd
    
    #loading the accuracies from each models
    with open(f'{models_dir}/random_forest', 'rb') as f:
        acc_random_forest = pickle.load(f)
        
    with open(f'{models_dir}/logistic_reg' ', 'rb') as f:
        acc_log = pickle.load(f) 
    
    with open(f'{models_dir}/gaus_NB' ', 'rb') as f:
        acc_gaussian = pickle.load(f) 
    
    with open(f'{models_dir}/svm', 'rb') as f:
        acc_linear_svc = pickle.load(f) 
              
    with open(f'{models_dir}/decision_tree', 'rb') as f:
        acc_decision_tree = pickle.load(f) 
              
    results = pd.DataFrame({
        'Model': ['Support Vector Machines', 'logistic Regression',
                  'Random Forest', 'Naive Bayes', 'Decision Tree'],
        'Score': [acc_linear_svc, acc_log,
                  acc_random_forest, acc_gaussian, acc_decision_tree]})
    result_df = results.sort_values(by='Score', ascending=False)
    result_df = result_df.set_index('Score')
    
    result_df.to_pickle(f'{result_dir}/result.pkl')
    print(f"Result saved {result_dir}")

### Component 3: Evaluate the Model
With the following Python function the model is evaluated.
The metrics [metadata](https://www.kubeflow.org/docs/pipelines/sdk/pipelines-metrics/) (loss and accuracy) is available to the Kubeflow Pipelines UI.
Metadata can automatically be visualized with output viewer(s).
Please go [here](https://www.kubeflow.org/docs/pipelines/sdk/output-viewer/) to see how to do that.

In [10]:
def evaluate_models(
    models_dir: InputPath(str), metrics_path: OutputPath(str)
) -> NamedTuple("EvaluationOutput", [("mlpipeline_metrics", "Metrics")]):
    """Loads a saved model from file and uses a pre-downloaded dataset for evaluation.
    Model metrics are persisted to `/mlpipeline-metrics.json` for Kubeflow Pipelines
    metadata."""

    import json
    from collections import namedtuple

     #loading the accuracies from each models
    with open(f'{models_dir}/random_forest', 'rb') as f:
        acc_random_forest = pickle.load(f)
        
    with open(f'{models_dir}/logistic_reg' ', 'rb') as f:
        acc_log = pickle.load(f) 
    
    with open(f'{models_dir}/gaus_NB' ', 'rb') as f:
        acc_gaussian = pickle.load(f) 
    
    with open(f'{models_dir}/svm', 'rb') as f:
        acc_linear_svc = pickle.load(f) 
              
    with open(f'{models_dir}/decision_tree', 'rb') as f:
        acc_decision_tree = pickle.load(f) 

    metrics = {
        "metrics": [
            {"name": "accuracy_svc", "numberValue": str(acc_linear_svc), "format": "PERCENTAGE"},
            {"name": "accuracy_logReg", "numberValue": str(acc_log), "format": "PERCENTAGE"},
            {"name": "accuracy_random_forest", "numberValue": str(acc_random_forest), "format": "PERCENTAGE"},
            {"name": "accuracy_gausNB", "numberValue": str(acc_gaussian), "format": "PERCENTAGE"},
            {"name": "accuracy_decTree", "numberValue": str(acc_decision_tree), "format": "PERCENTAGE"},
        ]
    }

    with open(metrics_path, "w") as f:
        json.dump(metrics, f)

    out_tuple = namedtuple("EvaluationOutput", ["mlpipeline_metrics"])

    return out_tuple(json.dumps(metrics))

### Component 4: Export the Model

In [11]:
def export_model(
    models_dir: InputPath(str),
    metrics: InputPath(str),
    export_bucket: str,
):
    import os
    import boto3
    from botocore.client import Config

    s3 = boto3.client(
        "s3",
        endpoint_url="http://minio-service.kubeflow:9000",
        aws_access_key_id="minio",
        aws_secret_access_key="minio123",
        config=Config(signature_version="s3v4"),
    )

    # Create export bucket if it does not yet exist
    response = s3.list_buckets()
    export_bucket_exists = False

    for bucket in response["Buckets"]:
        if bucket["Name"] == export_bucket:
            export_bucket_exists = True

    if not export_bucket_exists:
        s3.create_bucket(ACL="public-read-write", Bucket=export_bucket)

    # Save model files to S3
    models_name = []
    for i in os.listdir(models_dir):
        models_name.append(i)
    for root, dirs, files in os.walk(models_dir):
        for filename in files:
            local_path = os.path.join(root, filename)
            s3_path = os.path.relpath(local_path, models_dir)

            s3.upload_file(
                local_path,
                export_bucket,
                f"{models_name}/{s3_path}",
                ExtraArgs={"ACL": "public-read"},
            )

    response = s3.list_objects(Bucket=export_bucket)
    print(f"All objects in {export_bucket}:")
    for file in response["Contents"]:
        print("{}/{}".format(export_bucket, file["Key"]))

## How to Combine the Components into a Pipeline
Note that up to this point we have not yet used the Kubeflow Pipelines SDK!

With our four components (i.e. self-contained functions) defined, we can wire up the dependencies with Kubeflow Pipelines.

The call [`components.func_to_container_op(f, base_image=img)(*args)`](https://www.kubeflow.org/docs/pipelines/sdk/sdk-overview/) has the following ingredients:
- `f` is the Python function that defines a component
- `img` is the base (Docker) image used to package the function
- `*args` lists the arguments to `f`

What the `*args` mean is best explained by going forward through the graph:
- `downloadOp` is the very first step and has no dependencies; it therefore has no `InputPath`.
  Its output (i.e. `OutputPath`) is stored in `data_dir`.
- `trainOp` needs the data downloaded from `downloadOp` and its signature lists `data_dir` (input) and `model_dir` (output).
  So, it _depends on_ `downloadOp.output` (i.e. the previous step's output) and stores its own outputs in `model_dir`, which can be used by another step.
  `downloadOp` is the parent of `trainOp`, as required.
- `evaluateOp`'s function takes three arguments: `data_dir` (i.e. `downloadOp.output`), `model_dir` (i.e. `trainOp.output`), and `metrics_path`, which is where the function stores its evaluation metrics.
  That way, `evaluateOp` can only run after the successful completion of both `downloadOp` and `trainOp`.
- `exportOp` runs the function `export_model`, which accepts five parameters: `model_dir`, `metrics`, `export_bucket`, `model_name`, and `model_version`.
  From where do we get the `model_dir`?
  It is nothing but `trainOp.output`.
  Similarly, `metrics` is `evaluateOp.output`.
  The remaining three arguments are regular Python arguments that are static for the pipeline: they do not depend on any step's output being available.
  Hence, they are defined without using `InputPath`.

In [12]:
def train_and_serve(
    data_dir: str,
    preprocessed_data_dir: str,
    feature_dir: str,
    models_dir: str,
    export_bucket: str,
    
):
    # For GPU support, please add the "-gpu" suffix to the base image
    BASE_IMAGE = "mesosphere/kubeflow:1.0.1-0.5.0-tensorflow-2.2.0"

    downloadOp = components.func_to_container_op(
        download_dataset, base_image=BASE_IMAGE
    )()

    preprocessOp = components.func_to_container_op(preprocess_dataset,base_image=BASE_IMAGE)(
        downloadOp.output
    )
    
    featureOp = components.func_to_container_op(feateng_dataset, base_image=BASE_IMAGE)(
        preprocessOp.output
    )
    
    random_forestOp =  components.func_to_container_op(random_forest, base_image=BASE_IMAGE)(
        featureOp.output
    )
    
    logistic_regOp = components.func_to_container_op(logistic_reg, base_image=BASE_IMAGE)(
        featureOp.output
    )
    
    gaussian_NB_Op = components.func_to_container_op(gaussian_NB, base_image=BASE_IMAGE)(
        featureOp.output
    )
    
    svmOp = components.func_to_container_op(SVM, base_image=BASE_IMAGE)(
        featureOp.output
    )
    
    decision_treesOp = components.func_to_container_op(decision_tree, base_image=BASE_IMAGE)(
        featureOp.output
    )
    
    resultOp = components.func_to_container_op(results, , base_image=BASE_IMAGE)(
         random_forestOp.output, logistic_regOp.output, gaussian_NB_Op.output, svmOp.output, decision_treesOp.output
    )
    
    evaluateOp = components.func_to_container_op(evaluate_model, base_image=BASE_IMAGE)(
        random_forestOp.output, logistic_regOp.output, gaussian_NB_Op.output, svmOp.output, decision_treesOp.output
    )

    exportOp = components.func_to_container_op(export_model, base_image=BASE_IMAGE)(
        trainOp.output, evaluateOp.output, export_bucket, model_name
    )


Just in case it isn't obvious: this will build the Docker images for you.
Each image is based on `BASE_IMAGE` and includes the Python functions as executable files.
Each component _can_ use a different base image though.
This may come in handy if you want to have reusable components for automatic data and/or model analysis (e.g. to investigate bias).

Note that you did not have to use [Kubeflow Fairing](../fairing/Kubeflow%20Fairing.ipynb) or `docker build` locally at all!

<div class="alert alert-block alert-info">
    Remember when we said all dependencies have to be included in the base image?
    Well, that was not quite accurate.
    It's a good idea to have everything included and tested before you define and use your pipeline components to make sure that there are not dependency conflicts.
    There is, however, a way to add <a href="https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.func_to_container_op">packages (<code>packages_to_install</code>) and additional code to execute <em>before</em> the function code (<code>extra_code</code>)</a>.
</div>

Is that it?
Not quite!

We still have to define the pipeline itself.
Our `train_and_serve` function defines dependencies but we must use the KFP domain-specific language (DSL) to register the pipeline with its four components:

In [13]:
# See: https://github.com/kubeflow/kfserving/blob/master/docs/DEVELOPER_GUIDE.md#troubleshooting
def op_transformer(op):
    op.add_pod_annotation(name="sidecar.istio.io/inject", value="false")
    return op


@dsl.pipeline(
    name="Titanic Pipeline",
    description="A sample pipeline to demonstrate different model training, evaluation and export",
)
def mnist_pipeline(
    models_dir: str = "/train/models",
    data_dir: str = "/train/data",
    preprocessed_data_dir: str = "/train/preprocessed",
    feature_dir: str = "/train/features"
    export_bucket: str = "titanic",
   
):
    train_and_serve(
        data_dir=data_dir,
        model_dir=model_dir,
        export_bucket=export_bucket,
        model_name=model_name,
    )
    dsl.get_pipeline_conf().add_op_transformer(op_transformer)

With that in place, let's submit the pipeline directly from our notebook:

In [14]:
pipeline_func = mnist_pipeline
run_name = pipeline_func.__name__ + " run"
experiment_name = "End-to-End-Demo"

arguments = {
    "model_dir": "/train/model",
    "data_dir": "/train/data",
    "export_bucket": "mnist",
    "model_name": "mnist",
    "model_version": "1",
}

kfp.compiler.Compiler().compile(pipeline_func,  'demo.yaml')
    