## End-to-end Heart Failure Prediction Pipeline

#### Building our lightweight pipelines components using Python

### Lightweight python components

Lightweight python components do not require you to build a new container image for every code change. They're intended to use for fast iteration in notebook environment.

#### Building a lightweight python component

To build a component just define a stand-alone python function and then call kfp.components.func_to_container_op(func) to convert it to a component that can be used in a pipeline.

There are several requirements for the function:

- The function should be stand-alone. It should not use any code declared outside of the function definition. Any imports should be added inside the main function. Any helper functions should also be defined inside the main function.


- The function can only import packages that are available in the base image. If you need to import a package that's not available you can try to find a container image that already includes the required packages. (As a workaround you can use the module subprocess to run pip install for the required package.)


- If the function operates on numbers, the parameters need to have type hints. Supported types are [int, float, bool]. Everything else is passed as string.

### Building Python function-based components

A Kubeflow Pipelines component is a self-contained set of code that performs one step in your ML workflow. A pipeline component is composed of:

- The component code, which implements the logic needed to perform a step in your ML workflow.

- A component specification, which defines the following:
    - The component's metadata, its name and description.
    - The component's interface, the component's inputs and outputs.
    - The component's implementation, the Docker container image to run, how to pass inputs to your component code, and how to get the component's outputs.
    

Python function-based components make it easier to iterate quickly by letting you build your component code as a Python function and generating the component specification for you.

## Setup

In [1]:
!python -m pip install --user --upgrade pip

Requirement already up-to-date: pip in /home/jovyan/.local/lib/python3.6/site-packages (20.2.4)


In [2]:
# !pip3 install -U --user numpy==1.19.3 

In [3]:
from IPython import get_ipython 
!python -m pip install pandas 
!pip install pandas==0.23.4 matplotlib==3.3.1 scipy==1.2.1 scikit-learn==0.22 tensorflow==2.1.0 keras==1.2.2 seaborn==0.10.1 facets-overview==1.0.0 --user
!pip install IPython==7.12.0 numpy==1.16.1 imblearn==0.0 jsonlib==1.6.1 tensorboard==2.2.0 DateTime==4.1.1 IPython==7.11.1 --user

Defaulting to user installation because normal site-packages is not writeable
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
[31mERROR: Double requirement given: IPython==7.11.1 (already in IPython==7.12.0, name='IPython')[0m


In [4]:
import numpy as np
import pandas as  pd
import os
import matplotlib.pyplot as plt

## Install or update the pipelines SDK

#### Run the following command to install the Kubeflow Pipelines SDK.

In [5]:
# You may need to restart your notebook kernel after updating the kfp sdk
!pip3 install --user --upgrade kfp
!pip3 install kfp --upgrade
!pip3 install kfp --upgrade --user
!pip3 install -U kfp

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already up-to-date: kfp in /home/jovyan/.local/lib/python3.6/site-packages (1.1.1)
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: kfp in /home/jovyan/.local/lib/python3.6/site-packages (1.1.1)


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already up-to-date: kfp in /home/jovyan/.local/lib/python3.6/site-packages (1.1.1)


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: kfp in /home/jovyan/.local/lib/python3.6/site-packages (1.1.1)




`Restart the kernel before you proceed`

In [6]:
# Restart kernel after the pip install
import IPython

IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

`Check if the install was successful:`

## Build the Components

#### Import the kfp and kfp.components packages.

In [7]:
import kfp                  # the Pipelines SDK. 
from kfp import compiler
import kfp.dsl as dsl
import kfp.gcp as gcp
import kfp.components as comp
import os
import subprocess
import json

from kfp.dsl.types import Integer, GCSPath, String
import kfp.notebook

In [8]:
# where the outputs are stored
out_dir = "/home/jovyan/stage-f-07-heart-failure/data/out/"

## Create a release experiment in the Kubeflow pipeline

#### Kubeflow Pipeline requires having an Experiment before making a run. An experiment is a group of comparable runs

In [9]:
EXPERIMENT_NAME = 'Heart Failure Prediction Pipeline'        # Name of the experiment in the UI
BASE_IMAGE = "tensorflow/tensorflow:latest-gpu-py3"    # Base image used for components in the pipeline

PROJECT_NAME = "Kubeflow-mlops-pipeline"

#### Create an instance of the kfp.Client class

In [10]:
client = kfp.Client()
exp = client.create_experiment(name=EXPERIMENT_NAME)

## Building Python function-based components

#### Define your component's code as a standalone python function.

### Preprocessing Function

In [11]:
def preprocess(data_path): 
    
     # func_to_container_op requires packages to be imported inside of the function.
    import sys, subprocess;
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pip==20.2.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pandas==0.23.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn==0.22'])
    import pandas as pd
    import numpy as np
    from pandas import Series, DataFrame,read_csv
    import pickle
    
    # Read the dataset as a csv file 
    df = pd.read_csv("https://raw.githubusercontent.com/HamoyeHQ/stage-f-07-heart-failure/master/data/heart_failure_clinical_records_dataset.csv")
    
    
    # Re-assign the features with binary numbers to a boolean label
    df['anaemia'] = np.where(df['anaemia'] == 1 ,True,False)
    df['diabetes'] = np.where(df['diabetes'] == 1, True, False)
    df['high_blood_pressure'] = np.where(df['high_blood_pressure'] == 1, True, False)
    df['smoking'] = np.where(df['smoking'] == 1, True, False)
    df['sex'] = np.where(df['sex'] == 1, 'Male','Female')
    
    
    # prints the number of missing values in the different variables.
    df.apply(lambda x: sum(x.isnull()),axis=0)
    
    #Delete row with dummy value
    df = df.dropna(how='any',axis=0)
    
    # Save Dataframe using the pickle extension
    df.to_pickle(f'{data_path}/preprocessed-data.pkl')
    print("Preprocessing Done")

### Analysis Function

In [12]:
# Exploratory Data Analysis
def Analyze(data_path):
    
     # func_to_container_op requires packages to be imported inside of the function. 
    import sys, subprocess;
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pip==20.2.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pandas==0.23.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn==0.22']) 
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'matplotlib==3.3.1']) 
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'seaborn==0.10.1']) 
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'facets-overview==1.0.0'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'IPython==7.12.0'])
    import pandas as pd
    import numpy as np
    import pickle
    from pandas import Series
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Read the dataset as a csv file
    df = pd.read_csv("https://raw.githubusercontent.com/HamoyeHQ/stage-f-07-heart-failure/master/data/heart_failure_clinical_records_dataset.csv")
    
    # Statistical Inference from data
    df.describe()
    
    # Split into Features and Labels
    x = df.drop('DEATH_EVENT', axis = 1)
    y = df['DEATH_EVENT']
    
#     @title Install the facets_overview pip package.
#     import facets-overview
    
    train_data = x[0:150] 
    test_data = x[150: ]
    
    
    # Create the feature stats for the datasets and stringify it.
    import base64
    from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator

    gfsg = GenericFeatureStatisticsGenerator()
    proto = gfsg.ProtoFromDataFrames([{'name': 'train', 'table': train_data},
                                  {'name': 'test', 'table': test_data}])
    protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")
    
    
    # Display the facets overview visualization for this data
    from IPython.core.display import display, HTML

    HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
    html = HTML_TEMPLATE.format(protostr=protostr)
    display(HTML(html))
    
    
    # Distingushing those that died from a factor, from those that didn't
    fig,ax = plt.subplots(2,3,figsize=(15,8))
    ax1,ax2,ax3,ax4, ax5, ax6 = ax.flatten()

    sns.countplot(df['anaemia'], hue = df["DEATH_EVENT"],ax=ax1)
    sns.countplot(df['diabetes'],hue = df["DEATH_EVENT"],ax=ax2)
    sns.countplot(df['high_blood_pressure'],hue = df["DEATH_EVENT"], ax=ax3)
    sns.countplot(df['sex'],hue = df["DEATH_EVENT"], ax=ax4)
    sns.countplot(df['smoking'],hue = df["DEATH_EVENT"], ax=ax5)
    sns.countplot(df['DEATH_EVENT'],hue = df["DEATH_EVENT"], ax=ax6)

### Feature Engineering Function

In [13]:
# # preprocess(data_path)
# def feature_engineer(data_path):
    
#      # func_to_container_op requires packages to be imported inside of the function.
#     import sys, subprocess;
#     subprocess.run([sys.executable, '-m', 'pip', 'install', 'pandas==0.23.4'])
#     subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn==0.22']) 
#     subprocess.run([sys.executable, '-m', 'pip', 'install', 'numpy==1.16.1'])
#     import pandas as pd
#     import numpy as np
#     import pickle
#     from pandas import Series, DataFrame,read_csv
    
#     # load the preprocessed data
#     # preprocess(data_path)
#     df = pd.read_pickle(f'{data_path}/preprocessed-data.pkl')
    
    
#     # Re-engineer some features based on generally accepted medical values for those feature
    
#     # creatinine_phosphokinase normal values ranges from 10 to 120 micrograms per liter (mcg/L) creatinine_phosphokinase
#     def set_cpk(row):
#         if row["creatinine_phosphokinase"] >= 10 and row["creatinine_phosphokinase"] <= 120:
#             return 'Normal'
#         else:
#             return "High"
#     df['cp_desc'] =  df.apply(set_cpk, axis=1)
#     # line 27
    
#     # Range of EJECTION FRACTION for Heart Failure
#     def set_eject_fract(row):
#         if row["ejection_fraction"] <= 35:
#             return "Low"
#         elif row["ejection_fraction"] > 35 and row["ejection_fraction"] <= 49:
#             return "Below_Normal"
#         elif row["ejection_fraction"] > 50 and row["ejection_fraction"] <= 75:
#             return "Normal"
#         else:
#             return "High"
#     df['ejection_fraction_desc'] =  df.apply(set_eject_fract, axis =1)
    
#     # line 41
#     # Range of PLATELETS for Male and Female
#     def set_platelets(row):
#         if row["sex"] == 'Female':  #females
#             if row["platelets"] < 157000:
#                 return "Low"
#             elif row["platelets"] >=157000 and row["platelets"] <= 371000:
#                 return "Normal"
#             else:
#                 return "High"
            
#         elif row["sex"] == 'Male':  #males
#             if row["platelets"] < 135000:
#                 return "Low"
#             if row["platelets"] >= 135000 and row["platelets"] <= 317000:
#                 return "Normal"
#             else:
#                 return "High"
#     df['platelets_desc'] = df.apply(set_platelets, axis = 1)
# #     df['platelets_desc'] = df.apply(set_platelets, axis = 1)
    
    # 62
#     # Range of SERUM SODIUM for Heart Failure
#     def set_sodium(row):
#         if row["serum_sodium"] < 135:
#             return "Low"
#         elif row["serum_sodium"] >=135 and row["serum_sodium"] <= 145:
#             return "Normal"
#         else:
#             return "High"
#     df['sodium_desc'] = df.apply(set_sodium, axis =1)
    
    
#     # Range of SERUM CREATININE for Heart Failure (Varies for male and female)
#     def set_creatinine(row):
#         if row['sex'] == 'Female':  # females
#             if  row['serum_creatinine'] >= 0.5 and  row['serum_creatinine'] <= 1.1:
#                 return 'Normal'
#             else:
#                 return "High"
            
#         elif row['sex'] == 'Male':
#             if  row['serum_creatinine'] >= 0.6 and row['serum_creatinine'] <= 1.2:
#                 return 'Normal'
#             else:
#                 return "High"
#     df['serum_creatinine_desc'] = df.apply(set_creatinine, axis = 1)
    
    
#     # Save the dataframe
#     df.to_pickle(f'{data_path}/feature-engineered.pkl')
#     print("Feature Engineering Done")

### Scaling and Transformation Function

In [14]:
def scale_transform(data_path):
    
     # func_to_container_op requires packages to be imported inside of the function.
    import sys, subprocess;
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pip==20.2.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pandas==0.23.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn==0.22'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'numpy==1.16.1'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'imblearn==0.0'])
    import pandas as pd
    import numpy as np
    import pickle
    from sklearn.utils import shuffle
    import imblearn
    from imblearn.over_sampling import SMOTENC
    from sklearn.preprocessing import MinMaxScaler 
    from sklearn.compose import ColumnTransformer
    
    # Read the preprocessed pickle data file
    df = pd.read_pickle(f'{data_path}/preprocessed-data.pkl')
    
    # Features and labels
    x = df.drop('DEATH_EVENT', axis = 1)
    y = df['DEATH_EVENT']
    
    
    # use SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC)), 
    # for imbalance between the target class
    smote = SMOTENC(random_state=1,categorical_features=[0,1,3,5,9,10])
    x_bal, y_bal = smote.fit_sample(x, y)
    x_bal = pd.DataFrame(x_bal, columns = x.columns)
    
    # create dummy variables for the newly engineered features.
    encode = ['sex']
    x = pd.get_dummies(x_bal, columns = encode, drop_first = True)
    
    # Columns (Features)
    col = ['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time',
       'anaemia','diabetes','high_blood_pressure','smoking','sex_Male']
    
    # Scale the data
    col_trans = ColumnTransformer(remainder='passthrough',
                              transformers = [('scaler',MinMaxScaler(),[0,2,4,6,7,8,10])])
    trans = col_trans.fit_transform(x)
    trans = pd.DataFrame(trans,columns = col)
    
    #output file to path
    np.savez_compressed(f'{data_path}/scale_transform-data.npz', 
                       x=trans,
                       y_bal=y_bal)
    print("Scale and transform Done")

### Training Function

In [15]:
# !pip install -q tf-nightly-2.0-preview

In [16]:
# import tensorflow
# tensorflow.__version__

In [17]:
# %load_ext tensorboard

In [18]:
# import tensorflow as tf
# import datetime, os

# logs_base_dir = "./logs"
# os.makedirs(logs_base_dir, exist_ok=True)
# %tensorboard --logdir {logs_base_dir}

#### If your component returns multiple outputs, annotate your function with the typing.NamedTuple type hint and use the collections.namedtuple function return your function's outputs as a new subclass of tuple.

- You can also return metadata and metrics from your function.

    - Metadata helps you visualize pipeline results.
    - Metrics help you compare pipeline runs.

In [19]:
from typing import NamedTuple
def training(data_path, classifier_file) -> NamedTuple(
    'TrainingOutput',
    [
        ('mlpipeline_ui_metadata', 'UI_metadata')
#         ('mlpipeline_metrics', 'Metrics')
    ]):
    
    # func_to_container_op requires packages to be imported inside of the function.
    import sys, subprocess;
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pip==20.2.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pandas==0.23.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn==0.22'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'numpy==1.16.1'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'imblearn==0.0']) 
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'jsonlib==1.6.1']) 
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'tensorboard==2.1.0'])  
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'DateTime == 4.1.1'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'IPython==7.12.0'])
    import pandas as pd
    import numpy as np
    import pickle
    import imblearn
    from imblearn.over_sampling import SMOTENC
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.ensemble import RandomForestClassifier
    
    #load the transformed data
    scale_transformed_data = np.load(f'{data_path}/scale_transform-data.npz')
    x = scale_transformed_data['x']
    y = scale_transformed_data['y_bal']
    
    # split data into training and testing set
    x_train,x_test,y_train,y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
    
    # Instantiate classifier with obtained optimum parameters for training
    classifier = RandomForestClassifier(max_features= 'auto',random_state = 3,
                                    min_samples_leaf = 1, min_samples_split = 2,n_estimators = 100)
    
    
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.python.lib.io import file_io
    import json
#     import datetime, os
    from datetime import datetime
#     %load_ext tensorboard 
    
#     logdir = "/home/jovyan/stage-f-07-heart-failure/pipeline/logs/" + datetime.now().strftime("%d/%m/%Y - %H:%M:%S")
#     tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)
    
    
    # Fit to x_train and y_train
    classifier.fit(x_train, y_train, )
    
    
    # Export a sample tensorboard
    metadata = {
      'outputs' : [{
        'type': 'tensorboard',
        'source': 'gs://ml-pipeline-dataset/tensorboard-train',
      }]
    }
    
    with open('/mlpipeline-ui-metadata.json', 'w') as f:
      json.dump(metadata, f)
          
    # output the splitted data file to path
    np.savez_compressed(f'{data_path}/train-test-data.npz', 
                       x_train=x_train,
                       x_test=x_test,
                       y_train=y_train,
                       y_test=y_test)
    
    # Save the classifier model to the designated 
    with open(f'{data_path}/{classifier_file}', 'wb') as file:
        pickle.dump(classifier, file)
        
    
        
    from collections import namedtuple
    training_output = namedtuple(
        'TrainingOutput',
        ['classifier', 'mlpipeline_ui_metadata']) 
    return training_output(classifier, json.dumps(metadata)) 

### Model Validation Function

In [20]:
from typing import NamedTuple
def model_validation(data_path, classifier_file) -> NamedTuple(
    'ModelvalidationOutputs',
    [
      ('recall', float),
      ('accuracy', float),
      ('precision', float),
      ('f1score', float),
#       ('mlpipeline_ui_metadata', 'UI_metadata'),
      ('mlpipeline_metrics', 'Metrics')
    ]):
    
     # func_to_container_op requires packages to be imported inside of the function.
    import sys, subprocess;
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pip==20.2.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pandas==0.23.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn==0.22'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'numpy==1.16.1'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'jsonlib==1.6.1']) 
    import pandas as pd
    import numpy as np
    import json
    import pickle
    from sklearn.metrics import classification_report, recall_score, accuracy_score,precision_score, f1_score, confusion_matrix
    
    # load the transformed data
    train_test_data = np.load(f'{data_path}/train-test-data.npz')
    x_train = train_test_data['x_train']
    x_test  = train_test_data['x_test']
    y_train = train_test_data['y_train']
    y_test  = train_test_data['y_test']
    
    # Load the saved classifier model
    with open(f'{data_path}/{classifier_file}', 'rb') as file:
        classifier = pickle.load(file)
    
    # predict on x_test
    y_pred = classifier.predict(x_test)
    
    
    # Model Evaluation
    recall = recall_score(y_test,y_pred)
    accuracy = accuracy_score(y_test,y_pred)
    precision = precision_score(y_test,y_pred)
    f1score = f1_score(y_test,y_pred)
    
    # Classification Report table
    report = classification_report(y_test,y_pred)
    print(report)

    # Export metrics
    metrics = {
      'metrics': [{
        'name': 'accuracy-score', # The name of the metric. Visualized as the column name in the runs table.
        'numberValue':  accuracy, # The value of the metric. Must be a numeric value.
        'format': "PERCENTAGE",   # The optional format of the metric. Supported values are "RAW" (displayed in raw format) and "PERCENTAGE" (displayed in percentage format).
      },{
        'name': 'recall-score',
        'numberValue': recall,
        'format': "PERCENTAGE",
      },{
        'name': 'precision-score',
        'numberValue': precision,
        'format': "PERCENTAGE",
      },{
        'name': 'f1score',
        'numberValue': f1score,
        'format': "PERCENTAGE",
      }]}
    
    
    # The Report file
    with open(f'{data_path}/result.txt', 'w') as result:
        result.write("Report: {} ".format(report))
    
    #output the splitted data file to path
    np.savez_compressed(f'{data_path}/validated-data.npz', 
                       x_test=x_test,
                       y_test=y_test,
                       y_pred=y_pred)

    # Save y_pred and y_test as pickle files
    pickle.dump(y_pred, open(f'{data_path}/y_pred.pkl','wb'))
    pickle.dump(y_test, open(f'{data_path}/y_test.pkl','wb'))
    
    # Save the classifier model to the designated 
    with open(f'{data_path}/{classifier_file}', 'wb') as file:
        pickle.dump(classifier, file)
        
        
    with open(f'{data_path}/classifier_result.txt', 'w') as result:
        result.write(" Prediction: {},\n\nActual: {} ".format(y_pred, y_test))
        
    from collections import namedtuple
    model_eval_output = namedtuple(
        'ModelvalidationOutputs',
        ['accuracy', 'recall', 'precision', 'f1score',  'mlpipeline_metrics']) 
    return model_eval_output(accuracy, recall, precision, f1score,  json.dumps(metrics)) 

In [21]:
from typing import NamedTuple
def Confusion_matrix(data_path, classifier_file):
    
     # func_to_container_op requires packages to be imported inside of the function.
    import sys, subprocess;
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pip==20.2.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'pandas==0.23.4'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'scikit-learn==0.22'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'numpy==1.16.1'])
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'matplotlib==3.3.1']) 
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'jsonlib==1.6.1']) 
    import json
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import pickle
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import plot_confusion_matrix
    
    # Load the saved classifier model
    with open(f'{data_path}/{classifier_file}', 'rb') as file:
        classifier = pickle.load(file)
        
    # Load the y_pred data file
    pickle_in = open(f'{data_path}/y_pred.pkl',"rb")
    y_pred = pickle.load(pickle_in)
    
    # Load the y_test data file
    pickle_ = open(f'{data_path}/y_test.pkl',"rb")
    y_test = pickle.load(pickle_)
    
    # Confusion matrix
    matrix = confusion_matrix(y_test.reshape(-1,1), y_pred)
    print(matrix)
    
#     from collections import namedtuple
#     confusion_matrix_output = namedtuple(
#         'Confusionmatrix',
#         ['mlpipeline_metrics']) 
#     return confusion_matrix_output(json.dumps(metrics)) 


#      metadata = {
#     'outputs' : [{
#       'type': 'confusion_matrix',
#       'format': 'csv',
#       'schema': [
#         {'name': 'target', 'type': 'CATEGORY'},
#         {'name': 'predicted', 'type': 'CATEGORY'},
#         {'name': 'count', 'type': 'NUMBER'},
#       ],
#       'source': <CONFUSION_MATRIX_CSV_FILE>,
#       # Convert vocab to string because for bealean values we want "True|False" to match csv data.
#       'labels': list(map(str, vocab)),
#     }]
#   }
#   with file_io.FileIO('/mlpipeline-ui-metadata.json', 'w') as f:
#     json.dump(metadata, f)


#     with open(f'{data_path}/classifier_result.txt', 'w') as result:
#         result.write(" Prediction: {},\nActual: {} ".format(y_pred, y_test))

# Build a pipeline component from the function

#### Convert the function to a pipeline operation.

- Use `kfp.components.create_component_from_func` to return a factory function that you can use to create `kfp.dsl.ContainerOp` class instances for the pipeline. We also specify the base container image to run this function in.

In [22]:
# Create preproces lightweight components.
preprocess_op = comp.func_to_container_op(preprocess, base_image=BASE_IMAGE)

# Create the analysis lightweight components.
analyze_op = comp.func_to_container_op(Analyze, base_image=BASE_IMAGE)

# Create the feature Engineering lightweight components.
# feature_engineer_op = comp.func_to_container_op(feature_engineer, base_image=BASE_IMAGE)

# Create the scale and transform lightweight components.
scale_transform_op = comp.func_to_container_op(scale_transform, base_image=BASE_IMAGE)

# Create the training lightweight components.
training_op = comp.func_to_container_op(training, base_image=BASE_IMAGE)

# Create the model evaluation lightweight components.
model_validation_op = comp.func_to_container_op(model_validation, base_image=BASE_IMAGE)

# Create the confusion matrix lightweight components.
confusion_matrix_op = comp.func_to_container_op(Confusion_matrix, base_image=BASE_IMAGE)

# Create predict_classifier lightweight components.
# training_op = comp.func_to_container_op(training, base_image=BASE_IMAGE)

# Build Kubeflow Pipeline

- Our next step will be to create the various components that will make up the pipeline. Define the pipeline using the *@dsl.pipeline* decorator.


- The pipeline function is defined and includes a number of paramters that will be fed into our various components throughout execution. Kubeflow Pipelines are created decalaratively. This means that the code is not run until the pipeline is compiled.


- A [Persistent Volume Claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) can be quickly created using the [VolumeOp](https://) method to save and persist data between the components. 
   - Note that while this is a great method to use locally, you could also use a `cloud bucket` for your persistent storage.

In [23]:
# domain-specific language 
# Define the Pipeline
@dsl.pipeline(
    name='Heart Failure Prediction Pipeline',
    description='End-to-end training machine learning to predict mortality by heart failure.'
)

# Define parameters to be fed into pipeline
def Heart_Failure_container_pipeline(
    data_path: str,  # DATA_PATH
    classifier_file: str  # CLASSIFIER_PATH    
):
    
    # Create a persistent volume
    # Define volume to share data between components
    vop = dsl.VolumeOp(
    name="creat_volume",
    resource_name="data-volume", 
    size="1Gi", 
    modes=dsl.VOLUME_MODE_RWO)
    
    # Define Pipeline Components and dependencies
    # We do this with ContainerOp, an object that defines a pipeline component from a container.
    
    # Create Heart Failure preprocessing component.
    heart_failure_preprocessing_container = preprocess_op(data_path).add_pvolumes({data_path: vop.volume})
    
    # Create Heart Failure analysis component
    heart_failure_analyze_container = analyze_op(data_path).add_pvolumes({data_path: vop.volume})
    
    # Create Heart Failure Feature Engineering component
#     heart_failure_feature_engineer_container = feature_engineer_op(data_path) \
#                                                 .add_pvolumes({data_path: heart_failure_preprocessing_container.pvolume})
    
    # Create Heart Failure Scale and transform component
    heart_failure_scale_transform_container = scale_transform_op(data_path) \
                                                .add_pvolumes({data_path: heart_failure_preprocessing_container.pvolume})
    
    # Create Heart Failure training component
    heart_failure_training_container = training_op(data_path, classifier_file) \
                                        .add_pvolumes({data_path: heart_failure_scale_transform_container.pvolume})
    
    # Create Heart Failure model evaluation component
    heart_failure_model_validation_container = model_validation_op(data_path, classifier_file) \
                                        .add_pvolumes({data_path: heart_failure_training_container.pvolume})
    
    # Create Heart Failure confusion matrix component
    heart_failure_confusion_matrix_container = confusion_matrix_op(data_path, classifier_file) \
                                        .add_pvolumes({data_path: heart_failure_model_validation_container.pvolume})
    
    # Create Heart Failure ROC Curve component
#     heart_failure_roc_container = roc_op(data_path, classifier_file) \
#                                         .add_pvolumes({data_path: heart_failure_model_validation.pvolume})


    
     # Print the result of the prediction
    Heart_Failure_result_container = dsl.ContainerOp(
        name="Heart Failure prediction",  # the name displayed for the component execution during runtime.
        image='library/bash:4.4.23',      # Image tag for the Docker container to be used.
        pvolumes={data_path: heart_failure_model_validation_container.pvolume}, # dictionary of paths and associated Persistent Volumes to be mounted to the container before execution.
        arguments=['cat', f'{data_path}/classifier_result.txt'] # command to be run by the container at runtime.
    )

## Compile and run the pipeline

- Finally we feed our pipeline definition into the compiler and run it as an experiment. This will give us 2 links at the bottom that we can follow to the [Kubeflow Pipelines UI](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/) where you can check logs, artifacts, inputs/outputs, and visually see the progress of your pipeline.


- Kubeflow Pipelines lets you group pipeline runs by Experiments. You can create a new experiment, or call `kfp.Client().list_experiments()` to see existing ones. If you don't specify the experiment name, the Default experiment will be used.

Define some environment variables which are to be used as inputs at various points in the pipeline.

In [24]:
DATA_PATH = '/mnt'  # mount your filesystems or devices
CLASSIFIER_PATH = 'heart_main.pkl'

In [25]:
pipeline_func = Heart_Failure_container_pipeline

In [26]:
experiment_name=EXPERIMENT_NAME
run_name = pipeline_func.__name__ + ' run'


arguments = {"data_path":DATA_PATH,
             "classifier_file":CLASSIFIER_PATH}


# Compile pipeline to generate compressed YAML definition of the pipeline.
kfp.compiler.Compiler().compile(pipeline_func,'{}.zip'.format(experiment_name))



# Submit pipeline directly from pipeline function
run_result = client.create_run_from_pipeline_func(pipeline_func, 
                                                  experiment_name=experiment_name, 
                                                  run_name=run_name, 
                                                  arguments=arguments)

