# Demo KFP pipeline with fairness and energy monitoring

This notebook demonstrates fairness and energy consumption monitoring in a single-run OSS pipeline. The used open-source GitHub repositories for enabling this monitoring are:  

- Data: https://github.com/socialfoundations/folktables
- Fairness: https://github.com/Trusted-AI/AIF360
- Energy consumption: https://github.com/hubblo-org/scaphandre

The used data is 2014 US Census PUMS data (https://www.census.gov/programs-surveys/acs/microdata.html) from California, which is preprocessed into an ACSIncome format by Folktables suite. This aims to replicate a prediction task standardized in ML fairness research by the widely used UCI Adult Dataset. The task is to make a model which predicts using the available features if an individual has an income higher than $50 000 (Give 0 for no and give 1 for yes) in the column PINCP. 

The sensitive attributes of this data are age (AGEP), gender (SEX), and ethnicity (RAC1P) with encodings found here: https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2014-2018.pdf. In this simple demonstration we will set the privileged group as white men under 50 ([{"AGEP":1, "SEX": 1, "RAC1P": 1}]), and the unprivileged group as non-white women over 50 ([{"AGEP":0, "SEX": 0, "RAC1P": 0}]), so the binarization thresholds are 50 for AGEP, 1 for SEX, and 1 for RAC1P. AIF enables setting up more distinct groups by adding more dict elements into privileged or unprivileged groups, like {'AGEP': 1}. For this reason, the resulted scores for these groups should be seen only as a technical demonstration and nothing else.

# Dashboard screenshot
![Dashboard screenshot](dashboard-screenshot.png)

# Scaphandre, Prometheus alerts and Grafana dashboard setup

Before we run the pipeline, we will need to set up Scaphander, Prometheus alerts and Grafana dashboard to handle metrics. 

To setup Scaphandre,we will use the following official documentation in the given order to get the required commands:

- Kubernetes: https://hubblo-org.github.io/scaphandre-documentation/tutorials/kubernetes.html
- Prometheus: https://hubblo-org.github.io/scaphandre-documentation/references/exporter-prometheus.html
- Grafana: https://hubblo-org.github.io/scaphandre-documentation/how-to_guides/get-process-level-power-in-grafana.html

We will first need to clone the Scaphandre GitHub Repository with the command 'git clone https://github.com/hubblo-org/scaphandre'. I cloned it into the folder where I stored the OSS clone, but it should be fine if it is anywhere else. When the cloning is done, call the command 'cd scaphandre', check that helm is installed with 'helm version', and then call 'helm install scaphandre helm/scaphandre'. 

However, it might be required that you change the default port value of 8080 from 8081 in 'scaphandre/helm/scaphandre/values.yaml' as seen in the provided 'modified-scaphandre-values.yaml' before doing the last step because it is the same port used by KFP, Kserve, and Prometheus. If you want to run Scaphandre with modified YAML, delete it with Helm using the command 'helm delete scaphandre' ('helm list' to confirm the name) and rerun the install command. 

You can check that Scahpandre configuration by running 'kubectl get pods' and then 'kubectl describe pod (pod name),' which should show under containers and Port that 8081/TCP. A faster way of doing this is 'kubectl get services.' To get Scahpandre to send metrics to Prometheus, we need to go into the pod with 'kubectl exec -it (pod name) -- /bin/bash' and then run 'scaphandre prometheus --port 8081'.

If no errors are given, the setup is ready, and Prometheus should be able to query the energy consumption metrics. The Prometheus exporter can be made to run with 'scaphandre prometheus', but this will create errors due to the already used port. If the correct setup starts to throw errors unrelated to used port either due to nonoptimal configuration or unsuitable Prometheus query, restart it with CTRL + C and rerun the starting command.

To setup Prometheus alerts for fairness metrics, we must modify the default 'prometheus-config-map.yaml' to have fairness alerts. It is recommended that the default YAML is first moved somewhere safe, after which the provided 'fairness-alert-prometheus-config-map.yaml' is renamed into 'prometheus-config-map.yaml'. Now we only need to run 'kubectl apply -k deployment/monitoring', 'kubectl rollout
restart deployment/prometheus-deployment -n monitoring' and wait a bit to apply these modifications.

To setup the Grafana dashboard for fairness and energy consumption metrics, we only need to click import under create and upload the provided JSON file named 'grafana_fairness_consumption_monitoring_1.json'. The dashboard will be empty, except for Prometheus alerts and consumption plots until the KFP pipeline has completed evaluation step. As long as Prometheus is capable of querying fairness and energy consumption metrics after running the KFP, Grafana should also be fine.

# KFP setup

The requirements for running this code in Jupyter using a virtual enviroment are:

- pip install notebook
- pip install kfp≃1.8.14

Below we provide the necessary imports and set up the KFP client.

In [None]:
# Imports
import warnings
warnings.filterwarnings("ignore")

import kfp
import kfp.dsl as dsl
from kfp.aws import use_aws_secret
from kfp.v2.dsl import (
    component,
    Input,
    Output,
    Dataset,
    Metrics,
    Artifact,
    Model
)

## 1. Connect to client

The default way of accessing Kubeflow is via port-forward. This enables you to get started quickly without imposing any requirements on your environment. Run the following to port-forward Istio's Ingress-Gateway to local port `8080`:

```sh
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
```

In [None]:
import re
import requests
from urllib.parse import urlsplit

def get_istio_auth_session(url: str, username: str, password: str) -> dict:
    """
    Determine if the specified URL is secured by Dex and try to obtain a session cookie.
    WARNING: only Dex `staticPasswords` and `LDAP` authentication are currently supported
             (we default default to using `staticPasswords` if both are enabled)

    :param url: Kubeflow server URL, including protocol
    :param username: Dex `staticPasswords` or `LDAP` username
    :param password: Dex `staticPasswords` or `LDAP` password
    :return: auth session information
    """
    # define the default return object
    auth_session = {
        "endpoint_url": url,    # KF endpoint URL
        "redirect_url": None,   # KF redirect URL, if applicable
        "dex_login_url": None,  # Dex login URL (for POST of credentials)
        "is_secured": None,     # True if KF endpoint is secured
        "session_cookie": None  # Resulting session cookies in the form "key1=value1; key2=value2"
    }

    # use a persistent session (for cookies)
    with requests.Session() as s:

        ################
        # Determine if Endpoint is Secured
        ################
        resp = s.get(url, allow_redirects=True)
        if resp.status_code != 200:
            raise RuntimeError(
                f"HTTP status code '{resp.status_code}' for GET against: {url}"
            )

        auth_session["redirect_url"] = resp.url

        # if we were NOT redirected, then the endpoint is UNSECURED
        if len(resp.history) == 0:
            auth_session["is_secured"] = False
            return auth_session
        else:
            auth_session["is_secured"] = True

        ################
        # Get Dex Login URL
        ################
        redirect_url_obj = urlsplit(auth_session["redirect_url"])

        # if we are at `/auth?=xxxx` path, we need to select an auth type
        if re.search(r"/auth$", redirect_url_obj.path):

            #######
            # TIP: choose the default auth type by including ONE of the following
            #######

            # OPTION 1: set "staticPasswords" as default auth type
            redirect_url_obj = redirect_url_obj._replace(
                path=re.sub(r"/auth$", "/auth/local", redirect_url_obj.path)
            )
            # OPTION 2: set "ldap" as default auth type
            # redirect_url_obj = redirect_url_obj._replace(
            #     path=re.sub(r"/auth$", "/auth/ldap", redirect_url_obj.path)
            # )

        # if we are at `/auth/xxxx/login` path, then no further action is needed (we can use it for login POST)
        if re.search(r"/auth/.*/login$", redirect_url_obj.path):
            auth_session["dex_login_url"] = redirect_url_obj.geturl()

        # else, we need to be redirected to the actual login page
        else:
            # this GET should redirect us to the `/auth/xxxx/login` path
            resp = s.get(redirect_url_obj.geturl(), allow_redirects=True)
            if resp.status_code != 200:
                raise RuntimeError(
                    f"HTTP status code '{resp.status_code}' for GET against: {redirect_url_obj.geturl()}"
                )

            # set the login url
            auth_session["dex_login_url"] = resp.url

        ################
        # Attempt Dex Login
        ################
        resp = s.post(
            auth_session["dex_login_url"],
            data={"login": username, "password": password},
            allow_redirects=True
        )
        if len(resp.history) == 0:
            raise RuntimeError(
                f"Login credentials were probably invalid - "
                f"No redirect after POST to: {auth_session['dex_login_url']}"
            )

        # store the session cookies in a "key1=value1; key2=value2" string
        auth_session["session_cookie"] = "; ".join([f"{c.name}={c.value}" for c in s.cookies])

    return auth_session

In [None]:
import kfp

KUBEFLOW_ENDPOINT = "http://localhost:8080"
KUBEFLOW_USERNAME = "user@example.com"
KUBEFLOW_PASSWORD = "12341234"

auth_session = get_istio_auth_session(
    url=KUBEFLOW_ENDPOINT,
    username=KUBEFLOW_USERNAME,
    password=KUBEFLOW_PASSWORD
)

client = kfp.Client(host=f"{KUBEFLOW_ENDPOINT}/pipeline", cookies=auth_session["session_cookie"])
# print(client.list_experiments())

# Pull data

Here we create a KFP component, which uses Folktables functions to get the data and preprocess it into a suitable format for the prediction task. This data is then made into an artifact for further usage. 

In [None]:
@component(
    base_image="python:3.10",
    packages_to_install=["pandas~=1.4.2","numpy","folktables"],
    output_component_file='components/pull_data_component.yaml',
)
def pull_data(
    state: str, 
    year: int, 
    data: Output[Dataset]
):
    """
    Pull data component.
    """
    import pandas as pd
    from pathlib import Path
    import numpy as np
    from folktables import ACSDataSource, ACSIncome
    
    pull_data_component_landmark = 'KFP_component'
    
    source = ACSDataSource(survey_year=year, horizon='1-Year', survey='person')
    state_data = source.get_data(states=[state], download=True)
    state_features, state_labels, _ = ACSIncome.df_to_pandas(state_data)
    df = pd.concat([state_features,state_labels],axis=1)
    df.to_csv(data.path, index=None)

# Preprocess

Here we create a component that changes sensitive attributes into binary columns using given thresholds, scales features (no sensitive attributes and predicted values), divides the data into three parts (train, test, indrift), and stores these parts as artifacts for later use.

In [None]:
@component(
    base_image="python:3.10",
    packages_to_install=["pandas~=1.4.2", "scikit-learn~=1.0.2", "numpy"],
    output_component_file='components/preprocess_component.yaml',
)
def preprocess(
    data: Input[Dataset],
    train_set: Output[Dataset],
    test_set: Output[Dataset],
    drift_set: Output[Dataset],
    label_attribute: str,
    sensitive_attributes: list,
    splits: list,
    group_thresholds: list
):
    """
    Preprocess component.
    """
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    import numpy as np
    import random
    from itertools import islice
    
    preprocess_component_landmark = 'KFP_component'
    
    l_a = label_attribute
    s_a = sensitive_attributes
    g_t = group_thresholds
   
    data = pd.read_csv(data.path)
    
    attribute_amount = len(s_a) + 1
    non_scalable_attribute_identity = []
    non_scalable_attribute_values = []

    for i in range(0,attribute_amount):
        if i == attribute_amount-1:
            non_scalable_attribute_identity.append(l_a)
            non_scalable_attribute_values.append(data[l_a].astype(int))
            del data[l_a]
            continue

        values = data[s_a[i]].copy()
        values[values <= g_t[i]] = 1
        values[values > g_t[i]] = 0
        non_scalable_attribute_identity.append(s_a[i])
        non_scalable_attribute_values.append(values)
        del data[s_a[i]]

    scaler = StandardScaler()
    bin_df = pd.DataFrame(scaler.fit_transform(data), 
                                 columns=data.columns) 
    
    index = 0
    for name in non_scalable_attribute_identity:
        bin_df[name] = np.array(non_scalable_attribute_values[index]).astype(int)
        index = index + 1

    index = np.array(bin_df.index)
    random.seed(42)
    random.shuffle(index)

    amounts = [round(bin_df.shape[0]*splits[0]), 
               round(bin_df.shape[0]*splits[1]), 
               round(bin_df.shape[0]*splits[2])]

    if bin_df.shape[0] < sum(amounts):
            amounts[4] = amounts[4]+(bin_df.shape[0]-sum(amounts)) 

    it = iter(index)

    sliced = [list(islice(it, 0, i)) for i in amounts]

    train_data = bin_df.loc[sliced[0]]
    test_data = bin_df.loc[sliced[1]]
    drift_data = bin_df.loc[sliced[2]]

    train_data.to_csv(train_set.path, index=None)
    test_data.to_csv(test_set.path, index=None)
    drift_data.to_csv(drift_set.path, index=None)

# Train

Here we create a component that defines the used metrics, makes the datasets suitable for AIF360 metrics (with a favorable label of 1 and unfavorable label of 0), trains a logistic regression model, calculates metrics, and stores these metrics into MLflow for model comparison. Some of the code is reused from the original OSS pipeline. The used fairness metrics are statistical parity, disparate impact, equal odds difference, average odds difference, and theil index fairness metrics, which are the most common in the provided AIF360 tutorials. Notice that there are fewer dataset metrics than model metrics.

In [None]:
from typing import NamedTuple

@component(
    base_image="python:3.10",
    packages_to_install=["numpy", 
                         "pandas~=1.4.2",
                         "aif360",
                         "scikit-learn~=1.0.2", 
                         "mlflow~=2.4.1", 
                         "boto3~=1.21.0"],
    output_component_file='components/train_component.yaml',
)
def train(
    train_set: Input[Dataset],
    test_set: Input[Dataset],
    saved_model: Output[Model],
    model_name: str,
    label_attribute: str,
    sensitive_attributes: list,
    privilaged_groups: list,
    unprivilaged_groups: list,
    mlflow_experiment_name: str,
    mlflow_tracking_uri: str,
    mlflow_s3_endpoint_url: str
) -> NamedTuple("Output", [('storage_uri', str), ('run_id', str),]):
    """
    Train component.
    """
    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    
    from aif360.datasets import BinaryLabelDataset
    from aif360.metrics import BinaryLabelDatasetMetric
    from aif360.metrics import ClassificationMetric
    from sklearn.metrics import accuracy_score,confusion_matrix,precision_score,recall_score,f1_score
    
    import mlflow
    import mlflow.sklearn
    import os
    import logging
    import pickle
    from collections import namedtuple
    
    train_component_landmark = 'KFP_component'
    
    l_a = label_attribute
    s_a = sensitive_attributes
    p_g = privilaged_groups
    u_g = unprivilaged_groups

    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    def dataset_fairness(dataset,p_g,u_g):
        metrics_list = []
        dataset_metrics = BinaryLabelDatasetMetric(dataset, 
                                       privileged_groups = p_g,
                                       unprivileged_groups = u_g)
        
        SP = dataset_metrics.mean_difference()
        DI = dataset_metrics.disparate_impact()
        
        # Statistical parity
        metrics_list.append({'name': 'D_SP', 
                             'value': SP })
        
        # Disparate impact
        metrics_list.append({'name': 'D_DI', 
                             'value': DI })
        
        return metrics_list
    
    def model_metrics(dataset, pred, p_g, u_g):
        metrics_list = []
        
        Acc = accuracy_score(dataset.labels, pred)
        
        # Accuracy
        metrics_list.append({'name': 'M_Acc', 
                             'value': Acc })
        
        matrix = confusion_matrix(dataset.labels, pred)
        
        # True positives
        metrics_list.append({'name': 'M_TP', 
                             'value': matrix[0][0]})
        # False positives
        metrics_list.append({'name': 'M_FP', 
                             'value': matrix[0][1]})
        # False negatives
        metrics_list.append({'name': 'M_FN', 
                             'value': matrix[1][0]})
        # True negatives
        metrics_list.append({'name': 'M_TN', 
                             'value': matrix[1][1]})
        
        dataset_pred = dataset.copy()
        dataset_pred.labels = pred
        
        model_metrics = ClassificationMetric(
                        dataset,
                        dataset_pred,
                        privileged_groups = p_g,
                        unprivileged_groups = u_g)
        
        BA = (model_metrics.true_positive_rate() + model_metrics.true_negative_rate()) / 2
         
        # Balanced accuracy
        metrics_list.append({'name': 'M_BA', 
                             'value': BA})   
            
        SP = model_metrics.mean_difference()
        DI = model_metrics.disparate_impact()
        AOD = model_metrics.average_odds_difference()
        EOD = model_metrics.equal_opportunity_difference()
        TI = model_metrics.theil_index()
        
        # Statistical parity
        metrics_list.append({'name': 'M_SP', 
                             'value': SP})
        # Disparate impact
        metrics_list.append({'name': 'M_DI', 
                             'value': DI})
        
        # Average odds difference
        metrics_list.append({'name': 'M_AOD', 
                             'value': AOD})
        
        # Equal oppoturnity difference
        metrics_list.append({'name': 'M_EOD', 
                             'value': EOD})
        
        # Theil index
        metrics_list.append({'name': 'M_TI', 
                             'value': TI})
        
        return metrics_list
    
    os.environ['MLFLOW_S3_ENDPOINT_URL'] = mlflow_s3_endpoint_url

    # load data
    logger.info("Setting up data")
    train_data = pd.read_csv(train_set.path)
    test_data = pd.read_csv(test_set.path)
    
    train = BinaryLabelDataset(
                   favorable_label = 1,
                   unfavorable_label = 0,
                   df = train_data,
                   label_names = [l_a],
                   protected_attribute_names = s_a)

    test = BinaryLabelDataset(
               favorable_label = 1,
               unfavorable_label = 0,
               df = test_data,
               label_names = [l_a],
               protected_attribute_names= s_a)
    
    logger.info("Checking training and test data fairness")
    train_fairness = dataset_fairness(train,p_g,u_g)
    test_fairness = dataset_fairness(test,p_g,u_g)

    # The predicted column is "Target" which is either 0 or 1
    train_x = train.features 
    test_x = test.features 
    train_y = train.labels 
    test_y = test.labels 
    
    logger.info(f"Using MLflow tracking URI: {mlflow_tracking_uri}")
    mlflow.set_tracking_uri(mlflow_tracking_uri)

    logger.info(f"Using MLflow experiment: {mlflow_experiment_name}")
    mlflow.set_experiment(mlflow_experiment_name)

    with mlflow.start_run() as run:

        run_id = run.info.run_id
        logger.info(f"Run ID: {run_id}")

        model = LogisticRegression(random_state=42)
        
        logger.info("Fitting model...")
        model.fit(train_x, train_y)

        logger.info("Predicting...")
        
        predicted_qualities = model.predict(test_x)
        
        model_metrics = model_metrics(test, predicted_qualities, p_g, u_g)
        
        logger.info("Logging training data metrics to MLflow")
        for pair in train_fairness:
            name = 'Tr_' + pair['name']
            mlflow.log_metric(name, pair['value'])
        
        logger.info("Logging test data metrics to MLflow")
        for pair in test_fairness:
            name = 'Te_' + pair['name']
            mlflow.log_metric(name, pair['value'])
        
        logger.info("Logging model metrics to MLflow")
        for pair in model_metrics:
            mlflow.log_metric(pair['name'], pair['value'])
        
        # save model to mlflow
        logger.info("Logging trained model")
        mlflow.sklearn.log_model(
            model,
            model_name,
            registered_model_name="USCensusLR",
            serialization_format="pickle"
        )

        logger.info("Logging predictions artifact to MLflow")
        np.save("predictions.npy", predicted_qualities)
        mlflow.log_artifact(
            local_path="predictions.npy", artifact_path="predicted_qualities/"
        )

        # save model as KFP artifact
        logging.info(f"Saving model to: {saved_model.path}")
        with open(saved_model.path, 'wb') as fp:
            pickle.dump(model, fp, pickle.HIGHEST_PROTOCOL)

        # prepare output
        output = namedtuple('Output', ['storage_uri', 'run_id'])

        # return str(mlflow.get_artifact_uri())
        return output(mlflow.get_artifact_uri(), run_id)

# Evaluate

Here we define a component, which gets the stored metrics, pushes these into a Prometheus gateway, and evaluates these metrics with given thresholds before going to the next phase. The Prometheus gateway is a ready-made component of the OSS pipeline, which Prometheus will scrape by providing the correct gateway URL. Prometheus is set up with a slightly modified YAML configuration to alert when accuracy and fairness metrics exceed given thresholds. Prometheus also enables Grafana to easily visualize the given metrics to provide a general overview of the deployed model and the cluster. 

In [None]:
@component(
    base_image="python:3.10",
    packages_to_install=["numpy", "mlflow~=1.25.0", "prometheus_client"],
    output_component_file='components/evaluate_component.yaml',
)
def evaluate(
    run_id: str,
    mlflow_tracking_uri: str,
    threshold_metrics: dict
) -> bool:
    """
    Evaluate component: Compares metrics from training with given thresholds.

    Args:
        run_id (string):  MLflow run ID
        mlflow_tracking_uri (string): MLflow tracking URI
        threshold_metrics (dict): Minimum threshold values for each metric
    Returns:
        Bool indicating whether evaluation passed or failed.
    """
    from mlflow.tracking import MlflowClient
    from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
    import requests
    import json
    import logging
    
    evaluate_component_landmark = 'KFP_component'
    
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)

    client = MlflowClient(tracking_uri=mlflow_tracking_uri)
    info = client.get_run(run_id)
    training_metrics = info.data.metrics

    logger.info(f"Training metrics: {training_metrics}")
    
    registry = CollectorRegistry()
    url = 'http://prometheus-pushgateway.monitoring.svc.cluster.local:9091'
    for key, value in training_metrics.items():
        metric = Gauge(key, 'Metric', registry = registry)
        metric.set(value)
    push_to_gateway(url, job = 'Metrics', registry = registry)
    
    # compare the evaluation metrics with the defined thresholds
    for key, value in threshold_metrics.items():
        if (key not in training_metrics) or (training_metrics[key] < value):
            logger.error(f"Metric {key} failed. Evaluation not passed!")
            return False
    return True

# Deploy model

Here we define a component that deploys the passed model into an inference service. This component is exactly similar to the original OSS demo pipeline deploy model component.

In [None]:
@component(
    base_image="python:3.9",
    packages_to_install=["kserve==0.11.0"],
    output_component_file='components/deploy_model_component.yaml',
)
def deploy_model(model_name: str, storage_uri: str):
    """
    Deploy the model as a inference service with Kserve.
    """
    from kubernetes import client
    from kserve import KServeClient
    from kserve import constants
    from kserve import utils
    from kserve import V1beta1InferenceService
    from kserve import V1beta1InferenceServiceSpec
    from kserve import V1beta1PredictorSpec
    from kserve import V1beta1SKLearnSpec
    import logging
    
    deploy_model_component_landmark = 'KFP_component'
    
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    model_uri = f"{storage_uri}/{model_name}"
    logger.info("MODEL URI:", model_uri)

    namespace = utils.get_default_target_namespace()
    kserve_version='v1beta1'
    api_version = constants.KSERVE_GROUP + '/' + kserve_version

    isvc = V1beta1InferenceService(
        api_version=api_version,
        kind=constants.KSERVE_KIND,
        metadata=client.V1ObjectMeta(
            name=model_name,
            namespace=namespace,
            annotations={'sidecar.istio.io/inject':'false'}
        ),
        spec=V1beta1InferenceServiceSpec(
            predictor=V1beta1PredictorSpec(
                service_account_name="kserve-sa",
                sklearn=V1beta1SKLearnSpec(
                    storage_uri=model_uri
                )
            )
        )
    )
    KServe = KServeClient()
    KServe.create(isvc)

# Inference

Here we define a component that tests out the deployed inference service. The only difference between this and the original OSS pipeline component is that this gives two already preprocessed samples for simplicity.

In [None]:
@component(
    base_image="python:3.9",  # kserve on python 3.10 comes with a dependency that fails to get installed
    packages_to_install=["kserve==0.11.0", "scikit-learn~=1.0.2"],
    output_component_file='components/inference_component.yaml',
)
def inference(
    model_name: str
):
    """
    Test inference.
    """
    from kserve import KServeClient
    from kserve import utils
    import requests
    import pickle
    import logging
    from urllib.parse import urlsplit
    import re
    
    inference_component_landmark = 'KFP_component'
    
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    def get_istio_auth_session(url: str, username: str, password: str) -> dict:
        """
        Determine if the specified URL is secured by Dex and try to obtain a session cookie.
        WARNING: only Dex `staticPasswords` and `LDAP` authentication are currently supported
                 (we default default to using `staticPasswords` if both are enabled)
    
        :param url: Kubeflow server URL, including protocol
        :param username: Dex `staticPasswords` or `LDAP` username
        :param password: Dex `staticPasswords` or `LDAP` password
        :return: auth session information
        """
        # define the default return object
        auth_session = {
            "endpoint_url": url,    # KF endpoint URL
            "redirect_url": None,   # KF redirect URL, if applicable
            "dex_login_url": None,  # Dex login URL (for POST of credentials)
            "is_secured": None,     # True if KF endpoint is secured
            "session_cookie": None  # Resulting session cookies in the form "key1=value1; key2=value2"
        }
    
        # use a persistent session (for cookies)
        with requests.Session() as s:
    
            ################
            # Determine if Endpoint is Secured
            ################
            resp = s.get(url, allow_redirects=True)
            if resp.status_code != 200:
                raise RuntimeError(
                    f"HTTP status code '{resp.status_code}' for GET against: {url}"
                )
    
            auth_session["redirect_url"] = resp.url
    
            # if we were NOT redirected, then the endpoint is UNSECURED
            if len(resp.history) == 0:
                auth_session["is_secured"] = False
                return auth_session
            else:
                auth_session["is_secured"] = True
    
            ################
            # Get Dex Login URL
            ################
            redirect_url_obj = urlsplit(auth_session["redirect_url"])
    
            # if we are at `/auth?=xxxx` path, we need to select an auth type
            if re.search(r"/auth$", redirect_url_obj.path):
    
                #######
                # TIP: choose the default auth type by including ONE of the following
                #######
    
                # OPTION 1: set "staticPasswords" as default auth type
                redirect_url_obj = redirect_url_obj._replace(
                    path=re.sub(r"/auth$", "/auth/local", redirect_url_obj.path)
                )
                # OPTION 2: set "ldap" as default auth type
                # redirect_url_obj = redirect_url_obj._replace(
                #     path=re.sub(r"/auth$", "/auth/ldap", redirect_url_obj.path)
                # )
    
            # if we are at `/auth/xxxx/login` path, then no further action is needed (we can use it for login POST)
            if re.search(r"/auth/.*/login$", redirect_url_obj.path):
                auth_session["dex_login_url"] = redirect_url_obj.geturl()
    
            # else, we need to be redirected to the actual login page
            else:
                # this GET should redirect us to the `/auth/xxxx/login` path
                resp = s.get(redirect_url_obj.geturl(), allow_redirects=True)
                if resp.status_code != 200:
                    raise RuntimeError(
                        f"HTTP status code '{resp.status_code}' for GET against: {redirect_url_obj.geturl()}"
                    )
    
                # set the login url
                auth_session["dex_login_url"] = resp.url
    
            ################
            # Attempt Dex Login
            ################
            resp = s.post(
                auth_session["dex_login_url"],
                data={"login": username, "password": password},
                allow_redirects=True
            )
            if len(resp.history) == 0:
                raise RuntimeError(
                    f"Login credentials were probably invalid - "
                    f"No redirect after POST to: {auth_session['dex_login_url']}"
                )
    
            # store the session cookies in a "key1=value1; key2=value2" string
            auth_session["session_cookie"] = "; ".join([f"{c.name}={c.value}" for c in s.cookies])
    
        return auth_session
    
    KUBEFLOW_ENDPOINT = "http://istio-ingressgateway.istio-system.svc.cluster.local:80"
    KUBEFLOW_USERNAME = "user@example.com"
    KUBEFLOW_PASSWORD = "12341234"
    
    auth_session = get_istio_auth_session(
    url=KUBEFLOW_ENDPOINT,
    username=KUBEFLOW_USERNAME,
    password=KUBEFLOW_PASSWORD,
    )
    TOKEN = auth_session["session_cookie"].replace("authservice_session=", "")
    print("Token:", TOKEN)

    namespace = utils.get_default_target_namespace()
    
    input_sample = [
        [-0.07237661,  0.92825499,  1.31739921, -1.31196944, -0.73834281,
         -0.09041844,  0.18720611,  1.        ,  0.        ,  1.        ],
        [-0.60712221, -1.55312352,  1.31739921,  0.05298942,  1.63138481,
         -0.54927344,  0.18720611,  1.        ,  0.        ,  1.        ]]
    
    # get inference service
    KServe = KServeClient()

    # wait for deployment to be ready
    KServe.get(model_name, namespace=namespace, watch=True, timeout_seconds=120)

    inference_service = KServe.get(model_name, namespace=namespace)
    is_url = f"http://istio-ingressgateway.istio-system.svc.cluster.local:80/v1/models/{model_name}:predict"
    header = {"Host": f"{model_name}.{namespace}.example.com"}

    logger.info(f"\nInference service status:\n{inference_service['status']}")
    logger.info(f"\nInference service URL:\n{is_url}\n")
    
    inference_input = {
        'instances': input_sample
    }
    response = requests.post(
        is_url,
        json=inference_input,
        headers=header,
        cookies={"authservice_session": TOKEN}
        
    )
    if response.status_code != 200:
        raise RuntimeError(f"HTTP status code '{response.status_code}': {response.json()}")
    logger.info(f"\nPrediction response:\n{response.text}\n")

# Pipeline

The code below uses the previously defined components to create a KFP pipeline. This code modifies the original OSS pipeline by giving new variables state, year, lable_attribute, sensitive_attributes, splits, privilaged_groups, unprivilaged_groups, and group_thresholds for pull, preprocess, and train phases to use the new code.

In [None]:
@dsl.pipeline(
      name='aif-pipeline',
      description='An single run pipeline for UCI Adult prediciton task',
)
def pipeline(
    state: str,
    year: int,
    label_attribute: str,
    sensitive_attributes: list,
    splits: list,
    privilaged_groups: list,
    unprivilaged_groups: list,
    group_thresholds: list,
    mlflow_experiment_name: str,
    mlflow_tracking_uri: str,
    mlflow_s3_endpoint_url: str,
    model_name: str,
    threshold_metrics: dict
):
    """
    pipeline component.
    """
    pipeline_landmark = 'KFP_pipeline'
    
    pull_task = pull_data(state = state, year = year)

    preprocess_task = preprocess(data=pull_task.outputs["data"],
                                 label_attribute = label_attribute,
                                 sensitive_attributes = sensitive_attributes,
                                 splits = splits, group_thresholds = group_thresholds)

    train_task = train(
        train_set = preprocess_task.outputs["train_set"],
        test_set = preprocess_task.outputs["test_set"],
        mlflow_experiment_name=mlflow_experiment_name,
        mlflow_tracking_uri=mlflow_tracking_uri,
        mlflow_s3_endpoint_url=mlflow_s3_endpoint_url,
        model_name=model_name,
        label_attribute = label_attribute,
        sensitive_attributes = sensitive_attributes,
        privilaged_groups = privilaged_groups,
        unprivilaged_groups = unprivilaged_groups
    )
    
    train_task.apply(use_aws_secret(secret_name="aws-secret"))

    evaluate_trask = evaluate(
        run_id=train_task.outputs["run_id"],
        mlflow_tracking_uri=mlflow_tracking_uri,
        threshold_metrics=threshold_metrics
    )
    
    eval_passed = evaluate_trask.output

    with dsl.Condition(eval_passed == "true"):
        deploy_model_task = deploy_model(
            model_name=model_name,
            storage_uri=train_task.outputs["storage_uri"],
        )

        inference_task = inference(
            model_name=model_name
        )
        
        inference_task.after(deploy_model_task)

# Arguments

Here we define the arguments used by the pipeline. Notice how the data is split with a 0.5-0.3-0.2 ratio into train, test, and indrift datasets. We also see how AIF360 defines compared groups in privilaged_groups and unprivilaged_groups values. We can change the state and year if we want this pipeline to use other PUMS data. Just pick any state from seen here https://www.bls.gov/respondents/mwr/electronic-data-interchange/appendix-d-usps-state-abbreviations-and-fips-codes.htm and choose between a year in the range [2014,2018]. 

However, the memory requirements will increase since folktables will download the data, and then KFP needs to make a new artifact. It might also be sometimes the case that KFP will create separate artifacts for the same data, but this has not happened in this pipeline setup. If you want to check created artifacts, go to the KFP dashboard, and click artifacts and unknown.

If the pipeline, as seen in the KFP dashboard, gets an error in the inference step, you must either change the model_name given in the arguments into something else or remove some of the existing inference services. The latter can be done by writing 'kubectl get isvc -n kserve-inference' and using the listed names in the 'kubectl -n kserve-inference delete isvc (model_name)'. You can check these and other existing services with 'Kubectl get services -A'. 

In [None]:
# If we want the pipeline obey certain thresholds, we can set them here
eval_threshold_metrics = {'M_Acc': 0.60}

arguments = {
    "state": "CA",
    "year": 2014,
    "label_attribute": "PINCP",
    "sensitive_attributes": ["AGEP","SEX","RAC1P"],
    "splits": [0.5,0.3,0.2],
    "group_thresholds": [50,1,1],
    "privilaged_groups": [{"AGEP":1, "SEX": 1, "RAC1P": 1}],
    "unprivilaged_groups": [{"AGEP":0, "SEX": 0, "RAC1P": 0}],
    "mlflow_tracking_uri": "http://mlflow.mlflow.svc.cluster.local:5000",
    "mlflow_s3_endpoint_url": "http://mlflow-minio-service.mlflow.svc.cluster.local:9000",
    "mlflow_experiment_name": "demo-aif-notebook",
    "model_name": "demo-aif-lr",
    "threshold_metrics": eval_threshold_metrics
}

# Submit run

This block enables running the constructed pipeline. If you want to update the pipeline, rerun the code you have changed and then rerun this block. For some reason, running this KFP pipeline creates temporary mystery files with a size range of [2,5] GB, which enable_caching, metadata writing, artifacts, or in optimal docker configuration might cause. On their own, these files are not a problem, but if a computer has a set memory space of 105 GB for the root file system that has around 20GB free space left due to it being the default place for all kinds of software, the user needs to either manually delete these files (possible places for these files in ubuntu 22.04 are /tmp and /run based on date modification) or restart the computer after running KFP around 4-5 times to continue rerunning the pipeline. Thus, checking your computer's memory before and after running the KFP pipeline is recommended to prevent possible mistakes. It might also be worthwhile to check the amount of memory Minio is using with the following port forwards:

- MLFlow Minio = kubectl -n mlflow port-forward svc/mlflow-minio-service 9001:9001 (user and password is minioadmin)
- KFP Minio = kubectl port-forward -n kubeflow svc/minio-service 9000:9000 (user is minio and password minio123)

The localhost URLs are:

- MLFlow Minio = http://localhost:9001/
- KFP Minio = http://localhost:9000/

In [None]:
run_name = "demo-aif-run"
experiment_name = "demo-aif-experiment"

client.create_run_from_pipeline_func(
    pipeline_func=pipeline,
    run_name=run_name,
    experiment_name=experiment_name,
    arguments=arguments,
    mode=kfp.dsl.PipelineExecutionMode.V2_COMPATIBLE,
    enable_caching=False,
    namespace="kubeflow-user-example-com"
)

# Demonstration confirmation

If the previous block did not create any errors, check how the run goes by port forwarding KFP, MLFlow, Pushgateway, Prometheus, and Grafana. Since KFP and Prometheus ports are the same, I recommend first waiting for the pipeline to run in the KFP dashboard, shutting it down, and then port forwarding Prometheus. Same for Grafana, since it uses the same port as MLFlow. Here are the commands required for these port forwards:

- KFP = kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
- MLFlow = kubectl -n mlflow port-forward svc/mlflow 5000:5000
- Pushgateway = kubectl port-forward svc/prometheus-pushgateway 9091 -n monitoring
- Promtheus = kubectl port-forward svc/prometheus-service 8080 -n monitoring
- Grafana = kubectl port-forward svc/grafana 5000:3000 --namespace monitoring

The localhosts URLs are:

- KFP = http://localhost:8080
- MLFlow = http://localhost:5000/#/
- Pushgateway = http://localhost:9091/
- Prometheus = http://localhost:8080/alerts
- Grafana = http://localhost:5000/ (user and password are admin)

A sign that everything went fine with the pipeline is that the experiment 'demo-aif-run' found in runs is green in KFP, the metrics results of the notebook 'demo-aif-notebook' shows numbers in all of the columns in MLFlow, pushgateway has a job named metrics with scores for all the cases seen in the code, Prometheus alerts are working, and grafana dashboard shows different numbers. Demonstration-wise, you have reached the end, but we will still review Scaphandre metrics and demonstration debugging. 

# Scaphandre metrics

The relevant energy consumption metrics in Prometheus and Grafana queries, as described in https://hubblo-org.github.io/scaphandre-documentation/references/exporter-prometheus.html, are:

- scaph_host_power_microwatts
- scaph_process_power_consumption_microwatts (This shows all cluster processes)
- scaph_host_energy_microjoule
- scaph_socket_power_microwatts

We can filter these metrics by giving suitable labels inside {} and using regex. For example, if we want to get the power consumption of the whole cluster in watts, we need to write the query 'sum(scaph_process_power_consumption_microwatts{app_kubernetes_io_managed_by="Helm"}) / 1000000' either in Prometheus or Grafana. The label "Helm" is used because the same metrics are duplicated in the current configuration, while 1000000 changes microwatts into watts.

If we want to get more granular, like the enegry consumption of Prometheus, we need to write the query scaph_process_power_consumption_microwatts{app_kubernetes_io_managed_by="Helm", exe="prometheus"} / 1000000. Similarly, we can get the KFP (except for inference steps) power consumption with the query sum(scaph_process_power_consumption_microwatts{app_kubernetes_io_managed_by="Helm", cmdline =~".*kfp.*"}) / 1000000 by using regex.

Other interesting labels like kubernetes_namespace only show default because Scaphandre isn't most likely configured to get the cluster namespaces. The instance label could provide even better granularity if the cluster IP addresses are stable. The most specific label is PID, which is, unfortunately, unique for all processes. There are other ways through the cmdline, which allows specifying KFP components (except for inference steps)  by putting landmark variables. So, for example, the power consumption of running the code for every component execpt inference can be queried with sum(scaph_process_power_consumption_microwatts{app_kubernetes_io_managed_by="Helm", cmdline =~".*landmark.*"}) / 1000000. By giving more specifics, we only need the following filters for all components expect inference:

- cmdline=~".*(pull_data_component|python3-mpipinstall--quiet--no-warn-script-locationpandas~=1.4.2numpyfolktableskfp==1.8.22).*"
- cmdline=~".*(preprocess_component|python3-mpipinstall--quiet--no-warn-script-locationpandas~=1.4.2scikit-learn~=1.0.2numpykfp==1.8.22).*"
- Train = cmdline=~".*(train_component|python3-mpipinstall--quiet--no-warn-script-locationnumpypandas~=1.4.2aif360scikit-learn~=1.0.2mlflow~=1.25.0boto3~=1.21.0kfp==1.8.22).*"
- Evaluate = cmdline=~".*(evaluate_component|python3-mpipinstall--quiet--no-warn-script-locationnumpymlflow~=1.25.0prometheus_clientkfp==1.8.22).*"
- Deploy = cmdline=~".*(deploy_model_component|python3-mpipinstall--quiet--no-warn-script-locationkservekfp==1.8.22).*"

It is unknown why this matching technique does not work for inference services. We must first use regex negations to isolate KFP pipeline process runs. This can be done with the following filters:

- app_kubernetes_io_managed_by="Helm"
- cmdline!~".*(bin/|app/|conf/|--loglevelinfo|scaphandreprometheus--port8081).*" 
- exe!~".*(containerd-shim|nginx|postgres|sleep|workflow-contro|pause|minio|grafana-server|systemd-journal|manager|etcd|kube-apiserver|kube-controller|kube-scheduler|local-path-prov|mysqld|node|persistence_age).*"

Now, we can add further negations to isolate inference service related processes with the following filters:

- app_kubernetes_io_managed_by="Helm"
- cmdline!~".*(bin/|app/|conf/|--loglevelinfo|scaphandreprometheus--port8081|preprocess|pull_data|train|evaluate|deploy_model|numpy|python3-mpipinstall--quiet--no-warn-script-locationkservekfp==1.8.22|metadata|msklearnserver|python3server.py).*"
- exe!~".*(containerd-shim|nginx|postgres|sleep|workflow-contro|pause|minio|grafana-server|systemd-journal|manager|etcd|kube-apiserver|kube-controller|kube-scheduler|local-path-prov|mysqld|node|persistence_age).*"

Unfortunately, these filters are not agnostic. They use code-specific regex matches, which means the substring negations must be updated for modified code. This problem can be fixed by properly configuring Scaphander into the cluster because it is currently in an awkward place due to requiring manual actions and not being able to get more relevant label metadata. Additionally, in the Grafana plots the sum(sum_over_time({}[1d])) / (1000000 * 1000 * 24) only approximates the cumulated daily energy consumption per hour without thinking about possible downtime, which is why sum_over_time(avg_over_time(avg())) might be a better option.

# Demonstration debugging

If the run did create some errors, the first place to check for errors is the KFP logs for the components, which can be found by going into the runs, clicking the latest experiment, clicking the component with a red error, and finding its log tab. Usually, the error is caused by a coding error, so after fixing it, rerun the modified components and start the experiment.

If this doesn't solve the issue, use the tests for the cluster to check if the parts are green. Notice that the cluster is fine as long as the component test gives cluster ready and the passed amount of tests is either 36 or 37. The latter case can be caused by the website storing the test data being down. There might also be other errors, but as long as KFP is capable of running, these can be ignored. The tests are: 

- python tests/wait_deployment_ready.py --timeout 30 (virtual enviroment recommended and OSS root directory)
- pytest (virtual enviroment recommended)

If not, removing the cluster and reinstalling everything is usually easier. A more surgical approach is to use kubectl to check the logs of pods in the given namespace and configure used YAMLs to fix the issue. When the modified YAMLs have been saved, just apply them and then rollout restart the pods. It is recommended to stop any dashboards before doing this. The required commands cluster removal, and kubectl fixing are:

Cluster removal:
- Cluster deletion = kind delete cluster --name kind-ep
- Registry deletion = docker rm -f $(docker ps -aqf "name=kind-registry")

Optional docker clean up:
- Show docker configuration = docker info
- Show containers = docker ps
- Show all containers = docker ps -a
- Delete containers = docker system prune (Be specific if you have other containers)
- Show all images = docker images -a
- Remove all images = docker image prune -a (Be specific if you have other images)

Kubectl fixing
- Show pods of monitoring namespace = kubectl get pods -n monitoring
- Show deployment of monitoring namespace kubectl get deployment -n monitoring
- Apply YAMLs for monitoring = kubectl apply -k deployment/monitoring (OSS root directory)
- Show logs of a pod in monitoring namespace = kubectl logs (pod ID) -n monitoring
- Restart prometheus of monitoring namespace = kubectl rollout restart deployment prometheus-deployment -n monitoring

As a final note, it is highly recommended that the Docker root directory (Docker Root Dir in docker info) is located in a place that does not take memory from critical programs like the operating system. Docker and KFP produce a lot of data, which in the worst case take so much memory that the OS cannot start up normally without the help of IT support. To lessen the impact of these mishaps, it's always good practice to have updated backups that work in GitHub or any other cloud service of your choice.