### Installation
Install the packages required for executing this notebook.

## Some of the source codes are based on
https://towardsdatascience.com/how-to-set-up-custom-vertex-ai-pipelines-step-by-step-467487f81cad 

In [1]:
# Install the packages
! pip3 install --user --no-cache-dir --upgrade "kfp>2" "google-cloud-pipeline-components>2" \
                                        google-cloud-aiplatform

Collecting kfp>2
  Downloading kfp-2.9.0.tar.gz (595 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m595.6/595.6 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting kfp-pipeline-spec==0.4.0 (from kfp>2)
  Downloading kfp_pipeline_spec-0.4.0-py3-none-any.whl.metadata (301 bytes)
Collecting kfp-server-api<2.4.0,>=2.1.0 (from kfp>2)
  Downloading kfp_server_api-2.3.0.tar.gz (84 kB)
  Preparing metadata (setup.py) ... [?25ldone


## Restart the kernel
Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [2]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

Check the versions of the packages you installed. The KFP SDK version should be >=1.6.

In [1]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! pip3 freeze | grep aiplatform
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"

KFP SDK version: 2.7.0
google-cloud-aiplatform==1.70.0
google_cloud_pipeline_components version: 2.17.0


In [25]:
import kfp
import typing
from typing import Dict, List
from typing import NamedTuple
from kfp import dsl
from kfp.dsl import (Artifact,
                        Dataset,
                        Input,
                        Model,
                        Output,
                        Metrics,
                        ClassificationMetrics,
                        component, 
                        OutputPath, 
                        InputPath)
import google.cloud.aiplatform as aip
from google_cloud_pipeline_components.v1.model import ModelUploadOp
from google_cloud_pipeline_components.v1.endpoint import (EndpointCreateOp,ModelDeployOp)
from google_cloud_pipeline_components.types import artifact_types

#### Project and Pipeline Configurations

In [26]:
#The Google Cloud project that this pipeline runs in.
PROJECT_ID = "endless-mile-435507-h9"
# The region that this pipeline runs in
REGION = "europe-west1"
# Specify a Cloud Storage URI that your pipelines service account can access. The artifacts of your pipeline runs are stored within the pipeline root.
PIPELINE_ROOT = "gs://hairloss-de-temp"   # e.g., gs://temp_de2024

#### Create Pipeline Components

We can create a component from Python functions (inline) and from a container. We will first try inline python functions. 
Refer to  https://www.kubeflow.org/docs/components/pipelines/v2/components/lightweight-python-components/ for more information.

### Data Ingestion

In [27]:
@dsl.component(
    packages_to_install=["pandas","google-cloud-storage"],
    base_image="python:3.10.7-slim"
)
def download_data(project_id: str, bucket: str, file_name: str, dataset: Output[Dataset]):
    '''download data'''
    from google.cloud import storage
    import pandas as pd
    import logging 
    import sys
    
    logging.basicConfig(stream=sys.stdout, level=logging.INFO)
    
    # Downloaing the file from a google bucket 
    client = storage.Client(project=project_id)
    bucket = client.bucket(bucket)
    blob = bucket.blob(file_name)
    blob.download_to_filename(dataset.path + ".csv")
    logging.info('Downloaded Data!')

#### Pipeline Component : Training DecisionTree

In [28]:
@dsl.component(
    packages_to_install=["pandas", "scikit-learn==1.3.2"],
    base_image="python:3.10.7-slim"
)
def train_dt(features: Input[Dataset], out_model: Output[Model]) -> NamedTuple('outputs', metrics=dict):
    '''train a DecisionTreeClassifier with default parameters'''
    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    import pickle
    import logging
    import sys
    
    data = pd.read_csv(features.path + ".csv")
    x_train, x_test, y_train, y_test = train_test_split(data.drop('Hair Loss',axis=1), 
                                                    data['Hair Loss'], test_size=0.30, 
                                                    random_state=101)
    model_dt = DecisionTreeClassifier()
    model_dt.fit(x_train, y_train)
    
    metrics_dict = {
        "accuracy": model_dt.score(x_test, y_test)
    }
    logging.info(metrics_dict)  
    
    out_model.metadata["file_type"] = ".pkl"
    out_model.metadata["algo"] = "dt"
   # Save the model
    m_file = out_model.path + ".pkl"
    with open(m_file, "wb") as model_file:
        pickle.dump(model_dt, model_file)   
    
    outputs = NamedTuple('outputs', metrics=dict)
    return outputs(metrics_dict)

### Training Logistic Regression

In [39]:
@dsl.component(
    packages_to_install=["pandas", "scikit-learn==1.3.2"],
    base_image="python:3.10.7-slim"
)
def train_lr(features: Input[Dataset], out_model: Output[Model]) -> NamedTuple('outputs', metrics=dict):
    '''train a LogisticRegression with default parameters'''
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    from sklearn import metrics
    from sklearn.model_selection import train_test_split
    import json
    import logging 
    import sys
    import os
    import pickle
    
    data = pd.read_csv(features.path + ".csv")
    x_train, x_test, y_train, y_test = train_test_split(data.drop('Hair Loss',axis=1), 
                                                    data['Hair Loss'], test_size=0.30, 
                                                    random_state=101)
    model_lr = LogisticRegression()
    model_lr.fit(x_train, y_train)
    
    metrics_dict = {
        "accuracy": model_lr.score(x_test, y_test)
    }
    logging.info(metrics_dict)  
    
    out_model.metadata["file_type"] = ".pkl"
    out_model.metadata["algo"] = "lr"
   # Save the model
    m_file = out_model.path + ".pkl"
    with open(m_file, "wb") as model_file:
        pickle.dump(model_lr, model_file)   
    
    outputs = NamedTuple('outputs', metrics=dict)
    return outputs(metrics_dict)

### Prediction: DecisionTree

In [40]:
@dsl.component(
    packages_to_install=["pandas", "scikit-learn==1.3.2"],
    base_image="python:3.10.7-slim"
)
def predict_dt(model: Input[Model], features: Input[Dataset], results: Output[Dataset]):
    '''train a DecisionTreeClassifier with default parameters'''
    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    import pickle
    import logging
    import sys
    
    logging.basicConfig(stream=sys.stdout, level=logging.INFO)
    
    data = pd.read_csv(features.path + ".csv")
    xNew = data.loc[:, ['Genetics', 'Hormonal Changes', 'Medical Conditions', 'Medications & Treatments',
               'Nutritional Deficiencies', 'Stress', 'Age', 'Poor Hair Care Habits',
               'Environmental Factors', 'Smoking', 'Weight Loss']].values
    #load the model
    filename = model.path + ".pkl"
        
    #Loading the saved model
    model_dt = pickle.load(open(filename, 'rb'))
    
    dfcp = data.copy()
    result = model_dt.predict(xNew)   
    y_classes = result.argmax(axis=-1)
    logging.info(y_classes)
    dfcp['pHairloss'] = y_classes.tolist()
    dfcp.to_csv(results.path + ".csv" , index=False, encoding='utf-8-sig')

### Prediction LR

In [41]:
@dsl.component(
    packages_to_install=["pandas", "scikit-learn==1.3.2"],
    base_image="python:3.10.7-slim"
)
def predict_lr(model: Input[Model], features: Input[Dataset], results: Output[Dataset]):
    '''train a LogisticRegression with default parameters'''
    import pandas as pd
    import pickle  
    import json
    import logging
    import sys
    import os
    
    logging.basicConfig(stream=sys.stdout, level=logging.INFO)
    
    data = pd.read_csv(features.path + ".csv")
    xNew = data.loc[:, ['Genetics', 'Hormonal Changes', 'Medical Conditions', 'Medications & Treatments',
               'Nutritional Deficiencies', 'Stress', 'Age', 'Poor Hair Care Habits',
               'Environmental Factors', 'Smoking', 'Weight Loss']].values
    #load the model
    filename = model.path + ".pkl"
        
    #Loading the saved model
    model_lr = pickle.load(open(filename, 'rb'))
    
    dfcp = data.copy()
    result = model_lr.predict(xNew)   
    y_classes = result.argmax(axis=-1)
    logging.info(y_classes)
    dfcp['pHairloss'] = y_classes.tolist()
    dfcp.to_csv(results.path + ".csv" , index=False, encoding='utf-8-sig')

### Algorithm Selection

In [42]:
@dsl.component(
    base_image="python:3.10.7-slim"
)
def compare_model(dt_metrics: dict, lr_metrics: dict) -> str:
    import logging
    import json
    import sys
    logging.basicConfig(stream=sys.stdout, level=logging.INFO)
    logging.info(dt_metrics)
    logging.info(lr_metrics)
    if dt_metrics.get("accuracy") > lr_metrics.get("accuracy"):
        return "DT"
    else :
        return "LR"

### Upload model and metrics to google Bucket

In [43]:
@dsl.component(
    packages_to_install=["google-cloud-storage"],
    base_image="python:3.10.7-slim"
)
def upload_model_to_gcs(project_id: str, model_repo: str, model: Input[Model]):
    '''upload model to gsc'''
    from google.cloud import storage   
    import logging 
    import sys
    
    logging.basicConfig(stream=sys.stdout, level=logging.INFO)    
  
    # upload the model to GCS
    client = storage.Client(project=project_id)
    bucket = client.bucket(model_repo)
    blob = bucket.blob(str(model.metadata["algo"]) + '_model' + str(model.metadata["file_type"])) 
    blob.upload_from_filename(model.path + str(model.metadata["file_type"]))       
    
    print("Saved the model to GCP bucket : " + model_repo)

In [44]:
# @dsl.component(
#     packages_to_install=["google-cloud-storage"],
#     base_image="python:3.10.7-slim"
# )
# def upload_model_to_gcs(project_id: str, model_repo: str, model: Input[Model]):
#     '''upload model to gsc'''
#     from google.cloud import storage   
#     import logging 
#     import sys
    
#     logging.basicConfig(stream=sys.stdout, level=logging.INFO)    
  
#     # upload the model to GCS
#     client = storage.Client(project=project_id)
#     bucket = client.bucket(model_repo)
#     blob = bucket.blob('model.pkl')
#     source_file_name= model.path + '.pkl'
   
#     blob.upload_from_filename(source_file_name)    
    
#     print(f"File {source_file_name} uploaded to {model_repo}.")

### Trigger another CI-CD Pipeline

In [45]:
@dsl.component(
    packages_to_install=["google-cloud-build"],
    base_image="python:3.10.7-slim"
)
def run_build_trigger(project_id:str, trigger_id:str):
    import sys
    from google.cloud.devtools import cloudbuild_v1    
    import logging 
    logging.basicConfig(stream=sys.stdout, level=logging.INFO) 
    
    # Create a client
    client = cloudbuild_v1.CloudBuildClient()
    name = f"projects/{project_id}/locations/europe-west1/triggers/{trigger_id}"
    # Initialize request argument(s)
    request = cloudbuild_v1.RunBuildTriggerRequest(        
        project_id=project_id,
        trigger_id=trigger_id,
        name=name
    )

    # Make the request
    operation = client.run_build_trigger(request=request)
    
    logging.info("Trigger the CI-CD Pipeline: " + trigger_id)

### Deploy the model at Vertex AI 
We can use Google Pre-built Kebeflow componets such as  EndpointCreateOp, ModelUploadOp, and ModelDeployOp to deploy the models locally at Vertex AI.

***This is only for testing.  In your assigment, please use custom serving applications and CI-CD pipelines to deploy models. We should be able to deploy a given model at any given production environment. CI-CD pipelines are the best solution. ***

<img src="imgs/EnterpriseMlOps_Model_Deployment.png">

source: https://cloud.redhat.com/blog/enterprise-mlops-reference-design

https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction

https://cloud.google.com/vertex-ai/docs/pipelines/gcpc-list

#### Define the Pipeline

In [46]:
# # Define the workflow of the pipeline.
# @kfp.dsl.pipeline(
#     name="hairloss-predictor-training-pipeline")
# def pipeline(project_id: str, data_bucket: str, trainset_filename: str, model_repo: str, testset_filename: str):    
    
#     di_op = download_data(
#         project_id=project_id,
#         bucket=data_bucket,
#         file_name=trainset_filename
#     )
    
#     train_test_split_op = train_test_split(dataset=di_op.output)
        
#     training_lr_job_run_op = train_lr(features=train_test_split_op.outputs["dataset_train"])
    
#     model_evaluation_op = lr_model_evaluation(
#         test_set=train_test_split_op.outputs["dataset_test"],
#         model_lr=training_lr_job_run_op.outputs["model"],
#         thresholds_dict_str=thresholds_dict_str, # I deploy the model anly if the model performance is above the threshold
#     )
    
#     with dsl.If(
#         model_evaluation_op.outputs["approval"]== True,
#         name="approve-model",
#     ):
#         upload_model_to_gc_op = upload_model_to_gcs(
#             project_id=project_id,
#             model_repo=model_repo,
#             model=training_lr_job_run_op.outputs['model']
#         )    
        
#         import_unmanaged_model_task = dsl.importer(
#             artifact_uri=model_repo_uri,
#             artifact_class=artifact_types.UnmanagedContainerModel,
#             metadata={
#                 "containerSpec": {
#                     "imageUri": "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest",  # see https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers  
#                 },
#             },
#         ).after(upload_model_to_gc_op)      
       
#         # using Google's custom components for for uloading and deploying the model.
       
#         model_upload_op = ModelUploadOp(
#             project=project_id,
#             display_name="hairloss-prediction-model",
#             unmanaged_container_model=import_unmanaged_model_task.outputs["artifact"],
#         ).after(import_unmanaged_model_task)       
               
#         create_endpoint_op = EndpointCreateOp(
#             project=project_id,
#             display_name="hairloss-prediction-service",
#         ).after(model_upload_op)      
        
#         model_deploy_op = ModelDeployOp(
#             model=model_upload_op.outputs["model"],
#             endpoint=create_endpoint_op.outputs['endpoint'],
#             deployed_model_display_name="hairloss-prediction-model",
#             dedicated_resources_machine_type="n1-standard-4",
#             dedicated_resources_min_replica_count=1,
#             dedicated_resources_max_replica_count=1,
#             traffic_split={"0": 100},
#         ).after(create_endpoint_op)   

# Define the workflow of the pipeline.
@kfp.dsl.pipeline(
    name="hairloss-predictor-training-pipeline")
def pipeline(project_id: str, data_bucket: str, trainset_filename: str, model_repo: str, testset_filename: str, trigger_id: str):   
    
    # Step 1: Download training data
    di_op = download_data(
        project_id=project_id,
        bucket=data_bucket,
        file_name=trainset_filename
    )
    training_dt = train_dt(
        features=di_op.outputs['dataset'])
    
    training_lr = train_lr(
        features=di_op.outputs['dataset'])
    
    pre_di_op = download_data(
        project_id=project_id,
        bucket=data_bucket,
        file_name=testset_filename
    ).after(training_dt, training_lr)
    
    comp_model__op = compare_model(dt_metrics=training_dt.outputs["metrics"],
                                       lr_metrics=training_lr.outputs["metrics"]).after(training_dt, training_lr)  
    # defining the branching condition
    with dsl.If(comp_model__op.output=="DT"):
        predict_dt_job_run_op = predict_dt(
            model=training_dt.outputs["out_model"],      
            features=pre_di_op.outputs["dataset"]
        )
        upload_model_dt_to_gc_op = upload_model_to_gcs(
            project_id=project_id,
            model_repo=model_repo,
            model=training_dt.outputs['out_model']
        ).after(predict_dt_job_run_op)
        
        trigger_model_deployment_cicd = run_build_trigger(
            project_id=project_id,
            trigger_id=trigger_id
        ).after(upload_model_dt_to_gc_op)  
        
    with dsl.If(comp_model__op.output=="LR"):
        predict_lr_job_run_op = predict_lr(
            model=training_lr.outputs["out_model"],     
            features=pre_di_op.outputs["dataset"]
        )
        upload_model_lr_to_gc_op = upload_model_to_gcs(
            project_id=project_id,
            model_repo=model_repo,
            model=training_lr.outputs['out_model']
        ).after(predict_lr_job_run_op) 
        
        trigger_model_deployment_cicd = run_build_trigger(
            project_id=project_id,
            trigger_id=trigger_id
        ).after(upload_model_lr_to_gc_op) 


#### Compile the pipeline into a JSON file

In [47]:
from kfp import compiler
compiler.Compiler().compile(pipeline_func=pipeline,
        package_path='hairloss_predictor_training_pipeline.yaml')

In [48]:
import google.cloud.aiplatform as aip

# Before initializing, make sure to set the GOOGLE_APPLICATION_CREDENTIALS
# environment variable to the path of your service account.
aip.init(
    project=PROJECT_ID,
    location=REGION,
)

# Prepare the pipeline job
job = aip.PipelineJob(
    display_name="hairloss-predictor",
    enable_caching=False,
    template_path="hairloss_predictor_training_pipeline.yaml",
    pipeline_root=PIPELINE_ROOT,
    location=REGION,
    parameter_values={
        'project_id': 'endless-mile-435507-h9', # makesure to use your project id 
        'data_bucket': 'hairloss-de-data',  # makesure to use your data bucket name 
        'trainset_filename': 'training_set.csv',     # makesure to upload these to your data bucket from DE2024/lab4/data
        'testset_filename': 'test_set.csv',    # makesure to upload these to your data bucket from DE2024/lab4/data
        'model_repo':'hairloss-de-models', # makesure to use your model bucket name 
        'trigger_id':'5fd4d88c-0e8d-4b6b-a093-b19d2f1e6eb4'
    }
)

job.run()

Creating PipelineJob
PipelineJob created. Resource name: projects/136177505402/locations/europe-west1/pipelineJobs/hairloss-predictor-training-pipeline-20241029121021
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/136177505402/locations/europe-west1/pipelineJobs/hairloss-predictor-training-pipeline-20241029121021')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/europe-west1/pipelines/runs/hairloss-predictor-training-pipeline-20241029121021?project=136177505402
PipelineJob projects/136177505402/locations/europe-west1/pipelineJobs/hairloss-predictor-training-pipeline-20241029121021 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/136177505402/locations/europe-west1/pipelineJobs/hairloss-predictor-training-pipeline-20241029121021 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/136177505402/locations/europe-west1/pipelineJobs/hairloss-predictor-training-pipeline-202410