# Skin Cancer Pipeline

This pipeline performs the following steps:
1. **Fetch Dataset:** Check if the dataset exists in MinIO (or download it from Kaggle if not), then upload it.
2. **Process Data:** Load the metadata CSV, merge image paths, and perform some plotting.
3. **Feature Engineering:** Perform feature engineering on the processed metadata.
4. **Train Random Forest Model:** Train a Random Forest model on the engineered dataset.
5. **Train CNN Model:** Train a CNN on a sample of the engineered dataset.

Below is the complete pipeline implementation using MLRun in a Jupyter Notebook.

In [None]:
!pip install mlrun

## Setup the Project
Create a project using mlrun.get_or_create_project (make sure to load it in case it already exists), creating the paths where we'll store the project's artifacts:

In [None]:
import mlrun
import os

# Set our project's name:
project_name = "mlop"
project_dir = os.path.abspath('./')

# Create the project:
project = mlrun.get_or_create_project(project_name, project_dir, user_project=False)

A project in MLRun is based on the MLRun Functions it can run. In this notebook we will see two ways to create a MLRun Function:

* `mlrun.code_to_function`: Create our own MLRun Function from code (will be used for training and evaluation in section 4).
* `mlrun.import_function`: Import from [MLRun's functions marketplace](https://www.mlrun.org/hub/) - a [functions hub](https://docs.mlrun.org/en/v1.1.2/runtimes/load-from-marketplace.html) intended to be a centralized location for open source contributions of function components (will be used for downloading the data in section 2).

## Build project images
Building the images to satisfy the requirements of the project, according to the selection above.

In [None]:
project.build_image(
    image=f".mlop-base",
    base_image='mlrun/mlrun',
    commands=[
        'pip install kagglehub',
        'pip install tensorflow~=2.16.0',
        'pip install minio',
        'pip install pandas',
        'pip install matplotlib==3.5.3',
        'pip install seaborn',
        'pip install scikit-learn',
        'pip install opencv-python-headless',
        'pip install Pillow',
        'pip install numpy',
    ],
    extra_args="--skip-tls-verify",
    overwrite_build_params=True
)

## Create the MLRun Function & Run
We will use MLRun's `mlrun.code_to_function` to create a MLRun Function from our code in the above mentioned python file. 

We wish to run the load first as a Job, so we will set the kind parameter to "job".

In [None]:
metadata_filename = "HAM10000_metadata.csv" 
source_bucket = "data" 
processed_bucket = "processed-data"
processed_metadata_filename = "processed_metadata.pkl"
images_dir = "images"

mem_label = {"dedicated": "highmem"}
cpu_label = {"dedicated": "highcpu"}

use_gpu = False

### Example task

In [None]:
# Create the function parsing the given file code using 'code_to_function':
load_function = mlrun.code_to_function(
    filename="functions/load.py",
    name="mlop-base",
    kind="job",
    image='.load-data'
)

load_function.with_node_selection(node_selector=mem_label)

load_run = load_function.run(
    name="Load data into Minio Object Storage",
    handler="fetch_dataset",
    params={
        "metadata_filename": metadata_filename,
        "bucket_name": source_bucket, 
        "images_dir": images_dir,
    },
    local=False
)

# Wait for complition and show the results. 
load_run.wait_for_completion()
load_run.show()

# Kubeflow Pipeline

In [None]:
load_data_fn = project.set_function(
    "functions/load.py",
    name="load-data",
    kind="job",
    image='.mlop-base'
)
processing_fn = project.set_function(
    "functions/process.py",
    name="processing",
    kind="job",
    image='.mlop-base'
)
feature_engineering_fn = project.set_function(
    "functions/feature_engineering.py",
    name="feature-engineering",
    kind="job",
    image='.mlop-base'
)
model_fn = project.set_function(
    "functions/model.py",
    name="model",
    kind="job",
    image='.mlop-base'
)

load_data_fn.with_node_selection(node_selector=mem_label)
processing_fn.with_node_selection(node_selector=mem_label)
feature_engineering_fn.with_node_selection(node_selector=mem_label)
model_fn.with_node_selection(node_selector=cpu_label)

mlrun.mlconf.is_ce_mode()

In [None]:
%%writefile workflow.py
import mlrun
import kfp
from kfp import dsl
from typing import Literal

@kfp.dsl.pipeline(name="Skin Cancer Detection Pipeline")
def kfpipeline(
    target_model: Literal["rf", "cnn"],
    segmented_samples: int = 2000,
    sample: int =400,
    batch_size: int = 32,
    epochs: int = 10,
):     
    # Defaults
    metadata_filename = "HAM10000_metadata.csv" 
    source_bucket = "data" 
    processed_bucket = "processed-data"
    processed_metadata_filename = "processed_metadata.pkl"
    images_dir = "images"
    # Get the project object
    project = mlrun.get_current_project()

    load_run = mlrun.run_function(
        name="load-data-into-minio-object-storage",
        function="load-data",
        handler="fetch_dataset",
        params={
            "metadata_filename": metadata_filename,
            "bucket_name": source_bucket, 
            "images_dir": images_dir,
        },
        local=False
    )

    process_run = mlrun.run_function(
        name="preprocess-data",
        function="processing",
        handler="process_metadata",
        params={
            "metadata_filename": metadata_filename,
            "source_bucket": source_bucket, 
            "processed_bucket": processed_bucket,
            "images_dir": images_dir,
            "processed_metadata_filename": processed_metadata_filename,
        },
        local=False
    ).after(load_run)
    
    create_segmented_run = mlrun.run_function(
        name="preprocess-images",
        function="processing",
        handler="create_segmented_images",
        params={
            "processed_bucket": processed_bucket,
            "processed_metadata_filename": processed_metadata_filename,
            "n_samples": segmented_samples,
        },
        local=False,
        watch=True,  # <- Turn on the logs.
    ).after(process_run)

    feature_engineering_run = mlrun.run_function(
        name="feature-engineering",
        function="feature-engineering",
        handler="feature_engineer",
        params={
            "processed_bucket": processed_bucket,
            "processed_metadata_filename": processed_metadata_filename,
        },
        local=False,
        watch=True,  # <- Turn on the logs.
    ).after(create_segmented_run)

    with dsl.Condition(target_model == "rf"): 
        model_run = mlrun.run_function(
            name="train-random-forest-model",
            function="model",
            handler="train_random_forest",
            params={
                "processed_bucket": processed_bucket,
                "processed_metadata_filename": processed_metadata_filename,
            },
            local=False,
            watch=True,  # <- Turn on the logs.
        ).after(feature_engineering_run)

    with dsl.Condition(target_model == "cnn"): 
        training_cnn_run = mlrun.run_function(
            name="train-cnn",
            function="model",
            handler="train_cnn",
            params={
                "processed_bucket": processed_bucket,
                "processed_metadata_filename": processed_metadata_filename,
                "sample": sample,
                "batch_size": batch_size,
                "epochs": epochs,
            },
            local=False
        ).after(feature_engineering_run)

In [None]:
# Register the workflow file:
workflow_name = "skin_cancer_detection_workflow"
project.set_workflow(workflow_name, "workflow.py")

# Save the project:
project.save()

In [None]:
project.run(
    name=workflow_name,
    arguments={
        "target_model": "cnn",
        "sample": 1000,
    },
    watch=True
)