# Skin Cancer Pipeline

This pipeline performs the following steps:
1. **Fetch Dataset:** Check if the dataset exists in MinIO (or download it from Kaggle if not), then upload it.
2. **Process Data:** Load the metadata CSV, merge image paths, and perform some plotting.
3. **Feature Engineering:** Perform feature engineering on the processed metadata.
4. **Train Random Forest Model:** Train a Random Forest model on the engineered dataset.
5. **Train CNN Model:** Train a CNN on a sample of the engineered dataset.

Below is the complete pipeline implementation using MLRun in a Jupyter Notebook.

In [None]:
!pip install minio
!pip install mlrun

## Setup the Project
Create a project using mlrun.get_or_create_project (make sure to load it in case it already exists), creating the paths where we'll store the project's artifacts:

In [None]:
import mlrun
import os

# Set our project's name:
project_name = "skin-cancer-detection"
project_dir = os.path.abspath('./')

# Create the project:
project = mlrun.get_or_create_project(project_name, project_dir, user_project=False)

A project in MLRun is based on the MLRun Functions it can run. In this notebook we will see two ways to create a MLRun Function:

* `mlrun.code_to_function`: Create our own MLRun Function from code (will be used for training and evaluation in section 4).
* `mlrun.import_function`: Import from [MLRun's functions marketplace](https://www.mlrun.org/hub/) - a [functions hub](https://docs.mlrun.org/en/v1.1.2/runtimes/load-from-marketplace.html) intended to be a centralized location for open source contributions of function components (will be used for downloading the data in section 2).

Before we continue, **please select the desired framework** (comment and uncomment the below lines as needed):

In [None]:
framework = "tf-keras"
# framework = "pytorch"

# If you wish to train on gpu, set this variable to 'True', otherwise 'False':
use_gpu = False

## Build project images
Building the images to satisfy the requirements of the project, according to the selection above.

In [None]:
import os
import mlrun

# Define your framework and GPU usage.
# (Assume these variables are defined elsewhere in your notebook or pipeline.)
# framework = 'tf-keras'  # or 'torch'
# use_gpu = True or False


if framework == 'tf-keras':
    commands = [
        'pip install kaggle==1.5.12',
        'pip install minio==7.1.5',
        'pip install tensorflow~=2.9.0',
        'pip install horovod==0.25.0',
        'pip install pandas==1.4.0',
        'pip install matplotlib==3.5.1',
        'pip install seaborn==0.11.2',
        'pip install scikit-learn==1.1.1',
        'pip install opencv-python==4.5.5.64',
    ]
    builder_env = {
        'HOROVOD_WITH_MPI': '1',
        'HOROVOD_WITH_TENSORFLOW': '1'
    }
elif framework == 'torch':
    commands = [
        'pip install torch==1.13.0+cpu',
        'pip install torchvision==0.14.0+cpu -f https://download.pytorch.org/whl/torch_stable.html',
        'pip install tensorboard==2.5.0',
        'pip install horovod==0.25.0',
        'pip install onnx~=1.15.0',
        'pip install pandas==1.4.0',
        'pip install matplotlib==3.5.1',
        'pip install seaborn==0.11.2',
        'pip install scikit-learn==1.1.1',
        'pip install opencv-python==4.5.5.64',
        'pip install Pillow==9.0.1',
        'pip install numpy==1.21.5',
    ]
    builder_env = {
        'HOROVOD_WITH_MPI': '1',
        'HOROVOD_WITH_PYTORCH': '1'
    }

# If not using GPU, prepend an apt update/install command and set the proper env flag.
if not use_gpu:
    apt_cmd = [
        "apt update -qqq --fix-missing && apt upgrade -y && apt install -y build-essential cmake gcc ",
        "&& apt clean && rm -rf /var/lib/apt/lists/*",
    ]
    commands = apt_cmd + commands
    builder_env['HOROVOD_WITHOUT_GLOO'] = '1'
else:
    commands = commands

# Build the image.
# The image tag will be constructed as: ".cancer-detection-<framework>".
# The base image is set to "mlrun/mlrun-gpu" if using GPU, else "mlrun/mlrun".
project.build_image(
    image=f".cancer-detection-{framework}",
    base_image='mlrun/mlrun-gpu' if use_gpu else 'mlrun/mlrun',
    commands=commands,
    builder_env=builder_env,
    extra_args="--skip-tls-verify",
    overwrite_build_params=True
)

In [None]:
project.build_image(
    image=".load-data",
    base_image='mlrun/mlrun',
    commands=[
        'pip install kagglehub',
        'pip install minio',
    ],
    extra_args="--skip-tls-verify",
    overwrite_build_params=True
)

project.build_image(
    image=".process-data",
    base_image='mlrun/mlrun',
    commands=[
        'pip install minio',
        'pip install pandas',
        'pip install matplotlib==3.5.3',   # Pin matplotlib to a compatible version
        'pip install seaborn',      
        'pip install scikit-learn',
        'pip install opencv-python-headless',
        'pip install Pillow',
        'pip install numpy',
    ],
    extra_args="--skip-tls-verify",
    overwrite_build_params=True
)

## Create the MLRun Function & Run
We will use MLRun's `mlrun.code_to_function` to create a MLRun Function from our code in the above mentioned python file. 

We wish to run the load first as a Job, so we will set the kind parameter to "job".

In [None]:
metadata_filename = "HAM10000_metadata.csv" 
source_bucket = "data" 
processed_bucket = "processed-data"
processed_metadata_filename = "processed_metadata.pkl"
images_dir = "images"

In [None]:
# Create the function parsing the given file code using 'code_to_function':
process_function = mlrun.code_to_function(
    filename="functions/process.py",
    name="process-data",
    kind="job",
    image='.process-data'
)

process_function.with_requests(mem="12G")  # lower bound
process_function.with_limits(mem="12G")  # upper bound

process_run = process_function.run(
    name="Preprocess data",
    handler="process_metadata",
    params={
        "metadata_filename": metadata_filename,
        "source_bucket": source_bucket, 
        "processed_bucket": processed_bucket,
        "images_dir": images_dir,
        "processed_metadata_filename": processed_metadata_filename,
    },
    local=False
)

# Wait for complition and show the results. 
process_run.wait_for_completion()
process_run.show()

In [None]:
# Create the function that will create segmented images
create_segmented_function = mlrun.code_to_function(
    filename="functions/process.py",
    name="process-images",
    kind="job",
    image='.process-data'
)

# process_function.with_requests(mem="1G", cpu=1)  # lower bound
# process_function.with_limits(mem="2G", cpu=2)  # upper bound

create_segmented_run = create_segmented_function.run(
    name="Preprocess images",
    handler="create_segmented_images",
    params={
        "processed_bucket": processed_bucket,
        "processed_metadata_filename": processed_metadata_filename
        "n_samples": 2000,
    },
    local=False,
    watch=True,  # <- Turn on the logs.
)

# Wait for complition and show the results. 
create_segmented_run.wait_for_completion()
create_segmented_run.show()