# Scikit-Learn PCA
Using BREASTCANCER_VIEW from SAP Datasphere. This view has 569 records

## Install fedml_gcp package

In [None]:
pip install fedml_gcp

## Import Libraries

In [None]:
import os
import time
from fedml_gcp import dwcgcp

## Some constant variables to use throughout the notebook

In [None]:
PROJECT_ID = '<project_id>'
REGION = '<region>'

BUCKET_NAME = '<bucket_name>'
BUCKET_URI = "gs://"+BUCKET_NAME
BUCKET_FOLDER = 'dimreduction'
MODEL_OUTPUT_DIR = BUCKET_URI+'/'+BUCKET_FOLDER
GCS_PATH_TO_MODEL_ARTIFACTS= MODEL_OUTPUT_DIR+'/model/'

TRAINING_PACKAGE_PATH = 'Dimensionality-Reduction'
PREDICTOR_PACKAGE_PATH = 'DimensionalityReductionPredictor'
JOB_NAME = "dimreduction-training"

MODEL_DISPLAY_NAME = "dimreduction-model"
DEPLOYED_MODEL_DISPLAY_NAME = 'dimreduction-deployed-model'

TAR_BUNDLE_NAME = 'DimReduction.tar.gz'

CONTAINER_REGISTRY_REPOSITORY = 'dimreduction'
IMAGE = 'image-'+str(int(time.time()))

# Create DwcGCP Instance to access class methods and train model

It is expected that the bucket name passed here already exists in Cloud Storage.

For information on this constructor, please refer to the readme.

In [None]:
params = {'project':PROJECT_ID,
         'location':REGION, 
         'staging_bucket':BUCKET_URI}

In [None]:
dwc = dwcgcp.DwcGCP(params)


# Create tar bundle of script folder so GCP can use it for training

Please refer to the readme for more information on the dwc.make_tar_bundle() function

Before running this cell, please ensure that the script package has all the necessary files for a training job.

In [None]:
dwc.make_tar_bundle(TAR_BUNDLE_NAME, 
                    TRAINING_PACKAGE_PATH, 
                    BUCKET_FOLDER+'/train/'+TAR_BUNDLE_NAME)


## Determine which training image and deploying image you want to use. 

Please refer here for the training pre-built containers: https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container

Please refer here for the deployment pre-built containers: https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers

In [None]:
TRAIN_VERSION = "scikit-learn-cpu.0-23"
DEPLOY_VERSION = "sklearn-cpu.1-0"

TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

# Training using a custom python package and prebuilt container
For information on the dwc.train_model() function, please refer to the readme.

In the training inputs, we are using a script. When using a script, we have to pass the required packages needed as well.

We are also passing args which hold the table name to get data from and some other arguments we want to access in our training script. Before running the following cell, you should have a config.json uploaded to the bucket name you specified above with the path being /gcs/'+bucket_name+'/config.json'. This is specified in the training script, inside the function called get_dwc_data. This is used as the url parameter to DbConnection() so DbConnection knows where to find your credentials for access to SAP Datasphere.

You should also have the follow view BREASTCANCER_VIEW created in your SAP Datasphere. To gather this data, please refer to https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

In [None]:
table_name = 'BREASTCANCER_VIEW'
num_components = 3
job_dir = 'gs://'+BUCKET_NAME

cmd_args = [
    "--table_name=" + str(table_name),
    "--num_components=" + str(num_components),
    "--job-dir=" + str(job_dir),
    "--bucket_name=" + str(BUCKET_NAME),
    "--bucket_folder=" + str(BUCKET_FOLDER),
    "--package_name=" + 'trainer'
    
]

In [None]:
inputs ={
    'display_name':JOB_NAME,
    'python_package_gcs_uri':BUCKET_URI + '/' + BUCKET_FOLDER+'/train/'+TAR_BUNDLE_NAME,
    'python_module_name':'trainer.task',
    'container_uri':TRAIN_IMAGE,
    'model_serving_container_image_uri':DEPLOY_IMAGE,
}

In [None]:
run_job_params = {'model_display_name':MODEL_DISPLAY_NAME,
                  'args':cmd_args,
                  'replica_count':1,
                  'base_output_dir':MODEL_OUTPUT_DIR,
                  'sync':True}

In [None]:
job = dwc.train_model(training_inputs=inputs, 
                      training_type='customPythonPackage',
                     params=run_job_params)

## Deployment

For information on the dwc.deploy() function please refer to the readme.

Here we are deploying a custom predictor for the model we trained above.

In [None]:
from DimensionalityReductionPredictor.predictor import MyPredictor

In [None]:
cpr_model_config = {
    'src_dir': PREDICTOR_PACKAGE_PATH,
    'output_image_uri':f"gcr.io/{PROJECT_ID}/{CONTAINER_REGISTRY_REPOSITORY}/{IMAGE}",
    'predictor':MyPredictor,
    'requirements_path':os.path.join(PREDICTOR_PACKAGE_PATH, "requirements.txt"),
    'no_cache':True

}
upload_config = {
    'display_name':DEPLOYED_MODEL_DISPLAY_NAME,
    'artifact_uri':GCS_PATH_TO_MODEL_ARTIFACTS,
}

In [None]:
model = dwc.upload_custom_predictor(cpr_model_config, upload_config)

In [None]:
model_config = {'machine_type': "n1-standard-4"}
endpoint = dwc.deploy(model, model_config)

## Prediction

Once the model is deployed to an endpoint, we can run predictions on it.

For information on the dwc.predict() function please refer to the readme.

Since we are using DbConnection here, we will need to have the config.json in this notebook instance as well.

In [None]:
from fedml_gcp import DbConnection
import pandas as pd
import numpy as np

In [None]:
db = DbConnection()
org_data = db.get_data_with_headers(table_name="BREASTCANCER_VIEW", size=1)
org_data = pd.DataFrame(org_data[0], columns=org_data[1])
org_data = org_data.sample(frac=1).reset_index(drop=True)
org_data = org_data[500:]
org_data.fillna(0, inplace=True)
y = org_data['diagnosis']
X = org_data.drop(['diagnosis'], axis=1)

In [None]:
X

In [None]:
params = {'instances':X.astype('float64').values.tolist()}

In [None]:
predictions = dwc.predict(endpoint=endpoint, predict_params=params)

In [None]:
predictions