## Scikit-Learn Hierarchical Clustering
### Using MALL_CUSTOMERS_VIEW from DWC. This view has 200 records

## Install fedml_gcp package

In [1]:
pip install fedml_gcp-1.0.0-py3-none-any.whl --force-reinstall

Processing ./fedml_gcp-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Collecting google
  Using cached google-3.0.0-py2.py3-none-any.whl (45 kB)
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.2.1-py3-none-any.whl (33 kB)
Installing collected packages: soupsieve, beautifulsoup4, hdbcli, google, fedml-gcp
  Attempting uninstall: soupsieve
    Found existing installation: soupsieve 2.2.1
    Uninstalling soupsieve-2.2.1:
      Successfully uninstalled soupsieve-2.2.1
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.10.0
    Uninstalling beautifulsoup4-4.10.0:
      Successfully uninstalled beautifulsoup4-4.10.0
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.13
    Uninstalling hdbcli-2.10.13:
      Successfully uninstalled hdbcli-2.10.13
  Attempting uninstall: goo

## Import Libraries

In [2]:
from fedml_gcp import DwcGCP
import numpy as np
import pandas as pd

## Create DwcGCP Instance to access class methods and train model

It is expected that the bucket name passed here already exists in Cloud Storage.

In [3]:
dwc = DwcGCP(project_name='fed-ml',
                 bucket_name='fedml-bucket')

### Create tar bundle of script folder so GCP can use it for training

Before running this cell, please ensure that the script package has all the necessary files for a training job.

In [4]:
dwc.make_tar_bundle('HierarchicalClustering.tar.gz', 'HierarchicalClustering', 'hc/train/HierarchicalClustering.tar.gz')


File HierarchicalClustering.tar.gz uploaded to hc/train/HierarchicalClustering.tar.gz.


### Create tar bundle of predictor script folder so GCP can use it for inferencing

Before running this cell, please ensure that the predictor package has all the necessary files for a training job.

In [5]:
dwc.make_tar_bundle('HierarchicalClusteringPredictor.tar.gz', 'HierarchicalClusteringPredictor', 'hc/prediction/HierarchicalClusteringPredictor.tar.gz')


File HierarchicalClusteringPredictor.tar.gz uploaded to hc/prediction/HierarchicalClusteringPredictor.tar.gz.


### Train Model

GCP takes in training inputs that are specific to the training job and the environment needed.

In the training inputs, we are the python module. This is the module that your script package is named, and it references the task.py file inside the script package.

We are also passing args which hold the table name to get data from. Before running the following cell, you should have a config.json file in the script package with the specified values to allow you to access to DWC.

You should also have the follow view `MALL_CUSTOMERS_VIEW` created in your DWC. To gather this data, please refer to https://www.kaggle.com/roshansharma/mall-customers-clustering-analysis/data

In [6]:
training_inputs = {
    'scaleTier': 'BASIC',
    'packageUris': ['gs://fedml-bucket/hc/train/HierarchicalClustering.tar.gz', "gs://fedml-bucket/fedml_gcp-1.0.0-py3-none-any.whl"],
    'pythonModule': 'trainer.task',
    'args': ['--table_name', 'MALL_CUSTOMERS_VIEW',
             '--table_size', '1',
            '--bucket_name', 'fedml-bucket'],
    'region': 'us-east1',
    'jobDir': 'gs://fedml-bucket',
    'runtimeVersion': '2.5',
    'pythonVersion': '3.7',
    'scheduling': {'maxWaitTime': '3600s', 'maxRunningTime': '7200s'}
}

In [7]:
dwc.train_model('h_clustering_final_train2', training_inputs)

Training Job Submitted Succesfully
Job status for fed-ml.h_clustering_final_train2:
    state : QUEUED


### Deploy model

In [11]:
dwc.deploy(model_name='h_clustering_final_deploy3', model_location='/hc/model/', version='v1', region='us-east1', 
           prediction_location='hc/prediction/', custom_predict='HierarchicalClusteringPredictor.tar.gz', module_name='predictor.MyPredictor')


{'name': 'projects/fed-ml/models/h_clustering_final_deploy3', 'regions': ['us-east1'], 'etag': 'zp24dvA7Or8='}
{'name': 'projects/fed-ml/operations/create_h_clustering_final_deploy3_version-1633569329916', 'metadata': {'@type': 'type.googleapis.com/google.cloud.ml.v1.OperationMetadata', 'createTime': '2021-10-07T01:15:30Z', 'operationType': 'CREATE_VERSION', 'modelName': 'projects/fed-ml/models/h_clustering_final_deploy3', 'version': {'name': 'projects/fed-ml/models/h_clustering_final_deploy3/versions/version', 'deploymentUri': 'gs://fedml-bucket/hc/model/', 'createTime': '2021-10-07T01:15:29Z', 'runtimeVersion': '2.5', 'packageUris': ['gs://fedml-bucket/hc/prediction/HierarchicalClusteringPredictor.tar.gz'], 'etag': 'cw5dx7ydW2c=', 'machineType': 'mls1-c1-m2', 'pythonVersion': '3.7', 'predictionClass': 'predictor.MyPredictor'}}}
