## Scikit-Learn Hierarchical Clustering
### Using MALL_CUSTOMERS_VIEW from DWC. This view has 200 records

## Install fedml_aws library

In [1]:
pip install fedml_aws-1.0.0-py3-none-any.whl --force-reinstall

Processing ./fedml_aws-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Installing collected packages: hdbcli, fedml-aws
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.13
    Uninstalling hdbcli-2.10.13:
      Successfully uninstalled hdbcli-2.10.13
  Attempting uninstall: fedml-aws
    Found existing installation: fedml-aws 1.0.0
    Uninstalling fedml-aws-1.0.0:
      Successfully uninstalled fedml-aws-1.0.0
Successfully installed fedml-aws-1.0.0 hdbcli-2.10.13
Note: you may need to restart the kernel to use updated packages.


## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='scikit-learn/hierarchical-clustering', 
                    bucket_name='fedml-bucket')


## Create DbConnection instance to get data from DWC

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to DWC.

You should also have the follow view `MALL_CUSTOMERS_VIEW` created in your DWC. To gather this data, please refer to https://www.kaggle.com/roshansharma/mall-customers-clustering-analysis/data

In [4]:
%%time
db = DbConnection()
data = db.get_data_with_headers(table_name="MALL_CUSTOMERS_VIEW", size=1)
data = pd.DataFrame(data[0], columns=data[1])
data = data.sample(frac=1).reset_index(drop=True)
data = data.head(100)
data

CPU times: user 50.4 ms, sys: 803 µs, total: 51.2 ms
Wall time: 102 ms


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,92,Male,18,59,41
1,31,Male,60,30,4
2,12,Female,35,19,99
3,11,Male,67,19,14
4,1,Male,19,15,39
...,...,...,...,...,...
95,163,Male,19,81,5
96,62,Male,19,46,55
97,120,Female,50,67,57
98,117,Female,63,65,43


## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [5]:
clf = dwcs.train_sklearn_model(data,
                               train_script='train_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True)

Training data uploaded
2021-10-06 22:57:01 Starting - Starting the training job...
2021-10-06 22:57:26 Starting - Launching requested ML instancesProfilerReport-1633561020: InProgress
......
2021-10-06 22:58:26 Starting - Preparing the instances for training............
2021-10-06 23:00:27 Downloading - Downloading input data...
2021-10-06 23:00:47 Training - Downloading the training image..[34m2021-10-06 23:01:09,641 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-10-06 23:01:09,643 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:01:09,654 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-10-06 23:01:10,222 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:01:11,651 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-10-06 23:01:11,665 