## Scikit-Learn Hierarchical Clustering
### Using MALL_CUSTOMERS_VIEW from SAP Datasphere. This view has 200 records

## Install fedml aws library

In [1]:
pip install fedml-aws --force-reinstall

Note: you may need to restart the kernel to use updated packages.


## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='<prefix>', bucket_name='<bucket_name>')

Bucket created in  us-east-1


## Create DbConnection instance to get data from SAP Datasphere

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to SAP Datasphere.

You should also have the follow view `MALL_CUSTOMERS_VIEW` created in your SAP Datasphere. To gather this data, please refer to https://www.kaggle.com/roshansharma/mall-customers-clustering-analysis/data

In [4]:
%%time
db = DbConnection()
data = db.get_data_with_headers(table_name="MALL_CUSTOMERS_VIEW", size=1)
data = pd.DataFrame(data[0], columns=data[1])
data = data.sample(frac=1).reset_index(drop=True)
train_data = data.head(100)
train_data

CPU times: user 50 ms, sys: 162 µs, total: 50.2 ms
Wall time: 134 ms


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,44,Female,31,39,61
1,184,Female,29,98,88
2,156,Female,27,78,89
3,134,Female,31,72,71
4,129,Male,59,71,11
...,...,...,...,...,...
95,179,Male,59,93,14
96,191,Female,34,103,23
97,34,Male,18,33,92
98,19,Male,52,23,29


## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [5]:
clf = dwcs.train_sklearn_model(train_data,
                               train_script='train_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True)

Training data uploaded
2022-01-26 22:44:24 Starting - Starting the training job...
2022-01-26 22:44:50 Starting - Launching requested ML instancesProfilerReport-1643237063: InProgress
......
2022-01-26 22:45:54 Starting - Preparing the instances for training.........
2022-01-26 22:47:14 Downloading - Downloading input data...
2022-01-26 22:47:50 Training - Downloading the training image...
2022-01-26 22:48:22 Uploading - Uploading generated training model
2022-01-26 22:48:22 Completed - Training job completed
[34m2022-01-26 22:48:10,121 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-01-26 22:48:10,123 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 22:48:10,133 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-01-26 22:48:10,523 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 22:48:10