## Scikit-Learn PCA
### Using BREASTCANCER_VIEW from SAP Datasphere. This view has 569 records

## Install fedml aws library

In [1]:
pip install fedml-aws --force-reinstall

Note: you may need to restart the kernel to use updated packages.


## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='<prefix>', bucket_name='<bucket_name>')

## Create DbConnection instance to get data from SAP Datasphere

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to SAP Datasphere.

You should also have the follow view `BREASTCANCER_VIEW` created in your SAP Datasphere. To gather this data, please refer to https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

In [4]:
%%time
db = DbConnection()
res, column_headers = db.get_data_with_headers(table_name="BREASTCANCER_VIEW", size=1)
data = pd.DataFrame(res, columns=column_headers)
data

CPU times: user 60.7 ms, sys: 4 ms, total: 64.7 ms
Wall time: 206 ms


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,column32
0,842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130,1203,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142,1479,0.111,0.1159,0.2439,0.1389,...,26.4,166.1,2027,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
565,926682,M,20.13,28.25,131.2,1261,0.0978,0.1034,0.144,0.09791,...,38.25,155,1731,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,34.12,126.7,1124,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
567,927241,M,20.6,29.33,140.1,1265,0.1178,0.277,0.3514,0.152,...,39.42,184.6,1821,0.165,0.8681,0.9387,0.265,0.4087,0.124,


In [5]:
data.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'column32'],
      dtype='object')

## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [6]:
clf = dwcs.train_sklearn_model(data,
                               train_script='pca_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True,
                              download_output=True,
                              hyperparameters={'n_components':3})

Training data uploaded
2022-01-26 21:47:00 Starting - Starting the training job...
2022-01-26 21:47:24 Starting - Launching requested ML instancesProfilerReport-1643233620: InProgress
......
2022-01-26 21:48:28 Starting - Preparing the instances for training............
2022-01-26 21:50:24 Downloading - Downloading input data...
2022-01-26 21:50:45 Training - Downloading the training image..[34m2022-01-26 21:51:07,440 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-01-26 21:51:07,443 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 21:51:07,455 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-01-26 21:51:07,855 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 21:51:07,868 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 21:51:07,882 