## Scikit-Learn PCA and Logistic Regression Pipeline
### Using BREASTCANCER_VIEW from DWC. This view has 569 records

## Install fedml aws library

In [1]:
pip install fedml-aws --force-reinstall

Processing ./fedml_aws-2.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.12.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Collecting pyyaml
  Using cached PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (603 kB)
Installing collected packages: pyyaml, hdbcli, fedml-aws
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 6.0
    Uninstalling PyYAML-6.0:
      Successfully uninstalled PyYAML-6.0
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.12.13
    Uninstalling hdbcli-2.12.13:
      Successfully uninstalled hdbcli-2.12.13
  Attempting uninstall: fedml-aws
    Found existing installation: fedml-aws 2.0.0
    Uninstalling fedml-aws-2.0.0:
      Successfully uninstalled fedml-aws-2.0.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
docker-co

## Import Libraries 

In [2]:
from fedml_aws import DwcSagemaker
from fedml_aws import DbConnection
import numpy as np
import pandas as pd
import json

## Create DwcSagemaker instance to access libraries functions

In [3]:
dwcs = DwcSagemaker(prefix='<prefix>', bucket_name='<bucket_name>')

2022-03-23 16:05:08,058: fedml_aws.dwcsagemaker INFO: Bucket created in us-east-1


## Create DbConnection instance to get data from DWC

Before running the following cell, you should have a config.json file in the same directory as this notebook with the specified values to allow you to access to DWC.

You should also have the follow view `BREASTCANCER_VIEW` created in your DWC. To gather this data, please refer to https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

In [4]:
%%time
db = DbConnection()
res, column_headers = db.get_data_with_headers(table_name="BREASTCANCER_VIEW", size=1)
data = pd.DataFrame(res, columns=column_headers)
data

CPU times: user 63.5 ms, sys: 4.94 ms, total: 68.4 ms
Wall time: 247 ms


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,column32
0,842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130,1203,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142,1479,0.111,0.1159,0.2439,0.1389,...,26.4,166.1,2027,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
565,926682,M,20.13,28.25,131.2,1261,0.0978,0.1034,0.144,0.09791,...,38.25,155,1731,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,34.12,126.7,1124,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
567,927241,M,20.6,29.33,140.1,1265,0.1178,0.277,0.3514,0.152,...,39.42,184.6,1821,0.165,0.8681,0.9387,0.265,0.4087,0.124,


In [5]:
data.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'column32'],
      dtype='object')

## Train SciKit Model¶
`train_data` is the data you want to train your model with. 

In order to deploy a model to AWS using the Scikit-learn Sagemaker SDK, you must have a script that tells Sagemaker how to train and deploy the model. The path to the script is passed to the `train_sklearn_model` function in the `train_script` parameter.

`instance_type` specifies how much computing power we want AWS to allocate for our services.

In [6]:
clf = dwcs.train_sklearn_model(data,
                               train_script='pca_pipeline_script.py',
                               instance_type='ml.c4.xlarge',
                              wait=True,
                              download_output=False,
                              hyperparameters={'n_components':3})

2022-03-23 16:05:09,204: fedml_aws.dwcsagemaker INFO: Training data uploaded
2022-03-23 16:05:09 Starting - Starting the training job...
2022-03-23 16:05:36 Starting - Preparing the instances for trainingProfilerReport-1648051509: InProgress
.........
2022-03-23 16:07:03 Downloading - Downloading input data...
2022-03-23 16:07:33 Training - Downloading the training image......
2022-03-23 16:08:34 Uploading - Uploading generated training model[34m2022-03-23 16:08:26,768 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-03-23 16:08:26,770 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-03-23 16:08:26,785 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-03-23 16:08:27,246 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-03-23 16:08:27,263 sagemaker-training-toolkit INFO     No GPUs detected (normal if 

## Using the fedml_aws deploy to kyma function

In [7]:
!aws configure set aws_access_key_id '<aws_access_key_id>' --profile 'sample-pr'
!aws configure set aws_secret_access_key '<aws_secret_access_key>' --profile 'sample-pr'
!aws configure set region '<region>' --profile 'sample-pr'

In [None]:
dwcs.deploy_to_kyma(clf, initial_instance_count=1, profile_name='sample-pr')

## Using the fedml_aws invoke kyma endpoint function

In [65]:
org_data = data.sample(frac=1).reset_index(drop=True)
org_data = org_data[500:]
org_data.fillna(0, inplace=True)
y = org_data['diagnosis']
X = org_data.drop(['diagnosis'], axis=1)

In [66]:
result = dwcs.invoke_kyma_endpoint(api='<endpoint>', 
             payload=X.to_json(), 
             content_type='application/json')

In [67]:
result = result.content.decode()

In [68]:
result

'["M", "B", "M", "M", "M", "M", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "M", "B", "M", "B", "M", "B", "B", "M", "M", "M", "B", "B", "B", "M", "B", "M", "B", "B", "B", "B", "B", "B", "M", "M", "B", "B", "B", "M", "M", "B", "B", "M", "B", "M", "B", "B", "B", "M", "B", "M", "B", "B", "M", "B", "B", "B", "B", "B", "B", "B", "M", "M", "B"]'

## Write back to DWC

In [69]:
X.columns

Index(['id', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'column32'],
      dtype='object')

In [70]:
X.dtypes

id                          int64
radius_mean                object
texture_mean               object
perimeter_mean             object
area_mean                  object
smoothness_mean            object
compactness_mean           object
concavity_mean             object
concave points_mean        object
symmetry_mean              object
fractal_dimension_mean     object
radius_se                  object
texture_se                 object
perimeter_se               object
area_se                    object
smoothness_se              object
compactness_se             object
concavity_se               object
concave points_se          object
symmetry_se                object
fractal_dimension_se       object
radius_worst               object
texture_worst              object
perimeter_worst            object
area_worst                 object
smoothness_worst           object
compactness_worst          object
concavity_worst            object
concave points_worst       object
symmetry_worst

In [71]:
# ['ID', 'Units_Sold', 'Unit_Price', 'Unit_Cost', 'Total_Revenue','Total_Cost', 'totalprofit']

db.create_table("CREATE TABLE PCA_Pipeline_Table (ID INTEGER PRIMARY KEY, radius_mean FLOAT(2), texture_mean FLOAT(2), perimeter_mean FLOAT(2), area_mean FLOAT(2), smoothness_mean FLOAT(2), compactness_mean FLOAT(2), concavity_mean FLOAT(2), concave_points_mean FLOAT(2), symmetry_mean FLOAT(2), fractal_dimension_mean FLOAT(2), radius_se FLOAT(2), texture_se FLOAT(2), perimeter_se FLOAT(2), area_se FLOAT(2), smoothness_se FLOAT(2), compactness_se FLOAT(2), concavity_se FLOAT(2), concave_points_se FLOAT(2), symmetry_se FLOAT(2), fractal_dimension_se FLOAT(2), radius_worst FLOAT(2), texture_worst FLOAT(2), perimeter_worst FLOAT(2), area_worst FLOAT(2), smoothness_worst FLOAT(2), compactness_worst FLOAT(2), concavity_worst FLOAT(2), concave_points_worst FLOAT(2), symmetry_worst FLOAT(2), fractal_dimension_worst FLOAT(2), column32 INTEGER, diagnosis_predict VARCHAR(100))")


creating table...
CREATE TABLE PCA_Pipeline_Table (ID INTEGER PRIMARY KEY, radius_mean FLOAT(2), texture_mean FLOAT(2), perimeter_mean FLOAT(2), area_mean FLOAT(2), smoothness_mean FLOAT(2), compactness_mean FLOAT(2), concavity_mean FLOAT(2), concave_points_mean FLOAT(2), symmetry_mean FLOAT(2), fractal_dimension_mean FLOAT(2), radius_se FLOAT(2), texture_se FLOAT(2), perimeter_se FLOAT(2), area_se FLOAT(2), smoothness_se FLOAT(2), compactness_se FLOAT(2), concavity_se FLOAT(2), concave_points_se FLOAT(2), symmetry_se FLOAT(2), fractal_dimension_se FLOAT(2), radius_worst FLOAT(2), texture_worst FLOAT(2), perimeter_worst FLOAT(2), area_worst FLOAT(2), smoothness_worst FLOAT(2), compactness_worst FLOAT(2), concavity_worst FLOAT(2), concave_points_worst FLOAT(2), symmetry_worst FLOAT(2), fractal_dimension_worst FLOAT(2), column32 INTEGER, diagnosis_predict VARCHAR(100), INSERTED_AT TIMESTAMP NOT NULL)


In [72]:
res = result.strip('][').split(', ')
res

['"M"',
 '"B"',
 '"M"',
 '"M"',
 '"M"',
 '"M"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"M"',
 '"B"',
 '"M"',
 '"B"',
 '"M"',
 '"B"',
 '"B"',
 '"M"',
 '"M"',
 '"M"',
 '"B"',
 '"B"',
 '"B"',
 '"M"',
 '"B"',
 '"M"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"M"',
 '"M"',
 '"B"',
 '"B"',
 '"B"',
 '"M"',
 '"M"',
 '"B"',
 '"B"',
 '"M"',
 '"B"',
 '"M"',
 '"B"',
 '"B"',
 '"B"',
 '"M"',
 '"B"',
 '"M"',
 '"B"',
 '"B"',
 '"M"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"B"',
 '"M"',
 '"M"',
 '"B"']

In [81]:
dwc_data = X
dwc_data = dwc_data.assign(diagnosis_predict = res)

In [82]:
dwc_data.columns = ['id', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave_points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'column32', 'diagnosis_predict']

In [83]:
for i in dwc_data.columns[1:-1]:
    dwc_data[i] = dwc_data[i].astype('float64')

In [84]:
dwc_data

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst,column32,diagnosis_predict
500,853201,17.570,15.05,115.00,955.1,0.09847,0.11570,0.09875,0.07953,0.1739,...,134.90,1227.0,0.1255,0.2812,0.24890,0.14560,0.2756,0.07919,0.0,"""M"""
501,898143,9.606,16.84,61.64,280.5,0.08481,0.09228,0.08422,0.02292,0.2036,...,71.25,353.6,0.1233,0.3416,0.43410,0.08120,0.2982,0.09825,0.0,"""B"""
502,881046502,20.580,22.14,134.70,1290.0,0.09090,0.13480,0.16400,0.09561,0.1765,...,158.30,1656.0,0.1178,0.2920,0.38610,0.19200,0.2909,0.05865,0.0,"""M"""
503,844981,13.000,21.82,87.50,519.8,0.12730,0.19320,0.18590,0.09353,0.2350,...,106.20,739.3,0.1703,0.5401,0.53900,0.20600,0.4378,0.10720,0.0,"""M"""
504,84358402,20.290,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,152.20,1575.0,0.1374,0.2050,0.40000,0.16250,0.2364,0.07678,0.0,"""M"""
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,877501,12.230,19.56,78.54,461.0,0.09586,0.08087,0.04187,0.04107,0.1979,...,92.15,638.4,0.1429,0.2042,0.13770,0.10800,0.2668,0.08174,0.0,"""B"""
565,891716,12.720,13.78,81.78,492.1,0.09667,0.08393,0.01288,0.01924,0.1638,...,88.54,553.7,0.1298,0.1472,0.05233,0.06343,0.2369,0.06922,0.0,"""B"""
566,88206102,20.510,27.81,134.40,1319.0,0.09159,0.10740,0.15540,0.08340,0.1448,...,162.70,1872.0,0.1223,0.2761,0.41460,0.15630,0.2437,0.08328,0.0,"""M"""
567,919555,20.550,20.86,137.80,1308.0,0.10460,0.17390,0.20850,0.13220,0.2127,...,160.20,1809.0,0.1268,0.3135,0.44330,0.21480,0.3077,0.07569,0.0,"""M"""


In [85]:
dwc_data.dtypes

id                           int64
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave_points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave_points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave_points_worst

In [86]:
db.insert_into_table('PCA_Pipeline_Table', dwc_data)

inserting into table...
INSERT INTO PCA_Pipeline_Table (id, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave_points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave_points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave_points_worst, symmetry_worst, fractal_dimension_worst, column32, diagnosis_predict, "INSERTED_AT") VALUES (853201, 17.57, 15.05, 115.0, 955.1, 0.09847, 0.1157, 0.09875, 0.07953, 0.1739, 0.06149, 0.6003, 0.8225, 4.655, 61.1, 0.005627, 0.03033, 0.03407, 0.01354, 0.01925, 0.003742, 20.01, 19.52, 134.9, 1227.0, 0.1255, 0.2812, 0.2489, 0.1456, 0.2756, 0.07919, 0.0, '"M"', '2022-03-23 16:46:53')
INSERT INTO PCA_Pipeline_Table (id, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactnes