## Scikit-Learn PCA and Logistic Regression Pipeline
### Using BREASTCANCER_VIEW from DWC. This view has 569 records

## Install fedml_azure package

In [1]:
pip install fedml_azure --force-reinstall

Processing ./fedml_azure_test-2.0.0-py3-none-any.whl
Collecting ruamel.yaml
  Using cached ruamel.yaml-0.17.21-py3-none-any.whl (109 kB)
Collecting hdbcli
  Using cached hdbcli-2.12.20-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Collecting ruamel.yaml.clib>=0.2.6; platform_python_implementation == "CPython" and python_version < "3.11"
  Using cached ruamel.yaml.clib-0.2.6-cp36-cp36m-manylinux1_x86_64.whl (552 kB)
Installing collected packages: ruamel.yaml.clib, ruamel.yaml, hdbcli, fedml-azure-test
  Attempting uninstall: ruamel.yaml.clib
    Found existing installation: ruamel.yaml.clib 0.2.6
    Uninstalling ruamel.yaml.clib-0.2.6:
      Successfully uninstalled ruamel.yaml.clib-0.2.6
  Attempting uninstall: ruamel.yaml
    Found existing installation: ruamel.yaml 0.17.21
    Uninstalling ruamel.yaml-0.17.21:
      Successfully uninstalled ruamel.yaml-0.17.21
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.12.20
    Uninstalling hdbcli-2.12.20:
      Successful

## Import the libraries needed in this notebook

In [2]:
from fedml_azure import create_workspace
from fedml_azure import DbConnection
from fedml_azure import create_compute
from fedml_azure import create_environment
from fedml_azure import DwcAzureTrain
from fedml_azure import deploy
from fedml_azure import predict

## Set up

### Initialize the workspace

The create_workspace method takes a dictionary as input for parameter workspace_args.

The ‘subscription_id’, ‘resource_group’ and ‘workspace_name’ in the below cell must be replaced with the contents of 'config.json' file. This file can be obtained by referring step 2 of the link (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/sample-notebooks/FedML-With-Federated-Data-From-Athena-BigQuery/docs/prerequisites.md)

Refer the documentation on the ‘create_workspace’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_workspace).


In [3]:
workspace=create_workspace(workspace_args={
                                            "subscription_id": '<subscription-id>',
                                            "resource_group": '<resource-group>',
                                            "workspace_name": '<workspace_name>'
                                            }
)

2022-04-02 00:06:54,013: fedml_azure.logger INFO: Getting existing Workspace


### Create a Compute target

The create_compute method takes the workspace, a compute_type, and compute_args as parameters.The following code creates a Compute Cluster with the name 'cpu-cluster' for training.

Refer the documentation on the ‘create_compute’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_compute).


In [5]:
compute=create_compute(workspace=workspace,
                   compute_type='AmlComputeCluster',
                   compute_args={'vm_size':'Standard_D2',
                                'compute_name':'fedml-test1',
                                'max_nodes':6,
                                }
                )

2022-04-02 00:07:07,549: fedml_azure.logger INFO: Creating Compute_target.
2022-04-02 00:07:07,859: fedml_azure.logger INFO: Found compute target. just use it. fedml-test1


### Create an Environment

The create_environment method takes the workspace, environment_type, and environment_args as parameters.

Pass 'fedml_azure' as a pip package and to use scikit-learn, you must pass the name to conda_packages as well.

Refer the documentation on the ‘create_environment’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_environment).
`

In [6]:
environment=create_environment(workspace=workspace,
                           environment_type='CondaPackageEnvironment',
                           environment_args={'name':'pca-sklearn',
                                             'pip_packages':['joblib','fedml_azure'],
                                             'conda_packages':['scikit-learn']})


2022-04-02 00:07:11,748: fedml_azure.logger INFO: Creating Environment.


## Now, lets train the model

### Creating a Training object and setting the workspace, compute target, and environment.

Before running the below cell, ensure that you have a workspace and replace the subscription_id, resource_group, and workspace_name with your information.

The whl file for the fedml_azure library must be passed to the pip_wheel_files key in the environment_args and to use scikit-learn, you must pass the name to conda_packages as well.

Refer the documentation on the ‘DwcAzureTrain’ class (https://github.com/SAP-samples/dwc-fedml/blob/main/Azure/docs/fedml_azure.md#dwcazuretrain-class).

In [7]:
train=DwcAzureTrain(workspace=workspace,
                    environment=environment,
                    experiment_args={'name':'federated-experiment'},
                    compute=compute)

2022-04-02 00:07:15,748: fedml_azure.logger INFO: Assigning Workspace.
2022-04-02 00:07:15,749: fedml_azure.logger INFO: Creating Experiment
2022-04-02 00:07:15,898: fedml_azure.logger INFO: Assigning compute.
2022-04-02 00:07:15,899: fedml_azure.logger INFO: Assigning Environment.


### Then, we need to generate the run config. This is needed to package the configuration specified so we can submit a job for training. 

Before running the following cell, you should have a config.json file with the specified values to allow you to access to DWC. Provide this file path to config_file_path in the below cell.

You should also have the follow view BREASTCANCER_VIEW created in your DWC. To gather this data, please refer to https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Refer the documentation on the ‘generate_run_config’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#generate_run_config). 

In [8]:
src=train.generate_run_config(config_file_path='dwc_configs/config.json',
                          config_args={
                                          'source_directory':'Scikit-Learn-PCAPipeline',
                                          'script':'pca_script.py',
                                          'arguments':['--model_file_name','regression.pkl', '--table_name', 'BREASTCANCER_VIEW', '--n_components', '3']
                                          }
                            )

2022-04-02 00:07:21,301: fedml_azure.logger INFO: Generating script run config.
2022-04-02 00:07:21,302: fedml_azure.logger INFO: Copying config file for db connection to script_directory Scikit-Learn-PCAPipeline


### Submit the training job with the option to download the model outputs

Refer the documentation on ‘submit_run’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#submit_run)

In [9]:
run=train.submit_run(src)

2022-04-02 00:07:28,645: fedml_azure.logger INFO: Submitting training run.
RunId: federated-experiment_1648858048_c54e5388
Web View: https://ml.azure.com/runs/federated-experiment_1648858048_c54e5388?wsid=/subscriptions/2a9092f3-f89d-4589-b9aa-8a1f9ff8b776/resourcegroups/fedml_rg/workspaces/fedml_ws&tid=997ff79a-373b-416b-8f24-a97cb0225c59

Execution Summary
RunId: federated-experiment_1648858048_c54e5388
Web View: https://ml.azure.com/runs/federated-experiment_1648858048_c54e5388?wsid=/subscriptions/2a9092f3-f89d-4589-b9aa-8a1f9ff8b776/resourcegroups/fedml_rg/workspaces/fedml_ws&tid=997ff79a-373b-416b-8f24-a97cb0225c59



### Register the model

Pass ‘outputs/model_file_name.pkl’ to 'model_path' key of model_args ,where ‘model_file_name’ is the name of the .pkl model file specified in the previous step. 

Provide the desired model name to ‘model_name’ key of model_args in the below cell. The 'is_sklearn_model' flag specifies if a scikit learn model is being registered.

Refer the documentation on ‘register_model’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#register_model)

In [12]:
model=train.register_model(run=run,
                           model_args={'model_name':'sklearn_pcapipeline_model',
                                       'model_path':'outputs/regression.pkl'},
                            resource_config_args={'cpu':1, 'memory_in_gb':0.5},
                            is_sklearn_model=True
                           )

2022-04-02 00:13:17,569: fedml_azure.logger INFO: Registering the model.
2022-04-02 00:13:17,570: fedml_azure.logger INFO: Configuring parameters for sklearn model.


### Read test data from SAP DWC

In [13]:
import pandas as pd
import numpy as np
from fedml_azure import DbConnection

In [14]:
db = DbConnection(url='Scikit-Learn-PCAPipeline/config.json')
res, column_headers = db.get_data_with_headers(table_name="BREASTCANCER_VIEW", size=1)
data = pd.DataFrame(res, columns=column_headers)
org_data = data.sample(frac=1).reset_index(drop=True)
org_data = org_data[500:]
org_data.fillna(0, inplace=True)
y = org_data['diagnosis']
X = org_data.drop(['diagnosis'], axis=1)

#### Change the decimal datatype in the dataframe to float for serialization

In [15]:
X = X.apply(pd.to_numeric, downcast='float')

In [16]:
import json
test_data = json.dumps(X.values.tolist())

### Deploy the model as a webservice to Kyma Kubernetes

Before running this cell,

1. Ensure a service principal is created and the specify the config file path containing the service principal credentials to 'sp_config_path' key of deploy_args in the below cell.

2. Pass the path of the kubeconfig.yaml file to connect to Kyma Kuberentes to 'kubeconfig_path' key of deploy_args in the below cell.

Refer the documentation on 'deploy' for more details (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#deploy)

In [20]:
kyma_endpoint=deploy(compute_type='Kyma',
                    inference_config_args={'entry_script':'Scikit-Learn-PCAPipeline/predict.py', 'environment':environment},
                    deploy_args={'workspace':workspace,
                                'name':'pcawebservice',
                                'models':[model],
                                'kubeconfig_path':'Scikit-Learn-PCAPipeline/kubeconfig.yaml',
                                'sp_config_path':'Scikit-Learn-PCAPipeline/sp_config.json'
                                })

2022-04-02 00:14:52,474: fedml_azure.logger INFO: Installing Kubectl

2022-04-02 00:14:54,395: fedml_azure.logger INFO: kubectl: OK

2022-04-02 00:14:55,165: fedml_azure.logger INFO: Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}

2022-04-02 00:14:55,167: fedml_azure.logger INFO: Installing jq

2022-04-02 00:14:55,240: fedml_azure.logger INFO: Reading package lists...

2022-04-02 00:14:55,408: fedml_azure.logger INFO: Building dependency tree...

2022-04-02 00:14:55,409: fedml_azure.logger INFO: Reading state information...

2022-04-02 00:14:55,548: fedml_azure.logger INFO: jq is already the newest version (1.5+dfsg-2).

2022-04-02 00:14:55,549: fedml_azure.logger INFO: The following packages were automatically installed and are no longer required:

2022-04-02 00:14:55,550: fedml_azure.logge

### Inferencing the kyma_endpoint by passing the testing data

Refer the documentation on 'predict' for more details (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#predict)

In [22]:
result=predict(endpoint_url=kyma_endpoint,compute_type='kyma',data=test_data)
result

2022-04-02 00:16:27,904: fedml_azure.logger INFO: Using the parameters 'endpoint_url' and 'compute_type' for inferencing.


{'result': ['B',
  'B',
  'M',
  'B',
  'B',
  'B',
  'B',
  'M',
  'B',
  'B',
  'B',
  'M',
  'B',
  'M',
  'B',
  'B',
  'B',
  'B',
  'B',
  'B',
  'B',
  'B',
  'M',
  'B',
  'B',
  'B',
  'B',
  'B',
  'B',
  'B',
  'M',
  'B',
  'B',
  'B',
  'B',
  'B',
  'M',
  'M',
  'B',
  'B',
  'B',
  'B',
  'B',
  'M',
  'B',
  'B',
  'M',
  'B',
  'M',
  'B',
  'B',
  'B',
  'B',
  'B',
  'B',
  'B',
  'M',
  'B',
  'M',
  'M',
  'B',
  'M',
  'B',
  'B',
  'B',
  'M',
  'B',
  'B',
  'B']}

### Write the result back to SAP DWC

#### Create table in SAP DWC

In [23]:
from fedml_azure import DbConnection
db = DbConnection(url='Scikit-Learn-PCAPipeline/config.json')

In [25]:
db.create_table("CREATE TABLE PCA_BREASTCANCER_VIEW (ID INTEGER PRIMARY KEY, radius_mean FLOAT(2), texture_mean FLOAT(2), perimeter_mean FLOAT(2), area_mean FLOAT(2), smoothness_mean FLOAT(2), compactness_mean FLOAT(2), concavity_mean FLOAT(2), concave_points_mean FLOAT(2), symmetry_mean FLOAT(2), fractal_dimension_mean FLOAT(2), radius_se FLOAT(2), texture_se FLOAT(2), perimeter_se FLOAT(2), area_se FLOAT(2), smoothness_se FLOAT(2), compactness_se FLOAT(2), concavity_se FLOAT(2), concave_points_se FLOAT(2), symmetry_se FLOAT(2), fractal_dimension_se FLOAT(2), radius_worst FLOAT(2), texture_worst FLOAT(2), perimeter_worst FLOAT(2), area_worst FLOAT(2), smoothness_worst FLOAT(2), compactness_worst FLOAT(2), concavity_worst FLOAT(2), concave_points_worst FLOAT(2), symmetry_worst FLOAT(2), fractal_dimension_worst FLOAT(2), column32 INTEGER, result VARCHAR(100))")

2022-04-02 00:16:38,750: fedml_azure.logger INFO: creating table...
2022-04-02 00:16:38,751: fedml_azure.logger INFO: CREATE TABLE PCA_BREASTCANCER_VIEW (ID INTEGER PRIMARY KEY, radius_mean FLOAT(2), texture_mean FLOAT(2), perimeter_mean FLOAT(2), area_mean FLOAT(2), smoothness_mean FLOAT(2), compactness_mean FLOAT(2), concavity_mean FLOAT(2), concave_points_mean FLOAT(2), symmetry_mean FLOAT(2), fractal_dimension_mean FLOAT(2), radius_se FLOAT(2), texture_se FLOAT(2), perimeter_se FLOAT(2), area_se FLOAT(2), smoothness_se FLOAT(2), compactness_se FLOAT(2), concavity_se FLOAT(2), concave_points_se FLOAT(2), symmetry_se FLOAT(2), fractal_dimension_se FLOAT(2), radius_worst FLOAT(2), texture_worst FLOAT(2), perimeter_worst FLOAT(2), area_worst FLOAT(2), smoothness_worst FLOAT(2), compactness_worst FLOAT(2), concavity_worst FLOAT(2), concave_points_worst FLOAT(2), symmetry_worst FLOAT(2), fractal_dimension_worst FLOAT(2), column32 INTEGER, result VARCHAR(100), INSERTED_AT TIMESTAMP NOT NU

#### Storing the result in the dataframe

In [26]:
import pandas as pd
result_df=pd.DataFrame(result['result'])
result_df.rename( columns={0:'result'}, inplace=True )

In [28]:
X['result']=result_df['result'].values
X

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,column32,result
500,903011.0,11.270000,15.500000,73.379997,392.000000,0.08365,0.11140,0.10070,0.02757,0.1810,...,79.730003,450.000000,0.11020,0.28090,0.30210,0.08272,0.2157,0.10430,0.0,B
501,917897.0,9.847000,15.680000,63.000000,293.200012,0.09492,0.08419,0.02330,0.02416,0.1387,...,74.320000,376.500000,0.14190,0.22430,0.08434,0.06528,0.2502,0.09209,0.0,B
502,91376704.0,17.850000,13.230000,114.599998,992.099976,0.07838,0.06217,0.04445,0.04178,0.1220,...,127.099998,1210.000000,0.09862,0.09976,0.10480,0.08341,0.1783,0.05871,0.0,M
503,874662.0,11.810000,17.389999,75.269997,428.899994,0.10070,0.05562,0.02353,0.01553,0.1718,...,79.570000,489.500000,0.13560,0.10000,0.08803,0.04306,0.3200,0.06576,0.0,B
504,906539.0,11.570000,19.040001,74.199997,409.700012,0.08546,0.07722,0.05485,0.01428,0.2031,...,86.430000,520.500000,0.12490,0.19370,0.25600,0.06664,0.3035,0.08284,0.0,B
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,914580.0,12.470000,17.309999,80.449997,480.100006,0.08928,0.07630,0.03609,0.02369,0.1526,...,92.820000,607.299988,0.12760,0.25060,0.20280,0.10530,0.3035,0.07661,0.0,B
565,877989.0,17.540001,19.320000,115.099998,951.599976,0.08968,0.11980,0.10360,0.07488,0.1506,...,139.500000,1239.000000,0.13810,0.34200,0.35080,0.19390,0.2928,0.07867,0.0,M
566,854941.0,13.030000,18.420000,82.610001,523.799988,0.08983,0.03766,0.02562,0.02923,0.1467,...,84.459999,545.900024,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169,0.0,B
567,911654.0,14.200000,20.530001,92.410004,618.400024,0.08931,0.11080,0.05063,0.03058,0.1506,...,112.099998,828.500000,0.11530,0.34290,0.25120,0.13390,0.2534,0.07858,0.0,B


#### Renaming the columns

In [29]:
X = X.rename(columns={'concave points_mean': 'concave_points_mean', 'concave points_se': 'concave_points_se', 'concave points_worst':'concave_points_worst'})

In [30]:
X

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst,column32,result
500,903011.0,11.270000,15.500000,73.379997,392.000000,0.08365,0.11140,0.10070,0.02757,0.1810,...,79.730003,450.000000,0.11020,0.28090,0.30210,0.08272,0.2157,0.10430,0.0,B
501,917897.0,9.847000,15.680000,63.000000,293.200012,0.09492,0.08419,0.02330,0.02416,0.1387,...,74.320000,376.500000,0.14190,0.22430,0.08434,0.06528,0.2502,0.09209,0.0,B
502,91376704.0,17.850000,13.230000,114.599998,992.099976,0.07838,0.06217,0.04445,0.04178,0.1220,...,127.099998,1210.000000,0.09862,0.09976,0.10480,0.08341,0.1783,0.05871,0.0,M
503,874662.0,11.810000,17.389999,75.269997,428.899994,0.10070,0.05562,0.02353,0.01553,0.1718,...,79.570000,489.500000,0.13560,0.10000,0.08803,0.04306,0.3200,0.06576,0.0,B
504,906539.0,11.570000,19.040001,74.199997,409.700012,0.08546,0.07722,0.05485,0.01428,0.2031,...,86.430000,520.500000,0.12490,0.19370,0.25600,0.06664,0.3035,0.08284,0.0,B
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,914580.0,12.470000,17.309999,80.449997,480.100006,0.08928,0.07630,0.03609,0.02369,0.1526,...,92.820000,607.299988,0.12760,0.25060,0.20280,0.10530,0.3035,0.07661,0.0,B
565,877989.0,17.540001,19.320000,115.099998,951.599976,0.08968,0.11980,0.10360,0.07488,0.1506,...,139.500000,1239.000000,0.13810,0.34200,0.35080,0.19390,0.2928,0.07867,0.0,M
566,854941.0,13.030000,18.420000,82.610001,523.799988,0.08983,0.03766,0.02562,0.02923,0.1467,...,84.459999,545.900024,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169,0.0,B
567,911654.0,14.200000,20.530001,92.410004,618.400024,0.08931,0.11080,0.05063,0.03058,0.1506,...,112.099998,828.500000,0.11530,0.34290,0.25120,0.13390,0.2534,0.07858,0.0,B


 #### Inserting the data into table

In [31]:
db.insert_into_table('PCA_BREASTCANCER_VIEW',X)

2022-04-02 00:17:13,458: fedml_azure.logger INFO: inserting into table...
2022-04-02 00:17:13,459: fedml_azure.logger INFO: INSERT INTO PCA_BREASTCANCER_VIEW (id, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave_points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave_points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave_points_worst, symmetry_worst, fractal_dimension_worst, column32, result, INSERTED_AT) VALUES (:id, :radius_mean, :texture_mean, :perimeter_mean, :area_mean, :smoothness_mean, :compactness_mean, :concavity_mean, :concave_points_mean, :symmetry_mean, :fractal_dimension_mean, :radius_se, :texture_se, :perimeter_se, :area_se, :smoothness_se, :compactness_se, :concavity_se, :concave_points_se, :symmetry_se, :