## Scikit-Learn PCA and Logistic Regression Pipeline
### Using BREASTCANCER_VIEW from SAP Datasphere. This view has 569 records

## Install fedml_azure package

In [None]:
pip install fedml_azure --force-reinstall

## Import the libraries needed in this notebook

In [None]:
from fedml_azure import create_workspace
from fedml_azure import DbConnection
from fedml_azure import create_compute
from fedml_azure import create_environment
from fedml_azure import SAP DatasphereAzureTrain
from fedml_azure import deploy
from fedml_azure import predict

## Set up

### Initialize the workspace

The create_workspace method takes a dictionary as input for parameter workspace_args.

Before running the below cell, ensure that you have a workspace and replace the subscription_id, resource_group, and workspace_name with your information. https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=python

Refer the documentation on the ‘create_workspace’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_workspace).


In [None]:
workspace=create_workspace(workspace_args={
                                            "subscription_id": '<subscription-id>',
                                            "resource_group": '<resource-group>',
                                            "workspace_name": '<workspace_name>'
                                            }
)

### Create a Compute target

The create_compute method takes the workspace, a compute_type, and compute_args as parameters.The following code creates a Compute Cluster with the name 'cpu-cluster' for training.

Refer the documentation on the ‘create_compute’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_compute).


In [None]:
compute=create_compute(workspace=workspace,
                   compute_type='AmlComputeCluster',
                   compute_args={'vm_size':'Standard_D2',
                                'compute_name':'fedml-test1',
                                'max_nodes':6,
                                }
                )

### Create an Environment

The create_environment method takes the workspace, environment_type, and environment_args as parameters.

Pass 'fedml_azure' as a pip package and to use scikit-learn, you must pass the name to conda_packages as well.

Refer the documentation on the ‘create_environment’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_environment). `
`

In [None]:
environment=create_environment(workspace=workspace,
                           environment_type='CondaPackageEnvironment',
                           environment_args={'name':'pca-sklearn',
                                             'pip_packages':['joblib','fedml_azure'],
                                             'conda_packages':['scikit-learn']})


## Now, lets train the model

### Creating a Training object and setting the workspace, compute target, and environment.

Before running the below cell, ensure that you have a workspace and replace the subscription_id, resource_group, and workspace_name with your information.

The whl file for the fedml_azure library must be passed to the pip_wheel_files key in the environment_args and to use scikit-learn, you must pass the name to conda_packages as well.

Refer the documentation on the ‘DwcAzureTrain’ class (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#dwcazuretrain-class).

In [None]:
train=DwcAzureTrain(workspace=workspace,
                    environment=environment,
                    experiment_args={'name':'federated-experiment'},
                    compute=compute)

### Then, we need to generate the run config. This is needed to package the configuration specified so we can submit a job for training. 

Before running the following cell, you should have a config.json file with the specified values to allow you to access to SAP Datasphere. Provide this file path to config_file_path in the below cell.

You should also have the follow view BREASTCANCER_VIEW created in your SAP Datasphere. To gather this data, please refer to https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Refer the documentation on the ‘generate_run_config’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#generate_run_config).

In [None]:
src=train.generate_run_config(config_file_path='dwc_configs/config.json',
                          config_args={
                                          'source_directory':'Scikit-Learn-PCAPipeline',
                                          'script':'pca_script.py',
                                          'arguments':['--model_file_name','regression.pkl', '--table_name', 'BREASTCANCER_VIEW', '--n_components', '3']
                                          }
                            )

### Submit the training job with the option to download the model outputs

Refer the documentation on ‘submit_run’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#submit_run)

In [None]:
run=train.submit_run(src)

### Register the model

Pass ‘outputs/model_file_name.pkl’ to 'model_path' key of model_args ,where ‘model_file_name’ is the name of the .pkl model file specified in the previous step. 

Provide the desired model name to ‘model_name’ key of model_args in the below cell. The 'is_sklearn_model' flag specifies if a scikit learn model is being registered.

Refer the documentation on ‘register_model’ method and parameters (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#register_model)

In [None]:
model=train.register_model(run=run,
                           model_args={'model_name':'sklearn_pcapipeline_model',
                                       'model_path':'outputs/regression.pkl'},
                            resource_config_args={'cpu':1, 'memory_in_gb':0.5},
                            is_sklearn_model=True
                           )

### Read test data from SAP Datasphere

In [None]:
import pandas as pd
import numpy as np
from fedml_azure import DbConnection

In [None]:
db = DbConnection(url='Scikit-Learn-PCAPipeline/config.json')
res, column_headers = db.get_data_with_headers(table_name="BREASTCANCER_VIEW", size=1)
data = pd.DataFrame(res, columns=column_headers)
org_data = data.sample(frac=1).reset_index(drop=True)
org_data = org_data[500:]
org_data.fillna(0, inplace=True)
y = org_data['diagnosis']
X = org_data.drop(['diagnosis'], axis=1)

#### Change the decimal datatype in the dataframe to float for serialization

In [None]:
X = X.apply(pd.to_numeric, downcast='float')

In [None]:
import json
test_data = json.dumps(X.values.tolist())

### Deploy the model as a webservice to Kyma Kubernetes

Before running this cell,

1. Ensure a service principal is created and the specify the config file path containing the service principal credentials to 'sp_config_path' key of deploy_args in the below cell.

2. Pass the path of the kubeconfig.yaml file to connect to Kyma Kuberentes to 'kubeconfig_path' key of deploy_args in the below cell.

Refer the documentation on 'deploy' for more details (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#deploy)

In [None]:
kyma_endpoint=deploy(compute_type='Kyma',
                    inference_config_args={'entry_script':'Scikit-Learn-PCAPipeline/predict.py', 'environment':environment},
                    deploy_args={'workspace':workspace,
                                'name':'pcawebservice',
                                'models':[model],
                                'kubeconfig_path':'Scikit-Learn-PCAPipeline/kubeconfig.yaml',
                                'sp_config_path':'Scikit-Learn-PCAPipeline/sp_config.json'
                                })

### Inferencing the kyma_endpoint by passing the testing data

Refer the documentation on 'predict' for more details (https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#predict)

In [None]:
result=predict(endpoint_url=kyma_endpoint,compute_type='kyma',data=test_data)
result

### Write the result back to SAP Datasphere

#### Create table in SAP Datasphere

In [None]:
db.create_table("CREATE TABLE PCA_BREASTCANCER_VIEW (ID INTEGER PRIMARY KEY, radius_mean FLOAT(2), texture_mean FLOAT(2), perimeter_mean FLOAT(2), area_mean FLOAT(2), smoothness_mean FLOAT(2), compactness_mean FLOAT(2), concavity_mean FLOAT(2), concave_points_mean FLOAT(2), symmetry_mean FLOAT(2), fractal_dimension_mean FLOAT(2), radius_se FLOAT(2), texture_se FLOAT(2), perimeter_se FLOAT(2), area_se FLOAT(2), smoothness_se FLOAT(2), compactness_se FLOAT(2), concavity_se FLOAT(2), concave_points_se FLOAT(2), symmetry_se FLOAT(2), fractal_dimension_se FLOAT(2), radius_worst FLOAT(2), texture_worst FLOAT(2), perimeter_worst FLOAT(2), area_worst FLOAT(2), smoothness_worst FLOAT(2), compactness_worst FLOAT(2), concavity_worst FLOAT(2), concave_points_worst FLOAT(2), symmetry_worst FLOAT(2), fractal_dimension_worst FLOAT(2), column32 INTEGER, result VARCHAR(100))")

#### Storing the result in the dataframe

In [None]:
import pandas as pd
result_df=pd.DataFrame(result['result'])
result_df.rename( columns={0:'result'}, inplace=True )

In [None]:
X['result']=result_df['result'].values
X

#### Renaming the columns

In [None]:
X = X.rename(columns={'concave points_mean': 'concave_points_mean', 'concave points_se': 'concave_points_se', 'concave points_worst':'concave_points_worst'})

In [None]:
X

 #### Inserting the data into table

In [None]:
db.insert_into_table('PCA_BREASTCANCER_VIEW',X)