# Notebook to build a Scikit-Learn Linear Regression Machine learning model on Azure by federating the training data from Amazon Athena and Google BigQuery via SAP Data Warehouse Cloud


## Install fedml_azure package

Provide the path of the fedml_azure library file uploaded earlier to pip.

In [1]:
pip install fedml_azure --force-reinstall

Processing ./fedml_azure-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.20-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Installing collected packages: hdbcli, fedml-azure
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.20
    Uninstalling hdbcli-2.10.20:
      Successfully uninstalled hdbcli-2.10.20
  Attempting uninstall: fedml-azure
    Found existing installation: fedml-azure 1.0.0
    Uninstalling fedml-azure-1.0.0:
      Successfully uninstalled fedml-azure-1.0.0
Successfully installed fedml-azure-1.0.0 hdbcli-2.10.20
Note: you may need to restart the kernel to use updated packages.


## Initialization of AzureML resources required for training

The following steps provide a simple way to create the resources required for training:

### Initialize the workspace

The create_workspace method takes a dictionary as input for parameter workspace_args.

The values for ‘subscription_id’, ‘resource_group’ and ‘workspace_name’ keys of workspace_args in the below cell must be replaced with the configurations in ‘config.json’ downloaded in step 2 of the "Pre-requisites for the Federated ML Library for Azure ML" ([link](https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/sample-notebooks/FedML-With-Federated-Data-From-Athena-BigQuery/docs/prerequisites.md)).

Refer the documentation on the ‘create_workspace’ method and parameters ([link](https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_workspace)).


In [2]:
from fedml_azure import create_workspace
workspace=create_workspace(workspace_args={
                        "subscription_id": "<subscription_id>",
                        "resource_group": "<resource_group>",
                        "workspace_name": "<workspace_name>"
                          })

Getting existing Workspace


### Create a Compute target

The create_compute method takes the workspace, a compute_type, and compute_args as parameters.The following code creates a Compute Cluster with the name 'cpu-cluster' for training.

Refer the documentation on the ‘create_compute’ method and parameters ([link](https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_compute)).


In [3]:
from fedml_azure import create_compute
compute=create_compute(workspace=workspace,
                   compute_type='AmlComputeCluster',
                   compute_args={'vm_size':'Standard_D12_v2',
                                'vm_priority':'lowpriority',
                                'compute_name':'cpu-cluster',
                                'min_nodes':0,
                                'max_nodes':4,
                                'idle_seconds_before_scaledown':1700
                                }
                )

Creating Compute_target
Creating a new compute target...
InProgress......
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-11-19T23:56:45.630000+00:00', 'errors': None, 'creationTime': '2021-11-19T23:56:25.535399+00:00', 'modifiedTime': '2021-11-19T23:56:51.499094+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT1700S'}, 'vmPriority': 'LowPriority', 'vmSize': 'STANDARD_D12_V2'}


### Create an Environment

The create_environment method takes the workspace, environment_type, and environment_args as parameters.

The path of the fedml_azure library must be passed to the 'pip_wheel_files' key in environment_args and to use scikit-learn package, you must pass the name to 'conda_packages' as well.

Refer the documentation on the ‘create_environment’ method and parameters ([link](https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/docs/fedml_azure.md#create_environment)).


In [4]:
from fedml_azure import create_environment
environment=create_environment(workspace=workspace,
                           environment_type='CondaPackageEnvironment',
                           environment_args={'name':'sklearn-env',
                                             'conda_packages':['scikit-learn'],
                                             'pip_packages':['fedml_azure']})

Creating Environment


## Training the model

### Instantiate the training class which assigns the resources required for training

The name of the experiment to be created must be passed to the 'name' key of experiment_args. Refer the documentation on the ‘DwcAzureTrain’ class ([link](https://github.com/SAP-samples/dwc-fedml/blob/main/Azure/docs/fedml_azure.md#dwcazuretrain-class)).

In [5]:
from fedml_azure import DwcAzureTrain
train=DwcAzureTrain(workspace=workspace,
                    environment=environment,
                    experiment_args={'name':'federated-experiment'},
                    compute=compute)

Assigning Workspace
Creating Experiment
Assigning compute
Assigning Environment


### Read the federated data from Amazon Athena and Google BigQuery via SAP Data Warehouse Cloud in the training script.

Refer the 'db.execute_query()' method in the get_data(table_name)' function of the training script which reads the data from the 'SALES_VIEW' view in SAP Data Warehouse Cloud. 

By querying 'SALES_VIEW', we get the federated data from Amazon Athena and Google BigQuery without replicating the data from the original data storages.

Refer the documentation for more details on the DbConnection class ([link](https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/dbconnection.md)).

### Generate the run config which packages together the configuration information needed to submit a run in Azure ML

The file path of ‘config.json’ created in step 4 of the "Pre-requisites for the Federated ML Library for Azure ML" ([link](https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/sample-notebooks/FedML-With-Federated-Data-From-Athena-BigQuery/docs/prerequisites.md)), must be passed to ‘config_file_path’ in the below cell.

The path of the training folder  created in step 3 of "Steps to build a Machine Learning Model on Azure using Federated ML Library for Azure ML" ([link](https://github.com/SAP-samples/data-warehouse-cloud-fedml/blob/main/Azure/sample-notebooks/FedML-With-Federated-Data-From-Athena-BigQuery/docs/model_build_steps.md)) must be passed to ‘source_directory’ key and name of the training script  must be passed to ‘script’ key of config_args in the below cell. 

The name of the model file to be created must be passed to ‘model_file_name’ argument, name of the table/view to be queried must be passed to 'table_name' argument in the below cell. These arguments are passed to the training script.

Refer the documentation on the ‘generate_run_config’ method and parameters ([link](https://github.com/SAP-samples/dwc-fedml/blob/main/Azure/docs/fedml_azure.md#generate_run_config)). 

In [17]:
#generating the run config
src=train.generate_run_config(config_file_path='dwc_configs/config.json',
                              config_args={
                                          'source_directory':'Scikit-Learn-Linear-Regression',
                                          'script':'train_script.py',
                                          'arguments':['--model_file_name','regression.pkl',
                                                      '--table_name', 'SALES_VIEW']
                                          }
                            )

Generating script run config
Copying config file for db connection to script_directory Scikit-Learn-Linear-Regression


### Submit the training job with the option to download the model outputs

Refer the documentation on ‘submit_run’ method and parameters ([link](https://github.com/SAP-samples/dwc-fedml/blob/main/Azure/docs/fedml_azure.md#submit_run))

In [18]:
run=train.submit_run(src)

Submitting training run
RunId: federated-experiment_1637367641_cc3f802a
Web View: https://ml.azure.com/runs/federated-experiment_1637367641_cc3f802a?wsid=/subscriptions/cb97564e-cea8-45a4-9c5c-a3357e8f7ee4/resourcegroups/AI_Strategy_AzureML_Resource/workspaces/AIStrategy_AzureML_Worskpace&tid=42f7676c-f455-423c-82f6-dc2d99791af7

Execution Summary
RunId: federated-experiment_1637367641_cc3f802a
Web View: https://ml.azure.com/runs/federated-experiment_1637367641_cc3f802a?wsid=/subscriptions/cb97564e-cea8-45a4-9c5c-a3357e8f7ee4/resourcegroups/AI_Strategy_AzureML_Resource/workspaces/AIStrategy_AzureML_Worskpace&tid=42f7676c-f455-423c-82f6-dc2d99791af7

This run might be using a new job runtime with improved performance and error reporting. The logs from your script are in user_logs/std_log.txt. Please let us know if you run into any issues, and if you would like to opt-out, please add the environment variable AZUREML_COMPUTE_USE_COMMON_RUNTIME to the environment variables section of the j

### Register the model

Pass the path of the .pkl model file specified earlier to 'model_path' key of model_args. The model file will be created in the outputs directory of the run.

Provide the desired model name to ‘model_name’ key of model_args in the below cell.

The 'is_sklearn_model' flag specifies if a scikit learn model is being registered. If set to True, then 'model_framework' of model_args is set to Model.Framework.SCIKITLEARN and 'model_framework_version' of model_args is set to (sklearn.\_version\_) by default.

Refer the documentation on ‘register_model’ method and parameters ([link](https://github.com/SAP-samples/dwc-fedml/blob/main/Azure/docs/fedml_azure.md#register_model))

In [16]:
model=train.register_model(run=run,
                           model_args={'model_name':'sklearn_linReg_model',
                                       'model_path':'outputs/regression.pkl'},
                            resource_config_args={'cpu':1, 'memory_in_gb':0.5},
                            is_sklearn_model=True
                           )

Registering the model
Configuring parameters for sklearn model


 ### Success!