## Scikit-Learn Preprocessing and Training Pipeline
##### from sklearn.feature_extraction.text import TfidfVectorizer
##### from sklearn.naive_bayes import MultinomialNB
### Using data from Azure datastore and DWC

## Install fedml_azure package

In [1]:
pip install fedml_azure --force-reinstall

Processing ./fedml_azure-1.0.0-py3-none-any.whl
Collecting hdbcli
  Using cached hdbcli-2.10.13-cp34-abi3-manylinux1_x86_64.whl (11.7 MB)
Installing collected packages: hdbcli, fedml-azure
  Attempting uninstall: hdbcli
    Found existing installation: hdbcli 2.10.13
    Uninstalling hdbcli-2.10.13:
      Successfully uninstalled hdbcli-2.10.13
  Attempting uninstall: fedml-azure
    Found existing installation: fedml-azure 1.0.0
    Uninstalling fedml-azure-1.0.0:
      Successfully uninstalled fedml-azure-1.0.0
Successfully installed fedml-azure-1.0.0 hdbcli-2.10.13
Note: you may need to restart the kernel to use updated packages.


## Import the libraries needed in this notebook

In [2]:
from fedml_azure import DwcAzureTrain

## Set up
### Creating a Training object and setting the workspace, compute target, and environment.

Before running the below cell, ensure that you have a workspace and replace the subscription_id, resource_group, and workspace_name with your information.

The whl file for the fedml_azure library must be passed to the pip_wheel_files key in the environment_args and to use scikit-learn, you must pass the name to conda_packages as well.


In [3]:
#creation of training object and creating workspace in constructor.

training = DwcAzureTrain(
                          workspace_args={"subscription_id": '<subscription_id>',
                                        "resource_group": '<resource_group>',
                                        "workspace_name": '<workspace_name>'
                                        },
                          experiment_args={'name':'test-2'},
                          environment_type='CondaPackageEnvironment',
                          environment_args={'name':'test-env-prep','conda_packages':['scikit-learn'],'pip_packages':['fedml_azure']},
                          compute_type='AmlComputeCluster',
                          compute_args={'vm_size':'Standard_D12_v2',
                                'vm_priority':'lowpriority',
                                'compute_name':'cpu-clu-prep',
                                'min_nodes':0,
                                'max_nodes':1,
                                'idle_seconds_before_scaledown':1700
                                })


Getting existing Workspace
Creating Experiment
Creating Compute_target
Found compute target. just use it. cpu-clu-prep
Creating Environment


### Since this model is using data stored in Azure and DWC, we need to get the data that was uploaded to Azure so we can pass it to the training script.
For information on how this specific data was uploaded to Azure, please refer to `upload_data_to_datastore.ipynb.`


In [4]:
from azureml.core import Dataset, Datastore
datastore = Datastore.get(training.workspace, 'workspaceblobstore')
datastore

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-f29b6b92-835b-480f-b06a-79cd942c7451",
  "account_name": "sampleaistorage",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [5]:
train_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'dataset/imdb_train.csv')])
df = train_dataset.to_pandas_dataframe()
df.head()

Unnamed: 0,0,1
0,"This film is absolutely awful, but nevertheles...",0
1,Well since seeing part's 1 through 3 I can hon...,0
2,I got to see this film at a preview and was da...,1
3,This adaptation positively butchers a classic ...,0
4,Råzone is an awful movie! It is so simple. It ...,0


### Then, we need to generate the run config. This is needed to package the configuration specified so we can submit a job for training. 

Before running the following cell, you should have a config.json file with the specified values to allow you to access to DWC. Provide this file path to config_file_path in the below cell.

You should also have the follow view IMDB_TEST_VIEW created in your DWC. To gather this data, please refer to https://www.kaggle.com/mantri7/imdb-movie-reviews-dataset?select=train_data+%281%29.csv and download the test dataset.

https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.scriptrunconfig?view=azure-ml-py

In [6]:
#generating the run config
src=training.generate_run_config(config_file_path='dwc_configs/config.json',
                          config_args={
                                          'source_directory':'Scikit-Learn-Preprocessor-Training-Pipeline',
                                          'script':'train_script.py',
                                          'arguments':[
                                              '--model_file_name', 'pipeline.pkl',
                                              '--table_name', 'IMDB_TEST_VIEW',
                                              '--table_size', 1,
                                              '--data', train_dataset.as_named_input('train_data'),
                                          ]
                                          }
                            )

Generating script run config
Config file already exists in the script_directory Scikit-Learn-Preprocessor-Training-Pipeline


### Submitting the job for training

In [7]:
#submitting the training run
run=training.submit_run(src)

Submitting training run
RunId: test-2_1633632047_af7ab0e7
Web View: https://ml.azure.com/runs/test-2_1633632047_af7ab0e7?wsid=/subscriptions/cb97564e-cea8-45a4-9c5c-a3357e8f7ee4/resourcegroups/Sample2_AzureML_Resource/workspaces/Sample2_AzureML_Worskpace&tid=42f7676c-f455-423c-82f6-dc2d99791af7

Streaming azureml-logs/70_driver_log.txt

2021/10/07 18:44:17 Got JobInfoJson from env
2021/10/07 18:44:17 Starting App Insight Logger for task:  runTaskLet
2021/10/07 18:44:17 Version: 3.0.01734.0003 Branch: .SourceBranch Commit: 21dafbb
2021/10/07 18:44:17 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/info
2021/10/07 18:44:17 Send process info logs to master server succeeded
2021/10/07 18:44:17 Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status
2021/10/07 18:44:17 Send process info logs to master server succeeded
[2021-10-07T18:44:17.441880] Entering context manager injector.
[2021-10-07T18:44:17.884890] context_manager_injector.py Command line Op

## Register the model for deployment

In [8]:
model=training.register_model(run=run,
                           model_args={'model_name':'sklearn_pipeline_model',
                                       'model_path':'outputs/pipeline.pkl'},
                            resource_config_args={'cpu':1, 'memory_in_gb':0.5},
                            is_sklearn_model=True
                           )
print('Name:', model.name)
print('Version:', model.version)

Registering the model
Configuring parameters for sklearn model
Name: sklearn_pipeline_model
Version: 2
