## Author:  Bryan Cafferky Copyright 08/25/2020

Original from: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-with-deployment.ipynb

You can install the AML SDK as library from GUI. When attaching a library follow this https://docs.databricks.com/user-guide/libraries.html and add the below string as your PyPi package. You can select the option to attach the library to all clusters or just one cluster.

**install azureml-sdk with Automated ML**
* Source: Upload Python Egg or PyPi
* PyPi Name: `azureml-sdk[automl]`
* Select Install Library

# AutoML : Classification with Local Compute on Azure DataBricks with deployment to ACI

In this example we use the scikit-learn's [digit dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) to showcase how you can use AutoML for a simple classification problem.

In this notebook you will learn how to:
1. Create Azure Machine Learning Workspace object and initialize your notebook directory to easily reload this object from a configuration file.
2. Create an `Experiment` in an existing `Workspace`.
3. Configure AutoML using `AutoMLConfig`.
4. Train the model using AzureDataBricks.
5. Explore the results.
6. Register the model.
7. Deploy the model.
8. Test the best fitted model.

Prerequisites:
Before running this notebook, please follow the readme for installing necessary libraries to your cluster.

## Set up the model training data.

In [0]:
%python

spdf_salesinfo = spark.sql('''
select  split(Subcategory, ' ')[0] as Subcategory, 
        AgeBand, CommuteDistance, HasChildren, Education, Salary
FROM aw.t_salesinfo 
WHERE Category = 'Bikes' and FiscalYear in (2012, 2013) 
''')

In [0]:
### Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = spdf_salesinfo.randomSplit([0.7, 0.3], seed=100)
print(trainingData.count())
print(testData.count())

In [0]:
pdf_training = trainingData.toPandas()
pdf_training.to_csv("/dbfs/mnt/awdata/train.csv", index=False, header=True)

pdf_testing = testData.toPandas()
pdf_testing.to_csv("/dbfs/mnt/awdata/test.csv", index=False, header=True)

In [0]:
%fs ls /mnt/awdata/

path,name,size
dbfs:/mnt/awdata/DimOrganization.csv,DimOrganization.csv,499
dbfs:/mnt/awdata/product.csv/,product.csv/,0
dbfs:/mnt/awdata/test.csv,test.csv,105579
dbfs:/mnt/awdata/testing.csv/,testing.csv/,0
dbfs:/mnt/awdata/train.csv,train.csv,243645
dbfs:/mnt/awdata/training.csv/,training.csv/,0


In [0]:
display(spark.read.csv("/mnt/awdata/train.csv",header=True,inferSchema=True))

Subcategory,AgeBand,CommuteDistance,HasChildren,Education,Salary
Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
Mountain,Golden,0-1 Miles,N,Bachelors,20000.0
Mountain,Golden,0-1 Miles,N,Bachelors,20000.0
Mountain,Golden,0-1 Miles,N,Bachelors,30000.0
Mountain,Golden,0-1 Miles,N,Bachelors,30000.0
Mountain,Golden,0-1 Miles,N,Bachelors,30000.0
Mountain,Golden,0-1 Miles,N,Bachelors,30000.0


## Register Machine Learning Services Resource Provider
Microsoft.MachineLearningServices only needs to be registed once in the subscription. To register it:
Start the Azure portal.
Select your All services and then Subscription.
Select the subscription that you want to use.
Click on Resource providers
Click the Register link next to Microsoft.MachineLearningServices

### Check the Azure ML Core SDK Version to Validate Your Installation

In [0]:
import azureml.core

print("SDK Version:", azureml.core.VERSION)

## Initialize an Azure ML Workspace
### What is an Azure ML Workspace and Why Do I Need One?

An Azure ML workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows.  In particular, an Azure ML workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, operationalization, and the monitoring of operationalized models.


### What do I Need?

To create or access an Azure ML workspace, you will need to import the Azure ML library and specify following information:
* A name for your workspace. You can choose one.
* Your subscription id. Use the `id` value from the `az account show` command output above.
* The resource group name. The resource group organizes Azure resources and provides a default region for the resources in the group. The resource group will be created if it doesn't exist. Resource groups can be created and viewed in the [Azure portal](https://portal.azure.com)
* Supported regions include `eastus2`, `eastus`,`westcentralus`, `southeastasia`, `westeurope`, `australiaeast`, `westus2`, `southcentralus`.

In [0]:
subscription_id = "bxxxx-8333-3322-9999-8bbbbbbb" #you should be owner or contributor
resource_group = "rg_awazureml" #you should be owner or contributor
workspace_name = "awazureml" #your workspace name
workspace_region = "eastus2" #your region

In [0]:
subscription_id = "<your subscription id>" #you should be owner or contributor
resource_group = "<AML workspace resource group name>" #you should be owner or contributor
workspace_name = "<AML workspace name>" #your workspace name
workspace_region = "<your region>" #your region

## Creating a Workspace

If you followed the book, you already created an Azuer ML workspace.

However, it is possible to create a workspace in using Python in your notebook.  

The code below creates a new Azure ML workspace.  You will not need it now it is 
included for completeness. 

```
# Import the Workspace class and check the Azure ML SDK version.
from azureml.core import Workspace

ws = Workspace.create(name = workspace_name,
                      subscription_id = subscription_id,
                      resource_group = resource_group, 
                      location = workspace_region,                      
                      exist_ok=True)
ws.get_details()
```

**Note:** Creation of a new workspace can take several minutes.

## Configuring Your Local Environment
You can validate that you have access to the specified workspace and write a configuration file to the default configuration location, `./aml_config/config.json`.

In [0]:
from azureml.core.workspace import Workspace

ws = Workspace.get(name=workspace_name,
               subscription_id=subscription_id,
               resource_group=resource_group)

ws.name

In [0]:
ws

## Create a Folder to Host Sample Projects
Finally, create a folder where all the sample projects will be hosted.

In [0]:
import os

projects_folder = './awml'

if not os.path.isdir(projects_folder):
    os.mkdir(projects_folder)
    
print('Projects will be created in {}.'.format(projects_folder))

## Create an Experiment

As part of the setup you have already created an Azure ML `Workspace` object. For AutoML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [0]:
import logging
import os
import random
import time
import json

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

In [0]:
# Choose a name for the experiment and specify the project folder.
experiment_name = 'automl-local-classification'
project_folder = './awml/automl-local-classification'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

Unnamed: 0,Unnamed: 1
SDK version,1.5.0
Subscription ID,b25e4a1a-807f-4e08-9eb4-7f162b1ecd74
Workspace Name,awazureml
Resource Group,rg_awazureml
Location,eastus2
Project Directory,./awml/automl-local-classification
Experiment Name,automl-local-classification


## Registering Datastore

Datastore is the way to save connection information to a storage service (e.g. Azure Blob, Azure Data Lake, Azure SQL) information to your workspace so you can access them without exposing credentials in your code. The first thing you will need to do is register a datastore, you can refer to our [python SDK documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py) on how to register datastores. __Note: for best security practices, please do not check in code that contains registering datastores with secrets into your source control__

The code below registers a datastore pointing to a publicly readable blob container.

In [0]:
blobacctkey = dbutils.secrets.get(scope = "aw-key-vault-secrets", key = "blobacctkey")

In [0]:
from azureml.core import Datastore

datastore_name = 'awamltraining'
container_name = 'awdata' 
account_name = 'awstorageaccounteda'

Datastore.register_azure_blob_container(
    workspace = ws, 
    datastore_name = datastore_name, 
    container_name = container_name, 
    account_name = account_name,
    account_key = blobacctkey
)

Below is an example on how to register a private blob container
```python
datastore = Datastore.register_azure_blob_container(
    workspace = ws, 
    datastore_name = 'example_datastore', 
    container_name = 'example-container', 
    account_name = 'storageaccount',
    account_key = 'accountkey'
)
```
The example below shows how  to register an Azure Data Lake store. Please make sure you have granted the necessary permissions for the service principal to access the data lake.
```python
datastore = Datastore.register_azure_data_lake(
    workspace = ws,
    datastore_name = 'example_datastore',
    store_name = 'adlsstore',
    tenant_id = 'tenant-id-of-service-principal',
    client_id = 'client-id-of-service-principal',
    client_secret = 'client-secret-of-service-principal'
)
```

## Load Training Data Using Dataset

Automated ML takes a `TabularDataset` as input.

You are free to use the data preparation libraries/tools of your choice to do the require preparation and once you are done, you can write it to a datastore and create a TabularDataset from it.

You will get the datastore you registered previously and pass it to Dataset for reading. The data comes from the digits dataset: `sklearn.datasets.load_digits()`. `DataPath` points to a specific location within a datastore. 

More information on Datastores at: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data

In [0]:
# List all datastores registered in the current workspace
datastores = ws.datastores

for name, datastore in datastores.items():
    print(name, datastore.datastore_type)

In [0]:
datastore = ws.get_default_datastore()
datastore

Datastore link: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py

In [0]:
from azureml.core.dataset import Dataset
from azureml.data.datapath import DataPath

datastore = Datastore.get(workspace = ws, datastore_name = datastore_name)

train_data = Dataset.Tabular.from_delimited_files(datastore.path('train.csv'), header=True)
test_data = Dataset.Tabular.from_delimited_files(datastore.path('testing.csv'), header=True)

## Review the TabularDataset
You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only j records for all the steps in the TabularDataset, which makes it fast even against large datasets.

In [0]:
train_data.take(5).to_pandas_dataframe()

Unnamed: 0,Subcategory,AgeBand,CommuteDistance,HasChildren,Education,Salary
0,Mountain,Golden,0-1 Miles,False,Bachelors,10000.0
1,Mountain,Golden,0-1 Miles,False,Bachelors,10000.0
2,Mountain,Golden,0-1 Miles,False,Bachelors,10000.0
3,Mountain,Golden,0-1 Miles,False,Bachelors,10000.0
4,Mountain,Golden,0-1 Miles,False,Bachelors,20000.0


In [0]:
X_train = train_data.drop_columns("Subcategory")
X_train.take(2).to_pandas_dataframe()

Unnamed: 0,AgeBand,CommuteDistance,HasChildren,Education,Salary
0,Golden,0-1 Miles,False,Bachelors,10000.0
1,Golden,0-1 Miles,False,Bachelors,10000.0


In [0]:
y_train = train_data.keep_columns("Subcategory")

In [0]:
y_train.take(5).to_pandas_dataframe()

Unnamed: 0,Subcategory
0,Mountain
1,Mountain
2,Mountain
3,Mountain
4,Mountain


## Configure AutoML

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**n_cross_validations**|Number of cross validation splits.|
|**spark_context**|Spark Context object. for Databricks, use spark_context=sc|
|**max_concurrent_iterations**|Maximum number of iterations to execute in parallel. This should be <= number of worker nodes in your Azure Databricks cluster.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|
|**preprocess**|set this to True to enable pre-processing of data eg. string to numeric using one-hot encoding|
|**exit_score**|Target score for experiment. It is associated with the metric. eg. exit_score=0.995 will exit experiment after that|

### See link for config options: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train

In [0]:
automl_classifier = AutoMLConfig(
                                task='classification',
                                debug_log = 'automl_errors.log',
                                verbosity = logging.INFO,
                                featurization = 'auto',
                                primary_metric='AUC_weighted',
                                iteration_timeout_minutes = 10,
                                iterations = 5,
                                experiment_timeout_minutes=30,
                                max_concurrent_iterations = 4,  # change it based on number of worker nodes
                                blacklist_models=['XGBoostClassifier'],
                                training_data=train_data,
                                label_column_name='Subcategory',
                                n_cross_validations=5)

## Train the Models

Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.

In [0]:
local_run = experiment.submit(automl_classifier, show_output = True)

## Explore the Results

#### Portal URL for Monitoring Runs

The following will provide a link to the web interface to explore individual run details and status. In the future we might support output displayed in the notebook.

In [0]:
displayHTML("<a href={} target='_blank'>Azure Portal: {}</a>".format(local_run.get_portal_url(), local_run.id))

The following will show the child runs and waits for the parent run to complete.

#### Retrieve All Child Runs after the experiment is completed (in portal)
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [0]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4
AUC_macro,0.62,0.62,0.6,0.63,0.63
AUC_micro,0.71,0.71,0.68,0.72,0.72
AUC_weighted,0.61,0.61,0.59,0.62,0.62
accuracy,0.54,0.54,0.49,0.55,0.55
average_precision_score_macro,0.43,0.42,0.41,0.43,0.43
average_precision_score_micro,0.53,0.53,0.49,0.54,0.54
average_precision_score_weighted,0.48,0.47,0.47,0.48,0.49
balanced_accuracy,0.4,0.37,0.4,0.4,0.39
f1_score_macro,0.37,0.31,0.36,0.35,0.33
f1_score_micro,0.54,0.54,0.49,0.55,0.55


## Deploy

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method on `automl_classifier` returns the best run and the fitted model for the last invocation. Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [0]:
best_run, fitted_model = local_run.get_output()

In [0]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-local-classification,AutoML_69f164d5-611c-4a36-9563-fc1c20ea5e7c_4,,Completed,Link to Azure Machine Learning studio,Link to Documentation


### Download the conda environment file
From the *best_run* download the conda environment file that was used to train the AutoML model.

In [0]:
from automl.client.core.common import constants

conda_env_file_name = 'conda_env.yml'
best_run.download_file(name="outputs/conda_env_v_1_0_0.yml", output_file_path=conda_env_file_name)

with open(conda_env_file_name, "r") as conda_file:
    conda_file_contents = conda_file.read()
    print(conda_file_contents)

### Download the model scoring file
From the *best_run* download the scoring file to get the predictions from the AutoML model.

In [0]:
from automl.client.core.common import constants

script_file_name = 'scoring_file.py'
best_run.download_file(name="outputs/scoring_file_v_1_0_0.py", output_file_path=script_file_name)

with open(script_file_name, "r") as scoring_file:
    scoring_file_contents = scoring_file.read()
    print(scoring_file_contents)

## Register the Fitted Model for Deployment
If neither metric nor iteration are specified in the register_model call, the iteration with the best primary metric is registered.

In [0]:
description = 'AutoML Bike Model'
tags = None
model = local_run.register_model(description = description, tags = tags)
local_run.model_id # This will be written to the scoring script file later in the notebook.

### Deploy the model as a Web Service on Azure Container Instance

Create the configuration needed for deploying the model as a web service service.

In [0]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.environment import Environment

myenv = Environment.from_conda_specification(name="myenv", file_path=conda_env_file_name)
inference_config = InferenceConfig(entry_script=script_file_name, environment=myenv)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                               memory_gb = 1, 
                                               tags = {'bike': "subcategory", 'type': "automl_classification"}, 
                                               description = 'bike subcategory service for Automl Classification')

### Code below is not re-runnable.  It will fail if the webservice already exists.

In [0]:
from azureml.core.webservice import Webservice
from azureml.core.model import Model

aci_service_name = 'automl-databricks-local'
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

In [0]:
aci_service

In [0]:
pdf_testing.dtypes

In [0]:
indata = pdf_testing.drop(['Subcategory'], axis=1)
X_test = indata[:3].values
X_test

In [0]:
import json

test_samples = json.dumps({"data": X_test.tolist()})
test_samples = bytes(test_samples, encoding='utf8')

# predict using the deployed model
result = aci_service.run(input_data=test_samples)

print('Complete')

In [0]:
print(test_samples)
print(result)

In [0]:
print(aci_service.scoring_uri)

In [0]:
import requests
import json

scoring_uri = aci_service.scoring_uri   # URL for the web service

# If the service is authenticated, set the key or token
key = '<your key or token>'

# Two sets of data to score, so we get two results back
data = {"data":
        [
            [
                "Golden",
                "0-1 Miles",
                "N",
                "Bachelors",
                10000.0
            ],
            [
                "Golden",
                "0-1 Miles",
                "N",
                "Bachelors",
                30000.0
            ]
        ]
        }
# Convert to JSON string
input_data = json.dumps(data)

# Set the content type
headers = {'Content-Type': 'application/json'}

# If authentication is enabled, set the authorization header
#headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.text)

In [0]:
aci_service.delete()

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/azure-databricks/automl/automl-databricks-local-with-deployment.png)