# AutoML - distributed training for regression

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](../../../resources/workspace/workspace.ipynb) 
- A Compute Cluster. [Check this notebook to create a compute cluster](../../../resources/compute/compute.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

## 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

#### 1.1. Import the required libraries

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl
from azure.ai.ml import Input

import json
import pandas as pd
import os
import shutil
import yaml
from matplotlib import pyplot as plt

#### 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

###### Show Azure ML Workspace information

In [None]:
workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
output

## 2. Data

We will download the data and save it in the respective folders, created at the same location (in the repo) as this notebook.
For large data we recommend that the dataset be registered in your workspace.

In [None]:
import urllib
from zipfile import ZipFile

data_path = "./data/"
os.makedirs(data_path, exist_ok=True)

# Download dataset files and copy them in data path

data_download_url = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/machineData.csv"
data_file = data_path + "data.csv"
urllib.request.urlretrieve(data_download_url, filename=data_file)

print("Dataset files downloaded...")

#### Create the Azure Machine Learning MLTable

In [None]:
def create_folder_and_ml_table(csv_file, output, delimiter=",", encoding="ascii"):
    os.makedirs(output, exist_ok=True)
    fname = os.path.split(csv_file)[-1]

    mltable = {
        "paths": [{"file": f"./{fname}"}],
        "transformations": [
            {"read_delimited": {"delimiter": delimiter, "encoding": encoding}}
        ],
    }
    with open(os.path.join(output, "MLTable"), "w") as f:
        f.write(yaml.dump(mltable))
    shutil.copy(csv_file, os.path.join(output, fname))

In [None]:
train_mltable_path = "./data/training-mltable-folder"
create_folder_and_ml_table(data_file, train_mltable_path)

In [None]:
# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(type=AssetTypes.MLTABLE, path=train_mltable_path)

# WITH REMOTE PATH: If available already in the cloud/workspace-blob-store
# my_training_data_input = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my_training_mltable")

For documentation on creating your own MLTable assets for jobs beyond this notebook:
- https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-mltable details how to write MLTable YAMLs (required for each MLTable asset).
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK covers how to work with them in the v2 CLI/SDK.

## 3. Create or Attach existing AmlCompute

[Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) is a managed-compute infrastructure that allows the user to easily create a single or multi-node compute. In this tutorial, you will create and an AmlCompute cluster as your training compute resource.

<b>Creation of AmlCompute takes approximately 5 minutes.</b>

If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azure.core.exceptions import ResourceNotFoundError
from azure.ai.ml.entities import AmlCompute

cluster_name = "automl-distributed-cluster"

try:
    # Retrieve an already attached Azure Machine Learning Compute.
    compute = ml_client.compute.get(cluster_name)
except ResourceNotFoundError as e:
    compute = AmlCompute(
        name=cluster_name,
        size="STANDARD_DS12_V2",
        type="amlcompute",
        min_instances=4,
        max_instances=6,
        idle_time_before_scale_down=120,
    )
    poller = ml_client.begin_create_or_update(compute)
    poller.wait()

## 4. Configure and run the Distributed Classification training job

In this section we will configure and run the AutoML job to train the model.

#### 4.1 Configure the job through the classification() factory function

##### classification() function parameters:

The `classification()` factory function allows user to configure AutoML for the classification task for the most common scenarios with the following properties.

|Property|Description|
|-|-|
|**target_column_name**|The name of the label column.|
|**primary_metric**|This is the metric that you want to optimize. In case this parameter is set to `None`, `accuaracy` is used.|
|**training_data**|The training data to be used for this experiment. You can use a registered MLTable in the workspace using the format `<mltable_name>:<version>` OR you can use a local file or folder as a MLTable. For e.g `Input(mltable='my_mltable:1')` OR `Input(mltable=MLTable(local_path="./data"))` The parameter 'training_data' must always be provided.|
|**compute**|The compute on which the AutoML job will run. In this example we are using a compute called 'cpu-cluster' present in the workspace. You can replace it with any other compute in the workspace.|
|**name**|The name of the Job/Run. This is an optional property. If not specified, a random name will be generated.|
|**experiment_name**|The name of the Experiment. An Experiment is like a folder with multiple runs in Azure ML Workspace that should be related to the same logical machine learning experiment. For example, if a user runs this notebook multiple times, there will be multiple runs associated with the same Experiment name.|


##### set_limits() parameters:
This is an optional configuration method to configure limits parameters such as timeouts.

|Property|Description|
|-|-|
|**timeout_minutes**|Maximum amount of time in minutes that the whole AutoML job can take before the job terminates. This timeout includes setup, featurization and training runs but does not include the ensembling and model explainability runs at the end of the process since those actions need to happen once all the trials (children jobs) are done. If not specified, the default job's total timeout is 6 days (8,640 minutes). To specify a timeout less than or equal to 1 hour (60 minutes), make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results. It is hard to say what the timeout limit should be because the runtimes depend on multiple factors such as number of unique time series in the dataset, length of time series, statistical properties of the data, etc. If your dataset is less than 10,000,000 observations, you can try to set the experiment to 1 hour. If you are seeing less than 30 child jobs completed in this time frame, increase the timeout limit and re-run the experiment.|
|**trial_timeout_minutes**|Maximum time in minutes that each trial (child job) can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.|
|**max_trials**|The maximum number of trials/runs each with a different combination of algorithm and hyperparameters to try during an AutoML job. If not specified, the default is 1000 trials. If you are setting the `enable_early_termination=True` the number of trials will be smaller.|
|**max_concurrent_trials**|Represents the maximum number of trials (children jobs) that would be executed in parallel. It's a good practice to set this number equal to the number of nodes in your cluster.|
|**enable_early_termination**|Whether to enable early termination if the score is not improving over 10 iterations. Early stopping window starts only after first 20 iterations. This means that the first iteration where stopping can occur is the 31st.|
|**max_nodes**|The maximum number of nodes to use for `distributed` training. For classification/regression tasks, the minimum value that needs to be set is 4.|


##### set_training() parameters:
This is an optional configuration method to configure training parameters.

|Property|Description|
|-|-|
|**allowed_training_algorithms**|A list of Regression algorithms to try out as base model for model training in an experiment. If it is omitted or set to None, then all supported algorithms are used during experiment|
|**enable_model_explainability**|A flag to turn on model explainability like feature importance, of best model evaluated by Automated ML system.|
|**training_mode**|[Experimental] The training mode to use. The possible values are-<ul>`distributed`- enables distributed training for supported algorithms.<ul>`non_distributed`- disables distributed training.<ul>`auto`- Currently, it is same as non_distributed. In future, this might change.<br><i>*Note*: This parameter is in public preview and may change in future.</i>|

In [None]:
# general job parameters
max_trials = 5
exp_name = "regression_task_distributed_lightgbm"
target_column_name = "ERP"
primary_metric = "R2Score"

#### 4.2 Create the AutoML regression job

In [None]:
from azure.ai.ml.automl import ColumnTransformer

transformer_params = {
    "imputer": [
        ColumnTransformer(fields=["CACH"], parameters={"strategy": "most_frequent"}),
        ColumnTransformer(fields=["PRP"], parameters={"strategy": "most_frequent"}),
    ],
}

In [None]:
# Create the AutoML regression job with the related factory-function.

regression_job = automl.regression(
    compute=cluster_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    target_column_name=target_column_name,
    primary_metric=primary_metric,
    tags={"my_custom_tag": "My custom value"},
)

regression_job.set_featurization(
    mode="custom",
    transformer_params=transformer_params,
    blocked_transformers=["LabelEncoder"],
    column_name_and_types={"CHMIN": "Categorical"},
)

# Limits are all optional
regression_job.set_limits(
    timeout_minutes=600,
    trial_timeout_minutes=120,
    enable_early_termination=True,
    max_trials=max_trials,
    max_cores_per_trial=-1,
    max_concurrent_trials=4,
    max_nodes=6,
)

regression_job.set_training(
    allowed_training_algorithms=["LightGBM"],
    enable_model_explainability=True,
    training_mode="distributed",
)

#### 4.3 Train the AutoML model

Using the `MLClient` created earlier, we will execute the following commands to train the model.

In [None]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    regression_job, experiment_name=exp_name
)  # submit the job to the backend

print(f"Created job: {returned_job}")

In [None]:
# Wait until AutoML training runs are finished
ml_client.jobs.stream(returned_job.name)

In [None]:
# print job name
print(returned_job.name)

In [None]:
# Get a URL for the status of the job
returned_job.services["Studio"].endpoint

## 5. Retrieve the Best Trial (Best Model's trial/run)

Use the MLFLowClient to access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Trial.

#### 5.1 Initialize MLFlow Client
The models and artifacts that are produced by AutoML can be accessed via the MLFlow interface. 
Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.

*IMPORTANT*, you need to have installed the latest MLFlow packages with:

    pip install azureml-mlflow

    pip install mlflow

#### 5.2 Obtain the tracking URI for MLFlow

In [None]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

In [None]:
# Set the MLFLOW TRACKING URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

In [None]:
from mlflow.tracking.client import MlflowClient

# Initialize MLFlow client
mlflow_client = MlflowClient()

#### 5.3 Get the AutoML parent Job

In [None]:
job_name = returned_job.name

# Example if providing an specific Job name/ID
# job_name = "b4e95546-0aa1-448e-9ad6-002e3207b4fc"

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

print("Parent Run: ")
print(mlflow_parent_run)

In [None]:
# Print parent run tags. 'automl_best_child_run_id' tag should be there.
print(mlflow_parent_run.data.tags)

#### 5.4 Get the AutoML best child run

In [None]:
# Get the best model's child run
best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

print("Best child run: ")
print(best_run)

## 6. Model Evaluation

#### 6.1 Get best model run's metrics

Access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Run.

In [None]:
best_run.data.metrics

#### 6.2 Download the best model

In [None]:
# Create local folder
local_dir = "./artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)
# Download run's artifacts/outputs
local_path = mlflow_client.download_artifacts(
    best_run.info.run_id, "outputs", local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

# Show the contents of the MLFlow model folder
os.listdir("./artifact_downloads/outputs/mlflow-model")

##### Feaurization Summary
We can look at the engineered feature names generated in time-series featurization via the JSON file named 'engineered_feature_names.json' under the run outputs.

In [None]:
import json

with open(os.path.join(local_path, "engineered_feature_names.json"), "r") as f:
    records = json.load(f)
records

##### View featurization summary
You can also see what featurization steps were performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:

+ Raw feature name
+ Number of engineered features formed out of this raw feature
+ Type detected
+ If feature was dropped
+ List of feature transformations for the raw feature

In [None]:
# Render the JSON as a pandas DataFrame
with open(os.path.join(local_path, "featurization_summary.json"), "r") as f:
    records = json.load(f)
fs = pd.DataFrame.from_records(records)

# View a summary of the featurization
fs[
    [
        "RawFeatureName",
        "TypeDetected",
        "Dropped",
        "EngineeredFeatureCount",
        "Transformations",
    ]
]

#### 6.3 Prediction using managed batch endpoint

Now that we have retrieved the best pipeline/model, it can be used to make predictions on test data. We will do batch inferencing on the test dataset which must have the same schema as training dataset.

The inference will run on a remote compute. In this example, it will re-use the training compute.

##### 6.3.1 Create a model endpoint
First, we need to register the model, environment and the batch endpoint.

In [None]:
# import required libraries
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
    ProbeSettings,
)
from azure.ai.ml.constants import ModelType

# Creating a unique endpoint name with current datetime to avoid conflicts
import datetime

online_endpoint_name = "distributed-regression-model"

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="this is a sample online endpoint for mlflow model",
    auth_mode="key",
    tags={"foo": "bar"},
)

ml_client.begin_create_or_update(endpoint).wait()

##### 6.3.2 Register best model

In [None]:
model_name = "distributed-regression-model01"
model = Model(
    path=f"azureml://jobs/{best_run.info.run_id}/outputs/artifacts/outputs/mlflow-model/",
    name=model_name,
    description="my sample regression model",
    type=AssetTypes.MLFLOW_MODEL,
)

# for downloaded file
# model = Model(path="artifact_downloads/outputs/model.pkl", name=model_name)

registered_model = ml_client.models.create_or_update(model)

print(registered_model.id)

##### 6.3.3 Deploy the registered model

In [None]:
deployment = ManagedOnlineDeployment(
    name="hardware-performance-deploy",
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type="Standard_DS3_V2",
    instance_count=1,
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        timeout=2,
        period=10,
        initial_delay=2000,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=10,
        success_threshold=1,
        timeout=10,
        period=10,
        initial_delay=2000,
    ),
)

In [None]:
ml_client.online_deployments.begin_create_or_update(deployment).result()

In [None]:
# deployment to take 100% traffic
endpoint.traffic = {"hardware-performance-deploy": 100}
ml_client.begin_create_or_update(endpoint).result()

##### 6.3.4 Test the deployment

In [None]:
# test the blue deployment with some sample data
import pandas as pd

test_data = pd.read_csv("./data/training-mltable-folder/data.csv")

test_data = test_data.drop(target_column_name, axis=1)

test_data_json = test_data.to_json(orient="records", indent=4)
data = (
    '{ \
          "input_data": {"data": '
    + test_data_json
    + "}}"
)

request_file_name = "sample-request-hardware-performance.json"

with open(request_file_name, "w") as request_file:
    request_file.write(data)
# test the blue deployment with some sample data
ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="hardware-performance-deploy",
    request_file="sample-request-hardware-performance.json",
)

In [None]:
# Get the details for online endpoint
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

# existing traffic details
print(endpoint.traffic)

# Get the scoring URI
print(endpoint.scoring_uri)

##### 6.3.5 Delete the deployment and endpoint

In [None]:
ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

## Next Step: Load the best model and try predictions

Loading the models locally assume that you are running the notebook in an environment compatible with the model. The list of dependencies that is expected by the model is specified in the MLFlow model produced by AutoML (in the 'conda.yaml' file within the mlflow-model folder).

Since the AutoML model was trained remotelly in a different environment with different dependencies to your current local conda environment where you are running this notebook, if you want to load the model you have several options:

1. A recommended way to locally load the model in memory and try predictions is to create a new/clean conda environment with the dependencies specified in the conda.yaml file within the MLFlow model's folder, then use MLFlow to load the model and call .predict() as explained in the notebook **mlflow-model-local-inference-test.ipynb** in this same folder.

2. You can install all the packages/dependencies specified in conda.yaml into your current conda environment you used for using Azure ML SDK and AutoML. MLflow SDK also have a method to install the dependencies in the current environment. However, this option could have risks of package version conflicts depending on what's installed in your current environment.

3. You can also use: mlflow models serve -m 'xxxxxxx'