# Tutorial: Azure Machine Learning Day 1

[!INCLUDE [sdk v2](../../includes/machine-learning-sdk-v2.md)]

In this tutorial we will be learning how to use Azure Machine Learning AutoML to train a model. AutoML allows the user to reduce the manual labor time when finding a model that best fits the current data. AutoML will cycle through multiple models and sets of hyper-parameters to find the best fit. AutoML has several tasks users can choose from: Classification, Regression, and Forecasting. In this example we use the associated credit card dataset to show how you can use AutoML for a classification problem. The goal is to predict if a credit card transaction is considered a fraudulent charge.

In this example you will learn how you can save time by letting AutoML train multiple iterations of models on your data and allow you to choose the model you wish to work further on.

The steps you'll take are:

> [!div class="checklist"]
> * Connect to your Azure ML workspace
> * Create your compute resource and job environment
> * Create an MLTable Data Asset
> * Create a Compute Target 
> * Create and run your AutoML Job 
> * View the results from the Best AutoML Run
> * Call the Azure ML endpoint for inferencing


## Prerequisites

* Complete the [Quickstart: Get started with Azure Machine Learning](quickstart-create-resources.md) to:
    * Create a workspace.
    * Create a cloud-based compute instance to use for your development environment.

* Create a new notebook or copy our notebook.
    * Follow the [Quickstart: Run Juypter notebook in Azure Machine Learning studio](quickstart-run-notebooks.md) steps to create a new notebook.
    * Or use the steps in the quickstart to [clone the v2 tutorials folder](quickstart-run-notebooks.md#learn-from-sample-notebooks), then open the notebook from the **tutorials/azureml-in-a-day/azureml-in-a-day.ipynb** folder in your **File** section.

-----Ensure this link is changed with the day 1 tutorial notebook

## Run your notebook

1. On the top bar, select the compute instance you created during the  [Quickstart: Get started with Azure Machine Learning](quickstart-create-resources.md)  to use for running the notebook.

2. Make sure that the kernel, found on the top right, is `Python 3.10 - SDK v2`.  If not, use the dropdown to select this kernel.


> [!Important]
> The rest of this tutorial contains cells of the tutorial notebook.  Copy/paste them into your new notebook, or switch to the notebook now if you cloned it.
>
> To run a single code cell in a notebook, click the code cell and hit **Shift+Enter**. Or, run the entire notebook by choosing **Run all** from the top toolbar.

## Download required packages

Before you dive in the code, you'll need to run the code below to install the required packages

In [None]:
#Install Azure Machine Learning packages
!pip install --pre azure-ai-ml
!pip install azureml-mlflow
!pip install mlflow

## Connect to Azure Machine Learning Workspace
The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

We're using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` is used to handle most Azure SDK authentication scenarios. 

#### Import the required libraries

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.identity import AzureCliCredential
from azure.ai.ml import MLClient, automl, Input

from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.automl import (
    classification,
    ClassificationPrimaryMetrics,
    ClassificationModels,
)

#### Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the MLClient from azure.ai.ml to get a handle to the required Azure Machine Learning workspace. We use the default azure authentication for this tutorial. Check the configuration notebook for more details on how to configure credentials and connect to a workspace.

MLClient is part of the Azure Machine Learning Python SDK and provides a way to access the Azure Machine Learning Cloud Platform. Instead of using the Azure Studio, with MLClient and the Python SDK you can locally create and manage workspaces, train and deploy models, create and use data assets all from your text editor. You can also use MLClient for the same purpose within the Azure Studio when you are running code in the Notebooks tab.

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id="65a1016d-0f67-45d2-b838-b8f373d6d52e"
    resource_group="ssalgado-rg"
    workspace="ssalgado-test"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

#### Show Azure Machine Learning Workspace Information

In [None]:
workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

subscription_id = ml_client.connections._subscription_id
resource_group = workspace.resource_group
workspace_name = ml_client.workspace_name

output = {}
output["Workspace"] = workspace_name
output["Subscription ID"] = subscription_id
output["Resource Group"] = resource_group
output["Location"] = workspace.location
output

## MLTable with input Training Data

**Create MLTable data input:**
AutoML requires the data you input to be in MLTable format. For the sake of this example we will explain the steps to creating an ML Table asset as you would complete them through a local notebook. This process will be similar if you are following along to this notebook in the Azure Studio. 

1. Create a folder with name of your dataset. Example: `demo-dataset-mltable-folder`
1. Name and upload your data in .csv format in the folder you created in step 1. Example: `demo-dataset.csv`
1. Create a .yaml file called `MLTable.yaml` in the folder you created in step 1.

---We need next steps if we are explaining how to create an MLTable Data Asset

This folder will serve as your dataset for this job. We will use the path to the folder you created to train our model. 
First we will create a data asset with the folder you created. Your `path` will be whatever you named your folder in the previous step. If you used the example name, your path would be `./demo-dataset-mltable-folder`

In [None]:
# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(
    type=AssetTypes.MLTABLE, path="./data/training-mltable-folder"
)

# WITH REMOTE PATH: If available already in the cloud/workspace-blob-store
# my_training_data_input  = Input(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/Classification/Train")

## Compute target setup

**Create or Attach existing AmlCompute**
A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

**Creation of AmlCompute takes approximately 5 minutes.** 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

compute_name = "cpu-cluster"

try:
    _ = ml_client.compute.get(compute_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=compute_name,
        type="amlcompute",
        size="STANDARD_DS12_V2",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=6,
    )
    ml_client.begin_create_or_update(compute_config).result()

## Configure and run the AutoML classification job
In this section we will configure and run the AutoML classification job.

**Configure the job through the classification() factory function**

*classification() parameters:*

The `classification()` factory function allows user to configure AutoML for the classification task for the most common scenarios with the following properties.

- `target_column_name` - The name of the column to target for predictions. It must always be specified. This parameter is applicable to 'training_data', 'validation_data' and 'test_data'.
- `primary_metric` - The metric that AutoML will optimize for Classification model selection.
- `training_data` - The data to be used for training. It should contain both training feature columns and a target column. Optionally, this data can be split for segregating a validation or test dataset. 
You can use a registered MLTable in the workspace using the format '<mltable_name>:<version>' OR you can use a local file or folder as a MLTable. For e.g Input(mltable='my_mltable:1') OR Input(mltable=MLTable(local_path="./data"))
The parameter 'training_data' must always be provided.
- `compute` - The compute on which the AutoML job will run. In this example we are using a compute called 'cpu-cluster' present in the workspace. You can replace it any other compute in the workspace. 
- `name` - The name of the Job/Run. This is an optional property. If not specified, a random name will be generated.
- `experiment_name` - The name of the Experiment. An Experiment is like a folder with multiple runs in Azure ML Workspace that should be related to the same logical machine learning experiment.

*set_limits() function parameters:*
This is an optional configuration method to configure limits parameters such as timeouts.     
    
- `timeout_minutes` - Maximum amount of time in minutes that the whole AutoML job can take before the job terminates. This timeout includes setup, featurization and training runs but does not include the ensembling and model explainability runs at the end of the process since those actions need to happen once all the trials (children jobs) are done. If not specified, the default job's total timeout is 6 days (8,640 minutes). To specify a timeout less than or equal to 1 hour (60 minutes), make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.

- `trial_timeout_minutes` - Maximum time in minutes that each trial (child job) can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.
    
- `max_trials` - The maximum number of trials/runs each with a different combination of algorithm and hyperparameters to try during an AutoML job. If not specified, the default is 1000 trials. If using 'enable_early_termination' the number of trials used can be smaller.
    
- `max_concurrent_trials` - Represents the maximum number of trials (children jobs) that would be executed in parallel. It's a good practice to match this number with the number of nodes your cluster.
    
- `enable_early_termination` - Whether to enable early termination if the score is not improving in the short term. 
    

Since AutoML will be running multiple instances of each model with a different combination of hyper-parameters, it is important to set a max trail run variable so AutoML will stop running when it has found a model and parameter combination that is sufficient. While we are trying to test to find a basis point for the model and hyper-parameters, we will be setting a max trial run of around 5. If you would like to get a more accurate model you can set a higher number of max trials, but this will increase the job's training time. Once we have found a models and set of hyper-parameters we like, we can further improve improve upon that model manually or by increasing the number of trials. 

In [None]:
# General job parameters
max_trials = 5
exp_name = "dpv2-classifier-experiment"

#Note: This job can take 15 minutes or more

In [None]:
# Create the AutoML classification job with the related factory-function.

classification_job = automl.classification(
    compute=compute_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    target_column_name="y",
    primary_metric="accuracy",
    n_cross_validations=5,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"},
)

# Limits are all optional
classification_job.set_limits(
    timeout_minutes=600,
    trial_timeout_minutes=20,
    max_trials=max_trials,
    # max_concurrent_trials = 4,
    # max_cores_per_trial: -1,
    enable_early_termination=True,
)

# Training properties are optional
classification_job.set_training(
    blocked_training_algorithms=[ClassificationModels.LOGISTIC_REGRESSION],
    enable_onnx_compatible_models=True,
)

**Run the Command**
Using the MLClient created earlier, we will now run this Command in the workspace.

In [None]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    classification_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

## Get Azure Studio link to models Automl ran
ml_client.jobs.stream(returned_job.name) waits until the specified job is finished
This line could take a while (15 minutes or more) and will output a link to the Azure studio to view the models that Automl ran against the inputted data. The list will also include the accuracy scores and other metrics for each model.

In [None]:
ml_client.jobs.stream(returned_job.name)

In [None]:
# Get a URL for the status of the job
returned_job.services["Studio"].endpoint

In [None]:
print(returned_job.name)

## Retrieve the Best Trial (Best Model's trial/run)
Use the MLFLowClient to access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Trial.

Since AutoML trains several models with different hyper-parameters, we need a way to keep track of each model's results. The MLFlow package allows you to keep track of metrics and results for each model AutoML trains. We will be using MLFlow to first get the best model for our data, then we will get the model's parent run to see if we can access more information about our data by comparing the parent and child run.

**Initialize MLFlow Client:**
The models and artifacts that are produced by AutoML can be accessed via the MLFlow interface. Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.

**Obtain the tracking URI for MLFlow:**

In [None]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

In [None]:
# Set the MLFLOW TRACKING URI

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

In [None]:
from mlflow.tracking.client import MlflowClient

# Initialize MLFlow client
mlflow_client = MlflowClient()

**Get AutoML Parent Job:**

In [None]:
job_name = returned_job.name

# Example if providing an specific Job name/ID
# job_name = "b4e95546-0aa1-448e-9ad6-002e3207b4fc"

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

print("Parent Run: ")
print(mlflow_parent_run)

In [None]:
# Print parent run tags. 'automl_best_child_run_id' tag should be there.
print(mlflow_parent_run.data.tags)

**Get best child run:**

In [None]:
# Get the best model's child run

best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

print("Best child run: ")
print(best_run)

**Get best model run's metric:**

In [None]:
best_run.data.metrics

## Make improvements to the model