# Day 1: Training a model

*[Sanghee's note: this tutorial will be coming after the prototpying tutorial, so you'll want to keep the intro consistent. The first paragraph is a high level overview for everything, so we want to tweak this to be specific to the training. Also, let's not introduce Azure ML specific terminology such as 'command job' in the intro paragraph as users don't know what it is yet. Consider changing 'you'll learn how to submit a command job' to something 'you'll learn how to submit a training job to the cloud', you get the idea. Also, let's remove the portion about the deployment as that will be handled in the deployment tutorial by Mope]*


Learn how a data scientist uses Azure Machine Learning (Azure ML) to train a model, then use the model for a classifier. This tutorial will help you become familiar with the core concepts of Azure ML and their most common usage (training a model). In this example we use the associated credit card dataset to show how you can use AutoML for a classification problem. The goal is to predict if a customer has a high likelihood of defaulting on a credit card payment. To complete this tutorial, please ensure you have completed the prerequisite tutorial for prototyping. [link-to-notebook]

This article is paired with a ready to run (Python Notebook) where you can train[link-to-notebook] a model from Azure studio or your local machine. To train a model you need to submit a *job*. In this tutorial, you'll learn how to submit a *command job* to run your *training script* on a specified *compute resource*, configured with the *job environment* necessary to run the script. Submitting a *command job* will allow you to run a custom training script for your model. 

The training script handles the data preparation, then trains and registers a model. This tutorial will take you through steps to submit a cloud-based training job (command job). If you would like to learn more about how to load your data into Azure, please follow this link. 
[comment]: <> (I added this last line to preface that we will not be going over how to load data, some scientists make specifically looking for this. so in this instance I think it would be beneficial to link to the other tutorial., any thoughts? 

*[Sanghee's note]: I think this is a good idea, please work with Sheri to ensure this is done consistently across the tutorials)*

*[Sanghee's note for below]: let's include 'data' to the bulletpoint 4. 'configured with the appropriate job env and the data source'*
*[Sam Revised below]*

The steps you'll take are:

> * Connect to your Azure ML workspace
> * Create your compute resource and job environment
> * Create your training script
> * Create and run your command job to run the training script on the compute resource, configured with the appropriate job environment and the data source
> * View the output of your training script
> * Deploy the newly-trained model as an endpoint
> * Call the Azure ML endpoint for inferencing

## Prerequisites

*[Sanghee's note: please work with Sheri so all tutorial articles are consistent]*
* An Azure subscription. If you don't have an Azure subscription, [create a free account](https://aka.ms/AMLFree) before you begin.
* A working Azure ML workspace. A workspace can be created via Azure Portal, Azure CLI, or Python SDK. [Read more](https://docs.microsoft.com/azure/machine-learning/how-to-manage-workspace?tabs=python).
* An Azure Machine Learning [workspace]()
* A workspace and compute instance which you can create by  completing the [Quickstart: Get started with Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/quickstart-create-resources#create-compute-instance)

## Different ways to train models in Azure Machine Learning
*[Sanghee's note] I think for the flow of the article, we will want to add a paragraph to ease in. Also, let's use a common language first to describe our features. Something like, 'Azure Machine Learning offers different ways to train models. Construct a `command` job function to run your own training script, use AutoML to generate a baseline model and a training script, or modify an existing example from the Github Examples repo.' I am also wondering whether we should include a comparison table instead of bullet points.*
*[Sams revision: I thought that a table was a bit more restrictive in terms of use cases. I figured we didnt want to tell customers what they should use and instead we should allow them to decide and then offer an explanation of each type. I thougt the bullet points gave a quick summary of each and will let customers decide which they want to pursue.  I think a table is an interesting idea but maybe for a different tutorial? because there is also other ways to train in Azure. We didnt mention designer either and i dont know if we should in this tutorial. What do you think ? I also re worded the below paragraph]*

Azure Machine Learning offers several ways to train models. Users can select their method of training based on complexity of the model, data size, and training speed requirements. Here are some of the ways to train in Azure Machine Learning:

1. Command Job: A command job is a function that will allow you to use your own training script to train your model. This can also be defined as a custom training job. A command job in Azure Machine Learning is a type of job that runs a script or command in a specified environment. You can use command jobs to train models, process data, or any other custom code you want to execute in the cloud. 
1. AutoML: AutoML is a supplemental tool to reduce the amount of time a data scientist spends finding a model that works best with their data. Instead of rewriting a training script for each model, AutoML runs each model automatically, along with hyperparameter tuning of each model to help ensure its accuracy. After AutoML has found a model you are happy with, you can continue improving upon the model by tweaking the script or continuing hyperparameter tuning.
1. Github Examples: Github examples are great ways of training along side an explained tutorial. In Azure Machine Learning's examples repo, there are completed tutorial paired with Python Notebooks that you can run code and learn to train a model. Users are able to modify and run existing scripts from the Github Examples Repo containing scenarios including classification, natural language processing, and anomaly detection. 

*[Sanghee's note: let's add a transitional sentence or two here, something like, 'in this tutorial, you'll be learning how to create a custom training job.]*
*[Sam revision] revised wording based on feedback*

In this tutorial, we will focus on using a command job to create a custom training job that we will use to train a model. For any custom training job, these below items are required:
> * compute 
> * environment
> * data
> * command job 
> * training script
[comment]: <> (is the command script and training script the same thing? I think it will be in this case)
*[Sanghee's note: the `command` job in SDK allows configuring everything together to submit a custom training job. A training script is main.py]*
*[Sam Revision] then we would change command script to command job, correct?*

In this tutorial we will provide all items listed above for the purpose of our example: creating a classifier to predict customers who have a high likelihood of defaulting on credit card payments.

*[Sanghee's note: I think this model predicts who would default on the credit card payment. I think I know why you got it mixed up though - we probably talked about a fraud scenario in one of our meetings as a possibility!]* 
*[Sam revision] revised wording based on feedback*

## Connect to the workspace

Before you dive in the code, you'll need to connect to your Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

We're using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` is used to handle most Azure SDK authentication scenarios. 

Reference for more available credentials if it doesn't work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/python/api/azure-identity/azure.identity?view=azure-python).

To connect your code to Azure ML, we will be using MLClient. MLClient allows us to make cloud based code runs locally or in the Azure Studio. 

In [3]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

ModuleNotFoundError: No module named 'azure.ai'

However, in this If you want to use a browser to login and authenticate, you can use the following code instead. In this example, you'll use the `DefaultAzureCredential`.

In [2]:
# Handle to the workspace
# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace name.
1. Copy the value for workspace, resource group and subscription ID into the code.  
1. You'll need to copy one value, close the area and paste, then come back for the next one.

In [4]:
# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="65a1016d-0f67-45d2-b838-b8f373d6d52e",
    resource_group_name="ssalgado-rg",
    workspace_name="ssalgado-test",
)

NameError: name 'MLClient' is not defined

The result is a handler to the workspace that you'll use to manage other resources and jobs.

> [!IMPORTANT]
> Creating MLClient will not connect to the workspace. The client initialization is lazy, it will wait for the first time it needs to make a call (in the notebook below, that will happen during compute creation).

## Create a compute resource to run your job

You'll need a compute resource for running a job. It can be single or multi-node machines with Linux or Windows OS, or a specific compute fabric like Spark. In Azure, there are two compute resources that you can choose from: instance and cluster. A compute instance contains one node of computation resources while a compute cluster contains several. For the purpose of training, we recommend using a compute cluster because it allows the user to distribute calculations on multiple nodes of computation which results in a faster training experience. 

In Azure, a job can refer to several tasks that Azure allows its users to do: training, pipeline creation, deployment, etc. For this tutorial and our purpose of training a machine learning model we will be using *job* as a reference to running training computations (*training job*).

*[Sanghee's note: I think we need to unpack some concepts here.* 
*1) we need to establish that compute cluster is what users want to use for training (vs instance) because it allows the scale-up on the nodes. Compute instance only has a single node. In the prev tutorial (prototyping), we establish that compute instance is needed to run a cloud workstation (notebook/terminal). We get this question a lot (what's the difference btw CI and Cluster?) so we want to establish the mental model here. Cluster = for training and deployment]*
*2) we need to establish what a job is. Now this is a tricky part - a job can be different things: training, pipeline, deployment, etc. In the context of this tutorial however, it might be better to keep it simple as 'you'll need a compute resource to run your training job'. What you think?*
*[Sam revised]*

You'll provision a Linux compute cluster. See the [full list on VM sizes and prices](https://azure.microsoft.com/pricing/details/machine-learning/) .

For this example, you only need a basic cluster, so you'll use a Standard_DS3_v2 model with 2 vCPU cores, 7-GB RAM and create an Azure ML Compute.

*[Sanghee's note: could you ask Sheri what is an 'Azure ML Compute'? It's capitalised so I want to understand if there is a reason for it]*

In [4]:
#Sanghee's note: there is a mention of 'Azure ML compute object'. I'll follow up with an email if this is a terminology we use. 

from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster1 = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS3_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster1)

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

You already have a cluster named cpu-cluster, we'll reuse it as is.
AMLCompute with name cpu-cluster is created, the compute size is STANDARD_DS3_V2


## Create a job environment

To run your AzureML job on your compute resource, you'll need an [environment](https://docs.microsoft.com/azure/machine-learning/concept-environments). An environment lists the software runtime and libraries that you want installed on the compute where you’ll be training. It's similar to your python environment on your local machine.

AzureML provides many curated or ready-made environments, which are useful for common training and inference scenarios. You can also create your own custom environments using a docker image, or a conda configuration.

In this example, you'll create a custom conda environment for your jobs, using a conda yaml file.

First, create a directory to store the file in.

In [5]:
import os

dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

Now, create the file in the dependencies directory.

In [6]:
%%writefile {dependencies_dir}/conda.yml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - pip=21.2.4
  - scikit-learn=0.24.2
  - scipy=1.7.1
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - xlrd==2.0.1
    - mlflow== 1.26.1
    - azureml-mlflow==1.42.0
    - psutil>=5.8,<5.9
    - tqdm>=4.59,<4.60
    - ipykernel~=6.0
    - matplotlib

Overwriting ./dependencies/conda.yml



The specification contains some usual packages, that you'll use in your job (numpy, pip).

Reference this *yaml* file to create and register this custom environment in your workspace:

In [7]:
from azure.ai.ml.entities import Environment

custom_env_name = "aml-scikit-learn"

custom_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for Credit Card Defaults job",
    tags={"scikit-learn": "0.24.2"},
    conda_file=os.path.join(dependencies_dir, "conda.yml"),
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
    f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}"
)

Environment with name aml-scikit-learn is registered to workspace, the environment version is 2


## Configure a training job using the command function

*[Sanghee's note: I am wondering if we should consider changing the heading here since other headings are all call-to-action aka 'Do X'. Maybe we could consider 'Train a model on Azure ML platform' or 'Configure a training job using command function'. This paragraph basically covers the high level concept. What do you think is appropriate here?]* 
*[Sam revised]: I like the second wording a lot and I think youre right about call to action*

You'll create an Azure ML *command job* to train a model for credit default prediction. The command job is used to run a *training script* in a specified environment on a specified compute resource.  You've already created the environment and the compute resource.  Next you'll create the training script. In our specific case, we will be training our dataset to produce a classifier using the `GradientBoostingClassifier` model. 

The *training script* handles the data preparation, training and registering of the trained model. The method `train_test_split` handles splitting the dataset into test and training data. In this tutorial, you'll create a Python training script. 

Command jobs can be run from CLI, Python SDK, or studio interface. In this tutorial, you'll use the Azure ML Python SDK v2 to create and run the command job.

After running the training job, if you need to learn more about deploying the model you can follow the next tutorial in this series.

*[Sanghee's note: how might we rephrase 'you'll deploy the model' since it will be the next tutorial?]*
*[Sam Revised] does this revision still sound ok? Or should we remove this line*

## Create training script

Let's start by creating the training script - the *main.py* python file.

First create a source folder for the script:

In [8]:
import os

train_src_dir = "./src"
os.makedirs(train_src_dir, exist_ok=True)

This script handles the preprocessing of the data, splitting it into test and train data. It then consumes this data to train a tree based model and return the output model. 

[MLFlow](https://mlflow.org/docs/latest/tracking.html) will be used to log the parameters and metrics during our job. The MLFlow package allows you to keep track of metrics and results for each model Azure trains. We will be using MLFlow to first get the best model for our data, then we will be viewing the model's metrics on the Azure studio. If you would like to learn more about how MLFLow works (please visit this link)[./concept-mlflow]. If you would like to learn more about how Azure Machine Learning uses the MLflow model's concept to enable deployment workflows (please visit this link). [./concept-mlflow-models.md]

In [9]:
%%writefile {train_src_dir}/main.py
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--n_estimators", required=False, default=100, type=int)
    parser.add_argument("--learning_rate", required=False, default=0.1, type=float)
    parser.add_argument("--registered_model_name", type=str, help="model name")
    args = parser.parse_args()
   
    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    ###################
    #<prepare the data>
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)
    
    credit_df = pd.read_excel(args.data, header=1, index_col=0)

    mlflow.log_metric("num_samples", credit_df.shape[0])
    mlflow.log_metric("num_features", credit_df.shape[1] - 1)

    #Split train and test datasets
    train_df, test_df = train_test_split(
        credit_df,
        test_size=args.test_train_ratio,
    )
    ####################
    #</prepare the data>
    ####################

    ##################
    #<train the model>
    ##################
    # Extracting the label column
    y_train = train_df.pop("default payment next month")

    # convert the dataframe values to array
    X_train = train_df.values

    # Extracting the label column
    y_test = test_df.pop("default payment next month")

    # convert the dataframe values to array
    X_test = test_df.values

    print(f"Training with data of shape {X_train.shape}")

    clf = GradientBoostingClassifier(
        n_estimators=args.n_estimators, learning_rate=args.learning_rate
    )
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    print(classification_report(y_test, y_pred))
    ###################
    #</train the model>
    ###################

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=clf,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.sklearn.save_model(
        sk_model=clf,
        path=os.path.join(args.registered_model_name, "trained_model"),
    )
    ###########################
    #</save and register model>
    ###########################
    
    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()

Overwriting ./src/main.py


*[Sanghee's note: I think we need one sentence here to explain what model registry is. It feels abrupt to see the mention of the model registry coming out of nowhere]*
*[Sam Revised below]*

In this script, once the model is trained, the model file is saved and registered to the workspace. Registering your model allows you to store and version your models in the Azure cloud, in your workspace. Once you register a model, you can find all other registered model in one place in the Azure Studio called the model registry. The model registry helps you organize and keep track of your trained models. You'll also be able to use registered models when deploying endpoints, which we will talk more about in a later section of this article. [insert-hyper-link]

## Configure the command

*[Sanghee's note: below sentence is a bit awkward for me because data is described in the paragraph but the rest of the parameters are in the bullet point. Should we move the data to be its own bullet point for consistency? Thoughts?]*
*[Sam Revised] I think this paragraph splits up the ideas in a good way. Although I dont know much about what is happening in this command method. *

Now that you have a script that can perform the classification task, you'll use the general purpose **command** that can run command line actions. This command line action can be directly calling system commands or by running a script. 

Here, you'll create input variables to specify the input data, split ratio, learning rate and registered model name.  The command script will:
* Use the compute created earlier to run this command.
* Use the environment created earlier - you can use the `@latest` notation to indicate the latest version of the environment when the command is run.
* Configure some metadata like display name, experiment name etc. An *experiment* is a container for all the iterations you do on a certain project. All the jobs submitted under the same experiment name would be listed next to each other in Azure ML studio.
* Configure the command line action itself - `python main.py` in this case. The inputs/outputs are accessible in the command via the `${{ ... }}` notation.

In [10]:
from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = "credit_defaults_model"

job = command(
    inputs=dict(
        data=Input(
            type="uri_file",
            path="https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls",
        ),
        test_train_ratio=0.2,
        learning_rate=0.25,
        registered_model_name=registered_model_name,
    ),
    code="./src/",  # location of source code
    command="python main.py --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} --learning_rate ${{inputs.learning_rate}} --registered_model_name ${{inputs.registered_model_name}}",
    environment="aml-scikit-learn@latest",
    compute="cpu-cluster",
    experiment_name="train_model_credit_default_prediction",
    display_name="credit_default_prediction",
)

## Submit the job 

*[Sanghee's note: Azure Studio > Azure Machine Learning Studio. Also, do we need a full description here when we are going to walk the user about the details page? thoughts?]*
*[Sams revision: I reworded further below in the cell below the code. but what do you mean walk the user through? Im not sure what else to explain but that might just be me being used to the studio by now. I can add an image of the studio with boxes to show where a user would select "overview", "metrics", "output", etc. Would that help to walk through? Ill try out the bulleted list though.   ]

It's now time to submit the job to run in AzureML. This time you'll use `create_or_update`  on `ml_client.jobs`. 

In [11]:
ml_client.create_or_update(job)

Experiment,Name,Type,Status,Details Page
train_model_credit_default_prediction,jovial_celery_yvzx4cgjdr,command,Starting,Link to Azure Machine Learning studio


## View job output and wait for job completion

*[Sanghee's note: please incorporate my note in the prev email and remove the deployment portion~ let me know if you need any help describing the job details page. We will want to do more than what I originally included in the studio onboarding notebook]*

*[Reference] You can view the result of a training job by clicking the URL generated after submitting a job. Alternatively, you can also click Jobs on the left navigation menu. A job is a grouping of many runs from a specified script or piece of code. Information for the run is stored under that job.*

*Overview is where you can see the status of the job.*
*Metrics would display different visualizations of the metrics you specified in the script.*
*Images is where you can view any image artifacts that you have logged with MLflow.*
*Child jobs contains child jobs if you added them.*
*Outputs + logs contains log files you need for troubleshooting or other monitoring purposes.*
*Code contains the script/code used in the job.*
*Explanations and Fairness are used to see how your model performs against responsible AI standards. They are currently preview* *features and require additional package installations.*
*Monitoring is where you can view metrics for the performance of compute resources.*


View the job in Azure ML studio by selecting the link in the output of the previous cell. The output of this job will look like this in Azure ML studio. Explore the tabs for various details like metrics, outputs etc. Once completed, the job will register a model in your workspace as a result of training. 

![Screenshot that shows the job overview](media/view-job.gif "View the job in studio")

> [!IMPORTANT]
> Wait until the status of the job is complete before returning to this notebook to continue. The job will take 2 to 3 minutes to run. It could take longer (up to 10 minutes) if the compute cluster has been scaled down to zero nodes and custom environment is still building.

After the job is done running, the code above will print a link to the job's details page on Azure Studio. Alternatively, you can also click Jobs on the left navigation menu. A job is a grouping of many runs from a specified script or piece of code. Information for the run is stored under that job. The details page will give an overview of the job, the time it took to run, when it was created, etc. The page will also have tabs to other information about the job such as metrics, Outputs + logs, and code. Listed below are the tabs available in the job's details page:
> * Overview: The overview section provides basic information about the job, including its status, start and end times, and the type of job that was run
> * Inputs: The input section lists the data and code that were used as inputs for the job. This can include datasets, scripts, environment configurations, and other resources that were used during training. 
> * Outputs + logs: The Outputs + logs tab will contain logs generated while the job was running. This tab will assist in troubleshooting if anything goes wrong with your training script or model creation.
> * Metrics: The metrics tab will showcase key performance metrics from your model such as training score, f1 score, and precision score. 







## Next Steps

Learn about creating a multi step pipeline for this script [Create production ML pipelines in a Jupyter notebook](https://github.com/Azure/azureml-examples/blob/main/tutorials/e2e-ds-experience/e2e-ml-workflow.ipynb).