# Tutorial: Train your first ML model (Part 2 of 3)

## Introduction

In the [previous tutorial](1.hello-world.ipynb), you ran a trivial "hello world!" script in the cloud using Azure Machine Learning's Python SDK. This time you take it a step further by submitting a script that will train a machine learning model. This example will help you understand how Azure Machine Learning eases consistent behavior between debugging on a compute instance or laptop development environment, and remote runs.

Learning these concepts means that by the end of this session, you can:

* Use Conda to define an Azure Machine Learning environment
* Train a model in the cloud
* Log metrics to Azure Machine Learning

---

## Your machine learning code

This tutorial shows you how to train a PyTorch model on the CIFAR 10 dataset using an Azure Machine Learning Cluster. In this case you will be using a CPU cluster, but this could equally be a GPU cluster. Whilst this tutorial uses PyTorch, the steps we show you apply to *any* machine learning code. 

Note the code is based on [this introductory example from PyTorch](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html). 

## Write files

In [None]:
%%writefile model.py
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [None]:
%%writefile train.py
import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

from model import Net

# download CIFAR 10 data
trainset = torchvision.datasets.CIFAR10(
    root="./data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=4, shuffle=True, num_workers=2
)

if __name__ == "__main__":

    # define convolutional network
    net = Net()

    # set up pytorch loss /  optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    # train the network
    for epoch in range(2):

        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # unpack the data
            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:
                loss = running_loss / 2000
                print(f"epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}")
                running_loss = 0.0

    print("Finished Training")

In [None]:
%%writefile pytorch.yml
name: an-introduction-tutorial
channels:
  - defaults
  - pytorch
  - anaconda
  - conda-forge
dependencies:
  - python=3.6
  - pip
  - pip:
    - torch
    - torchvision
    - mlflow
    - matplotlib
    - azureml-mlflow
    - azureml-dataprep

## Submit your machine learning code to Azure Machine Learning

The difference to the control script below and the one used to submit "hello world" is that you adjust the environment to be set from the conda dependencies file you created earlier.

> <span style="color:purple; font-weight:bold">! NOTE <br>
> The first time you run this script, Azure Machine Learning will build a new docker image from your PyTorch environment. The whole run could take 5-10 minutes to complete. You can see the docker build logs in the widget by selecting the `20_image_build_log.txt` in the log files dropdown. This image will be reused in future runs making them run much quicker.</span>


In [None]:
from azureml.core import Workspace, Experiment, Environment, ScriptRunConfig
from azureml.widgets import RunDetails

ws = Workspace.from_config()
env = Environment.from_conda_specification(
    name="pytorch-tutorial", file_path="pytorch.yml",
)
exp = Experiment(workspace=ws, name="an-introduction-train-tutorial")
src = ScriptRunConfig(
    source_directory=".",
    script="train.py",
    compute_target="cpu-cluster",
    environment=env,
)

run = exp.submit(src)
RunDetails(run).show()

### Understand the control code

Compared to the control script that submitted the "hello world" example, this control script introduces the following:

| Code | Description
| --- | --- |
| `env = Environment.from_conda_specification( ...)` | Azure Machine Learning provides the concept of an `Environment` to represent a reproducible, <br>versioned Python environment for running experiments. Here you have created it from a yaml conda dependencies file.|

**There are many ways to create AML environments, including [from a pip requirements.txt](https://docs.microsoft.com/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py&preserve-view=true#from-pip-requirements-name--file-path-), or even [from an existing local Conda environment](https://docs.microsoft.com/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py&preserve-view=true#from-existing-conda-environment-name--conda-environment-name-).**


Once your image is built, select `70_driver_log.txt` to see the output of your training script, which should look like:

```txt
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
...
Files already downloaded and verified
epoch=1, batch= 2000: loss 2.19
...
epoch=2, batch=12000: loss 1.27
Finished Training
```

Environments can be registered to a workspace with `env.register(ws)`, allowing them to be easily shared, reused, and versioned. Environments make it easy to reproduce previous results and to collaborate with your team.

Azure Machine Learning also maintains a collection of curated environments. These environments cover common ML scenarios and are backed by cached Docker images. Cached Docker images make the first remote run faster.

In short, using registered environments can save you time! More details can be found on the [environments documentation](./how-to-use-environments.md).

In [None]:
run.wait_for_completion(show_output=True)

## Log training metrics

Now that you have a model training in Azure Machine Learning, start tracking some performance metrics.
The current training script prints metrics to the terminal. Azure Machine Learning provides a
mechanism for logging metrics with more functionality. By adding a few lines of code, you gain the ability to visualize metrics in the studio and to compare metrics between multiple runs.

### Machine learning code updates

In the `../../code/train/pytorch/cifar10-cnn` directory you will notice the [train-with-logging.py](../../code/train/pytorch/cifar10-cnn/train-with-logging.py) script has been modified with one additional line that will log the loss to the Azure Machine Learning Studio:

```python
# in train.py
import mlflow
...
mlflow.log_metric("loss", loss)
```

Metrics in Azure Machine Learning are:

- Organized by experiment and run so it's easy to keep track of and
compare metrics
- Equipped with a UI so we can visualize training performance in the studio or in the notebook widget
- **Designed to scale** You can submit concurrent experiments and the Azure Machine Learning cluster will scale out (up to the maximum node count of the cluster) to run the experiments in parallel

### Submit your machine learning code to Azure Machine Learning
Submit your code once more. This time the widget includes the metrics where you can now see live updates on the model training loss!

In [None]:
%%writefile train-with-logging.py
import mlflow
import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

from model import Net

# download CIFAR 10 data
trainset = torchvision.datasets.CIFAR10(
    root="./data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=4, shuffle=True, num_workers=2
)

if __name__ == "__main__":

    # define convolutional network
    net = Net()

    # set up pytorch loss /  optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    # train the network
    for epoch in range(2):

        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # unpack the data
            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:
                loss = running_loss / 2000
                # ADDITIONAL CODE: log loss metric to AML
                mlflow.log_metric("loss", loss)
                print(f"epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}")
                running_loss = 0.0

    print("Finished Training")

In [None]:
from azureml.core import Workspace, Experiment, Environment, ScriptRunConfig
from azureml.widgets import RunDetails

ws = Workspace.from_config()
env = Environment.from_conda_specification(
    name="pytorch-env-tutorial", file_path="pytorch.yml",
)
exp = Experiment(
    workspace=ws, name="an-introduction-train-with-logging-tutorial"
)
src = ScriptRunConfig(
    source_directory=".",
    script="train-with-logging.py",
    compute_target="cpu-cluster",
    environment=env,
)

run = exp.submit(src)
RunDetails(run).show()

## Next steps

In this session, you upgraded from a basic "hello world!" script to a more realistic
training script that required a specific Python environment to run. You saw how
to take a local Conda environment to the cloud with Azure Machine Learning Environments. Finally, you
saw how in a few lines of code you can log metrics to Azure Machine Learning.

In the next session, you'll see how to work with data in Azure Machine Learning by uploading the CIFAR10
dataset to Azure.

[Tutorial: Bring your own data](3.pytorch-model-cloud-data.ipynb)


In [None]:
run.wait_for_completion(show_output=True)