# Chapter 11. Machine Learning in the Cloud

[*Applied Machine Learning for Health and Fitness*](https://www.apress.com/9781484257715) by Kevin Ashley (Apress, 2020).

[*Video Course*](http://ai-learning.vhx.tv) Need a deep dive? Watch my [*video course*](http://ai-learning.vhx.tv) that complements this book with additional examples and video-walkthroughs. 

[*Web Site*](http://activefitness.ai) for research and supplemental materials.

## Overview

![](images/ch11/fig_11-1.png)

A bulk of machine learning compute tasks today happens in the data centers. As a data scientist, you may have started your research on your local computer, playing with various models, frameworks and sets of data, but there's a good chance that when your project reaches the stage when people start using it, your experiments may need many resources that the cloud provides. The goal of this chapter is going over some examples on how to deploy your data science project to the cloud, store data, train your models and ultimately give your customers access to the predictions the model provide.

## Containers

![](images/ch11/fig_11-2.png)

At the beginning of this book when we touched on the tools used by data scientists, you remember that we discussed virtual environments. In a virtual environment it's easy to isolate a set of tools and libraries needed by one experiment from another. Such an environment would typically include a list of libraries and dependencies making it relatively easy to replicate a set of components for your project. Containers take virtual environments even further. With containers you can package your data science experiment and models, together with all supporting components, and when you are ready, distribute everything to a data center, or make it a cloud service. Even if you don't use containers explicitly, most cloud services today are using them behind the scenes to isolate and package your resources.

As a data scientist, you are familiar with various ways to get Python on your system. What's the docker way to get it up and running? This one liner runs Python in a new docker container on your system:

```bash
docker run -it python:3.7 python
```

The magic that happens here is that even if this version of Python is not installed on your system, docker will pull it from the online repository of images, create a container and start Python shell in that container:

```
Unable to find image \'python:3.7\' locally
3.7: Pulling from library/python
Status: Downloaded newer image for python:3.7
Python 3.7.7 (default, Mar 11 2020, 00:27:03)
GCC 8.3.0 on linux
```

Earlier in the book we used lots of Jupyter notebooks. But what if your notebook requires a set of components that you need to configure in a different environment? Conveniently, you can also run notebooks from a docker container, and the Jupyter team has provided a set of public images you can start with:

```bash
docker run -p 8888:8888 jupyter/scipy-notebook
```

When this container starts, you should be able to connect to a fully functional notebook through the Web browser. Docker containers have become de facto standard for packaging and distributing applications. A template with a set of instructions on what defines your package is called an *image*, and you can create your own image with a *Dockerfile*. A container is essentially an instance of an image. As a practical data scientist, if you work with the cloud you may occasionally need to wrap your model into a container.

Notebooks in the Cloud
======================

![](images/ch11/fig_11-3.png)

Running Jupyter notebooks on your local machine is not the only way to run your data science experiments with some great (and often free to start) services available today: most major cloud vendors provide services that can get you started quickly with the notebooks. For starters some of the major notebook services available today include Microsoft Azure Notebooks, Google Colaboratory and Amazon. Beyond offering basic Jupyter notebooks, they often provide a set of tools that simplify managing machine learning workflows, getting and processing data etc.

We'll start by creating a free notebook using Azure Machine Learning and connecting to the workspace environment:

In [6]:
import azureml.core
azureml.core.VERSION

'1.1.5'

In [None]:
import azureml.core
from azureml.core import Workspace

workspace = Workspace.from_config()

This Python snippet takes advantage of the Azure Python SDK, a set of methods that simplifies working with objects in the workspace. If you want to connect to the cloud workspace from the local Jupyter notebook, you can simply export configuration config.json file and place it in the directory where you run your local notebook, then magically your Workspace.from\_config() command will use your local configuration file to connect with your cloud environment.

Data in the Cloud
=================

One of the most important advantages of developing your machine learning experiments in the cloud is the use of cloud-based storage. Storage in the cloud is a relatively inexpensive resource that you can easily scale, with added benefits of security, durability and high availability across different geographical regions and accessibility anywhere in the world from many languages and platforms.

## Project: Using cloud storage for machine learning

In the previous chapters, when we discussed various ways to capture data from athletes, I mentioned IMUs, inertial measurement unit that can aggregate data from several sensors: accelerometers, gyroscopes, magnetometers to provide accurate high frequency information about athlete movements. To illustrate the use of this data in the machine learning environment in the cloud, I provided a motion capture file of a high-level skier performing slalom turns. This data was captured using high-quality mocap with Xsens suit that combines multiple sensors.

In addition to storing data in the cloud, as a data scientist you are likely to spend some time parsing the file and getting it into various models for training and further processing. In the example below, we'll load an output from such a set of IMUs, into the cloud-based machine learning workspace. The following code snippet takes a comma separated file containing center of mass data and creates a tabular dataset, registering it in the cloud workspace:

![](images/ch11/fig_11-4.png)

In [None]:
import os
from azureml.core import Workspace, Datastore, Dataset

datastore = workspace.get_default_datastore()
source_dir = os.getcwd()
store_path = 'center_of_mass'

datastore.upload_files(
    files=[os.path.join(source_dir, f) for f in ['skier_center_of_mass.csv']],
    relative_root=source_dir,
    target_path=store_path,
    overwrite=True)
dataset = Dataset.Tabular.from_delimited_files(path=(datastore, store_path))
dataset = dataset.register(workspace=workspace,
                           name='center_of_mass',
                           description='skier center of mass')


Labeling data in the cloud
==========================

For machine learning tasks like classification and object detection you'll often need to label data for training. In the previous chapters, when we discussed deep computer vision and classification, when you used a labeled dataset of different sport activities and a pretrained model to classify activities. If you recall, we had two classes of actions: 'tennis' and 'surfing' and we trained our model with a set of pre-classified images. We also used transfer learning from a pretrained model, which reduced our need in the number of images we supplied for the model. The task of labeling data in the cloud often needs to be done by a distributed team, with thousands of images, and the cloud comes very handy.

Once the data is labeled, it can be exported as in COCO format (Common Objects in Context), the standard we used earlier in the book when we experimented with human body poses to store joints information. COCO data format is frequently used for object and keypoint detection, segmentation and captioning.

![Using cloud based labeling project for activity classification](images/ch11/fig_11-6.png)

## Project: Training a classification model on a labeled dataset in the cloud

Let's use the dataset we just labeled in the cloud to train our activity classification model. You probably wonder at this point, where does the labeled dataset live in the cloud and how to get access to it? The following code snippet obtains the dataset from the workspace in the cloud, then loads it as pandas dataframe:

In [27]:
# initial experiment configuration
experiment_name = 'activity-classification'
script_folder = 'activity-classification'
cluster_name = "compute-experiments"
model_file_name = 'activities.pkl'
labeled_dataset_name = 'Classifying activities-2020-03-15 00:54:26'
output_folder =  './outputs'
local_download_folder = './download/' 

In [None]:
from azureml.core import Dataset
from azureml.contrib.dataset import FileHandlingOption

dataset = Dataset.get_by_name(workspace, name=labeled_dataset_name)
dataset_pd = dataset.to_pandas_dataframe(
    file_handling_option=FileHandlingOption.DOWNLOAD, 
    target_path=local_download_folder, 
    overwrite_download=True)
dataset_pd

Note that although labeled dataset contains URLs to images, you also have ability to download image files locally, by using FileHandlingOption.DOWNLOAD. Once the data is labeled and exported, you can visualize it using standard Python libraries. Note that since this is a labeled dataset, each image now includes a label of the activity ('surfing' or 'tennis'):

In [None]:
import numpy as np
import matplotlib.pyplot as plt

w=10
h=10
fig=plt.figure(figsize=(15, 15))
plt.subplots_adjust(hspace=0.001)
columns = 2
rows = 2
for i in range(1, columns*rows +1):
    img = mpimg.imread(dataset_pd.loc[i+5,'image_url'])
    ax = fig.add_subplot(rows, columns, i)
    ax.title.set_text(dataset_pd.loc[i+5,'label'])
    ax.axis('off')
    plt.imshow(img)
plt.show()

![Displaying labeled images in Python](images/ch11/fig_11-7.png)

In the earlier chapters, we used PyTorch's torchvision to load our dataset locally. Conveniently, our cloud-labeled dataset can be easily converted to a torchvision dataset, containing torch tensors:

In [None]:
from torchvision.transforms import functional as F

pytorch_dataset = dataset.to_torchvision()
img = pytorch_dataset[0][0]
print(type(img))

Preparing for training
======================

Before we start training our model, we need to tell the cloud where the model is trained or specify a compute target: a VM or a compute cluster that satisfies the needs of your model, including GPU support and size. You can connect to an existing compute target, created with your workspace or add a new one, in this case I connect to compute target called 'compute-experiments':

In [8]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D3_V2', 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

print(compute_target.get_status().serialize())

{'errors': None, 'creationTime': '2020-03-12T14:08:17.927990+00:00', 'createdBy': {'userId': 'e180613e-2ad1-41cc-8aae-8d4183f7b2fd', 'userOrgId': '72f988bf-86f1-41af-91ab-2d7cd011db47'}, 'modifiedTime': '2020-03-12T14:09:04.221524+00:00', 'state': 'Running', 'vmSize': 'STANDARD_D3_V2'}


You can also create a new compute target with ComputeTarget.create() method. When you configured your local computer for machine learning, in the previous chapters you probably used Anaconda to manage virtual environments. Similarly, in the cloud you have a way to provision your compute target and the environment that your model needs, note that you can include Python packages in CondaDependencies of your environment (pretty cool, huh?):

In [9]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

conda_env = Environment('conda-env')
conda_env.python.conda_dependencies = CondaDependencies.create(pip_packages=['azureml-sdk',
                                                                             'azureml-contrib-dataset',
                                                                             'torch','torchvision',
                                                                             'azureml-dataprep[pandas,fuse]'])

Data scientists deal with many frameworks and libraries to train models: PyTorch, Keras, scikit-learn, TensorFlow, Chainer etc. Most of model development falls into the same pattern: first you specify an environment to train your model, including dependencies, parameters and scripts that define your experiment and how the model is trained, then the model is trained and saved or registered in the workspace. Azure ML SDK provides two useful abstractions: one that wraps our experiments in the Experiment object, and another one, called Estimator that simplifies model training. In the following code snippet, I create an experiment and an estimator with a script named train.py we'll discuss in the next section:

In [None]:
import os
from azureml.train.estimator import Estimator
from azureml.core import Experiment
from azureml.core import Dataset
from azureml.contrib.dataset import FileHandlingOption

experiment = Experiment(workspace=workspace, name=experiment_name)
os.makedirs(script_folder, exist_ok=True)
dataset = Dataset.get_by_name(workspace, name=labeled_dataset_name)

script_params = {
    '--output-folder': output_folder,
    '--model-file': model_file_name
}

estimator = Estimator(source_directory=script_folder, 
                entry_script='train.py',
                script_params=script_params,    
                inputs=[dataset.as_named_input('activities')],
                compute_target=compute_target,
                environment_definition=conda_env)

Model training in the cloud
===========================

In Chapter 6, you used PyTorch to train a model to classify a sport activity. We used a local notebook to run our training, and our dataset was already labeled: all images were placed in the folders corresponding to the names of the classes: *surfing* or *tennis*.

In this cloud-based project, we will use *activities* dataset we labeled using the cloud workflow from the previous section, and since earlier we already told the estimator where our training entry point will live, we'll place all our training code in the script train.py. Fortunately, we can reuse most of our model training code used for classification in Chapter 6, making adjustment for running it in the cloud. When the training script runs in the cloud, Run object maintains context information about our experiment environment, including input datasets we send to the model for training. You can obtain the context of the experiment by using Run.get\_context() call, and then get our labeled activities dataset from run.input\_datasets\['activities'\]:

In [None]:
from azureml.core import Dataset, Run
import azureml.contrib.dataset
from azureml.contrib.dataset import FileHandlingOption, LabeledDatasetTask

run = Run.get_context()
# get input dataset by name
labeled_dataset = run.input_datasets['activities']

mounted_path = tempfile.mkdtemp()
# mount dataset onto the mounted_path of a Linux-based compute
mount_context = labeled_dataset.mount(mounted_path)
mount_context.start()
print(os.listdir(mounted_path))
print (mounted_path)

The load() method below loads images from the labeled dataset and applies transformation that ResNet model requires. Remember, that a pretrained model needs all images normalized in the same way. The model expects all images to be 224 pixels, with 3 RGB channels, and normalized using mean = \[0.485, 0.456, 0.406\] and standard deviation std = \[0.229, 0.224, 0.225\]. As the script loads images, it also performs normalization. We will also split the dataset between training and testing, like we did in Chapter 6 example when the model was trained using a local notebook:

In [None]:
import torch
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
import time
import os
import copy
from torch.utils.data.sampler import SubsetRandomSampler

f = './download/workspaceblobstore/activities'

def load(f, size = .2):
    
    t = transforms.Compose([transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(), 
        transforms.Normalize(mean = [0.485, 0.456, 0.406], 
        std = [0.229, 0.224, 0.225])])
        
    train = datasets.ImageFolder(f, transform=t)
    test = datasets.ImageFolder(f, transform=t)
    n = len(train)
    indices = list(range(n))
    split = int(np.floor(size * n))
    np.random.shuffle(indices)
    train_idx, test_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    test_sampler = SubsetRandomSampler(test_idx)
    trainloader = torch.utils.data.DataLoader(train,sampler=train_sampler, batch_size=64)
    testloader = torch.utils.data.DataLoader(test, sampler=test_sampler, batch_size=64)
    return trainloader, testloader

trainloader, testloader = load(f, .2)
print(trainloader.dataset.classes)
images, labels = next(iter(trainloader))
grid = torchvision.utils.make_grid(images)
plt.imshow(grid.permute(1,2,0))

Just like last time, we will use a pretrained ResNet model, trained with ImageNet, using transfer learning. Basically, we instruct PyTorch to avoid backpropagation by setting requires\_grad to False. Then we replace the last fully connected layer with a Linear classifier for 2 classes of our labeled dataset, *surfing* and *tennis*:

In [None]:
import os
import torchvision
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision
from torchvision import datasets, models, transforms
from azureml.core import Dataset, Run
import azureml.contrib.dataset
from azureml.contrib.dataset import FileHandlingOption, LabeledDatasetTask

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.resnet18(pretrained=True)

for param in model.parameters():
    param.requires_grad = False  
    
run = Run.get_context()

# get input dataset by name
#labeled_dataset = run.input_datasets['activities']
#pytorch_dataset = labeled_dataset.to_torchvision()


features = model.fc.in_features
model.fc = nn.Linear(features, len(labels))
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
print_every = 100

def train_model(epochs=3):
    total_loss = 0
    i = 0
    for epoch in range(epochs):
        for inputs, labels in trainloader:
            i += 1
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            logps = model.forward(inputs)
            loss = criterion(logps, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            test_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for inputs, labels in testloader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    logps = model.forward(inputs)
                    batch_loss = criterion(logps, labels)
                    test_loss += batch_loss.item()

                    ps = torch.exp(logps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            train_losses.append(total_loss/len(trainloader))
            test_losses.append(test_loss/len(testloader))                    
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {total_loss/print_every:.3f}.. "
                  f"Test loss: {test_loss/len(testloader):.3f}.. "
                  f"Test accuracy: {accuracy/len(testloader):.3f}")
            running_loss = 0
            model.train()
    return model

train_losses, test_losses = [], []
model = train_model(epochs=3)
torch.save(model, model_file_name)

![](images/ch11/fig_11-7.png)

Finally, we call train method and when the training is finished, our model is saved in the experiment instance's ./outputs folder:

In [None]:
train_losses, test_losses = [], []
model = train_model(epochs=3)
print('Finished training, saving model')
os.makedirs(output_folder, exist_ok=True)
torch.save(model, os.path.join(output_folder, model_file_name))

Now, let's make our notebook write the whole train.py file:

In [28]:
%%writefile $experiment_name/train.py

import argparse
import os
import time
import copy
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
import torchvision
from torchvision import datasets, models, transforms
from torch.utils.data.sampler import SubsetRandomSampler
import tempfile
from azureml.core import Dataset, Run
import azureml.contrib.dataset
from azureml.contrib.dataset import FileHandlingOption, LabeledDatasetTask

def load(f, size = .2):
    
    t = transforms.Compose([transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(), 
        transforms.Normalize(mean = [0.485, 0.456, 0.406], 
        std = [0.229, 0.224, 0.225])])
        
    train = datasets.ImageFolder(f, transform=t)
    test = datasets.ImageFolder(f, transform=t)
    n = len(train)
    indices = list(range(n))
    split = int(np.floor(size * n))
    np.random.shuffle(indices)
    train_idx, test_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    test_sampler = SubsetRandomSampler(test_idx)
    trainloader = torch.utils.data.DataLoader(train,sampler=train_sampler, batch_size=64)
    testloader = torch.utils.data.DataLoader(test, sampler=test_sampler, batch_size=64)
    return trainloader, testloader

def get_mounting_path(labeled_dataset):
    
    mounted_path = tempfile.mkdtemp()
    mount_context = labeled_dataset.mount(mounted_path)
    mount_context.start()
    print(os.listdir(mounted_path))
    print (mounted_path)
    print(os.listdir(mounted_path+'/workspaceblobstore'))
    return mounted_path + '/workspaceblobstore/activities'

def start(output_folder, model_file_name):
    
    run = Run.get_context()
    labeled_dataset = run.input_datasets['activities']
    
    data_path =  get_mounting_path(labeled_dataset)

    trainloader, testloader = load(data_path, .2)
    print(trainloader.dataset.classes)
    images, labels = next(iter(trainloader))

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = models.resnet18(pretrained=True)

    for param in model.parameters():
        param.requires_grad = False  

    features = model.fc.in_features
    model.fc = nn.Linear(features, len(labels))
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
    
    # train the model
    print_every = 100
    train_losses, test_losses = [], []
    total_loss = 0
    i = 0
    epochs=3
    for epoch in range(epochs):
        for inputs, labels in trainloader:
            i += 1
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            logps = model.forward(inputs)
            loss = criterion(logps, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            test_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for inputs, labels in testloader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    logps = model.forward(inputs)
                    batch_loss = criterion(logps, labels)
                    test_loss += batch_loss.item()

                    ps = torch.exp(logps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
            train_losses.append(total_loss/len(trainloader))
            test_losses.append(test_loss/len(testloader))                    
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {total_loss/print_every:.3f}.. "
                  f"Test loss: {test_loss/len(testloader):.3f}.. "
                  f"Test accuracy: {accuracy/len(testloader):.3f}")
            running_loss = 0
            model.train()
    
    print('Finished training')
    os.makedirs(output_folder, exist_ok=True)
    torch.save(model, os.path.join(output_folder, model_file_name))
    print('Model saved:', model_file_name)

if __name__ == '__main__':
    
    parser = argparse.ArgumentParser()
    parser.add_argument("--output-folder", default=None, type=str, dest='output_folder', required=True, help="Output folder for the model")    
    parser.add_argument("--model-file", default=None, type=str, dest='model_file_name', required=True, help="Output model file")
    args = parser.parse_args()
    if args.output_folder:
        os.makedirs(args.output_folder, exist_ok=True)
    output_folder = args.output_folder
    model_file_name = args.model_file_name
    print('Output folder:', output_folder)
    print('Model file:', model_file_name)
    start(output_folder, model_file_name)
    


Overwriting activity-classification/train.py


Running experiments in the cloud
================================

Now, you created your compute target, experiment, training script and estimator, you can submit your experiment and wait until the model is trained! Everything we've created earlier in this chapter was a prelude to these two lines of code: our training script, the definition of our experiments and cloud environment, this is where all the action takes place:

In [None]:
run = experiment.submit(estimator)
run.wait_for_completion(show_output=True)

![](images/ch11/fig_11-9.png)

**Note:** the above command may take a long time! Your compute target is first provisioned with all required dependencies before it starts the actual training.

Model management
================

After you trained your model, you can register it in the cloud. A trained model is the 'brain' of your AI, and can be used from an API, a Web service, or any other endpoint to provide meaningful information to your customers. In this example, the model we trained to provide activity classification is registered as activity\_classification:

In [24]:
model = run.register_model(model_name='activities', model_path=output_folder+"/"+model_file_name)
print(model.name, model.id, model.version, sep='\t')

activities	activities:3	3


Alternatively, you can also download the model to your local device, this model can be loaded in PyTorch:

In [25]:
run.download_file(name=output_folder+"/"+model_file_name, output_file_path='./models')

Summary
=======

Using cloud-based machine learning methods is the natural step in bringing your experiment to your customer. In this chapter we looked at some familiar tasks, like using notebooks, loading and processing data, labeling, classification and training your models. Everything you've done on your local computer with notebooks (and more!) you can do in the cloud. In this chapter you learned how to take a familiar task, like an image classification for different sport activities and create a pipeline for training it in the cloud. Some new concepts include creating an environment for your experiment, including Python packages dependencies, defining a compute target to run your training and registering your model in the cloud environment.

## Reference

[*Video Course*](http://ai-learning.vhx.tv) Need a deep dive? Watch my [*video course*](http://ai-learning.vhx.tv) that complements this book with additional examples and video-walkthroughs. 

[*Web Site*](http://activefitness.ai) for research and supplemental materials.