<img src="./display_images/NV_AWS.PNG" width="500">

## Welcome to the Computer Vision Exercise!

As humans, nature devoted half of our brains to visual processing, making it critical to how we perceive the world. Endowing machines with sight has been a challenging endeavor, but advancements in compute, algorithms, and data quality have made computer vision more accessible than ever before. From mobile cameras to industrial mechanic lenses, biological labs to hospital imaging, and self-driving cars to security cameras, images are one of the most valuable types of data available. In this exercise, you will use common CV libraries, and you will build an end-to-end object detection model on Amazon SageMaker using NVIDIA GPUs.

In this module we will walk through:
- Image Basics
- How to load and process images
- How to use images with PyTorch 
- Using models for image classification, object detection, and semantic segmentation
- Finetuning an object detection model
- Deploying an object detection model

Have fun!

### Install libraries
First though we need to install some libraries:

In [None]:
!pip install jupyter
!pip install ipywidgets
!pip install imgaug
!pip install tqdm
!pip install -U torchvision
!pip install sagemaker-experiments
!pip install seaborn
!pip install -U sagemaker==2.60.0

import IPython
IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel after installations

### Import libraries
This block imports our libraries, sets some variables, and instantiates our SageMaker session, which we will use later in the notebook.

In [None]:
%pylab inline
import json 
import os
import boto3
import sagemaker
import cv2
import imgaug
import pandas as pd
from datetime import timezone
from seaborn import heatmap
from tqdm import tqdm
from glob import glob
from matplotlib import patches
from PIL import Image, ImageFilter, ImageOps
from sklearn.metrics import average_precision_score, confusion_matrix, precision_score, recall_score
from skimage import transform

# import SageMaker specific libraries
from sagemaker.session import Session
from sagemaker.pytorch.estimator import PyTorch, PyTorchModel
from sagemaker.predictor import RealTimePredictor
from sagemaker.debugger import ProfilerConfig, FrameworkProfile, DetailedProfilingConfig, DataloaderProfilingConfig, PythonProfilingConfig, Rule, ProfilerRule, rule_configs

# import PyTorch libraries
import torch
import torchvision
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision import transforms

# set device for PyTorch to use, if on a GPU instance use cuda
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# get the execution role that provides permissions to run operations in SageMaker like training or deploying endpoints
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
# set the S3 bucket you'll use
bucket = sagemaker_session.default_bucket()
prefix_output = 'prefix_output'
b3sess = boto3.Session()
# get region your notebook is running in
region = b3sess.region_name
sm = b3sess.client('sagemaker')

## Image Basics

Let's start with some image basics. [Digital images](https://en.wikipedia.org/wiki/Digital_image) are unlike other data in that their "columns" and "rows" are filled with pixel values. An image is fundamentally a matrix of pixels. Each pixel's x and y coordinate denotes its spatial position and its value informs the machine displaying it of the brightness of the color at that position. If you view a given column or row of the matrix individually it won't make much sense as it needs the context of the other pixels. The following cell creates a simple binary matrix forming an "image" of an X, run the cell below to visualize it!

In [None]:
# create an "image" array
x_img = np.array([
          [1,0,0,0,0,0,0,0,0,1],
          [0,1,0,0,0,0,0,0,1,0],
          [0,0,1,0,0,0,0,1,0,0],
          [0,0,0,1,0,0,1,0,0,0],
          [0,0,0,0,1,1,0,0,0,0],
          [0,0,0,0,1,1,0,0,0,0],
          [0,0,0,1,0,0,1,0,0,0],
          [0,0,1,0,0,0,0,1,0,0],
          [0,1,0,0,0,0,0,0,1,0],
          [1,0,0,0,0,0,0,0,0,1]
         ])
# plot our image
plt.figure(figsize=(12,7))
plt.imshow(x_img, cmap='gray') # set color scheme to grayscale 

## Color Images

[Color images](https://en.wikipedia.org/wiki/Color_image) typically have three channels: red, green, and blue (RGB). The values, or intensity of the colors range from 0-255. The combination of the 3 channels informs the machine displaying the image of the display color and the intensity of the color. The following matrix shows the same X image, but now instead of 0s and 1s, we are making our X red by setting red to 255 and the other values to 0.

Try playing with the color channel values!

In [None]:
# color values, try changing them!
red_val = 255
blue_val = 0
green_val = 0
# define our color image array
x_color = np.array([
          [[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val]],
          [[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0]],
          [[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0]],
          [[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0]],
          [[0,0,0],[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0],[0,0,0]],
          [[0,0,0],[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0],[0,0,0]],
          [[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val,],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0]],
          [[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0]],
          [[0,0,0],[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val],[0,0,0]],
          [[red_val,green_val,blue_val],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[red_val,green_val,blue_val]]
])
plt.figure(figsize=(12,7))
plt.imshow(x_color)

## Download data

Our X images were useful to demonstrate the concept of digital images but most images have far more complex pixel patterns. Now, let's download an actual image dataset to work with. The following dataset is Caltech-UCSD Birds 200 (CUB-200), which is a simple object detection dataset consisting of bird images. The following cell will download the dataset, decompress it, and remove the compressed version. The file structure is such that each class is a separate folder--for example, *001.Black_footed_Albatross*, which contains all of the images belonging to that class. 

(Welinder P., Branson S., Mita T., Wah C., Schroff F., Belongie S., Perona, P. “Caltech-UCSD Birds 200”. Distributed by the California Institute of Technology. 2010. CNS-TR-2010-001. http://www.vision.caltech.edu/visipedia/CUB-200.html)


In [None]:
# download CUB_200_2010
os.makedirs('CUB_200_2010', exist_ok=True)
! wget https://aws-tc-largeobjects.s3.us-west-2.amazonaws.com/DEV-AWS-MO-Nvidia/images.tgz -O CUB_200_2010/images.tgz
! wget https://aws-tc-largeobjects.s3.us-west-2.amazonaws.com/DEV-AWS-MO-Nvidia/annotations-mat.tgz -O CUB_200_2010/annotations-mat.tgz
! wget https://aws-tc-largeobjects.s3.us-west-2.amazonaws.com/DEV-AWS-MO-Nvidia/attributes.tgz -O CUB_200_2010/attributes.tgz    

# # Unpack and then remove the downloaded compressed tar file
!cd CUB_200_2010 && gunzip -c images.tgz | tar xopf - 
!cd CUB_200_2010 && gunzip -c annotations-mat.tgz | tar xopf - 
!cd CUB_200_2010 && gunzip -c attributes.tgz | tar xopf - 

# clean up artifacts
cats = glob('CUB_200_2010/images/*')
for cat in tqdm(cats):
    !rm {cat}/._* 

!rm CUB_200_2010/images.tgz
!rm CUB_200_2010/annotations-mat.tgz
!rm CUB_200_2010/attributes.tgz

## Load image

We will use Pillow, which is a fork of the [Python Imaging Library](https://pillow.readthedocs.io/en/stable/handbook/tutorial.html#using-the-image-class), otherwise known as PIL. This library is useful for processing images and has a variety of useful features and functions. It's all centered around PIL's Image class. The Image class has an open method that allows it to load an image stored on disk. PIL uses [lazy loading](https://en.wikipedia.org/wiki/Lazy_loading) meaning it identifies the file and opens it, but waits to load it into memory until you try to process it, this is useful when you have a limited amount of memory but want to process a large number of images. Let's load one of our downloaded images, display it, and print out some summary information.

In [None]:
# use PIL's Image class to open our image
bird_img = Image.open('CUB_200_2010/images/001.Black_footed_Albatross/Black_footed_Albatross_0004_2731401028.jpg')
plt.figure(figsize=(14,10))
plt.title('Black Footed Albatross', fontdict={'fontsize':20})
plt.imshow(bird_img)
print('Image height:', bird_img.height)
print('Image width:', bird_img.width)
print('Total pixels:', bird_img.width * bird_img.height) 

## Image transforms

Images are a unique data type because they can be heavily manipulated, but can still be interpreted by a human. If you adjust the brightness, resize the image, or change the color scheme, humans often can still interpret what the image portrays. There are many reasons you may want to manipulate images: you may need to resize them to fit into memory, you may want to crop out specific parts of your image, or maybe you want to blur sensitive information. PIL provides a large number of built-in transforms we can apply to our images. The following cell shows a few of those transforms, and then shows how to perform a custom affine shift transform.

In [None]:
# crop image by providing crop coordinates
cropimg = bird_img.crop([0,5,200,200])
# filter image using a PIL.ImageFilter class
filtimg = bird_img.filter(filter=ImageFilter.BLUR)
# rotate image by a specified number of degrees
rotimg = bird_img.rotate(90)
# resize image 
resizeimg = bird_img.resize((bird_img.width//2,bird_img.height//2))
# make the image grayscale
singlechanimg = ImageOps.grayscale(bird_img)

# rotation functions used in affine shift
def rot_x(angle, ptx, pty):
    return math.cos(angle)*ptx + math.sin(angle)*pty

def rot_y(angle, ptx, pty):
    return -math.sin(angle)*ptx + math.cos(angle)*pty

def affine_shift(img, deg=45):
    """
    Affine shift function utilizing PIL transform
    """
    angle = math.radians(deg)
    (x,y) = img.size
    # get extreme points
    xextremes = [rot_x(angle,0,0), rot_x(angle,0,y-1), rot_x(angle,x-1,0), rot_x(angle,x-1,y-1)]
    yextremes = [rot_y(angle,0,0), rot_y(angle,0,y-1), rot_y(angle,x-1,0), rot_y(angle,x-1,y-1)]
    mnx = min(xextremes)
    mxx = max(xextremes)
    mny = min(yextremes)
    mxy = max(yextremes)
    # perform transform using the Image class transform function
    affineimg = img.transform((int(round(mxx-mnx)), int(round((mxy-mny)))), 
                              Image.AFFINE, 
                              (math.cos(angle),math.sin(angle),-mnx,-math.sin(angle),math.cos(angle),-mny), 
                              resample=Image.BILINEAR)
    return affineimg

affineimg = affine_shift(bird_img, deg=30)
    
# plot different image manipulations
fig, axes = plt.subplots(nrows=2,ncols=3, figsize=(26,16))
axes[0,0].imshow(cropimg)
axes[0,0].set_title('Cropped image: \n cropimg = img.crop([0,5,200,200])', fontdict={'fontsize':20})
axes[0,1].imshow(filtimg)
axes[0,1].set_title('Filtered image: \n filtimg = img.filter(filter=ImageFilter.BLUR)', fontdict={'fontsize':20})
axes[0,2].imshow(rotimg)
axes[0,2].set_title('Rotated image: \n rotimg = img.rotate(90)', fontdict={'fontsize':20})
axes[1,0].imshow(resizeimg)
axes[1,0].set_title('Resized image: \n resizeimg = img.resize((img.width//2,img.height//2))', fontdict={'fontsize':20})
axes[1,1].imshow(affineimg)
axes[1,1].set_title('Affine shifted image: \n affineimg = img.transform(size, method, data)', fontdict={'fontsize':20})
axes[1,2].imshow(singlechanimg, cmap='gray')
axes[1,2].set_title('Grayscale image: \n singlechanimg = ImageOps.grayscale(img)', fontdict={'fontsize':20})

## Convert to Torch tensor

While PIL is a great library, we can’t use it on its own to create deep learning models. [PyTorch](https://pytorch.org/) is a commonly used deep learning framework that works out of the box with NVIDIA GPUs. It's based around PyTorch tensors, which are a specialized data structure that are very similar to [Numpy's ndarrays](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html). PyTorch has great support for computer vision tasks and makes it easy to convert images into [torch tensors](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html). PyTorch organizes images by (batchdimension x colorchannels x width x height), so we need to reshape our image. For any kind of tensor, PyTorch also expects the first dimension to be the batch dimension. For that reason, we will add an additional dimension using the unsqueeze method.

In [None]:
# reshape our numpy array and turn into a torch tensor
orig_shape = np.array(singlechanimg).shape
print('Original image shape:', orig_shape)
# reshape original image
reshaped_image_array = np.reshape(singlechanimg, (1, orig_shape[0], orig_shape[1])) 
# convert into a torch tensor
torch_img = torch.tensor(reshaped_image_array, device=device)
# add a batch dimension
torch_img = torch_img.unsqueeze(0)
print('Torch image tensor shape:' ,torch_img.shape)

## Convolution example

Now that we have our image represented as a tensor, let's try convolving over it! [Convolution](https://en.wikipedia.org/wiki/Convolution) is an operation where you take a weighted sum over a target using a kernel of fixed size. For images, we can use this kernel as a feature detector, looking for features like lines and edges. In the following example, we take two kernels: one that is looking for edges, and another that is essentially preserving the original image. 

[Convolutional neural networks (CNNS)](https://cs231n.github.io/convolutional-networks/) are specialized neural networks that use convolutions to take advantage of the structure and spatial relationships within images. A feed forward neural network would need to flatten images in order to perform tasks like classification. Convolutions allow CNNs to preserve existing spatial information and use a variety of different kernels that are adjusted dynamically to build layered feature maps. Convolutions are used with other network layers like [pooling](https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer), [activation](https://en.wikipedia.org/wiki/Convolutional_neural_network#ReLU_layer), [normalization](https://en.wikipedia.org/wiki/Batch_normalization), and [fully connected layers](https://en.wikipedia.org/wiki/Convolutional_neural_network#Fully_connected_layer) that allow CNNs to efficiently classify, detect objects, and segment images.

![convolution.gif](display_images/2D_Convolution_Animation.gif)

## Convolve over our X image

Let's try out the PyTorch [functional 2D convolution operation](https://pytorch.org/docs/stable/generated/torch.nn.functional.conv2d.html#torch.nn.functional.conv2d). PyTorch allows users to construct neural networks containing a variety of different operations in a neural network class (to see an example check out their [neural networks tutorial](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html)), but it also has a functional API that lets users try individual operations. Run the following cell to see the original image, the convolutional kernel, and the output of our 2D convolution operation! Try modifying the convolutional kernel and note how the convolutional output changes.

In [None]:
# define our convolutional kernel
conv_kernel = torch.tensor([[[[0,0,1],
                              [0,1,0],
                              [1,0,0]]]], device=device, dtype=torch.float32)
# create our image tensor and add batch and channel dimensions
x_tensor = torch.tensor(x_img, device=device, dtype=torch.float32).unsqueeze(dim=0).unsqueeze(dim=0)
# convolve our kernel over our image
conv_out = torch.nn.functional.conv2d(x_tensor, weight=conv_kernel, stride=1)

# plot the image, kernel, and output
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(18,7))
ax[0].imshow(x_img, cmap='gray')
ax[0].set_title('Original Image', fontdict = {"fontsize":20})
ax[1].imshow(conv_kernel.detach().cpu().numpy().squeeze(), cmap='gray')
ax[1].set_title('Convolutional Kernel', fontdict = {"fontsize":20})
ax[2].imshow(conv_out.detach().cpu().numpy().squeeze(), cmap='gray')
ax[2].set_title('Convolution Map Output', fontdict = {"fontsize":20})

## Convolve over a real image

Now we have seen what a convolution looks like on our X image, let's take a look at what it does to a real image. Let's specify two kernels: one that is looking for right facing edges, and one that is essentially an identity convolution. 

We will utilize another operation common in convolutional neural networks, [max pooling](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html). Max pooling is a simple method for reducing a network's computational load through downsampling the convolutional outputs. It takes the maximum value over a kernel of fixed size. In the case of 2D pooling this kernel is typically 2x2; meaning if we run this operation on our convolutional output it will reduce the size by a factor of 2. Convolutions tend to be followed by pooling operations as they allow the network to preserve important features of the input and reduce the amount of computation.

In [None]:
# define our kernels
# identity kernel
id_filt = [
             [0,0,0],
             [0,1,0],
             [0,0,0]]
# edge kernel, will look for edges 
edge_filt = [
             [0,1,0],
             [0,0,1],
             [0,1,0]]
# combine our kernels so we can apply them in parallel
filt = torch.tensor([[edge_filt], [id_filt]], device=device, dtype=torch.float32) 
# convolve our kernels over the image
conv_out = torch.nn.functional.conv2d(torch_img.float(), weight=filt) 
# pool out convolution output using a 2x2 kernel
pool_out = torch.nn.functional.max_pool2d(conv_out, 2)
# convert out outputs back into NumPy arrays
conv_out = np.array(conv_out.detach().cpu(), dtype=np.uint8)
pool_out = np.array(pool_out.detach().cpu(), dtype=np.uint8)
print('Original image tensor shape:', torch_img.shape)
print('Convolutional output shape:', conv_out.shape)
print('Pooled output:', pool_out.shape)

### View convolutional output

In [None]:
# plot convolutional maps
fig,ax = plt.subplots(nrows=2, ncols=2, figsize=(28,16))
ax[0,0].imshow(conv_out.squeeze()[0], cmap='gray')
ax[0,0].set_title('Edge Kernel', fontdict = {'fontsize':20})
ax[0,1].imshow(conv_out.squeeze()[1], cmap='gray')
ax[0,1].set_title('Identity Kernel', fontdict = {'fontsize':20})
ax[1,0].imshow(pool_out.squeeze()[0], cmap='gray')
ax[1,0].set_title('Pooled Edge Kernel', fontdict = {'fontsize':20})
ax[1,1].imshow(pool_out.squeeze()[1], cmap='gray')
ax[1,1].set_title('Pooled Identity Kernel', fontdict = {'fontsize':20})

## Visualize data with labels

We have walked through some methods for processing and manipulating images. While that is interesting, we are here to train machine learning models! Supervised models need labels to learn from and luckily our dataset comes with a set of bounding box labels. To get an idea of what your model will be training on, run the following cell to collect labels and the next cell to visualize a random bird. To visualize, we will load the image using PIL and also load the bounding box coordinates. We can plot bounding boxes using Matplotlib's patches class. First we will organize our annotations into a dictionary we can use to overlay them on our images.

In [None]:
from scipy.io import loadmat

!rm CUB_200_2010/annotations-mat/*m
!rm -rf CUB_200_2010/annotations-mat/Flickr*
!rm CUB_200_2010/annotations-mat/readme.pdf

# grab file paths for annotations
cats = glob('CUB_200_2010/annotations-mat/*') 
cats.sort()
label_dict = {}
# load annotations and put them into a dictionary
for cat in tqdm(cats):
    img_paths = glob(f'{cat}/*')
    for imgp in img_paths:
        label = loadmat(imgp)
        label_dict[imgp] = label

### Visualize labeled image

In [None]:
# pick a random index
i = np.random.randint(low=0,high=100)

# open the label file
with open('CUB_200_2010/attributes/images-dirs.txt','r') as f:
    images = f.read()
# split into lines
image = images.split('\n')
# get file path
imgp = image[i].split(' ')[-1]

# load image
full_path = f'CUB_200_2010/images/{imgp}'
random_bird = Image.open(full_path)
# load bounding boxes
bboxes = label_dict[full_path.replace('images', 'annotations-mat').replace('.jpg','.mat')]['bbox']
bboxes = np.concatenate(bboxes.tolist()).squeeze()

fig,ax = plt.subplots(1, figsize=(20,12))
ax.set_title(f"{imgp.split('/')[-1]}", fontdict={'fontsize':20})

# Display the image
ax.imshow(random_bird)
# add bounding box vis
rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2]-bboxes[0], bboxes[3]-bboxes[1] ,linewidth=1,edgecolor='r',facecolor='none') 
ax.add_patch(rect)
# add text label
plt.text(bboxes[0], bboxes[1]-10, f"{imgp} at {bboxes}", bbox=dict(facecolor='white', alpha=0.5)) 

## Image classification

Now that we have gone over the basics of the convolution operation, let's take a look at a common convolutional neural network (CNN), [ResNet](https://arxiv.org/abs/1512.03385). Despite ResNet being a relatively old architecture (proposed in 2015), it's still incredibly powerful and many state-of-the-art CNNs are built using similar architectures. ResNet is a neural network built on a series of stackable convolutional blocks. These convolutional blocks contain the same operations we spoke of earlier but also contain skip connections to one another. The skip connections are what made ResNet revolutionary since it could be built arbitrarily deep and the network could decide to "skip" specific layers that didn't contribute to predicting the target. Other networks at the time could not be built as deeply, or they would suffer from the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) (an issue where the gradient would become so small that it essentially prevented the weights from changing their values). The blocks consist of 2D convolutional, normalization, and activation layers. Since the blocks are stackable, there are a variety of variants of ResNet, ranging from ResNet18 (18 layers) to ResNet152 (152 layers!). We will download ResNet18 from [torchvision, PyTorch's computer vision library](https://pytorch.org/vision/stable/models.html), which contains a variety of pretrained models. 

![](display_images/residual_block.png)

### Load ResNet18 model

In [None]:
# load a model from torchhub
resnet = torchvision.models.resnet.resnet18(pretrained=True)
# set model to evaluation mode
resnet.eval() 
# send model to gpu
resnet.to(device)
# load the model's label scheme
with open('labels.txt', 'r') as f:
    labs = eval(f.read())
    

### Get model predictions

Now that we've downloaded a model, we can try it out on our image! Let's resize the image so that it's 224x224, which is the size our model expects.

In [None]:
imgp = image[2291].split(' ')[-1] 
print('Image path:',imgp)

# load image
bird_img = Image.open(f'CUB_200_2010/images/{imgp}')
# resize our image to 224x224 
bird_img = bird_img.resize((224,224))
# format our image as a torch tensor and reshape it to move the channel dimension
bird_tensor = torch.tensor(np.reshape(bird_img, (3,224,224)), dtype=torch.float32).unsqueeze(0)
# run inference with a PyTorch context manager that disables gradient calculation
with torch.no_grad():
    output = resnet(bird_tensor.to(device))
print('Output label:', labs[int(torch.argmax(output))])

# resize and plot image
plot_img = bird_img.resize((224,224))
plt.figure(figsize=(14,7))
plt.imshow(plot_img)
plt.title(f'Output label: {labs[int(torch.argmax(output))]} \n')

## Using a dataloader

The model did not do a great job with our resized image, the reason for this is because we didn't perform any normalization on our input images, which the model expects and was trained with. While we can implement this normalization ourselves, a much simpler way to achieve the results we want is by using a [DataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)!

While you can classify images you load individually, PyTorch allows you to utilize DataLoaders, which can efficiently load your images from disk and perform transformations on them in parallel. There are two components to a DataLoader: the Dataset object, and the DataLoader wrapper. The Dataset object specifies how to load the images and annotations with a __getitem__ method. The DataLoader wrapper specifies how large our batch size is, how many workers we want to fetch data, and if we want to do things like shuffle our inputs. Torchvision provides a series of transforms you can perform on your image inputs. Here, we are using a PyTorch built-in Dataset object that allows us to specify the location of our image root folder and what transforms we want to use on our images. 

Our transformations consist of normalizing our image, resizing it to 224x224, center cropping it, and transforming it into a torch tensor. These transforms mimic the transforms the image experienced during model training and will give our model a more consistent input.

In [None]:
test_folder = 'CUB_200_2010/images/'
# setup normalization transform
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
# put together a sequence of transforms 
test_transform = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize,
])
# initialize our dataset object
test_set = datasets.ImageFolder(root=test_folder, transform=test_transform)
# create our data loader
test_loader = DataLoader(test_set,
                         batch_size=1,
                         shuffle=False,
                         num_workers=0,
                         pin_memory=True)


### Accessing data from our dataset object

A DataLoader might seem like an abstract concept, but we can access data from our DataLoader in several ways to make it more concrete. We can  directly reference the Dataset object within the DataLoader and use the __getitem__ method to return an image tensor and its label for a given index.

In [None]:
test_loader.dataset.__getitem__(0)

### Classify image from DataLoader

Now, let's classify the image from our DataLoader. We can wrap our DataLoader in the iter() function to return an iterator object, and then use the next() function to return the next item from the iterator. This is similar to how we return data batches for training jobs.

The DataLoader will return items in order unless you set the shuffle argument to True. This means the image our DataLoader returns here is going to be the first file in the first class folder. Let's classify two images, one that is the next item from our DataLoader, and another from a specific index in our Dataset object.

In [None]:
# predict class
# get torch arrays
imgs = next(iter(test_loader))
# retrieve the image at the 3rd index 
imgs2 = test_loader.dataset.__getitem__(3)[0].unsqueeze(dim=0)

# run inference, the torch.no_grad() is a context manager that disables gradient calculation, which isn't necessary for inference
with torch.no_grad():
    output = resnet(imgs[0].to(device))
    output2 = resnet(imgs2.to(device))

plot_img = bird_img.resize((224,224))

# load image using PIL
imgp2= 'CUB_200_2010/images/001.Black_footed_Albatross/Black_footed_Albatross_0004_2731401028.jpg'
img2 = Image.open(imgp2)
plot_img2 = img2.resize((224,224))

# map our output back to human readable text
output_lab = labs[int(torch.argmax(output[0]))] 
output_lab2 = labs[int(torch.argmax(output2[0]))]

# Create figure and axes
fig, ax = plt.subplots(ncols=2,nrows=1, figsize=(20, 16))
ax[0].set_title(f"Prediction: {output_lab}", fontdict={'fontsize':20})
ax[1].set_title(f"Prediction: {output_lab2}", fontdict={'fontsize':20})

# Display the image
ax[0].imshow(plot_img)
ax[1].imshow(plot_img2)

While the model didn't work for the first image, for the second image it appears that now we have the correct label. The first image is more complex in that there are more objects and the bird is partially occluded. This illustrates the need for fine-tuning models for your use case.  

Let's see how our DataLoader transformed the image before it passed it to the model. We can't plot a torch tensor, so we first need to convert it back to a NumPy array using the following commands, and then plot the image. Since PyTorch stores tensors channel first--and Matplotlib expects data to be channel last--reshaping does not automatically rearrange the dimensions, so we have to use the permute method.

In [None]:
# convert to a numpy array and rearrange dimensions
dataloader_img = imgs[0].squeeze().permute(1,2,0).detach().cpu().numpy()
# reshape array
dataloader_img = np.reshape(dataloader_img, (224,224,3))
plt.figure(figsize=(14,7))
plt.title('DataLoader Image')
plt.imshow(dataloader_img)

## Object detection

Now, let's take a look at some object detection models. Object detection models tend to use similar operations as image classification models but with some tweaks at the end so they can output not only a class prediction but output bounding box coordinates as well.

Let's start with [Faster RCNN](https://arxiv.org/abs/1506.01497). Faster RCNN is a model that predicts both bounding boxes and class scores for potential objects in the image. It actually uses a variant of the ResNet model we used previously for classification as a "backbone". Instead of using the ResNet model as-is, they remove the final output layer. Rather than returning class predictions, it returns the penultimate activations, which can be considered a latent representation of our image. This backbone operates as a feature detector and passes information on to a [region proposal network (RPN)](https://paperswithcode.com/method/rpn). This regional proposal network is a fully convolutional network (meaning it has no fully connected layers) that proposes object boundaries and "objectness" scores, which are essentially its confidence that an object exists at that position. These areas are called regions of interest, or ROIs. In the following image, you can see ROIs are generated and then pooled, since we tend to have far more ROIs than actual objects in an image. The features from those ROIs are then flattened and used to predict the class and bounding box coordinates of objects in the image.

![Faster RCNN](display_images/faster_rcnn.png)

### Download Faster RCNN

Let's download a [Faster RCNN](https://pytorch.org/vision/stable/models.html#object-detection-instance-segmentation-and-person-keypoint-detection) object detection model that was pretrained on [COCO 2017](https://cocodataset.org/#home) from torchhub and try it out.

In [None]:
# load a model pre-trained on COCO 2017 with a ResNet50 backbone.
od_model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
# set model to evaluation mode
od_model.to(device)
evl = od_model.eval()

### Get model predictions

Let's use an image from our DataLoader and check the output of our model:

In [None]:
# retrieve an image from the dataset object by interating over it 
# imgs = next(iter(test_loader))
imgs = test_loader.dataset.__getitem__(3)[0].unsqueeze(dim=0)

# make into a torch tesnor
img_tensor = imgs.clone()
img_tensor = img_tensor/img_tensor.max()
# classify our image without returning a gradient 
with torch.no_grad():
    output = od_model(img_tensor.to(device))
# load COCO class labels
with open('coco_classes.txt', 'r') as f:
    labels = f.read()
labels = labels.split('\n')

output

## Visualize output

The model returned multiple bounding boxes, which seems strange since there is really only one object in the image. Since these models have to guess how many objects are in an image, it's not uncommon for them to return excess boxes. Typically, we want to only return bounding boxes above a specific confidence threshold. Let's set this threshold to 0.5, but feel free to adjust the threshold to see more predicted boxes overlaid on our image.

In [None]:
# Create figure and axes
fig, ax = plt.subplots(1, figsize=(20, 12))
imgp = '001.Black_footed_Albatross/Black_footed_Albatross_0004_2731401028.jpg'
ax.set_title(f"Frame {imgp}")
# confidence threshold, try adjusting it!
threshold = 0.5

# Display the image
ax.imshow(plot_img2)

# loop through predictions
for i, annot in enumerate(output[0]['scores']):
    if annot >= threshold: # if prediction value is greater than or equal to .5 plot bounding box
        rect = patches.Rectangle(
            (float(output[0]['boxes'][i,0]), float(output[0]['boxes'][i,1])), #["left"] ["top"]
            float(output[0]['boxes'][i,2])-float(output[0]['boxes'][i,0]), # ["width"]
            float(output[0]['boxes'][i,3])-float(output[0]['boxes'][i,1]), # ['height']
            linewidth=1,
            edgecolor="r",
            facecolor="none",
        )
        ax.add_patch(rect)
        score = output[0]['scores'][i]*100
        plt.text(
            output[0]['boxes'][i,0],
            output[0]['boxes'][i,1] - 5,
            f"{labels[output[0]['labels'][i]]} {score.round()}%",
            bbox=dict(facecolor="white", alpha=0.5),
        )

## Visualize predicted box and ground truth box

It looks like the model got the class wrong and didn't do much better with the bounding box. Let's visually compare this bounding box to the actual ground truth bounding box. Something to note: because we reshaped our image to 224x224, our predicted bounding boxes are generated for those dimensions. To compare between our ground truth labels, we need to reshape either our ground truth bounding box or our predicted bounding box. Another consideration: are your bounding boxes you output (x1, y1 ,x2 ,y2), or are they (x, y, width, height)? Make sure you are comparing boxes in the same output format!

In [None]:
# get image using previously defined filepath
albatross = Image.open(f'CUB_200_2010/images/{imgp}')
# get size of original image
x,y = albatross.size
albatross = albatross.resize((224,224))
# open the label file

# load image
full_path = f'CUB_200_2010/images/{imgp}'
# load bounding boxes
bboxes = label_dict[full_path.replace('images', 'annotations-mat').replace('.jpg','.mat')]['bbox']
bboxes = np.concatenate(bboxes.tolist()).squeeze()
bboxes = bboxes.astype(np.float16)
# adjust so that bounding box fits 224x224 image
bboxes[0] = bboxes[0]*224/x
bboxes[1] = bboxes[1]*224/y
bboxes[2] = bboxes[2]*224/x
bboxes[3] = bboxes[3]*224/y

fig,ax = plt.subplots(1, figsize=(20,12))
ax.set_title(f'Black Footed Albatross', fontdict={'fontsize':20})

# Display the image
ax.imshow(albatross)
rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2]-bboxes[0], bboxes[3]-bboxes[1] ,linewidth=1,edgecolor='g',facecolor='none') 
ax.add_patch(rect)
plt.text(bboxes[0], bboxes[1]-5, f"Ground Truth Albatross", bbox=dict(facecolor='white', alpha=0.5)) 

# we use an argmax operation that outputs the index of the highest value
index = torch.argmax(output[0]['scores'])
# that argmax index is used to determine which bounding box to use
pred_box = output[0]['boxes'][int(index)]
# create bounding box for predicted box
rect = patches.Rectangle(
    (float(pred_box[0]), float(pred_box[1])), #["left"] ["top"]
    float(pred_box[2])-float(pred_box[0]), # ["width"]
    float(pred_box[3])-float(pred_box[1]), # ['height']
    linewidth=1,
    edgecolor="r",
    facecolor="none",
)
ax.add_patch(rect)
plt.text(
    output[0]['boxes'][index,0],
    output[0]['boxes'][index,1] - 5,
    f"{labels[output[0]['labels'][index]]}",
    bbox=dict(facecolor="white", alpha=0.5),
)

## Bounding Box Evaluation

Now we have seen our predicted box next to the ground truth box, but how can we get a numeric representation of box accuracy? 

[Intersection over union (IoU)](https://giou.stanford.edu/) gives us a metric that measures the overlap between the ground truth bounding box and our predicted bounding box. Torchvision has a set of [operators](https://pytorch.org/vision/stable/ops.html) including a box_iou utility that we can use to calculate IoU between our prediction and our target. Our output is a value between 0-1 that represents the amount of overlap between the two boxes calculated by dividing the area of intersection between the two by the union of the box areas. An IoU over 0.5 tends to be considered a successful detection, but this definition varies for different use cases. 


In [None]:
# turn into a tensor
gt_box = torch.tensor(np.array(bboxes, dtype=np.uint8))
print('Ground Truth Box dimensions', gt_box)
print('Predicted Box dimensions', pred_box)

# run box_iou operation to calculate IoU between pred and ground truth
box_iou = torchvision.ops.box_iou(pred_box.unsqueeze(dim=0).cpu(), gt_box.unsqueeze(dim=0)) 
print('Bounding Box IoU:',round(float(box_iou),4))

## Try a different model

The default torchvision faster RCNN model seemed to mistake the bird's wing for a knife not only returning an incorrect bounding box, but also an incorrect class label, which isn't very helpful for us. Is there a better pretrained option available? 

One easy to use asset is NVIDIA’s single shot detector or [SSD model](https://arxiv.org/abs/1512.02325). SSD is an efficient single stage object detection model that is built to be fast and accurate. NVIDIA’s pretrained SSD model has been highly optimized so that it is both accurate and even faster than the standard implementations. This model is available through either NGC or [PyTorch’s torchhub](https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/). It has been trained on the COCO 2017 dataset and out of the box can generate very accurate inference results on the 81 classes in the dataset. Let's download it from torchhub and try it out!

![](display_images/ssd.png)

### Download SSD model

In addition to downloading the model, we will also load some utilities that will correctly prep our image for inference. 

In [None]:
# load our model, but don't get the pretrained version as it expects to load on a GPU
ssd_model = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', pretrained=False) 
# get model utilities
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd_processing_utils')
# get our model checkpoint and map it to the device our notebook is running on
checkpoint = torch.hub.load_state_dict_from_url("https://api.ngc.nvidia.com/v2/models/nvidia/ssdpyt_fp32/versions/2/zip", map_location=device) 
# change the key names, when running distributed training it often changes the names of your weight keys
state_dict = {key.replace("module.", ""): value for key, value in checkpoint["model"].items()}
# load our weights
ssd_model.load_state_dict(state_dict)
# send our model to our torch device
ssd_model.to(device)
# set our model to evaluation mode
ssde = ssd_model.eval()

### Prepare image tensors

We need to prepare our image inputs and transform them into tensors, we can do so using the prepare_tensor utility function. Below we've also specified a CPU based version you can use if not running on a GPU.

In [None]:
# use the NVIDIA provided utilities to transform and regularize our image input 
full_path = f'CUB_200_2010/images/001.Black_footed_Albatross/Black_footed_Albatross_0004_2731401028.jpg'
inputs = [utils.prepare_input(full_path)]

# define rescale function
def rescale(img, input_height, input_width):
    """Code from Loading_Pretrained_Models.ipynb - a Caffe2 tutorial"""
    aspect = img.shape[1] / float(img.shape[0])
    if (aspect > 1):
        # landscape orientation - wide image
        res = int(aspect * input_height)
        imgScaled = transform.resize(img, (input_width, res))
    if (aspect < 1):
        # portrait orientation - tall image
        res = int(input_width / aspect)
        imgScaled = transform.resize(img, (res, input_height))
    if (aspect == 1):
        imgScaled = transform.resize(img, (input_width, input_height))
    return imgScaled

# create a new version of the utils.prepare_tensor utility that can work outside of a GPU
def prepare_tensor(inputs, fp16=False):
    NHWC = np.array(inputs)
    NCHW = np.swapaxes(np.swapaxes(NHWC, 1, 3), 2, 3)
    tensor = torch.from_numpy(NCHW)
    tensor = tensor.contiguous()
    tensor.to(device)
    tensor = tensor.float()
    if fp16:
        tensor = tensor.half()
    return tensor

# prepare our tensor for classification
if device.type == 'cuda':
    tensor = utils.prepare_tensor(inputs)
else:
    tensor = prepare_tensor(inputs)
    

In [None]:
import os
file_with_coco_names = "category_names.txt"

if not os.path.exists(file_with_coco_names):
    print("Downloading COCO annotations.")
    import urllib
    import zipfile
    import json
    import shutil
    urllib.request.urlretrieve("https://dsoaws.s3.amazonaws.com/cfn_templates/annotations_trainval2017.zip", "cocoanno.zip")
    with zipfile.ZipFile("cocoanno.zip", "r") as f:
        f.extractall()
    print("Downloading finished.")
    with open("annotations/instances_val2017.json", 'r') as COCO:
        js = json.loads(COCO.read())
    classes_to_labels = [category['name'] for category in js['categories']]
    open("category_names.txt", 'w').writelines([c+"\n" for c in classes_to_labels])
    os.remove("cocoanno.zip")
    shutil.rmtree("annotations")
else:
    classes_to_labels = open("category_names.txt").readlines()
    classes_to_labels = [c.strip() for c in classes_to_labels]
print(classes_to_labels)

### Run inference

Now that we've prepared our data, let's test it out! The following block will return object detections with a model confidence of 0.5 or higher.

In [None]:
%%time

# try adjusting the confidence threshold
confidence_threshold = 0.5

# get model predictions with no gradient"
with torch.no_grad():
    detections_batch = ssd_model(tensor)
# decode our results, SSD by default returns 8732 boxes
results_per_input = utils.decode_results(detections_batch)
# filter our results
best_results_per_input = [utils.pick_best(results, confidence_threshold) for results in results_per_input]
# get class labels
best_results_per_input

### Rescale our ground truth boxes and image

Let's load our ground truth bounding box.

In [None]:
# get original and rescaled image dimensions 

img2 = Image.open(full_path)
img2_arr = np.array(img2)

def get_crop_data(img_arr, cropx=300, cropy=300):
    # get image dimensions
    ydim, xdim, cdim = img_arr.shape
    # get rescaled image and dimensions
    rescaled_img = rescale(img_arr, cropx, cropy)
    rescaled_y, rescaled_x, rescaled_c = rescaled_img.shape
    # find starting positions for center cropping
    start_x = xdim // 2 - (cropx // 2)
    start_y = ydim // 2 - (cropy // 2)
    return rescaled_x, rescaled_y, start_x, start_y


rescaled_x, rescaled_y, start_x, start_y = get_crop_data(img2_arr, cropx=300, cropy=300)
print("Crop offsets:" ,start_x, start_y)

# load bounding boxes
bboxes = label_dict[full_path.replace('images', 'annotations-mat').replace('.jpg','.mat')]['bbox']
bboxes = np.concatenate(bboxes.tolist()).squeeze()
bboxes = bboxes.astype(np.float16)
print('Ground truth box shape:',bboxes)

### Plot results

To plot SSD's detections, we need to adjust the dimensions. The SSD utilities rescale the image and center crop it. We use the rescaled_x and rescaled_y to adjust the size of the SSD output since the points are output between 0-1. We then use the crop start position to adjust the box to the original image. Let's view our detections on the original image:

In [None]:
def rescale_bbox(startx, starty, img, bboxes, rescaled_x, rescaled_y, size=300):
    left, bot, right, top = bboxes
    # adjust detections to rescaled image size
    x, w  = [val * rescaled_x for val in [left, right-left]] 
    y, h = [val * rescaled_y for val in [bot, top - bot]] 
    # adjust for center cropping
    x = x + startx/(img.size[0]/size)
    y = y + starty/(img.size[1]/size)
    w = w - startx/(img.size[0]/size) 
    h = h - starty/(img.size[1]/size) 
    return x,y,w,h

# plot our detections
for image_idx in range(len(best_results_per_input)):
    fig, ax = plt.subplots(1, figsize = (20,12))
    # Show original, denormalized image
    ax.imshow(img2_arr)
    rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2]-bboxes[0], bboxes[3]-bboxes[1] ,linewidth=1,edgecolor='g',facecolor='none') 
    ax.add_patch(rect)
    plt.text(bboxes[0], bboxes[1]-5, f"Ground Truth Albatross at {bboxes}", bbox=dict(facecolor='white', alpha=0.5), fontdict = {'fontsize':12}) 

    # show detections
    sbboxes, classes, confidences = best_results_per_input[image_idx]
    for idx in range(len(sbboxes)):
        # adjust detections to rescaled image size
        x,y,w,h = rescale_bbox(start_x, start_y, img2, sbboxes[idx], rescaled_x, rescaled_y, size=300)
        rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none')
        ax.add_patch(rect)
        ax.text(x, y+5, f"Prediction: {classes_to_labels[classes[idx] - 1]} {round(confidences[idx]*100,2)}%", 
                bbox=dict(facecolor='white', alpha=0.5),
               fontdict = {'fontsize':12})

### Calculate IoU

Now let's assess the IoU of the predicted box. The ground truth box appears to actually have some extra space, so our IoU might actually be better than we see here!

In [None]:
# turn coordinates into a tensor
gt_box = torch.tensor(bboxes)
x,y,w,h = rescale_bbox(start_x, start_y, img2, sbboxes[idx], rescaled_x, rescaled_y, size=300)
w = w + x 
h = h + y 
pred_box = torch.tensor([x,y,w,h])

print('Ground Truth box:', gt_box)
print('Pred box', pred_box)

# run box_iou operation to calculate IoU between pred and ground truth
box_iou = torchvision.ops.box_iou(pred_box.unsqueeze(dim=0), gt_box.unsqueeze(dim=0)) 
print('Bounding Box IoU:',round(float(box_iou),4))

## Visualize different IoUs

Our IoU was pretty good using the SSD model! The way to think about the bounding box predictions is through the lens of precision and recall, but spatially instead of in the traditional sense. Did the prediction encompass the entire ground truth box? If so you could say it had good recall, but did it also include a lot of extra space? If this is the case you could say the precision needs improvement. 

Let's take a look at a couple different examples of bounding boxes with similar IoU values. The ground truth boxes are visualized in green and the predicted boxes are red.

In [None]:
# create fake bounding box
interim_box = torch.tensor([ 140. , 40. ,250., 250.])
full_path = 'CUB_200_2010/images/001.Black_footed_Albatross/Black_footed_Albatross_0004_2731401028.jpg'
albatross = Image.open(full_path)
bboxes = label_dict[full_path.replace('images', 'annotations-mat').replace('.jpg','.mat')]['bbox']
bboxes = np.concatenate(bboxes.tolist()).squeeze()
bboxes = bboxes.astype(np.float16)
# calculate IoU
iou = torchvision.ops.box_iou(interim_box.unsqueeze(dim=0), torch.tensor(bboxes).unsqueeze(dim=0))

fig,ax = plt.subplots(nrows=1, ncols = 2, figsize=(24,16))

# Display the image
ax[0].imshow(albatross)
# create bounding box rectangle
rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2]-bboxes[0], bboxes[3]-bboxes[1] ,linewidth=1,edgecolor='g',facecolor='none') 
ax[0].add_patch(rect)
# add text label
ax[0].text(bboxes[0], bboxes[1]-10, f"Albatross bbox1 {bboxes}", bbox=dict(facecolor='white', alpha=0.5)) 

rect = patches.Rectangle((interim_box[0], interim_box[1]), interim_box[2]-interim_box[0], interim_box[3]-interim_box[1] ,linewidth=1,edgecolor='r',facecolor='none') 
ax[0].add_patch(rect)
ax[0].text(interim_box[0], interim_box[1], f"Albatross bbox2 {interim_box}", bbox=dict(facecolor='white', alpha=0.5)) 
iou = torchvision.ops.box_iou(interim_box.unsqueeze(dim=0), torch.tensor(bboxes).unsqueeze(dim=0))
ax[0].set_title(f'Black Footed Albatross IoU: {round(float(iou),4)}', fontdict={'fontsize':20})

interim_box = torch.tensor([ 5. , 5. ,300., 250.])
iou = torchvision.ops.box_iou(interim_box.unsqueeze(dim=0), torch.tensor(bboxes).unsqueeze(dim=0))

# Display the image
ax[1].imshow(albatross)
rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2]-bboxes[0], bboxes[3]-bboxes[1] ,linewidth=1,edgecolor='g',facecolor='none') 
ax[1].add_patch(rect)
ax[1].text(bboxes[0], bboxes[1]-10, f"Albatross bbox1 {bboxes[1:]}", bbox=dict(facecolor='white', alpha=0.5)) 

rect = patches.Rectangle((interim_box[0], interim_box[1]), interim_box[2]-interim_box[0], interim_box[3]-interim_box[1] ,linewidth=1,edgecolor='r',facecolor='none') 
ax[1].add_patch(rect)
ax[1].text(interim_box[0], interim_box[1], f"Albatross bbox2 {interim_box}", bbox=dict(facecolor='white', alpha=0.5)) 
iou = torchvision.ops.box_iou(interim_box.unsqueeze(dim=0), torch.tensor(bboxes).unsqueeze(dim=0))
ax[1].set_title(f'Black Footed Albatross IoU: {round(float(iou),4)}', fontdict={'fontsize':20})

The previous visualizations show bounding boxes with roughly the same IoU, but with drastically different sizes. Since IoU looks at the intersection between the boxes over the union of the total space of both boxes, it makes sense the IoU values are similar for these predicted boxes. Typically, objects in a scene don't comprise such a large portion of the image so predicting a very large box would be penalized more heavily. 

Now we know how to evaluate bounding box error, but how do you evaluate the error for all the objects in your image, and all of your images in your training set? We can do so with mean average precision otherwise known as [mAP](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision).

Mean average precision allows us to calculate the accuracy of our models by setting IoU thresholds (often 0.5 is used) and calculating how many bounding boxes the model predicted correctly. Mean average precision is essentially a representation of the [precision-recall curve](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html).

## Compute average precision

To calculate average precision, we first need some inference results. The mAP calculation factors in both the IoU between the predicted box and the ground truth box as well as the prediction confidence.

In [None]:
input_list = ['CUB_200_2010/images/001.Black_footed_Albatross/Black_footed_Albatross_0004_2731401028.jpg',
             'CUB_200_2010/images/001.Black_footed_Albatross/Black_footed_Albatross_0002_2293084168.jpg',
             'CUB_200_2010/images/001.Black_footed_Albatross/Black_footed_Albatross_0005_2755588934.jpg']
# set plot to True to visualize all bounding boxes predicted for an image
plot = True
if plot:
    fig, ax = plt.subplots(nrows=1, ncols=3, figsize = (20,12))

box_list = []
class_list = []
confidences = []
for i, inp in enumerate(input_list):
    # prepare images for inference
    inputs = [utils.prepare_input(inp)]
    if device.type == 'cuda':
        tensor = utils.prepare_tensor(inputs)
    else:
        tensor = prepare_tensor(inputs)
    bird_img = Image.open(inp)
    img_arr = np.array(bird_img)
    rescaled_x, rescaled_y, start_x, start_y = get_crop_data(img_arr, cropx=300, cropy=300)

    bboxes = label_dict[inp.replace('images', 'annotations-mat').replace('.jpg','.mat')]['bbox']
    bboxes = np.concatenate(bboxes.tolist()).squeeze()
    bboxes = bboxes.astype(np.float16)

    # generate predictions
    with torch.no_grad():
        detections_batch = ssd_model(tensor)
    results_per_input = utils.decode_results(detections_batch)
    best_results_per_input = [utils.pick_best(results, confidence_threshold) for results in results_per_input]
    
    # in our case we have one object per image
    sbboxes, classes, confidence = results_per_input[0]
    print('Number of boxes:', len(results_per_input[0][-1]))
    gt_box = torch.tensor(bboxes)

    # calculate IoUs
    box_ious = []
    for j in range(sbboxes.shape[0]):
        x,y,w,h = rescale_bbox(start_x, start_y, bird_img, sbboxes[j], rescaled_x, rescaled_y, size=300)
        w = w + x
        h = h + y 
        pred_box = torch.tensor([x,y,w,h])
        box_iou = torchvision.ops.box_iou(pred_box.unsqueeze(dim=0), gt_box.unsqueeze(dim=0)) 
        box_iou = float(box_iou.detach())
        box_ious.append(box_iou)

    print('Predictions:', sbboxes)
    print('GT box:', gt_box)
    print('Best bounding Box IoU:', round(float(max(box_ious)),4))
    print('------------')
    box_list.append(box_ious)
    class_list.append(classes)
    confidences.append(confidence)
    
#     plot our detections
    if plot:
        for image_idx in range(len(results_per_input[0])):
            # Show original, denormalized image...
            ax[i].imshow(bird_img)
            rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2]-bboxes[0], bboxes[3]-bboxes[1] ,linewidth=1,edgecolor='g',facecolor='none') 
            ax[i].add_patch(rect)
            ax[i].text(bboxes[0], bboxes[1]-5, f"Ground Truth Albatross at {bboxes}", bbox=dict(facecolor='white', alpha=0.5), fontdict = {'fontsize':12}) 
            ax[i].set_title(f"{(inp.split('/')[-1])}")
                # ...with detections
            for idx in range(sbboxes.shape[0]):
                x,y,w,h = rescale_bbox(start_x, start_y, bird_img, sbboxes[idx], rescaled_x, rescaled_y, size=300)
                rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none') 
                ax[i].add_patch(rect)
                ax[i].text(x, y+5, f"Prediction: {classes_to_labels[classes[idx] - 1]} {round(confidence[idx]*100,2)}% with IoU of: {round(box_ious[idx], 3)}", 
                                   bbox=dict(facecolor='white', alpha=0.5),
                       fontdict = {'fontsize':12})


### Calculate mAP

There are several ways in which you can evaluate mAP. [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/), which was a popular computer vision challenge, simply used 0.5 as the IoU cutoff for evaluating object detection accuracy and focused on a range of confidence thresholds from 0-1. Common Objects in Context or [COCO](https://cocodataset.org/#home) uses a different method for evaluating object detection accuracy. COCO evalutes detections at multiple IoU thresholds, ranging from 0.05 to 0.95, in addition to the 0-1 confidence thresholds to get a more holistic view of detection accuracy. We are going to use the simpler PASCAL VOC version of mAP to demonstrate the concept. 

As previously mentioned, average precision is a summarization of the precision-recall curve into a single value. Since we can have more predictions than ground truth labels in object detection, we can't simply calculate precision and recall directly. In order to calculate mAP we need to run the following steps:
- For each predicted bounding box in a given image we calculate the IoU with the ground truth boxes.
- We sort our bounding box predictions by confidence.
- We set a confidence threshold and only consider predictions above the threshold. PASCAL VOC looks at thresholds of 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.
- If the IoU is greater than our threshold, in this case 0.5, and we don't already have a bounding box above our IoU threshold for the given object, we consider this a true positive. Otherwise if we already have a bounding box above the IoU threshold, it is considered a false positive. If there isn't a prediction for a given object we consider this a false negative.
- We calculate precision and recall based on how we classified our predicted bounding boxes for each confidence level.
- The precision-recall curve is created for each confidence level and then we generate the interpolated precision-recall curve.
- We find the area under the precision-recall curves and average them to generate our average precision metric.
- If we have multiple classes we average those to get the mean average precision.

### Retrieve precision and recall values

For the predictions in a given image let's create that table. You need to sort everything by confidence, then calculate precision and recall at each step. 

In [None]:
# sort predictions in order of confidence as well as box_iou 
i = 0
sort_ind = np.argsort(confidences[i]).tolist()
sort_ind.sort(reverse=True)
confidences_sorted = np.array(confidences[i])[sort_ind]
box_list_sorted = np.array(box_list[i])[sort_ind]
classes_sorted = np.array(class_list[i])[sort_ind]

def gen_precision_recall_curve(box_list_sorted, confidences_sorted, conf_thresh = 0, iou_thresh = 0.5, target = [15], pos_label = 15,
                              verbose=False):
    """
    Generate our precision-recall curve by outputting precision and recall values for object predictions.
    """
    precisions = []
    recalls = []
    tps = []
    preds = []
    # for each box in our list of ranked boxes
    for i,box in enumerate(box_list_sorted):
        # if it meets the iou threshold and you have the same or fewer predictions as GT add as true positive 
        if (box >= iou_thresh) & (len(tps) <= len(target)) & (confidences_sorted[i] >= conf_thresh):
            tps.append(True)
            preds.append(15)
        # otherwise if it's below the iou threshold but above the confidence threshold, mark it as a false positive
        elif confidences_sorted[i] >= conf_thresh:
            tps.append(False)
            preds.append(15)
            # if target and predictions are different lengths, append zeros to target to make them the same length.
            len_diff = len(preds) - len(target)
            if len_diff > 0:
                target.extend([0]* len_diff )
        # if it's below the confidence threshold and if we have no predictions we add a single negative prediction 
        elif len(target) > len(preds):
            preds.append(0)
            tps.append(False)
        # if the predictions are below the confidence threshold and targets is the same length as preds we break 
        # since predictions are already sorted by confidence 
        else:
            break
            
        if verbose:
            print(target, preds)
        # calculate precision and recall at each step
        precision = precision_score(y_true=target, y_pred=preds, average='binary', pos_label=pos_label) 
        precisions.append(precision)
        recall = recall_score(y_true=target, y_pred=preds, average = 'binary', pos_label=pos_label)
        recalls.append(recall)
    return precisions, recalls, tps
    
precisions, recalls, tps = gen_precision_recall_curve(box_list_sorted, confidences_sorted, conf_thresh = 0,
                                                      iou_thresh = 0.5, target = [15], pos_label = 15)
print('True positives:', tps)
print('Precision values:', precisions)
print('Recall values:', recalls)

### Visualize precision-recall curve

Now we have our precisions and recalls at each step, let's put them in a [Pandas DataFrame](https://pandas.pydata.org/) and visualize the precision-recall curve. We can see that for our example, as our confidence goes down, our recall remains the same but our precision decreases. Typically you would see the recall increase at lower precision levels, but since our most confident prediction also predicted the only object in the scene, recall simply remains 100%.

In [None]:
# create rank table 
pr_frame = pd.DataFrame().from_dict({
    "True Positive":tps,
    "Precision":precisions,
    "Recall":recalls,
})
plt.plot( pr_frame.Recall, pr_frame.Precision, marker=11)
plt.xlabel('Recall')
plt.ylabel('Precision')
pr_frame

For our example we want to look at the interpolated precision-recall curve. This means we take the max precision value at each recall level. Since our most confident prediction correctly inferred the only object in our image, our recall is simply 1 for all of our predictions. In this case that means that our interpolated precision is actually 1. 

In [None]:
interpolated_pr_frame = pr_frame.groupby('Recall').max()
# generate average precision using the interpolated values 
print('Average precision:', np.average(interpolated_pr_frame.Precision))

### Calculate mAP

The previous example had no confidence cutoff, meaning we took all predictions into account. Mean average precision is designed to take confidence into account, so we want to calculate average precision at multiple confidence thresholds. 


In [None]:
APs = []
conf_thres = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
for conf in tqdm(conf_thres):
    # get precisions and recalls
    precisions, recalls, tps = gen_precision_recall_curve(box_list_sorted, confidences_sorted, conf_thresh = conf,
                                                          iou_thresh = 0.5, target = [15], pos_label = 15)
    # get precision-recall curve
    pr_frame = pd.DataFrame().from_dict({
        "True Positive":tps,
        "Precision":precisions,
        "Recall":recalls,
    })
    # interpolate precision-recall curve
    interpolated_pr_frame = pr_frame.groupby('Recall').max()
    interpolated_pr_frame = pr_frame.merge(interpolated_pr_frame, on='Recall')
    AP = np.average(interpolated_pr_frame.Precision_y)
    APs.append(AP)
    
print('Average precisions',APs)
mAP = np.average(APs)
print('Mean Average Precision', mAP)

Our mAP is good in this case, but we are only evaluating simple images with large and unoccluded objects. When you have to calculate it for many different objects of varying size and class like in the COCO dataset, achieving a mAP this high is incredibly difficult.

Instead of implementing it from scratch, there are a variety of libraries that have mAP implementations like [this one](https://github.com/Cartucho/mAP).

# Semantic Segmentation

The final computer vision task we will cover is semantic segmentation. Semantic segmentation is a task where we classify every pixel in the image as belonging to one of our output classes and generate an output mask with the pixel classifications. There are a variety of semantic segmentation models available. A commonly used one is [DeepLabV3](https://pytorch.org/hub/pytorch_vision_deeplabv3_resnet101/). Let's download a pretrained model and test it on our dataset.

In [None]:
# download a pretrained deeplabv3 model
seg_model = torchvision.models.segmentation.deeplabv3_resnet101(pretrained=True) 
seg_model.to(device)
evl = seg_model.eval()

### Get model predictions

In [None]:
# get a segmentation mask from our model
imgs = next(iter(test_loader))
with torch.no_grad():
    output = seg_model(imgs[0].to(device))

### Mask shape

So what exactly does the output of a semantic segmentation model look like? Let's view the shape:

In [None]:
print(output['out'].shape)

Our output is 1x21x224x224. It's shaped this way because the model was trained on 21 classes, so it has 21 different segmentation masks. Each mask has a set of pixel probabilities. Let's look at the different classes:

In [None]:
coco_seg_classes = ['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']


### Visualize output

Before we view our output, we need to consider that what our model returned is a set of probabilities. We can use a [softmax function](https://en.wikipedia.org/wiki/Softmax_function) to normalize our outputs and put them on a scale between 0-1. Let's run this operation on our output and visualize the normalized probabilities and raw probabilities. 

In [None]:
soft_tensor = torch.nn.functional.softmax(output['out'][0,3,:,:], dim=1)
albatross = Image.open(f'CUB_200_2010/images/{imgp}')
albatross = albatross.resize((224,224))

fig, ax = plt.subplots(ncols=3, figsize=(20, 12))
ax[0].set_title('Albatross raw probability Segmentation', fontdict = {'fontsize':18})
ax[0].imshow(output['out'][0,3,:,:].detach().cpu().numpy().squeeze())
ax[1].set_title('Albatross softmax output Segmentation', fontdict = {'fontsize':18})
ax[1].imshow(soft_tensor.detach().cpu().numpy()*255) 
ax[2].set_title('Original image', fontdict = {'fontsize':18})
ax[2].imshow(albatross)

Now that we have softmaxed our outputs, we typically set a confidence threshold to classify pixels. The threshold can simply be 0.5, or can be tuned for your specific use case.

## Finetuning an Object Detection Model

Now that we have gone through all of the different tasks, let's turn our focus on object detection. Let's take a look at how we can take an object detection model with a pre-trained backbone and fine-tune it for a different object detection task. The original CUB_200 dataset contains 200 separate classes. Since each class has only around 40 examples, it winds up being a complex classification task for such a small amount of data per class, even when using transfer learning. To reduce training times and improve our model's accuracy, we have created a subset of 10 bird classes that we will train our model on. 

In [None]:
# load our labels
with open('label_files/samples_birds_10_train.json', 'r') as f:
    bird_labs = json.load(f)
with open('label_files/samples_birds_10_val.json', 'r') as f:
    bird_labs_val = json.load(f)
with open('label_files/samples_birds_10_test.json', 'r') as f:
    bird_labs_test = json.load(f)
    
print('Training set size:',len(bird_labs['annotations']))
print('Validation set size:',len(bird_labs_val['annotations']))
print('Test set size:',len(bird_labs_test['annotations']))

# move labels into our data folder
!cp -r label_files/samples_birds_10* CUB_200_2010
# view an example annotation
bird_labs['annotations'][0]

## Initialize Dataloader

We previously created a dataloader using the ImageFolder dataset class, but that dataloader only output images. That's only half of the equation for training. We also need bounding box coordinates so that we can calculate error for our model to propagate backwards through it's weights.

Typically when you are training on a relatively small dataset, you want to use some form of augmentation. Torchvision has a variety of transforms, but another great library we can use for generating image augmentations is the library [imgaug](https://imgaug.readthedocs.io/en/latest/). Imgaug allows you to specify sequences of transforms with different probabilities. Take a look at the get_transform function to see how the sequence is constructed. Let's try building a dataloader from scratch using imgaug to implement some transforms.

In [None]:
from torchvision import transforms as T
from imgaug import augmenters as iaa


def create_targ_list(target):
    """
    Create list of bounding boxes
    """
    targ_list = []
    box_list = []
    lab_list = []
    image_id_list = []
    for targ in target:
        for i in range(len(target[targ])):
            if targ=='boxes':
                box_list.append(target[targ][i])
            elif targ=='labels':
                lab_list.append(target[targ][i])
            elif targ=='image_id':
                image_id_list.append(target[targ][i])

    for i in range(len(box_list)):
        targ = {
            "boxes":box_list[i].unsqueeze(0),
            "labels":lab_list[i].unsqueeze(0),
            "image_id":image_id_list[i].unsqueeze(0)
        }
        targ_list.append(targ)
    return targ_list

def get_transform(train, size=224):
    """
    Create imgaug transformation pipeline.
    """
    if train:
        # create sequence of transforms
        seq = iaa.Sequential([
            # flip left or right
        iaa.flip.Fliplr(p=0.5),
            # flip up or down
        iaa.flip.Flipud(p=0.5),
            # randomly adjust brightness
        iaa.MultiplyBrightness(mul=(0.9, 1.1)),
            # resize image
        iaa.size.Resize((size,size))
        ])
    else:
        seq = iaa.Sequential([
            iaa.size.Resize((size,size))
        ])
    return seq

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

class BirdDataset(torch.utils.data.Dataset):
    """
    Bird dataset object. Implements methods for retrieving images and bounding boxes
    """
    def __init__(self, root, transforms, sample_file='samples_birds_10_train.json', size=256, channels=3): # samples_birds_10_train
        self.root = root
        self.transforms = transforms
        self.size = size
        self.channels = channels
        self.sample_file = sample_file
        with open(os.path.join(self.root, self.sample_file), 'r') as f:
            labs = json.load(f)
        self.imgs = labs['images']
        self.labs = labs['annotations']
        # we use a JSON file containing paths to our images as well as our annotations

    def __getitem__(self, idx):
        # load images and boxes
        img_path = os.path.join(self.root,'images', self.imgs[idx]['file_name'])
        img = Image.open(img_path)
        annot = self.labs[idx]
        if self.channels==1:
            img = ImageOps.grayscale(img)

        # convert everything into a torch.Tensor
        boxes = torch.as_tensor(annot['bbox'], dtype=torch.float32)
        # convert box from xywh to xyxy
        boxes = torchvision.ops.boxes.box_convert(boxes, 'xywh', 'xyxy')
        labels = torch.tensor((annot['category_id'],), dtype=torch.long)
        image_id = torch.tensor([idx], dtype=torch.long)
        area = torch.tensor(annot['area'], dtype=torch.long)
        iscrowd = torch.tensor(annot['iscrowd'], dtype=torch.bool)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["image_id"] = image_id
        target["area"] = area
        target['iscrowd'] = iscrowd

        if self.transforms:
            img, boxes = self.transforms(image=np.array(img), bounding_boxes=np.reshape(np.array(boxes),(1,1,np.array(boxes).shape[0]))) 
            boxes = np.clip(boxes, 0, max(img.shape))
            target['boxes'] = torch.tensor(boxes, dtype=torch.float32).squeeze(dim=1)
            img = torch.tensor(np.reshape(img/255, ( self.channels, self.size, self.size)), dtype=torch.float32) 
            img = normalize(img)
            
        return img, target

    def __len__(self):
        return len(self.imgs)

# initialize our transforms
seq = get_transform(train=True, size=256)
# initialize our dataset object
bdataset = BirdDataset('CUB_200_2010', seq, size=256)
# initialize dataloader
train_loader = DataLoader(bdataset,
                         batch_size=1,
                         shuffle=False,
                         num_workers=0,
                         pin_memory=False)

### DataLoader output

Let's look at the target output of our dataloader

In [None]:
img, target = next(iter(train_loader))
target

## Visualize dataloader output

Now that we've created our dataloader, let's take a look at the output image, we should be able to notice some transforms being randomly applied to our input image. We can adjust the probability of applying a given transform to our data by changing the parameters in our transform sequence.

In [None]:
fig,ax = plt.subplots(1, figsize=(20,12))
ax.set_title(f'Bird', fontdict={'fontsize':20})
bboxes = target['boxes'].squeeze()

# Display the image
ax.imshow(np.reshape(img.squeeze().detach().numpy(),(256,256,3)))
rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2]-bboxes[0], bboxes[3]-bboxes[1] ,linewidth=1,edgecolor='r',facecolor='none') 
ax.add_patch(rect)
plt.text(bboxes[0], bboxes[1]-10, f"Bird at {bboxes[:]}", bbox=dict(facecolor='white', alpha=0.5)) 

# Run an example through our model

While we can run training on our local machine, often it's better to use an instance specialized for training like [P3](https://aws.amazon.com/ec2/instance-types/p3/) or now [P4 instances](https://aws.amazon.com/ec2/instance-types/p4/). We will launch a training job using SageMaker's training feature. Before we do so though, let's run an example from our dataloader through a faster RCNN model to see what the output looks like during training.

In [None]:
# download model for "training loop"
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False, num_classes=100)
model_out = model.to(device)

### Reformat targets

We need to slightly reformat our targets in order to work with our model

In [None]:
# create list of targets
targ_list = create_targ_list(target)
# send targets to device here it's just the CPU, but for GPU would want to send to a CUDA device
targets = [{k: v.squeeze(dim=1).to(device) for k, v in t.items()} for t in targ_list]
targets

### Get results from single batch

Let's run an image and set of targets through our model and check our results

In [None]:
# run our image and targets through our model
model(img.to(device), targets)

### Understanding loss outputs

The results contain a set of different losses from the loss functions used by our model. We have 4 loss outputs:
- loss_classifier: this corresponds to the classification loss, did the model correctly identify the class of the target? In our case we haven't trained on these classes so we'd expect the loss to be high.
- loss_box_reg: this is evaluating the bounding box point regression, how far off was the model in regressing the bounding box coordinates?
- loss_objectness: this is a loss output specific to the region proposal network, it evaluates the RPN's ability to distinguish an object v background in the image.
- loss_rpn_box_reg: this is another loss output specific to the RPN, it evaluates the RPN's ability to regress bounding box coordinates.

## Send data to S3

SageMaker has a variety of data source options for training, including [Amazon EFS](https://aws.amazon.com/efs/) (short for Elastic File System) and [FSx for lustre](https://aws.amazon.com/fsx/lustre/) a high performance file system ideal for deep learning at massive scale. However, the cheapest and simplest option is [Amazon S3](https://aws.amazon.com/s3/).  Amazon S3 is an object storage service that is massively scalable, available, secure, and performant. 

When SageMaker launches a training job, it spins up an instance/instances that are up for the duration of the training job. To get data to these instances, in the case of EFS or FSx for Lustre, these file systems are mounted onto the instance. In the case of S3, the data is copied over from S3 to the EBS volume attached to the training instance, or streamed directly from S3 to the model. 

Since our dataset is a small one, let's send our data to S3 and then our training instances can download it and use it for training. This process will take a couple minutes. 

In [None]:
!aws s3 cp --recursive CUB_200_2010/ s3://{bucket}/fsx_sync/coco-birds --quiet
!aws s3 cp --recursive label_files s3://{bucket}/fsx_sync/coco-birds --quiet
print('Done!')

## Launch SageMaker Training job

Similar to running training in our notebook we can run the same training job using [SageMaker Training](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html). SageMaker training allows us to run a training job that is decoupled from our notebook instance. It's based on [docker container images](https://www.docker.com/resources/what-container) and allows users to define a prebuilt docker image containing all of the dependencies for running their training code. This allows us to run training on larger multi-gpu instances, like the p4d.24xlarge that has 8 A100 GPUs, only for the duration of the training job. For those less familiar with docker, or are interested in building on top of prebuilt images, SageMaker has a variety of different [prebuilt docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) that can be used as is or extended for model training.

![](display_images/sagemaker-architecture.png)

## Create metric definitions

Since we aren't training on the same instance our notebook is hosted on we need a way to capture our performance metrics. SageMaker allows users to collect metrics from the output logs of their training jobs. In our case we are going to capture the 4 loss outputs from our Faster RCNN model as well as the total loss, the learning rate, and the number of training iterations. The following definition specifies the name of the metric collected and the appropriate regex used to collect the metric. 

In [None]:
# define metrics

metric_definitions=[{
        "Name": "total_loss",
        "Regex": ".*total_loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_cls",
        "Regex": ".*loss_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_box_reg",
        "Regex": ".*loss_box_reg:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_cls",
        "Regex": ".*loss_rpn_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_loc",
        "Regex": ".*loss_rpn_loc:\s([0-9\\.]+)\s*"
    }, 
    {
        "Name": "lr",  
        "Regex": ".*lr:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "iter",  
        "Regex": ".*iter:\s([0-9\\.]+)\s*"
    }
]

## SageMaker Experiments 

Now that we have specified our training metrics above, we are going to need a way to organize and compare our training runs. [Amazon SageMaker experiments](https://aws.amazon.com/blogs/aws/amazon-sagemaker-experiments-organize-track-and-compare-your-machine-learning-trainings/) lets you organize, track, compare and evaluate machine learning experiments and model versions. We can add experiments tracking to our training jobs using a couple simple hooks. There is a small amount of setup required before we can hook it into our estimators. We first are going to create our experiment, and within our experiment create a trial for our new training job.

In [None]:
# create d2 experiment

from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

d2_experiment = Experiment.create(
    experiment_name=f"d2-birds-demo-{int(time.time())}", 
    description="Detectron2 training on COCO birds", 
    sagemaker_boto_client=sm)
print(d2_experiment,'\n')

## Detectron2 

To this point we have been using models from torchhub and torchvision. For our training job, we are going to use a framework that has more sophisticated dependencies, but gives you more flexibility on the types of models you can use. [Detectron2](https://github.com/facebookresearch/detectron2) is a popular PyTorch based computer vision framework that allows users to train a wide variety of different CV deep learning models by supplying yaml configuration files. It has ready made configurations for object detection, segmentation, as well as pose estimation models. It performs all of the same functions we performed manually with PyTorch, but simplifies the setup for more complex training options, like multi-GPU and multi-node training.  We are going to use Detectron2 to fine-tune a Faster RCNN variant.

## Define the Estimator

In SageMaker, training jobs are created by initializing an estimator class where we define our training container, our entrypoint, our hyperparameters, and instance types in addition to a few other variables and then launching our training job on the instance or instances we specify.

We first define a set of hyperparameters that we pass to our estimator. When we launch our training job, these hyperparameters in addition to any source directory we define, will be packaged up and uploaded to our training instance running our docker image. To see the source code used for this training job, look in the container_training folder, d2_train.py is the primary training script, but takes in imports from train_funcs.py and build.py. When our training job is launched, this code will be packaged into a tarball and uploaded as sourcedir.tar.gz.

SageMaker Estimators provide a number of built in hooks to other SageMaker features. SageMaker Debugger gives data scientists the ability to debug, monitor, and profile training jobs in real time! SageMaker Debugger's profiling feature allows us to collect both system and framework level information about our training job. This gives us information ranging from CPU/GPU utilization to detailed descriptions of the most time consuming operations in our training job. When we setup our profiling configuration, we tell our estimator how often to record both system and framework level information on our training job. To profile your jobs simply add a profiler_config to your estimator. To learn more about training job profiling check out the documentation: [SageMaker Debugger's Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html).

For our specific training job, Detectron2 has a wide variety of model architectures with pretrained weights that we can use as a starting point. In our hyperparameters we can define the yaml configuration file that tells the Detectron2 framework what model architecture we want to use. In this case we are using a faster RCNN model with a variant on ResNet 101 as a backbone, but feel free to experiment with different backbones. 

This training job is set to run for 500 iterations, which with a batch size of 8 means we are running 4000 total examples through the model. Since our dataset is 271 images, this equates to roughly training for 15 epochs and will take ~25 minutes. For more accurate results try 2000 iterations, but be aware this will take ~40 minutes.

Our training script is setup for distributed training, but multiple GPU instances aren't necessary for our use case, so let's configure our job to launch on a ml.p3.2xlarge GPU instance!

In [None]:
# run detectron2 training job 

# create experiment trial
trial_name = f"d2-demo-training-job-{int(time.time())}"
d2_trial = Trial.create(
    trial_name=trial_name, 
    experiment_name=d2_experiment.experiment_name,
    sagemaker_boto_client=sm,
)
print(d2_trial)

# our training container, this is a SageMaker PyTorch container
image_uri = f'763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:1.7.1-gpu-py36-cu110-ubuntu18.04'

# set our hyperparameters
hyperparameters = {
    # number of iterations
    'num_iters': 500,
    # can try other configs:  # retinanet_R_101_FPN_3x.yaml # faster_rcnn_R_101_FPN_3x
    'model_config': 'COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml',
    # number of top k predictions to allow through non-maximal suppression 
    "pre_nms_topk_train":4000,
    # number of bird classes
    'sample_size': 10,
    # images per batch
    'ims_per_batch':8,
    # data loader workers
    'num_dataloader_workers':4,
    # learning rate steps for our learning rate scheduler
    'lr_steps':'750,1250',
    # whether to use PyTorch distributed data parallel
    'ddp':1,
    # non-maximal suppresion threshold
    "nms_thresh":.8,
    "opts":""
}

# SageMaker profiler configuration, this tells the profiler how often to check the training job
profiler_config = ProfilerConfig(
    system_monitor_interval_millis=1000,
)

estimator = PyTorch(role=role,
                    # number of instances to use for training
                      instance_count=1,
                    # the instance types to use for training 
                      instance_type= 'ml.p3.2xlarge',
                    # our training entrypoint
                      entry_point='d2_train.py',
                    # our source directory containing all dependencies needed for training
                      source_dir='container_training',
                    # the ECR URI for our training container
                      image_uri=image_uri,
                    # the size of the EBS volume attached to our training instance/s
                      volume_size=200,
                    # where to store the output of our training job
                      output_path=f"s3://{bucket}/{prefix_output}",
                    # the name of our training job
                      base_job_name=f"frcnn101-bsz{hyperparameters['ims_per_batch']}-iters{hyperparameters['num_iters']}-{hyperparameters['sample_size']}bird-p3-2x", 
                    # our SageMaker Debugger Profiler configuration
                      profiler_config=profiler_config,
                    # our hyperparameters specified above
                      hyperparameters=hyperparameters,
                    # specify our metric definitions 
                    metric_definitions=metric_definitions,

                     )

### Launch training job

Now that we've defined our estimator, to launch the training job we call estimator.fit(). Within the fit method we specify our data source, in this case S3, whether to wait, which holds attention in the notebook and prints logs, and any experiment configuration we want to pass in. 

Run the below cell to start training. The first ten minutes of the job consist of downloading the data and the training container. We are setting wait=True so that we can print out all of the training logs to our notebook. You will want to right click on the cell and select "Enable Scrolling for Outputs" to allow for scrolling through the cell output.

You can also monitor the job's progress using SageMaker Studio's Experiments and trials viewer. To see how the viewer works take a look at the gif below.

In [None]:
estimator.fit({'train' : f's3://{bucket}/fsx_sync/coco-birds'}, 
              wait=True,
              experiment_config={
            "ExperimentName": d2_experiment.experiment_name,
            "TrialName": d2_trial.trial_name,
            "TrialComponentDisplayName": "Training-p3-3",
        })

training_job_name = estimator.latest_training_job.name
print(training_job_name)
print(d2_trial.trial_name)

### Query job status

We can check the status of our training job by calling the describe method. We want to wait until our job is completed before deploying an endpoint as otherwise our model artifacts will not be ready to deploy.

In [None]:
# query our job status
estimator.latest_training_job.describe()['TrainingJobStatus']

## Monitor training job using SageMaker Studio

One of the benefits of using SageMaker Studio is you can access the built in job viewer. The below gif demonstrates how to visualize a given job using the Components and registries tab. If we open our training job's debugger report, we can get real time updates on system utilization and framework level metrics. Go take a look at your training job as it's progressing!

![](display_images/debugger-studio-insights-open.gif)

## Get experiment results

Once our training job is complete, we want to evaluate the results of our training run. An easy way to do so is by using SageMaker Experiments. In addition to the UI in SageMaker Studio, SageMaker Experiments allows users to import their results as a dataframe so they can easily evaluate their training runs.

### View experiments DataFrame

Once we have associated our experiment trials, we can import them as a DataFrame. We can incorporate search expressions to narrow down training runs with specific attributes and sort our trials by specified metrics. The experiments will track all of your set hyperparameters, making it easier to evaluate the effects of changing them. 

In [None]:
### to create a search expression use the following syntax
# search_expression = {
#     "Filters":[
#         {
#             "Name": "DisplayName",
#             "Operator": "Equals",
#             "Value": "Training",
#         }
#     ],
# }

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session,  
    experiment_name=d2_experiment.experiment_name,
#     search_expression=search_expression,
    sort_by="metrics.test:accuracy.max",
    sort_order="Descending",
#     metric_names=['test:accuracy'],
)

trial_df = trial_component_analytics.dataframe()
trial_df

## Deploy an endpoint

Our model has now been fine-tuned for our new object detection task and performed well on our validation set. What do we do now? SageMaker provides the ability for data scientists to easily deploy trained models to persistent endpoints using our pre-built serving containers. We simply need to specify our model object, an entrypoint for our serving container, and an execution role. Since we trained our model in SageMaker, it's already in the correct format. If you are using a pre-defined SageMaker hosting image, which we are, we only have to specify a few specific functions in our inference.py script:
- input_fn: specifies how does our endpoint handles inputs from our client
- model_fn: specifies how our endpoint loads the model
- predict_fn: specifies how our endpoint runs inference
- output_fn: specifies how our endpoint returns predictions to the client


Take a look at inference.py inside of the container_serving folder to get a closer look. For more information on SageMaker deployment check out the [SageMaker deployment documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html).

Let's try deploying our model to a persistent endpoint! This process will take a few minutes as SageMaker has to provision and spin up the hosting instances.

In [None]:

predictor = PyTorchModel( model_data = estimator.model_data.replace('model.tar.gz','output.tar.gz'),
             role=role,
             entry_point = 'inference.py',
             source_dir = 'container_serving',
            framework_version = '1.7.1',
            py_version='py36'
    )

endpoint_name = f'bird-d2-frcnn-detector-{int(time.time())}'

if estimator.latest_training_job.describe()['TrainingJobStatus'].lower() == 'completed':
    predictor = predictor.deploy(initial_instance_count=1, 
                                 instance_type='ml.g4dn.xlarge', 
                                 endpoint_name = endpoint_name)
else:
    print('Please wait to deploy your endpoint until your training job has completed!')

### Setup label categories 

To map our model output to human readable text we need to specify a dictionary with our class labels. 

In [None]:
# open our test set 
with open('label_files/samples_birds_10_test.json', 'r') as f:
    bird_labs_test = json.load(f)
    
# dictionary of class names
bird_cats = {
    0:"001.Black_footed_Albatross",
    1:"005.Crested_Auklet",
    2:"009.Brewer_Blackbird",
    3:"013.Bobolink",
    4:"017.Cardinal",
    5:"021.Eastern_Towhee",
    6:"025.Pelagic_Cormorant",
    7:"029.American_Crow",
    8:"033.Yellow_billed_Cuckoo",
    9:"037.Acadian_Flycatcher"
}

## Get predictions from our endpoint

Now that we have deployed our model, we can make requests to it by sending over encoded images. There are several options for calling our endpoint, we can call it using our predictor object, or we can call it using the [boto3 API](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html). The boto3 API allows you to easily integrate model predictions into applications using tools like [AWS Lambda](https://aws.amazon.com/lambda/). You can also make a SageMaker endpoint into a [standalone REST API](https://aws.amazon.com/blogs/machine-learning/creating-a-machine-learning-powered-rest-api-with-amazon-api-gateway-mapping-templates-and-amazon-sagemaker/).

Now that your endpoint is deployed, try it out on some example images!

In [None]:
%%time

i = 0
pred_image = f"CUB_200_2010/images/{bird_labs_test['images'][i]['file_name']}"
print(pred_image.split('/')[-1])

client = boto3.client('sagemaker-runtime')
accept_type = "json" # "json" or "detectron2". Won't impact predictions, just different deserialization pipelines.
content_type = 'image/jpeg'
headers = {'content-type': content_type}
payload = open(pred_image, 'rb')

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=payload,
    ContentType=content_type,
    Accept = accept_type
)

pred = eval(response['Body'].read().decode('utf-8'))
pred_frame = pd.DataFrame()
pred_frame['scores'] = pred['scores']
pred_frame['pred_boxes'] = pred['pred_boxes']
pred_frame['pred_classes'] = pred['pred_classes']
# pred_frame = pred_frame[pred_frame.scores>0.5]
pred_frame.head()

## View our predictions

Let's visualize the predictions from our endpoint.

In [None]:
# load image
img = Image.open(pred_image)
# load bounding boxes
bboxes = bird_labs_test['annotations'][i]['bbox']

fig,ax = plt.subplots(1, figsize=(20,12))
gt_label = bird_labs_test['images'][i]['file_name'].split('/')[0]
ax.set_title(f"{gt_label}", fontdict={'fontsize':20})

# Display the image
ax.imshow(img)
# add bounding box vis
rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2]-bboxes[0], bboxes[3]-bboxes[1] ,linewidth=1,edgecolor='g',facecolor='none')  # -bboxes[0]-bboxes[1]
ax.add_patch(rect)
# add text label
plt.text(bboxes[0], bboxes[1]+5, f"Ground Truth {gt_label} at {bboxes[:]}", bbox=dict(facecolor='green', alpha=0.5)) 

pred = pred_frame.loc[0]
pred_box = pred['pred_boxes']
pred_box = torchvision.ops.boxes.box_convert(torch.tensor(pred_box), 'xyxy', 'xywh')
rect = patches.Rectangle((pred_box[0], pred_box[1]), pred_box[2], pred_box[3],linewidth=1,edgecolor='r',facecolor='none')  # -pred_box[0] -pred_box[1] 
ax.add_patch(rect)
# add text label
plt.text(pred_box[0], pred_box[1]-5, f"Predicted {bird_cats[pred['pred_classes']]} probability {round(pred['scores'],2)} at {pred_box[:].detach().numpy()}", bbox=dict(facecolor='red', alpha=0.5)) 

## Evaluate our predictions

Let's take a look at how we did from an IoU perspective. We have to convert our predicted box since it's in xyxy format, whereas the ground truth bounding box is in xywh format.

In [None]:
# turn into a tensor
gt_box = torch.tensor(bboxes)
pred_box = torch.tensor(pred_box)
pred_box[3] = pred_box[3]+pred_box[1]
pred_box[2] = pred_box[2]+pred_box[0]
print('Ground Truth box:',gt_box)
print('Predicted box:', pred_box)

# run box_iou operation
# pred_box = torchvision.ops.boxes.box_convert(torch.tensor(pred_box), 'xyxy', 'xywh')
box_iou = torchvision.ops.box_iou(pred_box.unsqueeze(dim=0), gt_box.unsqueeze(dim=0)) 
print('\n Bounding Box IoU:',round(float(box_iou),4))

## Evaluate our test set

Now that we have set up our endpoint and gotten some results, let's evaluate our test set. 

In [None]:
pred_dict = {}
pred_classes = []
gt_classes = []
bbox_ious = []
accept_type = "json" # "json" or "detectron2". Won't impact predictions, just different deserialization pipelines.
content_type = 'image/jpeg'
bird_labs_test = bird_labs_val
for i in tqdm(range(len(bird_labs_test['images']))):
    pred_image = f"CUB_200_2010/images/{bird_labs_test['images'][i]['file_name']}"

    headers = {'content-type': content_type}
    payload = open(pred_image, 'rb')

    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=payload,
        ContentType=content_type,
        Accept = accept_type
    )

    pred = eval(response['Body'].read().decode('utf-8'))
    pred_frame = pd.DataFrame()
    pred_frame['scores'] = pred['scores']
    pred_frame['pred_boxes'] = pred['pred_boxes']
    pred_frame['pred_classes'] = pred['pred_classes']
    pred_frame = pred_frame[pred_frame.scores>0.5]
    pred_dict[pred_image] = {}
    pred_dict[pred_image]['pred_dict'] = pred_frame
    pred_dict[pred_image]['gt'] = bird_labs_test['annotations'][i]
    try:
        pred_box = torch.tensor(pred_dict[pred_image]['pred_dict']['pred_boxes'][0]).unsqueeze(dim=0)
        pred_box = torchvision.ops.boxes.box_convert(pred_box, 'xyxy', 'xywh')
        gt_box = torch.tensor(pred_dict[pred_image]['gt']['bbox']).unsqueeze(dim=0)
        box_iou = torchvision.ops.box_iou(pred_box, gt_box) 
        bbox_ious.append(float(box_iou))
        pred_classes.append(pred_dict[pred_image]['pred_dict']['pred_classes'][0])
        gt_classes.append(pred_dict[pred_image]['gt']['category_id'])
        pred_dict[pred_image]['box_iou'] = float(box_iou)
        pred_dict[pred_image]['pred_class'] = pred_dict[pred_image]['pred_dict']['pred_classes'][0]
        pred_dict[pred_image]['gt_class'] = pred_dict[pred_image]['gt']['category_id']-1
    except:
        print(pred_image, "returned empty")
        pred_dict[pred_image]['box_iou'] = 0
        pred_dict[pred_image]['pred_class'] = 0
        pred_dict[pred_image]['gt_class'] = pred_dict[pred_image]['gt']['category_id']-1
        bbox_ious.append(0)
        pred_classes.append(11)
        gt_classes.append(pred_dict[pred_image]['gt']['category_id'])

## Heatmap visualization

If we set an IoU threshold we can then evaluate our precision and recall similar to how we calculated mAP, let's visualize our performance with a heatmap:

In [None]:
bird_cats[10] = 'no_prediction'
def plot_heatmap(pred_classes, bbox_ious, gt_classes, bird_cats, threshold = 0.5):
    plt.figure(figsize=(20,12))
    plt.title('Model Test Performance', fontdict={'fontsize':20})
    # remove predictions where IoU was less than 0.5
    filt_preds = copy(pred_classes)
    filt_preds[sum(np.array(bbox_ious)>threshold)]=11
    heatmap(confusion_matrix(filt_preds, gt_classes), annot=True, xticklabels=bird_cats.values(), yticklabels=bird_cats.values())

plot_heatmap(pred_classes, bbox_ious, gt_classes, bird_cats, threshold = 0.5)

Overall pretty good results! You just deployed and tested a model you fine-tuned on SageMaker! This is just scratching the surface of SageMaker's capabilities. Even though we've deployed our model, the ML cycle doesn't stop there. We are going to want to monitor our model endpoint and make sure that it continues performing as we expect, and if it doesn't, we need to label more training data and retrain it. Once your model is in production, you can use tools like [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) to automate these processes and build continuous integration and continous delivery directly into your ML workflows!

## (Optional) Deploy a SSD endpoint

What if we just want to deploy an already trained model as an endpoint? How do I deploy a model I trained outside of SageMaker? It's easy! First we download the model weights from torchhub, package them up as a tarball, and send them to S3. You can follow similar steps for a model you trained outside of SageMaker.

In [None]:
import tarfile

checkpoint = torch.hub.load_state_dict_from_url("https://api.ngc.nvidia.com/v2/models/nvidia/ssdpyt_fp32/versions/2/zip", 
                                                map_location=device) 
state_dict = {key.replace("module.", ""): value for key, value in checkpoint["model"].items()}
torch.save(state_dict, 'nvidia_ssd.pth')
with tarfile.open('nvidia_model.tar.gz', mode='w:gz') as f:
    f.add('nvidia_ssd.pth')

!aws s3 cp nvidia_model.tar.gz s3://{bucket}/ssd_model/nvidia_model.tar.gz

### Deploy pre-trained model

Once we have our model weights in S3, we can deploy our endpoint with our serving script.

In [None]:
endpoint_name = f'nvidia-ssd-gpu-uint8-{int(time.time())}'
ssd_predictor = PyTorchModel( model_data = f's3://{bucket}/ssd_model/nvidia_model.tar.gz',
             role=role,
             entry_point = 'inference_ssd.py',
             source_dir = 'container_serving',
            framework_version = '1.7.1',
            py_version='py36'
    )

ssd_predictor = ssd_predictor.deploy(initial_instance_count=1, 
                                     instance_type='ml.g4dn.xlarge', 
                                     endpoint_name = endpoint_name) #ml.g4dn.xlarge # ml.m5.xlarge

### Get model predictions

Now let's test out our deployed model!

In [None]:
%%time

i = 2
pred_image = f"CUB_200_2010/images/{bird_labs_test['images'][i]['file_name']}"
print(pred_image)
# endpoint_name = 'nvidia-ssd-gpu-uint8-1629904899'

runtime_client = boto3.client('sagemaker-runtime')
accept_type = "json" # "json" or "detectron2". Won't impact predictions, just different deserialization pipelines.
content_type = 'image/jpeg'
headers = {'content-type': content_type}
payload = open(pred_image, 'rb')

response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=payload,
    ContentType=content_type,
    Accept = accept_type
)

response = eval(response['Body'].read().decode('utf-8'))
response

### Visualize detections

Let's verify our predictions look how we expect them to.

In [None]:
# plot our detections
for image_idx in range(len(best_results_per_input)):
    fig, ax = plt.subplots(1, figsize = (20,12))
    # Show original, denormalized image...
    image = inputs[image_idx] / 2 + 0.5
    ax.imshow(image)
#     rect = patches.Rectangle((bboxes[0], bboxes[1]), bboxes[2], bboxes[3] ,linewidth=1,edgecolor='g',facecolor='none') 
#     ax.add_patch(rect)
#     plt.text(bboxes[0], bboxes[1]-5, f"Ground Truth Albatross", bbox=dict(facecolor='white', alpha=0.5)) 

        # ...with detections
    sbboxes, classes, confidences = best_results_per_input[image_idx]
    for idx in range(len(sbboxes)):
        left, bot, right, top = sbboxes[idx]
        x, y, w, h = [val * 300 for val in [left, bot, right - left, top - bot]]
        rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none')
        ax.add_patch(rect)
        ax.text(x, y, "Prediction: {} {:.0f}%".format(classes_to_labels[classes[idx] - 1], confidences[idx]*100), bbox=dict(facecolor='white', alpha=0.5),
               fontdict = {'fontsize':12})


## Conclusion

Congratulations, you made it through the lab! 

We've walked through how to work with images, PyTorch, evaluating object detection models, and training and deploying object detection models in SageMaker. We sincerely hope you've enjoyed this tutorial, now it's time for you to take what you've learned here and apply it to your own computer vision problems!

## Cleanup

Please delete your deployed endpoints.

In [None]:
predictor.delete_endpoint()
ssd_predictor.delete_endpoint()