<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Architecture of the Visual Cortex

David H. Hubel and Torsten Wiesel performed a series of experiments on cats in
1958 and 1959 (and a few years later on monkeys), giving crucial insights into
the structure of the visual cortex. The authors received the Nobel Prize in
Physiology or Medicine in 1981 for this work.

In particular, they showed that many neurons in the visual cortex have a small
**local receptive field**, meaning they react only to visual stimuli located in
a limited region of the visual field (see Figure 12-1, in which the local
receptive fields of five neurons are represented by dashed circles). The
receptive fields of different neurons may overlap, and together they tile the
whole visual field.

**Figure 12-1.** Biological neurons in the visual cortex respond to specific
patterns in small regions of the visual field called receptive fields; as the
visual signal makes its way through consecutive brain modules, neurons respond
to more complex patterns in larger receptive fields.

Moreover, the authors showed that some neurons react only to images of
horizontal lines, while others react only to lines with different orientations
(two neurons may have the same receptive field but react to different line
orientations). They also noticed that some neurons have larger receptive
fields, and they react to more complex patterns that are combinations of the
lower-level patterns.

These observations led to the idea that higher-level neurons are based on the
outputs of neighboring lower-level neurons. In Figure 12-1, each neuron is
connected only to nearby neurons from the previous layer. This powerful
architecture is able to detect all sorts of complex patterns in any area of the
visual field.

These studies of the visual cortex inspired the **neocognitron**, introduced in
1980, which gradually evolved into what we now call **convolutional neural
networks (CNNs)**. An important milestone was a 1998 paper by Yann LeCun et al.
that introduced the famous **LeNet-5** architecture, which became widely used by
banks to recognize handwritten digits on checks.

This architecture has some building blocks that you already know, such as fully
connected layers and sigmoid activation functions, but it also introduces two
new building blocks:

- **Convolutional layers**
- **Pooling layers**

We will examine these next.

---

## Note

Why not simply use a deep neural network with fully connected layers for image
recognition tasks?

Although this works fine for small images (e.g., Fashion MNIST), it breaks down
for larger images because of the huge number of parameters it requires. For
example, a 100 × 100 pixel image has 10,000 pixels. If the first layer has just
1,000 neurons (which already severely restricts the amount of information
transmitted to the next layer), this results in **10 million connections**—and
that is only the first layer.

CNNs solve this problem using **partially connected layers** and **weight
sharing**.


# Convolutional Layers

The most important building block of a CNN is the convolutional layer. Neurons in a convolutional layer are not connected to every single pixel in the input image, but only to pixels in their receptive fields.

Each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This allows the network to focus on low-level features first, then combine them into higher-level features in deeper layers.

This hierarchical structure is well-suited for real-world images, which are composed of complex patterns and objects.


## Local Receptive Fields

All the multilayer neural networks we looked at previously had layers composed of a long line of neurons, requiring images to be flattened into 1D vectors.

In a CNN, layers are represented in 2D, making it easier to preserve spatial relationships.

A neuron at row *i*, column *j* is connected to inputs in rows *i* to *i + fh − 1* and columns *j* to *j + fw − 1*, where *fh* and *fw* are the receptive field height and width.


## Zero Padding and Stride

To keep the output size the same as the input size, zeros can be added around the input image — this is called **zero padding**.

Stride defines how far the receptive field moves between applications. A larger stride reduces the spatial dimensions and computational cost.

For stride *(sh, sw)*, the neuron at position *(i, j)* connects to input rows *i × sh* to *i × sh + fh − 1* and columns *j × sw* to *j × sw + fw − 1*.


## Filters (Kernels)

A neuron's weights can be visualized as a small image called a **filter** or **kernel**.

For example:
- A vertical-line filter highlights vertical edges
- A horizontal-line filter highlights horizontal edges

Each filter produces a **feature map**, emphasizing areas where the filter activates strongly.

Filters are learned automatically during training — you never define them manually.


## Stacking Multiple Feature Maps

A convolutional layer contains multiple filters, producing multiple feature maps.

Each feature map:
- Has one neuron per pixel
- Shares the same weights across all locations
- Uses a unique filter and bias

This parameter sharing drastically reduces model size and enables translation invariance.


## Multichannel Inputs

Input images usually have multiple channels:
- RGB images → 3 channels
- Grayscale → 1 channel
- Satellite images → many channels

Each convolutional filter spans **all input channels**, not just one.


## Mathematical Definition (Equation 12-1)

The output of a neuron is computed as a weighted sum of all values in its receptive field across all input channels, plus a bias term.

Variables:
- `z_{i,j,k}` → output at row i, column j, feature map k
- `x` → input values
- `w` → kernel weights
- `b_k` → bias for feature map k
- `sh, sw` → strides
- `fh, fw` → filter dimensions


# Implementing Convolutional Layers with PyTorch


In [None]:
import numpy as np
import torch
from sklearn.datasets import load_sample_images

# Load sample images
sample_images = np.stack(load_sample_images()["images"])
sample_images = torch.tensor(sample_images, dtype=torch.float32) / 255


In [None]:
# Inspect shape
sample_images.shape


The tensor has shape:
[batch_size, height, width, channels]

PyTorch expects:
[batch_size, channels, height, width]


In [None]:
# Reorder dimensions
sample_images_permuted = sample_images.permute(0, 3, 1, 2)
sample_images_permuted.shape


In [None]:
import torchvision.transforms.v2 as T

# Center crop images
cropped_images = T.CenterCrop((70, 120))(sample_images_permuted)
cropped_images.shape


## Creating a Convolutional Layer


In [None]:
import torch.nn as nn

torch.manual_seed(42)

conv_layer = nn.Conv2d(
    in_channels=3,
    out_channels=32,
    kernel_size=7
)

fmaps = conv_layer(cropped_images)
fmaps.shape


### Output Shape Explanation

- 32 output channels → 32 feature maps
- Height and width shrink due to no padding
- Kernel size 7 removes 6 pixels total (3 per side)


## Padding Options

- `padding=0` or `"valid"` → no padding
- `padding="same"` → output size equals input size


In [None]:
conv_layer = nn.Conv2d(
    in_channels=3,
    out_channels=32,
    kernel_size=7,
    padding="same"
)

fmaps = conv_layer(cropped_images)
fmaps.shape


## Stride Effects

A stride greater than 1 reduces spatial dimensions.

Example:
- Input: 70 × 120
- Stride: 2
- Output: 35 × 60

Using large padding with large stride is discouraged.


## Inspecting Layer Parameters


In [None]:
conv_layer.weight.shape


In [None]:
conv_layer.bias.shape


### Parameter Shapes

- Weights: [out_channels, in_channels, kernel_height, kernel_width]
- Biases: [out_channels]

Image size does NOT affect parameter size.


## Activation Functions and Initialization

Convolutional layers are linear — activation functions are required.

- Use **ReLU** with **He initialization**
- Biases are typically initialized to zero

Hyperparameters include:
- Number of filters
- Kernel size
- Padding
- Stride
- Activation function


# Pooling Layers

Once you understand how convolutional layers work, the pooling layers are quite easy to grasp. Their goal is to subsample (i.e., shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting).

Just like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field. You must define its size, the stride, and the padding type, just like before. However, a pooling neuron has **no weights or biases**; all it does is aggregate the inputs using an aggregation function such as the **max** or **mean**.

## Max Pooling

Figure 12-9 shows a **max pooling layer**, which is the most common type of pooling layer. In this example, we use a **2 × 2 pooling kernel** with a **stride of 2** and **no padding**. Only the maximum input value in each receptive field is propagated to the next layer, while the other inputs are dropped.

For example, in the lower-left receptive field in Figure 12-9, the input values are 1, 5, 3, and 2, so only the maximum value, **5**, is kept. Because of the stride of 2, the output image has **half the height and half the width** of the input image (rounded down since no padding is used).

> **NOTE**  
> A pooling layer typically operates independently on each input channel, so the output depth (number of channels) remains the same as the input depth.

## Translation Invariance

In addition to reducing computation and memory usage, max pooling introduces a degree of **invariance to small translations**. Figure 12-10 illustrates this effect using three images (A, B, C) processed through a max pooling layer with a 2 × 2 kernel and stride 2.

Images B and C are shifted versions of image A. The outputs for images A and B are identical, demonstrating **translation invariance**. For image C, the output is shifted by one pixel, showing partial invariance. By inserting max pooling layers periodically, CNNs can gain translation invariance at larger scales.

Max pooling also provides limited **rotational** and **scale invariance**, which can be useful for tasks like image classification where exact spatial alignment is not critical.

## Limitations of Max Pooling

Max pooling is highly **destructive**. Even with a small 2 × 2 kernel and stride of 2, the output area is reduced by a factor of four, discarding **75% of the input values**.

In some tasks, invariance is undesirable. For example, in **semantic segmentation**, where each pixel must be classified, the output should shift exactly when the input shifts. This property is called **equivariance**, not invariance: small changes in the input should result in corresponding small changes in the output.


# Implementing Pooling Layers with PyTorch

Pooling layers are used to subsample (shrink) feature maps in convolutional neural networks. This helps reduce computational cost, memory usage, and overfitting.

The most common pooling operation is **max pooling**, which keeps only the maximum value in each local receptive field.


In [None]:
max_pool = nn.MaxPool2d(kernel_size=2)


## Average Pooling

To create an **average pooling** layer, we use `nn.AvgPool2d`.  
It behaves exactly like max pooling, except it computes the **mean** instead of the maximum.

Although average pooling loses less information, **max pooling is more popular** because it preserves the strongest features, introduces stronger translation invariance, and is slightly more efficient.


In [None]:
avg_pool = nn.AvgPool2d(kernel_size=2)


## Depthwise Pooling

Pooling can also be applied along the **depth (channel) dimension** instead of the spatial dimensions.  
This can help a CNN become invariant to transformations such as rotation, thickness, brightness, or color.

PyTorch does not include a built-in depthwise max pooling layer, but we can implement one using `max_pool1d`.


In [None]:
import torch.nn.functional as F

class DepthPool(nn.Module):
    def __init__(self, kernel_size, stride=None, padding=0):
        super().__init__()
        self.kernel_size = kernel_size
        self.stride = stride if stride is not None else kernel_size
        self.padding = padding

    def forward(self, inputs):
        batch, channels, height, width = inputs.shape
        Z = inputs.view(batch, channels, height * width)  # merge spatial dims
        Z = Z.permute(0, 2, 1)  # swap spatial and channel dims
        Z = F.max_pool1d(
            Z,
            kernel_size=self.kernel_size,
            stride=self.stride,
            padding=self.padding
        )
        Z = Z.permute(0, 2, 1)  # swap back
        return Z.view(batch, -1, height, width)  # restore spatial dims


### Example Shape Transformation

Assume the input has shape `[2, 32, 70, 120]`:
- 2 images
- 32 channels
- spatial size 70 × 120

Using `kernel_size = 4`, stride = 4, and no padding:

1. Merge spatial dimensions → `[2, 32, 8400]`
2. Permute dimensions → `[2, 8400, 32]`
3. Apply depthwise max pooling → `[2, 8400, 8]`
4. Permute back → `[2, 8, 8400]`
5. Restore spatial dimensions → `[2, 8, 50, 100]`


## Global Average Pooling

Global average pooling computes the mean of each entire feature map, producing **one value per channel per image**.

This is often used just before the output layer to reduce parameters and overfitting.


In [None]:
global_avg_pool = nn.AdaptiveAvgPool2d(output_size=1)


The same result can also be achieved using `torch.mean` directly:


In [None]:
output = cropped_images.mean(dim=(2, 3), keepdim=True)


You now know all the key pooling layers used in modern convolutional neural networks.  
Next, these components will be assembled into full CNN architectures.


# CNN Architectures

Typical CNN architectures stack a few convolutional layers (each generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on.

As the image progresses through the network, its spatial resolution decreases, but the number of feature maps (depth) typically increases. At the top of the stack, a regular feedforward neural network is added, composed of fully connected layers, and the final layer outputs the prediction (e.g., class probabilities using softmax).


## Design Tip

Instead of using a convolutional layer with a 5 × 5 kernel, it is generally preferable to stack two layers with 3 × 3 kernels. This uses fewer parameters, requires fewer computations, and usually performs better.

An exception is the first convolutional layer, which can use a large kernel (e.g., 5 × 5 or 7 × 7) with a stride of 2 or more to reduce spatial dimensions early without losing much information.


In [None]:
from functools import partial
import torch
import torch.nn as nn


## A Basic CNN for Fashion MNIST

Below is a CNN implementation suitable for the Fashion MNIST dataset. It follows the common pattern:
- Convolution + ReLU
- Pooling
- Repeat
- Fully connected layers at the top


In [None]:
DefaultConv2d = partial(nn.Conv2d, kernel_size=3, padding="same")

model = nn.Sequential(
    DefaultConv2d(in_channels=1, out_channels=64, kernel_size=7), nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),

    DefaultConv2d(in_channels=64, out_channels=128), nn.ReLU(),
    DefaultConv2d(in_channels=128, out_channels=128), nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),

    DefaultConv2d(in_channels=128, out_channels=256), nn.ReLU(),
    DefaultConv2d(in_channels=256, out_channels=256), nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),

    nn.Flatten(),
    nn.Linear(in_features=2304, out_features=128), nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(in_features=128, out_features=64), nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(in_features=64, out_features=10),
)


### Explanation

- `DefaultConv2d` is defined using `functools.partial` to avoid repeating kernel size and padding.
- The number of filters doubles after each pooling layer.
- Max pooling reduces spatial dimensions by a factor of 2.
- Dropout layers reduce overfitting.
- The output layer has 10 units (one per class).
- Softmax is omitted because `nn.CrossEntropyLoss` expects logits.


### Why 2,304 Input Features?

Fashion MNIST images start at 28 × 28 pixels.
Pooling reduces them to:
- 14 × 14
- 7 × 7
- 3 × 3

With 256 feature maps:
256 × 3 × 3 = 2,304 input features.


## Classical CNN Architectures

Over time, CNN architectures evolved rapidly, especially through competitions like ImageNet (ILSVRC). The top-5 error rate dropped from over 26% to under 2.3% in just six years.

We now review several landmark architectures.


## LeNet-5 (1998)

LeNet-5 was created by Yann LeCun for handwritten digit recognition (MNIST).

It consists of:
- Convolutional layers
- Average pooling layers
- Fully connected layers
- tanh activations

Today, ReLU and softmax would typically be used instead.


## AlexNet (2012)

AlexNet won the 2012 ImageNet challenge with a top-5 error rate of 17%.

Key contributions:
- Much deeper than LeNet-5
- Stacked convolutional layers
- ReLU activations
- Dropout for regularization
- Data augmentation

It popularized deep CNNs.


### Data Augmentation

Data augmentation increases training data by generating realistic variants:
- Shifts
- Rotations
- Resizing
- Color changes
- Horizontal flips

This reduces overfitting and improves generalization.


## GoogLeNet (Inception, 2014)

GoogLeNet introduced **inception modules**, allowing the network to:
- Capture patterns at multiple scales
- Use far fewer parameters than AlexNet

It won the 2014 ImageNet challenge with under 7% top-5 error.


### Inception Modules

An inception module runs several layers in parallel:
- 1 × 1 convolutions
- 3 × 3 convolutions
- 5 × 5 convolutions
- Max pooling

All outputs are concatenated along the depth dimension.


## ResNet (2015)

ResNet introduced **skip connections**, enabling very deep networks (up to 152 layers).

Instead of learning h(x), residual units learn:
f(x) = h(x) − x

This greatly improves gradient flow and training stability.


### Residual Units

Each residual unit:
- Contains two or three convolutional layers
- Preserves spatial dimensions
- Adds the input to the output

When dimensions change, a 1 × 1 convolution is used to match shapes.


## Xception

Xception replaces inception modules with **depthwise separable convolutions**.

These separate:
- Spatial feature extraction
- Cross-channel feature extraction

This reduces parameters and often improves performance.


In [None]:
class SeparableConv2d(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        super().__init__()
        self.depthwise_conv = nn.Conv2d(
            in_channels, in_channels, kernel_size,
            stride=stride, padding=padding, groups=in_channels
        )
        self.pointwise_conv = nn.Conv2d(
            in_channels, out_channels, kernel_size=1
        )

    def forward(self, x):
        return self.pointwise_conv(self.depthwise_conv(x))


## SENet (2017)

SENet introduces **Squeeze-and-Excitation (SE) blocks**, which:
- Analyze feature map importance
- Recalibrate channels dynamically
- Improve performance with minimal overhead


### SE Block Structure

An SE block consists of:
1. Global average pooling
2. Dense layer (ReLU)
3. Dense layer (sigmoid)
4. Channel-wise scaling of feature maps


## Other Noteworthy Architectures

- **VGGNet** – deep but simple, many parameters
- **ResNeXt** – grouped convolutions
- **DenseNet** – dense connections between layers
- **MobileNet** – lightweight, mobile-friendly
- **EfficientNet** – compound scaling (depth, width, resolution)
- **ConvNeXt** – CNNs inspired by vision transformers


## GPU Memory: Training vs Inference

During inference, activations can be freed layer by layer.
During training, all activations must be stored for backpropagation.

This makes training far more memory-intensive than inference.


### Memory Optimization Techniques

- Reduce batch size
- Use mixed precision (FP16)
- Gradient accumulation
- Activation checkpointing
- Model parallelism
- Reversible networks (RevNets)


In [None]:
from torch.utils.checkpoint import checkpoint


## Reversible Residual Networks (RevNets)

RevNets avoid storing activations entirely by making each layer reversible.
Inputs can be recomputed from outputs during backpropagation, saving memory.


You now have a full overview of modern CNN architectures and design principles.

Next, we’ll implement a popular CNN architecture from scratch using PyTorch.


# Implementing a ResNet-34 CNN Using PyTorch

Most CNN architectures described so far can be implemented pretty naturally using PyTorch (although generally you would load a pretrained network instead, as you will see).

To illustrate the process, we will implement a **ResNet-34** from scratch using PyTorch.


## Residual Unit

We start by defining a **ResidualUnit** layer. This corresponds directly to the residual blocks used in ResNet architectures.

Each residual unit consists of:
- A main path with two convolutional layers
- A skip (identity) connection
- An elementwise addition followed by a ReLU activation


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial


In [None]:
class ResidualUnit(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        DefaultConv2d = partial(
            nn.Conv2d, kernel_size=3, stride=1, padding=1, bias=False
        )

        self.main_layers = nn.Sequential(
            DefaultConv2d(in_channels, out_channels, stride=stride),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
            DefaultConv2d(out_channels, out_channels),
            nn.BatchNorm2d(out_channels),
        )

        if stride > 1:
            self.skip_connection = nn.Sequential(
                nn.Conv2d(
                    in_channels, out_channels,
                    kernel_size=1, stride=stride, padding=0, bias=False
                ),
                nn.BatchNorm2d(out_channels),
            )
        else:
            self.skip_connection = nn.Identity()

    def forward(self, inputs):
        return F.relu(self.main_layers(inputs) + self.skip_connection(inputs))


### Residual Unit Explanation

This implementation closely matches the standard ResNet residual block.

- The **main layers** correspond to the right-hand path of the residual diagram.
- The **skip connection** is either:
  - A 1 × 1 convolution with stride > 1 when dimensions must change, or
  - An identity mapping when dimensions stay the same.
- In the forward pass, the outputs of both paths are added together and passed through a ReLU activation.


## Building the ResNet-34 Architecture

Now that we have the ResidualUnit, we can assemble the full **ResNet-34** architecture.

The network is essentially a large sequential stack:
- Initial convolution and max pooling
- Four groups of residual units
- Global average pooling
- A fully connected output layer


In [None]:
class ResNet34(nn.Module):
    def __init__(self):
        super().__init__()

        layers = [
            nn.Conv2d(
                in_channels=3, out_channels=64,
                kernel_size=7, stride=2, padding=3, bias=False
            ),
            nn.BatchNorm2d(num_features=64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        ]

        prev_filters = 64

        for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
            stride = 1 if filters == prev_filters else 2
            layers.append(ResidualUnit(prev_filters, filters, stride=stride))
            prev_filters = filters

        layers += [
            nn.AdaptiveAvgPool2d(output_size=1),
            nn.Flatten(),
            nn.LazyLinear(10),
        ]

        self.resnet = nn.Sequential(*layers)

    def forward(self, inputs):
        return self.resnet(inputs)


### Architecture Breakdown

- The first convolution uses a 7 × 7 kernel with stride 2, followed by batch normalization and max pooling.
- Residual units are stacked in the following pattern:
  - 3 blocks with 64 filters
  - 4 blocks with 128 filters
  - 6 blocks with 256 filters
  - 3 blocks with 512 filters
- When the number of filters increases, the stride is set to 2 to downsample the spatial dimensions.


### Final Layers

- `AdaptiveAvgPool2d(output_size=1)` performs **global average pooling**, producing one value per feature map.
- `Flatten()` converts the tensor into a vector.
- `LazyLinear(10)` creates the final classification layer with 10 outputs (e.g., for CIFAR-10).


## Summary

In just about **45 lines of code**, we have implemented ResNet-34 from scratch.

This highlights:
- The elegance of the ResNet architecture
- The expressiveness and flexibility of PyTorch

Although implementing other CNN architectures would take more code, it would not be significantly harder. In practice, however, PyTorch’s **TorchVision** library provides many of these models preimplemented and pretrained, making them easy to use in real-world applications.


# Using TorchVision’s Pretrained Models

In practice, you rarely need to implement standard CNN architectures like GoogLeNet, ResNet, or ConvNeXt manually. TorchVision provides many **pretrained models** that can be loaded with just a few lines of code.


## Tip: Other Sources of Pretrained Models

- **TIMM** is a popular PyTorch-based library that offers a large collection of pretrained image classification models, along with utilities for data loading, augmentation, optimizers, and schedulers.
- **Hugging Face Hub** is another excellent resource for pretrained models across many domains.

Both libraries integrate well with PyTorch.


In [None]:
import torch
import torchvision


## Loading a Pretrained ConvNeXt Model

TorchVision provides several ConvNeXt variants: tiny, small, base, and large.  
The following example loads the **ConvNeXt Base** model pretrained on ImageNet.


In [None]:
weights = torchvision.models.ConvNeXt_Base_Weights.IMAGENET1K_V1
model = torchvision.models.convnext_base(weights=weights).to(device)


This code automatically downloads the pretrained weights (about 338 MB) from Torch Hub and caches them locally for future use.

Some models have multiple versions of pretrained weights (for example, `IMAGENET1K_V2`). To explore available models and weights, you can use:

- `torchvision.models.list_models()`
- `torchvision.models.get_model_weights("convnext_base")`

You can also browse the official documentation on the TorchVision website.


## Preprocessing the Input Images

Before passing images to the model, they must be preprocessed exactly as expected during training.

ConvNeXt models expect **224 × 224** pixel images. Instead of manually resizing and normalizing, it is best to use the transforms provided by the pretrained weights object.


In [None]:
transforms = weights.transforms()
preprocessed_images = transforms(sample_images_permuted)


These transforms:
- Resize the images to the correct dimensions
- Normalize pixel intensities using ImageNet’s channel-wise means and standard deviations

This ensures compatibility with the pretrained model.


## Running Inference

Before making predictions:
- Switch the model to **evaluation mode**
- Disable gradient computation to save memory and computation


In [None]:
model.eval()

with torch.no_grad():
    y_logits = model(preprocessed_images.to(device))


The output is a tensor of shape **[2, 1000]**, since ImageNet contains 1,000 classes.  
Each row contains the logits for one image.


## Predicted Classes

To obtain the predicted class for each image, we select the index of the maximum logit.


In [None]:
y_pred = torch.argmax(y_logits, dim=1)
y_pred


The resulting tensor contains the ImageNet class IDs predicted for each image.


## Mapping Class IDs to Human-Readable Labels

The pretrained weights object contains metadata, including the ImageNet class names.


In [None]:
class_names = weights.meta["categories"]
[class_names[class_id] for class_id in y_pred]


For example, the predictions may be:
- **palace**
- **daisy**

These labels make sense even if the exact object is not present in ImageNet, as the model selects the closest available class.


## Top-3 Predictions

Instead of only looking at the top prediction, we can inspect the top three most likely classes using `topk()`.


In [None]:
y_top3_logits, y_top3_class_ids = y_logits.topk(k=3, dim=1)
[[class_names[class_id] for class_id in top3] for top3 in y_top3_class_ids]


## Estimated Probabilities

The logits can be converted into probabilities using the softmax function.


In [None]:
y_top3_logits.softmax(dim=1)


These probabilities indicate how confident the model is in each of its top predictions.


## Summary

TorchVision makes it extremely easy to download and use pretrained models, and they perform very well out of the box on ImageNet classes.

When your task involves different classes (such as specific flower species), pretrained models are still highly valuable through **transfer learning**, which we will explore next.


# Using TorchVision’s Pretrained Models

In general, you won’t have to implement standard models like GoogLeNet, ResNet, or ConvNeXt manually, since pretrained networks are readily available with a couple lines of code using TorchVision.


## TIP

TIMM is another very popular library built on PyTorch: it provides a collection of pretrained image classification models, as well as many related tools such as data loaders, data augmentation utilities, optimizers, schedulers, and more.

Hugging Face’s Hub is also a great place to get all sorts of pretrained models (see Chapter 14).


For example, you can load a ConvNeXt model pretrained on ImageNet with the following code. There are several variants of the ConvNeXt model—tiny, small, base, and large—and this code loads the base variant:


In [None]:
weights = torchvision.models.ConvNeXt_Base_Weights.IMAGENET1K_V1
model = torchvision.models.convnext_base(weights=weights).to(device)


That’s all! This code automatically downloads the weights (338 MB) from the Torch Hub, an online repository of pretrained models. The weights are saved and cached for future use (e.g., in ~/.cache/torch/hub; run torch.hub.get_dir() to find the exact path on your system).

Some models have newer weights versions (e.g., IMAGENET1K_V2) or other weight variants. For the full list of available models, run:

- torchvision.models.list_models()

To find the list of pretrained weights available for a given model, such as convnext_base, run:

- list(torchvision.models.get_model_weights("convnext_base"))

Alternatively, visit https://pytorch.org/vision/main/models.


Let’s use this model to classify the two sample images we loaded earlier.

Before we can do this, we must first ensure that the images are preprocessed exactly as the model expects. In particular, they must have the right size. A ConvNeXt model expects 224 × 224 pixel images (other models may expect other sizes, such as 299 × 299).

Since our sample images are 427 × 640 pixels, we need to resize them. We could do this using TorchVision’s CenterCrop and/or Resize transform, but it’s much easier and safer to use the transforms returned by weights.transforms(), as they are specifically designed for this particular pretrained model:


In [None]:
transforms = weights.transforms()
preprocessed_images = transforms(sample_images_permuted)


Importantly, these transforms also normalize the pixel intensities just like during training. In this case, the transforms standardize the pixel intensities separately for each color channel, using ImageNet’s means and standard deviations for each channel (we will see how to do this manually later in this chapter).


Next we can move the images to the GPU and pass them to the model. As always, remember to switch the model to evaluation mode before making predictions—the model is in training mode by default—and also turn off autograd:


In [None]:
model.eval()
with torch.no_grad():
    y_logits = model(preprocessed_images.to(device))


The result is a 2 × 1,000 tensor containing the class logits for each image (recall that ImageNet has 1,000 classes). As we did in Chapter 10, we can use torch.argmax() to get the predicted class for each image (i.e., the class with the maximum logit):


In [None]:
y_pred = torch.argmax(y_logits, dim=1)
y_pred


So far, so good, but what exactly do these classes represent?

Well you could find the ImageNet class names online, but once again it’s simpler and safer to get the class names directly from the weights object. Indeed, its meta attribute is a dictionary containing metadata about the pretrained model, including the class names:


In [None]:
class_names = weights.meta["categories"]
[class_names[class_id] for class_id in y_pred]


There you have it: the first image is classified as a palace, and the second as a daisy. Since the ImageNet dataset does not have classes for Chinese towers or dahlia flowers, a palace and a daisy are reasonable substitutes (the tower is part of the Summer Palace in Beijing).


Let’s look at the top-three predictions using topk():


In [None]:
y_top3_logits, y_top3_class_ids = y_logits.topk(k=3, dim=1)
[[class_names[class_id] for class_id in top3] for top3 in y_top3_class_ids]


Let’s look at the estimated probabilities for each of these classes:


In [None]:
y_top3_logits.softmax(dim=1)


As you can see, TorchVision makes it easy to download and use pretrained models, and it works quite well out of the box for ImageNet classes.

But what if you need to classify images into classes that don’t belong to the ImageNet dataset, such as various flower species? In that case, you may still benefit from the pretrained models by using them to perform transfer learning.


# Classification and Localization

Localizing an object in a picture can be expressed as a regression task, as discussed in Chapter 9: to predict a bounding box around the object, a common approach is to predict the location of the bounding box’s center, as well as its width and height (alternatively, you could predict the horizontal and vertical coordinates of the object’s upper-left and lower-right corners).

This means we have four numbers to predict.


It does not require much change to the ConvNeXt model; we just need to add a second dense output layer with four units (e.g., on top of the global average pooling layer).

Here’s a FlowerLocator model that adds a localization head to a given base model, such as our ConvNeXt model:


In [None]:
class FlowerLocator(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.localization_head = nn.Sequential(
            nn.Flatten(),
            nn.Linear(base_model.classifier[2].in_features, 4)
        )

    def forward(self, X):
        features = self.base_model.features(X)
        pool = self.base_model.avgpool(features)
        logits = self.base_model.classifier(pool)
        bbox = self.localization_head(pool)
        return logits, bbox


torch.manual_seed(42)
locator_model = FlowerLocator(model).to(device)


This locator model has two heads: the first outputs class logits, while the second outputs the bounding box.

The localization head has the same number of inputs as the nn.Linear layer of the classification head, but it outputs just four numbers.

The forward() method takes a batch of preprocessed images as input and outputs both the predicted class logits (102 per image) and the predicted bounding boxes (1 per image).


After training this model, you can use it as follows:


In [None]:
preproc_images = [...]  # a batch of preprocessed images
y_pred_logits, y_pred_bbox = locator_model(preprocessed_images.to(device))


But how can we train this model?

Well, we saw how to train a model with two or more outputs in Chapter 10, and this one is no different: in this case, we can use the nn.CrossEntropyLoss for the classification head, and the nn.MSELoss for the localization head.

The final loss can just be a weighted sum of the two. Voilà, that’s all there is to it.


Hey, not so fast! We have a problem: the Flowers102 dataset does not include any bounding boxes, so we need to add them ourselves.

This is often one of the hardest and most costly parts of a machine learning project: labeling and annotating the data.


To annotate images with bounding boxes, you may want to use an open source labeling tool like:

- Label Studio
- OpenLabeler
- ImgLab
- Labelme
- VoTT
- VGG Image Annotator

Or perhaps a commercial tool like:

- LabelBox
- Supervisely
- Roboflow
- RectLabel


Many of these are now AI assisted, greatly speeding up the annotation task. You may also want to consider crowdsourcing platforms such as Amazon Mechanical Turk if you have a very large number of images to annotate.

However, it is quite a lot of work to set up a crowdsourcing platform, prepare the form to be sent to the workers, supervise them, and ensure that the quality of the bounding boxes they produce is good, so make sure it is worth the effort.


If there are just a few hundred or even a couple of thousand images to label, and you don’t plan to do this frequently, it may be preferable to do it yourself: with the right tools, it will only take a few days, and you’ll also gain a better understanding of your dataset and task.


You can then create a custom dataset (see Chapter 10) where each entry contains an image, a label, and a bounding box.

TorchVision conveniently includes a BoundingBoxes class that represents a list of bounding boxes.


For example, the following code creates a bounding box for the largest flower in the first image of the Flowers102 training set (for now we only consider one bounding box per image, but we’ll discuss multiple bounding boxes per image later in this chapter):


In [None]:
import torchvision.tv_tensors

bbox = torchvision.tv_tensors.BoundingBoxes(
    [[377, 199, 248, 262]],  # center x=377, center y=199, width=248, height=262
    format="CXCYWH",        # other possible formats: "XYXY" and "XYWH"
    canvas_size=(500, 754) # raw image size before preprocessing
)


import torchvision.tv_tensors

bbox = torchvision.tv_tensors.BoundingBoxes(
    [[377, 199, 248, 262]],  # center x=377, center y=199, width=248, height=262
    format="CXCYWH",        # other possible formats: "XYXY" and "XYWH"
    canvas_size=(500, 754) # raw image size before preprocessing
)


The BoundingBoxes class is a subclass of TVTensor, which is a subclass of torch.Tensor, so you can treat bounding boxes exactly like regular tensors, with extra features.

Most importantly, you can transform bounding boxes using TorchVision’s transforms API v2.


For example, let’s use the transform we defined earlier to preprocess this bounding box:


In [None]:
transform(bbox)


In [None]:
BoundingBoxes([[ 90,  91, 120, 154]], format=BoundingBoxFormat.CXCYWH,
              canvas_size=(224, 224), clamping_mode=soft)


## WARNING

Resizing and cropping a bounding box works as expected, but rotation is special: the bounding box can’t be rotated since it doesn’t have any rotation parameter, so instead it is resized to fit the rotated box (not the rotated object).

As a result, it may end up being a bit too large for the object.


You can pass a nested data structure to a transform and the output will have the same structure, except with all the images and bounding boxes transformed.


In [None]:
first_image = [...]  # load the first training image without any preprocessing

preproc_image, preproc_target = transform(
    (first_image, {"label": 0, "bbox": bbox})
)

preproc_bbox = preproc_target["bbox"]


## TIP

When using the MSE, a 10-pixel error for a large bounding box will be penalized just as much as a 10-pixel error for a small bounding box.

To avoid this, you can use a custom loss function that computes the square root of the width and height—for both the target and the prediction—before computing the MSE.


The MSE is simple and often works fairly well to train the model, but it is not a great metric to evaluate how well the model can predict bounding boxes.


The most common metric for this is the intersection over union (IoU, also known as the Jaccard index): it is the area of overlap between the target bounding box T and the predicted bounding box P, divided by the area of their union P ∪ T.

In short, IoU = |P ∩ T| / |P ∪ T|, where |x| is the area of x.


The IoU ranges from 0 (no overlap) to 1 (perfect overlap). It is implemented by the torchvision.ops.box_iou() function.


The IoU is not great for training because it is equal to zero whenever P and T have no overlap, regardless of the distance between them or their shapes: in this case the gradient is also equal to zero and therefore gradient descent cannot make any progress.


Luckily, it’s possible to fix this flaw by incorporating extra information.

For example, the Generalized IoU (GIoU), introduced in a 2019 paper by H. Rezatofighi et al., considers the smallest box S that contains both P and T, and it subtracts from the IoU the ratio of S that is not covered by P or T.


In short:

GIoU = IoU – |S – (P ∪ T)| / |S|

Since we want to maximize the GIoU, the GIoU loss is equal to 1 – GIoU.

This loss quickly became popular, and it is implemented by the torchvision.ops.generalized_box_iou_loss() function.


Another important variant of the IoU is the Complete IoU (CIoU), introduced in a 2020 paper by Z. Zheng et al.

It considers three geometric factors:
- the IoU (the more overlap, the better)
- the distance between the centers of P and T (the closer, the better)
- the similarity between the aspect ratios of P and T (the closer, the better)


The loss is 1 – CIoU, and it is implemented by the torchvision.ops.complete_box_iou_loss() function.

It generally performs better than the MSE or the GIoU, converging faster and leading to more accurate bounding boxes, so it is becoming the default loss for localization.


Classifying and localizing a single object is nice, but what if the images contain multiple objects (as is often the case in the flowers dataset)?


# Object Detection

The task of classifying and localizing multiple objects in an image is called object detection.

Until a few years ago, a common approach was to take a CNN that was trained to classify and locate a single object roughly centered in the image, then slide this CNN across the image and make predictions at each step.


The CNN was generally trained to predict not only class probabilities and a bounding box, but also an objectness score: this is the estimated probability that the image does indeed contain an object centered near the middle.

This is a binary classification output; it can be produced by a dense output layer with a single unit, using the sigmoid activation function and trained using the binary cross-entropy loss.


## NOTE

Instead of an objectness score, a “no-object” class was sometimes added, but in general this did not work as well.

The questions “Is an object present?” and “What type of object is it?” are best answered separately.


This sliding-CNN approach is illustrated in Figure 12-25.

In this example, the image was chopped into a 5 × 7 grid, and we see a CNN—the thick black rectangle—sliding across all 3 × 3 regions and making predictions at each step.


**Figure 12-25.** Detecting multiple objects by sliding a CNN across the image


In this figure, the CNN has already made predictions for three of these 3 × 3 regions:


When looking at the top-left 3 × 3 region (centered on the red-shaded grid cell located in the second row and second column), it detected the leftmost rose.

Notice that the predicted bounding box exceeds the boundary of this 3 × 3 region. That’s absolutely fine: even though the CNN could not see the bottom part of the rose, it was able to make a reasonable guess as to where it might be.


It also predicted class probabilities, giving a high probability to the “rose” class.

Lastly, it predicted a fairly high objectness score, since the center of the bounding box lies within the central grid cell (in this figure, the objectness score is represented by the thickness of the bounding box).


When looking at the next 3 × 3 region, one grid cell to the right (centered on the shaded blue square), it did not detect any flower centered in that region, so it predicted a very low objectness score.

Therefore, the predicted bounding box and class probabilities can safely be ignored.


Finally, when looking at the next 3 × 3 region, again one grid cell to the right (centered on the shaded green cell), it detected the rose at the top, although not perfectly.

This rose is not well centered within this region, so the predicted objectness score was not very high.


You can imagine how sliding the CNN across the whole image would give you a total of 15 predicted bounding boxes, organized in a 3 × 5 grid, with each bounding box accompanied by its estimated class probabilities and objectness score.


Since objects can have varying sizes, you may then want to slide the CNN again across 2 × 2 and 4 × 4 regions as well, to capture smaller and larger objects.


This technique is fairly straightforward, but as you can see it will often detect the same object multiple times, at slightly different positions.

Some post-processing is needed to get rid of all the unnecessary bounding boxes.


A common approach for this is called **non-max suppression (NMS)**.

Here’s how it works:


1. First, get rid of all the bounding boxes for which the objectness score is below some threshold.


2. Find the remaining bounding box with the highest objectness score, and get rid of all the other remaining bounding boxes that overlap a lot with it (e.g., with an IoU greater than 60%).


3. Repeat step 2 until there are no more bounding boxes to get rid of.


This simple approach to object detection works pretty well, but it requires running the CNN many times, so it is quite slow.

Fortunately, there is a much faster way to slide a CNN across an image: using a fully convolutional network (FCN).


# Fully Convolutional Networks

The idea of FCNs was first introduced in a 2015 paper by Jonathan Long et al., for semantic segmentation.


The authors pointed out that you could replace the dense layers at the top of a CNN with convolutional layers.


Suppose a dense layer with 200 neurons sits on top of a convolutional layer that outputs 100 feature maps, each of size 7 × 7.

Each neuron computes a weighted sum of all 100 × 7 × 7 activations.


Now replace this dense layer with a convolutional layer using 200 filters of size 7 × 7 and "valid" padding.

The output will be 200 feature maps of size 1 × 1.


In other words, the output values are exactly the same — only the tensor shape changes from:

- Dense: [batch size, 200]
- Conv: [batch size, 200, 1, 1]


## TIP

To convert a dense layer to a convolutional layer:

- Number of filters = number of dense units
- Kernel size = input feature map size
- Padding = "valid"


Why is this important?

Because convolutional layers can process images of any spatial size, while dense layers cannot.

This means FCNs can be trained and run on images of any size.


For example, if an FCN is trained on 224 × 224 images and later receives a 448 × 448 image, it will naturally output a larger prediction grid — without retraining.


This allows the CNN to make many predictions in a single forward pass, instead of sliding a window manually.

In fact, **YOLO** — *You Only Look Once* — is based on this idea.


**Figure 12-26.** The same fully convolutional network processing a small image (left) and a large one (right)


# You Only Look Once (YOLO)

YOLO is a fast and accurate object detection architecture proposed by Joseph Redmon et al. in 2015.

It is fast enough to run in real time on video.


YOLO differs from basic FCNs in several important ways:


- Each grid cell only predicts objects whose bounding box center lies within that cell
- Bounding box coordinates are relative to the grid cell
- Width and height may extend beyond the cell


- YOLO predicts multiple bounding boxes per grid cell
- Each bounding box has its own objectness score


- Class probabilities are predicted per grid cell, not per bounding box


YOLO has evolved through many versions (YOLOv2, YOLOv3, YOLO9000, and beyond), adding improvements such as:

- Anchor priors
- More bounding boxes per cell
- More classes
- Skip connections
- Tiny versions for real-time inference


 # Mean Average Precision (mAP)

A very common metric used in object detection is the mean average precision (mAP).


To compute mAP, we first compute the **average precision (AP)** for each class by averaging the maximum precision achievable at recall levels from 0% to 100%.


The mAP is then computed by averaging the AP values across all classes.


In object detection, a prediction is only considered correct if:

- The predicted class is correct
- The IoU with the ground truth box exceeds a threshold (e.g., 0.5)


This leads to metrics such as:

- mAP@0.5
- mAP@[.50:.95]


TorchMetrics provides a ready-to-use MeanAveragePrecision metric that handles all of this.


TorchVision does not include YOLO models, but you can use the Ultralytics library, which provides pretrained YOLO models based on PyTorch.


In [None]:
from ultralytics import YOLO

model = YOLO('yolov9m.pt')  # n=nano, s=small, m=medium, x=large
images = ["https://homl.info/soccer.jpg", "https://homl.info/traffic.jpg"]
results = model(images)


The output is a list of Results objects.

For example, here is the first detected object in the first image:


In [None]:
results[0].summary()[0]


{'name': 'sports ball',
 'class': 32,
 'confidence': 0.96214,
 'box': {'x1': 245.35733, 'y1': 286.03003, 'x2': 300.62509, 'y2': 343.57184}}


## TIP

The Ultralytics library also provides a simple API to train YOLO models on your own datasets.

See https://docs.ultralytics.com/modes/train for more details.


Several other pretrained object detection models are available via TorchVision:


- Faster R-CNN  
- SSD  
- SSDlite  
- RetinaNet  
- FCOS


So far, we’ve only considered detecting objects in single images.

But what about videos?

Objects must not only be detected in each frame, they must also be tracked over time.


# Object Tracking

Object tracking is a challenging task: objects move, they may grow or shrink as they get closer or further away, their appearance may change as they turn around or move to different lighting conditions or backgrounds, they may be temporarily occluded by other objects, and so on.


One of the most popular object tracking systems is **DeepSORT**. It is based on a combination of classical algorithms and deep learning:


One of the most popular object tracking systems is **DeepSORT**. It is based on a combination of classical algorithms and deep learning:


- It uses **Kalman filters** to estimate the most likely current position of an object given prior detections, and assuming that objects tend to move at a constant speed.


- It uses a **deep learning model** to measure the resemblance between new detections and existing tracked objects.


- Lastly, it uses the **Hungarian algorithm** to map new detections to existing tracked objects (or to new tracked objects). This algorithm efficiently finds the combination of mappings that minimizes the distance between the detections and the predicted positions of tracked objects, while also minimizing the appearance discrepancy.


For example, imagine a red ball that just bounced off a blue ball traveling in the opposite direction.


Based on the previous positions of the balls, the Kalman filter will predict that the balls will go through each other; indeed, it assumes that objects move at a constant speed, so it will not expect the bounce.


If the Hungarian algorithm only considered positions, then it would happily map the new detections to the wrong balls, as if they had just gone through each other and swapped colors.


But thanks to the resemblance measure, the Hungarian algorithm will notice the problem. Assuming the balls are not too similar, the algorithm will map the new detections to the correct balls.


The Ultralytics library supports object tracking. It uses the **Bot-SORT** algorithm by default: this algorithm is very similar to DeepSORT but it’s faster and more accurate thanks to improvements such as camera-motion compensation and tweaks to the Kalman filter.


In this example, we also print the ID of each tracked object at every frame, and we save a copy of the video with annotations (its path is displayed at the end):


In [None]:
my_video = "https://homl.info/cars.mp4"

results = model.track(
    source=my_video,
    stream=True,
    save=True
)

for frame_results in results:
    summary = frame_results.summary()  # similar summary as earlier + track id
    track_ids = [obj["track_id"] for obj in summary]
    print("Track ids:", track_ids)


So far we have located objects using bounding boxes.

This is often sufficient, but sometimes you need to locate objects with much more precision—for example, to remove the background behind a person during a videoconference call.

Let’s see how to go down to the pixel level.


# Semantic Segmentation


In semantic segmentation, each pixel is classified according to the class of the object it belongs to (e.g., road, car, pedestrian, building, etc.), as shown in Figure 12-27.


Note that different objects of the same class are not distinguished. For example, all the bicycles on the righthand side of the segmented image end up as one big lump of pixels.


The main difficulty in this task is that when images go through a regular CNN, they gradually lose their spatial resolution (due to the layers with strides greater than 1).


So, a regular CNN may end up knowing that there’s a person somewhere in the bottom left of the image, but it might not be much more precise than that.


Figure 12-27. Semantic segmentation


Just like for object detection, there are many different approaches to tackle this problem, some quite complex.


However, a fairly simple solution was proposed in the 2015 paper by Jonathan Long et al. on fully convolutional networks (FCNs).


The authors start by taking a pretrained CNN and turning it into an FCN.


The CNN applies an overall stride of 32 to the input image (i.e., if you multiply all the strides), meaning the last layer outputs feature maps that are 32 times smaller than the input image.


This is clearly too coarse, so they added a single upsampling layer that multiplies the resolution by 32.


There are several solutions available for upsampling (increasing the size of an image), such as bilinear interpolation, but that only works reasonably well up to ×4 or ×8.


Instead, they use a transposed convolutional layer.


This is equivalent to first stretching the image by inserting empty rows and columns (full of zeros), then performing a regular convolution.


Alternatively, some people prefer to think of it as a regular convolutional layer that uses fractional strides.


The transposed convolutional layer can be initialized to perform something close to linear interpolation, but since it is a trainable layer, it will learn to do better during training.


In PyTorch, you can use the `nn.ConvTranspose2d` layer.


NOTE

In a transposed convolutional layer, the stride defines how much the input will be stretched, not the size of the filter steps, so the larger the stride, the larger the output (unlike for convolutional layers or pooling layers).


Figure 12-28. Upsampling using a transposed convolutional layer


## Other PyTorch Convolutional Layers


**nn.Conv1d**

A convolutional layer for 1D inputs, such as time series or text (sequences of letters or words).


**À-trous convolutional layer**

Setting the dilation hyperparameter of any convolutional layer to a value of 2 or more creates an à-trous convolutional layer (à trous is French for “with holes”).


This is equivalent to using a regular convolutional layer with a filter dilated by inserting rows and columns of zeros.


This lets the convolutional layer have a larger receptive field at no computational cost and using no extra parameters.


Using transposed convolutional layers for upsampling is OK, but still too imprecise.


To do better, Long et al. added skip connections from lower layers.


They upsampled the output image by a factor of 2 and added the output of a lower layer that had this double resolution.


Then they upsampled the result by a factor of 16, leading to a total upsampling factor of 32.


This recovered some of the spatial resolution that was lost in earlier pooling layers.


In their best architecture, they used a second similar skip connection to recover even finer details from an even lower layer.


In their best architecture, they used a second similar skip connection to recover even finer details from an even lower layer.


It is even possible to scale up beyond the size of the original image, which can be used for super-resolution.


Figure 12-29. Skip layers recover some spatial resolution from lower layers


TIP

The FCN model is available in TorchVision, along with a couple of other semantic segmentation models.


Instance segmentation is similar to semantic segmentation, but instead of merging all objects of the same class, each object is distinguished from the others.


For example, Mask R-CNN extends Faster R-CNN by additionally producing a pixel mask for each bounding box.


So you get a bounding box, class probabilities, and a pixel mask for each object.


This model is available in TorchVision, pretrained on the COCO 2017 dataset.


TIP

TorchVision’s transforms API v2 can apply to masks and videos, just like it applies to bounding boxes.


As you can see, deep computer vision is a vast and fast-paced field, with new architectures appearing every year.


Since 2020, Transformers have also entered the computer vision space.


Researchers are now tackling harder problems such as adversarial learning, explainability, realistic image generation, single-shot learning, video prediction, and multimodal models.


Next, we move on to sequential data such as time series, using recurrent neural networks and convolutional neural networks.
