# Week 14

More Neural Networks for images... and CNNs...

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/image_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/nn_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/WK14/raw/main/WK14_utils.py

!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/releases/latest/download/lfw.tar.gz | tar xz

In [None]:
import torch

from torch import nn
from torch import Tensor

from torchvision.models import resnet34, ResNet34_Weights
from torchvision.transforms import v2

from data_utils import LFWUtils, classification_error, display_confusion_matrix
from image_utils import make_image
from nn_utils import get_labels, get_num_params

from WK14_utils import display_activation_grids, display_kernel_grids

## INTRO !

In [None]:
train, test = LFWUtils.train_test_split(0.5)

iw,ih = LFWUtils.IMAGE_SIZE
nc = len(train["pixels"][0][0]) if type(train["pixels"][0][0]) == list else 1

x_train = Tensor(train["pixels"]).reshape(-1, ih, iw, nc).movedim(-1,1)
y_train = Tensor(train["labels"]).long()

x_test = Tensor(test["pixels"]).reshape(-1, ih, iw, nc).movedim(-1,1)
y_test = Tensor(test["labels"]).long()

print("Dataset Samples")
print("\tTrain:", len(x_train))
print("\tTest:", len(x_test))

print("\nDataset Shape:", list(x_train.shape))
print("\nSample Shape:", list(x_train[0].shape))

## Review

....

### Model, Optimizer, Cost/Loss Function

This is the model from last week.

## Transfer Learning

The CNN architecture is so stable that models can be made to be very deep, some with $100\text{s}$ of layers.

The internal layers of these models are so abstract and generic that once a model has been trained on millions of data samples (images), it learns and retains information not only about the images on the dataset, but any visual pattern that it learned in the process.

It's not uncommon to use a previously trained model for a similar-but-different project, even if the images have nothing in common. Generic information about images can be transferred to new datasets and problem spaces.

<!-- <img src="./imgs/resnet_activation_00.jpg" height="300px" /> -->
<img src="https://i.postimg.cc/tR3twzmz/resnet-activation-00.jpg" height="300px" />

<!-- <img src="./imgs/resnet_activation_01.jpg" height="300px" /> -->
<img src="https://i.postimg.cc/hPKbB7kR/resnet-activation-01.jpg" height="300px" />

<!-- <img src="./imgs/resnet_activation_02.jpg" height="300px" /> -->
<img src="https://i.postimg.cc/15M053Sn/resnet-activation-03.jpg" height="300px" />

### Residual Networks

There are a couple of families of CNN networks that get used as the starting point for many different types of visual models (and also audio and text). One such architecture is [ResNet](https://arxiv.org/abs/1512.03385).

ResNet comes in a few sizes/depths, and PyTorch has at least [5 pre-trained ResNet models](https://pytorch.org/hub/pytorch_vision_resnet/) that we can use.

These PyTorch ResNet models were trained on the [ImageNet](https://image-net.org/download.php) dataset. This dataset has $1\text{,}281\text{,}167$ training images and classifies objects into $1\text{,}000$ classes.

We'll use the `ReNet34` model, which is not the largest, but will fit nicely into small GPUs.

<!-- <img src="./imgs/resnet34_00.jpg" width="900px" /> -->
<img src="https://i.postimg.cc/XNc8xdqy/resnet34-00.jpg" width="900px" />

<!-- <img src="./imgs/resnet34_01.jpg" width="900px" /> -->
<img src="https://i.postimg.cc/hP20Rn9D/resnet34-01.jpg" width="900px" />


### Instantiating ResNet

Is easy:

In [None]:
model = resnet34(weights=ResNet34_Weights.DEFAULT)
display(model)

### Adjust inputs

From https://pytorch.org/hub/pytorch_vision_resnet/:

_All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]._

We can use `PyTorch` transformation functions to achieve this, but this means that now we'll have some transformations that always have to happen and some that only happen in the training dataset.

In [None]:
res_transforms = v2.Compose([
  v2.ToDtype(torch.uint8),
  v2.Resize(224),
  v2.Grayscale(3),
  v2.ToDtype(torch.float32, scale=True),
  v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

In [None]:
x_train_res = res_transforms(x_train)
x_test_res = res_transforms(x_test)
x_train_res.shape

In [None]:
x_train_res[0].min(), x_train_res[0].max()

In [None]:
mdevice = "cuda" if torch.cuda.is_available() else "cpu"

model.fc = nn.Linear(model.fc.in_features, len(LFWUtils.LABELS))
model = model.to(mdevice)

learning_rate = 5e-3
optim = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)

loss_fn = nn.CrossEntropyLoss()

out = model(x_train_res[::3].to("cuda"))

print("Input shape:", x_train_res.shape)
print("Output shape:", out.shape)
print("Parameters:", get_num_params(model))

In [None]:
batch_step = 3

for e in range(16):
  model.train()
  for si in range(batch_step):
    optim.zero_grad()
    labels_pred = model(x_train_res[si::batch_step].to("cuda"))
    loss = loss_fn(labels_pred.to("cuda"), y_train[si::batch_step].to("cuda"))
    loss.backward()
    optim.step()

  if e % 4 == 3:
    train_predictions = get_labels(model, x_train_res.to("cuda"))
    test_predictions = get_labels(model, x_test_res.to("cuda"))
    train_error = classification_error(y_train, train_predictions)
    test_error = classification_error(y_test, test_predictions)
    print(f"Epoch: {e} loss: {loss.item():.4f}, train error: {train_error:.4f}, test error: {test_error:.4f}")

In [None]:
train_predictions = get_labels(model, x_train_res.to("cuda"))
test_predictions = get_labels(model, x_test_res.to("cuda"))
train_error = classification_error(y_train, train_predictions)
test_error = classification_error(y_test, test_predictions)
print(f"train error: {train_error:.4f}, test error: {test_error:.4f}")

display_confusion_matrix(y_train, train_predictions, display_labels=LFWUtils.LABELS)
display_confusion_matrix(y_test, test_predictions, display_labels=LFWUtils.LABELS)

### Visualize Layers

That worked really well. The information learned by the `ResNet` network on 1 million images seems to transfer to our classification of faces and we can leverage its pattern-recognition layers to build a more accurate model in a short amount of time.

Let's take a look at some of the filtered images in our `ResNet` model. We can do this with an untrained model, but it's better to look at one that has been recently trained.

When we displayed the model layers above we saw that the model has $4$ main groups of convolution layers. The further down the model we go, the smaller the images are, and the more abstract the activation patterns will be.

At the very last layer we might have $512$ _images_ that are only $4 \times 4$ pixels, but light-up under very specific conditions, like: is there a bird in the image ? is there a face ?

To see slightly larger images we'll look at some activations on layers $1$ and $2$.

We'll use the `hook` mechanism from `PyTorch` to add some auxiliary logic to the layers we are interested in looking at. This allows us to run some extra code on the layers inputs and outputs every time it processes an image.

Our `hook` function will just save the layers input and output tensors to external dictionaries that we cn visualize later.

In [None]:
activations_in = {}
activations_out = {}
layer_kernels = {}

def get_activation(name):
  def hook(model, input, output):
    if name not in layer_kernels:
      layer_kernels[name] = model.weight.detach()
    activations_in[name] = input[0].detach()
    activations_out[name] = output.detach()
  return hook

model.conv1.register_forward_hook(get_activation('conv1'))
model.layer1[1].conv2.register_forward_hook(get_activation('layer1.1.conv2'))
model.layer1[2].conv2.register_forward_hook(get_activation('layer1.2.conv2'))
model.layer2[0].conv2.register_forward_hook(get_activation('layer2.0.conv2'))
model.layer4[0].conv2.register_forward_hook(get_activation('layer4.0.conv2'))
model.layer4[2].conv2.register_forward_hook(get_activation('layer4.2.conv2'))
model = model.to(mdevice)

Once we have our `hook` in place we have to pass some data through the network so it saves the inputs and outputs for us.

In [None]:
with torch.no_grad():
  model(x_train_res[0:128].to("cuda"))

Now we can display activations for specific images in the processed batch:

In [None]:
img_idx = 0
channel_idx = 0

img_t = x_train[img_idx, channel_idx]

display(make_image(img_t, width=img_t.shape[-1]))
display_activation_grids(activations_out, img_idx)

And we can also visualize the kernel values in each layer.

These are tiny $3 \times 3$ or $7 \times 7$ convolution kernels that get multiplied to filter the images.

Other than the first one, where the kernels actually acted on 3-channel layers, the colors on the other kernels is artificial. They're the result of combining the kernels into groups of $3$, but in reality they are 64-channel kernels.

In [None]:
display_kernel_grids(layer_kernels)