
# CSCI E-25   
## Image classification with Vision Transformer
## Steve Elston

> **Attributiion:** This notebook is a modification of the original [Keras example notebook](https://keras.io/examples/vision/image_classification_with_vision_transformer/) by [Khalid Salama](https://www.linkedin.com/in/khalid-salama-24403144/), 2021/01/18.


## Introduction

The Vision Transformer (ViT), [Dosovistskiy, et. al., 2020](https://arxiv.org/abs/2010.11929), model was as early application of a transformer architecture to images. While this model is not state of the art in terms of performance or computational efficiency one can learn a lot about how transformer architectures are applied to images. Thus, our goal is to develop an understanding of image transformer models rater than attempting to achieve state of the art performance.      

Vision transformers apply the concept of **attention** to generating feature maps, or image embeddings. Convolutional neural networks (CNNs) are effective at creating feature maps for images, exhibiting advantageous **inductive bias**. However, CNNs have a generally small or local **receptive field**. In contrast, attention is a computed globally over the image. While CNNs are effective at mapping small scale features, attention provides mappings for features covering larger and specific parts of the image. For example, a CNN will effectively capture the edges and local textures. Whereas, attends to entire parts of an image important to the task being learned. For example, if the task is to identify objects, attention will capture features of entire objects such as vehicles, road surfaces, people, animals, etc.    

### The ViT model

The ViT model is a pure transformer architecture, with no convolutional layers. There are three major steps in the ViT algorithm.     
1. The image is tokenized. Tokenization divides the image into small patches that can be efficiently processed. **Positional encoding** is used to encode the positions of the patches in the original image. The concepts of tokenization and positional encoding are inherited from transformer models used for natural language processing (NLP). As the processing of the tokens proceeds to deeper layers the tokens become more abstracted.  
2. An activation tensor is computed using multiple multi-headed layers of **scaled dot product attention (SDPA)** transformers. Multiple attention heads in each layer create attention-based feature maps or embeddings. Each head learns a different layer of the feature map. In each of the heads of a layer SDPA is computed as the product of the value (V) with the softmax activation of the dot product of the key (K) and query (Q), scaled by the square root of the dimension of the key, $d_k$.
$$SDPA = softmax \Bigg( \frac{Q K^T}{\sqrt{d_k}}\Bigg) V$$
3. Once the multiple heads have computed the activation tensor the layers are mixed using a **multi-layer perceptron (MLP)**. The purpose of this so call **token mixing** is two fold. First, the dimensionallity of the tensor is reduced to the embedding dimension. Second the information in the tensor is linearly reweighted by the learned weights of the MLP.     

### Understanding attention

The arguments to foregoing equation are vectors, V, K and Q. These vectors are embeddings of tokens. The embedding is computed by matrix multiplication of a **learned weight matrix**, $W_c\v, W_k, W_q$, with the token vectors, $T_v, T_k, T_qq$.   

\begin{align}
    V = T_v \cdot W_v^T \\
    K = T_k \cdot W_k^T \\
    Q = T_q \cdot W_q^T \\
\end{align}

We can interpret the product of $Q$ and $K$ as the **dot product similarity** between the query and the key. $Q$ and $K$ are normalized before the dot product is computed. As a result, the dot product similarity is the same as the cosine similarity. This similarity is then scaled by the square root of the embedding dimension and a softmax activation is applied. The result is an activation tensor which is then multiplied by the value vector $V$ to give attention.  

### Mulitheaded attention

A given attention layer attends to a particular type of feature, such as color, shape or texture. To create feature maps we use **multiheaded attention**, where each head learns different features. Each head learns different weight matricies, $W_v, W_k, W_q$, giving different vector embeddings, $V, K, Q$. The output of the multiple heads is combined by **mixing the attention layers**. The result of the mixing operation is a vector with the length of the embedding dimension. Intuitively, we can think of the concept of using multiple attention heads as analogous to using multiple channels in convolutional layers of a CNN.

### Self attention

For the classification task example in this notebook we use just one vector, $V = K = Q$. Applying the learned weights, $W_v, W_k, W_q$, to this token vector and with the softmax activation gives **self attention**. Conceptually, self attention creates feature maps attending to features used by the classifier head.    

### Dataset

The example in the notebook implements a ViT model classification of the images, using the **[CIFAR-100 dataset](https://www.cs.toronto.edu/~kriz/cifar.html)**. The CIFAR-100 has low resolution $32 \times 32$ images with 100 classes of objects with 500 training and 100 testing images for each class. While 500 training examples per category may see like a lot of data, given the small size of the images, the complexity (number of trainable parameters), and the visual similarity between some of the classes, this is a challenging classification task!    

### Alternative Transformer Models  

As has already been mentioned we are using the now obsolete ViT model as a baseline for learning the basic principles of transformer models. Given the limitations of this model, the small training dataset, and the limited computing power used, we do not expect anything approaching state-of-the-art results. 

You can find an extensive library of pre-trained Keras image transformation model in the [*Keras_cv_attention_models*](https://github.com/leondgarse/keras_cv_attention_models) repository and package.   

### Setup to Run this Notebook

This notebook was created and tested using a Google Colab Pro+ account. While not considered large by current standards, training the models in this notebook is computationally intensive.  Expect long run-times for model training in any environment. You are free to run this notebook in any environment of your choosing that has sufficient resources.

To run the notebook in Colab you will need a [Google Colabratory account](https://colab.research.google.com/) if you do not already have one. Log into your google account. You can then *Upload* this notebook into your work Colab space. Make sure you configure the Runtime to use an appropriate GPU, such as A100. Large memory should not be required. Further, a dedicated Google cloud storage account (not GoogleDrive) is required. It appears that conflicts arise in the stack when using an H100 GPU with the JAX backend.  

To import the packages required to run this notebook execute the code in the cell below.

In [None]:
import os

## os.environ["KERAS_BACKEND"] = "tensorflow"  # @param ["tensorflow", "jax", "torch"]
os.environ["KERAS_BACKEND"] = "jax"  # @param ["tensorflow", "jax", "torch"]

import keras
from keras import layers, models, ops

import sklearn.metrics as metrics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Prepare the Dataset

The widely used CIFAR 100 dataset is provided in the `keras.datasets` package. The code in the cell below loads the train and test images and labels. Execute this code.  

In [None]:
num_classes = 100
input_shape = (32, 32, 3)

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()

print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}")
print(f"The number of unique class: {len(np.unique(y_train))}")

## Configure the Hyperparameters

The code in this notebook has quite a few hyperparamters. To set this hyperparameters execute the code in the cell below.

In [None]:
learning_rate = 0.0001
weight_decay = 0.001
batch_size = 256
num_epochs = 25
image_size = 72  # We'll resize input images to this size
patch_size = 6  # Size of the patches to be extract from the input images
num_patches = (image_size // patch_size) ** 2
projection_dim = 64
num_heads = 4
transformer_units = [
    projection_dim * 2,
    projection_dim,
]  # Size of the transformer layers
transformer_layers = 8
mlp_head_units = [
    2048,
    1024,
]  # Size of the dense layers of the final classifier

## Data Augmentation

As has already been mentioned, there are only 500 training classes per category in CIFAR 100. We can improve on this situation by defining **data augmentation** layers for our ViT model. Execute the code in the cell below to instantiate the data augmentation object.   

In [None]:
data_augmentation = keras.Sequential(
    [
        layers.Normalization(),
        layers.Resizing(image_size, image_size),
        layers.RandomFlip("horizontal"),
        layers.RandomTranslation(height_factor=0.1, width_factor=0.1),
        layers.RandomContrast(factor=0.1),
        layers.RandomRotation(factor=0.02),
        layers.RandomZoom(height_factor=0.2, width_factor=0.2),
    ],
    name="data_augmentation",
)
# Compute the mean and the variance of the training data for normalization.
data_augmentation.layers[0].adapt(x_train)

> **Exercise 5-1:** Two of the layers defined above perform standard preprocessing of the images rather than augmentation. Answer these questions.
> 1. What are these layers and what is there purpose?
> 2. How are the pixel values of the images represented after applying these layers?

> **Answers:**
> 1.            
> 2.            

## Implement a Multilayer Perceptron (MLP) Layer       

The ViT model requires an MLP layer to mix the output of the heads in the attention layers. An MLP layer is also required for the classification head. The code in the layer below defines the layers of the MLP. Execute this code to instantiate this function.        

In [None]:
def mlp(x, hidden_units, dropout_rate):
    for units in hidden_units:
        x = layers.Dense(units, activation=keras.activations.gelu)(x)
        x = layers.Dropout(dropout_rate)(x)
    return x

## Patch Creation Layer     

It is now time to explore the code to tokenize the images, by creating patches. Execute the code in the cell below to create the Patches class.    

In [None]:
class Patches(layers.Layer):
    def __init__(self, patch_size):
        super().__init__()
        self.patch_size = patch_size

    def call(self, images):
        input_shape = ops.shape(images)
        batch_size = input_shape[0]
        height = input_shape[1]
        width = input_shape[2]
        channels = input_shape[3]
        num_patches_h = height // self.patch_size
        num_patches_w = width // self.patch_size
        patches = keras.ops.image.extract_patches(images, size=self.patch_size)
        patches = ops.reshape(
            patches,
            (
                batch_size,
                num_patches_h * num_patches_w,
                self.patch_size * self.patch_size * channels,
            ),
        )
        return patches

    def get_config(self):
        config = super().get_config()
        config.update({"patch_size": self.patch_size})
        return config

Execute the code below to display patches for a sample image.

In [None]:
plt.figure(figsize=(4, 4))
image = x_train[np.random.choice(range(x_train.shape[0]))]
plt.imshow(image.astype("uint8"))
plt.axis("off")

resized_image = ops.image.resize(
    ops.convert_to_tensor([image]), size=(image_size, image_size)
)

# Convert image to float32 before passing to Patches
patches = Patches(patch_size)(resized_image.astype("float32"))

print(f"Image size: {image_size} X {image_size}")
print(f"Patch size: {patch_size} X {patch_size}")
print(f"Patches per image: {patches.shape[1]}")
print(f"Elements per patch: {patches.shape[-1]}")

n = int(np.sqrt(patches.shape[1]))
plt.figure(figsize=(4, 4))
for i, patch in enumerate(patches[0]):
    ax = plt.subplot(n, n, i + 1)
    patch_img = ops.reshape(patch, (patch_size, patch_size, 3))
    plt.imshow(ops.convert_to_numpy(patch_img).astype("uint8"))
    plt.axis("off")

> **Exercise 5-2:** Examine the code for the Patches object and the example images and resulting patches and answer these questions. *Note:* if the object in the randomly select image is not clear, execute the code in the cell above until you have a clear image.   
> 1. Given the dimensions of the input image and of the patches, how many horizontal and vertical patches are created?
> 2. What is the upper limit on the number of non-overlapping tokens one can create from the image? What do this small tokens represent?   
> 3. The `keras.ops.reshape` function is applied to the patches tensor. Consider the dimensions of the resulting output tensor. Explain what the number of rows of this tensor represents? Explain what the dimension of the row vectors represents?
> 4. Examine the image of the patches. Are do some patches contain more of the object to be classified as opposed to other items or background in the image and what does this mean for the tokens that should be attended to optimize performance of the task-specific head?    

> **Answers:**
> 1.            
> 2.             
> 3.              
> 4.            

## Implement the patch encoding layer

Once patches are found, the token embedding must be computed. The embedding vector is created by two steps.     
1. Linear projection of the image tokens into the embedding space.
2. Adding positional encoding to the embedded token vectors.   

Execute the code in the cell below to create the Patches class.

In [None]:
class PatchEncoder(layers.Layer):
    def __init__(self, num_patches, projection_dim):
        super().__init__()
        self.num_patches = num_patches
        self.projection = layers.Dense(units=projection_dim)
        self.position_embedding = layers.Embedding(
            input_dim=num_patches, output_dim=projection_dim
        )

    def call(self, patch):
        positions = ops.expand_dims(
            ops.arange(start=0, stop=self.num_patches, step=1), axis=0
        )
        projected_patches = self.projection(patch)
        encoded = projected_patches + self.position_embedding(positions)
        return encoded

    def get_config(self):
        config = super().get_config()
        config.update({"num_patches": self.num_patches})
        return config

> **Exercise 5-3:** Examine the code in the `__init__` and `call` methods above and answer these questions in one or a few sentences.   
> 1. Explain how the code computes and applies the projection weight matrix to compute the token embedding.
> 2. Explain how the positional embedding is created and added to the token embedding. It may help you to read the documentation for the [Keras embedding layer](https://keras.io/api/layers/core_layers/embedding/).      

> **Answers:**
> 1.             
> 2.         

## Build the ViT model

We now have all the pieces required to build the complete ViT model. This model executes the following steps.     
1. Data augmentation is applied to the batch of images.     
2. The patches of the augmented images in the batch are embedded and positionally encoded.   
3. Multiple transformer layers compute an attention tensor of dimension $[batch\_size,\ num\_patches,\ projection\_dim]$. The transformer uses the [Keras multiheaded attention layer](https://keras.io/api/layers/attention_layers/multi_head_attention/). Reading the documentation for this layer will help your understanding of the hyperparameters and the arguments.
4. The classifier head computes the most probable category.     

Execute this code to instantiate the function.   

In [None]:
def create_vit_classifier():
    inputs = keras.Input(shape=input_shape)
    # Augment data.
    augmented = data_augmentation(inputs)
    # Create patches.
    patches = Patches(patch_size)(augmented)
    # Encode patches.
    encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)

    # Create multiple layers of the Transformer block.
    for _ in range(transformer_layers):
        # Layer normalization 1.
        x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
        # Create a multi-head attention layer.
        attention_output = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=projection_dim, dropout=0.1
        )(x1, x1)
        # Skip connection 1.
        x2 = layers.Add()([attention_output, encoded_patches])
        # Layer normalization 2.
        x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
        # Apply the MLP to the x3 tensor
        x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)
        # Skip connection 2.
        encoded_patches = layers.Add()([x3, x2])

    # Create a [batch_size, projection_dim] tensor.
    representation = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
    # Flatten the tensor and apply dropout regularization
    representation = layers.Flatten()(representation)
    representation = layers.Dropout(0.5)(representation)
    # Add MLP layer for the classifier
    features = mlp(representation, hidden_units=mlp_head_units, dropout_rate=0.5)
    # Classify outputs.
    logits = layers.Dense(num_classes)(features)
    # Create the Keras model.
    model = keras.Model(inputs=inputs, outputs=logits)
    return model

> **Exercise 5-4:** Examine the code in the cell above and a provide short answers to the following questions in one or a few sentences. Notice that the loop constructs the stack of transformer layers.        
> 1. Explain how normalization layer applied before `MultiHeadAttention affects the interpretation of the dot product computed.
> 2. Explain how the arguments to the `MultiHeadAttention` result in computing self-attention.
> 3. The normalized tensor from the `.MultiHeadAttention` is passed to a MLP. What is purpose of this MLP and what are the input and output tensor dimensions?
> 4. What does the last layer in the loop do and why is this function important?
> 5. The scaleability limitations of transformer models are widely known. Compute how the relative computational demands of the model will change if the patch size is changed from $6 \times 6$ pixels to:
>    - a) $3 \times 3$ pixels
>    - b) $12 \times 12$ pixels?    
>      Perform the simple algebraic calculation to find the ratio of complexity for these cases.    
> 6. What does the result of your scaling calculation tell you about the trade-off between spatial resolution of a tokenized image transformer model and computational complexity?  

> **Answers:**
> 1.           
> 2.                
> 3.           
> 4.             
> 5. 
> 6.           

## Compile, Train, and Evaluate the Model         

With the model constructed it is time to compile, train and evaluate the model. Execute the code in the cell below and examine the results. Expect the training to take some time. On Colab Pro+ running an A100 GPU the training took over one hour.    

As a first step, execute the model in the cell below to instantiate the model and print a summary.   

In [None]:
## Instantiate the model
vit_classifier = create_vit_classifier()

## Print the model summary
vit_classifier.summary()

> **Exercise 5-5:** Examine the model summary and answer this question.    \
> 1. To get feel for how many trainable parameters there are in a ViT block sum the learnable parametersdifferen for the last block from the `layer_normalization`` to the second `add` following `multi-head attention`.     
> 2. Now, compare the total trainable parameters of the ViT model and the model you created for Exercise 4-10. How do you think this difference in number of free parameters affects how difficult it is to train the ViT model?    `

> **Answers:**
> 1.           
> 2.             

The code in the cell below trains the ViT model. The function returns the trained model along with the training history.    

Hyperparameters for the model are set in the *Configure the Hyperparameters* section of this notebook, above. An exhaustive search has not been conducted. A limited search was conducted by fitting the model for 15 epochs only and observing the results, which are summarized here.    

| Weight Decay | Learning Rate | Val Accuracy/top-5 | Comments | 
| :----: | :----: | :----: | :----: |
| 0.0001 | 0.001 | 0.15/0.54| Erratic train and val curves, learning low after 10 epochs |
| 0.0001 | 0.0001 | 0.14/0.40 | Less erratic train curves, learning low after 10 epochs |
| 0.001 | 0.0001 | 0.13/0.35 | Less erratic train and val curves, learning continues for 15 epochs |

Using an A100 GPU on Colab each epoch took approximately 6 minutes to run. If you find the training time excessive, you can reduce the `num_epochs` hyperparameter to 15. Execute the code to train the model.    

In [None]:
def run_experiment(model):
    optimizer = keras.optimizers.AdamW(
        learning_rate=learning_rate, weight_decay=weight_decay
    )

    model.compile(
        optimizer=optimizer,
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[
            keras.metrics.SparseCategoricalAccuracy(name="accuracy"),
            keras.metrics.SparseTopKCategoricalAccuracy(5, name="top-5-accuracy"),
        ],
    )

    checkpoint_filepath = "/tmp/checkpoint.weights.h5"
    checkpoint_callback = keras.callbacks.ModelCheckpoint(
        checkpoint_filepath,
        monitor="val_accuracy",
        save_best_only=True,
        save_weights_only=True,
    )

    history = model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        epochs=num_epochs,
        validation_split=0.1,
        callbacks=[checkpoint_callback],
    )

    model.load_weights(checkpoint_filepath)
    _, accuracy, top_5_accuracy = model.evaluate(x_test, y_test)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")
    print(f"Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%")

    return model, history


## Train the model
vit_classifier, history = run_experiment(vit_classifier)

To better understand the training of this model, execute the code in the cell below to display the learning curve.   

In [None]:
def plot_hist(hist):
    _,ax = plt.subplots(1,2, figsize = (12,6))
    ax[0].plot(hist.history["accuracy"], label="train")
    ax[0].plot(hist.history["val_accuracy"], label="validation")
    ax[0].plot(hist.history['top-5-accuracy'], label="Top 5 train accuracy")
    ax[0].plot(hist.history['val_top-5-accuracy'], label="Top 5 validation accuracy")
    ax[0].set_title("model accuracy")
    ax[0].set_ylabel("accuracy")
    ax[0].set_xlabel("epoch")
    ax[0].legend(loc="upper left")
    ax[1].plot(hist.history["loss"], label="train")
    ax[1].plot(hist.history["val_loss"], label="validation")
    ax[1].set_title("model loss")
    ax[1].set_ylabel("loss")
    ax[1].set_xlabel("epoch")
    ax[1].legend(loc="upper right")
    plt.show()

# Plot the training history
plot_hist(history)

As stated at the beginning of this notebook, we are not expecting state of the art results from this model. To provide some perspective, the results reported by [Dosovistskiy, et. al., 2020](https://arxiv.org/abs/2010.11929) are achieved by pre-training the ViT model on the massive JFT-300M dataset. The pre-trained model is then fine-tuning it on the target dataset.

> **Exercise 5-6:** Examine the results of the model training and provide brief answers to these questions.
> 1. Examine the accuracy and loss curves from the model training. What evidence is there that the model has learned over the limited number of epochs, and has the learning completed and why?
> 2. Is there evidence that with the chosen hyperparameters the model is exhihibiting significant over-fitting and why?
> 3. A higher capacity transformer model might give better performance, but at a cost in scaleability. Describe two options for expanding the capacity of the transformer model.
> 4. We are training this model using a conventional supervised machine learning, employing gradient descent algorithm, AdamW, to minimize categorical cross entropy. What other more sophisticated and complex approach could be used to effectively train this model using the limited labeled data available? Why do you expect this approach to produce better results and what are the costs of applying this approach.      

> **Answers:**
> 1.             
> 2.               
> 3.          
> 4.               

The code in the cell below uses the validation dataset to compute summary model performance statistics and to display a confusion matrix. This code will require considerable computing time to perform the required inferences. Execute the code and examine the results.  

In [None]:
def print_model_performance(test_labels, ds_test, test_model):
    ## Compute predicted labels
    predictions = test_model.predict(ds_test, batch_size=1)
    predicted = predictions.argmax(axis=1)

    k = 5
    print('Overall accuracy = ' + str(round(metrics.accuracy_score(test_labels, predicted), 4)))
    print('Top 5 accuracy = ' + str(round(metrics.top_k_accuracy_score(test_labels, predictions, k=k),4)))

    unique_labels, label_counts = np.unique(test_labels, return_counts=True)
    class_precision = metrics.precision_score(test_labels, predicted, labels=unique_labels, average=None)
    class_recall = metrics.recall_score(test_labels, predicted, labels=unique_labels, average=None)

    sum_label_counts = np.sum(label_counts)
    weighted_average = lambda x: round(np.sum(np.divide(x * label_counts, sum_label_counts)), 4)
    print('Average precision = ' + str(weighted_average(class_precision)))
    print('Average recall = ' + str(weighted_average(class_recall)))
    return predicted

def plot_confusion_matrix(test_labels, predicted):
    confusion_matrix = metrics.confusion_matrix(test_labels, predicted)

    plt.figure(figsize = (12,9))
    p = plt.imshow(np.log(np.divide(confusion_matrix + 1.0, np.sum(confusion_matrix, axis=1))))
    cb = plt.colorbar(p)
    _=cb.set_label('Log count')

test_labels = y_test.flatten()
## Compute predictions and display performance metrics
predicted = print_model_performance(test_labels, x_test, vit_classifier)

## Display the confusion matrix
plot_confusion_matrix(test_labels, predicted)

As was noted previously, this is a challenging classification problem with many similar categories. Further, the training of the ViT model is incomplete.    

> **Exercise 5-7:** Examine the confusion matrix noting the pattern of errors. What does this pattern of errors tell you about the difficulty of identifying certain categories of object in the images.         

> **Answer:**               

## Exploring the Attention Layers

It will be interesting to explore the attention activations from different layers in this network.    

As a first step, execute the code in the cell below to create and display a list of the layer names for the network.  

In [None]:
with pd.option_context('display.max_rows', None):
  print(pd.Series([layer.name for layer in vit_classifier.layers]))

We will now view activations from three of the MSA layers. To do so, we display the results from the following normalization layer, which will limit issues with scale of the activations.  

The code in the cell below extracts the normalized activation patches for the specified layers for a single image and displays them. Execute this code and examine the results.        

In [None]:
def plot_patches(patches, patch_display_size=8):
  n = int(np.sqrt(patches.shape[1]))
  plt.figure(figsize=(8, 8))
  plt.tight_layout()
  for i, patch in enumerate(patches[0]):
    ax = plt.subplot(n, n, i + 1)
    # Reshape the 64-dimensional feature vector into an 8x8 grayscale image
    patch_img = ops.reshape(patch, (patch_display_size, patch_display_size))
    plt.imshow(ops.convert_to_numpy(patch_img), cmap='gray')
    plt.axis("off")

layer_names = [layer.name for layer in vit_classifier.layers] # create list of layer names
layer_indx = [13, 22, 76] # indices of layers we want to visualize
layer_outputs = [vit_classifier.get_layer(layer_names[i]).output for i in layer_indx]

activation_model = models.Model(inputs=vit_classifier.inputs[0], outputs=layer_outputs)

img_tensor = np.expand_dims(image, axis=0)
activations = activation_model.predict(img_tensor)

layer_label=[layer_names[i] for i in layer_indx]
for i in range(len(layer_label)):
  print(layer_label[i])
  plot_patches(activations[i], patch_display_size=int(np.sqrt(projection_dim)))
  plt.show()
  print("\n\n\n")

The activations are organized by the image patches with lighter colors indicating higher activation. Notice that some patches have higher activations, indicating high attention in that patch. Further, the patterns within each patch changes from layer to layer, showing that attention is on different features of the image.   

#### Portions of this document are copyright 2026, Stephen F Elston. All rights reserved.  