# Homework - DINO
### Sharif University - Deep Learning Course - Spring 2024

*Instructor:  Dr. Soleymani*

---

*Full Name:* Ali Alvandi

*SID:* 400104748

---

In this homework we will learn to use the DINO model.

In this homework you need to complete the notebook and run all the cells.
We have specified the parts to be completed with `TODO` tags inside the code blocks.

**NOTES**:
* It is important that you read all the code and text blocks carefully, even if you think you are excited to jump into completing the missing codes.
* This notebook is tested with *Google Colab* and *Kaggle* free runtimes and you can used them for testing your code.
* Ensure all cells are executable and perform their intended functions
* You can ask your questions on [Quera Class](https://quera.org/course/16605/)
* Write clear, commented code when necessary.

# Introduction

Listen up, folks! You know how they say dinosaurs are extinct? Well, they lied to us. The DINO model is proof that these ancient beasts are still roaming the earth, but this time, they're here to help us with feature extraction and downstream tasks. Imagine a T-Rex with a fancy deep learning algorithm strapped to its back, stomping around and making sense of all the data in its path.

But wait, it gets better! This dino doesn't just extract features; it does it in a self-supervised manner, which means it's like a kid who learned to tie its own shoelaces without any help from its parents (or, in this case, labeled data). And once it's done extracting those juicy features, it's ready to tackle any downstream task you throw its way, whether it's image classification, object detection, or even predicting the next hot dinosaur-themed movie.

In this homework assignment, we will be utilizing the DINO model to extract meaningful visual features from satellite imagery data. The self-supervised DINO model has proven to be an effective tool for extracting rich representations from visual data without the need for labeled examples during pre-training.

Specifically, we will leverage the DINO model's capabilities to extract visual features from satellite images. These extracted features will then be used to train a classifier on top of the DINO backbone. The goal of this classifier is to predict whether a given satellite image contains solar panels or not.

Moving on to the second part of the assignment, we will explore the transformer attention maps produced by the DINO model. By analyzing these attention maps, we aim to estimate the size of the solar panels present in the positive examples from the dataset.

While not as exhilarating as envisioning a dinosaur with deep learning capabilities, this assignment presents an opportunity to gain hands-on experience with a state-of-the-art self-supervised model and its applications in computer vision tasks. I'm sure you'll find the process insightful and rewarding.

# Installations and imports

As usual, imports are our first step.

In [1]:
!pip install einops

Collecting einops
  Downloading einops-0.8.0-py3-none-any.whl.metadata (12 kB)
Downloading einops-0.8.0-py3-none-any.whl (43 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m590.7 kB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.8.0


In [2]:
!pip install rasterio



In [3]:
!conda install -y gdown

Retrieving notices: ...working... done
Channels:
 - rapidsai
 - nvidia
 - conda-forge
 - defaults
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - gdown


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2024.6.2           |     pyhd8ed1ab_0         157 KB  conda-forge
    filelock-3.15.4            |     pyhd8ed1ab_0          17 KB  conda-forge
    gdown-5.2.0                |     pyhd8ed1ab_0          21 KB  conda-forge
    openssl-3.3.1              |       h4ab18f5_1         2.8 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be INSTALLED:

  filelock           conda-forge/noarch::filelock-3.15.4-pyhd8ed1a

In [4]:
!gdown --id 1AgBzfGrOSEcuItEGBv7VpSw6bfJBichK

Downloading...
From (original): https://drive.google.com/uc?id=1AgBzfGrOSEcuItEGBv7VpSw6bfJBichK
From (redirected): https://drive.google.com/uc?id=1AgBzfGrOSEcuItEGBv7VpSw6bfJBichK&confirm=t&uuid=8f35b1cf-259a-4dd4-9bfc-3a9d88e9a63d
To: /kaggle/working/uk20K.zip
100%|███████████████████████████████████████| 1.92G/1.92G [00:13<00:00, 142MB/s]


In [5]:
import glob
import torch
import rasterio
import numpy as np
import einops as eo
import random as rnd
import torch.nn as nn
from PIL import Image
from pathlib import Path
import matplotlib.pyplot as plt
import torch.nn.functional as F
from matplotlib.colors import Normalize
from torch.utils.data import Dataset, DataLoader

# Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [6]:
# Move the dataset to this colab session.
# We suggest you add a shortcut from the given file to your own google drive account and then copy the file from that shortcut to here.

# ======================= Your Code =======================
!cp '/content/drive/My Drive/uk20K.zip' '/content/'
# ======================= Your Code =======================

cp: cannot stat '/content/drive/My Drive/uk20K.zip': No such file or directory


In [7]:
!unzip uk20K.zip

Archive:  uk20K.zip
   creating: uk20K_v2/
  inflating: uk20K_v2/262711-N.tif   
  inflating: uk20K_v2/338394-N.tif   
  inflating: uk20K_v2/632183-N.tif   
  inflating: uk20K_v2/107870-N.tif   
  inflating: uk20K_v2/139125-N.tif   
  inflating: uk20K_v2/213270-N.tif   
  inflating: uk20K_v2/47080-P.tif    
  inflating: uk20K_v2/108053-N.tif   
  inflating: uk20K_v2/671716-N.tif   
  inflating: uk20K_v2/351590-N.tif   
  inflating: uk20K_v2/63310-P.tif    
  inflating: uk20K_v2/178551-P.tif   
  inflating: uk20K_v2/669802-N.tif   
  inflating: uk20K_v2/87718-N.tif    
  inflating: uk20K_v2/109704-P.tif   
  inflating: uk20K_v2/359002-N.tif   
  inflating: uk20K_v2/71334-P.tif    
  inflating: uk20K_v2/150342-P.tif   
  inflating: uk20K_v2/332348-N.tif   
  inflating: uk20K_v2/230652-N.tif   
  inflating: uk20K_v2/1518-P.tif     
  inflating: uk20K_v2/82360-P.tif    
  inflating: uk20K_v2/439983-N.tif   
  inflating: uk20K_v2/175022-P.tif   
  inflating: uk20K_v2/14206-P.tif    
  infla

Let's prepare the train and validation datasets.

In [8]:
def normalize(image, MEAN = [0.485, 0.456, 0.406], STD = [0.485, 0.456, 0.406]):
    image = image / 255
    source, dest = 0 if len(image.shape) == 3 else 1, -1
    return np.moveaxis((np.moveaxis(image, source, dest) - MEAN) / STD, dest, source)

def denormalize(image, MEAN = [0.485, 0.456, 0.406], STD = [0.485, 0.456, 0.406]):
    source, dest = 0 if len(image.shape) == 3 else 1, -1
    image = np.moveaxis((np.moveaxis(image, source, dest) * STD) + MEAN, dest, source)
    return (image * 255).astype(int)


class SolarDataset(Dataset):
    def __init__(self, file_names):
        self.file_names = file_names

    def __len__(self):
        return len(self.file_names)

    def __getitem__(self, index):
        path = self.file_names[index]
        x = rasterio.open(self.file_names[index]).read()
        y = torch.tensor(1.0).long() if path.endswith('-P.tif') else torch.tensor(0.0).long()
        x = normalize(x, MEAN=[0.5, 0.5, 0.5], STD=[0.5, 0.5, 0.5])
        return torch.as_tensor(x.copy()).float(), y, self.file_names[index]

The dataset comprises of .tif image files. The images are labeled based on the presence or absence of solar panels within them. If an image's filename ends with "-P" it indicates a positive label, signifying that the image contains at least one solar panel. Conversely, if the filename ends with "-N" it denotes a negative label, implying that no solar panels are present in the image. However, the dataset does not provide any details regarding the size of the solar panels in the images.

Split the dataset into train and test sets and create the dataloaders.

In [9]:
# ======================= Your Code =======================
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

file_names = glob.glob('uk20K_v2/*.tif')

train_file_names, val_file_names = train_test_split(file_names, test_size=0.2, random_state=42)
train_dataset = SolarDataset(train_file_names)
val_dataset = SolarDataset(val_file_names)
# ======================= Your Code =======================

In [10]:
# ======================= Your Code =======================
batch_size = 32
train_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dl = DataLoader(val_dataset, batch_size=batch_size)
# ======================= Your Code =======================

print(len(train_dl), len(val_dl))

500 125


# Model definition

This section focuses on defining the models. The first component is a DINO backbone, which serves as a feature extractor. The code for DINO is already written and does not require any modifications. The second component is a classifier head that will be placed on top of the DINO features. The init function for this model has been provided, and your task is to complete the forward method.

In [11]:
class DinoBackbone(nn.Module):
    def __init__(self, dino_size='small') -> None:
        super().__init__()
        if dino_size == 'small':
            self.dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
            self.d_model = 384
        elif dino_size == 'base':
            self.dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
            self.d_model = 768
        elif dino_size == 'giant':
            self.dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')
            self.d_model = 1536

    def forward(self, x):
        x = self.dinov2.forward_features(x)
        cls_token = x["x_norm_clstoken"]
        patch_tokens = x["x_norm_patchtokens"]
        return cls_token, patch_tokens

In [12]:
class TransformerEncoderLinearHead(nn.Module):

    def __init__(self, d_model, output_size) -> None:
        super().__init__()
        self.transformer = nn.TransformerEncoderLayer(d_model, 8, batch_first=True)
        self.fc = nn.Linear(d_model, 2)

    def forward(self, x_feats):
        cls_embs, patch_embs = x_feats

        # ==================== Your Code ====================
        # 1. Pass the cls and patch embeddings as a single sequence to the transformer layer
        # 2. Pass the cls output of the transformer to the linear model
        # 3. Return the output of the linear model as the prediction
        x = torch.cat((cls_embs.unsqueeze(1), patch_embs), dim=1)
        transformer_output = self.transformer(x)
        cls_output = transformer_output[:, 0, :]
        output = self.fc(cls_output)
        return output
        # ==================== Your Code ====================

    def get_size_estimate(self, x_feats, vis=False, images=None):

        # ==================== Your Code ====================
        # Note: This function is used in the second part of the notebook for size estimation.
        # 1. Pass cls and patch embeddings to the self-attention layer of the transformer defined in the init.
        #           use "self.transformer.self_attn()"
        # 2. When using the self_attn layer make sure to set the need_weights=True. (This will give you the attention scores)
        # 3. The previous step computes the attention of each token with all other tokens. (shape: hx257x257 where h is the number of heads)
        # 4. Get the attention score of cls token with all patch tokens (shape: hx256)
        # 5. Reshape this into hx16x16.
        # 6. Upsample this 16x16 image by a factor of 14. (You get a hx224x224 image)
        # 7. Sum accross the heads (shape: 224x224)
        # 8. Normalize this 224x224 into the [0,1] range for all pixels.
        # 9. Create a binary mask from this 224x224 using a threshold (You should choose this threshold.)
        # 10. This mask is an estimation of the solar panels in the image (if it exists). You can use it to estimate the size of the solar panel.
        # 11. Good Luck.
        cls_embs, patch_embs = x_feats

        # ==================== Your Code ====================
        # 1. Concatenate cls and patch embeddings to form a single sequence
        x = torch.cat((cls_embs.unsqueeze(1), patch_embs), dim=1)

        # 2. Use the self-attention layer of the transformer with need_weights=True
        attn_output, attn_weights = self.transformer.self_attn(x, x, x, need_weights=True)

        # 3. Get the attention of the cls token with all patch tokens
        cls_attn = attn_weights[:, :, 0, 1:].reshape(attention.shape[0], -1, 16, 16)

        # 5. Upsample the attention map by a factor of 14
        upsampled_attn = torch.nn.functional.interpolate(cls_attn, scale_factor=14, mode='bilinear', align_corners=False)

        # 6. Sum across the heads to get a single attention map
        summed_attn = upsampled_attn.sum(dim=1)

        # 7. Normalize the attention map to [0, 1]
        normalized_attn = (summed_attn - summed_attn.min()) / (summed_attn.max() - summed_attn.min())

        # 8. Create a binary mask using a threshold
        threshold = 0.5  # You can adjust this threshold
        attention = (normalized_attn > threshold).float()
        # ==================== Your Code ====================

        h, w = attention.shape[-2:]
        visualization_images = []
        if vis:
            for i in range(images.shape[0]):
                img = images[i].detach().cpu().numpy().transpose((1, 2, 0))
                normalized_img = Normalize(vmin=img.min(), vmax=img.max())(img)

                reds = plt.cm.Reds(attention[i])
                alpha_max_value = 1.00
                gamma = 0.5

                rgba_img = np.zeros((h, w, 4))
                rgba_img[..., :3] = normalized_img
                rgba_img[..., 3] = 1

                rgba_mask = np.zeros((h, w, 4))
                rgba_mask[..., :3] = reds[..., :3]
                rgba_mask[..., 3] = np.power(attention[i], gamma) * alpha_max_value

                rgba_all = np.zeros((h, 2*w, 4))
                rgba_all[:, :w, :] = rgba_img
                rgba_all[:, w:, :] = rgba_mask

                visualization_images.append(rgba_all)

        counts = attention.sum((1, 2))
        return counts, visualization_images

# Train

Now initialize the model, define the optimizer and loss function and train the classifier model. Train your model for 2 epochs (this should be pretty fast). Pay attention that your code produces similar output as the one in the notebook so that training loss and validation accuracy are reported.

In [13]:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

backbone = DinoBackbone(dino_size='small')
backbone = backbone.to(device)
backbone.eval()
head = TransformerEncoderLinearHead(backbone.d_model, 2)
head = head.to(device)

Downloading: "https://github.com/facebookresearch/dinov2/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_reg4_pretrain.pth" to /root/.cache/torch/hub/checkpoints/dinov2_vits14_reg4_pretrain.pth
100%|██████████| 84.2M/84.2M [00:00<00:00, 214MB/s] 


In [14]:
lr = 0.001
optimizer = torch.optim.Adam(head.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [15]:
iter = 0
running_loss = 0

# ==================== Your Code ====================
# Write the training loop
# Be careful: in each iteration you first have to extract the DINO features and
# the pass the features to your classifier network,
# pay attention that you must freeze the DINO weights so that it isn't trained. (Use torch.no_grads() block.)

# ==================== Your Code ====================

num_epochs = 10  # Set the number of epochs for training

for epoch in range(num_epochs):
    head.train()  # Set the classifier head to training mode

    for i, (images, labels, paths) in enumerate(train_dl):
        images, labels = images.to(device), labels.to(device)

        # Extract DINO features with no gradient calculation
        with torch.no_grad():
            cls_token, patch_tokens = backbone(images)

        # Forward pass through the classifier head
        outputs = head((cls_token, patch_tokens))

        # Calculate loss
        loss = criterion(outputs, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Update running loss and iteration count
        running_loss += loss.item()
        iter += 1

        # Print statistics every 100 iterations
        if iter % 100 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}], Iteration [{iter}], Loss: {running_loss/100:.4f}")
            running_loss = 0

# Save the trained classifier head model
torch.save(head.state_dict(), 'final_model.pth')

Epoch [1/10], Iteration [100], Loss: 0.3353
Epoch [1/10], Iteration [200], Loss: 0.1982
Epoch [1/10], Iteration [300], Loss: 0.1995
Epoch [1/10], Iteration [400], Loss: 0.2049
Epoch [1/10], Iteration [500], Loss: 0.2047
Epoch [2/10], Iteration [600], Loss: 0.1795
Epoch [2/10], Iteration [700], Loss: 0.1816
Epoch [2/10], Iteration [800], Loss: 0.1977
Epoch [2/10], Iteration [900], Loss: 0.1955
Epoch [2/10], Iteration [1000], Loss: 0.2026
Epoch [3/10], Iteration [1100], Loss: 0.1809
Epoch [3/10], Iteration [1200], Loss: 0.1839
Epoch [3/10], Iteration [1300], Loss: 0.2016
Epoch [3/10], Iteration [1400], Loss: 0.1841
Epoch [3/10], Iteration [1500], Loss: 0.1812
Epoch [4/10], Iteration [1600], Loss: 0.1624
Epoch [4/10], Iteration [1700], Loss: 0.1888
Epoch [4/10], Iteration [1800], Loss: 0.1858
Epoch [4/10], Iteration [1900], Loss: 0.1825
Epoch [4/10], Iteration [2000], Loss: 0.1734
Epoch [5/10], Iteration [2100], Loss: 0.1817
Epoch [5/10], Iteration [2200], Loss: 0.1873
Epoch [5/10], Itera

In [17]:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

backbone = DinoBackbone(dino_size='small')
backbone = backbone.to(device)
backbone.eval()
head = TransformerEncoderLinearHead(backbone.d_model, 2)
head.load_state_dict(torch.load('final_model.pth'))
head = head.to(device)

Using cache found in /root/.cache/torch/hub/facebookresearch_dinov2_main


Now go through the validation set and for every image predict whether it contains solar panel or not. Then from all the images that contain solar panels, visualize some that have the large panels and some with the small panels based on your size estimation module. Your outputs should be something like the following. Remeber: you have to look at size estimates only for images that are predicted positive (contain solar panel.) The size estimation module doesn't work for negative images (There are no panels to estimate their size).

In [None]:
# ======================== Your Code ========================
import matplotlib.pyplot as plt
import numpy as np

# Set models to evaluation mode
head.eval()

# Variables to store predictions and size estimates
predictions = []
size_estimates = []
images_with_panels = []
paths_with_panels = []

# Go through the validation set
with torch.no_grad():
    for i, (images, labels, paths) in enumerate(val_dl):
        images = images.to(device)
        
        # Extract DINO features
        cls_token, patch_tokens = backbone(images)
        
        # Predict solar panel presence
        outputs = head((cls_token, patch_tokens))
        _, preds = torch.max(outputs, 1)
        
        for j in range(images.size(0)):
            if preds[j] == 1:  # If the image is predicted to contain solar panels
                images_with_panels.append(images[j].cpu().numpy().transpose((1, 2, 0)))
                paths_with_panels.append(paths[j])
                
                # Estimate the size of the solar panels
                size_estimate, _ = head.get_size_estimate((cls_token[j].unsqueeze(0), patch_tokens[j].unsqueeze(0)))
                size_estimates.append(size_estimate.item())
                
        predictions.extend(preds.cpu().numpy())

# Convert size estimates to numpy array for easy manipulation
size_estimates = np.array(size_estimates)

# Sort images by size estimates
sorted_indices = np.argsort(size_estimates)

# Select a few images with the smallest and largest panels
num_images_to_visualize = 5  # Adjust as needed
smallest_panel_indices = sorted_indices[:num_images_to_visualize]
largest_panel_indices = sorted_indices[-num_images_to_visualize:]

# Function to visualize images
def visualize_images(indices, title):
    plt.figure(figsize=(20, 10))
    for i, idx in enumerate(indices):
        plt.subplot(1, len(indices), i+1)
        plt.imshow(images_with_panels[idx])
        plt.title(f"Size Estimate: {size_estimates[idx]:.2f}")
        plt.axis('off')
    plt.suptitle(title)
    plt.show()

# Visualize images with the smallest panels
visualize_images(smallest_panel_indices, "Images with Smallest Solar Panels")

# Visualize images with the largest panels
visualize_images(largest_panel_indices, "Images with Largest Solar Panels")

# ======================== Your Code ========================