# Homework - DINO
### Sharif University - Deep Learning Course - Spring 2024

*Instructor:  Dr. Soleymani*

---

*Full Name:* Esra Kashaninia

*SID:* 402210676

---

In this homework we will learn to use the DINO model.

In this homework you need to complete the notebook and run all the cells.
We have specified the parts to be completed with `TODO` tags inside the code blocks.

**NOTES**:
* It is important that you read all the code and text blocks carefully, even if you think you are excited to jump into completing the missing codes.
* This notebook is tested with *Google Colab* and *Kaggle* free runtimes and you can used them for testing your code.
* Ensure all cells are executable and perform their intended functions
* You can ask your questions on [Quera Class](https://quera.org/course/16605/)
* Write clear, commented code when necessary.

# Introduction

Listen up, folks! You know how they say dinosaurs are extinct? Well, they lied to us. The DINO model is proof that these ancient beasts are still roaming the earth, but this time, they're here to help us with feature extraction and downstream tasks. Imagine a T-Rex with a fancy deep learning algorithm strapped to its back, stomping around and making sense of all the data in its path.

But wait, it gets better! This dino doesn't just extract features; it does it in a self-supervised manner, which means it's like a kid who learned to tie its own shoelaces without any help from its parents (or, in this case, labeled data). And once it's done extracting those juicy features, it's ready to tackle any downstream task you throw its way, whether it's image classification, object detection, or even predicting the next hot dinosaur-themed movie.

In this homework assignment, we will be utilizing the DINO model to extract meaningful visual features from satellite imagery data. The self-supervised DINO model has proven to be an effective tool for extracting rich representations from visual data without the need for labeled examples during pre-training.

Specifically, we will leverage the DINO model's capabilities to extract visual features from satellite images. These extracted features will then be used to train a classifier on top of the DINO backbone. The goal of this classifier is to predict whether a given satellite image contains solar panels or not.

Moving on to the second part of the assignment, we will explore the transformer attention maps produced by the DINO model. By analyzing these attention maps, we aim to estimate the size of the solar panels present in the positive examples from the dataset.

While not as exhilarating as envisioning a dinosaur with deep learning capabilities, this assignment presents an opportunity to gain hands-on experience with a state-of-the-art self-supervised model and its applications in computer vision tasks. I'm sure you'll find the process insightful and rewarding.

# Installations and imports

As usual, imports are our first step.

In [1]:
!pip install einops

Collecting einops
  Downloading einops-0.8.0-py3-none-any.whl (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.8.0


In [2]:
!pip install rasterio

Collecting rasterio
  Downloading rasterio-1.3.10-cp310-cp310-manylinux2014_x86_64.whl (21.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.5/21.5 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting affine (from rasterio)
  Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Collecting snuggs>=1.4.1 (from rasterio)
  Downloading snuggs-1.4.7-py3-none-any.whl (5.4 kB)
Installing collected packages: snuggs, affine, rasterio
Successfully installed affine-2.4.0 rasterio-1.3.10 snuggs-1.4.7


In [3]:
import glob
import torch
import rasterio
import numpy as np
import einops as eo
import random as rnd
import torch.nn as nn
from PIL import Image
from pathlib import Path
import matplotlib.pyplot as plt
import torch.nn.functional as F
from matplotlib.colors import Normalize
from torch.utils.data import Dataset, DataLoader

# Dataset

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Move the dataset to this colab session.
# We suggest you add a shortcut from the given file to your own google drive account and then copy the file from that shortcut to here.

# ======================= Your Code =======================
!cp -r /content/drive/MyDrive/HW5/uk20K.zip /content/

# ======================= Your Code =======================

In [6]:
!unzip uk20K.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: uk20K_v2/111412-P.tif   
  inflating: uk20K_v2/3374-N.tif     
  inflating: uk20K_v2/486881-N.tif   
  inflating: uk20K_v2/6129-P.tif     
  inflating: uk20K_v2/92423-P.tif    
  inflating: uk20K_v2/709543-N.tif   
  inflating: uk20K_v2/54256-P.tif    
  inflating: uk20K_v2/111122-N.tif   
  inflating: uk20K_v2/697872-N.tif   
  inflating: uk20K_v2/269296-N.tif   
  inflating: uk20K_v2/161532-N.tif   
  inflating: uk20K_v2/94179-P.tif    
  inflating: uk20K_v2/600055-N.tif   
  inflating: uk20K_v2/105520-P.tif   
  inflating: uk20K_v2/578873-N.tif   
  inflating: uk20K_v2/59564-N.tif    
  inflating: uk20K_v2/110893-N.tif   
  inflating: uk20K_v2/71071-P.tif    
  inflating: uk20K_v2/714753-N.tif   
  inflating: uk20K_v2/37732-P.tif    
  inflating: uk20K_v2/88134-P.tif    
  inflating: uk20K_v2/315686-N.tif   
  inflating: uk20K_v2/343921-N.tif   
  inflating: uk20K_v2/125646-P.tif   
  inflating: uk20K_v2/5

Let's prepare the train and validation datasets.

In [7]:
def normalize(image, MEAN = [0.485, 0.456, 0.406], STD = [0.485, 0.456, 0.406]):
    image = image / 255
    source, dest = 0 if len(image.shape) == 3 else 1, -1
    return np.moveaxis((np.moveaxis(image, source, dest) - MEAN) / STD, dest, source)

def denormalize(image, MEAN = [0.485, 0.456, 0.406], STD = [0.485, 0.456, 0.406]):
    source, dest = 0 if len(image.shape) == 3 else 1, -1
    image = np.moveaxis((np.moveaxis(image, source, dest) * STD) + MEAN, dest, source)
    return (image * 255).astype(int)


class SolarDataset(Dataset):
    def __init__(self, file_names):
        self.file_names = file_names

    def __len__(self):
        return len(self.file_names)

    def __getitem__(self, index):
        path = self.file_names[index]
        x = rasterio.open(self.file_names[index]).read()
        y = torch.tensor(1.0).long() if path.endswith('-P.tif') else torch.tensor(0.0).long()
        x = normalize(x, MEAN=[0.5, 0.5, 0.5], STD=[0.5, 0.5, 0.5])
        return torch.as_tensor(x.copy()).float(), y, self.file_names[index]

The dataset comprises of .tif image files. The images are labeled based on the presence or absence of solar panels within them. If an image's filename ends with "-P" it indicates a positive label, signifying that the image contains at least one solar panel. Conversely, if the filename ends with "-N" it denotes a negative label, implying that no solar panels are present in the image. However, the dataset does not provide any details regarding the size of the solar panels in the images.

Split the dataset into train and test sets and create the dataloaders.

In [8]:
# ======================= Your Code =======================
from torch.utils.data import DataLoader
import glob
from glob import glob
import os
import random
from torch.utils.data import random_split

data = glob(os.path.join("/content/uk20K_v2/", "*.tif"))
data = data[ :int(len(data) / 10)]
random.shuffle(data)
data = SolarDataset(data)
trainset, testset = random_split(data, [int(len(data) * 0.8), int(int(len(data) * 0.2))])





# ======================= Your Code =======================

In [9]:
# ======================= Your Code =======================
batch_size = 64
train_dl = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=4)
val_dl = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=4)
# ======================= Your Code =======================

print(len(train_dl), len(val_dl))

25 7




# Model definition

This section focuses on defining the models. The first component is a DINO backbone, which serves as a feature extractor. The code for DINO is already written and does not require any modifications. The second component is a classifier head that will be placed on top of the DINO features. The init function for this model has been provided, and your task is to complete the forward method.

In [10]:
class DinoBackbone(nn.Module):
    def __init__(self, dino_size='small') -> None:
        super().__init__()
        if dino_size == 'small':
            self.dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
            self.d_model = 384
        elif dino_size == 'base':
            self.dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
            self.d_model = 768
        elif dino_size == 'giant':
            self.dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')
            self.d_model = 1536

    def forward(self, x):
        x = self.dinov2.forward_features(x)
        cls_token = x["x_norm_clstoken"]
        patch_tokens = x["x_norm_patchtokens"]
        return cls_token, patch_tokens

In [19]:
import matplotlib.pyplot as plt


class TransformerEncoderLinearHead(nn.Module):

    def __init__(self, d_model, output_size) -> None:
        super().__init__()
        self.transformer = nn.TransformerEncoderLayer(d_model, 8, batch_first=True)
        self.fc = nn.Linear(d_model, 2)

    def forward(self, x_feats):
        cls_embs, patch_embs = x_feats

        # ==================== Your Code ====================
        # 1. Pass the cls and patch embeddings as a single sequence to the transformer layer
        # 2. Pass the cls output of the transformer to the linear model
        # 3. Return the output of the linear model as the prediction

        # ==================== Your Code ====================
        full_emb = torch.cat((cls_embs.unsqueeze(1), patch_embs), dim=1)
        transformed = self.transformer(full_emb)
        cls = transformed[:, 0, :]
        out = self.fc(cls)
        return out

    def get_size_estimate(self, x_feats, vis=False, images=None):

        # ==================== Your Code ====================
        # Note: This function is used in the second part of the notebook for size estimation.
        # 1. Pass cls and patch embeddings to the self-attention layer of the transformer defined in the init.
        #           use "self.transformer.self_attn()"
        # 2. When using the self_attn layer make sure to set the need_weights=True. (This will give you the attention scores)
        # 3. The previous step computes the attention of each token with all other tokens. (shape: hx257x257 where h is the number of heads)
        # 4. Get the attention score of cls token with all patch tokens (shape: hx256)
        # 5. Reshape this into hx16x16.
        # 6. Upsample this 16x16 image by a factor of 14. (You get a hx224x224 image)
        # 7. Sum accross the heads (shape: 224x224)
        # 8. Normalize this 224x224 into the [0,1] range for all pixels.
        # 9. Create a binary mask from this 224x224 using a threshold (You should choose this threshold.)
        # 10. This mask is an estimation of the solar panels in the image (if it exists). You can use it to estimate the size of the solar panel.
        # 11. Good Luck.

        # ==================== Your Code ====================

        cls_embs, patch_embs = x_feats
        full_emb = torch.cat((cls_embs.unsqueeze(1), patch_embs), dim=1)
        temp, w = self.transformer.self_attn(full_emb, need_weights=True)
        scores = w[:, :, 0, 1:]
        B = scores.size()[0]
        scores = scores.view(B, scores.size()[1], 16, 16)

        up = T.Resize((224, 224), interpolation=T.InterpolationMode.BICUBIC)
        scores = up(scores).sum(dim=1)

        normalized = scores / scores.max(dim=(-1, -2), keepdim=True)[0]
        mask = (normalized > 0.5).float()

        if vis:
            for i in range(B):
                f, axarr = plt.subplots(1, 3, figsize=(9, 3))
                axarr[0].imshow(images[i].permute(1, 2, 0).detach().cpu().numpy())
                axarr[0].set_title("Image", size=8)
                axarr[1].imshow(scores[i].detach().cpu().numpy(), cmap='hot')
                axarr[1].set_title("Scores", size=8)
                axarr[2].imshow(mask[i].detach().cpu().numpy(), cmap='gray')
                axarr[2].set_title("Mask", size=8)

                axarr[0].get_xaxis().set_visible(False)
                axarr[0].get_yaxis().set_visible(False)
                axarr[1].get_xaxis().set_visible(False)
                axarr[1].get_yaxis().set_visible(False)
                axarr[2].get_xaxis().set_visible(False)
                axarr[2].get_yaxis().set_visible(False)

                plt.show()

        return mask



# Train

Now initialize the model, define the optimizer and loss function and train the classifier model. Train your model for 2 epochs (this should be pretty fast). Pay attention that your code produces similar output as the one in the notebook so that training loss and validation accuracy are reported.

In [20]:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

backbone = DinoBackbone(dino_size='small')
backbone = backbone.to(device)
backbone.eval()
head = TransformerEncoderLinearHead(backbone.d_model, 2)
head = head.to(device)

Using cache found in /root/.cache/torch/hub/facebookresearch_dinov2_main


In [21]:
lr = 0.001
optimizer = torch.optim.Adam(head.parameters(), lr=lr)
# I added this
criterion = nn.CrossEntropyLoss()

In [22]:
iter = 0
train_loss = 0

for epoch in range(1):

# ==================== Your Code ====================
# Write the training loop
# Be careful: in each iteration you first have to extract the DINO features and
# the pass the features to your classifier network,
# pay attention that you must freeze the DINO weights so that it isn't trained. (Use torch.no_grads() block.)

# ==================== Your Code ====================
    correct = 0
    total = 0
    iter_loss = 0
    head.train()
    for inputs, labels, dummy in train_dl:
        inputs, labels = inputs.to(device), labels.to(device)

        with torch.no_grad():
            embs = backbone(inputs)

        output = head(embs)
        optimizer.zero_grad()
        loss = criterion(output, labels)
        iter_loss += loss.item()
        train_loss += loss.item()
        loss.backward()
        optimizer.step()

        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

        iter += 1
        if iter % 10 == 0:
            print("Train Phase: Batch No", iter - 9, "to", iter + 1,
                  "Loss:", iter_loss)
            iter_loss = 0
    train_loss /= len(train_dl)
    print("Train Loss:", train_loss, "Accuracy:", correct * 100 / total)

print("-" * 150)

val_loss = 0
correct = 0
total = 0
with torch.no_grad():
    head.eval()
    for inputs, labels, dummy in val_dl:
        inputs, labels = inputs.to(device), labels.to(device)
        embs = backbone(inputs)
        output = head(embs)
        loss = criterion(output, labels)
        val_loss += loss.item()

        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

val_loss /= len(val_dl)
print("Validation Loss:", val_loss, "Validation Accuracy:", correct * 100 / total)



torch.save(head.state_dict(), 'final_model.pth')

Train Phase: Batch No 1 to 11 Loss: 8.463289886713028
Train Phase: Batch No 11 to 21 Loss: 4.069504901766777
Train Loss: 0.5486022764444352 Accuracy: 77.0
------------------------------------------------------------------------------------------------------------------------------------------------------
Validation Loss: 0.2302708774805069 Validation Accuracy: 90.0


In [23]:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

backbone = DinoBackbone(dino_size='small')
backbone = backbone.to(device)
backbone.eval()
head = TransformerEncoderLinearHead(backbone.d_model, 2)
head.load_state_dict(torch.load('final_model.pth'))
head = head.to(device)

Using cache found in /root/.cache/torch/hub/facebookresearch_dinov2_main


Now go through the validation set and for every image predict whether it contains solar panel or not. Then from all the images that contain solar panels, visualize some that have the large panels and some with the small panels based on your size estimation module. Your outputs should be something like the following. Remeber: you have to look at size estimates only for images that are predicted positive (contain solar panel.) The size estimation module doesn't work for negative images (There are no panels to estimate their size).

In [None]:
# ======================== Your Code ========================

# A few hours' deadline extension could've done wonders :/

# ======================== Your Code ========================