HHU Deep Learning, SS2022/23, 05.05.2023, Prof. Dr. Markus Kollmann

Lecturers and Tutoring is done by Tim Kaiser, Nikolas Adaloglou and Felix Michels.

# Assignment 05 - Contrastive self-supervised learning: SimCLR in STL10 with Resnet18 


## Contents

1. Preparation and imports
2. Implement the augmentation pipeline used in SimCLR
3. Implement the SimCLR Contrastive loss (NT-Xent)
4. Load and modify resnet18
5. Gradient Accumulation: Implement the `training_step`  and `pretrain_one_epoch_grad_acc`
6. Putting everything together and train the model
7. Linear probing + T-SNE visualization of features
8. Compare SimCLR versus supervised Imagenet-pretrained weights and random init on STL10 train/val split
9. Plot the val accuracies for the 3 different initializations

# Introduction 

Contrastive loss is a way of training a machine learning model in a self-supervised manner, where the goal is to learn meaningful representations of the input data without any explicit labels or annotations.

The basic idea is to take a pair of input samples (such as two augmented views from the same image), and compare them to see if they are similar or dissimilar. The model is then trained to push similar pairs closer together in the representation space, while pushing dissimilar pairs farther apart.

To do this, the contrastive loss function measures the similarity between the representations of the two input samples (nominator), and encourages the model to maximize this similarity if the samples are similar, and minimize it if they are dissimilar.


You can also advice the [SimCLR Paper](https://arxiv.org/abs/2002.05709)

# Part I. Preparation and imports

In [None]:
import os
import torch
import torchvision.models as models
import numpy as np

import torch
import torchvision
import torchvision.transforms as T
import torch.nn as nn
import torch.nn.functional as F
from torchvision.datasets import STL10
from torch.utils.data import DataLoader
from torch.optim import Adam
import tqdm

# Local imports
from utils import *

# Part II. Implement the augmentation pipeline used in SimCLR

In contrastive self-supervised learning, there are several image augmentations that are commonly used to create pairs of images that are transformed versions of each other. These augmentations are designed to ensure that the resulting views have enough differences between them so that the model can learn to distinguish between them, while also preserving the label-related information.

Implement the following transformations **presented in random order**:


- Random flipping: This involves randomly flipping the image horizontally or vertically. Choose the one that best fits with a probability of 50%.
- Normalize the images with an appropriate mean std.
- Color jitter: This involves randomly changing the brightness, contrast, saturation and hue (20%) of the image. This augmentation helps the model learn to recognize objects or scenes under different lighting conditions. Apply this augmentation with a probability of 80%. Distort the brightness, contrast, saturation in the range `[0.2, 1.8]`.
- Random cropping: This involves randomly cropping a portion of the image to create a new image. We will then resize the images to 64x64 instead of 96x96 to reduce the computational time complexity to train the model.  Use a scale of 10-100% of the initial image size. 
- Gaussian blur: This augmentation helps the model learn to recognize objects or scenes that are slightly out of focus. Use a `kernel_size` of 3 and Standard deviation of 0.1 to 2.0.


The above augmentations are typically applied randomly to each image in a pair, resulting in two slightly different versions of the same image that can be used for contrastive learning.

Your task is to define the augmentation and decide in which order they should be applied. 

In [None]:
class Augment:
    """
    A stochastic data augmentation module
    Transforms any given data example randomly
    resulting in two correlated views of the same example,
    denoted x ̃i and x ̃j, which we consider as a positive pair.
    """
    def __init__(self, img_size):
        ### START CODE HERE ### (≈ 5 lines of code)
        
    def __call__(self, x):
        # This function applied the same transformation to an image twice.
        
    ### END CODE HERE ###

def load_data( batch_size=128, train_split="unlabeled", test_split="test", transf = T.ToTensor()):
    # Returns a train and validation dataloader for STL10 dataset
    ### START CODE HERE ### (≈ 6 lines of code)

    ### END CODE HERE ###
    return train_dl, val_dl

# Part III. Implement the SimCLR Contrastive loss (NT-Xent)

Let $sim(u,v)$ note the dot product between 2 normalized $u$ and $v$ (i.e. cosine similarity). Then the loss function for a **positive pair**
of examples (i,j) is defined as:
$$
\ell_{i, j}=-\log \frac{\exp \left(\operatorname{sim}\left(\boldsymbol{z}_{i}, \boldsymbol{z}_{j}\right) / \tau\right)}{\sum_{k=1}^{2 N} \mathbb{1}_{[k \neq i]} \exp \left(\operatorname{sim}\left(\boldsymbol{z}_{i}, \boldsymbol{z}_{k}\right) / \tau\right)}
$$

where $\mathbb{1}_{[k \neq i]} $ ∈{0,1} is an indicator function evaluating to 1 iff $k != i$ and τ denotes a temperature parameter. The final loss is computed by summing all positive pairs and divide by $2\times N = views \times batch_{size} $

There are different ways to develop contrastive loss. 


#### Hints
Here we provide you with some hints about the main algorithm:

- apply l2 normalization to the features and concatenate them in the batch dimension

- Calculate the similarity/logits of all pairs.  Output shape:[batch_size $\times$ views,batch_size $\times$ views]

- Make Identity matrix as mask with size=(batch_size $\times$ views, batch_size $\times$ views)

- Repeat the mask in both direction to the number of views (in simclr number of views = 2)
for batch_size=5 and 2 views: 
```
[1., 0., 0., 0., 0., 1., 0., 0., 0., 0.]
[0., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 1.]
```

4. Make a mask to index the positive pairs. mask-out the self-contrast as follows.
make a mask with the shape of the logits = [batch_size $\times$ views,batch_size $\times$ views]  that has ones in the diagonals that are +- batch_size from the main diagonal. this will be used to index the positive pairs.
Example for [6,6] matrix (batch_size=3,views=2):
```
[0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0.]
``` 
Ones here will be the positive elements for the nominator.
Alternativly you can use torch.diag() to take the positives from the  [6,6] similarity matrix (aka logits)

- Use the positives to form the nominator.Scale down result with the temperature. There are batch_size $\times$ views positive pairs.

- Calculate the denominator by summing the masked logits in the correct dimension.

- dont forget to apply `-log(result)`

- Calculate the final loss as in the above equation.


#### A note on L2 normalization

L2 normalization is a common technique used in contrastive learning to normalize the embedding vectors before computing the contrastive loss. 

This is because L2 normalization scales the vectors to have unit length. Without L2 normalization, the magnitude of the embedding vectors can have a large influence on the contrastive loss. 

This can result in the optimization process focusing more on adjusting the magnitude of the vectors rather than their direction, leading to suboptimal solutions. 

By normalizing the embeddings, the contrastive loss only considers the angular difference between embedding vectors.




In [None]:
import torch
import torch.nn as nn

class ContrastiveLoss(nn.Module):
    """
    Vanilla Contrastive loss, also called InfoNceLoss as in SimCLR paper
    There are different ways to develop contrastive loss. Here we provide you with some hints about the main algorithm:
        1- create an Identity matrix as a mask (bsz, bsz)
        2- repeat the mask in both direction to the number of views (in simclr number of views = 2) in the above code we called it anchor_count
        3- modify the mask to remove the self contrast cases
        4- calculate the similarity of two features. *Note: final size should be  [bsz, bsz]
        5- apply the mask on similairty matrix 
        6- calculate the final loss 
    """
    ### START CODE HERE ###  (≈ 19 lines of code)
    
    def forward(self, proj_1, proj_2):
        """
        proj_1 and proj_2 are batched embeddings [batch, embedding_dim]
        where corresponding indices are pairs
        z_i, z_j in the SimCLR paper
        """

        return loss # scalar!
    ### END CODE HERE ###

def test_ContrastiveLoss():
    batch_size = 8
    temperature = 0.1
    criterion = ContrastiveLoss(batch_size, temperature)
    proj_1 = torch.rand(batch_size, 128)
    proj_2 = torch.rand(batch_size, 128)
    loss = criterion(proj_1, proj_2)
    assert loss.shape == torch.Size([]), "ContrastiveLoss output shape is wrong"
    assert loss.item() >= 0, "ContrastiveLoss output is negative"
    print("ContrastiveLoss test passed!")

test_ContrastiveLoss()

# Part IV. Load and modify resnet18

- Load and modify the resnet18.
- Add an MLP with batch normalization after the resnet18 backbone as illustrate below:
```python
Sequential(
  (0): Linear(in_features=in_features, out_features=in_features, bias=False)
  (1): BatchNorm(in_features)
  (2): ReLU()
  (3): Linear(in_features=in_features, out_features=embedding_size, bias=False)
  (4): BatchNorm(embedding_size))
```

In [None]:
class ResNetSimCLR(nn.Module):
    def __init__(self, embedding_size=128):
        super(ResNetSimCLR, self).__init__()
        ### START CODE HERE ### (≈ 10 lines of code)
        # load resnet18 pretrained on imagenet
        # self.backbone = ...
        # add mlp projection head
        # self.projection = ....

    def forward(self, x, return_embedding=False):

    ### END CODE HERE ###

# Part V. Implement the `training_step`  and `pretrain_one_epoch_grad_acc`

### Gradient accumulation and mixed precision

- `training_step` should load a batch of 2 image views and feed them to the model. The loss function will calculate the implemented SimCLR loss.
- Gradient accumulation saves the gradient values for $N$ steps. It calculates the gradients and proceeds to the next batch. Remember that when you call `loss.backward()` the newly computed gradients are added to the old ones. After N steps, the parameter update is done and the loss shall be scaled down (averaged) by the number of N iterations.

Note: SimCLR training requires a large batch size. You should be to train SimCLR with a batch size of at least 256 on Google Colab.

#### Explanation of accumulated gradients

When training large neural networks, the computational cost of computing the gradient for all of the training examples in the dataset can be prohibitive. Gradient accumulation is a technique used to increase the size of the batch of training samples used to update the weights of the network. 

Instead of applying the gradients to the model's parameters after each batch, the gradients are accumulated over a batch of training examples. The accumulated gradients are then used to update the model's parameters. In this way, one reduces the noise in the gradients by averaging them over a batch of training examples, which can lead to more stable updates to the model's parameters. It also allows the model to make larger updates to its parameters, which may speed up the training process.

For example, if we set the batch size to 32, the network would process 32 examples at a time, compute the gradients for each example, and then accumulate the gradients over the 32 examples. After accumulating the gradients for the entire batch, the weights of the network are updated using the average of the accumulated gradients. Thus, for a batch size of 32 you can accumulate gradients every N steps so that you have an effective batch size of 32 $\times$ N!

> Importantly, gradient accumulation slows down training since gradient updates happen every N steps, but it is expected to see the loss dropping steadily and probably faster, depending on the method.

### Mixed Precision

At this point, we are introducing another technique to optimize GPU  memory usage to use larger batch sizes, mixed precision. The idea is to perform as many operations as possible in fp16, instead of the standard fp32, during training. This is not as simple as casting everything to fp16 however, because some operations are sensitive to underflow (being rounded to 0), especially the gradient itself. 

Luckily, there is a torch package for this, `torch.cuda.amp`. Feel free to check out the docs [here](https://pytorch.org/docs/stable/amp.html#) and some examples [here](https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples). This package takes care of the intricate things and you can go ahead and train. 

We are using two functions from the package here, `autocast` and `GradScaler`. Autocast is taking care of casting the correct tensors to fp16 and leaving the others unchanged. The GradScaler then makes sure that the gradients in the backward pass avoid numerical instabilities. 

Feel free to use this technique in future exercises to save some memory and speed up your training. 

In [None]:
from torch.cuda.amp import autocast, GradScaler

def training_step(model, loss_function, data):
    ### START CODE HERE ### (≈ 5 lines of code)
   
    ### END CODE HERE ###
    return loss

def pretrain_one_epoch_grad_acc(model, loss_function, train_dataloader, 
                                    optimizer, device, accum_iter=1, amp=False):
    model.train()
    total_loss = 0
    num_batches = len(train_dataloader)
    optimizer.zero_grad()
    scaler = GradScaler() if amp else None
    for batch_idx,data in enumerate(train_dataloader):
        ### START CODE HERE ### ( > 6 lines of code)
        if amp:
            # ....
        else:
            #.......
        
        # weights update

        # scale back the loss
        # total_loss = ....

        ### END CODE HERE ###
    return total_loss/num_batches
    


def pretrain(model, optimizer, num_epochs, train_loader, criterion, device, accum_iter=1, amp=False):
    dict_log = {"train_loss":[]}
    best_loss = 1e8
    model = model.to(device)
    pbar = tqdm(range(num_epochs))
    for epoch in pbar:
        train_loss = pretrain_one_epoch_grad_acc(model, criterion, train_loader, optimizer,
                                                    device, accum_iter, amp=amp)
        msg = (f'Ep {epoch}/{num_epochs}: || Loss: Train {train_loss:.3f}')
        pbar.set_description(msg)
        dict_log["train_loss"].append(train_loss)
        
        # Use this code to save the model with the lowest loss
        if train_loss < best_loss:
            best_val_loss = train_loss
            save_model(model, f'best_model_min_train_loss.pth', epoch, optimizer, train_loss)   
        if epoch == num_epochs - 1:
            save_model(model, f'last_model_ep{epoch}.pth', epoch, optimizer, train_loss)
    return dict_log

# Part VI. Putting everything together and train the model

Hint: ~50 epochs should be sufficient to see the learned features.

A small training trick here. We will exclude batch normalization parameters from weight decay in `define_param_groups`

Note on complexity: 10.7 VRAM used and ~156mins needed. Effective batch size>1024, images of 64x64, 60 epochs.

In case you face problem with Google colab, download the model every 5 epochs or better mount you google drive and save the model there in case you disconnect.

Here
```python
PATH = './best_model.ckpt'
torch.save(model_simclr.state_dict(), PATH)
files.download(PATH)
```

In [None]:
class Hparams:
    def __init__(self):
        # This is what we used, feel free to change those parameters.
        # You only need to specify the temperature in the config object
        self.seed = 77777 # randomness seed
        self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
        self.img_size = 64 #image shape
        self.load = False # load pretrained checkpoint
        self.batch_size = 512
        self.lr = 3e-4 # for ADAm only
        self.weight_decay = 1e-6
        self.embedding_size = 128 # papers value is 128
       
        self.epochs = 100
        self.accum_iter = 1 # gradient accumulation
        self.amp = True # automatic mixed precision
        ############################################
        # START CODE HERE ### (≈ 1 line of code)
        self.temperature = ........
        ### END CODE HERE ###

### START CODE HERE ### (>10 lines of code)


# Launch training i.e :
# dict_log = pretrain(model, optimizer, config.epochs,
#                     train_dl, criterion, 
#                     config.device, accum_iter=config.accum_iter,
#                     amp=config.amp)

### END CODE HERE ###

# Part VII. Linear probing + T-SNE visualization of features

As in the previous exercise, check the results of linear probing on the supervised training split and the T-SNE visualization.

Code for the T-SNE visualization exists in `utils.py`.

In [None]:
### START CODE HERE ### (> 10 lines of code)
# model = ResNetSimCLR(embedding_size=config.embedding_size)
# model = load_model(model, "simclr.pth")


# Linear evaluation


# TSNE plot


### END CODE HERE ###

### Expected results
```
Model simclr.pth is loaded from epoch 99 , loss 5.342101926069994
Ep 199/200: Accuracy : Train:87.80 	 Val:78.41 || Loss: Train 0.360 	 Val 0.612
```

# Part VIII. Compare SimCLR versus supervised Imagenet-pretrained weights and random init on STL10 train/val split

- Don't forget to use the train split of STL10 for supervised training.
- For simplicity, don't use augmentations here, although it's possible and it would lead to better results.
- Since we are not using any augmentations at this step, simclr will have the same results as before.


Variants to be tested: 
- SimCLR weights trained for at least 50 epochs
- Imagenet initialization
- random initialization
Afterward, print the best val. accuracy for all 3 models!

In [None]:
def main(mode='simclr'):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    ### START CODE HERE ### (≈ 15 lines of code)
        
    if mode == 'random':

    elif mode == 'imagenet':

    elif mode == 'simclr':
        
    ### END CODE HERE ###
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    dict_log = linear_eval(model, optimizer, 20, train_dl, val_dl, device)
    return dict_log
    

dict_log_simclr = main('simclr')
acc1 = np.max(dict_log_simclr["val_acc_epoch"])
dict_log_in = main('imagenet')
acc2 = np.max(dict_log_in["val_acc_epoch"])
dict_log_ran = main('random')
acc3 = np.max(dict_log_ran["val_acc_epoch"])
print(f"Fine-tuning best results: SimCLR: {acc1:.2f}%, ImageNet: {acc2:.2f} %, Random: {acc3:.2f} %")

### Expected results

By fine-tuning all variants for 20 epochs this is what we got: 

```
Fine-tuning best results: SimCLR: 77.26%, ImageNet: 76.25 %, Random: 53.83 %

```

# Part IX. Plot the val accuracies for the 3 different initializations

In [None]:
# Provided
plt.figure(figsize=(10, 5))
plt.plot(dict_log_simclr["val_acc_epoch"], label="SimCLR")
plt.plot(dict_log_in["val_acc_epoch"], label="ImageNet")
plt.plot(dict_log_ran["val_acc_epoch"], label="Random")
plt.legend()
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.title("Fine tuning results on STL-10")
plt.savefig("fine_tuning_results_stl10.png")
plt.show()

# Conclusion and Bonus reads

That's the end of this exercise. If you reached this point, congratulations!


### Optional stuff

- Improve SimCLR. Add the [LARS optimizer](https://gist.github.com/black0017/3766fc7c62bdd274df664f8ec03715a2) with linear warm + [cosine scheduler](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html?highlight=cosine%20scheduler#torch.optim.lr_scheduler.CosineAnnealingLR) + train for 200 epochs. Then make a new comparison!
- Train on CIFAR100 and compare rotation prediction VS SimCLR pretraining on both datasets. Which pretext task is likely to work better there?