HHU Deep Learning, SS2022/23, 09.06.2023, Prof. Dr. Markus Kollmann

Lecturers and Tutoring is done by Tim Kaiser, Nikolas Adaloglou and Felix Michels.

# Assignment 10 - Contrastive Language-Image Pre-training for unsupervised out-of-distribution detection

Copyright © 2023 Nikolas Adaloglou, Tim Kaiser and Felix Michels

---

Submit the solved notebook (not a zip) with your full name plus assingment number for the filename as an indicator, e.g `max_mustermann_a1.ipynb` for assignment 1. Since this assignment spans 2 weeks, completing it is worth 2 points. If we feel like you have genuinely tried to solve the exercise, you will receive 2 point for this assignment, regardless of the quality of your solution, otherwise 0. 

## <center> DUE FRIDAY 23.06.2023 2:30 pm </center>

Drop-off link: [https://uni-duesseldorf.sciebo.de/s/inBTRVg7lLmccdN](https://uni-duesseldorf.sciebo.de/s/inBTRVg7lLmccdN)

---

## Contents

1. Basic imports
2. Get the visual features of the CLIP model
3. Compute the k-NN similarity as the OOD score
4. Compute MSP using the text encoder and the label names
5. Linear probing on the pseudolabels
6. Mahalanobis distance as OOD score
7. Mahalanobis distance using the real labels without linear probing
8. K-means clusters combined with Mahalanobis distance

---

## Overview
We will apply the learned representations from Contrastive Language-Image Pretrained (CLIP) on the downstream task of out-of-distribution detection.

`Note`: I used the pretrained models from open_clip_torch, you can install it with `!pip install open_clip_torch`

We will be using the model 'convnext_base_w' pretrained on 'laion2b_s13b_b82k' throughout this tutorial.

Info and examples on how to use CLIP models for inference is provided in [openclip](https://github.com/mlfoundations/open_clip#usage)

- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
- [Contrastive Language-Image Pretrained (CLIP) Models are Powerful Out-of-Distribution Detectors](https://arxiv.org/abs/2303.05828)



# Part I. Basic imports

In [1]:
import os
from pathlib import Path
from tqdm import tqdm
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score

import torch
import open_clip
from torch import nn
from torch.nn import functional as F
import torchvision
import torchvision.transforms as T
from torch.utils.data import Subset, DataLoader, Dataset

out_dir = Path('./features/').resolve()
out_dir.mkdir(parents=True, exist_ok=True)
# Local import
from utils import *

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Helper function
def auroc_score(score_in, score_out):
    if type(score_in) == torch.Tensor:
        score_in = score_in.cpu().numpy()
        score_out = score_out.cpu().numpy()
    labels = np.concatenate((np.ones_like(score_in), np.zeros_like(score_out)))
    return roc_auc_score(labels, np.concatenate((score_in, score_out))) * 100

# Part II. Get the visual features of the CLIP model

- We will use `CIFAR100` as the in-distribution, and `CIFAR10` as the out-distribution.
- When you are only loading the visual CLIP backbone, you must remove the final linear layer that projects the features to the shared feature space of the image-text encoder.
- Load the data, compute the visual features and save them in the `features` folder.
- For the in-distribution you need both the train and test split, while for the out-distribution, we will only use the validation split.


### Optional structure

```python
def load_datasets(indist="CIFAR100", ood="CIFAR10", batch_size=256, tranform=None):
    # ....
    return indist_train_loader, indist_test_loader, ood_loader

# visual is a boolean that controls whether the visual backbone is only returned or the whole CLIP model
def get_model(visual, name, pretrained)
    # .....
    if visual:
        return backbone, preprocess
    return model, preprocess, tokenizer
    
# Load everything .......

feats, labels = get_features(backbone, dl, device)
# Save features 
# ....
```

In [2]:
### START CODE HERE ### (≈ 31 lines of code)
def load_datasets(indist = "CIFAR100", ood = "CIFAR10", batch_size = 256, transform = None,num_workers = 0):
    indist_train_ds = getattr(torchvision.datasets, indist)(train = True, download = True, root = './data', transform = transform)
    indist_test_ds = getattr(torchvision.datasets, indist)(train = False, download = True, root = './data', transform = transform)
    ood_ds = getattr(torchvision.datasets, ood)(train = False, download = True, root = './data', transform = transform)

    indist_train_loader = DataLoader(indist_train_ds, batch_size=batch_size, shuffle = True, num_workers = num_workers)
    indist_test_loader = DataLoader(indist_test_ds, batch_size=batch_size, shuffle = True, num_workers = num_workers)
    ood_loader = DataLoader(ood_ds, batch_size=batch_size, shuffle = True, num_workers = num_workers)

    return indist_train_loader, indist_test_loader, ood_loader

def get_model(visual, name = 'convnext_base_w', pretrained  = 'laion2b_s13b_b82k'):
    model,_,preprocess = open_clip.create_model_and_transforms(name,pretrained = pretrained)
    tokenizer = open_clip.get_tokenizer(name)
    if visual:
        model = model.visual
        model.head.proj = nn.Identity()

    return model, preprocess, tokenizer



In [None]:
backbone,preprocess,tokenizer = get_model(visual = True)
indist_train_loader, indist_test_loader, ood_loader = load_datasets(transform = preprocess)

indist_train_features, indist_train_labels = get_features(backbone, indist_train_loader,device)
indist_test_features, indist_test_labels = get_features(backbone, indist_test_loader,device)
ood_features, ood_labels = get_features(backbone, ood_loader,device)

torch.save(indist_train_features, 'features/cifar100_train_feats.pt')
torch.save(indist_train_labels, 'features/cifar100_train_labels.pt')

torch.save(indist_test_features, 'features/cifar100_test_feats.pt')
torch.save(indist_test_labels, 'features/cifar100_test_labels.pt')

torch.save(ood_features, 'features/cifar10_test_feats.pt')
torch.save(ood_labels,'features/cifar10_test_labels.pt')


### END CODE HERE ###

# feature test
for name, N in [('cifar100_train', 50000), ('cifar100_test', 10000), ('cifar10_test', 10000)]:
    feats = torch.load(f'features/{name}_feats.pt')
    labels = torch.load(f'features/{name}_labels.pt')
    assert feats.shape == (N, 1024)
    assert labels.shape == (N,)
print('Success!')

# Part III. Compute the k-NN similarity as the OOD score

- For each test image of in and out distribution compute the top-1 cosine similarity and use it as OOD score.
- Report the resulting AUROC score.
- Note: Use the image features and not the images!

In [9]:
@torch.no_grad()
def OOD_classifier_knn(train_features, test_features, k=1):
    ### START CODE HERE ### (≈ 13 lines of code)
    norm_train = nn.functional.normalize(train_features, dim =-1, p = 2)
    norm_test = nn.functional.normalize(test_features, dim = -1,p = 2)
    cos_sim,_ = torch.matmul(norm_test, norm_train.T).topk(k, largest=True, sorted=True, dim=-1)
    ### END CODE HERE ###
    return cos_sim 

# load the computed features and compute scores
### START CODE HERE ### (≈ 5 lines of code)
indist_train = torch.load('features/cifar100_train_feats.pt')
indist_test = torch.load('features/cifar100_test_feats.pt')
ood_test = torch.load('features/cifar10_test_feats.pt')
score_in = OOD_classifier_knn(indist_train, indist_test)
score_out =  OOD_classifier_knn(indist_train, ood_test)
### END CODE HERE ###
print(f'CIFAR100-->CIFAR10 AUROC: {auroc_score(score_in, score_out):.2f}')

CIFAR100-->CIFAR10 AUROC: 83.55


### Expected result

```
CIFAR100-->CIFAR10 AUROC: 83.55
```


# Part IV. Compute MSP using the text encoder and the label names

We will now consider the case where the in-distribution label names are available.

Your task is to apply zero-shot classification and get the maximum softmax probability (MSP) as the OOD score.

In short:
- compute image and text embeddings
- compute the image-test similarity matrix (logits)
- apply softmax to the logits for each image to get a probability distribution of the classes.
- compute maximum softmax probability (MSP)

- `Note`: After loading the saved image features you need to apply the linear projection layer from the visual backbone of CLIP

In [5]:
### optional to use
def compute_logits(model, text_embs, img_embs, device):
    ### START CODE HERE ### (≈ 5 lines of code)
    img_embs = img_embs.to(device)
    img_embeds = model.visual.head.proj(img_embs)
    img_embeds_norm = nn.functional.normalize(img_embeds, dim = -1, p = 2)
    text_embeds_norm = nn.functional.normalize(text_embs, dim = -1 , p = 2)
    logits = torch.matmul(img_embeds_norm, text_embeds_norm.T)
    ### END CODE HERE ###
    return logits

### optional to use
def compute_text_embeds(model, class_tokens):
    ### START CODE HERE ### (≈ 3 lines of code)
    text_embs = model.encode_text(class_tokens)
    ### END CODE HERE ###
    return text_embs

def compute_msp(label_names, model, class_tokens, indist_test, ood_test, device):
    ### START CODE HERE ### (≈ 7 lines of code)
    text_embs = compute_text_embeds(model, class_tokens)
    logits_in = compute_logits(model, text_embs, indist_test, device)
    logits_out = compute_logits(model, text_embs, ood_test, device)

    score_in = torch.nn.functional.softmax(logits_in, dim = 1).max(dim = 1)[0].detach()
    score_out = torch.nn.functional.softmax(logits_out, dim = 1).max(dim = 1)[0].detach()
    ### END CODE HERE ###
    return score_in, score_out
        

# Load model and features
### START CODE HERE ### (≈ 4 lines of code)
indist_test = torch.load('features/cifar100_test_feats.pt')
ood_test = torch.load('features/cifar10_test_feats.pt')
model, preprocess, tokenizer = get_model(visual = False)
model =model.to(device)
### END CODE HERE ###

### Provided 
label_names = torchvision.datasets.CIFAR100(root='../data', train=True, download=True).classes
prompts = ['an image of a ' + lab.replace('_', ' ') for lab in label_names]
class_tokens = tokenizer(prompts).to(device)
score_in, score_out = compute_msp(label_names, model, class_tokens, indist_test, ood_test, device)
print(f'CIFAR100-->CIFAR10 AUROC: {auroc_score(score_in, score_out):.2f}')

Files already downloaded and verified
CIFAR100-->CIFAR10 AUROC: 84.18


### Expected result

```
CIFAR100-->CIFAR10 AUROC: 76.38
```


# Part V. Linear probing on the pseudolabels

- Your task is to train a linear layer using the CLIP pseudolabels as targets.
- The pseudolabels are the argmax of the logits computed above, i.e., take the class with the maximum probability as the class label

In [6]:
def compute_score_probe_msp(lin_layer, indist_loader, ood_loader, device):
    """
    Computes the MSP scores for a linear layer for both in- and out- distribution.
    """
    ### START CODE HERE ### (≈ 4 lines of code)
    in_list = []
    out_list = []
    for imgs, _ in indist_loader:
        imgs = imgs.to(device)
        logits = lin_layer(imgs)
        in_list.append(nn.functional.softmax(logits, dim = 1).max(dim = 1)[0].detach())

    for imgs, _ in ood_loader:
        imgs = imgs.to(device)
        logits = lin_layer(imgs)
        out_list.append(nn.functional.softmax(logits, dim = 1).max(dim = 1)[0].detach())

    score_in = torch.cat(in_list)
    score_out = torch.cat(out_list)

    ### END CODE HERE ###
    return score_in, score_out

### START CODE HERE ### (≈ 17 lines of code) 
# get CLIP model
# get text embeds from label names
model, preprocess, tokenizer = get_model(visual = False)
model =model.to(device)
text_embeds = compute_text_embeds(model, class_tokens)
# load features
indist_train = torch.load('features/cifar100_train_feats.pt')
indist_test = torch.load('features/cifar100_test_feats.pt')
ood_test = torch.load('features/cifar10_test_feats.pt')
# compute CLIP logits of image features based on text encoder
logits_train = compute_logits(model, text_embeds, indist_train, device)
logits_val = compute_logits(model,text_embeds, indist_test,device)
# get target pseudo labels from CLIP logits
pseudo_labels_train = logits_train.argmax(dim=1)
pseudo_labels_val = logits_val.argmax(dim = 1)
# create dataset and dataloaders for linear probing
train_dataset = torch.utils.data.TensorDataset(indist_train, pseudo_labels_train)
val_dataset = torch.utils.data.TensorDataset(indist_test,pseudo_labels_val)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = 128, shuffle = True, drop_last = False)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size = 128, shuffle = True, drop_last = False)
### END CODE HERE ###

# The code below is provided based on our implementation. Optional to use!
# Run linear probing
embed_dim = train_dataset[0][0].shape[0]
lin_layer = nn.Linear(embed_dim, 100).to(device)
optimizer = torch.optim.Adam(lin_layer.parameters(), lr=1e-3)
num_epochs = 20
dict_log = linear_eval(lin_layer, optimizer, num_epochs, train_loader, val_loader, device)
# compute MSP scores
lin_layer = load_model(lin_layer, "CLIP_best_max_train_acc.pth")
ood_dataset = torch.utils.data.TensorDataset(ood_test, torch.zeros(ood_test.shape[0], dtype=torch.long))
ood_loader = torch.utils.data.DataLoader(ood_dataset, batch_size=128, shuffle=False, drop_last=False)
score_in, score_out = compute_score_probe_msp(lin_layer, val_loader, ood_loader, device)
print(f'CIFAR100-->CIFAR10 AUROC: {auroc_score(score_in, score_out):.2f}')

Ep 19/20: Accuracy : Train:99.66 	 Val:91.71 || Loss: Train 0.055 	 Val 0.229: 100%|██████████| 20/20 [00:13<00:00,  1.48it/s]

Model CLIP_best_max_train_acc.pth is loaded from epoch 19
CIFAR100-->CIFAR10 AUROC: 75.90





### Expected results

AUROC may slightly vary due to random initialization of linear probing.
```
CIFAR100-->CIFAR10 AUROC: 74.81
```

# Part VI. Mahalanobis distance as OOD score
- Use the output of the linear layer from task 4 as features to compute the Mahalanobis distance and the relative Mahalanobis distance.
- To compute the Mahalanobis distance group the features by their pseudolabels and compute the mean and covariance matrix for each class.

In [7]:
### optional to use
def calc_maha_distance(embeds, means_c, inv_cov_c):
    diff = embeds - means_c
    dist = np.matmul(diff,inv_cov_c)*diff
    dist = np.sum(dist,axis=1)
    return dist

def OOD_classifier_maha(train_embeds_in, train_labels_in, test_embeds_in, test_embeds_outs, num_classes,
                        relative=False):
    # optional to use our code!
    class_covs = []
    class_means = []
    used_classes = 0
    if type(train_labels_in) == torch.Tensor:
        train_labels_in = train_labels_in.cpu().numpy()
    if type(train_embeds_in) == torch.Tensor:
        train_embeds_in = train_embeds_in.cpu().numpy()
        test_embeds_in = test_embeds_in.cpu().numpy()
        test_embeds_outs = test_embeds_outs.cpu().numpy()
    ### START CODE HERE ### (≈ 23 lines of code)
    # calculate class-wise means and covariances
    classes = np.unique(train_labels_in)
    N = train_embeds_in.shape[0]
    for c in classes:
        embeds_c = train_embeds_in[train_labels_in == c]
        mu_c = np.mean(embeds_c, axis = 0)
        class_means.append(mu_c)
        sigma_c = np.cov((embeds_c - (mu_c.reshape([1,-1]))).T )
        # sigma_c = (embeds_c-mu_c).T@(embeds_c-mu_c)
        class_covs.append(sigma_c)
        used_classes +=1

    # estimating the global std from train data
    sigma = (1/used_classes)*np.sum(class_covs, axis = 0)
    sigma_inv = np.linalg.inv(sigma)
    # RMD: subtracting the average train score if relative is True
    if relative:
        mu_0 = np.mean(train_embeds_in, axis = 0)
        # sigma_0 = (1/N)* ((train_embeds_in - mu_0).T @ (train_embeds_in - mu_0))
        sigma_0_inv = np.linalg.inv(np.cov((train_embeds_in-mu_0.reshape([1,-1])).T))

        in_center = calc_maha_distance(test_embeds_in, mu_0, sigma_0_inv)
        out_center = calc_maha_distance(test_embeds_outs, mu_0, sigma_0_inv)
    else:
        in_center = np.zeros(test_embeds_in.shape[0])
        out_center = np.zeros(test_embeds_outs.shape[0])

    # Get OOD score for each datapoint
    all_scores_in = []
    all_scores_out = []
    for i,c in enumerate(classes):
        md_in = calc_maha_distance(test_embeds_in,class_means[i],sigma_inv) - in_center
        md_out = calc_maha_distance(test_embeds_outs,class_means[i],sigma_inv) - out_center

        all_scores_in.append(md_in)
        all_scores_out.append(md_out)

    
    all_scores_in = np.stack(all_scores_in)
    all_scores_out = np.stack(all_scores_out)
 
    scores_in = -np.min(all_scores_in, axis = 0)
    scores_out = -np.min(all_scores_out, axis = 0)
    ### END CODE HERE ###
    return scores_in, scores_out

# The code below is provided based on our implementation. Optional to use!
num_classes = 100
lin_layer = load_model(lin_layer, "CLIP_best_max_train_acc.pth")
logits_indist_train, indist_pseudolabels_train = get_features(lin_layer, train_loader, device)
logits_indist_test, indist_pseudolabels_test = get_features(lin_layer, val_loader, device)
logits_ood, _ = get_features(lin_layer, ood_loader, device)
# convert to numpy
indist_pseudolabels_train = indist_pseudolabels_train.cpu().numpy()
indist_pseudolabels_test = indist_pseudolabels_test.cpu().numpy()
logits_indist_train = logits_indist_train.cpu().numpy()
logits_indist_test = logits_indist_test.cpu().numpy()
logits_ood = logits_ood.cpu().numpy()

# run OOD classifier based on mahalanobis distance
scores_in, scores_out = OOD_classifier_maha(logits_indist_train, indist_pseudolabels_train, 
                                            logits_indist_test, logits_ood, num_classes, relative=False)
print(f'Maha: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_in, scores_out):.2f}')
scores_in, scores_out = OOD_classifier_maha(logits_indist_train, indist_pseudolabels_train, 
                                            logits_indist_test, logits_ood, num_classes, relative=True)
print(f'Relative Maha: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_in, scores_out):.2f}')                                    

Model CLIP_best_max_train_acc.pth is loaded from epoch 19
Maha: CIFAR100-->CIFAR10 AUROC: 84.56
Relative Maha: CIFAR100-->CIFAR10 AUROC: 88.72


### Expected results
(can differ based on linear probing performance)

```
Maha: CIFAR100-->CIFAR10 AUROC: 83.31
Relative Maha: CIFAR100-->CIFAR10 AUROC: 80.88
```

# Part VII. Mahalanobis distance using the real labels without linear probing
- Again, compute the (relative) Mahalanobis distance as OOD score
- This time, instead of using the pseudolabels and output of the linear probing layer, use the real labels of the training data and the features computed in task 1

In [156]:
### START CODE HERE ### (≈ 7 lines of code)
# load features
indist_train = torch.load('features/cifar100_train_feats.pt').cpu().numpy()
indist_test = torch.load('features/cifar100_test_feats.pt').cpu().numpy()
ood_test = torch.load('features/cifar10_test_feats.pt').cpu().numpy()
# load labels
indist_train_labels = torch.load('features/cifar100_train_labels.pt')
# run OOD classifier based on mahalanobis distance
scores_md_in, scores_md_out = OOD_classifier_maha(indist_train,indist_train_labels,indist_test,ood_test,num_classes,relative = False)
scores_rmd_in, scores_rmd_out = OOD_classifier_maha(indist_train,indist_train_labels,indist_test,ood_test,num_classes,relative = True)
### END CODE HERE ###
print(f'Maha: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_md_in, scores_md_out):.2f}')
print(f'Relative Maha: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_rmd_in, scores_rmd_out):.2f}')

Maha: CIFAR100-->CIFAR10 AUROC: 71.70
Relative Maha: CIFAR100-->CIFAR10 AUROC: 73.67


### Expected results
```
Maha: CIFAR100-->CIFAR10 AUROC: 71.71
Relative Maha: CIFAR100-->CIFAR10 AUROC: 84.93
```

# Part VIII. K-means clusters combined with Mahalanobis distance

The paper [SSD: A Unified Framework for Self-Supervised Outlier Detection](https://arxiv.org/abs/2103.12051) has proposed another unsupervised method for OOD detection. Instead of using the (real or pseudo) labels as class-wise means, we will now use the obtained clusters as found be kmeans. In more detail:

- Find k=10,50,100 clusters using Kmeans on the in-distribution training data (you can use the sklearn KMeans implementation).
- Get the cluster centers.
- Use them as class-wise means for the mahalanobis distance classifier.

In [8]:
# The code below is provided based on our implementation. Optional to use!
# load features - modify names if you use different names
indist_train = torch.load('features/cifar100_train_feats.pt').cpu().numpy()
indist_test = torch.load('features/cifar100_test_feats.pt').cpu().numpy()
ood_test = torch.load('features/cifar10_test_feats.pt').cpu().numpy()
results_md = []
results_rmd = []
for N in [10,50,100]:
    ### START CODE HERE ### (≈ 7 lines of code)
    print(N)
    kmeans = KMeans(n_clusters = N, n_init = 'auto').fit(indist_train)
    print(f'{N} fitted')
    train_labels = kmeans.labels_
    print(f'{N} Labels calculated')
    scores_md_in, scores_md_out = OOD_classifier_maha(indist_train, train_labels,indist_test,ood_test, num_classes = N, relative = False)
    print(f'{N} md scores calced')
    scores_rmd_in, scores_rmd_out = OOD_classifier_maha(indist_train, train_labels,indist_test,ood_test, num_classes = N, relative = True)
    print(f'{N} rmd scores calced')
    ### END CODE HERE ###
    print(f'Kmeans (k={N}) + MD: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_md_in, scores_md_out):.2f}')
    print(f'Kmeans (k={N}) + RMD: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_rmd_in, scores_rmd_out):.2f}')
    print("-"*100)

10


: 

: 

### Expected results
Can differ based on KMenas performance.
```
Kmeans (k=10) + MD: CIFAR100-->CIFAR10 AUROC: 67.87
Kmeans (k=10) + RMD: CIFAR100-->CIFAR10 AUROC: 42.38
----------------------------------------------------------------------------------------------------
Kmeans (k=50) + MD: CIFAR100-->CIFAR10 AUROC: 72.18
Kmeans (k=50) + RMD: CIFAR100-->CIFAR10 AUROC: 58.73
----------------------------------------------------------------------------------------------------
Kmeans (k=100) + MD: CIFAR100-->CIFAR10 AUROC: 72.84
Kmeans (k=100) + RMD: CIFAR100-->CIFAR10 AUROC: 68.59
----------------------------------------------------------------------------------------------------
```

That's the end of this exercise. If you reached this point, **congratulations**!