# Tip-of-the-Tongue: Doodle-Image Retrieval Engine

__Group 1:__ 
- Rishabh Anand (A0220603Y)
- New Jun Jie (TODO)
- Ai Bo (TODO)

---

__Tip of the tongue__ refers to the situation when we have a vague idea of an object in our memory but simply cannot name it. Most often than not, we feel retrieval of the object's name is imminent. However, we can definitely draw out a doodle of this object when asked to. The objective of this CS4243 project is to investigate the design of learning algorithms for retrieving a collection of real-world images from these manually drawn doodles. 

As this module is on Computer Vision, our project focuses on the dataset collection and preprocessing, as well as model selection, training, and testing _only_. One can easily package the models into a search engine that takes in a doodle and returns the top-k matching images.

As a taster, here are some interesting results:
<!-- ADD RESULTS -->

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import Dataset, DataLoader, Sampler
from torchvision import transforms
from torchinfo import summary

## Training and Testing Dataset

The dataset consists of an amalgam of over 1 million doodles web-scraped from the following sources:

- Google Quick, Draw!
- Sketchy

It also features real-life images web-scraped from:

- Google Images 
-

Our final dataset is an unpaired 

In [None]:
def combined_dataset(datasets, size):
    combined_dataset = {}
    for name, dataset in datasets.items():
        for class_name, class_data in dataset.items():
            if class_name not in combined_dataset:
                combined_dataset[class_name] = []
            # resize data so they can be stacked
            resized = []
            for data in class_data:
                resized.append(cv2.resize(data, (size, size), interpolation=cv2.INTER_AREA))
            resized = np.stack(resized, axis=0)
            combined_dataset[class_name].append(resized)
    for class_name, lst_datasets in combined_dataset.items():
        combined_dataset[class_name] = np.concatenate(lst_datasets, axis=0)
    return combined_dataset


class ImageDataset(Dataset):
    DATASET_DIR = {True: 'dataset/dataset_train.npy', False: 'dataset/dataset_test.npy'}

    def __init__(self, doodles_list, real_list, doodle_size, real_size, train: bool):
        super(ImageDataset, self).__init__()

        dataset = np.load(self.DATASET_DIR[train], allow_pickle=True)[()]

        doodle_datasets = {name: data for name, data in dataset.items() if name in doodles_list}
        real_datasets = {name: data for name, data in dataset.items() if name in real_list}
        self.doodle_dict = combined_dataset(doodle_datasets, doodle_size)
        self.real_dict = combined_dataset(real_datasets, real_size)

        # sanity check
        assert set(self.doodle_dict.keys()) == set(self.real_dict.keys()), \
            f'doodle and real images label classes do not match'

        # process classes
        label_idx = {}
        for key in self.doodle_dict.keys():
            if key not in label_idx:
                label_idx[key] = len(label_idx)
        self.label_idx = label_idx

        # parse data and labels
        self.doodle_data, self.doodle_label = self._return_x_y_pairs(self.doodle_dict, label_idx)
        self.real_data, self.real_label = self._return_x_y_pairs(self.real_dict, label_idx)

        # data preprocessing
        self.doodle_preprocess = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize(doodle_size),
            transforms.ToTensor(),
            transforms.Normalize((self.doodle_data/255).mean(), (self.doodle_data/255).std())   # IMPORTANT / 255
        ])

        self.real_preprocess = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize(real_size),
            transforms.ToTensor(),
            transforms.Normalize((self.real_data/255).mean(axis=(0, 1, 2)), (self.real_data/255).std(axis=(0, 1, 2)))
        ])

        print(f'Train = {train}. Doodle list: {doodles_list}, \n real list: {real_list}. \n classes: {label_idx.keys()} \n'
              f'Doodle data size {len(self.doodle_data)}, real data size {len(self.real_data)}, '
              f'ratio {len(self.doodle_data)/len(self.real_data)}')

    def _return_x_y_pairs(self, data_dict, category_mapping):
        xs, ys = [], []
        for key in data_dict.keys():
            data = data_dict[key]
            labels = [category_mapping[key]] * len(data)
            xs.append(data)
            ys.extend(labels)
        return np.concatenate(xs, axis=0), np.array(ys)

    def __getitem__(self, idx):
        # naive sampling scheme - sample with replacement
        # sample label first so that doodle and real data belong to the same category
        label = random.choice(list(self.label_idx.keys()))
        doodle_data = self.doodle_preprocess(random.choice(self.doodle_dict[label]))
        real_data = self.real_preprocess(random.choice(self.real_dict[label]))
        numer_label = self.label_idx[label]
        return doodle_data, numer_label, real_data, numer_label

    def __len__(self):
        return max(len(self.doodle_data), len(self.real_data))     # could be arbitrary number

In [None]:
dataset = ImageDataset(64, 32, train=True)

## Models and Approaches

1. Version 1: Multilayer Perceptron Classification
2. Version 2: Convolutional Neural Network Classification
3. Version 3: Convolutional Neural Network with Contrastive Loss
4. Version 4: Convolutional Neural Network with multiple Contrastive Losses
5. Version 5: ConvNeXt<sup>1</sup> with multiple Contrastive Losses

---

<sup>1</sup> Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545.

## Version 1: Multilayer Perceptron Classification

The final architecture and pipeline look like so:

In [3]:
class MLP(nn.Module):
    def __init__(self, in_dim, hid_dim, out_dim, dropout=0.2):
        super(ExampleMLP, self).__init__()
        self.l1 = nn.Linear(in_dim, hid_dim)
        self.l2 = nn.Linear(hid_dim, hid_dim)
        self.l3 = nn.Linear(hid_dim, hid_dim)
        self.l4 = nn.Linear(hid_dim, out_dim)
        self.relu = nn.LeakyReLU(negative_slope=0.2)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, return_feats=False):
        x = x.flatten(1) # img to vector
        x = self.relu(self.l1(x))
        x = self.dropout(x)
        x = self.relu(self.l2(x))
        x = self.l3(x)
        feat = x
        x = self.relu(x)
        x = self.dropout(x)
        x = self.l4(x)

        if return_feats:
            return x, feat

        return x

## Version 2: Convolutional Neural Network Classification

The final pipeline and architecture look like so:

In [4]:
def conv_block(in_channels, out_channels, kernel_size, stride, padding, bias):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=bias),
        nn.BatchNorm2d(out_channels),
        nn.ReLU(inplace=True)
    )

class ConvNet(nn.Module):
    def __init__(self, num_classes, dropout=0.2):
        super().__init__()
        layer1 = nn.Sequential(
            conv_block(3, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            conv_block(64, 3, kernel_size=3, stride=1, padding=1, bias=True)
        )
        
        layer2 = conv_block(128, 192, kernel_size=3, stride=2, padding=1, bias=True)
        layer3 = conv_block(192, 256, kernel_size=3, stride=2, padding=1, bias=True)        
        pool = nn.AvgPool2d((2,2))
        
        self.layers = nn.Sequential(layer1, layer2, layer3, pool)
        self.nn = nn.Linear(2 * 2 * 256, num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, return_feats=False):
        feats = self.layers(x).flatten(1)
        out = self.nn(self.dropout(feats))

        return out

In [6]:
class BetterCNN(nn.Module):
    def __init__(self, in_channels, classes):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, 32, (3,3))
        self.conv2 = nn.Conv2d(32, 32, (3,3))
        self.conv3 = nn.Conv2d(32, 64, (3,3))
        self.conv4 = nn.Conv2d(64, 64, (3,3))
        self.mp = nn.MaxPool2d((2,2))
        self.flatten = nn.Flatten(1)

        self.l1 = nn.Linear(1600, 512)
        self.l2 = nn.Linear(512, 128)
        self.l3 = nn.Linear(128, classes)
        self.relu = nn.LeakyReLU()

    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.relu(self.mp(self.conv2(x)))
        x = self.relu(self.conv3(x))
        x = self.relu(self.mp(self.conv4(x)))
        print (x.shape)
        x = self.flatten(x)
        print (x.shape)
        x = self.relu(self.l1(x))
        x = self.relu(self.l2(x))
        out = torch.softmax(self.l3(x), 1)

        return out

In [7]:
x = torch.rand(100, 3, 32, 32)
net = BetterCNN(3, 10)
y = net(x)
print (y.shape)

torch.Size([100, 64, 5, 5])
torch.Size([100, 1600])
torch.Size([100, 10])


## Version 3: Convolutional Neural Network with Contrastive Loss

#### Contrastive Loss
We follow the Contrastive Loss from SimCLR<sup>2</sup>:

$$
l_{i, j} = -\log \frac{\text{exp}(\text{sim}(z_i, z_j)/\tau)}{\sum_{2N}^{k=1} \mathbb{1}_{k \neq i} \text{ exp}(\text{sim}(z_i, z_k)/\tau)}
$$

The total loss is the arithmetic mean of the losses for all positive pairs in a batch:

$$
L = \frac{1}{2N} \sum^{N}_{k=1} [l(2k-1, 2k) + l(2k, 2k-1)]
$$

The architecture and pipeline look like so:


---

<sup>2</sup> Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.

In [None]:
def conv_block(in_channels, out_channels, kernel_size, stride, padding, bias):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=bias),
        nn.BatchNorm2d(out_channels),
        nn.ReLU(inplace=True)
    )

class ConvNet(nn.Module):
    def __init__(self, num_classes, dropout=0.2):
        super().__init__()
        layer1 = nn.Sequential(
            conv_block(3, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            conv_block(64, 3, kernel_size=3, stride=1, padding=1, bias=True)
        )
        
        layer2 = conv_block(128, 192, kernel_size=3, stride=2, padding=1, bias=True)
        layer3 = conv_block(192, 256, kernel_size=3, stride=2, padding=1, bias=True)        
        pool = nn.AvgPool2d((2,2))
        
        self.layers = nn.Sequential(layer1, layer2, layer3, pool)
        self.nn = nn.Linear(2 * 2 * 256, num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, return_feats=False):
        feats = self.layers(x).flatten(1)
        x = self.nn(self.dropout(feats))

        return x, feats

In [2]:
def compute_sim_matrix(feats):
    """
    Takes in a batch of features of size (bs, feat_len).
    """
    sim_matrix = F.cosine_similarity(feats.unsqueeze(2).expand(-1, -1, feats.size(0)),
                                     feats.unsqueeze(2).expand(-1, -1, feats.size(0)).transpose(0, 2),
                                     dim=1)

    return sim_matrix


def compute_target_matrix(labels):
    """
    Takes in a label vector of size (bs)
    """
    label_matrix = labels.unsqueeze(-1).expand((labels.shape[0], labels.shape[0]))
    trans_label_matrix = torch.transpose(label_matrix, 0, 1)
    target_matrix = (label_matrix == trans_label_matrix).type(torch.float)

    return target_matrix


def contrastive_loss(pred_sim_matrix, target_matrix, temperature):
    return F.kl_div(F.softmax(pred_sim_matrix / temperature).log(), F.softmax(target_matrix / temperature),
                    reduction="batchmean", log_target=False)


def compute_contrastive_loss_from_feats(feats, labels, temperature):
    sim_matrix = compute_sim_matrix(feats)
    target_matrix = compute_target_matrix(labels)
    loss = contrastive_loss(sim_matrix, target_matrix, temperature)
    return loss

## Version 4: Convolutional Neural Network with multiple Contrastive Losses

We add two more losses to the Contrastive Loss from Version 3.

#### Loss 2

#### Loss 3

The final architecture and pipeline look like so:

## Version 5: ConvNeXt with multiple Contrastive Losses

<!-- TODO: Talk about ConvNeXt -->

Finally, we train ConvNeXt with the three losses used in Version 4. We can think of ConvNeXt as a "modernised" ConvNet in hopes of competing head-on with Transformers. 

ConvNeXt is an improvement over the standard ConvNet that brings together innovations from the Transformer<sup>3</sup> and ResNet<sup>4</sup>. Here are the list of enhancements we wish to showcase in this CS4243 project:

1. Block-based 

The final architecture and pipeline look like so:

---
<sup>3</sup> Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

<sup>4</sup> He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

In [None]:
class ConvNeXtBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.conv1 = nn.Conv2d(dim, dim, (7,7), padding=3, groups=dim)
        self.lin1 = nn.Linear(dim, 4 * dim)
        self.lin2 = nn.Linear(4 * dim, dim)
        self.ln = nn.LayerNorm(dim)
        self.gelu = nn.GELU()

    def forward(self, x):
        res_inp = x
        x = self.conv1(x)
        x = x.permute(0, 2, 3, 1) # NCHW -> NHWC
        x = self.ln(x)
        x = self.lin1(x)
        x = self.lin2(x)
        x = self.gelu(x)
        x = x.permute(0, 3, 1, 2) # NHWC -> NCHW
        out = x + res_inp

        return out

In [None]:
class ConvNeXt(nn.Module):
    def __init__(self):
        super().__init__()
        pass
    
    def forward(self, x):
        pass

## Metrics

While the quality of real-life images returned by the model for a given doodle is subjective, we use classification accuracy

## Results and Evaluation

## Analysis and Ablations

### t-SNE

### GradCAM on ConvNet and ConvNeXt