## Introduction


*Disclaimer: This notebook is based on my understanding of the [detectron2](https://github.com/facebookresearch/detectron2) and the [visualbert](https://github.com/uclanlp/visualbert) repositories. Hence, I do not guarantee that this is the "correct" or "recommended" way to get visual embeddings from detectron2. Having said that, I'm definitely looking to improve this notebook and open to any criticism/suggestions. You can reach me at chhablani.gunjan@gmail.com with any issues that concern you regarding this notebook.*

This notebook is based on the concept in the [script to extract image features](https://github.com/uclanlp/visualbert/blob/master/utils/get_image_features/extract_image_features_nlvr.py) for NLVR2 task in the [visualbert](https://github.com/uclanlp/visualbert) repository. You can refer to this script for a "safer" way to extract visual embeddings. The script uses [detectron](https://github.com/facebookresearch/Detectron) and it'll be fairly easier to use it (I hope) without getting into the nitty-gritty.

However, for the sake of using detectron2, which will have better support (for the foreseeable future) than detectron, I present this notebook example to you.



For extracting visual embeddings, we need the features from various regions in the image which are used in the classification. This means that we need to "detect" the regions which might have objects in them.

The detectron2 library, off-the-shelf, does not support intermediate tensor extraction. But, there are ways the user can get the values of these tensors with some effort. See the docs [here](https://detectron2.readthedocs.io/en/latest/tutorials/models.html#partially-execute-a-model). In this notebook, I will be using the *partial execution* method as described in the docs. I admit that the other approaches might be easier or better suited, but this is just a start ;).

**Tip:** If you're looking to play with detectron2, you might like this [Colab tutorial](https://colab.research.google.com/drive/16jcaJoc6bCFAQ96jDe2HwtXj7BMD_-m5#scrollTo=h9tECBQCvMv3).

For the purpose of this notebook, I will be using an example from the [VQA v2](https://visualqa.org/download.html) validation set, as it is one of the tasks VisualBert has been used for. VisualBert authors used pre-generated embeddings for VQA v2, however.

## How it works?

The model checkpoint that we will be using for this notebook is a MaskRCNN+ResNet-101+FPN checkpoint.

First, the image features are generated at various scales using the ResNet+FPN backbone. These features are then passed to the region proposal network or RPN. RPN generates 1000 region proposals, which are then passed to ROI Heads. ROI Heads perform the classification and box-regression and after that the predictions are aligned using ROIAlign layer and to the mask RCNN heads.

We want to extract the box features in the ROI heads which are used for classification. However, we don't want to select all the proposals (as there are 1000 of them!). For the same, we use the NMS with a threshold. Then the boxes are further filtered using a class score threshold. 

### Install Detectron2

In [1]:
import torch
torch.__version__

'1.11.0'

In [2]:
torch.cuda.is_available()

True

In [3]:
# See https://detectron2.readthedocs.io/tutorials/install.html for instructions
%%capture
!pip install pyyaml==5.1
!python -m pip install 'git+https://github.com/facebookresearch/detectron2.git@05bc8439ca10e11300d9d34e4fe0dd1d3f42773a'

### Imports

In [1]:
import pandas as pd

In [4]:
import torch, torchvision
import matplotlib.pyplot as plt
import json
import cv2
import numpy as np
from copy import deepcopy

In [3]:
from transformers import BertTokenizer, VisualBertForPreTraining

In [8]:
from detectron2.modeling import build_model
from detectron2.checkpoint import DetectionCheckpointer
from detectron2.structures.image_list import ImageList
from detectron2.data import transforms as T
from detectron2.modeling.box_regression import Box2BoxTransform
from detectron2.modeling.roi_heads.fast_rcnn import FastRCNNOutputLayers
from detectron2.modeling.roi_heads.fast_rcnn import FastRCNNOutputs
from detectron2.structures.boxes import Boxes
from detectron2.layers import nms
from detectron2 import model_zoo
from detectron2.config import get_cfg

In [None]:
cfg_path = "COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml"

def load_config_and_model_weights(cfg_path):
    cfg = get_cfg()
    cfg.merge_from_file(model_zoo.get_config_file(cfg_path))

    # ROI HEADS SCORE THRESHOLD
    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5

    # Comment the next line if you're using 'cuda'
    #cfg['MODEL']['DEVICE']='cpu'

    cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(cfg_path)

    return cfg


def get_model(cfg):
    # build model
    model = build_model(cfg)

    # load weights
    checkpointer = DetectionCheckpointer(model)
    checkpointer.load(cfg.MODEL.WEIGHTS)

    # eval mode
    model.eval()
    return model


def prepare_image_inputs(cfg, img_list):
    # Resizing the image according to the configuration
    transform_gen = T.ResizeShortestEdge(
                [cfg.INPUT.MIN_SIZE_TEST, cfg.INPUT.MIN_SIZE_TEST], cfg.INPUT.MAX_SIZE_TEST
            )
    img_list = [transform_gen.get_transform(img).apply_image(img) for img in img_list]

    # Convert to C,H,W format
    convert_to_tensor = lambda x: torch.Tensor(x.astype("float32").transpose(2, 0, 1))

    batched_inputs = [{"image":convert_to_tensor(img), "height": img.shape[0], "width": img.shape[1]} for img in img_list]

    # Normalizing the image
    num_channels = len(cfg.MODEL.PIXEL_MEAN)
    pixel_mean = torch.Tensor(cfg.MODEL.PIXEL_MEAN).view(num_channels, 1, 1)
    pixel_std = torch.Tensor(cfg.MODEL.PIXEL_STD).view(num_channels, 1, 1)
    normalizer = lambda x: (x - pixel_mean) / pixel_std
    images = [normalizer(x["image"]) for x in batched_inputs]

    # Convert to ImageList
    images =  ImageList.from_tensors(images,model.backbone.size_divisibility)
    
    return images, batched_inputs


def get_features(model, images):
    features = model.backbone(images.tensor)
    return features


def get_proposals(model, images, features):
    proposals, _ = model.proposal_generator(images, features)
    return proposals


def get_box_features(model, features, proposals, batch_size):
    features_list = [features[f] for f in ['p2', 'p3', 'p4', 'p5']]
    box_features = model.roi_heads.box_pooler(features_list, [x.proposal_boxes for x in proposals])
    box_features = model.roi_heads.box_head.flatten(box_features)
    box_features = model.roi_heads.box_head.fc1(box_features)
    box_features = model.roi_heads.box_head.fc_relu1(box_features)
    box_features = model.roi_heads.box_head.fc2(box_features)

    box_features = box_features.reshape(batch_size, 1000, 1024) # depends on your config and batch size
    return box_features, features_list


def get_prediction_logits(model, features_list, proposals):
    cls_features = model.roi_heads.box_pooler(features_list, [x.proposal_boxes for x in proposals])
    cls_features = model.roi_heads.box_head(cls_features)
    pred_class_logits, pred_proposal_deltas = model.roi_heads.box_predictor(cls_features)
    return pred_class_logits, pred_proposal_deltas


def get_box_scores(cfg, pred_class_logits, pred_proposal_deltas):
    box2box_transform = Box2BoxTransform(weights=cfg.MODEL.ROI_BOX_HEAD.BBOX_REG_WEIGHTS)
    smooth_l1_beta = cfg.MODEL.ROI_BOX_HEAD.SMOOTH_L1_BETA

    outputs = FastRCNNOutputs(
        box2box_transform,
        pred_class_logits,
        pred_proposal_deltas,
        proposals,
        smooth_l1_beta,
    )

    boxes = outputs.predict_boxes()
    scores = outputs.predict_probs()
    image_shapes = outputs.image_shapes

    return boxes, scores, image_shapes


def get_output_boxes(boxes, batched_inputs, image_size):
    proposal_boxes = boxes.reshape(-1, 4).clone()
    scale_x, scale_y = (batched_inputs["width"] / image_size[1], batched_inputs["height"] / image_size[0])
    output_boxes = Boxes(proposal_boxes)

    output_boxes.scale(scale_x, scale_y)
    output_boxes.clip(image_size)

    return output_boxes


def select_boxes(cfg, output_boxes, scores):
    test_score_thresh = cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST
    test_nms_thresh = cfg.MODEL.ROI_HEADS.NMS_THRESH_TEST
    cls_prob = scores.detach()
    cls_boxes = output_boxes.tensor.detach().reshape(1000,80,4)
    max_conf = torch.zeros((cls_boxes.shape[0]))
    for cls_ind in range(0, cls_prob.shape[1]-1):
        cls_scores = cls_prob[:, cls_ind+1]
        det_boxes = cls_boxes[:,cls_ind,:]
        keep = np.array(nms(det_boxes, cls_scores, test_nms_thresh))
        max_conf[keep] = torch.where(cls_scores[keep] > max_conf[keep], cls_scores[keep], max_conf[keep])
    keep_boxes = torch.where(max_conf >= test_score_thresh)[0]
    return keep_boxes, max_conf



MIN_BOXES=10
MAX_BOXES=100

def filter_boxes(keep_boxes, max_conf, min_boxes, max_boxes):
    if len(keep_boxes) < min_boxes:
        keep_boxes = np.argsort(max_conf).numpy()[::-1][:min_boxes]
    elif len(keep_boxes) > max_boxes:
        keep_boxes = np.argsort(max_conf).numpy()[::-1][:max_boxes]
    return keep_boxes


def get_visual_embeds(box_features, keep_boxes):
    return box_features[keep_boxes.copy()]

In [95]:
def get_visual_embeddings(cfg, model, paths):
    img_arr = []
    for path in paths:
        img = plt.imread(path)        
        img_bgr = cv2.cvtColor(img, cv2.COLOR_RGB2BGR) # Detectron expects BGR images
        img_arr.append(img_bgr)
        
        
    images, batched_inputs = prepare_image_inputs(cfg, img_arr)
    features = get_features(model, images)
    proposals = get_proposals(model, images, features)
    box_features, features_list = get_box_features(model, features, proposals, len(images))
    pred_class_logits, pred_proposal_deltas = get_prediction_logits(model, features_list, proposals)

    boxes, scores, image_shapes = get_box_scores(cfg, pred_class_logits, pred_proposal_deltas)

    output_boxes = [get_output_boxes(boxes[i], batched_inputs[i], proposals[i].image_size) for i in range(len(proposals))]

    temp = [select_boxes(cfg, output_boxes[i], scores[i]) for i in range(len(scores))]
    keep_boxes, max_conf = [],[]
    for keep_box, mx_conf in temp:
        keep_boxes.append(keep_box)
        max_conf.append(mx_conf)

    keep_boxes = [filter_boxes(keep_box, mx_conf, MIN_BOXES, MAX_BOXES) for keep_box, mx_conf in zip(keep_boxes, max_conf)]

    visual_embeds = [get_visual_embeds(box_feature, keep_box) for box_feature, keep_box in zip(box_features, keep_boxes)]
    
    return visual_embeds

In [None]:

from google.colab import files

In [None]:
#Загружаем файл
files.upload()

! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json


In [None]:
!kaggle datasets download -d parthplc/facebook-hateful-meme-dataset

In [None]:
!unzip facebook-hateful-meme-dataset.zip

In [100]:
        self.data = [json.loads(l) for l in open(data_path)]
        self.data_dir = os.path.dirname(data_path)
        self.transforms = transforms
            
    def __getitem__(self, index: int):
        #image = Image.open(os.path.join(self.data_dir, self.data[index]["img"]))   
        
        path = os.path.join(self.data_dir, self.data[index]["img"])

In [103]:
cfg = load_config_and_model_weights(cfg_path)

model = get_model(cfg)
model = model.to(device)

In [10]:
from tqdm.notebook import tqdm
from torch.utils.data import DataLoader

In [18]:
import torch
import json
import os
from PIL import Image
import cv2
from cv2 import cv2


class HatefulMemesDataset(torch.utils.data.Dataset):
    def __init__(self, data_path):
        self.data = [json.loads(l) for l in open(data_path)]
        self.data_dir = os.path.dirname(data_path)
            
    def __getitem__(self, index: int):
        #image = Image.open(os.path.join(self.data_dir, self.data[index]["img"]))   
        
        path = os.path.join(self.data_dir, self.data[index]["img"])
        text = self.data[index]["text"]
        
        label = self.data[index]["label"]
            
        return path,  text, label
    
    def __len__(self):
        return len(self.data)

In [2]:
data_dir = r'E:\datasets\MADE\3_graduation\parthplc\archive\data\\'

train_path = data_dir + 'train.jsonl'
dev_path = data_dir + 'dev.jsonl'


train_data = pd.read_json(train_path, lines=True)
test_data = pd.read_json(dev_path, lines=True)

test_data.head(3)

Unnamed: 0,id,img,label,text
0,8291,img/08291.png,1,white people is this a shooting range
1,46971,img/46971.png,1,bravery at its finest
2,3745,img/03745.png,1,your order comes to $37.50 and your white priv...


In [20]:
train_dataset = HatefulMemesDataset(train_path)
val_dataset = HatefulMemesDataset(dev_path)

In [21]:
for paths, texts, labels in tqdm(DataLoader(val_dataset, batch_size=8)):
    print(paths)
    print(texts)
    print(labels)
    break

HBox(children=(FloatProgress(value=0.0, max=63.0), HTML(value='')))

('E:\\datasets\\MADE\\3_graduation\\parthplc\\archive\\data\\img/08291.png', 'E:\\datasets\\MADE\\3_graduation\\parthplc\\archive\\data\\img/46971.png', 'E:\\datasets\\MADE\\3_graduation\\parthplc\\archive\\data\\img/03745.png', 'E:\\datasets\\MADE\\3_graduation\\parthplc\\archive\\data\\img/83745.png', 'E:\\datasets\\MADE\\3_graduation\\parthplc\\archive\\data\\img/80243.png', 'E:\\datasets\\MADE\\3_graduation\\parthplc\\archive\\data\\img/05279.png', 'E:\\datasets\\MADE\\3_graduation\\parthplc\\archive\\data\\img/01796.png', 'E:\\datasets\\MADE\\3_graduation\\parthplc\\archive\\data\\img/53046.png')
('white people is this a shooting range', 'bravery at its finest', 'your order comes to $37.50 and your white privilege discount brings the total to $37.50', 'it is time.. to send these parasites back to the desert', 'mississippi wind chime', "knowing white people , that's probably the baby father", 'life hack #23 how to get stoned with no weed', "you've heard of elf on a shelf, now get r

In [109]:
assert False

## Tips for putting it all together

Note that these methods can be combined into different parts to make it more efficient: 
1. Get the model and store it in a variable.
2. Transform and create batched inputs separately.
3. Generate visual embeddings from the detectron on the batched inputs and models.

Ideally, you want to build a class around this for ease of use - The class should contain all the methods, the model and the configuration details. And it should process a batch of images and convert to embeddings.

## Using the embeddings with VisualBert

In [115]:
import os
from getpass import getpass
import urllib
# %cd /content/
# user = input('User name: ')
# password = getpass('Password: ')
# password = urllib.parse.quote(password) # your password is converted into url format
# cmd_string = f'git clone -b add_visualbert --single-branch https://{user}:{password}@github.com/gchhablani/transformers.git'
# os.system(cmd_string)
# cmd_string, password = "", "" # removing the password from the variable
# %cd transformers
# !pip install -e ".[dev]"
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




In [117]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [118]:
questions = [question1]#[question1, question2]
tokens = tokenizer(questions, padding='max_length', max_length=50)

In [119]:
tokens

{'input_ids': [[101, 2054, 2003, 1996, 2611, 3061, 2006, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

In [120]:
input_ids = torch.tensor(tokens["input_ids"])
attention_mask = torch.tensor(tokens["attention_mask"])
token_type_ids = torch.tensor(tokens["token_type_ids"])

In [121]:
visual_embeds = torch.stack(visual_embeds)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)

In [122]:
model = VisualBertForPreTraining.from_pretrained('uclanlp/visualbert-nlvr2-coco-pre') # this checkpoint has 1024 dimensional visual embeddings projection

In [123]:
input_ids

tensor([[ 101, 2054, 2003, 1996, 2611, 3061, 2006, 1029,  102,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0]])

In [124]:
input_ids.shape

torch.Size([1, 50])

In [125]:
outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, visual_embeds=visual_embeds, visual_attention_mask=visual_attention_mask, visual_token_type_ids=visual_token_type_ids)

In [126]:
outputs

VisualBertForPreTrainingOutput(loss=None, prediction_logits=tensor([[[ -6.4359,  -6.2968,  -6.5019,  ...,  -6.6693,  -7.2687,  -7.0636],
         [ -9.0872,  -8.8805,  -8.9193,  ..., -10.0792,  -8.8213,  -8.9826],
         [-10.0700, -10.0092, -10.0334,  ..., -10.7274, -10.3397,  -7.7949],
         ...,
         [ -6.0317,  -5.9081,  -6.0876,  ...,  -6.1065,  -6.7837,  -7.5046],
         [ -6.2400,  -6.2881,  -6.4031,  ...,  -6.6600,  -7.0552,  -7.0159],
         [ -5.7199,  -5.8479,  -5.8892,  ...,  -6.4322,  -6.5893,  -7.2381]]],
       grad_fn=<ViewBackward0>), seq_relationship_logits=tensor([[0.2899, 0.2922]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [127]:
outputs.prediction_logits.shape

torch.Size([1, 150, 30522])

In [129]:
outputs.prediction_logits.reshape((batch_size,150*30522)).shape

torch.Size([1, 4578300])

## References

1. [Detectron2 Colab Tutorial](https://colab.research.google.com/drive/16jcaJoc6bCFAQ96jDe2HwtXj7BMD_-m5#scrollTo=h9tECBQCvMv3)
2. [Detectron Repository](https://github.com/facebookresearch/Detectron)
3. [Detectron2 Repository](https://github.com/facebookresearch/detectron2)
4. [Detectron2 Docs](https://detectron2.readthedocs.io/en/latest/index.html)
5. [VisualBert Repository](https://github.com/uclanlp/visualbert)
6. [Medium Article on Detectron2 by Hiroto Honda](https://medium.com/@hirotoschwert/digging-into-detectron-2-47b2e794fabd)