<a href="https://colab.research.google.com/github/TeamMAMI/MAMI/blob/GenerateVisualEmbeddings/Generating_Visual_Embeddings_with_MAMI_trial_images.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction


*Disclaimer: This notebook is based on my understanding of the [detectron2](https://github.com/facebookresearch/detectron2) and the [visualbert](https://github.com/uclanlp/visualbert) repositories. Hence, I do not guarantee that this is the "correct" or "recommended" way to get visual embeddings from detectron2. Having said that, I'm definitely looking to improve this notebook and open to any criticism/suggestions. You can reach me at chhablani.gunjan@gmail.com with any issues that concern you regarding this notebook.*

This notebook is based on the concept in the [script to extract image features](https://github.com/uclanlp/visualbert/blob/master/utils/get_image_features/extract_image_features_nlvr.py) for NLVR2 task in the [visualbert](https://github.com/uclanlp/visualbert) repository. You can refer to this script for a "safer" way to extract visual embeddings. The script uses [detectron](https://github.com/facebookresearch/Detectron) and it'll be fairly easier to use it (I hope) without getting into the nitty-gritty.

However, for the sake of using detectron2, which will have better support (for the foreseeable future) than detectron, I present this notebook example to you.



In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


For extracting visual embeddings, we need the features from various regions in the image which are used in the classification. This means that we need to "detect" the regions which might have objects in them.

The detectron2 library, off-the-shelf, does not support intermediate tensor extraction. But, there are ways the user can get the values of these tensors with some effort. See the docs [here](https://detectron2.readthedocs.io/en/latest/tutorials/models.html#partially-execute-a-model). In this notebook, I will be using the *partial execution* method as described in the docs. I admit that the other approaches might be easier or better suited, but this is just a start ;).

**Tip:** If you're looking to play with detectron2, you might like this [Colab tutorial](https://colab.research.google.com/drive/16jcaJoc6bCFAQ96jDe2HwtXj7BMD_-m5#scrollTo=h9tECBQCvMv3).

For the purpose of this notebook, I will be using an example from the [VQA v2](https://visualqa.org/download.html) validation set, as it is one of the tasks VisualBert has been used for. VisualBert authors used pre-generated embeddings for VQA v2, however.

## How it works?

The model checkpoint that we will be using for this notebook is a MaskRCNN+ResNet-101+FPN checkpoint.

First, the image features are generated at various scales using the ResNet+FPN backbone. These features are then passed to the region proposal network or RPN. RPN generates 1000 region proposals, which are then passed to ROI Heads. ROI Heads perform the classification and box-regression and after that the predictions are aligned using ROIAlign layer and to the mask RCNN heads.

We want to extract the box features in the ROI heads which are used for classification. However, we don't want to select all the proposals (as there are 1000 of them!). For the same, we use the NMS with a threshold. Then the boxes are further filtered using a class score threshold. 

### Install Detectron2

In [None]:
import torch
torch.__version__

'1.10.0+cu111'

In [None]:
torch.cuda.is_available()

True

In [None]:
# See https://detectron2.readthedocs.io/tutorials/install.html for instructions
%%capture
!pip install pyyaml==5.1
!python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

### Imports

In [None]:
import cv2
import json
import numpy as np
from pandas import *
from PIL import Image
from copy import deepcopy
import torch, torchvision
import matplotlib.pyplot as plt

In [None]:
from detectron2 import model_zoo
from detectron2.layers import nms
from detectron2.config import get_cfg
from detectron2.layers import ShapeSpec
from detectron2.data import transforms as T
from detectron2.modeling import build_model
from detectron2.structures.boxes import Boxes
from detectron2.checkpoint import DetectionCheckpointer
from detectron2.structures.image_list import ImageList
from detectron2.modeling.box_regression import Box2BoxTransform
from detectron2.modeling.roi_heads.fast_rcnn import FastRCNNOutputLayers

### Download the VQA v2 Validation Set

### Load Examples
The next few cells show how to get an example from the VQA v2 dataset. We will only use the image from the example.

In [None]:
!ls

drive  sample_data


In [None]:
%cd drive/Shareddrives/team_MAMI/MAMI/TRIAL
!ls

/content/drive/Shareddrives/team_MAMI/MAMI/TRIAL
Images	trial.csv


In [None]:
# from pandas import *

In [None]:
# with open('v2_OpenEnded_mscoco_val2014_questions.json') as f:
    # q = json.load(f)
data = read_csv('trial.csv', sep='\t')
print(data)
f_name = data['file_name'].tolist()
msgyn = data['misogynous'].tolist()
txt_trnscrpt = data['Text Transcription'].tolist()
print(f_name)
print(msgyn)
print(txt_trnscrpt)

   file_name  ...                                 Text Transcription
0     28.jpg  ...  not now, dad. We should burn Jon Snow. stop it...
1     30.jpg  ...  there may have been a mixcommunication with th...
2     33.jpg  ...                      i shouldn't have sold my boat
3     58.jpg  ...    Bitches be like, It was my fault i made him mad
4     89.jpg  ...  find a picture of 4 girls together on FB make ...
..       ...  ...                                                ...
95  1380.jpg  ...  Rape culture.  It's what every oxymoronic femi...
96  1381.jpg  ...  walking, running, telereporting, not going to ...
97  1384.jpg  ...  taking the time to get her pussy wet. always p...
98  1408.jpg  ...         what men play with vs what women play with
99  1440.jpg  ...  Girls boys school travel sport shopping going ...

[100 rows x 7 columns]
['28.jpg', '30.jpg', '33.jpg', '58.jpg', '89.jpg', '97.jpg', '104.jpg', '122.jpg', '126.jpg', '133.jpg', '142.jpg', '156.jpg', '157.jpg', '161.jpg',

In [None]:
image_list = []
for i in range (100):
  image_list.append(plt.imread(f'Images/{f_name[i]}'))


### Load Config and Model Weights

I am using the MaskRCNN ResNet-101 FPN checkpoint, but you can use any checkpoint of your preference. This checkpoint is pre-trained on the COCO dataset. You can check other checkpoints/configs on the [Model Zoo](https://github.com/facebookresearch/detectron2/blob/master/MODEL_ZOO.md) page.

In [None]:
cfg_path = "COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml"

def load_config_and_model_weights(cfg_path):
    cfg = get_cfg()
    cfg.merge_from_file(model_zoo.get_config_file(cfg_path))

    # ROI HEADS SCORE THRESHOLD
    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5

    # Comment the next line if you're using 'cuda'
    cfg['MODEL']['DEVICE']='cpu'

    cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(cfg_path)

    return cfg

cfg = load_config_and_model_weights(cfg_path)

### Load the Object Detection Model
The `build_model` method can be used to load a model from the configuration, the checkpoints have to be loaded using the `DetetionCheckpointer`.

In [None]:
def get_model(cfg):
    # build model
    model = build_model(cfg)

    # load weights
    checkpointer = DetectionCheckpointer(model)
    checkpointer.load(cfg.MODEL.WEIGHTS)

    # eval mode
    model.eval()
    return model

model = get_model(cfg)

model_final_a3ec72.pkl: 254MB [00:11, 22.1MB/s]                           


### Convert Image to Model Input
The detectron uses resizing and normalization based on the configuration parameters and the input is to be provided using `ImageList`. The `model.backbone.size_divisibility` handles the sizes (padding) such that the FPN lateral and output convolutional features have same dimensions.

In [None]:
def prepare_image_inputs(cfg, img_list):
    # Resizing the image according to the configuration
    transform_gen = T.ResizeShortestEdge(
                # [cfg.INPUT.MIN_SIZE_TEST, cfg.INPUT.MIN_SIZE_TEST], cfg.INPUT.MAX_SIZE_TEST
                [30, 30], 50
            )
    img_list = [transform_gen.get_transform(img).apply_image(img) for img in img_list]

    # Convert to C,H,W format
    convert_to_tensor = lambda x: torch.Tensor(x.astype("float32").transpose(2, 0, 1))

    batched_inputs = [{"image":convert_to_tensor(img), "height": img.shape[0], "width": img.shape[1]} for img in img_list]

    # Normalizing the image
    num_channels = len(cfg.MODEL.PIXEL_MEAN)
    pixel_mean = torch.Tensor(cfg.MODEL.PIXEL_MEAN).view(num_channels, 1, 1)
    pixel_std = torch.Tensor(cfg.MODEL.PIXEL_STD).view(num_channels, 1, 1)
    normalizer = lambda x: (x - pixel_mean) / pixel_std
    images = [normalizer(x["image"]) for x in batched_inputs]

    # Convert to ImageList
    images =  ImageList.from_tensors(images,model.backbone.size_divisibility)
    
    return images, batched_inputs

images, batched_inputs = prepare_image_inputs(cfg, image_list)
print(len(images))
print(len(images[0]))
print(len(images[0][0]))
print(len(batched_inputs))
print(len(batched_inputs[0]))
print(len(batched_inputs[0][0]))

### Get ResNet+FPN features
The ResNet model in combination with FPN generates five features for an image at different levels of complexity. For more details, refer to the FPN paper or this [article](https://medium.com/@hirotoschwert/digging-into-detectron-2-47b2e794fabd). For this tutorial, just know that `p2`, `p3`, `p4`, `p5`, `p6` are the features needed by the RPN (Region Proposal Network). The proposals in combination with `p2`, `p3`, `p4`, `p5` are then used by the ROI (Region of Interest) heads to generate box predictions.

In [None]:
def get_features(model, images):
    features = model.backbone(images.tensor)
    return features

features = get_features(model, images)

In [None]:
features.keys()
print(len(features))
print(len(features[0]))
print(len(features[0][0]))

dict_keys(['p2', 'p3', 'p4', 'p5', 'p6'])

### Get region proposals from RPN
This RPN takes in the features and images and generates the proposals. Based on the configuration we chose, we get 1000 proposals.

In [None]:
def get_proposals(model, images, features):
    proposals, _ = model.proposal_generator(images, features)
    return proposals

proposals = get_proposals(model, images, features)
print(len(proposals))
print(len(proposals[0]))
print(len(proposals[0][0]))

100
329
1


### Get Box Features for the proposals

The proposals and features are then used by the ROI heads to get the predictions. In this case, the partial execution of layers becomes significant. We want the `box_features` to be the `fc2` outputs of the regions. Hence, I use only the layers that are needed until that step. 

In [None]:
print(len(image_list))

100


In [None]:
box_features = model.roi_heads.box_pooler(features, [x.proposal_boxes for x in proposals])
box_features = model.roi_heads.box_head(box_features)
predictions = model.roi_heads.box_predictor(box_features)

In [None]:
# the dimensions of box_features determine the end dimensions of the embeddings
# if the first argument of reshape() is 100, it will work out
def get_box_features(model, features, proposals):
    features_list = [features[f] for f in ['p2', 'p3', 'p4', 'p5']]
    print(len(features_list), ", ", end="")
    print(len(features_list[0]), ", ", end="")
    print(len(features_list[0][0]), ", ", end="")
    print(len(features_list[0][0][0]), ", ", end="")
    print(len(features_list[0][0][0][0]))
    box_features = model.roi_heads.box_pooler(features_list, [x.proposal_boxes for x in proposals])
    print(box_features.size())
    box_features = model.roi_heads.box_head.flatten(box_features)
    print(box_features.size())
    box_features = model.roi_heads.box_head.fc1(box_features)
    print(box_features.size())
    box_features = model.roi_heads.box_head.fc_relu1(box_features)
    print(box_features.size())
    box_features = model.roi_heads.box_head.fc2(box_features)
    print(box_features.size())
    box_features = model.roi_heads.box_head.fc2(box_features)

    box_features = box_features.reshape(100, 226, 1024) # depends on your config and batch size
    return box_features, features_list

box_features, features_list = get_box_features(model, features, proposals)

4
100
256
16
16
torch.Size([22603, 256, 7, 7])
torch.Size([22603, 12544])
torch.Size([22603, 1024])
torch.Size([22603, 1024])
torch.Size([22603, 1024])


RuntimeError: ignored

### Get prediction logits and boxes
The prediction class logits and the box predictions from the ROI heads, this is used in the next step to get the boxes and scores from the `FastRCNNOutputs`


In [None]:
def get_prediction_logits(model, features_list, proposals):
    cls_features = model.roi_heads.box_pooler(features_list, [x.proposal_boxes for x in proposals])
    cls_features = model.roi_heads.box_head(cls_features)
    pred_class_logits, pred_proposal_deltas = model.roi_heads.box_predictor(cls_features)
    return pred_class_logits, pred_proposal_deltas

pred_class_logits, pred_proposal_deltas = get_prediction_logits(model, features_list, proposals)

### Get FastRCNN scores and boxes

This results in the softmax scores and the boxes.

In [None]:
def get_box_scores(cfg, pred_class_logits, pred_proposal_deltas):
    box2box_transform = Box2BoxTransform(weights=cfg.MODEL.ROI_BOX_HEAD.BBOX_REG_WEIGHTS)
    smooth_l1_beta = cfg.MODEL.ROI_BOX_HEAD.SMOOTH_L1_BETA
    num_classes = 1
    box_head_output_size = 8
    predictions = [pred_class_logits, pred_proposal_deltas]

    outputs = FastRCNNOutputLayers(
      ShapeSpec(channels=box_head_output_size),
      box2box_transform=Box2BoxTransform(weights=cfg.MODEL.ROI_BOX_HEAD.BBOX_REG_WEIGHTS),
      num_classes=1,
    )


    scores = outputs.predict_probs(predictions, proposals)
    boxes = outputs.predict_boxes(predictions, proposals)
    # image_shapes = outputs.image_shapes

    # return boxes, scores, image_shapes
    return boxes, scores

# boxes, scores, image_shapes = get_box_scores(cfg, pred_class_logits, pred_proposal_deltas)
boxes, scores = get_box_scores(cfg, pred_class_logits, pred_proposal_deltas)

### Rescale the boxes to original image size
We want to rescale the boxes to original size as this is done in the detectron2 library. This is done for sanity and to keep it similar to the visualbert repository.

In [None]:
def get_output_boxes(boxes, batched_inputs, image_size):
    proposal_boxes = boxes.reshape(-1, 4).clone()
    scale_x, scale_y = (batched_inputs["width"] / image_size[1], batched_inputs["height"] / image_size[0])
    output_boxes = Boxes(proposal_boxes)

    output_boxes.scale(scale_x, scale_y)
    output_boxes.clip(image_size)

    return output_boxes

output_boxes = [get_output_boxes(boxes[i], batched_inputs[i], proposals[i].image_size) for i in range(len(proposals))]

### Select the Boxes using NMS
We need two thresholds - NMS threshold for the NMS box section, and score threshold for the score based section.

First NMS is performed for all the classes and the max scores of each proposal box and each class is updated.

Then the class score threshold is used to select the boxes from those.

In [None]:
for i in range (10):
  print(len(output_boxes[i])/320)



In [None]:
def select_boxes(cfg, output_boxes, scores):
    test_score_thresh = cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST
    test_nms_thresh = cfg.MODEL.ROI_HEADS.NMS_THRESH_TEST
    cls_prob = scores.detach()
    dim1 = int(len(output_boxes)/80)
    cls_boxes = output_boxes.tensor.detach().reshape(dim1,80,4)
    max_conf = torch.zeros((cls_boxes.shape[0]))
    for cls_ind in range(0, cls_prob.shape[1]-1):
        cls_scores = cls_prob[:, cls_ind+1]
        det_boxes = cls_boxes[:,cls_ind,:]
        keep = np.array(nms(det_boxes, cls_scores, test_nms_thresh))
        max_conf[keep] = torch.where(cls_scores[keep] > max_conf[keep], cls_scores[keep], max_conf[keep])
    keep_boxes = torch.where(max_conf >= test_score_thresh)[0]
    return keep_boxes, max_conf

In [None]:
temp = [select_boxes(cfg, output_boxes[i], scores[i]) for i in range(len(scores))]
keep_boxes, max_conf = [],[]
for keep_box, mx_conf in temp:
    keep_boxes.append(keep_box)
    max_conf.append(mx_conf)

### Limit the total number of boxes
In order to get the box features for the best few proposals and limit the sequence length, we set minimum and maximum boxes and pick those box features.

In [None]:
MIN_BOXES=3
MAX_BOXES=4
def filter_boxes(keep_boxes, max_conf, min_boxes, max_boxes):
    if len(keep_boxes) < min_boxes:
        keep_boxes = np.argsort(max_conf).numpy()[::-1][:min_boxes]
    elif len(keep_boxes) > max_boxes:
        keep_boxes = np.argsort(max_conf).numpy()[::-1][:max_boxes]
    return keep_boxes

keep_boxes = [filter_boxes(keep_box, mx_conf, MIN_BOXES, MAX_BOXES) for keep_box, mx_conf in zip(keep_boxes, max_conf)]

### Get the visual embeddings :) 
Finally, the boxes are chosen using the `keep_boxes` indices and from the `box_features` tensor.

In [None]:
zip(box_features, keep_boxes)

In [None]:
print(len(box_features))
print(len(keep_boxes))

In [None]:
print("box_features: ")
print(len(box_features)) 
print(len(box_features[0])) 
print(len(box_features[0][0])) 
print()
print("keep boxes:")
print(len(keep_boxes))
print(len(keep_boxes[0])) 



In [None]:
# box_features[0][keep_boxes[0].copy()]
kb = keep_boxes[0].copy()
# box_features[0][keep_boxes[0]]

In [None]:
box_features[0][kb]

In [None]:
def get_visual_embeds(box_features, keep_boxes):
    return box_features[keep_boxes.copy()]
get_visual_embeds(box_features[0], keep_boxes[0])

In [None]:
def get_visual_embeds(box_features, keep_boxes):
    return box_features[keep_boxes.copy()]

# visual_embeds = [get_visual_embeds(box_feature, keep_box) for box_feature, keep_box in zip(box_features, keep_boxes)]
visual_embeds = []
for box_feature, keep_box in zip(box_features, keep_boxes):
  visual_embeds.append(get_visual_embeds(box_feature, keep_box))

In [None]:

print(len(visual_embeds[0]))

In [None]:
print(len(visual_embeds))

## Tips for putting it all together

Note that these methods can be combined into different parts to make it more efficient: 
1. Get the model and store it in a variable.
2. Transform and create batched inputs separately.
3. Generate visual embeddings from the detectron on the batched inputs and models.

Ideally, you want to build a class around this for ease of use - The class should contain all the methods, the model and the configuration details. And it should process a batch of images and convert to embeddings.

## Using the embeddings with VisualBert

In [None]:
import os
from getpass import getpass
import urllib
# %cd /content/
# user = input('User name: ')
# password = getpass('Password: ')
# password = urllib.parse.quote(password) # your password is converted into url format
# cmd_string = f'git clone -b add_visualbert --single-branch https://{user}:{password}@github.com/gchhablani/transformers.git'
# os.system(cmd_string)
# cmd_string, password = "", "" # removing the password from the variable
# %cd transformers
# !pip install -e ".[dev]"
!pip install transformers

In [None]:
from transformers import BertTokenizer, VisualBertForPreTraining

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
questions = [question1, question2]
tokens = tokenizer(questions, padding='max_length', max_length=50)

In [None]:
input_ids = torch.tensor(tokens["input_ids"])
attention_mask = torch.tensor(tokens["attention_mask"])
token_type_ids = torch.tensor(tokens["token_type_ids"])

In [None]:
visual_embeds = torch.stack(visual_embeds)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)

In [None]:
model = VisualBertForPreTraining.from_pretrained('uclanlp/visualbert-nlvr2-coco-pre') # this checkpoint has 1024 dimensional visual embeddings projection

In [None]:
outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, visual_embeds=visual_embeds, visual_attention_mask=visual_attention_mask, visual_token_type_ids=visual_token_type_ids)

In [None]:
outputs

## References

1. [Detectron2 Colab Tutorial](https://colab.research.google.com/drive/16jcaJoc6bCFAQ96jDe2HwtXj7BMD_-m5#scrollTo=h9tECBQCvMv3)
2. [Detectron Repository](https://github.com/facebookresearch/Detectron)
3. [Detectron2 Repository](https://github.com/facebookresearch/detectron2)
4. [Detectron2 Docs](https://detectron2.readthedocs.io/en/latest/index.html)
5. [VisualBert Repository](https://github.com/uclanlp/visualbert)
6. [Medium Article on Detectron2 by Hiroto Honda](https://medium.com/@hirotoschwert/digging-into-detectron-2-47b2e794fabd)