# TSV vs Extractor
This notebook compares the features from the TSV file, which made with the caffe model, and those from the extractor.   
We compare three images, filled_with_0.png, filled_with_255.png, 000542.jpeg in the download directory.   
The tsv file was made in advance by extracting these files.

In [1]:
import torch
import numpy as np
import PIL.Image

from eval_vl_glue import VoltaImageFeature, load_tsv
from eval_vl_glue.extractor import BUTDDetector

In [2]:
image_paths = [
    '../download/filled_with_0.png', 
    '../download/filled_with_255.png',
    '../download/000542.jpeg', 
]

## Load the TSV file

In [3]:
tsv_path = '../conversion/test_obj36.tsv'
feature_dict = load_tsv(tsv_path)
# Change raw features into inputs for transformers_volta
tsv_features = {k: VoltaImageFeature.from_dict(v) for k, v in feature_dict.items()}

## Configure extractor

In [4]:
model_path = '../download/resnet101_faster_rcnn_final_iter_320000.pt'
extractor = BUTDDetector()
extractor.load_state_dict(torch.load(model_path))
extractor = extractor.eval()

In [5]:
# Skip if you use cpu
extractor = extractor.to('cuda:0')

## Feature matching function
- Two tensors have 36 features each (excluding global image feature).     
  We match those features trying to minimize the sum of pair distances (cossim, absolute value of differences).

In [6]:
# a function to match nearest features in a greedy way
def greedy_match(v1, v2, func=None):
    func = func or (lambda x, y: (x - y).abs().mean(axis=-1))
    ds = func(v1[None], v2[:, None])
    n = ds.shape[0]
    not_available = ds.max().item() + 100

    pairs = []
    dists = torch.zeros((n,), dtype=torch.float32)
    for _ in range(n):
        dists[_] = ds.min().item()
        i = ds.argmin().item()
        c = int(i % n)
        r = int(i // n)
        pairs.append((r, c))
        ds[r] = not_available
        ds[:, c] = not_available
    return pairs, dists

In [7]:
def compare(v1, v2):
    # Element-wise absolute values
    pairs, dists = greedy_match(v1, v2, func=lambda x, y: (x - y).abs().mean(axis=-1))
    print('absolute mean:', torch.cat([v1, v2], axis=0).abs().mean().item())
    print('averaged absolute defference:', dists.mean().item())
    print('defference', dists)
    
    # Vector-wise cosine similarity
    pairs, dists = greedy_match(v1, v2, func=lambda x, y: -torch.nn.functional.cosine_similarity(x, y, dim=-1))
    dists *= -1
    print('averaged cossim:', dists.mean().item())
    print('cossim', dists)

## comparison

In [8]:
for image_path in image_paths:
    tsv_feature = tsv_features[image_path.split('/')[-1].split('.')[0]]
    
    # extract features from a image using the extractor
    image = PIL.Image.open(image_path)
    regions = extractor.detect(image)
    ext_feature = VoltaImageFeature.from_regions(regions)
    
    print(image_path)
    # We do not consider global image features.
    compare(tsv_feature.features[1:], ext_feature.features[1:])
    print()

  "The default behavior for interpolate/upsample with float scale_factor changed "
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)


../download/filled_with_0.png
absolute mean: 0.504785418510437
averaged absolute defference: 0.11803480982780457
defference tensor([0.0047, 0.0065, 0.0149, 0.0152, 0.0181, 0.0294, 0.0369, 0.0380, 0.0384,
        0.0423, 0.0467, 0.0493, 0.0594, 0.0605, 0.0637, 0.0645, 0.0659, 0.0754,
        0.0764, 0.0852, 0.0958, 0.0964, 0.0995, 0.1179, 0.1259, 0.1299, 0.1310,
        0.1329, 0.1444, 0.1654, 0.2164, 0.2514, 0.2981, 0.3437, 0.4064, 0.6026])
averaged cossim: 0.9547709226608276
cossim tensor([1.0000, 0.9999, 0.9996, 0.9996, 0.9996, 0.9986, 0.9982, 0.9979, 0.9972,
        0.9967, 0.9964, 0.9956, 0.9953, 0.9953, 0.9928, 0.9927, 0.9922, 0.9916,
        0.9909, 0.9888, 0.9872, 0.9825, 0.9807, 0.9783, 0.9765, 0.9729, 0.9721,
        0.9702, 0.9537, 0.9377, 0.8852, 0.8740, 0.8709, 0.8584, 0.7484, 0.5042])

../download/filled_with_255.png
absolute mean: 0.4255666434764862
averaged absolute defference: 0.12673912942409515
defference tensor([2.8457e-04, 4.1648e-03, 7.4035e-03, 7.9637e-03, 9.9676e

- The averages of element-wise absolute difference ranges 0.1-0.16. These values are approximetly 20% of absolute mean of elements.
- The average of vector-wise cosine similarity is approximetly 0.95.
- Although two models are similar to some extent, these are not identical.
- According to the debug, raw outputs of neural networks, such as rpn_cls_prob_reshape in the forward function, are slightly different and this seems to affect boundary candidates in the remove_small_boxes function.