# **QPIC HOI TUTORIAL**

Turorial for the Paper **`QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information`** (2021), by Tamura, Masato and Ohashi, Hiroki and Yoshinaga, Tomoaki in CVPR

  - Tutorial Author: [Esteve Valls Mascaro](https://github.com/Evm7/Tutorials-Computer-Vision)
  - Repository used: https://github.com/hitachi-rd-cv/qpic/tree/main


The original paper does not have any chapter devoted to HOI inference.
In order to proceed to its use, just follow this simple tutorial.

Take into account:

*   Turn on GPU
*   Inference, in this case, just prepared to work for a single image.
*   Same colour means H-O interaction, which is written in text above human.



## Installment and preparation of Environment

In [1]:
! git clone https://github.com/hitachi-rd-cv/qpic.git

Cloning into 'qpic'...
remote: Enumerating objects: 68, done.[K
remote: Counting objects: 100% (68/68), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 68 (delta 22), reused 62 (delta 16), pack-reused 0[K
Unpacking objects: 100% (68/68), done.


In [2]:
! pip install -q numpy

In [3]:
! pip install -qr /content/qpic/requirements.txt

[K     |████████████████████████████████| 69 kB 3.4 MB/s 
[K     |████████████████████████████████| 753.2 MB 14 kB/s 
[K     |████████████████████████████████| 6.6 MB 22.2 MB/s 
[?25h  Building wheel for pycocotools (setup.py) ... [?25l[?25hdone
  Building wheel for panopticapi (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.10.0 requires torch==1.9.0, but you have torch 1.5.1 which is incompatible.[0m


In [4]:
!wget https://github.com/hitachi-rd-cv/qpic/releases/download/v1.0/qpic_resnet50_vcoco.pth

--2021-09-30 12:50:16--  https://github.com/hitachi-rd-cv/qpic/releases/download/v1.0/qpic_resnet50_vcoco.pth
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/345977575/50bd7a80-8651-11eb-976f-b9a0b3fabf29?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210930%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210930T125016Z&X-Amz-Expires=300&X-Amz-Signature=9f40ea16e4f95ccd54fbd784fe8d59d2694a6f7d9f262a2962dbeba6786dc80b&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=345977575&response-content-disposition=attachment%3B%20filename%3Dqpic_resnet50_vcoco.pth&response-content-type=application%2Foctet-stream [following]
--2021-09-30 12:50:16--  https://github-releases.githubusercontent.com/345977575/50bd7a80-8651-11eb-976f-b9a0b3fabf29?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA

## Development of the project

In [7]:
import sys
sys.path.append('/content/qpic')

In [8]:
import argparse
from pathlib import Path
import numpy as np
import copy
import pickle

import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader

from datasets.vcoco import build as build_dataset
from models.backbone import build_backbone
from models.transformer import build_transformer
import util.misc as utils
from util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou
from util.misc import (NestedTensor, nested_tensor_from_tensor_list,
                       accuracy, get_world_size, interpolate,
                       is_dist_avail_and_initialized)

### Utils


In [28]:
valid_obj_ids = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13,
                     14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
                     24, 25, 27, 28, 31, 32, 33, 34, 35, 36,
                     37, 38, 39, 40, 41, 42, 43, 44, 46, 47,
                     48, 49, 50, 51, 52, 53, 54, 55, 56, 57,
                     58, 59, 60, 61, 62, 63, 64, 65, 67, 70,
                     72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
                     82, 84, 85, 86, 87, 88, 89, 90)

verb_classes = ['hold_obj', 'stand', 'sit_instr', 'ride_instr', 'walk', 'look_obj', 'hit_instr', 'hit_obj',
                'eat_obj', 'eat_instr', 'jump_instr', 'lay_instr', 'talk_on_phone_instr', 'carry_obj',
                'throw_obj', 'catch_obj', 'cut_instr', 'cut_obj', 'run', 'work_on_computer_instr',
                'ski_instr', 'surf_instr', 'skateboard_instr', 'smile', 'drink_instr', 'kick_obj',
                'point_instr', 'read_obj', 'snowboard_instr']

class Args():
    def __init__(self):
        self.backbone='resnet50'
        self.batch_size= 2
        self.dec_layers=6
        self.device='cuda'
        self.dilation= False
        self.dim_feedforward= 2048
        self.dropout= 0.1
        self.enc_layers= 6
        self.hidden_dim= 256
        self.hoi_path= None
        self.lr_backbone= 0
        self.masks= False
        self.missing_category_id=80
        self.nheads=8
        self.num_queries= 100
        self.num_workers= 2
        self.param_path= '/content/qpic_resnet50_vcoco.pth'
        self.position_embedding='sine'
        self.pre_norm=False
        self.save_path= None
        self.subject_category_id= 0

args = Args()

'''
args = {'backbone': 'resnet50',
 'batch_size': 2,
 'dec_layers': 6,
 'device': 'cuda',
 'dilation': False,
 'dim_feedforward': 2048,
 'dropout': 0.1,
 'enc_layers': 6,
 'hidden_dim': 256,
 'hoi_path': None,
 'lr_backbone': 0,
 'masks': False,
 'missing_category_id': 80,
 'nheads': 8,
 'num_queries': 100,
 'num_workers': 2,
 'param_path': '/content/qpic_resnet50_vcoco.pth',
 'position_embedding': 'sine',
 'pre_norm': False,
 'save_path': None,
 'subject_category_id': 0}'''

"\nargs = {'backbone': 'resnet50',\n 'batch_size': 2,\n 'dec_layers': 6,\n 'device': 'cuda',\n 'dilation': False,\n 'dim_feedforward': 2048,\n 'dropout': 0.1,\n 'enc_layers': 6,\n 'hidden_dim': 256,\n 'hoi_path': None,\n 'lr_backbone': 0,\n 'masks': False,\n 'missing_category_id': 80,\n 'nheads': 8,\n 'num_queries': 100,\n 'num_workers': 2,\n 'param_path': '/content/qpic_resnet50_vcoco.pth',\n 'position_embedding': 'sine',\n 'pre_norm': False,\n 'save_path': None,\n 'subject_category_id': 0}"

In [25]:
class DETRHOI(nn.Module):

    def __init__(self, backbone, transformer, num_obj_classes, num_verb_classes, num_queries):
        super().__init__()
        self.num_queries = num_queries
        self.transformer = transformer
        hidden_dim = transformer.d_model
        self.query_embed = nn.Embedding(num_queries, hidden_dim)
        self.obj_class_embed = nn.Linear(hidden_dim, num_obj_classes + 1)
        self.verb_class_embed = nn.Linear(hidden_dim, num_verb_classes)
        self.sub_bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
        self.obj_bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
        self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1)
        self.backbone = backbone

    def forward(self, samples: NestedTensor):
        if not isinstance(samples, NestedTensor):
            samples = nested_tensor_from_tensor_list(samples)
        features, pos = self.backbone(samples)

        src, mask = features[-1].decompose()
        assert mask is not None
        hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]

        outputs_obj_class = self.obj_class_embed(hs)
        outputs_verb_class = self.verb_class_embed(hs)
        outputs_sub_coord = self.sub_bbox_embed(hs).sigmoid()
        outputs_obj_coord = self.obj_bbox_embed(hs).sigmoid()
        out = {'pred_obj_logits': outputs_obj_class[-1], 'pred_verb_logits': outputs_verb_class[-1],
               'pred_sub_boxes': outputs_sub_coord[-1], 'pred_obj_boxes': outputs_obj_coord[-1]}
        return out

class MLP(nn.Module):
    """ Very simple multi-layer perceptron (also called FFN)"""

    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
        super().__init__()
        self.num_layers = num_layers
        h = [hidden_dim] * (num_layers - 1)
        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))

    def forward(self, x):
        for i, layer in enumerate(self.layers):
            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
        return x

In [26]:
def buildModel(param_path ='/content/qpic_resnet50_vcoco.pth',  eval=True):
  device = torch.device('cuda:0')
  backbone = build_backbone(args)
  transformer = build_transformer(args)
  model = DETRHOI(backbone, transformer, len(valid_obj_ids)+1, len(verb_classes),
                  args.num_queries)

  model = model.to(device)
  checkpoint = torch.load(param_path, map_location='cpu')
  model.load_state_dict(checkpoint['model'])

  if eval:
    model = model.eval().to(device)
  return model, device

In [12]:
from PIL import Image

import torchvision.transforms as transforms
transform = transforms.Compose([
    transforms.ToTensor()
])

def loadImage(input_path = '/content/kitchen.png'):
  image = Image.open(input_path).convert("RGB")
  return image, transform(image).to(device)

In [13]:
import numpy as np
def processOutputs(outputs, image, threshold = 0.2):
  out_obj_logits, out_verb_logits, out_sub_boxes, out_obj_boxes = outputs['pred_obj_logits'], \
                                                                outputs['pred_verb_logits'], \
                                                                outputs['pred_sub_boxes'], \
                                                                outputs['pred_obj_boxes']

  obj_prob = F.softmax(out_obj_logits, -1)
  obj_scores, obj_labels = obj_prob[..., :-1].max(-1)
  
  ch, img_h, img_w = image.shape
  img_w = torch.tensor(img_w)
  img_h = torch.tensor(img_h)

  verb_scores = out_verb_logits.sigmoid()
  scale_fct = torch.stack([img_w, img_h, img_w, img_h]).to(verb_scores.device)
  sub_boxes = box_cxcywh_to_xyxy(out_sub_boxes)
  sub_boxes = sub_boxes * scale_fct
  obj_boxes = box_cxcywh_to_xyxy(out_obj_boxes)
  obj_boxes = obj_boxes * scale_fct

  obj_scores = obj_scores.detach()
  obj_labels = obj_labels.detach()
  verb_scores = verb_scores.detach()
  sub_boxes = sub_boxes.detach()
  obj_boxes = obj_boxes.detach()

  results = []
  for os, ol, vs, sb, ob in zip(obj_scores, obj_labels, verb_scores, sub_boxes, obj_boxes):
      sl = torch.full_like(ol, 0)
      l = torch.cat((sl, ol))
      b = torch.cat((sb, ob))
      bboxes = [{'bbox': bbox, 'category_id': label} for bbox, label in zip(b.to('cpu').numpy(), l.to('cpu').numpy())]

      hoi_scores = vs * os.unsqueeze(1)

      verb_labels = torch.arange(hoi_scores.shape[1], device=device).view(1, -1).expand(
          hoi_scores.shape[0], -1)
      object_labels = ol.view(-1, 1).expand(-1, hoi_scores.shape[1])


      ids = torch.arange(b.shape[0])

      hois = [{'subject_id': subject_id, 'object_id': object_id, 'category_id': category_id, 'score': score} for
              subject_id, object_id, category_id, score in zip(ids[:ids.shape[0] // 2].to('cpu').numpy(),
                                                                ids[ids.shape[0] // 2:].to('cpu').numpy(),
                                                                verb_labels.to('cpu').numpy(), hoi_scores.to('cpu').numpy())]

      results.append({
          'predictions': bboxes,
          'hoi_prediction': hois
      })

  results_filtered = []
  results = results[0]
  for h in results['hoi_prediction']:
    score = np.max(h['score'])
    if score > threshold:
      obj_id =  h['object_id']
      sub_id =  h['subject_id']
      index = np.argmax(h['score'])
      dict_ = { 'category_id': h['category_id'][index],
                'object_id': obj_id,
                'score': score,
                'subject_id': sub_id,
                'object_bbox' : results['predictions'][obj_id],
                'subject_id' : results['predictions'][sub_id],
              }
      results_filtered.append(dict_)
  return results_filtered 

In [15]:
import cv2

def drawImage(image, results):
  image = cv2.cvtColor(np.asarray(image), cv2.COLOR_RGB2BGR)
  COLORS = np.random.uniform(0, 255, size=(len(valid_obj_ids)+1, 3))
  for color_id, r in enumerate(results):
    category_id_verb = r['category_id']
    object_bbox = r['object_bbox']['bbox']
    object_id = r['object_bbox']['category_id']
    subject_bbox = r['subject_id']['bbox']
    subject_id = 0

    color = COLORS[color_id]
    cv2.rectangle(
        image,
        (int(subject_bbox[0]), int(subject_bbox[1])),
        (int(subject_bbox[2]), int(subject_bbox[3])),
        color, 2
    )

    cv2.rectangle(
        image,
        (int(object_bbox[0]), int(object_bbox[1])),
        (int(object_bbox[2]), int(object_bbox[3])),
        color, 2
    )

    cv2.putText(image, verb_classes[category_id_verb], (int(subject_bbox[0]), int(subject_bbox[1]-5)),
                cv2.FONT_HERSHEY_SIMPLEX, 0.8, color, 2,
                lineType=cv2.LINE_AA)
  return image

### Inference 

In [16]:
# Get an image from online
!wget https://d2hl4mfiesch9e.cloudfront.net/surfersmag/wp-content/uploads/2018/07/Bildschirmfoto-2018-07-05-um-12.12.06.png -O image.jpg

--2021-09-30 12:51:46--  https://d2hl4mfiesch9e.cloudfront.net/surfersmag/wp-content/uploads/2018/07/Bildschirmfoto-2018-07-05-um-12.12.06.png
Resolving d2hl4mfiesch9e.cloudfront.net (d2hl4mfiesch9e.cloudfront.net)... 13.249.90.104, 13.249.90.181, 13.249.90.134, ...
Connecting to d2hl4mfiesch9e.cloudfront.net (d2hl4mfiesch9e.cloudfront.net)|13.249.90.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1418839 (1.4M) [image/png]
Saving to: ‘image.jpg’


2021-09-30 12:51:47 (2.92 MB/s) - ‘image.jpg’ saved [1418839/1418839]



In [29]:
model, device = buildModel()

Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/checkpoints/resnet50-19c8e357.pth


  0%|          | 0.00/97.8M [00:00<?, ?B/s]

In [30]:
device = torch.device('cuda:0')
threshold = 0.1
image_path = '/content/image.jpg' 
image, image_device = loadImage(image_path)
outputs = model([image_device])
results = processOutputs(outputs, image_device, threshold = threshold)
image_save = drawImage(image, results)

In [31]:
cv2.imwrite("image_out" +str(threshold)+".jpg", image_save)

True