# **CDN for HOI TUTORIAL**

Turorial for the Paper **`Mining the Benefits of Two-stage and One-stage
HOI Detection`** (13 Oct, 2021), by Aixi Zhang*, Yue Liao*, Si Liu, Miao Lu, Yongliang Wang, Chen Gao and Xiaobo Li.

Paper used for extracting the Human-Object Interaction in Images.

  - Tutorial Author: [Esteve Valls Mascaro](https://github.com/Evm7/Tutorials-Computer-Vision)
  - Repository used: https://github.com/YueLiao/CDN


The original paper does not have any chapter devoted to HOI inference.
In order to proceed to its use, just follow this simple tutorial.

Take into account:

*   Turn on GPU
*   Inference, in this case, just prepared to work for a single image.
*   Same colour means H-O interaction, which is written in text above human.



## Installment and preparation of the Environment

In [1]:
! git clone https://github.com/YueLiao/CDN.git

Cloning into 'CDN'...
remote: Enumerating objects: 43, done.[K
remote: Counting objects: 100% (43/43), done.[K
remote: Compressing objects: 100% (37/37), done.[K
remote: Total 43 (delta 9), reused 30 (delta 3), pack-reused 0[K
Unpacking objects: 100% (43/43), done.


In [2]:
! pip install -qr /content/CDN/requirements.txt

[K     |████████████████████████████████| 753.2 MB 14 kB/s 
[K     |████████████████████████████████| 6.6 MB 23.8 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.10.0 requires torch==1.9.0, but you have torch 1.5.1 which is incompatible.[0m
[?25h

In [3]:
!pip install -q gdown

In [4]:
%cd CDN

/content/CDN


In [5]:
!gdown https://drive.google.com/u/0/uc?id=1-GuJ4FGTGJAktH2NVR6Qp_N0zimg7uRr  # weights for the model. In the repository Github other weights are shown. Easy to modify

Downloading...
From: https://drive.google.com/u/0/uc?id=1-GuJ4FGTGJAktH2NVR6Qp_N0zimg7uRr
To: /content/CDN/hico_cdn_s.pth
100% 167M/167M [00:01<00:00, 87.1MB/s]


In [6]:
!gdown https://drive.google.com/u/0/uc?id=1EeHNHuYyJI-qqDk_-5nay7Mb07tzZLsl # verbs lists for HICO model

Downloading...
From: https://drive.google.com/u/0/uc?id=1EeHNHuYyJI-qqDk_-5nay7Mb07tzZLsl
To: /content/CDN/hico_list_vb.txt
  0% 0.00/2.38k [00:00<?, ?B/s]100% 2.38k/2.38k [00:00<00:00, 2.01MB/s]


In [7]:
!gdown https://drive.google.com/u/0/uc?id=1geCHW-yukOnEPjkiD9n9N5rWGczpzX4p # objects lists for HICO model

Downloading...
From: https://drive.google.com/u/0/uc?id=1geCHW-yukOnEPjkiD9n9N5rWGczpzX4p
To: /content/CDN/hico_list_obj.txt
  0% 0.00/1.64k [00:00<?, ?B/s]100% 1.64k/1.64k [00:00<00:00, 3.15MB/s]


## Inference of the model

### Utils
Necessary to reproduce real results

In [8]:
import pandas as pd
hico_list_obj = pd.read_csv("/content/CDN/hico_list_obj.txt", delimiter="  ")
hico_list_obj = hico_list_obj.drop(0)
objects = hico_list_obj.to_dict()
valid_obj_names= objects[' object']

hico_list_verb = pd.read_csv("/content/CDN/hico_list_vb.txt", delimiter="  ")
hico_list_verb = hico_list_verb.drop(0)
verbs = hico_list_verb.to_dict()
verb_classes_dict= verbs[' verb']

  
  import sys


In [21]:
verb_classes = list(verb_classes_dict.values())

In [9]:
import argparse

parser = argparse.ArgumentParser('Set transformer detector', add_help=False)
parser.add_argument('--lr', default=1e-4, type=float)
parser.add_argument('--lr_backbone', default=1e-5, type=float)
parser.add_argument('--batch_size', default=2, type=int)
parser.add_argument('--weight_decay', default=1e-4, type=float)
parser.add_argument('--epochs', default=90, type=int)
parser.add_argument('--lr_drop', default=60, type=int)
parser.add_argument('--clip_max_norm', default=0.1, type=float,
                    help='gradient clipping max norm')

# Model parameters
parser.add_argument('--frozen_weights', type=str, default=None,
                    help="Path to the pretrained model. If set, only the mask head will be trained")
# * Backbone
parser.add_argument('--backbone', default='resnet50', type=str,
                    help="Name of the convolutional backbone to use")
parser.add_argument('--dilation', action='store_true',
                    help="If true, we replace stride with dilation in the last convolutional block (DC5)")
parser.add_argument('--position_embedding', default='sine', type=str, choices=('sine', 'learned'),
                    help="Type of positional embedding to use on top of the image features")

# * Transformer
parser.add_argument('--enc_layers', default=6, type=int,
                    help="Number of encoding layers in the transformer")
parser.add_argument('--dec_layers_hopd', default=3, type=int,
                    help="Number of hopd decoding layers in the transformer")
parser.add_argument('--dec_layers_interaction', default=3, type=int,
                    help="Number of interaction decoding layers in the transformer")
parser.add_argument('--dim_feedforward', default=2048, type=int,
                    help="Intermediate size of the feedforward layers in the transformer blocks")
parser.add_argument('--hidden_dim', default=256, type=int,
                    help="Size of the embeddings (dimension of the transformer)")
parser.add_argument('--dropout', default=0.1, type=float,
                    help="Dropout applied in the transformer")
parser.add_argument('--nheads', default=8, type=int,
                    help="Number of attention heads inside the transformer's attentions")
parser.add_argument('--num_queries', default=100, type=int,
                    help="Number of query slots")
parser.add_argument('--pre_norm', action='store_true')

# * Segmentation
parser.add_argument('--masks', action='store_true',
                    help="Train segmentation head if the flag is provided")

# HOI
parser.add_argument('--num_obj_classes', type=int, default=80,
                    help="Number of object classes")
parser.add_argument('--num_verb_classes', type=int, default=117,
                    help="Number of verb classes")
parser.add_argument('--pretrained', type=str, default='',
                    help='Pretrained model path')
parser.add_argument('--subject_category_id', default=0, type=int)
parser.add_argument('--verb_loss_type', type=str, default='focal',
                    help='Loss type for the verb classification')

# Loss
parser.add_argument('--no_aux_loss', dest='aux_loss', action='store_false',
                    help="Disables auxiliary decoding losses (loss at each layer)")
parser.add_argument('--use_matching', action='store_true',
                    help="Use obj/sub matching 2class loss in first decoder, default not use")

# * Matcher
parser.add_argument('--set_cost_class', default=1, type=float,
                    help="Class coefficient in the matching cost")
parser.add_argument('--set_cost_bbox', default=2.5, type=float,
                    help="L1 box coefficient in the matching cost")
parser.add_argument('--set_cost_giou', default=1, type=float,
                    help="giou box coefficient in the matching cost")
parser.add_argument('--set_cost_obj_class', default=1, type=float,
                    help="Object class coefficient in the matching cost")
parser.add_argument('--set_cost_verb_class', default=1, type=float,
                    help="Verb class coefficient in the matching cost")
parser.add_argument('--set_cost_matching', default=1, type=float,
                    help="Sub and obj box matching coefficient in the matching cost")

# * Loss coefficients
parser.add_argument('--mask_loss_coef', default=1, type=float)
parser.add_argument('--dice_loss_coef', default=1, type=float)
parser.add_argument('--bbox_loss_coef', default=2.5, type=float)
parser.add_argument('--giou_loss_coef', default=1, type=float)
parser.add_argument('--obj_loss_coef', default=1, type=float)
parser.add_argument('--verb_loss_coef', default=2, type=float)
parser.add_argument('--alpha', default=0.5, type=float, help='focal loss alpha')
parser.add_argument('--matching_loss_coef', default=1, type=float)
parser.add_argument('--eos_coef', default=0.1, type=float,
                    help="Relative classification weight of the no-object class")

# dataset parameters
parser.add_argument('--dataset_file', default='coco')
parser.add_argument('--coco_path', type=str)
parser.add_argument('--coco_panoptic_path', type=str)
parser.add_argument('--remove_difficult', action='store_true')
parser.add_argument('--hoi_path', type=str)

parser.add_argument('--output_dir', default='',
                    help='path where to save, empty for no saving')
parser.add_argument('--device', default='cuda',
                    help='device to use for training / testing')
parser.add_argument('--seed', default=42, type=int)
parser.add_argument('--resume', default='', help='resume from checkpoint')
parser.add_argument('--start_epoch', default=0, type=int, metavar='N',
                    help='start epoch')
parser.add_argument('--eval', action='store_true')
parser.add_argument('--num_workers', default=2, type=int)

# distributed training parameters
parser.add_argument('--world_size', default=1, type=int,
                    help='number of distributed processes')
parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')

# decoupling training parameters
parser.add_argument('--freeze_mode', default=0, type=int)
parser.add_argument('--obj_reweight', action='store_true')
parser.add_argument('--verb_reweight', action='store_true')
parser.add_argument('--use_static_weights', action='store_true', 
                    help='use static weights or dynamic weights, default use dynamic')
parser.add_argument('--queue_size', default=4704*1.0, type=float,
                    help='Maxsize of queue for obj and verb reweighting, default 1 epoch')
parser.add_argument('--p_obj', default=0.7, type=float,
                    help='Reweighting parameter for obj')
parser.add_argument('--p_verb', default=0.7, type=float,
                    help='Reweighting parameter for verb')

# hoi eval parameters
parser.add_argument('--use_nms_filter', action='store_true', help='Use pair nms filter, default not use')
parser.add_argument('--thres_nms', default=0.7, type=float)
parser.add_argument('--nms_alpha', default=1.0, type=float)
parser.add_argument('--nms_beta', default=0.5, type=float)
parser.add_argument('--json_file', default='results.json', type=str)

args = parser.parse_args([])

### Define Functions

In [10]:
from PIL import Image

import torchvision.transforms as transforms
transform = transforms.Compose([
    transforms.ToTensor()
])

def loadImage(input_path = '/content/kitchen.png'):
  image = Image.open(input_path).convert("RGB")
  return image, transform(image).to(device)

In [11]:
import numpy as np
import torch.nn.functional as F
from util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou
def processOutputs(outputs, image, threshold = 0.2):
  out_obj_logits, out_verb_logits, out_sub_boxes, out_obj_boxes = outputs['pred_obj_logits'], \
                                                                outputs['pred_verb_logits'], \
                                                                outputs['pred_sub_boxes'], \
                                                                outputs['pred_obj_boxes']

  obj_prob = F.softmax(out_obj_logits, -1)
  obj_scores, obj_labels = obj_prob[..., :-1].max(-1)
  
  ch, img_h, img_w = image.shape
  img_w = torch.tensor(img_w)
  img_h = torch.tensor(img_h)

  verb_scores = out_verb_logits.sigmoid()
  scale_fct = torch.stack([img_w, img_h, img_w, img_h]).to(verb_scores.device)
  sub_boxes = box_cxcywh_to_xyxy(out_sub_boxes)
  sub_boxes = sub_boxes * scale_fct
  obj_boxes = box_cxcywh_to_xyxy(out_obj_boxes)
  obj_boxes = obj_boxes * scale_fct

  obj_scores = obj_scores.detach()
  obj_labels = obj_labels.detach()
  verb_scores = verb_scores.detach()
  sub_boxes = sub_boxes.detach()
  obj_boxes = obj_boxes.detach()

  results = []
  for os, ol, vs, sb, ob in zip(obj_scores, obj_labels, verb_scores, sub_boxes, obj_boxes):
      sl = torch.full_like(ol, 0)
      l = torch.cat((sl, ol))
      b = torch.cat((sb, ob))
      bboxes = [{'bbox': bbox, 'category_id': label} for bbox, label in zip(b.to('cpu').numpy(), l.to('cpu').numpy())]

      hoi_scores = vs * os.unsqueeze(1)

      verb_labels = torch.arange(hoi_scores.shape[1], device=device).view(1, -1).expand(
          hoi_scores.shape[0], -1)
      object_labels = ol.view(-1, 1).expand(-1, hoi_scores.shape[1])


      ids = torch.arange(b.shape[0])

      hois = [{'subject_id': subject_id, 'object_id': object_id, 'category_id': category_id, 'score': score} for
              subject_id, object_id, category_id, score in zip(ids[:ids.shape[0] // 2].to('cpu').numpy(),
                                                                ids[ids.shape[0] // 2:].to('cpu').numpy(),
                                                                verb_labels.to('cpu').numpy(), hoi_scores.to('cpu').numpy())]

      results.append({
          'predictions': bboxes,
          'hoi_prediction': hois
      })

  results_filtered = []
  results = results[0]
  for h in results['hoi_prediction']:
    score = np.max(h['score'])
    if score > threshold:
      obj_id =  h['object_id']
      sub_id =  h['subject_id']
      index = np.argmax(h['score'])
      dict_ = { 'category_id': h['category_id'][index],
                'object_id': obj_id,
                'score': score,
                'subject_id': sub_id,
                'object_bbox' : results['predictions'][obj_id],
                'subject_id' : results['predictions'][sub_id],
              }
      results_filtered.append(dict_)
  return results_filtered

In [19]:
import cv2

def drawImage(image, results):
  image = cv2.cvtColor(np.asarray(image), cv2.COLOR_RGB2BGR)
  COLORS = np.random.uniform(0, 255, size=(len(verb_classes_dict)+1, 3))
  for color_id, r in enumerate(results):
    category_id_verb = r['category_id']
    object_bbox = r['object_bbox']['bbox']
    object_id = r['object_bbox']['category_id']
    subject_bbox = r['subject_id']['bbox']
    subject_id = 0

    color = COLORS[color_id]
    cv2.rectangle(
        image,
        (int(subject_bbox[0]), int(subject_bbox[1])),
        (int(subject_bbox[2]), int(subject_bbox[3])),
        color, 2
    )

    cv2.rectangle(
        image,
        (int(object_bbox[0]), int(object_bbox[1])),
        (int(object_bbox[2]), int(object_bbox[3])),
        color, 2
    )

    cv2.putText(image, verb_classes[category_id_verb], (int(subject_bbox[0]), int(subject_bbox[1]-5)),
                cv2.FONT_HERSHEY_SIMPLEX, 0.8, color, 2,
                lineType=cv2.LINE_AA)
  return image

In [13]:
import torch
from models import build_model

args.pretrained = "/content/CDN/hico_cdn_s.pth"
args.num_obj_classes=80
args.backbone='resnet50'
args.num_verb_classes=117
args.dataset_file='hico'
args.num_queries=64
args.dec_layers_hopd=3
args.dec_layers_interaction =3
args.use_nms_filter=True
args.eval=True

def load_model(args):
  device = torch.device(args.device)
  model, criterion, postprocessors = build_model(args)
  model = model.to(device)
  checkpoint = torch.load(args.pretrained, map_location='cpu')
  model.load_state_dict(checkpoint['model'])
  return model

### Inference

In [23]:
!mkdir /content/outputs

In [14]:
!wget https://www.westend61.de/images/0001490931pw/happy-business-people-cooking-food-together-in-office-kitchen-PESF02401.jpg -O image.jpg

--2021-10-19 15:05:04--  https://www.westend61.de/images/0001490931pw/happy-business-people-cooking-food-together-in-office-kitchen-PESF02401.jpg
Resolving www.westend61.de (www.westend61.de)... 94.130.134.142, 2a01:4f8:13b:356f::2
Connecting to www.westend61.de (www.westend61.de)|94.130.134.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 130380 (127K) [image/jpeg]
Saving to: ‘image.jpg’


2021-10-19 15:05:05 (578 KB/s) - ‘image.jpg’ saved [130380/130380]



In [16]:
device = torch.device('cuda:0')

In [28]:
threshold = 0.04
image_path = '/content/CDN/image.jpg' 
image, image_device = loadImage(image_path)
model = load_model(args)

In [29]:
outputs = model([image_device])
results = processOutputs(outputs, image_device, threshold = threshold)
image_save = drawImage(image, results)
cv2.imwrite("/content/outputs/image_out" +str(threshold)+".jpg", image_save)


True

### Inference on Videos

In [30]:
!wget https://j.gifs.com/325l3R@facebook.gif -O gif.mp4


--2021-10-19 15:12:59--  https://j.gifs.com/325l3R@facebook.gif
Resolving j.gifs.com (j.gifs.com)... 23.42.158.216, 23.42.158.192
Connecting to j.gifs.com (j.gifs.com)|23.42.158.216|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7020749 (6.7M) [image/gif]
Saving to: ‘gif.mp4’


2021-10-19 15:13:00 (13.6 MB/s) - ‘gif.mp4’ saved [7020749/7020749]



In [31]:
from IPython.display import HTML
from base64 import b64encode
import os
!mkdir outputs
def showVideo(path):
  compressed_path = "/content/outputs/"+os.path.basename(path)+"compressed.mp4"

  os.system(f"ffmpeg -i {path} -vcodec libx264 {compressed_path}")

  # Show video
  mp4 = open(compressed_path,'rb').read()
  data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
  return HTML("""
  <video width=400 controls>
        <source src="%s" type="video/mp4">
  </video>
  """ % data_url)

In [37]:
import time, torch, cv2

def prepare_video(input_path, output_path, scale=1, slow_down=1):
    cap = cv2.VideoCapture(input_path)
    if not cap.isOpened():
        print('Could not open {} video'.format(input_path), flush=True)
        sys.exit()

    fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH) / scale), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT) / scale))
    out = cv2.VideoWriter(str(output_path), fourcc, int(fps/slow_down), size)
    frame_length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    return cap, out, device, (fps, size, frame_length)

def rotate_image(frame, size, angle=270):
    image = cv2.resize(frame, size, interpolation=cv2.INTER_AREA)
    image_center = tuple(np.array(image.shape[1::-1]) / 2)
    rot_mat = cv2.getRotationMatrix2D(image_center, angle, 0.5)
    result = cv2.warpAffine(image, rot_mat, image.shape[1::-1], flags=cv2.INTER_LINEAR)
    return result

def reformat_frame(frame, size, rotate=False):
    # Re-scale the frame to the desired resolution
    resized_frame = cv2.resize(frame, size, interpolation=cv2.INTER_AREA)
    if rotate:
        image = rotate_image(resized_frame, size)
        return image
    return resized_frame

In [38]:
@torch.no_grad()
def processVideo(model, input_video_path, output_video_path, threshold = 0.05, scale=1, slow_down=1):

    cap, out, device, (fps, size, frame_length) = prepare_video(input_video_path, output_video_path,  scale=scale, slow_down=slow_down)
    frame_count = 0  # to count total frames
    total_fps = 0  # to get the final frames per second
    start_time_tot = time.time()
    model = model.to(device)
    # read until end of video
    print("[INFO] Video is being processed...")
    while (cap.isOpened()):
        # capture each frame of the video
        ret, frame = cap.read()
        if ret:
            frame = reformat_frame(frame, size)
            frame = cv2.cvtColor(np.asarray(frame), cv2.COLOR_BGR2RGB)

            img = transform(frame).to(device)
            outputs = model([img])
            results = processOutputs(outputs, img, threshold = threshold)
            image_save = drawImage(frame, results)
            out.write(image_save)

            frame_count += 1
            percentage = float(frame_count / frame_length) * 100
            if (percentage == 100):
                break
        else:
            break

    print("[INFO] Video already processed", flush=True)
    end_time = time.time()
    avg_fps = frame_length/(time.time()- start_time_tot)
    # release VideoCapture()
    cap.release()
    out.release()

In [40]:
processVideo(model, '/content/CDN/gif.mp4', '/content/outputs/out.mp4', threshold = 0.05, scale=1, slow_down=3)

[INFO] Video is being processed...
[INFO] Video already processed


In [36]:
showVideo('/content/outputs/out.mp4')