# Gaze Follow Tutorial

Tutorial for the Paper **Detecting Attended Visual Targets in Video** (2020), by Chong, Eunji and Wang, Yongxin and Ruiz, Nataniel and Rehg, James M. in CVPR.


*   Tutorial Author [Esteve Valls Mascaro](https://github.com/Evm7/Tutorials-Computer-Vision)
*   Repository used: https://github.com/ejcgt/attention-target-detection
*   Tracking-Detection: https://github.com/mikel-brostrom/Yolov5_DeepSort_Pytorch


This tutorial has been explained and developed step-by-step. 
*   (1)  is the baseline of the official repository: where we are using gaze following code for obtaining the focus of one single person already detected. 
*   Then, in (2) we add multi-person managing with pre-detections.
*   In (3) we allow to obtain the focus in any video that only one person appears. 
*   Finally, (4) is the final system capable of detectin and tracking the faces for each person in the video and then computing the gaze following for each. 

(4) can be used for any video.





## 0. Installment and preparation of Gaze- Follow Environment

In [1]:
! git clone https://github.com/ejcgt/attention-target-detection.git

Cloning into 'attention-target-detection'...
remote: Enumerating objects: 156, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 156 (delta 1), reused 1 (delta 0), pack-reused 151[K
Receiving objects: 100% (156/156), 111.53 MiB | 33.80 MiB/s, done.
Resolving deltas: 100% (10/10), done.


In [2]:
# ! sh /content/attention-target-detection/download_models.sh # for all the models
! wget https://www.dropbox.com/s/vt8hua06r1yoi2i/model_demo.pt 


--2021-10-04 15:37:51--  https://www.dropbox.com/s/vt8hua06r1yoi2i/model_demo.pt
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/vt8hua06r1yoi2i/model_demo.pt [following]
--2021-10-04 15:37:51--  https://www.dropbox.com/s/raw/vt8hua06r1yoi2i/model_demo.pt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb27f50e8b9318b16be618603e3.dl.dropboxusercontent.com/cd/0/inline/BXaPseyAvbIG3Cw_y8hzog7O4TKHvxtrwl_p2p6x1lG9TC7MuTIBwPeRQb63gLMvQLiicqwpyQfIUdMAh8QaG359HpXJK-KhPVfThnA29ahNqUnazonJ9AdZk0N-5Eu1P92xbETjKZBlKRr0yVJylIuH/file# [following]
--2021-10-04 15:37:52--  https://ucb27f50e8b9318b16be618603e3.dl.dropboxusercontent.com/cd/0/inline/BXaPseyAvbIG3Cw_y8hzog7O4TKHvxtrwl_p2p6x1lG9TC7MuTIBwPeRQb63gLMvQLiicqwpyQfIUdMAh8QaG35

## 1. Single-Person Gaze Following from Pre-Detections
In this chapter we are describing the original use of the Repository.
In order to proceed with the Gaze Following for a video in this chapter, the video should be splitted in frames and each frame should be already processed, detecting the faces of the people in it.

This is the baseline for the method, using the original demo-video/detections from the reposititory.

### Define Functions

In [30]:
import sys
sys.path.append("/content/attention-target-detection")

In [31]:
import argparse, os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import datasets, transforms
import pandas as pd
import numpy as np
from PIL import Image
import cv2

import glob
from IPython.display import HTML
from base64 import b64encode

from model import ModelSpatial
from utils import imutils, evaluation
from config import *

In [32]:
def build_model(device, model_weights):
    model = ModelSpatial()
    model_dict = model.state_dict()
    pretrained_dict =  torch.load(model_weights, map_location='cpu')
    pretrained_dict = pretrained_dict['model']
    model_dict.update(pretrained_dict)
    model.load_state_dict(model_dict)

    model.to(device)
    model = model.train(False)
    return model

In [33]:
def _get_transform():
    transform_list = []
    transform_list.append(transforms.Resize((input_resolution, input_resolution)))
    transform_list.append(transforms.ToTensor())
    transform_list.append(transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]))
    return transforms.Compose(transform_list)

In [34]:
def getHead(head_data_path):
    column_names = ['frame', 'left', 'top', 'right', 'bottom']
    df = pd.read_csv(head_data_path, names=column_names, index_col=0)
    df['left'] -= (df['right']-df['left'])*0.1
    df['right'] += (df['right']-df['left'])*0.1
    df['top'] -= (df['bottom']-df['top'])*0.1
    df['bottom'] += (df['bottom']-df['top'])*0.1
    return df

In [35]:

def prepareVideo(image_dir,output_path, fps=10):
    files = [f for f in glob.glob(image_dir + "**/*.jpg", recursive=True)]
    file_path = files[0]
    frame_raw = Image.open(file_path).convert('RGB')
    size = frame_raw.size
    fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
    out_video = cv2.VideoWriter(output_path, fourcc, fps, size)
    frame_length = int(len(files))
    return out_video, frame_length, size

In [36]:
def drawImage(image, head_box, video_estimation, out_threshold=100, color = (1, 1, 0)):
    image = cv2.cvtColor(np.asarray(image), cv2.COLOR_RGB2BGR)
    cv2.rectangle(
            image,
            (int(head_box[0]), int(head_box[1])),
            (int(head_box[2]), int(head_box[3])),
            color, 2
        )
    inout, raw_hm = video_estimation
    height, width, _ = image.shape
    if inout < out_threshold: # in-frame gaze
        pred_x, pred_y = evaluation.argmax_pts(raw_hm)
        norm_p = [pred_x/output_resolution, pred_y/output_resolution]
        cv2.circle(image, (int(norm_p[0]*width), int(norm_p[1]*height)), int(height/30.0), color, 2)
        cv2.line(image, (int((head_box[0]+head_box[2])/2),int((head_box[1]+head_box[3])/2)), (int(norm_p[0]*width),int(norm_p[1]*height)), color, 2)

    return image

In [37]:
def prepareVideoFrames(image_dir, head_data, output_video):
  try:
      os.makedirs("outputs")
  except FileExistsError:
      print("New video will be saved in "+str(output_video))
  df = getHead(head_data)
  test_transforms = _get_transform()
  out_video, frame_length, size = prepareVideo(image_dir, output_video, fps=10)
  return out_video, df, test_transforms

In [38]:
def processVideoFrames(image_dir, head_data, output_video, model):
    out_video, df, test_transforms = prepareVideoFrames(image_dir, head_data, output_video)
    with torch.no_grad():
        for i in df.index:
            frame_raw = Image.open(os.path.join(image_dir, i)).convert('RGB')
            width, height = frame_raw.size

            head_box = [df.loc[i,'left'], df.loc[i,'top'], df.loc[i,'right'], df.loc[i,'bottom']]

            head = frame_raw.crop((head_box)) # head crop

            head = test_transforms(head) # transform inputs
            frame = test_transforms(frame_raw)
            head_channel = imutils.get_head_box_channel(head_box[0], head_box[1], head_box[2], head_box[3], width, height,
                                                        resolution=input_resolution).unsqueeze(0)

            head = head.unsqueeze(0).cuda()
            frame = frame.unsqueeze(0).cuda()
            head_channel = head_channel.unsqueeze(0).cuda()

            # forward pass
            raw_hm, _, inout = model(frame, head_channel, head)

            # heatmap modulation
            raw_hm = raw_hm.cpu().detach().numpy() * 255
            raw_hm = raw_hm.squeeze()
            inout = inout.cpu().detach().numpy()
            inout = 1 / (1 + np.exp(-inout))
            inout = (1 - inout) * 255
            norm_map = cv2.resize(raw_hm, (height, width)) - inout

            # vis
            image = drawImage(frame_raw, head_box, (inout, raw_hm), out_threshold=100)
            out_video.write(image)

        print('DONE!')
    out_video.release()

In [39]:
def showVideo(path):
  compressed_path = "/content/outputs/"+os.path.basename(path)+"compressed.mp4"

  os.system(f"ffmpeg -i {path} -vcodec libx264 {compressed_path}")

  # Show video
  mp4 = open(compressed_path,'rb').read()
  data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
  return HTML("""
  <video width=400 controls>
        <source src="%s" type="video/mp4">
  </video>
  """ % data_url)

### Inference

In [None]:
model_weights = '/content/model_demo.pt'

image_dir  = '/content/attention-target-detection/data/demo/frames'
head_data = '/content/attention-target-detection/data/demo/person1.txt' # contains thehead bbox for each frame of the video
output_video= "/content/outputs/gaze_follows.mp4"

# Important to activate the GPU environment
device = torch.device('cuda:0')


In [None]:
model =  build_model(device, model_weights)

In [None]:
processVideoFrames(image_dir, head_data, output_video, model)

  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)


DONE!


In [None]:
showVideo(output_video)

Output hidden; open in https://colab.research.google.com to view.

## 2. Multi-Person Gaze Following from Pre-Detections

In the previous section it is only shown how to create the Gaze Follow for one person, but what happens if we are interested in 2 independent Gaze.

In this chapter we are developing a Multi-person Gaze Follower for videos based on splitted frames and with each person's faces location already annotated. 

Using the original of the repository.

### Define Functions

In [40]:
def prepareVideoFrames_multi(image_dir, head_datas, output_video):
  try:
      os.makedirs("outputs")
  except FileExistsError:
      print("New video will be saved in "+str(output_video))

  dfs = []
  for hd in head_datas:
      dfs.append(getHead(hd))
  test_transforms = _get_transform()
  out_video, frame_length, size = prepareVideo(image_dir, output_video, fps=10)
  return out_video, dfs, test_transforms

In [41]:
def processImage(frame_raw, draw_image, model, path_name, df, test_transforms, color=(0,1,0)):
    head_box = [df.loc[path_name,'left'], df.loc[path_name,'top'], df.loc[path_name,'right'], df.loc[path_name,'bottom']]

    head = frame_raw.crop((head_box)) # head crop

    head = test_transforms(head) # transform inputs
    frame = test_transforms(frame_raw)

    width, height = frame_raw.size

    head_channel = imutils.get_head_box_channel(head_box[0], head_box[1], head_box[2], head_box[3], width, height,
                                                resolution=input_resolution).unsqueeze(0)

    head = head.unsqueeze(0).cuda()
    frame = frame.unsqueeze(0).cuda()
    head_channel = head_channel.unsqueeze(0).cuda()

    # forward pass
    raw_hm, _, inout = model(frame, head_channel, head)

    # heatmap modulation
    raw_hm = raw_hm.cpu().detach().numpy() * 255
    raw_hm = raw_hm.squeeze()
    inout = inout.cpu().detach().numpy()
    inout = 1 / (1 + np.exp(-inout))
    inout = (1 - inout) * 255
    norm_map = cv2.resize(raw_hm, (height, width)) - inout

    # vis
    draw_image = drawImage(draw_image, head_box, (inout, raw_hm), out_threshold=100, color=color)

    return draw_image

In [42]:
def drawImage_multi(image, head_box, video_estimation, out_threshold=100, color = (1, 1, 0)):
    cv2.rectangle(
            image,
            (int(head_box[0]), int(head_box[1])),
            (int(head_box[2]), int(head_box[3])),
            color, 2
        )
    inout, raw_hm = video_estimation
    height, width, _ = image.shape
    if inout < out_threshold: # in-frame gaze
        pred_x, pred_y = evaluation.argmax_pts(raw_hm)
        norm_p = [pred_x/output_resolution, pred_y/output_resolution]
        cv2.circle(image, (int(norm_p[0]*width), int(norm_p[1]*height)), int(height/30.0), color, 2)
        cv2.line(image, (int((head_box[0]+head_box[2])/2),int((head_box[1]+head_box[3])/2)), (int(norm_p[0]*width),int(norm_p[1]*height)), color, 2)

    return image

In [43]:
def processVideoFrames_multi(image_dir, head_datas, output_video, model):
    out_video, dfs, test_transforms = prepareVideoFrames_multi(image_dir, head_datas, output_video)
    files = [os.path.basename(f) for f in glob.glob(image_dir + "**/*.jpg", recursive=True)]
    COLORS = np.random.uniform(0, 255, size=(len(dfs)+1, 3))
    files.sort()
    with torch.no_grad():
        for frame_path in files:
            image = Image.open(os.path.join(image_dir, frame_path)).convert('RGB')
            draw_image = cv2.cvtColor(np.asarray(image), cv2.COLOR_RGB2BGR)

            for index, df in enumerate(dfs):
              draw_image = processImage(image, draw_image, model, frame_path, df, test_transforms, COLORS[index])
              #cv2.imwrite("outputs/frame"+str(frame_path)+str(index)+".jpg", draw_image)
            out_video.write(draw_image)

        print('DONE!')
    out_video.release()

### Inference

In [None]:
model_weights = '/content/model_demo.pt'

image_dir  = '/content/attention-target-detection/data/demo/frames'

head_data1 = '/content/attention-target-detection/data/demo/person1.txt' # contains thehead bbox for each frame of the video for the person 1
head_data2 = '/content/attention-target-detection/data/demo/person2.txt' # contains thehead bbox for each frame of the video for the person 1

output_multi_video= "outputs/multi_gaze_follows.mp4"

# Important to activate the GPU environment
device = torch.device('cuda:0')


In [None]:
model =  build_model(device, model_weights)

In [None]:
processVideoFrames_multi(image_dir, [head_data1, head_data2], output_multi_video, model)

In [None]:
showVideo(output_multi_video)

Output hidden; open in https://colab.research.google.com to view.

## 3. Single-Person Gaze Following for Videos + Detection

Since now we have only been able to use the model for frames with already annotated heads.

In this chapter we will create a Head Detector which will be then use to create a proper end-to-end system.

### Create the environment for Face Detection

In [13]:
! git clone https://github.com/deepakcrk/yolov5-crowdhuman.git
sys.path.append("/content/yolov5-crowdhuman")
! pip install -qr /content/yolov5-crowdhuman/requirements.txt
os.chdir("yolov5-crowdhuman/")

Cloning into 'yolov5-crowdhuman'...
remote: Enumerating objects: 5028, done.[K
remote: Total 5028 (delta 0), reused 0 (delta 0), pack-reused 5028[K
Receiving objects: 100% (5028/5028), 7.92 MiB | 18.94 MiB/s, done.
Resolving deltas: 100% (3421/3421), done.
[K     |████████████████████████████████| 636 kB 5.2 MB/s 
[?25h

Gdown is used to download the [crowdhuman_yolov5m.pt](https://github.com/deepakcrk/yolov5-crowdhuman.git) and attach it to this environment.

In [14]:
!pip install -q gdown
! gdown https://drive.google.com/u/1/uc?id=1gglIwqxaH2iTvy6lZlXuAcMpd_U0GCUb

Downloading...
From: https://drive.google.com/u/1/uc?id=1gglIwqxaH2iTvy6lZlXuAcMpd_U0GCUb
To: /content/yolov5-crowdhuman/crowdhuman_yolov5m.pt
169MB [00:01, 158MB/s]


### Inference on ETRI Database
ETRI database collects videos of older person performing daily activities.
In this case we will be using some samples videos in order to use the attention module in it, without previous head detections.

However, in this case we are making use of only a Detection Model, without any tracking, by ensuring that only one person appears in all the frames.

In [15]:
! gdown https://drive.google.com/u/1/uc?id=19hgw-kp2qM0rs4-KoaR8Ou8xrjgiQOoD
! unzip ETRI-Activity3D_Sample_en.zip

Downloading...
From: https://drive.google.com/u/1/uc?id=19hgw-kp2qM0rs4-KoaR8Ou8xrjgiQOoD
To: /content/yolov5-crowdhuman/ETRI-Activity3D_Sample_en.zip
95.3MB [00:00, 143MB/s]
Archive:  ETRI-Activity3D_Sample_en.zip
  inflating: ETRI-Activity3D.xlsx    
   creating: SampleVideos/
  inflating: SampleVideos/10_DS.mp4  
  inflating: SampleVideos/10_RGB.mp4  
  inflating: SampleVideos/1_DS.mp4   
  inflating: SampleVideos/1_RGB.mp4  
  inflating: SampleVideos/24_DS.mp4  
  inflating: SampleVideos/24_RGB.mp4  
  inflating: SampleVideos/29_DS.mp4  
  inflating: SampleVideos/29_RGB.mp4  


### Face Detection Inference on Real Video

In this subchapter we can see how the inference of the face detection model wors in a experiment of the ETRI dataset

In [16]:
! python detect.py --weights crowdhuman_yolov5m.pt --source /content/yolov5-crowdhuman/SampleVideos/1_RGB.mp4 --heads --save-txt  --conf-thres 0.5 


Namespace(agnostic_nms=False, augment=False, classes=None, conf_thres=0.5, device='', exist_ok=False, heads=True, img_size=640, iou_thres=0.45, name='exp', person=False, project='runs/detect', save_conf=False, save_txt=True, source='/content/yolov5-crowdhuman/SampleVideos/1_RGB.mp4', update=False, view_img=False, weights=['crowdhuman_yolov5m.pt'])
YOLOv5 v4.0-114-g285bd44 torch 1.9.0+cu102 CUDA:0 (Tesla K80, 11441.1875MB)

Fusing layers... 
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Model Summary: 308 layers, 21041679 parameters, 0 gradients, 50.3 GFLOPS
video 1/1 (1/143) /content/yolov5-crowdhuman/SampleVideos/1_RGB.mp4: 384x640 1 person, 1 head, Done. (0.103s)
video 1/1 (2/143) /content/yolov5-crowdhuman/SampleVideos/1_RGB.mp4: 384x640 1 person, 1 head, Done. (0.065s)
video 1/1 (3/143) /content/yolov5-crowdhuman/SampleVideos/1_RGB.mp4: 384x640 1 person, 1 head, Done. (0.064s)
video 1/1 (4/143) /content/yolov5-crowdhuman/SampleVideos/1_RGB.mp4:

In [18]:
!mkdir /content/outputs

In [19]:
showVideo('/content/yolov5-crowdhuman/runs/detect/exp/1_RGB.mp4') # We can see that the person's face is perfectly detected during the whole video

### Define  Functions

In [20]:
def getHeadImage(head_data_path):
    column_names = ['class_', 'cx', 'cy', 'w', 'h']
    df = pd.read_csv(head_data_path, names=column_names, sep=' ')      
    df = df[df['class_']==1]
    return df

def prepareFrame(frame_id, labels_path):
  hd = getHeadImage(labels_path+"1_RGB_"+str(frame_id)+'.txt')
  return hd

def prepareVideo2(video_path ,output_path):
    # Get the video from the input fpath
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print('Could not open {} video'.format(video_path), flush=True)
        sys.exit()
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH) ), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT) ))
    fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
    out_video = cv2.VideoWriter(output_path, fourcc, fps, size)
    frame_length =  int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    return cap, out_video, frame_length, size

In [21]:
def processImage2(draw_image, model, head_box, test_transforms, color=(0,1,0)):
    frame_raw = Image.fromarray(draw_image)

    head = frame_raw.crop((head_box)) # head crop

    head = test_transforms(head) # transform inputs
    frame = test_transforms(frame_raw)

    width, height = frame_raw.size

    head_channel = imutils.get_head_box_channel(head_box[0], head_box[1], head_box[2], head_box[3], width, height,
                                                resolution=input_resolution).unsqueeze(0)

    head = head.unsqueeze(0).cuda()
    frame = frame.unsqueeze(0).cuda()
    head_channel = head_channel.unsqueeze(0).cuda()

    # forward pass
    raw_hm, _, inout = model(frame, head_channel, head)

    # heatmap modulation
    raw_hm = raw_hm.cpu().detach().numpy() * 255
    raw_hm = raw_hm.squeeze()
    inout = inout.cpu().detach().numpy()
    inout = 1 / (1 + np.exp(-inout))
    inout = (1 - inout) * 255
    norm_map = cv2.resize(raw_hm, (height, width)) - inout
    draw_image = cv2.cvtColor(draw_image, cv2.COLOR_RGB2BGR)

    # vis
    draw_image = drawImage(draw_image, head_box, (inout, raw_hm), out_threshold=100, color=color)

    return draw_image

In [44]:
def processVideo(video_path, head_data_paths, output_video, model):
    in_video, out_video, frame_length, size = prepareVideo2(video_path, output_video)
    test_transforms = _get_transform()
    width, height = size
    frame_id  =  1
    with torch.no_grad():
      while (in_video.isOpened() or frame_id == frame_length):
          # capture each frame of the video
          ret, frame = in_video.read()

          if ret:
            df =  prepareFrame(frame_id, head_data_paths)
            row  =  df.iloc[0]
            l = (row['cx']- row['w']/2) * width
            t = (row['cy']- row['h']/2) * height
            r = (row['cx']+ row['w']/2) * width
            b = (row['cy']+ row['h']/2) * height
            head_boxes = [l, t, r, b]
            draw_image = processImage2(frame, model, head_boxes, test_transforms, (0,1,0))
            out_video.write(draw_image)
            frame_id +=1
            if frame_id == frame_length:
              break


      print('DONE!')
    out_video.release()

### Inference

In [47]:
model_weights = '/content/model_demo.pt'

input__video = "/content/yolov5-crowdhuman/SampleVideos/1_RGB.mp4"
head_data = "/content/yolov5-crowdhuman/runs/detect/exp/labels/"
output_video= "/content/outputs/gaze_follows__ETRI.mp4"

# Important to activate the GPU environment
device = torch.device('cuda:0')


In [45]:
model =  build_model(device, model_weights)

In [48]:
processVideo(input__video, head_data, output_video, model)


DONE!


In [49]:
! rm /content/outputs/gaze_follows__ETRI.mp4compressed.mp4
showVideo(output_video) # We can see that the person's face is perfectly detected during the whole video

rm: cannot remove '/content/outputs/gaze_follows__ETRI.mp4compressed.mp4': No such file or directory


## 4. Multi-Person Gaze Following for Videos + Detection+Tracking (suitable for any video)

Includes different modules that work together to get the attention of each person in the video:

*   Yolov5 Face-Detection Module: works in spatial frame (not temporality)
*   Deep SORT Tracker: works in time: associates detections and mantains ID
*   Gaze-Following Attention: obtains the attention for each of the different humans in video



### Installment of Multi Tracker: Deep Sort Tracker

In [50]:
os.chdir("/content/")

In [51]:
!git clone --recurse-submodules https://github.com/mikel-brostrom/Yolov5_DeepSort_Pytorch.git



Cloning into 'Yolov5_DeepSort_Pytorch'...
remote: Enumerating objects: 904, done.[K
remote: Counting objects: 100% (898/898), done.[K
remote: Compressing objects: 100% (427/427), done.[K
remote: Total 904 (delta 414), reused 877 (delta 409), pack-reused 6[K
Receiving objects: 100% (904/904), 25.71 MiB | 29.38 MiB/s, done.
Resolving deltas: 100% (414/414), done.
Submodule 'yolov5' (https://github.com/ultralytics/yolov5.git) registered for path 'yolov5'
Cloning into '/content/Yolov5_DeepSort_Pytorch/yolov5'...
remote: Enumerating objects: 9463, done.        
remote: Counting objects: 100% (27/27), done.        
remote: Compressing objects: 100% (12/12), done.        
remote: Total 9463 (delta 15), reused 25 (delta 15), pack-reused 9436        
Receiving objects: 100% (9463/9463), 9.92 MiB | 24.71 MiB/s, done.
Resolving deltas: 100% (6569/6569), done.
Submodule path 'yolov5': checked out 'aa1859909c96d5e1fc839b2746b45038ee8465c9'


In [52]:
os.chdir("/content/Yolov5_DeepSort_Pytorch")

### Define Functions

In [53]:
def getHeads(head_data_path):
    column_names =  ['frame_idx', 'id', 'bbox_left', 'bbox_top', 'bbox_w', 'bbox_h', 'ns1', 'ns2', 'ns3', 'ns4', 'nan']
    # column_names = ['class_', 'cx', 'cy', 'w', 'h']
    df = pd.read_csv(head_data_path, names=column_names, sep=' ')      
    return df

def prepareHeads(heads_path):
  df = getHeads(heads_path)
  ids = sorted(df['id'].unique())
  return ids,df

In [54]:
def prepareVideo2(video_path ,output_path):
    # Get the video from the input fpath
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print('Could not open {} video'.format(video_path), flush=True)
        sys.exit()
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH) ), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT) ))
    fourcc = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
    out_video = cv2.VideoWriter(output_path, fourcc, fps, size)
    frame_length =  int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    return cap, out_video, frame_length, size

In [55]:
def processImage2(draw_image, model, head_box, test_transforms, color=(0,1,0)):
    frame_raw = Image.fromarray(draw_image)

    head = frame_raw.crop((head_box)) # head crop

    head = test_transforms(head) # transform inputs
    frame = test_transforms(frame_raw)

    width, height = frame_raw.size

    head_channel = imutils.get_head_box_channel(head_box[0], head_box[1], head_box[2], head_box[3], width, height,
                                                resolution=input_resolution).unsqueeze(0)

    head = head.unsqueeze(0).cuda()
    frame = frame.unsqueeze(0).cuda()
    head_channel = head_channel.unsqueeze(0).cuda()

    # forward pass
    raw_hm, _, inout = model(frame, head_channel, head)

    # heatmap modulation
    raw_hm = raw_hm.cpu().detach().numpy() * 255
    raw_hm = raw_hm.squeeze()
    inout = inout.cpu().detach().numpy()
    inout = 1 / (1 + np.exp(-inout))
    inout = (1 - inout) * 255
    norm_map = cv2.resize(raw_hm, (height, width)) - inout
    draw_image = cv2.cvtColor(draw_image, cv2.COLOR_RGB2BGR)

    # vis
    draw_image = drawImage(draw_image, head_box, (inout, raw_hm), out_threshold=100, color=color)

    return draw_image

In [56]:
def processVideo(video_path, head_data_paths, output_video, model):
    in_video, out_video, frame_length, size = prepareVideo2(video_path, output_video)
    test_transforms = _get_transform()
    width, height = size
    ids, df = prepareHeads(head_data_paths)
    COLORS = np.random.uniform(0, 255, size=(len(ids)+1, 3))

    frame_id  =  1
    with torch.no_grad():
      while (in_video.isOpened() or frame_id == frame_length):
          # capture each frame of the video
          ret, frame = in_video.read()
          if ret:
            heads_frame = df[df['frame_idx'] ==frame_id]
            for _, row in heads_frame.iterrows():
              id  = int(row['id'])
              l = int(row['bbox_left'])
              t = int(row['bbox_top'])
              r = int(row['bbox_left'])+ int(row['bbox_w'])
              b = int(row['bbox_top'])+ int(row['bbox_h'])
              head_boxes = [l, t, r, b]
              frame = processImage2(frame, model, head_boxes, test_transforms, COLORS[id])
          out_video.write(frame)
          frame_id +=1
          percentage = float(frame_id / frame_length) * 100
          if int(percentage) % 20 == 0:
              print("Percentage of frames detected: " + "%0.2f" % percentage + '%', flush=True)
          if frame_id == frame_length:
            break


      print('DONE!')
    out_video.release()


#['frame_idx', 'id', 'bbox_left', 'bbox_top', 'bbox_w', 'bbox_h', 'ns1', 'ns2', 'ns3', 'ns4', 'nan']

### Inference

In order to proceed with the inference, we will make use of a Gif found online. However, any video can be retrieved online or used and the model would infer the gaze follow for each person in it

#### Download the video

In [57]:
!wget https://j.gifs.com/325l3R@facebook.gif -O gif.mp4

--2021-10-04 15:44:16--  https://j.gifs.com/325l3R@facebook.gif
Resolving j.gifs.com (j.gifs.com)... 104.119.188.67, 104.119.188.17
Connecting to j.gifs.com (j.gifs.com)|104.119.188.67|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7020749 (6.7M) [image/gif]
Saving to: ‘gif.mp4’


2021-10-04 15:44:18 (5.19 MB/s) - ‘gif.mp4’ saved [7020749/7020749]



#### Define Variables

In [58]:
model_weights = '/content/model_demo.pt'
faces_weights  = "/content/yolov5-crowdhuman/crowdhuman_yolov5m.pt"

input__video = "/content/Yolov5_DeepSort_Pytorch/gif.mp4"
head_data = "/content/Yolov5_DeepSort_Pytorch/inference/output/gif.txt"
output_video= "outputs/gaze_follows__def.mp4"

# Important to activate the GPU environment
device = torch.device('cuda:0')
!mkdir outputs

#### Detect and Track the Faces for the people in the video

In [59]:
! python3 track.py --source $input__video --yolo_weights $faces_weights --classes 1  --save-txt # tracks faces

Downloading https://ultralytics.com/assets/Arial.ttf to /root/.config/Ultralytics/Arial.ttf...
100% 755k/755k [00:00<00:00, 19.0MB/s]
Downloading https://github.com/mikel-brostrom/Yolov5_DeepSort_Pytorch/releases/download/v3.0/ckpt.t7 to deep_sort_pytorch/deep_sort/deep/checkpoint/ckpt.t7...
100% 43.9M/43.9M [00:01<00:00, 26.4MB/s]

video 1/1 (1/201) /content/Yolov5_DeepSort_Pytorch/gif.mp4: 384x640 2 heads, Done. (0.062s)
video 1/1 (2/201) /content/Yolov5_DeepSort_Pytorch/gif.mp4: 384x640 2 heads, Done. (0.058s)
video 1/1 (3/201) /content/Yolov5_DeepSort_Pytorch/gif.mp4: 384x640 2 heads, Done. (0.055s)
video 1/1 (4/201) /content/Yolov5_DeepSort_Pytorch/gif.mp4: 384x640 2 heads, Done. (0.053s)
video 1/1 (5/201) /content/Yolov5_DeepSort_Pytorch/gif.mp4: 384x640 2 heads, Done. (0.052s)
video 1/1 (6/201) /content/Yolov5_DeepSort_Pytorch/gif.mp4: 384x640 2 heads, Done. (0.048s)
video 1/1 (7/201) /content/Yolov5_DeepSort_Pytorch/gif.mp4: 384x640 2 heads, Done. (0.050s)
video 1/1 (8/201) /co

In [60]:
output_faces_detections  = "/content/Yolov5_DeepSort_Pytorch/inference/output/gif.txt" # CHANGE Depends on the name of the above cell output


#### Infer the Eye-Gaze Following from the people

In [61]:
model =  build_model(device, model_weights)

In [62]:
processVideo(input__video, head_data, output_video, model)


Percentage of frames detected: 1.00%
Percentage of frames detected: 20.40%
Percentage of frames detected: 20.90%
Percentage of frames detected: 40.30%
Percentage of frames detected: 40.80%
Percentage of frames detected: 60.20%
Percentage of frames detected: 60.70%
Percentage of frames detected: 80.10%
Percentage of frames detected: 80.60%
Percentage of frames detected: 100.00%
DONE!


In [63]:
showVideo('/content/Yolov5_DeepSort_Pytorch/outputs/gaze_follows__def.mp4') # We can see that the person's face is perfectly detected during the whole video