<center><h1> MultiParty Tracking with MediaPipe: Top-View Hand Tracking  </h1>


<h3> Wim Pouw ( wim.pouw@donders.ru.nl )<br>James Trujillo ( james.trujillo@donders.ru.nl )<br>
    18-11-2021 </h3>
    
<img src="./images/BOOTCAMP.png"> </center>

<h3> Info documents </h3>
In this module, we'll demonstrates how to perform motion tracking using the lightweight tool MediaPipe, and considers some of the pros and cons of this method. Specifically, we'll be using MediaPipe for hand-tracking in situations where we have multiple people in frame, and when tracking from a top view.
<br><br>

* location code: 
https://github.com/WimPouw/EnvisionBootcamp2021/tree/main/Python/MediaBodyTracking

* citation: 
Pouw, W.,  &  Trujillo, J.P. (2021-11-18). <i> MultiParty Tracking with MediaPipe: Top-View Hand Tracking  </i> \[day you visited the site]. Retrieved from: https://github.com/WimPouw/EnvisionBootcamp2021/tree/main/Python/MediaBodyTracking 


<h3> Introduction </h3>
The first thing that we will cover here is how to utilize MediaPipe to acquire motion tracking of the hands, from multiple people. MediaPipe offers a nice lightweight (computationally) solution to capture hand motion from multiple people (or just one person). We'll first go over some code to get body and hand tracking. 

<h4>resources</h4>
* https://github.com/google/mediapipe
<br><br>
* https://google.github.io/mediapipe/solutions/hands.html
<br><br>
* Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019). Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.
<br><br>
<h3> Body Tracking</h3>
The hand tracking algorith provided below captures the x,y,z keypoints of just the hands, from everyone in frame.  Let's do some tracking and see what we get!
<br>
First, let's load some packages and set our paths

In [1]:
%config Completer.use_jedi = False
import cv2
import sys
import mediapipe
import pandas as pd
import numpy as np
import csv
from os import listdir
from os.path import isfile, join
  
#initialize modules
drawingModule = mediapipe.solutions.drawing_utils #the module(s) usd from the mediapipe package
handsModule = mediapipe.solutions.hands           #the module(s) usd from the mediapipe package

In [2]:
#list all videos in mediafolder
mypath = "./MediaToAnalyze/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))] # get all files that are in mediatoanalyze
#time series output folder
foldtime = "./Timeseries_Output/"

In [3]:
################################some preperatory functions and lists for saving the data
#take some google classification object and convert it into a string
def makegoginto_str(gogobj):
    gogobj = str(gogobj).strip("[]")
    gogobj = gogobj.split("\n")
    return(gogobj[:-1]) #ignore last element as this has nothing

#Hand landmarks
markers = ['WRIST', 'THUMB_CMC', 'THUMB_MCP', 'THUMB_IP', 'THUMB_TIP', 
 'INDEX_MCP', 'INDEX_PIP', 'INDEX_DIP', 'INDEX_TIP', 
 'MIDDLE_MCP', 'MIDDLE_PIP', 'MIDDLE_DIP','MIDDLE_TIP', 
 'RING_MCP', 'RING_TIP', 'RING_DIP', 'RING_TIP', 
 'PINKY_MCP', 'PINKY_PIP', 'PINKY_DIP', 'PINKY_TIP']

#make the stringifyd position traces into clean values
def listpostions(newsamplemaerks):
    tracking_p = []
    for value in newsamplelmarks:
        stripped = value.split(':', 1)[1]
        stripped = stripped.strip() #remove spaces in the string if present
        tracking_p.append(stripped) #add to this list  
    return(tracking_p)

#a function that only retrieves the numerical info in a string
def only_numerics(seq):
    seq_type= type(seq)
    return seq_type().join(filter(seq_type.isdigit, seq))

Now we'll pefrorm the actual tracking. This block goes through each video file in your directory, gets the video frames (images) using cv2, creates an output video file, and then collects the tracked points. The saved keypoint coordinates are then drawn onto a copy of the video frame in order to visualize the tracking as well as saved into a .csv file for later analysis. <br>

In [6]:
#loop through the frames of the video
for ff in onlyfiles:
    #capture the video and save some video properties
    capture = cv2.VideoCapture(mypath+ff)
    frameWidth = capture.get(cv2.CAP_PROP_FRAME_WIDTH)
    frameHeight = capture.get(cv2.CAP_PROP_FRAME_HEIGHT)
    fps = capture.get(cv2.CAP_PROP_FPS)

    print(frameWidth, frameHeight, fps ) #print some video info to the console
    
    #make a video file where we will project keypoints on
    samplerate = fps #make the same as current 
    fourcc = cv2.VideoWriter_fourcc(*'XVID') #(*'XVID')
    out = cv2.VideoWriter('Videotracking_output/'+ff[:-4]+'.avi', fourcc, fps= samplerate, frameSize = (int(frameWidth), int(frameHeight))) #make sure that frameheight/width is the same a original

    #make a variable list with x, y, z, info where data is appended to
    markerxyz = []
    for mark in markers:
        for pos in ['X', 'Y', 'Z']:
            nm = pos + "_" + mark
            markerxyz.append(nm)
    addvariable = ['index', 'confidence', 'hand', 'time']
    addvariable.extend(markerxyz)
    time = 0
    fr = 1
    timeseries = [addvariable]
    #MAIN ROUTINE
         #For finetuning the tracking here check: https://google.github.io/mediapipe/solutions/hands.html
    with handsModule.Hands(static_image_mode=False, min_detection_confidence=0.5, min_tracking_confidence=0.75, max_num_hands=6) as hands:
         while (True):
            ret, frame = capture.read()
            if ret == True:
                results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
                # the results.multi_hand_landmarks should contain sets of x,y,z values for each landmark
                # However, they have no label or ID, just raw coordinates. 
                # we do know which set of coordinates corresponds to which joint:
                # see https://google.github.io/mediapipe/solutions/hands.html and figure 2.21 on that page
                if results.multi_hand_landmarks != None: 
                    #attach an id based on location                    
                    for handLandmarks, handinfo in zip(results.multi_hand_landmarks,results.multi_handedness):
                        # these first few lines just convert the results output into something more workable
                        newsamplelmarks = makegoginto_str(handLandmarks.landmark)
                        newsamplelmarks = listpostions(newsamplelmarks)
                        newsampleinfo = makegoginto_str(handinfo) #get info the hands
                        # now we compile the data into a complete row, and add it to our dataframe
                        fuldataslice = [fr, newsampleinfo[2], newsampleinfo[3]]
                        fuldataslice.extend([str(time)]) #add time
                        fuldataslice.extend(newsamplelmarks) #add positions
                        timeseries.append(fuldataslice)
                        #get information about hand index [0], hand confidence [1], handedness [2]              
                        for point in handsModule.HandLandmark:
                            normalizedLandmark = handLandmarks.landmark[point]
                            # now draw the landmark onto the video frame
                            pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, frameWidth, frameHeight)
                            cv2.circle(frame, pixelCoordinatesLandmark, 5, (0, 255, 0), -1)
                if results.multi_hand_landmarks == None:
                    timeseries.append(["NA"]) #add a row of NAs
                cv2.imshow('Test hand', frame)
                out.write(frame)  #########################################comment this out if you dont wn
                time = round(time+1000/samplerate)
                fr = fr+1
                if cv2.waitKey(1) == 27:
                    break
            if ret == False:
                break
    out.release()
    capture.release()
    cv2.destroyAllWindows()

    ####################################################### data to be written row-wise in csv file
    data = timeseries

    # opening the csv file in 'w+' mode
    file = open(foldtime+ff[:-4]+'.csv', 'w+', newline ='')
    #write it
    with file:    
        write = csv.writer(file)
        write.writerows(data)

1440.0 1080.0 29.97017053449149


Let's take a first look at the data to see what kind of output we get. 

In [5]:
print(foldtime+ff[:-4]+'.csv')
df = pd.read_csv(foldtime+ff[:-4]+'.csv')
df.head()

./Timeseries_Output/sampletopview.csv


Unnamed: 0,index,confidence,hand,time,X_WRIST,Y_WRIST,Z_WRIST,X_THUMB_CMC,Y_THUMB_CMC,Z_THUMB_CMC,...,Z_PINKY_MCP,X_PINKY_PIP,Y_PINKY_PIP,Z_PINKY_PIP,X_PINKY_DIP,Y_PINKY_DIP,Z_PINKY_DIP,X_PINKY_TIP,Y_PINKY_TIP,Z_PINKY_TIP
0,1,score: 0.9747177958488464,"label: ""Right""",0,0.36233,0.57738,2.766056e-07,0.379316,0.5907,-0.019496,...,-0.000562,0.423765,0.517262,-0.006143,0.43966,0.520554,-0.009332,0.452664,0.525868,-0.01083
1,1,score: 0.9968168139457703,"label: ""Right""",0,0.589795,0.797021,1.672304e-07,0.608119,0.779976,-0.004363,...,-0.010393,0.563613,0.718554,-0.013781,0.56236,0.700599,-0.015491,0.562121,0.686501,-0.016087
2,2,score: 0.927975058555603,"label: ""Right""",33,0.363977,0.577015,2.70593e-07,0.38088,0.589034,-0.019743,...,0.001318,0.425784,0.51405,-0.005703,0.440824,0.519644,-0.009894,0.453483,0.525963,-0.011872
3,2,score: 0.9930723905563354,"label: ""Right""",33,0.59047,0.799091,1.585804e-07,0.608578,0.779634,-0.002823,...,-0.010179,0.563655,0.718578,-0.013116,0.562503,0.700143,-0.014482,0.562225,0.685411,-0.014995
4,3,score: 0.9250645041465759,"label: ""Right""",66,0.364033,0.577941,2.389433e-07,0.380352,0.589998,-0.019178,...,0.002812,0.425795,0.515333,-0.003822,0.440245,0.519717,-0.008195,0.452456,0.524815,-0.010517


Above we have the first 5 rows of our output data. The first named column, "index", provides you with the frame number. Note that each frame may have multiple rows, if multiple hands are tracked in that frame. We also get a label, right or left, and x,y,z coordinates (scaled to 0,1 --- see below) for each keypoint. <br>
<h3> Output Details </h3>
The 3D coordinate output is certainly an advantage for MediaPipe, as it is able to provide some sense of depth, even if you don't have multiple camera angles or an actual depth image (e.g, as recorded by infrared sensors). The authors of MediaPipe achieve this by training their detector model on a synthetic dataset where they could vary the pose and orientation of the hand in many ways, but always have ground-truth 3D coordinates. As they state in the Zhang et al., 2020 paper:  <i>"Synthetic dataset: To even better cover the possible hand poses and provide additional supervision for depth, we render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates. We use a commercial 3D hand model that is rigged with 24 bones and includes 36 blendshapes, which control fingers and palm thickness. The model also provides 5 textures with different skin tones. We created video sequences of transformation between hand poses and sampled 100K images from the videos." Zhang et al., 2020 </i><br>
It is important to note that the depth provided in this output, however, does not correspond to real-world meters. Instead, a point with x,y coordinates = 0.5, 0.5 would be in the center of the image, while x,y = 0.25, 0.75 would indicate that the point is 1/4 of the way from left to right, and 3/4 of the way from top to bottom (x,y = 0,0 is the top left corner). For depth, it is relative to the wrist. In other words, the wrist is taken as the origin (0 depth), and smaller values are estimated to be closer to the camera, and larger values further away. <br>
This relative scaling makes it difficult to compare across videos with different camera set-ups, but is quite intuitive when looking at the coordinates compared to the actual video.

<br>However, at the moment we don't know if the first row in frame 1 is the same hand as the first row in frame 2. This is easier when there is just one person, as mediapipe does differientate between left and right hand. But now we don't know if a left and right hand belong together as there are multiple persons! We'll cover a potential solution to this in the module on linking and pairing hands. 