# **Sutton Trust Music & Science Workshop**

Instructor: Huw Cheston, PhD researcher @ Centre for Music & Science, University of Cambridge

![ST](https://summerschools.suttontrust.com/wp-content/themes/sutton-trust-summer-programme/assets/img/summer_school_logo.png)

© Huw Cheston 2023, hwc31@cam.ac.uk

# Pose estimation using neural networks

![NN](https://forum.khadas.com/uploads/default/original/2X/f/f96d91501a7613128e0881b6adfdb5cff02bb309.gif)

In this workbook, we'll use pose estimation via neural networks to track the motion and movement of dancers and musicians in videos of performances. By immersing themselves in a diverse array of movement examples, these networks become skilled at tracking and mapping the human form in motion, effectively translating dynamic performances into data. This technology holds immense potential, from enhancing training and feedback for artists to enabling immersive digital experiences.

In this workbook, we'll use a pose estimation library implemented in [OpenCV](https://opencv.org/). This is much simpler than the example above, which uses [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose), so you shouldn't expect the results to be quite as good. However, it has the advantage of possible to run on the cloud, and is much quicker. We'll rip performances direct from YouTube, so you won't need to download anything beforehand. You also don't need to have any experience of programming to use this workbook, and all the various options will be explained as you go.

## Setup

**Before you do anything else**, hit the *Play* button below, next to the **Show code** line. You may need to move your mouse for this to appear. Please let me know if you get any errors when running this!

In [None]:
# @title
!git clone https://github.com/misbah4064/human-pose-estimation-opencv.git
!pip install yt-dlp
%cd human-pose-estimation-opencv/

import cv2 as cv
import numpy as np
from google.colab.patches import cv2_imshow
from tqdm import tqdm
from IPython.display import HTML
from base64 import b64encode
import os
import pandas as pd

## Run the model

First things first, go to [YouTube](https://youtube.com) and choose a video of a music or dance performance you want to work with. The video must show **one performer or dancer only**: our algorithm can only work with one performer at a time! Try and choose a video where the camera doesn't move too much, as well.

Aside from this, there are no restrictions on genre here, so choose whatever you think might lead to some interesting results! Once you've found a track, copy the link into the field *yt_link* below. It should look something like https://www.youtube.com/watch?v=NlZ0e5FqZEU

If the track takes a while to start (maybe it has a long intro), you can use the starting_position slider to skip ahead in the track. So, if the music starts at 10 seconds into the video, you'd set the slider to 10.

Once you've set all the parameters, hit the big "Play" icon as before and wait a minute for the recording to process. You should see a progress bar appear to let you know how much longer you'll have to wait.

In [None]:
yt_link = 'https://www.youtube.com/watch?v=zV1qLYukTH8' # @param {type:"string"}
starting_position = 40 # @param {type:"slider", min:1, max:100, step:1}

!yt-dlp $yt_link --force-overwrites -f  "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" -o youtube.mp4
end_pos = starting_position + 10
!ffmpeg -y -hide_banner -loglevel error -i youtube.mp4 -ss $starting_position -to $end_pos -c copy cut.mp4

BODY_PARTS = { "Nose": 0, "Neck": 1, "RShoulder": 2, "RElbow": 3, "RWrist": 4,
               "LShoulder": 5, "LElbow": 6, "LWrist": 7, "RHip": 8, "RKnee": 9,
               "RAnkle": 10, "LHip": 11, "LKnee": 12, "LAnkle": 13, "REye": 14,
               "LEye": 15, "REar": 16, "LEar": 17, "Background": 18 }

POSE_PAIRS = [ ["Neck", "RShoulder"], ["Neck", "LShoulder"], ["RShoulder", "RElbow"],
               ["RElbow", "RWrist"], ["LShoulder", "LElbow"], ["LElbow", "LWrist"],
               ["Neck", "RHip"], ["RHip", "RKnee"], ["RKnee", "RAnkle"], ["Neck", "LHip"],
               ["LHip", "LKnee"], ["LKnee", "LAnkle"], ["Neck", "Nose"], ["Nose", "REye"],
               ["REye", "REar"], ["Nose", "LEye"], ["LEye", "LEar"] ]

width = 368
height = 368
inWidth = width
inHeight = height

net = cv.dnn.readNetFromTensorflow("graph_opt.pb")
thr = 0.2
def poseDetector(frame):
    frameWidth = frame.shape[1]
    frameHeight = frame.shape[0]
    net.setInput(cv.dnn.blobFromImage(frame, 1.0, (inWidth, inHeight), (127.5, 127.5, 127.5), swapRB=True, crop=False))
    out = net.forward()
    out = out[:, :19, :, :]
    assert(len(BODY_PARTS) == out.shape[1])
    points = []
    data = {}
    for i, part in zip(range(len(BODY_PARTS)), BODY_PARTS):
        heatMap = out[0, i, :, :]
        _, conf, _, point = cv.minMaxLoc(heatMap)
        x = (frameWidth * point[0]) / out.shape[3]
        y = (frameHeight * point[1]) / out.shape[2]
        points.append((int(x), int(y)) if conf > thr else None)
        data[part] = (int(x), int(y)) if conf > thr else None
    for pair in POSE_PAIRS:
        partFrom = pair[0]
        partTo = pair[1]
        assert(partFrom in BODY_PARTS)
        assert(partTo in BODY_PARTS)
        idFrom = BODY_PARTS[partFrom]
        idTo = BODY_PARTS[partTo]
        if points[idFrom] and points[idTo]:
            cv.line(frame, points[idFrom], points[idTo], (0, 255, 0), 3)
            cv.ellipse(frame, points[idFrom], (3, 3), 0, 0, 360, (0, 0, 255), cv.FILLED)
            cv.ellipse(frame, points[idTo], (3, 3), 0, 0, 360, (0, 0, 255), cv.FILLED)
    t, _ = net.getPerfProfile()
    return frame, data

def frame_iter(capture, description):
  def _iterator():
      while capture.grab():
          yield capture.retrieve()[1]
  return tqdm(
      _iterator(),
      desc=description,
      total=int(capture.get(cv.CAP_PROP_FRAME_COUNT)),
  )

cap = cv.VideoCapture('cut.mp4')
ret, frame = cap.read()
frame_height, frame_width, _ = frame.shape
out = cv.VideoWriter('output.mp4', cv.VideoWriter_fourcc('M','J','P','G'), 10, (frame_width,frame_height))
all_data = []
frame_num = 1
for frame in frame_iter(cap, 'Processing video ...'):
  output, motion = poseDetector(frame)
  motion['frame'] = frame_num
  all_data.append(motion)
  out.write(output)
  frame_num += 1
out.release()
print("... done!")


## Create a graph

Once you're happy with how the pose estimation is working, you can press the play button on the next cell to create a graph showing the change in X and Y positions of the detected body parts for every frame of the video.

In [None]:
# @title
df = pd.DataFrame(all_data)
df = df.fillna(value=np.nan)

res = []
for col in df.columns[:-1]:
  r = []
  for idx, val in df[col].items():
    try:
      x, y = val
    except:
      x, y = np.nan, np.nan
    finally:
      r.append({f'{col}_x': x, f'{col}_y': y})
  fmt = pd.DataFrame(r)
  for c in [f'{col}_x', f'{col}_y']:
    fmt[c] = fmt[c].diff()
  res.append(fmt)

import matplotlib.pyplot as plt
fig, ax = plt.subplots(nrows=len(res), ncols=1, sharex=True, sharey=False, figsize=(8, 1*len(res)))
for n, (part, a) in enumerate(zip(res, ax.flatten())):
  for col in part.columns:
    label = col.split('_')[-1].title()
    a.plot(part.index, part[col], label=label)
  a.set(title=part.columns[-1].replace('_y', '').title())
  if n == 0:
    a.legend()

fig.supxlabel('Frames')
fig.supylabel('Change [px]')
fig.subplots_adjust(hspace=0.5, left=0.1, bottom=0.05)

## Evaluate the output

Congratulations, you just used a neural network for the first time! How do the results sound? You can try different combinations of parameters (or different videos) by changing the parameters above and pressing the "Play" button once again.

If you can't think of which tracks to use, you can try the following:

*   Ballet: https://www.youtube.com/watch?v=zV1qLYukTH8
*   Jazz: https://www.youtube.com/watch?v=-Zi5Xq-1jSU

## Discussion questions

1.   Do particular styles of music or dance lend themselves better to pose estimation? Which styles work better, and what connects them?
2.   What are some of the potential applications of this technology, both in research and in practice?