# MediaPipe Holistic Keypoint Extraction

## Overview

This notebook provides a streamlined tool for extracting human pose keypoints from videos using **Google's MediaPipe Holistic** library. The extracted data is saved as CSV files for further analysis in multimodal interaction research.

### What This Notebook Does

1. **Extracts MediaPipe Holistic landmarks** from video files:
   - 33 body landmarks (with X, Y, Z coordinates + visibility scores)
   - 42 hand landmarks (21 per hand, with X, Y, Z coordinates)
   - 478 face landmarks (with X, Y, Z coordinates)

2. **Exports time series data** as CSV files:
   - `*_body.csv` - Body keypoints
   - `*_hands.csv` - Hand keypoints  
   - `*_face.csv` - Face keypoints

### Key Features

- Batch processing of multiple videos
- Automatic handling of missing landmarks (filled with NaN values)
- Frame-by-frame time series output with millisecond timestamps
- Simple folder-based workflow (input videos → output CSV files)



## 📑 Notebook overview
1. **Environment setup** — install & import requirements
2. **Configuration** — set video source and pipeline parameters
3. **Pipeline functions** — create and run the MediaPipe graph
4. **Exercises** — analyse tracking quality and landmark trajectories
5. **Discussion & extras** — reflect on speed/accuracy trade‑offs

## ⚙️ Environment setup

In [1]:
# Import required packages
import mediapipe as mp
import cv2
import numpy as np
import pandas as pd
import csv
import os
from os import listdir
from os.path import isfile, join

# Configure folder paths
input_folder = "../input_videos/"
output_folder = "../Mediapipe_results/"

# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# List all video files in input folder
video_files = [f for f in listdir(input_folder) if isfile(join(input_folder, f)) and not f.startswith('.')]

# Display configuration
print("MediaPipe Holistic Keypoint Extraction")
print("="*50)
print(f"Input folder: {os.path.abspath(input_folder)}")
print(f"Output folder: {os.path.abspath(output_folder)}")
print(f"\nVideos to process: {len(video_files)}")
for i, vf in enumerate(video_files, 1):
    print(f"  {i}. {vf}")

MediaPipe Holistic Keypoint Extraction
Input folder: /Users/ferdinandpaar/Library/Mobile Documents/com~apple~CloudDocs/Projects/MPI_Max_Planck_Institut/MPI_Github/MPI_website_github/MediaPipe_keypoints_extraction/input_videos
Output folder: /Users/ferdinandpaar/Library/Mobile Documents/com~apple~CloudDocs/Projects/MPI_Max_Planck_Institut/MPI_Github/MPI_website_github/MediaPipe_keypoints_extraction/Mediapipe_results

Videos to process: 1
  1. salma_hayek_short.mp4


In [2]:
# Initialize MediaPipe modules
mp_holistic = mp.solutions.holistic

# Define landmark names for body (33 landmarks)
markers_body = [
    'NOSE', 'LEFT_EYE_INNER', 'LEFT_EYE', 'LEFT_EYE_OUTER', 'RIGHT_EYE_INNER', 'RIGHT_EYE', 'RIGHT_EYE_OUTER',
    'LEFT_EAR', 'RIGHT_EAR', 'MOUTH_LEFT', 'MOUTH_RIGHT', 'LEFT_SHOULDER', 'RIGHT_SHOULDER', 'LEFT_ELBOW', 
    'RIGHT_ELBOW', 'LEFT_WRIST', 'RIGHT_WRIST', 'LEFT_PINKY', 'RIGHT_PINKY', 'LEFT_INDEX', 'RIGHT_INDEX',
    'LEFT_THUMB', 'RIGHT_THUMB', 'LEFT_HIP', 'RIGHT_HIP', 'LEFT_KNEE', 'RIGHT_KNEE', 'LEFT_ANKLE', 'RIGHT_ANKLE',
    'LEFT_HEEL', 'RIGHT_HEEL', 'LEFT_FOOT_INDEX', 'RIGHT_FOOT_INDEX'
]

# Define landmark names for hands (42 landmarks - 21 per hand)
markers_hands = [
    'LEFT_WRIST', 'LEFT_THUMB_CMC', 'LEFT_THUMB_MCP', 'LEFT_THUMB_IP', 'LEFT_THUMB_TIP', 'LEFT_INDEX_FINGER_MCP',
    'LEFT_INDEX_FINGER_PIP', 'LEFT_INDEX_FINGER_DIP', 'LEFT_INDEX_FINGER_TIP', 'LEFT_MIDDLE_FINGER_MCP', 
    'LEFT_MIDDLE_FINGER_PIP', 'LEFT_MIDDLE_FINGER_DIP', 'LEFT_MIDDLE_FINGER_TIP', 'LEFT_RING_FINGER_MCP', 
    'LEFT_RING_FINGER_PIP', 'LEFT_RING_FINGER_DIP', 'LEFT_RING_FINGER_TIP', 'LEFT_PINKY_FINGER_MCP', 
    'LEFT_PINKY_FINGER_PIP', 'LEFT_PINKY_FINGER_DIP', 'LEFT_PINKY_FINGER_TIP',
    'RIGHT_WRIST', 'RIGHT_THUMB_CMC', 'RIGHT_THUMB_MCP', 'RIGHT_THUMB_IP', 'RIGHT_THUMB_TIP', 'RIGHT_INDEX_FINGER_MCP',
    'RIGHT_INDEX_FINGER_PIP', 'RIGHT_INDEX_FINGER_DIP', 'RIGHT_INDEX_FINGER_TIP', 'RIGHT_MIDDLE_FINGER_MCP', 
    'RIGHT_MIDDLE_FINGER_PIP', 'RIGHT_MIDDLE_FINGER_DIP', 'RIGHT_MIDDLE_FINGER_TIP', 'RIGHT_RING_FINGER_MCP', 
    'RIGHT_RING_FINGER_PIP', 'RIGHT_RING_FINGER_DIP', 'RIGHT_RING_FINGER_TIP', 'RIGHT_PINKY_FINGER_MCP', 
    'RIGHT_PINKY_FINGER_PIP', 'RIGHT_PINKY_FINGER_DIP', 'RIGHT_PINKY_FINGER_TIP'
]

# Define landmark names for face (478 landmarks)
markers_face = [str(x) for x in range(478)]

# Create column headers for CSV files
# Body: time, X_LANDMARK, Y_LANDMARK, Z_LANDMARK, visibility_LANDMARK
columns_body = ['time']
for mark in markers_body:
    for pos in ['X', 'Y', 'Z', 'visibility']:
        columns_body.append(f"{pos}_{mark}")

# Hands: time, X_LANDMARK, Y_LANDMARK, Z_LANDMARK
columns_hands = ['time']
for mark in markers_hands:
    for pos in ['X', 'Y', 'Z']:
        columns_hands.append(f"{pos}_{mark}")

# Face: time, X_LANDMARK, Y_LANDMARK, Z_LANDMARK
columns_face = ['time']
for mark in markers_face:
    for pos in ['X', 'Y', 'Z']:
        columns_face.append(f"{pos}_{mark}")

# Helper functions
def num_there(s):
    """Check if there are numbers in a string"""
    return any(i.isdigit() for i in s)

def convert_landmarks_to_str(landmarks_obj):
    """Convert MediaPipe landmark object to string list"""
    landmarks_str = str(landmarks_obj).strip("[]")
    landmarks_str = landmarks_str.split("\n")
    return landmarks_str[:-1]  # Ignore last empty element

def extract_positions(landmarks):
    """Extract numerical position values from landmark strings"""
    landmarks_str = convert_landmarks_to_str(landmarks)
    positions = []
    for value in landmarks_str:
        if num_there(value):
            stripped = value.split(':', 1)[1].strip()
            positions.append(stripped)
    return positions

print(f"Body landmarks: {len(markers_body)}")
print(f"Hand landmarks: {len(markers_hands)}")
print(f"Face landmarks: {len(markers_face)}")
print(f"\nTotal CSV columns:")
print(f"  Body CSV: {len(columns_body)} columns")
print(f"  Hands CSV: {len(columns_hands)} columns")
print(f"  Face CSV: {len(columns_face)} columns")

Body landmarks: 33
Hand landmarks: 42
Face landmarks: 478

Total CSV columns:
  Body CSV: 133 columns
  Hands CSV: 127 columns
  Face CSV: 1435 columns


## 2. Process Videos and Extract Keypoints


In [None]:
# Process each video
for video_idx, video_file in enumerate(video_files, 1):
    print(f"\n{'='*60}")
    print(f"Processing video {video_idx}/{len(video_files)}: {video_file}")
    print(f"{'='*60}")
    
    # Open video
    video_path = os.path.join(input_folder, video_file)
    capture = cv2.VideoCapture(video_path)
    
    # Get video properties
    fps = capture.get(cv2.CAP_PROP_FPS)
    frame_width = int(capture.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(capture.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(capture.get(cv2.CAP_PROP_FRAME_COUNT))
    
    print(f"Video properties:")
    print(f"  Resolution: {frame_width}x{frame_height}")
    print(f"  FPS: {fps}")
    print(f"  Total frames: {total_frames}")
    
    # Initialize time series data storage
    time = 0
    ts_body = [columns_body]
    ts_hands = [columns_hands]
    ts_face = [columns_face]
    
    frame_count = 0
    
    # Process video with MediaPipe Holistic
    with mp_holistic.Holistic(
        static_image_mode=False,
        model_complexity=1,
        enable_segmentation=False,
        refine_face_landmarks=True
    ) as holistic:
        
        while True:
            ret, frame = capture.read()
            
            if not ret:
                break
            
            frame_count += 1
            
            # Display progress every 30 frames
            if frame_count % 30 == 0:
                progress = (frame_count / total_frames) * 100
                print(f"  Progress: {frame_count}/{total_frames} frames ({progress:.1f}%)", end='\r')
            
            # Convert BGR to RGB
            image_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            
            # Process with MediaPipe
            results = holistic.process(image_rgb)
            
            # Extract landmarks or fill with NaN if not detected
            if results.pose_landmarks is not None:
                # Extract body landmarks
                sample_body = extract_positions(results.pose_landmarks)
                sample_body.insert(0, time)
                
                # Extract hand landmarks with NaN for missing hands
                sample_left_hand = extract_positions(results.left_hand_landmarks)
                sample_right_hand = extract_positions(results.right_hand_landmarks)
                
                # Fill missing left hand with NaN
                if len(sample_left_hand) == 0:
                    sample_left_hand = [np.nan for _ in range(63)]  # 21 landmarks * 3 coordinates = 63
                
                # Fill missing right hand with NaN
                if len(sample_right_hand) == 0:
                    sample_right_hand = [np.nan for _ in range(63)]  # 21 landmarks * 3 coordinates = 63
                
                # Combine hands
                sample_hands = sample_left_hand + sample_right_hand
                sample_hands.insert(0, time)
                
                # Extract face landmarks
                sample_face = extract_positions(results.face_landmarks) if results.face_landmarks else []
                if len(sample_face) == 0:
                    sample_face = [np.nan for _ in range(1434)]  # 478 landmarks * 3 coordinates = 1434
                sample_face.insert(0, time)
                
            else:
                # No landmarks detected - fill with NaN
                sample_body = [np.nan for _ in range(len(columns_body) - 1)]
                sample_body.insert(0, time)
                
                sample_hands = [np.nan for _ in range(len(columns_hands) - 1)]
                sample_hands.insert(0, time)
                
                sample_face = [np.nan for _ in range(len(columns_face) - 1)]
                sample_face.insert(0, time)
            
            # Append to time series
            ts_body.append(sample_body)
            ts_hands.append(sample_hands)
            ts_face.append(sample_face)
            
            # Increment time (in milliseconds)
            time += (1000 / fps)
    
    # Release video capture
    capture.release()
    
    print(f"\n  Completed processing {frame_count} frames")
    
    # Save CSV files
    base_filename = os.path.splitext(video_file)[0]
    
    # Save body CSV
    body_csv_path = os.path.join(output_folder, f"{base_filename}_body.csv")
    with open(body_csv_path, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(ts_body)
    print(f"  Saved: {base_filename}_body.csv")
    
    # Save hands CSV
    hands_csv_path = os.path.join(output_folder, f"{base_filename}_hands.csv")
    with open(hands_csv_path, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(ts_hands)
    print(f"  Saved: {base_filename}_hands.csv")
    
    # Save face CSV
    face_csv_path = os.path.join(output_folder, f"{base_filename}_face.csv")
    with open(face_csv_path, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(ts_face)
    print(f"  Saved: {base_filename}_face.csv")

print(f"\n{'='*60}")
print("Processing complete!")
print(f"All CSV files saved to: {os.path.abspath(output_folder)}")
print(f"{'='*60}")



Processing video 1/1: salma_hayek_short.mp4
Video properties:
  Resolution: 1920x1080
  FPS: 29.97002997002997
  Total frames: 1136


I0000 00:00:1761036019.467228 12537653 gl_context.cc:369] GL version: 2.1 (2.1 Metal - 89.4), renderer: Apple M1 Pro
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W0000 00:00:1761036019.564038 12540535 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1761036019.581286 12540535 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1761036019.585263 12540536 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1761036019.585266 12540540 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1761036019.585326 12540534 inference_feedback_manager.cc:114] Feedback manager requ

  Progress: 1110/1136 frames (97.7%)
  Completed processing 1110 frames
  Saved: salma_hayek_short_body.csv
  Saved: salma_hayek_short_hands.csv
  Saved: salma_hayek_short_face.csv

Processing complete!
All CSV files saved to: /Users/ferdinandpaar/Library/Mobile Documents/com~apple~CloudDocs/Projects/MPI_Max_Planck_Institut/MPI_Github/MPI_website_github/MediaPipe_keypoints_extraction/Mediapipe_results


## Done!

Your MediaPipe keypoint data has been extracted and saved as CSV files in the `Mediapipe_results/` folder.

### Output Files

For each video, three CSV files are created:
- `<video_name>_body.csv` - 33 body landmarks with X, Y, Z, visibility
- `<video_name>_hands.csv` - 42 hand landmarks with X, Y, Z
- `<video_name>_face.csv` - 478 face landmarks with X, Y, Z

### Next Steps

These CSV files can be used for:
- Smoothing and normalization (see `Smoothing/` folder)
- Kinematic analysis - speed, acceleration, jerk (see `Speed_Acceleration_Jerk/` folder)
- Gesture segmentation (see `Submovements_Holds/` folder)
- Multimodal merging with ELAN annotations (see `Merging_Motion_ELAN/` folder)
- And more!
