# 🏃‍♂️ MediaPipe Pose & Holistic — Hands‑on Notebook
Practical test‑bed for MediaPipe pipelines.

1. Run on **your own video** or **web‑cam**.
2. Compare **model types** and **model_complexity**.
3. See impact of **tracking**.

In [1]:
import os, sys, cv2, numpy as np
from pathlib import Path
from datetime import datetime
import mediapipe as mp
mp_pose, mp_holistic = mp.solutions.pose, mp.solutions.holistic
mp_draw, mp_styles = mp.solutions.drawing_utils, mp.solutions.drawing_styles
print('MediaPipe version:', mp.__version__)

MediaPipe version: 0.10.21


## 🔧 Key Parameters — MediaPipe Configuration

Tune these settings to control your pipeline behavior:

- **VIDEO_SOURCE**: `int` or `str`  
   – `0` for webcam or `'/path/to/video.mp4'`
- **MODEL**: `{'pose', 'holistic'}`  
   – Choose between Pose-only or full Holistic (face, hands, pose)
- **MODEL_COMPLEXITY**: `0 | 1 | 2`  
   – Trade-off between inference speed and landmark accuracy
- **ENABLE_TRACKING**: `boolean`: `True | False`
   – Smooths landmarks over time (reduces jitter at the cost of slight lag)
- **SAVE_OVERLAY**: `boolean`: `True | False`  
   – Write out video with drawn landmarks for later review

Experiment with these to balance performance, accuracy, and output needs!


In [2]:
def create_pipeline(model='holistic', model_complexity=1, enable_tracking=True):
    if model not in {'pose', 'holistic'}: raise ValueError('model must be pose|holistic')
    kw = dict(model_complexity=model_complexity,
              smooth_landmarks=enable_tracking,
              enable_segmentation=False,
              min_detection_confidence=0.5,
              min_tracking_confidence=0.5)
    return (mp_pose.Pose if model=='pose' else mp_holistic.Holistic)(static_image_mode=False, **kw)


In [4]:
def run_inference(source, model='holistic', model_complexity=1, enable_tracking=True,
                  save_overlay=False, out_dir='results'):
    cap = cv2.VideoCapture(source)
    if not cap.isOpened(): raise RuntimeError(f'Cannot open {source}')
    out_dir = Path(out_dir); out_dir.mkdir(parents=True, exist_ok=True)
    base = Path(str(source)).stem if isinstance(source,str) else 'webcam'
    overlay_path = str(out_dir)+"/" + "/{}_model_{}_tracking_{}_complexity_{}_overlay.mp4".format(base, model, enable_tracking, model_complexity)
    writer = None
    if save_overlay:
        fourcc=cv2.VideoWriter_fourcc(*'mp4v')
        fps=cap.get(cv2.CAP_PROP_FPS) or 25
        w,h=int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)),int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        writer=cv2.VideoWriter(str(overlay_path),fourcc,fps,(w,h))
    pipe=create_pipeline(model,model_complexity,enable_tracking)
    kpts=[]
    while True:
        ret,frame=cap.read()
        if not ret: break
        res=pipe.process(cv2.cvtColor(frame,cv2.COLOR_BGR2RGB))
        if model=='pose':
            mp_draw.draw_landmarks(frame,res.pose_landmarks,mp_pose.POSE_CONNECTIONS,
                                   landmark_drawing_spec=mp_styles.get_default_pose_landmarks_style())
        else:
            mp_draw.draw_landmarks(frame,res.face_landmarks,mp_holistic.FACEMESH_CONTOURS,
                                   connection_drawing_spec=mp_styles.get_default_face_mesh_contours_style())
            mp_draw.draw_landmarks(frame,res.left_hand_landmarks,mp_holistic.HAND_CONNECTIONS)
            mp_draw.draw_landmarks(frame,res.right_hand_landmarks,mp_holistic.HAND_CONNECTIONS)
            mp_draw.draw_landmarks(frame,res.pose_landmarks,mp_holistic.POSE_CONNECTIONS,
                                   landmark_drawing_spec=mp_styles.get_default_pose_landmarks_style())
        frame_k=[]
        if res.pose_landmarks:
            frame_k+=[[lm.x,lm.y,lm.z,lm.visibility] for lm in res.pose_landmarks.landmark]
        if model=='holistic':
            for hand in (res.left_hand_landmarks,res.right_hand_landmarks):
                if hand:
                    frame_k+=[[lm.x,lm.y,lm.z,1.0] for lm in hand.landmark]
                else:
                    frame_k+=[[0,0,0,0]]*21
        kpts.append(frame_k)
        if writer: writer.write(frame)
        cv2.imshow('MediaPipe',frame)
        if cv2.waitKey(1)==27: break
    cap.release(); cv2.destroyAllWindows()
    if writer: writer.release()
    np.save(out_dir/f'{base}_kpts.npy',np.array(kpts,dtype=np.float32))
    return np.array(kpts), (overlay_path if save_overlay else None)


## 📝 Exercise 1: Your video or webcam
Set parameters below and run. Press **ESC** to stop.

In [5]:
VIDEO_SOURCE="input_videos/salma_hayek_short.mp4"          # 0 for webcam or 'my_video.mp4' (specify the path to your video file)
MODEL='holistic'        # 'pose' or 'holistic'
MODEL_COMPLEXITY=2      # 0,1,2
ENABLE_TRACKING=True    # smoothing
SAVE_OVERLAY=True

kpts, overlay = run_inference(VIDEO_SOURCE, MODEL, MODEL_COMPLEXITY,
                              ENABLE_TRACKING, SAVE_OVERLAY)
print('Keypoints shape:', kpts.shape)
if overlay: print('Overlay saved to', overlay)

I0000 00:00:1749903076.070117 17051478 gl_context.cc:369] GL version: 2.1 (2.1 Metal - 88.1), renderer: Apple M3 Pro
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W0000 00:00:1749903076.134921 17051882 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903076.160077 17051883 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903076.161176 17051882 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903076.161191 17051880 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903076.161246 17051887 inference_feedback_manager.cc:114] Feedback manager requ

Keypoints shape: (27, 75, 4)
Overlay saved to results//salma_hayek_short_model_holistic_tracking_True_complexity_2_overlay.mp4


## 📝 Exercise 2 — Tracking Off vs On

1. Rerun Exercise 1 with  
    ```python
    ENABLE_TRACKING=False
    ```
2. Capture a short video segment (5–10 s) of moderate motion.
3. Compare **jitter vs lag**:  
    - Plot x-position of the nose over time for both runs on the same axes.  
    - Compute the mean frame-to-frame Δx and its standard deviation.  
4. Summarize your findings:  
    - Does smoothing reduce variance? By how much?  
    - How much additional latency does it introduce (in ms)?



## 📝 Exercise 3 — Landmark Indices & Trajectories

1. List all pose and hand landmark indices:  
    ```python
    from pprint import pprint
    pprint({i: lm.name for i, lm in enumerate(mp_pose.PoseLandmark)})
    pprint({i: lm.name for i, lm in enumerate(mp_holistic.HandLandmark)})
    ```
2. Choose three landmarks (e.g., NOSE, LEFT_WRIST, RIGHT_WRIST).  
3. Extract their 2D trajectories from `kpts` and plot over time:  
    ```python
    import matplotlib.pyplot as plt

    # example indices
    nose_idx = mp_pose.PoseLandmark.NOSE.value
    lw_idx   = mp_pose.PoseLandmark.LEFT_WRIST.value
    rw_idx   = mp_pose.PoseLandmark.RIGHT_WRIST.value

    t = np.arange(kpts.shape[0])
    for idx, label in [(nose_idx,'Nose'), (lw_idx,'L-Wrist'), (rw_idx,'R-Wrist')]:
          x, y = kpts[:, idx, :2].T
          plt.plot(t, x, label=f'{label} x')
          plt.plot(t, y, '--', label=f'{label} y')
    plt.xlabel('Frame'); plt.ylabel('Normalized coord')
    plt.legend(); plt.show()
    ```
4. **Bonus**:  
    - Identify frames where `visibility < 0.5` for each of these landmarks.  
    - Overlay a small marker (e.g., red dot) on the video at those low-confidence frames.


## 🧐 Discussion

- **Model Complexity Trade-off**  
   - Complexity 0 vs 1 vs 2: how does inference **FPS** change?  
      • Measure end-to-end runtime on the provided clip by checking how much time it takes to process the entire video. 
   - Does higher complexity give **more accurate** landmarks?  
      • Visually inspect overlay at key joints  

- **Holistic vs Pose-Only**  
   - **Landmark count**: Pose-only returns ~33 landmarks; Holistic adds ~468 face + 42 hands.  

- **Tracking (Smoothing) On vs Off**  
   - Jitter vs Lag:  
      • With `ENABLE_TRACKING=True`, landmarks are smoother but react more slowly to sudden motion. This can be good if you want to track one person.   
      • With `ENABLE_TRACKING=False`, landmarks jitter more. Does the model complexity affect this?
   - Do you see a difference in the **stability** of landmarks?  

- **Keypoints Array Structure**  
   - Shape: `(n_frames, n_landmarks, 4)` → `(frame, [x, y, z, visibility])`.  
   - Visibility: **NOTE**: MediaPipe does not provide a confidence score for each landmark, but visibility indicates if the landmark is detected (1.0) or not (0.0). **This is only available for pose landmarks, not hands or face landmarks**.

## 📝 Gesture Segmentation Subset

For gesture segmentation, we only need a handful of pose- and all hand-landmarks from the Holistic model. The code below already extracts these and saves them in a `.npy` file.

Please run the code and check the overlay video to see how the landmarks are extracted. We already selected the best parameters for this task. 

In [6]:
from utils.extract_mp_pose import extract_keypoints
# Path to the video you want to analyse
video_path = "input_videos/salma_hayek_short.mp4"  # or specify a path to your video file
# video_path = 0 # ← use this to use your webcam as input

# Extract keypoints. The function returns a dictionary with useful metadata.
pose_data = extract_keypoints(
    vidf=video_path,
    save_video=True,
    model_complexity=MODEL_COMPLEXITY
)

OpenCV: FFMPEG: tag 0x5634504d/'MP4V' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'
I0000 00:00:1749903096.392733 17051478 gl_context.cc:369] GL version: 2.1 (2.1 Metal - 88.1), renderer: Apple M3 Pro


Video resolution: 1920.0x1080.0, FPS: 29.97002997002997
Number of frames in the video: 1136


Processing frames:   0%|          | 0/1136 [00:00<?, ?frame/s]W0000 00:00:1749903096.485310 17052299 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903096.517921 17052306 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903096.519350 17052303 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903096.519378 17052297 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903096.519519 17052306 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1749903096.