<a href="https://colab.research.google.com/github/ChelsaMJ/Vision-Based-Detection-of-Emotion-Suppression-Using-Facial-Motor-Dynamics/blob/main/03_audio_visual_latency_voxceleb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In unconstrained, real-world videos, does facial activation precede speech, and can we observe variable delays that are consistent with inhibition / regulation?

### Understanding dataset

VoxCeleb provides:
- real faces
- real speech
- no acting
- no scripted silence
- no emotion labels

Which is PERFECT for:

- latency
- inhibition
- temporal analysis

### What signals we extract?

From vox2_test_mp4

We extract:
- frames (OpenCV)
- facial landmarks (MediaPipe)
- facial motion over time
- face activation onset

From vox2_test_aac

We extract:
- waveform (librosa)
- silence → speech transition
- speech onset time

# Setup & imports

In [1]:
!pip install mediapipe librosa opencv-python



In [9]:
!pip uninstall -y mediapipe


[0m

In [10]:
!pip install mediapipe==0.10.21


Collecting mediapipe==0.10.21
  Downloading mediapipe-0.10.21-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
Collecting numpy<2 (from mediapipe==0.10.21)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting protobuf<5,>=4.25.3 (from mediapipe==0.10.21)
  Downloading protobuf-4.25.8-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
INFO: pip is looking at multiple versions of jax to determine which version is compatible with other requirements. This could take a while.
Collecting jax (from mediapipe==0.10.21)
  Downloading jax-0.9.0-py3-none-any.whl.metadata (13 kB)
Collecting jaxlib (from mediapipe==0.10.21)
  Downloading jaxlib-0.9.0-cp312-cp312-manylinux_2_27_x86_64.whl.metadata (1.3 kB)
Collecting jax (from mediapipe==0.10.21)
  Downloading jax-0.8.3-py3-none-any.whl.metadata (13 kB)
Collec

In [1]:
import mediapipe as mp

print(mp.__version__)
print("solutions" in dir(mp))
print(mp.solutions.face_mesh.FaceMesh)


0.10.21
True
<class 'mediapipe.python.solutions.face_mesh.FaceMesh'>


## Mount Drive and imports

In [3]:
from google.colab import drive
drive.mount('/content/drive')

import os
import glob
import cv2
import numpy as np
import pandas as pd
import librosa
import matplotlib.pyplot as plt

import mediapipe as mp


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Paths & sample selection

In [4]:
BASE = "/content/drive/MyDrive/Research Dataset/datasets/voxCeleb_test"

VIDEO_BASE = os.path.join(BASE, "vox2_test_mp4", "mp4")
AUDIO_BASE = os.path.join(BASE, "vox2_test_aac", "aac")

### We WILL:

- Use MediaPipe Face Mesh as a feature extractor

- Track specific muscle-related regions:
  - Eyebrows (AU1, AU2 proxy)
  - Eyes (blink / tension proxy)
  - Mouth corners & lips (AU12, AU25 proxy)
- Measure temporal dynamics, not static emotion

This aligns perfectly with:
> Emotion suppression + thought-to-speech latency

## MediaPipe Face Mesh (KEY POINTS ONLY)

In [5]:
mp_face = mp.solutions.face_mesh

face_mesh = mp_face.FaceMesh(
    static_image_mode=False,
    max_num_faces=1,
    refine_landmarks=False,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5
)

# Key landmark indices (eyes, brows, mouth)
KEY_LANDMARKS = [
    33, 133, 159, 145,      # left eye
    362, 263, 386, 374,    # right eye
    70, 105, 336, 300,     # eyebrows
    61, 291, 13, 14        # mouth
]

AttributeError: module 'mediapipe' has no attribute 'solutions'

In [6]:
import mediapipe as mp

print(mp.__version__)
print(dir(mp))


0.10.32
['Image', 'ImageFormat', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'tasks']
