# Feature Extraction Code Demonstrations

This notebook demonstrates our feature extraction codebase with simple explanations of how each component works. It covers the full pipeline from video input to processed data.

## Table of Contents

1. [Environment Setup](#environment-setup)
   - [Setup Python Environment](#setup-python-environment)
   - [Check Module Imports](#check-module-imports)
   - [Torch and CUDA Check](#torch-and-cuda-check)
   - [Demo Data Overview](#demo-data-overview)
2. [Keypoint Processing](#keypoint-processing)
   - [YOLOv8 Pose Estimation](#yolov8-pose-estimation)
   - [Video to Keypoints Dataframe](#video-to-keypoints-dataframe)
   - [Filtering and Normalization](#filtering-and-normalization)
   - [Full Keypoint Processing Pipeline](#full-keypoint-processing-pipeline)
3. [Audio Processing](#audio-processing)
   - [Extracting Audio from Video](#extracting-audio-from-video)
   - [Speech Recognition](#speech-recognition)
   - [Speaker Diarization](#speaker-diarization)
   - [Additional Audio Analysis](#additional-audio-analysis)
   - [Complete Audio Processing Pipeline](#complete-audio-processing-pipeline)
4. [Visualization and Integration](#visualization-and-integration)
   - [Visualizing Keypoint Data](#visualizing-keypoint-data)
   - [Adding Annotations to Videos](#adding-annotations-to-videos)
   - [Creating Complete Annotated Videos](#creating-complete-annotated-videos)

Let's begin our exploration of these methods!

# Environment Setup

## Setup Python Environment

Follow the instructions in [README.md](README.md) to setup the python environment.

## Check Module Imports

Run the next cell to check that all modules are correctly imported. ModuleNotFound errors are _usually_  fixed by installing the missing libraries typing `pip install <library_name>` in terminal. For example:

```bash
ModuleNotFoundError: No module named 'ultralytics'

> pip install ultralytics
```

There are a few exceptions:  
For `pyannote` > `pip install pyannote.audio`  
For `dotenv` > `pip install python-dotenv`  

Then restart the kernel and run the cell again.

In [1]:
import os
import sys
import cv2
import json
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ultralytics import YOLO



### 0.0.3 Torch and CUDA?

If you have a GPU and want to use it, make sure that torch is installed with CUDA support. You can check it by running the next cell.  
If `True` you are good to go.  
If `False` then might be best to start over in a new environment. Torch install instructions https://pytorch.org/get-started/locally/

In [2]:
torch.cuda.is_available()



Import our custom modules and check that they are working.

In [3]:
# first, add project to path so we can import the modules
project_root = os.path.join("..")
sys.path.append(project_root)

# impoty the functions for video processing
import src.utils as utils

import src.processors.keypoint_processor as kp_processor
import src.processors.audio_processor as audio_processor
import src.processors.video_processor as video_processor
import src.processors.face_processor as face_processor
import src.processors.object_processor as object_processor
import src.processors.video_understanding as video_understanding



# 0.1 Demonstrating some of the function calls we use during feature extraction 

A useful set of simple examples to show how to call the models and parse the data they return.


### 0.1.1 Demo data

Where will we find videos, images and audio for our examples? Two videos, the associate audio files and a set of images are available in the `data\demo` directory.

In [4]:


demo_data = os.path.join("..","data", "demo")

# a couple of videos for testing
VIDEO_FILE = os.path.join(demo_data, "2UWdXP.joke1.rep2.take1.Peekaboo_h265.mp4")
VIDEO_FILE2 = os.path.join(demo_data, "2UWdXP.joke2.rep1.take1.NomNomNom_h265.mp4")

AUDIO_FILE = os.path.join(demo_data, "2UWdXP.joke1.rep2.take1.Peekaboo.mp3")
AUDIO_FILE2 = os.path.join(demo_data, "2UWdXP.joke2.rep1.take1.NomNomNom.mp3")

IMAGE1 = os.path.join(demo_data, "mother-and-baby.jpg")
IMAGE2 = os.path.join(demo_data, "peekaboo.png")
IMAGE3 = os.path.join(demo_data, "twopeople.jpg")

videoset = [VIDEO_FILE, VIDEO_FILE2]
audioset = [AUDIO_FILE, AUDIO_FILE2]
photoset = [IMAGE1, IMAGE2, IMAGE3]

## 0.1 YOLOv8

Go to [docs.ultralytics.com](https://docs.ultralytics.com/) for detailed documentation and lots of examples. We just demo a few here.


In [5]:
# Load YOLOv8 model with pose estimation capability
# The 'n' in yolov8n-pose.pt stands for 'nano' - the smallest and fastest model variant
model = YOLO("yolov8n-pose.pt")

# Run inference on an image
results = model(IMAGE3)
print(f"Results type: {type(results)}")
print(f"Number of results: {len(results)}")
print(f"Fields in first result: {dir(results[0])[:10]}...")





In [6]:
# Display the image with keypoints, skeleton, and bounding boxes
labelledimage = results[0].plot()
plt.figure(figsize=(10, 8))
plt.imshow(labelledimage)
plt.title("YOLOv8 Pose Estimation Result")
plt.axis("off")
plt.show()

# Extract keypoints as numpy arrays
keypoints = results[0].keypoints.cpu().numpy()
print(f"Keypoints shape: {keypoints.xy.shape} - (persons, keypoints, xy)")
print("\nKeypoint coordinates (x,y):")
print(keypoints.xy)
print("\nKeypoint confidence scores:")
print(keypoints.conf)
print("\nFull keypoint data (x,y,confidence):")
print(keypoints.data)





In [7]:
# YOLOv8 returns keypoints as a 3D tensor with x, y, confidence values
# For storage in dataframes, we typically flatten it to a 1D list
if len(keypoints.data) > 0:  # Check if any people were detected
    xyc = keypoints.data[0].flatten().tolist()  # Flatten first person's keypoints
    print(f"Flattened keypoint data (length: {len(xyc)}):\n{xyc}")
    
    # Explain the structure
    print("\nStructure: Each keypoint has 3 values - [x, y, confidence]")
    print("Example of the first few keypoints:")
    for i in range(0, 15, 3):  # Show first 5 keypoints
        keypoint_idx = i // 3
        print(f"Keypoint {keypoint_idx}: x={xyc[i]:.2f}, y={xyc[i+1]:.2f}, confidence={xyc[i+2]:.2f}")



## Video to Keypoints Dataframe


### Keypoints Dataframe Structure

Our keypoints dataframe has the following structure:

![keypoints dataframe](../docs/keypointsdf.png)

For each video `frame`:
- Each detected person gets a row with `person` label and `index`
- Bounding box coordinates: `x`, `y`, `w`, `h` (center x, center y, width, height)
- Detection confidence: `conf`
- 17 COCO keypoints: Each has 3 values (x, y, confidence)
  - Keypoints include: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles

Now, let's use the convenience function to process a complete video:

In [9]:
# Process a full video with a single function call
# We limit to 60 frames for demonstration purposes
df = src.utils.videotokeypoints(model, VIDEO_FILE, track=False, max_frames=60)

print(f"Processed {df['frame'].max()+1} frames and extracted {len(df)} rows of pose data")

# Save to CSV for later use
stemname = os.path.splitext(VIDEO_FILE)[0]
csvpath = os.path.join(data_out, os.path.basename(stemname) + ".csv")
df.to_csv(csvpath, index=False)
print(f"Saved keypoint data to {csvpath}")

# Display a sample
df_read = pd.read_csv(csvpath, index_col=None)
display(df_read.head(3))



## 0.1.2 model.track()

YoloV8 also comes with a `model.track` method. This aims to keep track of all identified people (and other objects?) over the course of a video. 

This is pretty easy instead of calling 
`results = model(video_path, stream=True)`

we can call
`results = model.track(video_path, stream=True)`

https://docs.ultralytics.com/modes/track/#persisting-tracks-loop

Here's an inline example of it working..

In [None]:
# Get video dimensions for normalization
video = cv2.VideoCapture(VIDEO_FILE)
width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
video.release()
print(f"Video dimensions: {width}×{height}")

# Let's examine the keypoint processing functions
print("\nKeypoint processor functions:")
print("1. Filter by confidence:")
print(keypoint_processor.filter_keypoints_by_confidence.__doc__)
print("\n2. Normalize coordinates:")
print(keypoint_processor.normalize_keypoints.__doc__)
print("\n3. Interpolate missing data:")
print(keypoint_processor.interpolate_missing_keypoints.__doc__)

In [None]:
# Demonstration of filtering by confidence
# First, let's count how many valid keypoints we have in the original data
conf_columns = [col for col in df_read.columns if col.endswith('_c')]
print("Original data:")
print(f"Total keypoints: {len(df_read) * len(conf_columns)}")

# Apply filtering with different thresholds
for threshold in [0.0, 0.3, 0.5, 0.7]:
    # Using the function from keypoint_processor
    filtered_df = keypoint_processor.filter_keypoints_by_confidence(
        df_read.copy(), confidence_threshold=threshold
    )
    
    # Count keypoints below threshold
    below_threshold = 0
    for col in conf_columns:
        below_threshold += sum(filtered_df[col] < threshold)
    
    print(f"Threshold {threshold}: {below_threshold} keypoints filtered out")

# Demonstrate normalization
# Apply filtering first with threshold 0.5
filtered_df = keypoint_processor.filter_keypoints_by_confidence(df_read.copy(), 0.5)

# Then normalize coordinates
normalized_df = keypoint_processor.normalize_keypoints(filtered_df, height, width)

# Compare original and normalized coordinates for one keypoint
print("\nOriginal vs. Normalized coordinates (first row, keypoint 0):")
orig_x = filtered_df['k0_x'].iloc[0]
orig_y = filtered_df['k0_y'].iloc[0]
norm_x = normalized_df['k0_x'].iloc[0]
norm_y = normalized_df['k0_y'].iloc[0]
print(f"Original: ({orig_x:.2f}, {orig_y:.2f}) pixels")
print(f"Normalized: ({norm_x:.4f}, {norm_y:.4f}) [0-1 range]")
print(f"Verification: {norm_x:.4f} = {orig_x:.2f}/{width}, {norm_y:.4f} = {orig_y:.2f}/{height}")

## Full Keypoint Processing Pipeline

The `keypoint_processor.py` module provides a comprehensive function `process_keypoints_for_modeling()` that applies all processing steps in sequence. This is the recommended way to prepare keypoint data for analysis.

In [None]:
# Look at the full processing pipeline function
print("Full keypoint processing pipeline:")
print(keypoint_processor.process_keypoints_for_modeling.__doc__)

# For demonstration purposes, create a simple configuration
# In practice, this would be imported from src.config
keypoint_processor.KEYPOINT_CONFIG = {
    'confidence_threshold': 0.5,
    'interpolate_missing': True
}

# Apply the full processing pipeline
processed_df = keypoint_processor.process_keypoints_for_modeling(df_read, height, width)

print(f"\nProcessed {len(processed_df)} rows of keypoint data")
print("\nProcessed dataframe sample:")
display(processed_df[['frame', 'person', 'index', 'k0_x', 'k0_y', 'k0_c']].head(3))

# Verify the range of normalized coordinates
x_cols = [col for col in processed_df.columns if col.endswith('_x')]
y_cols = [col for col in processed_df.columns if col.endswith('_y')]

print(f"\nX coordinate range: [{processed_df[x_cols].min().min():.3f}, {processed_df[x_cols].max().max():.3f}]")
print(f"Y coordinate range: [{processed_df[y_cols].min().min():.3f}, {processed_df[y_cols].max().max():.3f}]")

# Audio Processing

Our `audio_processor.py` module contains a suite of functions for extracting and analyzing audio from videos. Let's explore these functions step by step.

## Extracting Audio from Video

First, we need to extract the audio track from a video file using the `extract_audio()` function:

In [None]:
# Look at the audio extraction function
print("Audio extraction function:")
print(audio_processor.extract_audio.__doc__)

# Extract audio from video
audio_path = audio_processor.extract_audio(VIDEO_FILE, temp_out, output_ext="wav")
print(f"\nExtracted audio saved to: {audio_path}")

# Play the extracted audio
display(Audio(audio_path))

# Show audio information
try:
    y, sr = librosa.load(audio_path)
    duration = librosa.get_duration(y=y, sr=sr)
    
    print(f"\nAudio information:")
    print(f"  Sample rate: {sr} Hz")
    print(f"  Duration: {duration:.2f} seconds")
    print(f"  Number of samples: {len(y)}")
    
    # Plot waveform
    plt.figure(figsize=(12, 3))
    plt.plot(np.linspace(0, duration, len(y)), y)
    plt.title("Audio Waveform")
    plt.xlabel("Time (seconds)")
    plt.ylabel("Amplitude")
    plt.show()
except Exception as e:
    print(f"Error analyzing audio: {e}")

## Speech Recognition

Next, we can transcribe the speech in the audio using OpenAI's Whisper model. Our `transcribe_audio()` function handles this process:

In [None]:
# Look at the transcription function
print("Audio transcription function:")
print(audio_processor.transcribe_audio.__doc__)

try:
    # Transcribe the extracted audio
    print("\nTranscribing audio (this may take a moment)...")
    transcript_path, transcript_data = audio_processor.transcribe_audio(
        audio_path, temp_out, model_name="base"
    )
    
    if transcript_path and transcript_data:
        print(f"Transcript saved to: {transcript_path}")
        
        print("\nFull transcript:")
        print(transcript_data['text'])
        
        # Show the segments with timestamps
        print("\nTranscript segments:")
        for i, segment in enumerate(transcript_data['segments'][:3]):  # Show first 3 segments
            print(f"Segment {i+1}: {segment['start']:.2f}s - {segment['end']:.2f}s")
            print(f"  '{segment['text']}'")
except Exception as e:
    print(f"Error during transcription: {e}")
    print("This may be due to Whisper not being installed or other dependencies missing.")

In [None]:
# check captions for this video at
with open(SPEECH_FILE) as f:
    speechjson = json.load(f)
caption = display.WhisperExtractCurrentCaption(speechjson, framenum, 15)
print(caption)

# 0.3 Facial Emotion Recognition

we use DeepFace wrapped in a fucntion to store the resutls to a dataframe, indexed by person detected and video frame number

In [None]:

demo_data = os.path.join("..","data", "demo")

# a couple of videos for testing
VIDEO_FILE = os.path.join(demo_data, "2UWdXP.joke1.rep2.take1.Peekaboo_h265.mp4")
VIDEO_FILE2 = os.path.join(demo_data, "2UWdXP.joke2.rep1.take1.NomNomNom_h265.mp4")

AUDIO_FILE = os.path.join(demo_data, "2UWdXP.joke1.rep2.take1.Peekaboo.mp3")
AUDIO_FILE2 = os.path.join(demo_data, "2UWdXP.joke2.rep1.take1.NomNomNom.mp3")

IMAGE1 = os.path.join(demo_data, "mother-and-baby.jpg")
IMAGE2 = os.path.join(demo_data, "peekaboo.png")
IMAGE3 = os.path.join(demo_data, "twopeople.jpg")

videoset = [VIDEO_FILE, VIDEO_FILE2]
audioset = [AUDIO_FILE, AUDIO_FILE2]
photoset = [IMAGE1, IMAGE2, IMAGE3]

In [None]:
#let's check the face detectiion works
features = ("emotion", "age", "gender")

backend = backends[1] # opencv
for backend in backends:    
    print(f"Using backend: {backend}")
    for photo in photoset:
        print(photo)
        img = cv2.imread(photo)
        faces = extract_faces_from_image(img, backend=backend, features=features, precision=5, debug=True)
        

















































































# 0.9 Diarization with pyannote

We can use pyannote to diarize the audio and then use the results to extract the speech from the audio.

The code is in our utils.py file.

In [None]:
import utils

diarization = utils.diarize_audio(AUDIO_FILE)

with open("output.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

print(diarization)  # diarizaation as values

diarization # as visual timeline

## Additional Audio Analysis

Our `audio_processor.py` module also provides functions for extracting additional audio features:

In [None]:
# Look at the fundamental frequency extraction function
print("Fundamental frequency extraction function:")
print(audio_processor.extract_fundamental_frequency.__doc__)

try:
    print("\nExtracting fundamental frequency (this may take a moment)...")
    f0_path = audio_processor.extract_fundamental_frequency(audio_path, temp_out)
    
    if f0_path:
        print(f"F0 data saved to {f0_path}")
        
        # Load and analyze the F0 data
        f0_data = np.load(f0_path)
        f0 = f0_data['f0']
        voiced_flag = f0_data['voiced_flag']
        
        # Basic statistics
        n_voiced = np.sum(voiced_flag)
        voiced_f0 = f0[voiced_flag]
        
        print(f"Total frames: {len(f0)}")
        print(f"Voiced frames: {n_voiced} ({n_voiced/len(f0)*100:.1f}%)")
        
        if len(voiced_f0) > 0:
            print(f"\nF0 statistics (Hz):")
            print(f"  Mean: {np.mean(voiced_f0):.1f}")
            print(f"  Median: {np.median(voiced_f0):.1f}")
            print(f"  Min: {np.min(voiced_f0):.1f}")
            print(f"  Max: {np.max(voiced_f0):.1f}")
            
            # Plot the F0 contour
            plt.figure(figsize=(12, 4))
            
            # Create time axis
            y, sr = librosa.load(audio_path)
            duration = librosa.get_duration(y=y, sr=sr)
            time_axis = np.linspace(0, duration, len(f0))
            
            # Plot only voiced frames
            plt.plot(time_axis[voiced_flag], f0[voiced_flag], 'b-')
            plt.title('Fundamental Frequency (F0) Contour')
            plt.xlabel('Time (seconds)')
            plt.ylabel('Frequency (Hz)')
            plt.grid(alpha=0.3)
            plt.show()
except Exception as e:
    print(f"Error extracting fundamental frequency: {e}")

# Look at the laughter detection function
print("\nLaughter detection function:")
print(audio_processor.detect_laughter.__doc__)
print("\nNote: Laughter detection requires additional installation steps.")
print("We'll show sample output for demonstration purposes.")

# Sample laughter detection output
sample_laughter = [
    {"start": 3.2, "end": 5.7, "prob": 0.89},
    {"start": 12.5, "end": 14.3, "prob": 0.76}
]

print("\nSample laughter detection output:")
for i, segment in enumerate(sample_laughter):
    print(f"Laughter {i+1}: {segment['start']:.1f}s - {segment['end']:.1f}s (probability: {segment['prob']:.2f})")

## Complete Audio Processing Pipeline

The `audio_processor.py` module provides a comprehensive function `process_audio()` that combines all audio processing steps:

In [None]:
# Look at the complete audio processing function
print("Complete audio processing pipeline:")
print(audio_processor.process_audio.__doc__)

# For demonstration, we'll create a short clip to process
try:
    # Create a short clip for faster processing
    short_clip_path = os.path.join(temp_out, "demo_short.mp4")
    video_clip = mp.VideoFileClip(VIDEO_FILE).subclip(0, 5)  # First 5 seconds
    video_clip.write_videofile(short_clip_path, codec="libx264", audio_codec="aac", logger=None)
    video_clip.close()
    
    print(f"\nProcessing short video clip: {os.path.basename(short_clip_path)}")
    print("This will demonstrate the full audio processing pipeline...")
    
    # Process with most features enabled, but disable diarization if no token
    results = audio_processor.process_audio(
        short_clip_path, 
        temp_out, 
        enable_whisper=True,
        enable_diarization=(hf_token is not None),
        enable_f0=True,
        enable_laughter=False,  # Disable for demo as it requires additional setup
        force_process=False
    )
    
    print("\nProcessing complete! Results:")
    for key, value in results.items():
        if isinstance(value, str):
            print(f"  {key}: {os.path.basename(value)}")
        else:
            print(f"  {key}: {type(value)}")
except Exception as e:
    print(f"\nError in audio processing pipeline: {e}")

# 0.7 visualising data over time

some of the calculations to help us visualise the movement of participants over time. 


In [None]:
# function that calculates the average x and y coordinates of a set of keypoints (where confidence score is above a threshold)
xycs = np.array(
    [
        [1, 2, 0.9],
        [2, 3, 0.8],
        [3, 4, 0.7],
        [4, 5, 0.6],
        [5, 6, 0.5],
        [6, 7, 0.4],
        [7, 8, 0.3],
        [8, 9, 0.2],
        [9, 10, 0.1],
    ]
)

avgx, avgy = calcs.avgxys(xycs, threshold=0.5)

print(avgx, avgy)

In [None]:
framenumber = 34
bboxlabels, bboxes, xycs = utils.getFrameKpts(df, framenumber)

print(bboxlabels)
print(bboxes)
print(xycs)

video = cv2.VideoCapture(VIDEO_FILE)
video.set(cv2.CAP_PROP_POS_FRAMES, framenumber)
success, image = video.read()
video.release()

image = display.drawOneFrame(image, bboxlabels, bboxes, xycs)

plt.imshow(image)

# 0.8 Adding annotations to the videos

In [None]:
processedvideos = utils.getProcessedVideos(data_out)
processedvideos.head()

In [None]:
# let's grab a single frame from the video

framenum = 60
video = cv2.VideoCapture(VIDEO_FILE)
video.set(cv2.CAP_PROP_POS_FRAMES, framenum)
ret, frame = video.read()
video.release()

if ret:
    plt.imshow(frame)
    plt.show()

videoname = os.path.basename(VIDEO_FILE)

kpts = utils.getKeyPoints(processedvideos, videoname)
kpts.head()

In this demo, the extract movement algorithm has mislabelled the adult and the child (the labels get applied at random). We need to swap the labels around.

In [10]:
kpts = utils.relabelPersonIndex(
    kpts, person="child", index=0, newPerson="temp", newIndex=100
)
kpts = utils.relabelPersonIndex(
    kpts, person="adult", index=1, newPerson="child", newIndex=0
)
kpts = utils.relabelPersonIndex(
    kpts, person="temp", index=100, newPerson="adult", newIndex=1
)

## 0.8.7 Add annotations onto a video.



In [None]:
annotatedVideo = display.createAnnotatedVideo(
    VIDEO_FILE, kpts, facedata, speechdata, temp_out, True
)