# 📓 Notebook 1 – Exploratory Data Analysis (EDA) of Pose Outputs

## 1. Introduction & objectives

In this notebook, we will explore pose estimation outputs generated with SuperAnimal ModelZoo on 10-minute top-view mouse videos.

**Learning goals:**
- Understand the structure of .h5 output files
- Explore metadata and summary statistics
- Visualize likelihoods, trajectories, and skeletons
- Detect and correct errors (missing points, jumps)
- Compare outputs from clear vs challenging videos
- Prepare cleaned data for further analysis
  

<img src="https://raw.githubusercontent.com/LizbethMG-Teaching/pose2behav-book/main/assets/notebook-image1.png" width="50%">

**Narrative**

Imagine you are a junior researcher in a neuroscience lab. Your colleague just handed you pose estimation outputs generated with SuperAnimal ModelZoo from 5-minute videos of mice exploring an arena. Before you can ask scientific questions about locomotion, posture, or social behavior, you need to verify the quality of these model predictions. Are all the keypoints tracked reliably? Do some body parts drop out in certain conditions? How does tracking quality differ between a clear video and a more challenging one?

In this notebook, you will take the role of a data detective: opening the .h5 pose files, exploring the structure, visualizing likelihoods and trajectories, spotting errors, and applying simple corrections. By the end, you will produce a short “quality report” that prepares you for deeper behavioral analysis in the next notebooks.

--- 

**Instructions**

This notebook mixes pre-filled code cells (nothing to change) and coding exercises that you will complete.

👉 Here’s how to work through it:
1. Read carefully each section before running the cells.
2.	When a cell requires you to code, you’ll see a TODO comment.
3.	The TODO will tell you how many lines to write.
4.	Write your code only between the markers:
    
```python
# >>>>>>>>>>>>>>>>>>>
# your code goes here
# <<<<<<<<<<<<<<<<<<<
```

✋ Do not edit anything outside these markers.

⚡ After finishing the course, feel free to experiment and modify the notebook as you like!

✨ Example

What you will see in the notebook:

```python
# >>>>>>>>>>>>>>>>>>>
# TODO (2 lines): compute the duration of the video and print it 
# variables: frame_count, fps
# YOUR CODE: duration = 
# YOUR CODE: print(...)
# <<<<<<<<<<<<<<<<<<< 
```

What you are expected to write: 

```python
# >>>>>>>>>>>>>>>>>>>
# TODO (2 lines): compute the duration of the video and print it 
# variables: frame_count, fps
duration = frame_count / fps
print(f"Duration (s):", duration)
# <<<<<<<<<<<<<<<<<<< 
```
---

## 2. Data Loading & Format Inspection

👉 Goal: learn to open .h5 files and understand their structure.
- Load one file into a pandas DataFrame
- Inspect columns: scorer, bodypart, x, y, likelihood
- Count frames and list bodyparts

Exercise 1:
List all detected bodyparts and classify them as head / body / tail.

**2.1 ✅ Install & Download (prefilled)**

📥 Download a dataset file from Google Drive (with gdown), save it locally (path depends on Colab vs local), and verify the download.

In [None]:

# Install and import the required libraries:
!pip -q install gdown tables

import os
from pathlib import Path
import gdown, pandas as pd, numpy as np, numpy as np


# Detect if running in Google Colab
if "COLAB_RELEASE_TAG" in os.environ or "COLAB_GPU" in os.environ:
    DEST = Path("/content/dlc_output.h5")
else:
    DEST = Path("dlc_output.h5")  # save in current folder locally
print("Saving to:", DEST)

# Select here the expeirment you want to download, comment the others:
# Opt 1: Single mouse - arena with bedding
# FILE_ID = 
# Opt 2: Single mouse - arena without clear floor
# FILE_ID = 
# Opt 3: Single mouse - beatbox
# FILE_ID = "11zcVPSS4D-JLQQ11hkMbPwmqs-cd6Am2"
# URL = f"https://drive.google.com/uc?id={FILE_ID}"

print("Downloading from Drive...")
_ = gdown.download(URL, str(DEST), quiet=False)

# Basic checks
assert DEST.exists() and DEST.stat().st_size > 0, "❌ Download failed or empty file."
print(f"✅ Downloaded to {DEST} ({DEST.stat().st_size/1_000_000:.2f} MB)")

5055.53s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Saving to: dlc_output.h5
Downloading from Drive...


Downloading...
From: https://drive.google.com/uc?id=11zcVPSS4D-JLQQ11hkMbPwmqs-cd6Am2
To: /Users/lix/Library/CloudStorage/OneDrive-Personnel/3-work/teaching/2025_BehavioralAnalysis/pose2behav-book/notebooks/dlc_output.h5
100%|██████████| 97.8M/97.8M [00:08<00:00, 11.6MB/s]

✅ Downloaded to dlc_output.h5 (97.80 MB)





**2.2 👩🏻‍💻 Load the HDF5 into a DataFrame (TODO)**

In [None]:
# Load the HDF5 pose output into a pandas DataFrame.
# TODO: verify the file contains a DataFrame (one line)
import numpy as np

def read_pose_h5(path: Path) -> pd.DataFrame:
    for key in ("df_with_missing", "df", "tracks", "pose"):
        try:
            return pd.read_hdf(path, key=key)
        except Exception:
            pass
        
    return pd.read_hdf(path)

df = read_pose_h5(DEST)

# Basic sanity check
# >>>>>>>>>>>>>>>>>>>
# TODO (1 line): Check that the DataFrame "df" has at least 1 row.
# Clue: ( assert <logical statement>, “message to return if assertion fails” )
#   If the file was loaded but empty (no rows), this condition is False.
#   If condition is True, nothing happens, code continues.

assert df.shape[0] > 0, "Empty DataFrame after loading. Check file."
# <<<<<<<<<<<<<<<<<<<

print("✅ H5 loaded.")

# 1) Pick the FIRST animal label in the H5 (usually 'animal0')
first_animal = df.columns.get_level_values('individuals').unique()[0]
A = df.xs(first_animal, axis=1, level='individuals')   # columns: (scorer, bodyparts, coords)

print(f"✅ Focusing on first animal: {first_animal}")
print("Frames:", df.shape[0])
print("Bodyparts:", list(A.columns.levels[1]))
print("Coords:", list(A.columns.levels[2]))

# 2) Flatten columns for easier use: "nose_x", "nose_y", "nose_likelihood", ...
df_flat = A.copy()
df_flat.columns = [f"{bp}_{coord}" for _, bp, coord in df_flat.columns]
print("\nFlattened columns (sample):", df_flat.columns[:9].tolist())
display(df_flat.head())

# 3) Likelihood-based QC per bodypart (H5 only)
#    SuperAnimal often uses -1 for “no detection”. Convert <0 to NaN before stats.
L = A.xs('likelihood', axis=1, level='coords')      # (frames, bodyparts)
L_valid = L.where(L >= 0)                            # drop -1 sentinel -> NaN

per_bp = pd.DataFrame({
    'coverage'        : L_valid.notna().mean(axis=0),          # fraction of frames with any detection
    'frac_conf>=0.5'  : (L_valid >= 0.5).mean(axis=0),         # fraction of frames confidently detected
    'mean_likelihood' : L_valid.mean(axis=0),                  # average likelihood (ignoring -1)
}).sort_values(['frac_conf>=0.5','coverage','mean_likelihood'], ascending=False)
per_bp.index.name = 'bodypart'

print("\n=== Per-bodypart QC for first animal (top 10) ===")
display(per_bp.head(10))

# 4) Frame-level quick view (how many bodyparts detected per frame)
detected_per_frame = L_valid.notna().sum(axis=1)  # integer count per frame
print("\nDetected bodyparts per frame — summary:")
display(detected_per_frame.describe())

# 5) Access helpers (examples)
#    - Single series: nose_x / nose_y / nose_likelihood
#    - First 5 rows shown for demonstration
for name in ["nose_x", "nose_y", "nose_likelihood"]:
    if name in df_flat.columns:
        print(f"\n{name} (first 5):")
        print(df_flat[name].head())
    else:
        print(f"\n{name} not found in columns (check bodypart names).")

# 6) Sanity checks students can discuss:
#    - If many bodyparts show coverage ≈ 0 and mean_likelihood ≈ NaN,
#      the animal may not have been detected (or wrong slot chosen).
low_cov = (per_bp['coverage'] < 0.05).mean()
if low_cov == 1.0:
    print("\n⚠️ All bodyparts have very low coverage (<5%). "
          "This first animal slot may be empty in this file.")


✅ H5 loaded.
✅ Focusing on first animal: animal0
Frames: 15053
Bodyparts: ['head_midpoint', 'left_ear', 'left_ear_tip', 'left_eye', 'left_hip', 'left_midside', 'left_shoulder', 'mid_back', 'mid_backend', 'mid_backend2', 'mid_backend3', 'mouse_center', 'neck', 'nose', 'right_ear', 'right_ear_tip', 'right_eye', 'right_hip', 'right_midside', 'right_shoulder', 'tail1', 'tail2', 'tail3', 'tail4', 'tail5', 'tail_base', 'tail_end']
Coords: ['likelihood', 'x', 'y']

Flattened columns (sample): ['nose_x', 'nose_y', 'nose_likelihood', 'left_ear_x', 'left_ear_y', 'left_ear_likelihood', 'right_ear_x', 'right_ear_y', 'right_ear_likelihood']


Unnamed: 0,nose_x,nose_y,nose_likelihood,left_ear_x,left_ear_y,left_ear_likelihood,right_ear_x,right_ear_y,right_ear_likelihood,left_ear_tip_x,...,right_midside_likelihood,right_hip_x,right_hip_y,right_hip_likelihood,tail_end_x,tail_end_y,tail_end_likelihood,head_midpoint_x,head_midpoint_y,head_midpoint_likelihood
0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0



=== Per-bodypart QC for first animal (top 10) ===


Unnamed: 0_level_0,Unnamed: 1_level_0,coverage,frac_conf>=0.5,mean_likelihood
scorer,bodyparts,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,right_hip,0.628446,0.553511,0.76014
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,left_ear,0.628446,0.537102,0.743324
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,left_ear_tip,0.628446,0.535973,0.656686
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,right_ear,0.628446,0.475852,0.657428
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,nose,0.628446,0.45559,0.673105
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,right_eye,0.628446,0.444496,0.576161
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,mid_back,0.628446,0.443832,0.594053
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,left_midside,0.628446,0.438517,0.609447
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,left_eye,0.628446,0.429084,0.581836
superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004,head_midpoint,0.628446,0.419518,0.557033



Detected bodyparts per frame — summary:


count    15053.000000
mean        16.968046
std         13.047374
min          0.000000
25%          0.000000
50%         27.000000
75%         27.000000
max         27.000000
dtype: float64


nose_x (first 5):
0   -1.0
1   -1.0
2   -1.0
3   -1.0
4   -1.0
Name: nose_x, dtype: float64

nose_y (first 5):
0   -1.0
1   -1.0
2   -1.0
3   -1.0
4   -1.0
Name: nose_y, dtype: float64

nose_likelihood (first 5):
0   -1.0
1   -1.0
2   -1.0
3   -1.0
4   -1.0
Name: nose_likelihood, dtype: float64


In [20]:
# --- Pose H5 loader + active-animal detector (SuperAnimal / DLC multi-animal) ---

from pathlib import Path
import pandas as pd
import numpy as np

# If you already defined DEST earlier, it will be reused.
# Otherwise, set it here explicitly, e.g.:
# DEST = Path("dlc_output.h5")

# -------------------------
# 1) Robust H5 DataFrame loader
# -------------------------
def read_pose_h5(path: Path) -> pd.DataFrame:
    """
    Load a DeepLabCut/SuperAnimal HDF5 pose file as a pandas DataFrame.
    Tries common keys; falls back to root.
    """
    preferred = ("df_with_missing", "df", "tracks", "pose")
    with pd.HDFStore(path, mode='r') as store:
        keys = [k.strip('/') for k in store.keys()]
    for k in preferred:
        if k in keys:
            df = pd.read_hdf(path, key=k)
            if not isinstance(df, pd.DataFrame):
                raise TypeError(f"H5 key '{k}' did not return a DataFrame.")
            return df
    # Fallback: try without key (some exports store a single root)
    df = pd.read_hdf(path)
    if not isinstance(df, pd.DataFrame):
        raise TypeError("H5 did not contain a DataFrame at root.")
    return df

# -------------------------
# 2) Active-animal detection (robust, uses level NAMES)
# -------------------------
def detect_active_animals(df: pd.DataFrame, conf_thresh: float = 0.5) -> pd.DataFrame:
    """
    Returns a per-animal summary with:
      - mean_likelihood         : raw mean (often -1.0 for totally missing)
      - frac_conf>=thresh       : fraction of points with likelihood >= conf_thresh
      - mean_xy_variance        : average variance across x/y (only where detected)
    Sorted so the most likely real animal is on top.
    """
    # Expect a 4-level MultiIndex: scorer / individuals / bodyparts / coords
    if not isinstance(df.columns, pd.MultiIndex):
        raise ValueError("Expected MultiIndex columns (scorer/individuals/bodyparts/coords).")
    expected = ['scorer', 'individuals', 'bodyparts', 'coords']
    if list(df.columns.names) != expected:
        raise ValueError(f"Unexpected column levels: {df.columns.names} (expected {expected})")

    animals = list(df.columns.levels[df.columns.names.index('individuals')])

    idx = pd.IndexSlice  # for clean multi-index slicing with .loc
    rows = []
    for a in animals:
        # Slice one animal using level NAME
        A  = df.xs(a, axis=1, level='individuals')

        # Likelihoods: (frames, bodyparts) – single label is OK with .xs
        L  = A.xs('likelihood', axis=1, level='coords')

        # Coordinates: (frames, bodyparts*2) – multiple labels -> use .loc + IndexSlice
        XY = A.loc[:, idx[:, :, ['x', 'y']]]

        # 1) Mean likelihood (SuperAnimal often uses -1.0 as "no detection")
        mean_L = float(L.mean().mean())

        # 2) Fraction confident: drop sentinel < 0 first, then threshold
        L_valid = L.where(L >= 0)
        frac_conf = float((L_valid >= conf_thresh).mean().mean())

        # 3) Movement variance only where detected:
        # Build a 3-level boolean mask with SAME columns as XY.
        det_mask = L_valid.notna()                  # (frames, bodyparts)
        # Duplicate for x/y and elevate to 3-level MI with coords at last level
        mask3 = pd.concat([det_mask, det_mask], axis=1, keys=['x', 'y'])
        # mask3 levels currently ('coords','scorer','bodyparts') – reorder to match XY
        mask3 = mask3.swaplevel(0, 2, axis=1).swaplevel(0, 1, axis=1)
        mask3 = mask3.sort_index(axis=1)

        # Align and mask XY
        mask3 = mask3.reindex(columns=XY.columns)   # ensure exact alignment
        XY_masked = XY.where(mask3)

        mov_var = float(XY_masked.var(ddof=0).mean())  # average variance across all x/y cols

        rows.append((a, mean_L, frac_conf, mov_var))

    out = pd.DataFrame(rows, columns=['animal', 'mean_likelihood', 'frac_conf>=0.5', 'mean_xy_variance'])
    out = out.set_index('animal').sort_values(
        ['frac_conf>=0.5', 'mean_xy_variance', 'mean_likelihood'],
        ascending=False
    )
    return out

# -------------------------
# 3) Pick best animal and flatten columns
# -------------------------
def pick_top_animal(df: pd.DataFrame, conf_thresh: float = 0.5):
    """
    Detects active animals, returns (best_animal_label, flattened_dataframe_for_that_animal, summary).
    Flattened columns are like 'nose_x', 'nose_y', 'nose_likelihood'.
    """
    summary = detect_active_animals(df, conf_thresh=conf_thresh)
    top = summary.index[0]  # best-ranked animal

    # Slice that animal and flatten to 'bodypart_coord' (drop scorer level in names)
    A = df.xs(top, axis=1, level='individuals')
    A.columns = [f"{bp}_{coord}" for _, bp, coord in A.columns]  # (scorer, bodyparts, coords)
    return top, A, summary

# -------------------------
# 4) (Optional) Compute duration if FPS is known
# -------------------------
def compute_duration_from_df(df: pd.DataFrame, fps: float) -> dict:
    """
    Returns frames, fps, seconds, minutes.
    Note: DLC H5 usually doesn't store FPS; you must supply it (e.g., from the source video).
    """
    if fps <= 0:
        raise ValueError("FPS must be > 0.")
    n_frames = int(df.shape[0])
    seconds = n_frames / fps
    return {
        "frames": n_frames,
        "fps": float(fps),
        "seconds": float(seconds),
        "minutes": float(seconds / 60.0),
    }

# -------------------------
# 5) Run
# -------------------------
df = read_pose_h5(Path(DEST))

# Basic sanity check: ensure at least one row
assert df.shape[0] > 0, "Empty DataFrame after loading. Check file."

print("✅ H5 loaded.")
print("Shape:", df.shape)
print("Column level names:", df.columns.names)
print("First 6 columns:", df.columns[:6])

# Detect animals and flatten the top one
best_animal, df_flat, summary = pick_top_animal(df, conf_thresh=0.5)

print("\n=== Active-animal summary (sorted) ===")
print(summary)

print(f"\nBest animal picked: {best_animal}")
print("Flattened view (first rows):")
display(df_flat.head())

# Example (optional): if you know FPS, compute duration
# duration = compute_duration_from_df(df, fps=30)
# print("\nVideo duration estimate:", duration)

✅ H5 loaded.
Shape: (15053, 810)
Column level names: ['scorer', 'individuals', 'bodyparts', 'coords']
First 6 columns: MultiIndex([('superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004', ...),
            ('superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004', ...),
            ('superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004', ...),
            ('superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004', ...),
            ('superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004', ...),
            ('superanimal_topviewmouse_snapshot-fasterrcnn_mobilenet_v3_large_fpn-004_snapshot-hrnet_w32-004', ...)],
           names=['scorer', 'individuals', 'bodyparts', 'coords'])

=== Active-animal summary (sorted) ===
         mean_likelihood  frac_conf>=0.5  mean_xy_variance
animal      

Unnamed: 0,nose_x,nose_y,nose_likelihood,left_ear_x,left_ear_y,left_ear_likelihood,right_ear_x,right_ear_y,right_ear_likelihood,left_ear_tip_x,...,right_midside_likelihood,right_hip_x,right_hip_y,right_hip_likelihood,tail_end_x,tail_end_y,tail_end_likelihood,head_midpoint_x,head_midpoint_y,head_midpoint_likelihood
0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0


## 3. Metadata & basic summary

👉 Goal: extract key metadata and get a first impression of data quality.
- Print frame rate, duration, number of frames
- Compute % of missing/low-confidence points per bodypart
- Create summary table of likelihoods

Exercise 2:
Which bodypart is most reliably detected? Which one is least?

## 4. Likelihood distributions


👉 Goal: visualize the reliability of detections.
- Histograms of likelihood per bodypart
- Violin plots comparing bodyparts
- Fraction of frames below confidence threshold

Exercise 3:
Compare the tail base vs nose likelihood distributions. What do you observe?

## 5. Time series inspection

👉 Goal: detect failures and instability across time.
- Plot time series of x,y positions for nose (or other keypoints)
- Plot likelihood as a function of time

Exercise 4:
Spot at least two segments where the model clearly failed (likelihood drops).

## 6. Spatial distributions

👉 Goal: understand where in the arena each bodypart was detected.
- Scatter plot of nose positions
- Kernel density estimate heatmap of occupancy
- Overlay all bodypart scatter plots

Exercise 5:
Does the mouse explore the arena uniformly or are there preferences (corners, walls)?

## 7. Visual diagnostics 

👉 Goal: overlay skeletons on frames and create animations.
- Pick random frames and overlay skeleton on image
- Short animation (GIF or video snippet) of 200 frames with skeleton overlay
- Compare clear video vs challenging video

Exercise 6:
Compare skeleton overlays between clear and noisy video. What errors do you see?

## 8. Outlier & Error Detection

👉 Goal: identify extreme jumps and suspicious frames.
- Compute frame-to-frame displacement for each keypoint
- Histogram of displacements; flag outliers
- Mark “bad frames” with low likelihood or jumps

Exercise 7:
How many frames of the tail tip exceed a jump threshold of 30 pixels?

9. Filtering & Correction

👉 Goal: correct noisy or missing data.
- Apply interpolation to missing points
- Apply smoothing (rolling median or spline)
- Compare raw vs corrected trajectories

Exercise 8:
Apply interpolation to ear-left trajectory and plot before vs after.

10. Comparative Analysis (Clear vs Noisy Video)

👉 Goal: see how conditions affect pose quality.
- Load outputs from two videos: one clear, one dark/low contrast
- Create summary table: % of low-confidence frames per bodypart
- Violin plots comparing likelihood distributions

Exercise 9:
Which video shows more missingness for the nose keypoint? Why might that be?

In [None]:


# Install helpers (quiet)
!pip -q install gdown tables

import gdown, os, pandas as pd, numpy as np
from pathlib import Path

# 👇 Replace with your real Google Drive FILE ID (not the whole link!)
FILE_ID = "11zcVPSS4D-JLQQ11hkMbPwmqs-cd6Am2"

# Build a direct-download URL for gdown
URL = f"https://drive.google.com/uc?id={FILE_ID}"

DEST = Path("/content/dlc_output.h5")
print("Downloading from Drive...")
gdown.download(URL, str(DEST), quiet=False)

# Quick sanity check
assert DEST.exists() and DEST.stat().st_size > 0, "Download failed or empty file."
print(f"✅ Downloaded to {DEST} ({DEST.stat().st_size/1_000_000:.2f} MB)")


In [None]:
# DeepLabCut H5 often stores under keys like 'df_with_missing' or 'df'
# We'll try common keys, and fall back to listing what's available.

def load_dlc_h5(path: Path):
    try:
        # Try default (let pandas pick)
        return pd.read_hdf(path)
    except (KeyError, ValueError):
        # Inspect keys and try common ones
        with pd.HDFStore(path, mode="r") as store:
            keys = [k.strip("/") for k in store.keys()]
        print("Available keys in H5:", keys)
        for k in ["df_with_missing", "df", "pose", "table"]:
            if k in keys:
                return pd.read_hdf(path, key=k)
        # Last resort: first key
        if keys:
            return pd.read_hdf(path, key=keys[0])
        raise RuntimeError("No readable tables found in this H5.")

df = load_dlc_h5(DEST)
print("✅ Loaded H5 into DataFrame:", df.shape)
display(df.head(3))


In [None]:
# DLC H5 columns are often a MultiIndex: (scorer, bodypart, coord)
if isinstance(df.columns, pd.MultiIndex):
    df.columns = ["{}/{}/{}".format(*lvl) for lvl in df.columns]
print("Columns (first 10):")
print(df.columns[:10])


In [None]:
import matplotlib.pyplot as plt

# Try to find any '/x' and matching '/y' columns
xcols = [c for c in df.columns if c.endswith("/x")]
assert len(xcols) > 0, "Couldn't find any '/x' columns. Inspect df.columns."
xcol = xcols[0]
ycol = xcol[:-2] + "/y"

plt.figure()
plt.plot(df[xcol].values, -df[ycol].values)  # invert y for display
plt.title(f"Trajectory preview: {xcol.split('/')[1] if '/' in xcol else xcol}")
plt.xlabel("x"); plt.ylabel("y (top=up)")
plt.show()


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/LizbethMG-Teaching/pose2behav-book/blob/main/notebooks/EDA.ipynb)]