# Gaze Heatmap & Crops Pipeline (Google Colab)

This notebook processes **eye-tracking gaze data** from the NEAR Experiment Design dataset. For each user folder it:

1. **Loads gaze data** (from Pupil exports CSV or pldata).
2. **Splits the recording into time windows** (e.g. 3 s).
3. **For each window:** grabs a video frame, builds a **heatmap** (2D histogram + Gaussian blur + jet colormap overlay), saves full-frame heatmap and source images, computes a **region of interest** from the heatmap, and saves **cropped** versions of both.
4. **Builds an MP4** from the heatmap frames for easy playback.

**Outputs per user** (under `BASE_OUTPUT_PATH/<user_name>/frames/`):

- `src_000-003s.png`, `src_003-006s.png`, ... ‚Äî source video frame at mid-window.
- `heat_000-003s.png`, `heat_003-006s.png`, ... ‚Äî heatmap overlay (same style as Data_Analysis original).
- `src_*_crop.png`, `heat_*_crop.png` ‚Äî crops around the gaze-dense region.
- `<user_name>_heatmap.mp4` ‚Äî video of heatmap frames in time order.

Heatmap method matches the original Data_Analysis pipeline (2D histogram, Gaussian smoothing, jet colormap, matplotlib figure at 200 dpi).

## 1. Mount Google Drive

Mount Drive so we can read the source dataset and write outputs. Run this cell first.

In [30]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. Imports

Libraries for gaze loading (including Pupil pldata via msgpack), video (OpenCV), heatmap (matplotlib, scipy), and MP4 writing (imageio, PIL).  
If you get an import error, run: `%pip install msgpack scipy pillow imageio imageio-ffmpeg`

In [31]:
# Import required libraries (heatmap from Data_Analysis_old_version)
# If needed: %pip install msgpack scipy pillow imageio imageio-ffmpeg
import os
import glob
import re
import msgpack
import cv2
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter
from collections import Counter
from PIL import Image
import imageio.v2 as imageio

## 3. Paths and user list

- **BASE_SOURCE_PATH:** folder containing one subfolder per user (e.g. `Ayu_1`, `AT_1`), each with `world.mp4` and `exports/` (or pldata).
- **BASE_OUTPUT_PATH:** where to write per-user `frames/` and heatmap MP4s.  
We list all non-hidden subfolders of the source path as user folders (excluding `1_Data_Analysis`).

In [32]:
# Where to read data (one subfolder per user: Ayu_1, AT_1, ...) and where to write results
BASE_SOURCE_PATH = "/content/drive/MyDrive/NEAR_Experiment_Design/PilotData_V1_10232025"
BASE_OUTPUT_PATH = "/content/drive/MyDrive/NEAR_Experiment_Design_Output"
os.makedirs(BASE_OUTPUT_PATH, exist_ok=True)

# Discover all user folders (each must contain world.mp4 and exports/ or pldata)
user_folders = sorted([f for f in os.listdir(BASE_SOURCE_PATH) 
                       if os.path.isdir(os.path.join(BASE_SOURCE_PATH, f)) 
                       and not f.startswith('.')
                       and f not in ['1_Data_Analysis']])  # Skip non-user folders
print(f"Found {len(user_folders)} user folders:")
for folder in user_folders[:10]:  # Show first 10
    print(f"  - {folder}")
if len(user_folders) > 10:
    print(f"  ... and {len(user_folders) - 10} more")

Found 26 user folders:
  - AT_1
  - AT_2
  - AT_3_1
  - AT_3_2
  - Ayu_1
  - Ayu_2
  - Ayu_3
  - JC_1
  - JC_2
  - JC_3_1
  ... and 16 more


## 4. Helper functions

- **Gaze:** `load_gaze_dataframe(task_dir)` ‚Äî loads from `exports/**/gaze_positions.csv` or, if missing, from Pupil `.pldata`; returns DataFrame with `timestamp`, `norm_pos_x`, `norm_pos_y`, `confidence`.
- **Video:** `open_world_video(task_dir)`, `grab_frame_at_time(cap, fps, t_sec)` ‚Äî open `world.mp4` and seek to a time.
- **Heatmap:** `compute_heat(df_window, h, w)` ‚Äî 2D histogram of gaze (Y flipped) + Gaussian blur; used for crop bbox.  
  `plot_and_save_heatmap(bg_rgb, df_window, path)` ‚Äî same heatmap drawn with matplotlib (jet, alpha 0.5), saved as PNG (200 dpi).
- **Crop:** `bbox_from_heatmap_only(heat, pad, thresh)` ‚Äî bounding box from heat above threshold.
- **MP4:** `sort_by_window(files)` (by start time in filename), `to_even_size`, `make_mp4_from_folder(folder, pattern, out_mp4, fps)` ‚Äî build MP4 from heatmap PNGs.

## 5. Main processing function

`process_user_folder(user_name, source_base_path, output_base_path, ...)` does everything for one user:

1. Load gaze and open video; align gaze time to relative seconds (`t_rel`).
2. For each time window: grab frame at window mid-time; save source PNG; build and save heatmap PNG; compute heat array and bbox; save source and heatmap crops (heatmap crop uses bbox scaled to figure size 2000√ó1200).
3. Build `<user_name>_heatmap.mp4` from all `heat_*-*s.png` frames.

Parameters: `interval_sec` (window length in seconds), `pad` (pixels around bbox for crop), `sample_windows` (cap number of windows if set).

In [33]:
# Heatmap pipeline from Data_Analysis_old_version (Ayu_1_heatmap.mp4 style)
def load_pldata_file(directory, topic):
    ts_file = os.path.join(directory, f"{topic}_timestamps.npy")
    mp_file = os.path.join(directory, f"{topic}.pldata")
    data_list = []
    if not (os.path.exists(ts_file) and os.path.exists(mp_file)):
        return [], np.array([])
    timestamps = np.load(ts_file)
    with open(mp_file, "rb") as fh:
        unpacker = msgpack.Unpacker(fh, raw=False, use_list=False)
        for tpc, payload in unpacker:
            datum = msgpack.unpackb(payload, raw=False, use_list=False)
            data_list.append(datum)
    return data_list, timestamps

def pldata_gaze_to_dataframe(directory):
    def _normalize_to_unit(series):
        v = series.astype(float)
        vmin, vmax = float(v.min()), float(v.max())
        rng = vmax - vmin
        if (vmin < 0) or (vmax > 1):
            if rng <= 1.2:
                v = (v - vmin) / max(rng, 1e-9)
            elif rng <= 2.2 and (vmin >= -1.1) and (vmax <= 1.1):
                v = (v + 1.0) / 2.0
            else:
                v = v.clip(0, 1)
        return v
    data_list, ts = load_pldata_file(directory, "gaze")
    if len(ts) == 0:
        return pd.DataFrame()
    rows = []
    for d, tt in zip(data_list, ts):
        row = {"timestamp": float(d.get("timestamp", tt))}
        if "norm_pos" in d:
            nx, ny = d["norm_pos"][0], d["norm_pos"][1]
        else:
            nx, ny = d.get("norm_pos_x", np.nan), d.get("norm_pos_y", np.nan)
        row["norm_pos_x"] = float(nx) if nx is not None else np.nan
        row["norm_pos_y"] = float(ny) if ny is not None else np.nan
        row["confidence"] = float(d.get("confidence", np.nan))
        rows.append(row)
    df = pd.DataFrame(rows).dropna(subset=["norm_pos_x", "norm_pos_y"])
    if (df["norm_pos_x"].min() < 0) or (df["norm_pos_x"].max() > 1) or (df["norm_pos_y"].min() < 0) or (df["norm_pos_y"].max() > 1):
        df["norm_pos_x"] = _normalize_to_unit(df["norm_pos_x"])
        df["norm_pos_y"] = _normalize_to_unit(df["norm_pos_y"])
    return df

def exports_csv_to_dataframe(exports_dir):
    candidates = glob.glob(os.path.join(exports_dir, "**", "gaze_positions.csv"), recursive=True)
    if not candidates:
        return pd.DataFrame()
    path = candidates[0]
    df = pd.read_csv(path)
    ts_col = next((c for c in ["gaze_timestamp", "timestamp", "world_timestamp", "time", "system_time"] if c in df.columns), None)
    if ts_col is None:
        return pd.DataFrame()
    if {"norm_pos_x", "norm_pos_y"}.issubset(df.columns):
        out = pd.DataFrame()
        out["timestamp"] = df[ts_col].astype(float)
        out["norm_pos_x"] = df["norm_pos_x"].astype(float)
        out["norm_pos_y"] = df["norm_pos_y"].astype(float)
        out["confidence"] = df["confidence"].astype(float) if "confidence" in df.columns else np.nan
        return out.sort_values("timestamp").reset_index(drop=True)
    if {"gaze_point_2d_x", "gaze_point_2d_y"}.issubset(df.columns):
        out = pd.DataFrame()
        out["timestamp"] = df[ts_col].astype(float)
        out["norm_pos_x"] = df["gaze_point_2d_x"].astype(float)
        out["norm_pos_y"] = df["gaze_point_2d_y"].astype(float)
        out["confidence"] = df["confidence"].astype(float) if "confidence" in df.columns else np.nan
        return out.sort_values("timestamp").reset_index(drop=True)
    return pd.DataFrame()

def load_gaze_dataframe(task_dir):
    exports_dir = os.path.join(task_dir, "exports")
    df_csv = exports_csv_to_dataframe(exports_dir)
    if not df_csv.empty:
        return df_csv
    df_pl = pldata_gaze_to_dataframe(task_dir)
    if not df_pl.empty:
        return df_pl
    raise FileNotFoundError("Could not load gaze data from exports CSV or pldata.")

def open_world_video(task_dir):
    mp4_path = os.path.join(task_dir, "world.mp4")
    if not os.path.exists(mp4_path):
        cand = glob.glob(os.path.join(task_dir, "**", "world.mp4"), recursive=True)
        if cand:
            mp4_path = cand[0]
    cap = cv2.VideoCapture(mp4_path)
    if not cap.isOpened():
        raise FileNotFoundError(f"Cannot open video at: {mp4_path}")
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    duration = frame_count / fps if fps > 0 else 0.0
    return cap, fps, frame_count, duration, width, height

def grab_frame_at_time(cap, fps, t_sec):
    frame_idx = int(round(t_sec * fps))
    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
    ok, frame = cap.read()
    return frame if ok else None

def compute_heat(df_window, h, w, blur_ratio=0.05):
    gx = df_window["norm_pos_x"].to_numpy()
    gy = 1.0 - df_window["norm_pos_y"].to_numpy()
    hist, _, _ = np.histogram2d(gy, gx, bins=[h, w], range=[[0, 1], [0, 1]])
    fh = max(1, (int(blur_ratio * h) // 2) * 2 + 1)
    fw = max(1, (int(blur_ratio * w) // 2) * 2 + 1)
    heat = gaussian_filter(hist, sigma=(fh, fw), order=0)
    return heat

def plot_and_save_heatmap(bg_image_rgb, df_window, out_path_png, blur_ratio=0.05):
    os.makedirs(os.path.dirname(out_path_png), exist_ok=True)
    h, w = bg_image_rgb.shape[:2]
    gx = df_window["norm_pos_x"].to_numpy()
    gy = 1.0 - df_window["norm_pos_y"].to_numpy()
    hist, _, _ = np.histogram2d(gy, gx, bins=[h, w], range=[[0, 1], [0, 1]])
    fh = max(1, (int(blur_ratio * h) // 2) * 2 + 1)
    fw = max(1, (int(blur_ratio * w) // 2) * 2 + 1)
    heat = gaussian_filter(hist, sigma=(fh, fw), order=0)
    plt.figure(figsize=(10, 6))
    plt.imshow(bg_image_rgb)
    plt.imshow(heat, cmap="jet", alpha=0.5)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.savefig(out_path_png, dpi=200, bbox_inches=None, pad_inches=0)
    plt.close()

def bbox_from_heatmap_only(hmap, pad=0, thresh=0.5):
    if hmap.size == 0 or hmap.max() <= 0:
        return None
    hmap_n = hmap / hmap.max()
    mask = hmap_n > thresh
    ys, xs = np.where(mask)
    if len(xs) == 0:
        return None
    x0, x1 = xs.min(), xs.max()
    y0, y1 = ys.min(), ys.max()
    x0, y0 = max(0, x0 - pad), max(0, y0 - pad)
    x1, y1 = min(hmap.shape[1], x1 + pad), min(hmap.shape[0], y1 + pad)
    return x0, y0, x1, y1

def sort_by_window(files):
    def key_fn(p):
        m = re.search(r"_(\d+)-\d+s", os.path.basename(p))
        return int(m.group(1)) if m else 10**9
    return sorted(files, key=key_fn)

def to_even_size(size):
    w, h = size
    w, h = w - (w % 2), h - (h % 2)
    return (max(2, w), max(2, h))

def make_mp4_from_folder(folder, pattern, out_mp4_path, fps=2):
    files = sort_by_window(glob.glob(os.path.join(folder, pattern)))
    if not files:
        return
    sizes = [Image.open(p).size for p in files]
    common_size = Counter(sizes).most_common(1)[0][0]
    target_size = to_even_size(common_size)
    with imageio.get_writer(out_mp4_path, fps=fps, codec="libx264", quality=8, macro_block_size=None) as writer:
        for p in files:
            im = Image.open(p).convert("RGB")
            if im.size != target_size:
                im = im.resize(target_size, Image.LANCZOS)
            writer.append_data(np.array(im))
    print(f"[video] Saved: {out_mp4_path} ({len(files)} frames)")



In [34]:
def process_user_folder(user_name, source_base_path, output_base_path,
                        interval_sec=3.0, pad=20, alpha=0.5, sample_windows=None):
    """Process one user: load gaze + video, build heatmaps per window, save crops and MP4."""
    user_source_path = os.path.join(source_base_path, user_name)
    frames_dir = os.path.join(output_base_path, user_name, "frames")
    os.makedirs(frames_dir, exist_ok=True)

    if not os.path.isdir(user_source_path):
        print(f"‚ö†Ô∏è  {user_name}: folder not found")
        return False

    print(f"üìä Processing {user_name}...")
    try:
        gaze_df = load_gaze_dataframe(user_source_path)
    except Exception as e:
        print(f"‚ùå {user_name}: Failed to load gaze - {e}")
        return False
    try:
        cap, fps, frame_count, duration, W, H = open_world_video(user_source_path)
    except Exception as e:
        print(f"‚ùå {user_name}: Failed to open video - {e}")
        return False

    # Align gaze time to 0 and compute number of windows
    t_min = float(gaze_df["timestamp"].min())
    gaze_df = gaze_df.assign(t_rel=gaze_df["timestamp"] - t_min)
    max_duration = min(duration, float(gaze_df["t_rel"].max()))
    num_windows = int(max_duration // interval_sec) + 1
    if sample_windows is not None:
        num_windows = min(num_windows, sample_windows)

    processed = 0
    os.makedirs(frames_dir, exist_ok=True)
    for k in range(num_windows):
        start_t = k * interval_sec
        end_t = min((k + 1) * interval_sec, max_duration)
        if end_t <= start_t + 1e-6:
            continue
        win_mask = (gaze_df["t_rel"] >= start_t) & (gaze_df["t_rel"] < end_t)
        df_win = gaze_df.loc[win_mask]
        if df_win.empty:
            continue

        mid_t = 0.5 * (start_t + end_t)
        frame_bgr = grab_frame_at_time(cap, fps, mid_t)
        if frame_bgr is None:
            continue
        frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)

        win_tag = f"{int(start_t):03d}-{int(end_t):03d}s"
        heat_path = os.path.join(frames_dir, f"heat_{win_tag}.png")
        src_path = os.path.join(frames_dir, f"src_{win_tag}.png")

        cv2.imwrite(src_path, frame_bgr)
        plot_and_save_heatmap(frame_rgb, df_win, heat_path)

        heat = compute_heat(df_win, H, W)
        bbox = bbox_from_heatmap_only(heat, pad=pad)
        if bbox is not None:
            x0, y0, x1, y1 = bbox
            crop_src = frame_bgr[y0:y1, x0:x1]
            cv2.imwrite(os.path.join(frames_dir, f"src_{win_tag}_crop.png"), crop_src)
            saved_heat = cv2.imread(heat_path)
            if saved_heat is not None:
                # Heatmap PNG is 2000x1200 (matplotlib fig 10x6 @ 200 dpi); scale bbox to crop it
                H_fig, W_fig = 1200, 2000
                sx, sy = W_fig / W, H_fig / H
                x0f, x1f = int(x0 * sx), int(x1 * sx)
                y0f, y1f = int(y0 * sy), int(y1 * sy)
                crop_heat = saved_heat[y0f:y1f, x0f:x1f]
                cv2.imwrite(os.path.join(frames_dir, f"heat_{win_tag}_crop.png"), crop_heat)
            processed += 1

    cap.release()

    heat_mp4_path = os.path.join(frames_dir, f"{user_name}_heatmap.mp4")
    make_mp4_from_folder(frames_dir, "heat_*-*s.png", heat_mp4_path, fps=2)

    print(f"‚úÖ {user_name}: Processed {processed} intervals, heatmap MP4: {heat_mp4_path}")
    return True



## 6. Run pipeline

Set **SINGLE_USER** to a folder name (e.g. `"Ayu_1"`) to process only that user, or set to **`None`** to process **all users** in `user_folders`.  
Outputs go under `BASE_OUTPUT_PATH/<user_name>/frames/`.

In [None]:
# --- Configuration ---
INTERVAL_SEC = 3.0      # Time window length in seconds (e.g. 3 = one heatmap every 3 s)
PAD = 20               # Extra pixels around heatmap ROI for crop
SAMPLE_WINDOWS = None  # None = process full recording; set to int to limit number of windows per user

# Process one user (set name) or all users (set to None)
SINGLE_USER = None     # e.g. None for all users, or "Ayu_1" to run only that folder

# --- Run ---
users_to_run = [SINGLE_USER] if SINGLE_USER else user_folders
print(f"Processing {len(users_to_run)} user(s): {users_to_run[:5]}{'...' if len(users_to_run) > 5 else ''}\n")

successful = 0
failed = 0
for user in users_to_run:
    try:
        result = process_user_folder(user, BASE_SOURCE_PATH, BASE_OUTPUT_PATH,
                                     interval_sec=INTERVAL_SEC, pad=PAD, alpha=0.5,
                                     sample_windows=SAMPLE_WINDOWS)
        if result:
            successful += 1
        else:
            failed += 1
    except Exception as e:
        print(f"‚ùå {user}: {e}")
        failed += 1

print(f"\n‚úÖ Done: {successful} ok, {failed} failed. Outputs under: {BASE_OUTPUT_PATH}")

Processing 26 user(s): ['AT_1', 'AT_2', 'AT_3_1', 'AT_3_2', 'Ayu_1']...

üìä Processing AT_1...
[video] Saved: /content/drive/MyDrive/NEAR_Experiment_Design_Output/AT_1/frames/AT_1_heatmap.mp4 (24 frames)
‚úÖ AT_1: Processed 24 intervals, heatmap MP4: /content/drive/MyDrive/NEAR_Experiment_Design_Output/AT_1/frames/AT_1_heatmap.mp4
üìä Processing AT_2...
[video] Saved: /content/drive/MyDrive/NEAR_Experiment_Design_Output/AT_2/frames/AT_2_heatmap.mp4 (10 frames)
‚úÖ AT_2: Processed 10 intervals, heatmap MP4: /content/drive/MyDrive/NEAR_Experiment_Design_Output/AT_2/frames/AT_2_heatmap.mp4
üìä Processing AT_3_1...
[video] Saved: /content/drive/MyDrive/NEAR_Experiment_Design_Output/AT_3_1/frames/AT_3_1_heatmap.mp4 (7 frames)
‚úÖ AT_3_1: Processed 7 intervals, heatmap MP4: /content/drive/MyDrive/NEAR_Experiment_Design_Output/AT_3_1/frames/AT_3_1_heatmap.mp4
üìä Processing AT_3_2...
[video] Saved: /content/drive/MyDrive/NEAR_Experiment_Design_Output/AT_3_2/frames/AT_3_2_heatmap.mp4 (15 

In [None]:
# To process only one user (e.g. for testing), set SINGLE_USER in the cell above:
#   SINGLE_USER = "Ayu_1"
# Then re-run the "Run pipeline" cell. To process everyone again, set:
#   SINGLE_USER = None

## 7. Check outputs

Summarizes how many frames and crops were written per user. Run after the pipeline has finished.

In [None]:
def check_output_files(user_name):
    """Print a short summary of generated files for one user."""
    frames_dir = os.path.join(BASE_OUTPUT_PATH, user_name, "frames")
    if not os.path.exists(frames_dir):
        print(f"  {user_name}: no output")
        return
    files = sorted(os.listdir(frames_dir))
    src_full = [f for f in files if f.startswith("src_") and "_crop" not in f]
    heat_full = [f for f in files if f.startswith("heat_") and "_crop" not in f and not f.endswith(".mp4")]
    heat_crop = [f for f in files if "heat_" in f and "_crop" in f]
    mp4s = [f for f in files if f.endswith(".mp4")]
    print(f"  {user_name}: {len(src_full)} frames, {len(heat_crop)} crops, MP4: {len(mp4s)}")

# Summary for every user that has output (after running the pipeline)
print("Output summary (users with frames/ folder):")
for user in user_folders:
    check_output_files(user)


üìÅ Output files for Ayu_1:
  Source full frames: 20
  Source crops:       20
  Heat full frames:   20
  Heat crops:         20
  Total files:        81
