# Tutorial: Use Precomputed IMWUT Embeddings in Another Repo

This notebook shows a minimal workflow to reuse embeddings generated by
`trustME_CLUES/source/segment_imwut.py` in any other project.

Artifacts expected:
- `moment_embeddings.npz`
- `segments_metadata.parquet`
- (optional) `manifest.json`


## 1) Point to the embedding artifact directory

If this notebook is copied to another repo, set `EMB_DIR` to the absolute path
of the exported embedding folder.


In [1]:
from pathlib import Path
import json

import numpy as np
import pandas as pd

# Change this path in your other repo:
EMB_DIR = Path('/home/ppg/eyetracking/moment4ET/trustME_CLUES/data/processed/imwut_tobii')

EMB_PATH = EMB_DIR / 'moment_embeddings.npz'
META_PATH = EMB_DIR / 'segments_metadata.parquet'
MANIFEST_PATH = EMB_DIR / 'manifest.json'

print('Embedding dir:', EMB_DIR)
for p in [EMB_PATH, META_PATH, MANIFEST_PATH]:
    print(f'{p.name:28s} exists={p.exists()}')


Embedding dir: /home/ppg/eyetracking/moment4ET/trustME_CLUES/data/processed/imwut_tobii
moment_embeddings.npz        exists=True
segments_metadata.parquet    exists=True
manifest.json                exists=True


## 2) Load embeddings + metadata and validate alignment

In [2]:
emb_payload = np.load(EMB_PATH, allow_pickle=False)
X = emb_payload['embeddings'].astype(np.float32)
seg_ids = emb_payload['segment_id'].astype(str)

meta = pd.read_parquet(META_PATH)
kept = meta.loc[meta['kept']].copy()

assert X.shape[0] == len(seg_ids), 'Embeddings and IDs misaligned'
assert set(seg_ids) == set(kept['segment_id'].astype(str)), 'ID mismatch against metadata kept rows'

# Reorder metadata to embedding row order
kept = kept.set_index('segment_id').loc[seg_ids].reset_index()

print('X shape:', X.shape)
print('Kept metadata shape:', kept.shape)
print('Labels:', kept['Label'].nunique(), '| Subjects:', kept['Subject'].nunique())


X shape: (10110, 1024)
Kept metadata shape: (10110, 16)
Labels: 15 | Subjects: 25


## 3) (Optional) Read provenance from manifest

In [3]:
if MANIFEST_PATH.exists():
    manifest = json.loads(MANIFEST_PATH.read_text())
    print('Pipeline version:', manifest.get('pipeline_version'))
    print('Model:', manifest.get('model', {}).get('name'))
    print('Counts:', manifest.get('counts'))
else:
    print('manifest.json not found (this is optional).')


Pipeline version: 1.0.0
Model: AutonLab/MOMENT-1-large
Counts: {'segments_dropped': 7253, 'segments_kept': 10110, 'segments_total_filtered': 17363}


## 4) Build labels for your downstream task

Example below: binary load vs no-load.


In [4]:
no_load_labels = {'passive_viewing', 'listen_music', 'rest'}
y_binary = np.where(kept['Label'].isin(no_load_labels), 'no_load', 'load')

print(pd.Series(y_binary).value_counts())


load       7704
no_load    2406
Name: count, dtype: int64


## 5) Train a quick baseline in your other repo

Minimal baseline: logistic regression on frozen embeddings.


In [5]:
from sklearn.model_selection import GroupKFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score

le = LabelEncoder()
y = le.fit_transform(y_binary)
groups = kept['Subject'].to_numpy()  # subject-wise CV (recommended)

clf = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=2000, class_weight='balanced')
)

gkf = GroupKFold(n_splits=min(5, len(np.unique(groups))))
scores = []
for tr, te in gkf.split(X, y, groups=groups):
    clf.fit(X[tr], y[tr])
    y_pred = clf.predict(X[te])
    scores.append(balanced_accuracy_score(y[te], y_pred))

print('GroupKFold balanced accuracy mean:', float(np.mean(scores)))
print('GroupKFold balanced accuracy std :', float(np.std(scores)))


GroupKFold balanced accuracy mean: 0.6077306337869969
GroupKFold balanced accuracy std : 0.012702279464084086


## 6) Save a portable package (optional)

If you want to transfer embeddings to another machine/repo, you can package:
- `moment_embeddings.npz`
- `segments_metadata.parquet`
- `manifest.json`

and unpack them anywhere.


In [6]:
# Example packaging command (run in terminal, not required in notebook):
# tar -czf imwut_tobii_embeddings.tar.gz -C /home/ppg/eyetracking/moment4ET/trustME_CLUES/data/processed imwut_tobii
