# Phase 4 (Revised): Dark-Cycle Estrous Discovery (Vehicle Cages Only)
## Morph2REP Study 1001 (2025v3.3) — cycle-aware inference from Envision video-derived behavior

### What we’re trying to do
We **do not have estrous ground-truth labels** (no vaginal cytology).  
Our goal is to test whether we can recover a **within-mouse ~4–5 day cyclic signal** consistent with estrous using **home-cage, video-derived behavioral bouts**.

### Why dark-cycle focus?
In home-cage data, light-cycle behavior is often constrained and can reduce variance. Estrous-linked modulation frequently appears as changes in:
- locomotion/exploration intensity
- sleep / inactivity structure
- fragmentation of bouts

We therefore compute features **per “night”** (6 PM → 6 AM EST), which spans two calendar dates.

### Strategy
1. Load **vehicle-only cages** using the same DuckDB → S3 `read_parquet` approach as Phase 3.
2. Apply **data cleaning** (spillover, cage change, termination days).
3. Build **dark-cycle nightly features** per mouse.
4. Normalize **within mouse**:
   - raw nightly metrics
   - within-mouse z-scores
   - leakage-safe rolling z-scores (past-only baseline)
5. Dimensionality reduction (PCA).
6. Fit a **4-state time-aware model** (HMM) to infer latent “phase-like” states.
7. Validate with:
   - per-mouse periodicity near 4–5 days (Lomb–Scargle on PC1)
   - dwell times & transition matrix sanity checks

### Data available (used here)
- `animal_bouts.parquet`: bout start/end times, state name, bout duration seconds
- `animal_bout_metrics.parquet` (optional): bout-level metrics such as distance by state (schema varies)

> This notebook is designed to be robust to `animal_bout_metrics.parquet` schema differences by attempting to detect the correct metric columns.

---
## 1) Setup: packages and imports
We install and import packages for:
- loading parquet from S3: DuckDB + PyArrow
- feature engineering: pandas/numpy
- modeling: scikit-learn + hmmlearn
- periodicity: SciPy Lomb–Scargle
- plotting: matplotlib

In [None]:
!pip -q install duckdb pyarrow pandas numpy scikit-learn scipy matplotlib hmmlearn

In [None]:
import warnings
warnings.filterwarnings("ignore")

import duckdb
import numpy as np
import pandas as pd

from datetime import date, datetime, timedelta
from typing import Dict, List, Tuple, Optional

import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from scipy.signal import lombscargle

---
## 2) Configuration: vehicle cages, dates, and exclusions (data cleaning)
We follow your study metadata and keep **vehicle cages only**:

- Rep 1 vehicle cages: 4918, 4922, 4923  
- Rep 2 vehicle cages: 4928, 4929, 4934

### Cleaning rules (from your Phase 3 cleaning table)
We exclude:
- acclimation spillover days: **Jan 9** (Rep1), **Jan 24** (Rep2)
- cage change days: **Jan 15** (Rep1), **Jan 29** (Rep2)
- termination/unstable days: **Feb 3–4** (Rep2)

### Time alignment
Facility schedule is **EST**:
- Lights ON 6:00 AM EST
- Lights OFF 6:00 PM EST

Timestamps are stored as **UTC**. EST = UTC − 5 hours, so:
- 6:00 AM EST = 11:00 UTC
- 6:00 PM EST = 23:00 UTC

**Dark cycle = outside [11:00, 23:00) UTC**, i.e., `hour < 11` OR `hour >= 23`.
We assign each bout to a **night_date**:
- 6 PM–midnight: night_date = same calendar day (UTC date)
- midnight–6 AM: belongs to **previous night_date** (subtract 1 day when hour < 11 UTC)

In [None]:
from datetime import date

S3_BASE = "s3://jax-envision-public-data/study_1001/2025v3.3/tabular"

VEHICLE_CAGES = {
    "Rep1": {
        "cages": [4918, 4922, 4923],
        "analysis_start": "2025-01-10",  # after acclimation (Jan 7–9)
        "analysis_end": "2025-01-22",
        "cage_change": date(2025, 1, 15),
    },
    "Rep2": {
        "cages": [4928, 4929, 4934],
        "analysis_start": "2025-01-25",  # after acclimation (Jan 22–24)
        "analysis_end": "2025-02-04",
        "cage_change": date(2025, 1, 29),
    },
}

EXCLUDE_DATES = {
    "Rep1": {date(2025, 1, 9), date(2025, 1, 15)},
    "Rep2": {date(2025, 1, 24), date(2025, 1, 29), date(2025, 2, 3), date(2025, 2, 4)},
}

LIGHT_START_UTC = 11  # 6 AM EST
LIGHT_END_UTC = 23    # 6 PM EST

print("Vehicle cages & cleaning:")
for rep, cfg in VEHICLE_CAGES.items():
    print(f"- {rep}: cages={cfg['cages']}, window={cfg['analysis_start']}..{cfg['analysis_end']}, exclude={sorted(EXCLUDE_DATES[rep])}")

---
## 3) Loading utilities (DuckDB → parquet on S3)
We load the same tables you used previously via DuckDB `read_parquet` from S3 partitions:
- `animal_bouts.parquet`
- `animal_bout_metrics.parquet`

We only pull partitions for the cages + cleaned dates.

In [None]:
def date_range(start: str, end: str) -> List[date]:
    start_d = datetime.strptime(start, "%Y-%m-%d").date()
    end_d = datetime.strptime(end, "%Y-%m-%d").date()
    out = []
    d = start_d
    while d <= end_d:
        out.append(d)
        d += timedelta(days=1)
    return out

def load_table_for_cages_dates(table_name: str, cages: List[int], dates: List[date]) -> pd.DataFrame:
    conn = duckdb.connect(database=":memory:")
    all_data = []
    for cage_id in cages:
        for d in dates:
            date_str = d.strftime("%Y-%m-%d")
            path = f"{S3_BASE}/cage_id={cage_id}/date={date_str}/{table_name}"
            try:
                df = conn.execute(f"SELECT * FROM read_parquet('{path}')").fetchdf()
                df["cage_id"] = cage_id
                df["date"] = date_str
                all_data.append(df)
            except Exception:
                continue
    conn.close()
    if not all_data:
        return pd.DataFrame()
    return pd.concat(all_data, ignore_index=True)

TABLE_BOUTS = "animal_bouts.parquet"
TABLE_BOUT_METRICS = "animal_bout_metrics.parquet"

def load_vehicle_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
    all_bouts = []
    all_metrics = []
    for rep, cfg in VEHICLE_CAGES.items():
        cages = cfg["cages"]
        days = date_range(cfg["analysis_start"], cfg["analysis_end"])
        # cleaning: remove cage change and explicit exclusions
        days = [d for d in days if d != cfg["cage_change"]]
        days = [d for d in days if d not in EXCLUDE_DATES[rep]]

        df_b = load_table_for_cages_dates(TABLE_BOUTS, cages, days)
        df_m = load_table_for_cages_dates(TABLE_BOUT_METRICS, cages, days)

        if not df_b.empty:
            df_b["replicate"] = rep
            all_bouts.append(df_b)
        if not df_m.empty:
            df_m["replicate"] = rep
            all_metrics.append(df_m)

    bouts = pd.concat(all_bouts, ignore_index=True) if all_bouts else pd.DataFrame()
    metrics = pd.concat(all_metrics, ignore_index=True) if all_metrics else pd.DataFrame()
    return bouts, metrics

df_bouts, df_metrics = load_vehicle_data()
print("Loaded:")
print("  bouts rows:", len(df_bouts))
print("  metrics rows:", len(df_metrics))
print("\nBouts columns:", df_bouts.columns.tolist())
print("\nMetrics columns:", df_metrics.columns.tolist()[:40], "..." if len(df_metrics.columns)>40 else "")
df_bouts.head()

---
## 4) Feature engineering: Dark-cycle nightly features (fixed + robust)
This is the *fixed* version of your `compute_dark_cycle_features`:

### Fixes applied
✅ **Timezone correctness:** `pd.to_datetime(..., utc=True)` so `hour_utc` is truly UTC.  
✅ **Night assignment correctness:** bouts with `hour_utc < 11` belong to the previous night.  
✅ **Metrics correctness:** we only sum distance-like metrics *if we can identify them* from schema; otherwise we skip distance features safely.  
✅ **State handling:** works whether `state_name` is full (e.g., `animal_bouts.active`) or short (e.g., `active`).

### Outputs
One row per `(cage_id, animal_id, night_date)` with:
- per-state duration / count / mean bout length
- derived composites (physically_demanding, sleep_related, feeding_resourcing, activity_amplitude, fragmentation, exploration_intensity if distance is available)

In [None]:
def _parse_utc(ts: pd.Series) -> pd.Series:
    # Force UTC to avoid ambiguous 'hour_utc' extraction
    return pd.to_datetime(ts, utc=True, errors="coerce")

def _is_dark_hour(hour_utc: pd.Series) -> pd.Series:
    return ~((hour_utc >= LIGHT_START_UTC) & (hour_utc < LIGHT_END_UTC))

def _night_date_from_start(ts_utc: pd.Series) -> pd.Series:
    ts = _parse_utc(ts_utc)
    hour = ts.dt.hour
    night = ts.dt.date
    # before 11:00 UTC belongs to previous night (6am EST boundary)
    mask = hour < LIGHT_START_UTC
    night = pd.Series(night, index=ts.index)
    night.loc[mask] = (ts.loc[mask] - pd.Timedelta(days=1)).dt.date
    return night

def _normalize_state_name(s: pd.Series) -> pd.Series:
    # Map 'animal_bouts.active' -> 'active', also handle already-short names
    s = s.astype(str)
    return s.str.split(".").str[-1].str.lower()

def _detect_metric_columns(df_m: pd.DataFrame) -> Tuple[Optional[str], Optional[str], Optional[str]]:
    """
    Try to infer:
    - time column: start_time or time or timestamp
    - metric name column: metric_name or name
    - metric value column: metric_value or value
    Returns (time_col, name_col, value_col)
    """
    cols = {c.lower(): c for c in df_m.columns}
    time_col = None
    for cand in ["start_time", "time", "timestamp", "time_stamp"]:
        if cand in cols:
            time_col = cols[cand]
            break
    name_col = None
    for cand in ["metric_name", "name"]:
        if cand in cols:
            name_col = cols[cand]
            break
    value_col = None
    for cand in ["metric_value", "value"]:
        if cand in cols:
            value_col = cols[cand]
            break
    return time_col, name_col, value_col

def compute_dark_cycle_features_fixed(df_bouts: pd.DataFrame, df_metrics: pd.DataFrame) -> pd.DataFrame:
    # --------- BOUTS ---------
    df_b = df_bouts.copy()
    # Robust timestamp parse
    df_b["start_time"] = _parse_utc(df_b["start_time"])
    df_b = df_b.dropna(subset=["start_time", "animal_id", "state_name", "bout_length_seconds"])
    df_b["hour_utc"] = df_b["start_time"].dt.hour
    df_b["is_dark"] = _is_dark_hour(df_b["hour_utc"])
    df_b = df_b[df_b["is_dark"]].copy()

    df_b["night_date"] = _night_date_from_start(df_b["start_time"])
    df_b["state_short"] = _normalize_state_name(df_b["state_name"])

    # Aggregate per state
    grp = df_b.groupby(["cage_id", "animal_id", "night_date", "state_short"])
    agg = grp["bout_length_seconds"].agg(["sum", "count", "mean"]).reset_index()
    agg = agg.rename(columns={"sum": "duration_s", "count": "bout_count", "mean": "mean_bout_s"})

    # Pivot wide
    def pivot_value(val_col: str, suffix: str) -> pd.DataFrame:
        wide = agg.pivot_table(index=["cage_id", "animal_id", "night_date"], columns="state_short", values=val_col, aggfunc="first")
        wide.columns = [f"{c}_{suffix}" for c in wide.columns]
        return wide.reset_index()

    dur = pivot_value("duration_s", "duration")
    cnt = pivot_value("bout_count", "bout_count")
    mb  = pivot_value("mean_bout_s", "mean_bout")

    df_summary = dur.merge(cnt, on=["cage_id","animal_id","night_date"], how="outer").merge(mb, on=["cage_id","animal_id","night_date"], how="outer")

    # Total duration (dark bouts)
    total = df_b.groupby(["cage_id","animal_id","night_date"])["bout_length_seconds"].sum().reset_index().rename(columns={"bout_length_seconds":"total_duration"})
    df_summary = df_summary.merge(total, on=["cage_id","animal_id","night_date"], how="left")

    # Fill NaNs for duration/count/mean features with 0 (state absent that night)
    for c in df_summary.columns:
        if c.endswith(("_duration","_bout_count","_mean_bout")) or c in ["total_duration"]:
            df_summary[c] = df_summary[c].fillna(0.0)

    # --------- METRICS (distance-like) ---------
    if df_metrics is not None and len(df_metrics) > 0:
        df_m = df_metrics.copy()
        time_col, name_col, value_col = _detect_metric_columns(df_m)

        if time_col is not None and value_col is not None:
            df_m[time_col] = _parse_utc(df_m[time_col])
            df_m = df_m.dropna(subset=[time_col, value_col])
            df_m["hour_utc"] = df_m[time_col].dt.hour
            df_m["is_dark"] = _is_dark_hour(df_m["hour_utc"])
            df_m = df_m[df_m["is_dark"]].copy()
            df_m["night_date"] = _night_date_from_start(df_m[time_col])

            # Determine which rows correspond to distance.
            # If we have a metric-name column, filter by 'distance' keyword.
            if name_col is not None:
                name_series = df_m[name_col].astype(str).str.lower()
                dist_mask = name_series.str.contains("distance|dist|path_length|displacement", regex=True, na=False)
                df_m_dist = df_m[dist_mask].copy()
            else:
                # No metric name column; cannot safely interpret as distance
                df_m_dist = pd.DataFrame()

            # If there is a state column, normalize it for per-state distance; else only compute total distance
            state_col = None
            for cand in ["state_name", "state", "behavior_state"]:
                if cand in {c.lower(): c for c in df_m.columns}:
                    state_col = {c.lower(): c for c in df_m.columns}[cand]
                    break

            if len(df_m_dist) > 0:
                df_m_dist[value_col] = pd.to_numeric(df_m_dist[value_col], errors="coerce").fillna(0.0)

                dist_total = df_m_dist.groupby(["cage_id","animal_id","night_date"])[value_col].sum().reset_index().rename(columns={value_col:"total_distance"})
                df_summary = df_summary.merge(dist_total, on=["cage_id","animal_id","night_date"], how="left")

                if state_col is not None:
                    df_m_dist["state_short"] = _normalize_state_name(df_m_dist[state_col])
                    dist_state = df_m_dist.groupby(["cage_id","animal_id","night_date","state_short"])[value_col].sum().reset_index()
                    dist_state = dist_state.pivot_table(index=["cage_id","animal_id","night_date"], columns="state_short", values=value_col, aggfunc="first")
                    dist_state.columns = [f"{c}_distance" for c in dist_state.columns]
                    dist_state = dist_state.reset_index()
                    df_summary = df_summary.merge(dist_state, on=["cage_id","animal_id","night_date"], how="left")

                # Fill missing distances with 0
                for c in df_summary.columns:
                    if c.endswith("_distance") or c == "total_distance":
                        df_summary[c] = df_summary[c].fillna(0.0)
            else:
                # No distance rows found; skip
                pass

    # --------- DERIVED FEATURES (Khatiz-style composites) ---------
    # Make sure required base columns exist (if state absent, column might be missing; create as 0)
    def ensure(col):
        if col not in df_summary.columns:
            df_summary[col] = 0.0

    for base in ["locomotion_duration", "climbing_duration", "inferred_sleep_duration", "feeding_duration", "drinking_duration", "active_duration"]:
        ensure(base)
    for base in ["inferred_sleep_bout_count", "active_bout_count"]:
        ensure(base)

    df_summary["physically_demanding"] = df_summary["locomotion_duration"] + df_summary["climbing_duration"]
    df_summary["sleep_related"] = df_summary["inferred_sleep_duration"]
    df_summary["feeding_resourcing"] = df_summary["feeding_duration"] + df_summary["drinking_duration"]
    df_summary["activity_amplitude"] = df_summary["active_duration"] + df_summary["locomotion_duration"]

    df_summary["sleep_fragmentation"] = df_summary["inferred_sleep_bout_count"] / (df_summary["inferred_sleep_duration"] + 1.0)
    df_summary["active_fragmentation"] = df_summary["active_bout_count"] / (df_summary["active_duration"] + 1.0)

    if "total_distance" in df_summary.columns:
        df_summary["exploration_intensity"] = df_summary["total_distance"] / (df_summary["activity_amplitude"] + 1.0)

    # Sort for downstream rolling operations
    df_summary["night_date"] = pd.to_datetime(df_summary["night_date"])
    df_summary = df_summary.sort_values(["animal_id","night_date"])

    return df_summary

nightly = compute_dark_cycle_features_fixed(df_bouts, df_metrics)
print("Nightly feature table:", nightly.shape)
nightly.head()

---
## 5) Within-mouse normalization (nightly)
Estrous is a **within-mouse** cycle, so we create:
- raw nightly features (for reference)
- within-mouse z-scores across nights
- leakage-safe rolling z-scores using only past nights

We will model on within-mouse z by default.

In [None]:
# Identify feature columns
key_cols = ["cage_id", "animal_id", "night_date"]
feature_cols = [c for c in nightly.columns if c not in key_cols]

# Ensure numeric for feature cols
for c in feature_cols:
    nightly[c] = pd.to_numeric(nightly[c], errors="coerce")

# Simple within-mouse z-score
df_within = nightly.copy()
for c in feature_cols:
    df_within[c] = df_within.groupby("animal_id")[c].transform(lambda s: (s - s.mean()) / (s.std(ddof=0) + 1e-8))

# Rolling z-score using past only (no leakage)
df_roll = nightly.copy()
WINDOW = 5
MINP = 3
for c in feature_cols:
    g = df_roll.groupby("animal_id")[c]
    mu = g.shift(1).rolling(window=WINDOW, min_periods=MINP).mean()
    sd = g.shift(1).rolling(window=WINDOW, min_periods=MINP).std(ddof=0)
    df_roll[c] = (df_roll[c] - mu) / (sd + 1e-8)

display(nightly.head(3))
display(df_within.head(3))
display(df_roll.head(3))

---
## 6) Build model matrix + PCA
We:
1) choose a normalized table (within-mouse z by default)
2) impute missing values (median)
3) standardize features
4) run PCA and keep enough PCs to explain ~85% variance (capped at 10 PCs)

In [None]:
MODEL_DF = df_within.copy()  # swap to df_roll if you want leakage-safe baseline in modeling

X = MODEL_DF[feature_cols].copy()

imputer = SimpleImputer(strategy="median")
X_imp = imputer.fit_transform(X)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imp)

pca = PCA(n_components=min(10, X_scaled.shape[1]))
X_pca = pca.fit_transform(X_scaled)

explained = np.cumsum(pca.explained_variance_ratio_)
k = int(np.searchsorted(explained, 0.85) + 1)
k = max(2, min(k, X_pca.shape[1]))
X_pca_k = X_pca[:, :k]

print("X_scaled:", X_scaled.shape, "X_pca_k:", X_pca_k.shape)

plt.figure()
plt.plot(np.arange(1, len(explained)+1), explained, marker="o")
plt.xlabel("Number of PCs")
plt.ylabel("Cumulative explained variance")
plt.title("PCA explained variance (nightly, vehicle)")
plt.tight_layout()
plt.show()

---
## 7) Fit a 4-state HMM and decode nightly states
We use a Gaussian HMM (hmmlearn). This is time-aware and discourages rapid switching.

- 4 states is a natural starting point for estrous-like phase structure.
- We initialize with a “sticky” transition matrix (high self-transition) to reduce noise-driven flips.

In [None]:
states = None
post = None

try:
    from hmmlearn.hmm import GaussianHMM
    hmm_ok = True
except Exception as e:
    hmm_ok = False
    print("Could not import hmmlearn:", e)

if hmm_ok:
    n_states = 4
    model = GaussianHMM(
        n_components=n_states,
        covariance_type="full",
        n_iter=250,
        random_state=42
    )

    # sticky initialization
    transmat = np.full((n_states, n_states), 0.05 / (n_states - 1))
    np.fill_diagonal(transmat, 0.95)
    model.transmat_ = transmat
    model.startprob_ = np.full(n_states, 1.0 / n_states)

    model.fit(X_pca_k)
    states = model.predict(X_pca_k)
    post = model.predict_proba(X_pca_k)

    print("Learned transition matrix:")
    display(pd.DataFrame(model.transmat_))

---
## 8) Results table (mouse-night) + quick visualization
We attach:
- inferred state (0..3)
- max posterior confidence
- PC1 (for periodicity tests)

Then we plot a heatmap: mice x nights.

In [None]:
results = MODEL_DF[["cage_id", "animal_id", "night_date"]].copy()
results["night_date"] = pd.to_datetime(results["night_date"]).dt.date

results["PC1"] = X_pca_k[:, 0]
if states is not None:
    results["state"] = states
    results["state_conf"] = post.max(axis=1)
else:
    results["state"] = np.nan
    results["state_conf"] = np.nan

# Heatmap
pivot = results.pivot_table(index="animal_id", columns="night_date", values="state", aggfunc="first").sort_index()

plt.figure(figsize=(12, max(4, 0.15 * len(pivot))))
plt.imshow(pivot.values, aspect="auto", interpolation="nearest")
plt.yticks(np.arange(len(pivot.index)), pivot.index)
plt.xticks(np.arange(len(pivot.columns)), [d.strftime("%m-%d") for d in pivot.columns], rotation=90)
plt.colorbar(label="Inferred state")
plt.title("Inferred 4-state sequence per mouse-night (vehicle, dark cycle)")
plt.tight_layout()
plt.show()

results.head()

---
## 9) Validation: per-mouse periodicity near 4–5 days (Lomb–Scargle on PC1)
We test for a strong period in the 2.5–8 day range and summarize the best period per mouse.
If estrous-like modulation exists, we expect many mice to show a peak around ~4–5 days.

In [None]:
def lomb_best_period(t_days, y, min_period=2.5, max_period=8.0, n_freq=600):
    t = np.asarray(t_days, dtype=float)
    y = np.asarray(y, dtype=float)
    m = np.isfinite(t) & np.isfinite(y)
    t, y = t[m], y[m]
    if len(t) < 6:
        return np.nan
    periods = np.linspace(min_period, max_period, n_freq)
    freqs = 2 * np.pi / periods
    y0 = y - y.mean()
    pgram = lombscargle(t, y0, freqs, normalize=True)
    return float(periods[int(np.argmax(pgram))])

peaks = []
for aid, g in results.groupby("animal_id"):
    g = g.sort_values("night_date")
    t0 = pd.to_datetime(g["night_date"]).min()
    t_days = (pd.to_datetime(g["night_date"]) - t0).dt.days.values
    peaks.append((aid, lomb_best_period(t_days, g["PC1"].values)))

peak_df = pd.DataFrame(peaks, columns=["animal_id", "best_period_days"]).dropna()
display(peak_df.describe())

plt.figure()
plt.hist(peak_df["best_period_days"].values, bins=20)
plt.xlabel("Best period (days)")
plt.ylabel("Count of mice")
plt.title("Distribution of best Lomb–Scargle periods (PC1, vehicle, dark)")
plt.tight_layout()
plt.show()

---
## 10) Validation: dwell times + transition matrix sanity checks
Estrous-like states should not jump randomly. We compute:
- empirical transition matrix from decoded sequences
- dwell time distribution per state

In [None]:
res_sorted = results.sort_values(["animal_id", "night_date"])
trans_counts = np.zeros((4,4), dtype=int)
dwell = {s: [] for s in range(4)}

for aid, g in res_sorted.groupby("animal_id"):
    seq = g["state"].dropna().astype(int).values
    if len(seq) < 2:
        continue
    for a, b in zip(seq[:-1], seq[1:]):
        trans_counts[a, b] += 1

    # run-length encoding for dwell
    cur = seq[0]; run = 1
    for x in seq[1:]:
        if x == cur:
            run += 1
        else:
            dwell[cur].append(run)
            cur = x; run = 1
    dwell[cur].append(run)

trans_prob = trans_counts / np.maximum(trans_counts.sum(axis=1, keepdims=True), 1)
display(pd.DataFrame(trans_prob))

print("Mean dwell length (nights) per state:")
for s in range(4):
    vals = dwell[s]
    if vals:
        print(f"state {s}: mean={np.mean(vals):.2f}, n={len(vals)}")
    else:
        print(f"state {s}: no data")

---
## 11) Export
We save a CSV of mouse-night inferred states and PC1.
You can join this back to other nightly features or use it to search for “cycle day” anchors.

In [None]:
out_path = "/mnt/data/phase4_vehicle_darkcycle_cycle_states.csv"
results.to_csv(out_path, index=False)
print("Saved:", out_path)
results.head()

---
## 12) Next steps (recommended)
If you see a clear ~4–5 day peak in many mice:

1) Add a **null test**:
   - shuffle night order within each mouse; the 4–5 day peak should disappear.
2) Upgrade HMM to **cyclic-constrained** transitions (P→E→M→D→P).
3) Add modalities:
   - drinking totals (`animal_drinking`)
   - respiration (`animal_respiration` / `animal_tsdb_mvp`)
   - sociability (`animal_sociability_pairwise`)
These often increase estrous signal-to-noise.

If you want, I can extend this notebook to include the null test + cyclic-constrained HMM.