## Literature Foundation: Khatiz et al. (2025)

**"Real-time behavioral monitoring of C57BL/6J mice during reproductive cycle"**  
*Frontiers in Neuroscience, 19:1509822*

### Key Behavioral Markers

| Estrus (High Estrogen) | Metestrus/Diestrus (Low Estrogen) |
|------------------------|-----------------------------------|
| 30% more physically demanding activity | Lower overall activity |
| Sustained activity bouts (low fragmentation) | Fragmented activity (more bouts) |
| Higher exploratory behavior | More sleep-related behavior |
| Lower feeding during dark cycle | Higher feeding/habituation |
| More locomotion bout counts | Fewer locomotion bouts |
| Less sleep fragmentation | More sleep fragmentation (more rousings) |

### Statistical Methods from Khatiz

1. **Hierarchical Clustering** - Group behaviors by statistical relationships
2. **Factor Analysis** - Identify underlying behavioral dimensions
3. **PCA** - Reduce dimensionality, find primary axes of differentiation
4. **K-Means Clustering** - Cluster nights into groups
5. **ANOVA + Fisher's LSD** - Compare groups (we lack ground truth for this)

---

## Analysis Pipeline

1. **Data Loading:** Load bout data from S3, filter to dark cycle
2. **Feature Engineering:** Compute duration AND bout-based metrics (per Khatiz)
3. **Normalization:** Z-score within each animal to remove individual differences
4. **Classification Approaches:**
   - Weighted composite score (based on Khatiz markers)
   - K-Means clustering (unsupervised)
   - Hierarchical clustering (unsupervised)
5. **Validation:** Factor analysis, dimensionality reduction, method comparison

---

## Expected Outcomes

- ~25% of nights classified as "estrus-like" (matches 1-2 days of 4-5 day cycle)
- Significant feature differences between classified states
- Agreement between supervised (weighted score) and unsupervised (clustering) methods

---

## References

1. Khatiz A, et al. (2025). Real-time behavioral monitoring of C57BL/6J mice during reproductive cycle. *Front. Neurosci.* 19:1509822.
2. Levy DR, et al. (2023). Mouse spontaneous behavior reflects individual variation rather than estrous state. *Curr Biol.* 33:1358-1364.
3. Wollnik F, Turek FW. (1988). Estrous correlated modulations of circadian and ultradian wheel-running activity rhythms. *Physiol Behav.* 43:389-396.

In [1]:
!pip install duckdb pyarrow astropy umap-learn -q


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import duckdb
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Statistical tools
from scipy import stats
from scipy.stats import shapiro, normaltest, ttest_ind
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

# ML tools
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, adjusted_rand_score
import umap

print("All imports successful!")

2026-02-03 12:38:22.121582: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-03 12:38:22.121654: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-03 12:38:22.164333: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-03 12:38:22.256154: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


All imports successful!


In [5]:
# Configuration
S3_BASE = "s3://jax-envision-public-data/study_1001/2025v3.3/tabular"

# Vehicle control cages (14-16 days of unconfounded baseline)
VEHICLE_CAGES = {
    'Rep1': {
        'cages': [4918, 4922, 4923],
        'start_date': '2025-01-07',
        'end_date': '2025-01-22',
    },
    'Rep2': {
        'cages': [4928, 4929, 4934],
        'start_date': '2025-01-22',
        'end_date': '2025-02-04',
    }
}

# Light cycle parameters (EST → UTC)
# Lights ON: 6:00 AM EST = 11:00 UTC
# Lights OFF: 6:00 PM EST = 23:00 UTC
# Dark phase: hour >= 23 OR hour < 11 (UTC)

print("Configuration loaded!")
print(f"Vehicle cages: {[c for rep in VEHICLE_CAGES.values() for c in rep['cages']]}")


Configuration loaded!
Vehicle cages: [4918, 4922, 4923, 4928, 4929, 4934]


In [6]:
def load_bout_data(cage_id, start_date, end_date):
    """Load animal bout data for a single cage across date range."""
    conn = duckdb.connect()
    conn.execute("INSTALL httpfs; LOAD httpfs;")
    conn.execute("SET s3_region='us-east-1';")
    
    dates = pd.date_range(start_date, end_date, freq='D')
    all_data = []
    
    for date in dates:
        date_str = date.strftime('%Y-%m-%d')
        path = f"{S3_BASE}/cage_id={cage_id}/date={date_str}/animal_bouts.parquet"
        
        try:
            df = conn.execute(f"SELECT * FROM read_parquet('{path}')").fetchdf()
            df['cage_id'] = cage_id
            df['date'] = date_str
            all_data.append(df)
        except Exception as e:
            # Skip missing dates
            continue
    
    conn.close()
    
    if all_data:
        return pd.concat(all_data, ignore_index=True)
    return pd.DataFrame()

def load_all_vehicle_data():
    """Load bout data for all vehicle control cages."""
    all_data = []
    
    for rep_name, rep_config in VEHICLE_CAGES.items():
        print(f"Loading {rep_name}...")
        for cage_id in rep_config['cages']:
            print(f"  Cage {cage_id}...", end=" ")
            df = load_bout_data(cage_id, rep_config['start_date'], rep_config['end_date'])
            if len(df) > 0:
                df['replicate'] = rep_name
                all_data.append(df)
                print(f"{len(df):,} bouts")
            else:
                print("No data")
    
    return pd.concat(all_data, ignore_index=True)

In [7]:
# Check what exploration/movement data is available
conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute("SET s3_region='us-east-1';")

sample_cage = 4918
sample_date = '2025-01-10'

# 1. Check animal_activity_features.parquet
print("="*70)
print("1. animal_activity_features.parquet")
print("="*70)
path = f"{S3_BASE}/cage_id={sample_cage}/date={sample_date}/animal_activity_features.parquet"
try:
    df_features = conn.execute(f"SELECT * FROM read_parquet('{path}') LIMIT 5").fetchdf()
    print(f"Columns ({len(df_features.columns)}):")
    # Show columns related to movement/exploration
    movement_cols = [c for c in df_features.columns if any(x in c.lower() for x in 
                    ['displacement', 'velocity', 'distance', 'motion', 'stationary', 'travel'])]
    print(f"Movement-related columns: {movement_cols}")
    print(f"\nSample data:")
    if movement_cols:
        print(df_features[['time', 'predicted_identity'] + movement_cols[:5]].head())
except Exception as e:
    print(f"Error: {e}")

# 2. Check animal_activity_db.parquet
print("\n" + "="*70)
print("2. animal_activity_db.parquet")
print("="*70)
path = f"{S3_BASE}/cage_id={sample_cage}/date={sample_date}/animal_activity_db.parquet"
try:
    df_activity = conn.execute(f"SELECT * FROM read_parquet('{path}')").fetchdf()
    print(f"Columns: {df_activity.columns.tolist()}")
    print(f"\nUnique metric names:")
    print(df_activity['name'].unique())
except Exception as e:
    print(f"Error: {e}")

# 3. Check animal_bout_metrics.parquet
print("\n" + "="*70)
print("3. animal_bout_metrics.parquet")
print("="*70)
path = f"{S3_BASE}/cage_id={sample_cage}/date={sample_date}/animal_bout_metrics.parquet"
try:
    df_bout_metrics = conn.execute(f"SELECT * FROM read_parquet('{path}')").fetchdf()
    print(f"Columns: {df_bout_metrics.columns.tolist()}")
    print(f"\nUnique metric names:")
    print(df_bout_metrics['metric_name'].unique())
    print(f"\nUnique state names:")
    print(df_bout_metrics['state_name'].unique())
except Exception as e:
    print(f"Error: {e}")

conn.close()

1. animal_activity_features.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Columns (157):
Movement-related columns: ['start_to_end_displacement', 'total_displacement', 'average_velocity', 'max_velocity', 'variance_velocity', 'sum_displacement_cents', 'average_displacement_cents', 'min_displacement_cents', 'max_displacement_cents', 'sum_displacement_nose', 'average_displacement_nose', 'min_displacement_nose', 'max_displacement_nose', 'stationary_ratio', 'avg_distance_per_timestep', 'max_distance_per_timestep', 'min_distance_per_timestep', 'total_distance_moved', 'avg_total_distance_moved', 'first_to_last_avg_distance', 'first_to_last_max_distance', 'first_to_last_min_distance', 'avg_kpt_4_to_6_distance', 'max_kpt_4_to_6_distance', 'min_kpt_4_to_6_distance', 'avg_kpt_1_to_6_distance', 'max_kpt_1_to_6_distance', 'min_kpt_1_to_6_distance', 'avg_kpt_1_to_5_distance', 'max_kpt_1_to_5_distance', 'min_kpt_1_to_5_distance', 'avg_distance_to_0', 'avg_distance_to_1', 'avg_distance_to_2']

Sample data:
                 time predicted_identity  start_to_end_displacement  

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Columns: ['predicted_identity', 'time', 'resolution', 'name', 'value', 'units', 'version_str', 'organization_id', 'cage_id', 'study_id', 'device_id', 'run_id', 'animal_id', 'ULID', '__index_level_0__', 'filename', 'source_file', 'date']

Unique metric names:
['animal_bouts.active' 'animal_bouts.climbing' 'animal_bouts.inactive'
 'animal_bouts.locomotion']

3. animal_bout_metrics.parquet
Columns: ['predicted_identity', 'start_time', 'end_time', 'state_name', 'organization_id', 'cage_id', 'study_id', 'device_id', 'animal_id', 'bout_length_seconds', 'metric_name', 'metric_agg', 'metric_value', 'source_file', 'date']

Unique metric names:
['animal.distance_travelled']

Unique state names:
['animal_bouts.drinking' 'animal_bouts.locomotion' 'animal_bouts.active'
 'animal_bouts.social.in_proximity_other' 'animal_bouts.inactive'
 'animal_bouts.climbing' 'animal_bouts.feeding'
 'animal_bouts.social.isolated_other' 'animal_bouts.inferred_sleep'
 'animal_bouts.social.proximal_all' 'animal_bouts.s

In [8]:
def load_activity_features(cage_id, start_date, end_date):
    """Load animal_activity_features.parquet for exploration metrics."""
    conn = duckdb.connect()
    conn.execute("INSTALL httpfs; LOAD httpfs;")
    conn.execute("SET s3_region='us-east-1';")
    
    dates = pd.date_range(start_date, end_date, freq='D')
    all_data = []
    
    for date in dates:
        date_str = date.strftime('%Y-%m-%d')
        path = f"{S3_BASE}/cage_id={cage_id}/date={date_str}/animal_activity_features.parquet"
        
        try:
            df = conn.execute(f"SELECT * FROM read_parquet('{path}')").fetchdf()
            df['cage_id'] = cage_id
            df['date'] = date_str
            all_data.append(df)
        except:
            continue
    
    conn.close()
    
    if all_data:
        return pd.concat(all_data, ignore_index=True)
    return pd.DataFrame()

# Load for one cage first to see what we get
print("Loading activity features for cage 4918...")
df_features_test = load_activity_features(4918, '2025-01-07', '2025-01-10')
print(f"Loaded {len(df_features_test)} rows")

# Check movement-related columns
movement_cols = [c for c in df_features_test.columns if any(x in c.lower() for x in 
                ['displacement', 'velocity', 'distance', 'motion', 'stationary', 'travel', 'speed'])]
print(f"\nMovement columns found: {movement_cols}")

Loading activity features for cage 4918...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Loaded 799200 rows

Movement columns found: ['start_to_end_displacement', 'total_displacement', 'average_velocity', 'max_velocity', 'variance_velocity', 'sum_displacement_cents', 'average_displacement_cents', 'min_displacement_cents', 'max_displacement_cents', 'sum_displacement_nose', 'average_displacement_nose', 'min_displacement_nose', 'max_displacement_nose', 'avg_speed', 'max_speed', 'variance_speed', 'stationary_ratio', 'avg_distance_per_timestep', 'max_distance_per_timestep', 'min_distance_per_timestep', 'total_distance_moved', 'avg_total_distance_moved', 'first_to_last_avg_distance', 'first_to_last_max_distance', 'first_to_last_min_distance', 'avg_kpt_4_to_6_distance', 'max_kpt_4_to_6_distance', 'min_kpt_4_to_6_distance', 'avg_kpt_1_to_6_distance', 'max_kpt_1_to_6_distance', 'min_kpt_1_to_6_distance', 'avg_kpt_1_to_5_distance', 'max_kpt_1_to_5_distance', 'min_kpt_1_to_5_distance', 'avg_distance_to_0', 'avg_distance_to_1', 'avg_distance_to_2']


In [9]:
def load_bout_metrics(cage_id, start_date, end_date):
    """Load animal_bout_metrics.parquet with distance traveled per bout."""
    conn = duckdb.connect()
    conn.execute("INSTALL httpfs; LOAD httpfs;")
    conn.execute("SET s3_region='us-east-1';")
    
    dates = pd.date_range(start_date, end_date, freq='D')
    all_data = []
    
    for date in dates:
        date_str = date.strftime('%Y-%m-%d')
        path = f"{S3_BASE}/cage_id={cage_id}/date={date_str}/animal_bout_metrics.parquet"
        
        try:
            df = conn.execute(f"SELECT * FROM read_parquet('{path}')").fetchdf()
            df['cage_id'] = cage_id
            df['date'] = date_str
            all_data.append(df)
        except:
            continue
    
    conn.close()
    
    if all_data:
        return pd.concat(all_data, ignore_index=True)
    return pd.DataFrame()

# Load for all vehicle cages
print("Loading bout metrics (distance traveled)...")
print("="*60)

all_bout_metrics = []
for rep_name, rep_config in VEHICLE_CAGES.items():
    print(f"{rep_name}:")
    for cage_id in rep_config['cages']:
        print(f"  Cage {cage_id}...", end=" ")
        df = load_bout_metrics(cage_id, rep_config['start_date'], rep_config['end_date'])
        if len(df) > 0:
            df['replicate'] = rep_name
            all_bout_metrics.append(df)
            print(f"{len(df):,} rows")
        else:
            print("No data")

df_bout_metrics = pd.concat(all_bout_metrics, ignore_index=True)
print("="*60)
print(f"Total bout metrics loaded: {len(df_bout_metrics):,}")

Loading bout metrics (distance traveled)...
Rep1:
  Cage 4918... 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

2,153,458 rows
  Cage 4922... 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

2,114,699 rows
  Cage 4923... 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

2,112,969 rows
Rep2:
  Cage 4928... 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

1,967,936 rows
  Cage 4929... 1,828,562 rows
  Cage 4934... 1,945,821 rows
Total bout metrics loaded: 12,123,445


In [10]:
print("Bout Metrics Structure:")
print("="*60)
print(f"Columns: {df_bout_metrics.columns.tolist()}")
print(f"\nMetric names: {df_bout_metrics['metric_name'].unique()}")
print(f"State names: {df_bout_metrics['state_name'].unique()}")
print(f"\nSample data:")
print(df_bout_metrics[['start_time', 'animal_id', 'state_name', 'bout_length_seconds', 
                       'metric_name', 'metric_value']].head(10))

# Check distance traveled by state
print("\n" + "="*60)
print("Mean distance traveled by behavioral state:")
print("-"*60)
distance_by_state = df_bout_metrics.groupby('state_name')['metric_value'].agg(['mean', 'std', 'count'])
print(distance_by_state.round(2))

Bout Metrics Structure:
Columns: ['predicted_identity', 'start_time', 'end_time', 'state_name', 'organization_id', 'cage_id', 'study_id', 'device_id', 'animal_id', 'bout_length_seconds', 'metric_name', 'metric_agg', 'metric_value', 'source_file', 'date', 'replicate']

Metric names: ['animal.distance_travelled']
State names: ['animal_bouts.active' 'animal_bouts.climbing' 'animal_bouts.locomotion'
 'animal_bouts.feeding' 'animal_bouts.inactive'
 'animal_bouts.social.in_proximity_other'
 'animal_bouts.social.isolated_other' 'animal_bouts.inferred_sleep'
 'animal_bouts.drinking' 'animal_bouts.social.isolated_all'
 'animal_bouts.social.proximal_all']

Sample data:
           start_time  animal_id               state_name  \
0 2025-01-07 23:00:00       9259      animal_bouts.active   
1 2025-01-07 23:06:37       9259    animal_bouts.climbing   
2 2025-01-07 23:05:49       9259  animal_bouts.locomotion   
3 2025-01-07 23:08:33       9258  animal_bouts.locomotion   
4 2025-01-07 23:08:37      

In [14]:
# Load bout data for all vehicle control cages
def load_bout_data(cage_id, start_date, end_date):
    """Load animal_bouts.parquet for a single cage across date range."""
    conn = duckdb.connect()
    conn.execute("INSTALL httpfs; LOAD httpfs;")
    conn.execute("SET s3_region='us-east-1';")
    
    dates = pd.date_range(start_date, end_date, freq='D')
    all_data = []
    
    for date in dates:
        date_str = date.strftime('%Y-%m-%d')
        path = f"{S3_BASE}/cage_id={cage_id}/date={date_str}/animal_bouts.parquet"
        
        try:
            df = conn.execute(f"SELECT * FROM read_parquet('{path}')").fetchdf()
            df['cage_id'] = cage_id
            df['date'] = date_str
            all_data.append(df)
        except:
            continue
    
    conn.close()
    
    if all_data:
        return pd.concat(all_data, ignore_index=True)
    return pd.DataFrame()

# Load for all vehicle cages
print("Loading bout data (animal_bouts.parquet)...")
print("="*60)

all_bouts = []
for rep_name, rep_config in VEHICLE_CAGES.items():
    print(f"{rep_name}:")
    for cage_id in rep_config['cages']:
        print(f"  Cage {cage_id}...", end=" ")
        df = load_bout_data(cage_id, rep_config['start_date'], rep_config['end_date'])
        if len(df) > 0:
            df['replicate'] = rep_name
            all_bouts.append(df)
            print(f"{len(df):,} bouts")
        else:
            print("No data")

df_bouts = pd.concat(all_bouts, ignore_index=True)
print("="*60)
print(f"Total bouts loaded: {len(df_bouts):,}")

Loading bout data (animal_bouts.parquet)...
Rep1:
  Cage 4918... 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

2,558,344 bouts
  Cage 4922... 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

2,401,455 bouts
  Cage 4923... 2,458,754 bouts
Rep2:
  Cage 4928... 2,259,981 bouts
  Cage 4929... 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

2,071,012 bouts
  Cage 4934... 2,201,662 bouts
Total bouts loaded: 13,951,208


In [15]:
def compute_nightly_summaries_with_exploration(df_bouts, df_bout_metrics):
    """
    Compute comprehensive nightly summaries including:
    - Duration metrics (from animal_bouts)
    - Bout count metrics (from animal_bouts)  
    - Exploration metrics (from animal_bout_metrics - distance traveled)
    """
    from datetime import timedelta
    
    STATE_MAP = {
        'active': 'animal_bouts.active',
        'inactive': 'animal_bouts.inactive',
        'locomotion': 'animal_bouts.locomotion',
        'feeding': 'animal_bouts.feeding',
        'drinking': 'animal_bouts.drinking',
        'climbing': 'animal_bouts.climbing',
        'inferred_sleep': 'animal_bouts.inferred_sleep',
    }
    
    # ============================================
    # Process bouts data
    # ============================================
    df_bouts = df_bouts.copy()
    df_bouts['start_time'] = pd.to_datetime(df_bouts['start_time'])
    df_bouts['hour_utc'] = df_bouts['start_time'].dt.hour
    df_bouts['is_dark'] = (df_bouts['hour_utc'] >= 23) | (df_bouts['hour_utc'] < 11)
    
    # Night date assignment
    df_bouts['night_date'] = df_bouts['start_time'].dt.date
    mask_early = df_bouts['hour_utc'] < 11
    df_bouts.loc[mask_early, 'night_date'] = (
        pd.to_datetime(df_bouts.loc[mask_early, 'start_time']) - timedelta(days=1)
    ).dt.date
    
    df_dark = df_bouts[df_bouts['is_dark']].copy()
    
    # ============================================
    # Process bout metrics (distance traveled)
    # ============================================
    df_metrics = df_bout_metrics.copy()
    df_metrics['start_time'] = pd.to_datetime(df_metrics['start_time'])
    df_metrics['hour_utc'] = df_metrics['start_time'].dt.hour
    df_metrics['is_dark'] = (df_metrics['hour_utc'] >= 23) | (df_metrics['hour_utc'] < 11)
    
    df_metrics['night_date'] = df_metrics['start_time'].dt.date
    mask_early = df_metrics['hour_utc'] < 11
    df_metrics.loc[mask_early, 'night_date'] = (
        pd.to_datetime(df_metrics.loc[mask_early, 'start_time']) - timedelta(days=1)
    ).dt.date
    
    df_metrics_dark = df_metrics[df_metrics['is_dark']].copy()
    
    # ============================================
    # Aggregate bout-based metrics
    # ============================================
    results = []
    
    for (cage_id, animal_id, night_date), group in df_dark.groupby(['cage_id', 'animal_id', 'night_date']):
        row = {
            'cage_id': cage_id,
            'animal_id': animal_id,
            'night_date': night_date,
        }
        
        # Duration and bout count per state
        for short_name, full_name in STATE_MAP.items():
            state_data = group[group['state_name'] == full_name]
            row[f'{short_name}_duration'] = state_data['bout_length_seconds'].sum()
            row[f'{short_name}_bout_count'] = len(state_data)
            row[f'{short_name}_mean_bout'] = state_data['bout_length_seconds'].mean() if len(state_data) > 0 else 0
        
        row['total_dark_seconds'] = group['bout_length_seconds'].sum()
        results.append(row)
    
    df_summary = pd.DataFrame(results)
    
    # ============================================
    # Aggregate exploration metrics (distance traveled)
    # ============================================
    exploration_results = []
    
    for (cage_id, animal_id, night_date), group in df_metrics_dark.groupby(['cage_id', 'animal_id', 'night_date']):
        row = {
            'cage_id': cage_id,
            'animal_id': animal_id,
            'night_date': night_date,
        }
        
        # Total distance traveled (all states)
        row['total_distance'] = group['metric_value'].sum()
        
        # Distance by state
        for short_name, full_name in STATE_MAP.items():
            state_data = group[group['state_name'] == full_name]
            row[f'{short_name}_distance'] = state_data['metric_value'].sum()
            row[f'{short_name}_distance_per_bout'] = state_data['metric_value'].mean() if len(state_data) > 0 else 0
        
        exploration_results.append(row)
    
    df_exploration = pd.DataFrame(exploration_results)
    
    # ============================================
    # Merge bout and exploration summaries
    # ============================================
    df_summary = df_summary.merge(
        df_exploration, 
        on=['cage_id', 'animal_id', 'night_date'], 
        how='left'
    )
    
    # ============================================
    # Compute derived features
    # ============================================
    
    # Activity metrics
    df_summary['activity_amplitude'] = df_summary['active_duration'] + df_summary['locomotion_duration']
    
    # Fragmentation metrics (bouts per unit time)
    df_summary['sleep_fragmentation'] = df_summary['inferred_sleep_bout_count'] / (df_summary['inferred_sleep_duration'] + 1)
    df_summary['active_fragmentation'] = df_summary['active_bout_count'] / (df_summary['active_duration'] + 1)
    
    # Ratios
    df_summary['feeding_ratio'] = df_summary['feeding_duration'] / (df_summary['active_duration'] + 1)
    df_summary['sleep_ratio'] = df_summary['inferred_sleep_duration'] / (df_summary['total_dark_seconds'] + 1)
    
    # Exploration metrics (NEW!)
    df_summary['exploration_intensity'] = df_summary['total_distance'] / (df_summary['activity_amplitude'] + 1)  # Distance per active second
    df_summary['locomotion_efficiency'] = df_summary['locomotion_distance'] / (df_summary['locomotion_duration'] + 1)  # Distance per locomotion second
    
    return df_summary

# Compute enhanced summaries
print("Computing nightly summaries with exploration metrics...")
df_nightly = compute_nightly_summaries_with_exploration(df_bouts, df_bout_metrics)

# Filter out animal_id = 0
df_nightly = df_nightly[df_nightly['animal_id'] != 0].copy()

print(f"Nightly summaries: {len(df_nightly)} animal-nights")
print(f"Animals: {df_nightly['animal_id'].nunique()}")

Computing nightly summaries with exploration metrics...
Nightly summaries: 258 animal-nights
Animals: 18


In [16]:
print("New Exploration Features:")
print("="*70)

exploration_features = [
    'total_distance',
    'locomotion_distance', 
    'active_distance',
    'locomotion_distance_per_bout',
    'exploration_intensity',
    'locomotion_efficiency',
]

for feat in exploration_features:
    if feat in df_nightly.columns:
        print(f"{feat:30}: mean={df_nightly[feat].mean():.2f}, std={df_nightly[feat].std():.2f}")
    else:
        print(f"{feat:30}: NOT FOUND")

# Correlation with activity
print("\n" + "="*70)
print("Correlation of exploration metrics with activity_amplitude:")
print("-"*70)
for feat in exploration_features:
    if feat in df_nightly.columns:
        corr = df_nightly[feat].corr(df_nightly['activity_amplitude'])
        print(f"{feat:30}: r = {corr:.3f}")

New Exploration Features:
total_distance                : mean=58561.80, std=21644.41
locomotion_distance           : mean=12027.49, std=4634.39
active_distance               : mean=13271.41, std=2700.43
locomotion_distance_per_bout  : mean=7.19, std=0.82
exploration_intensity         : mean=2.11, std=0.80
locomotion_efficiency         : mean=3.65, std=0.23

Correlation of exploration metrics with activity_amplitude:
----------------------------------------------------------------------
total_distance                : r = 0.282
locomotion_distance           : r = 0.437
active_distance               : r = 0.864
locomotion_distance_per_bout  : r = -0.105
exploration_intensity         : r = -0.177
locomotion_efficiency         : r = -0.213


In [17]:
# Updated classification weights including exploration features
# Based on Khatiz: Estrus = MORE exploration, MORE distance traveled

CLASSIFICATION_WEIGHTS_V2 = {
    # Duration-based (existing)
    'activity_amplitude_z':       (+2.0, 'Higher activity → Estrus'),
    'locomotion_duration_z':      (+1.5, 'More locomotion → Estrus'),
    'inactive_duration_z':        (-1.5, 'Less inactive time → Estrus'),
    'inferred_sleep_duration_z':  (-1.0, 'Less sleep → Estrus'),
    
    # Bout-based (existing)
    'locomotion_bout_count_z':    (+1.0, 'More locomotion bouts → Estrus'),
    'active_mean_bout_z':         (+1.0, 'Longer bouts (sustained) → Estrus'),
    'active_fragmentation_z':     (-1.0, 'Less fragmented activity → Estrus'),
    'sleep_fragmentation_z':      (-0.5, 'Less sleep fragmentation → Estrus'),
    
    # Feeding
    'feeding_ratio_z':            (-1.0, 'Less feeding → Estrus'),
    
    # Exploration (NEW!)
    'total_distance_z':           (+1.5, 'More distance traveled → Estrus'),
    'locomotion_distance_z':      (+1.0, 'More locomotion distance → Estrus'),
    'exploration_intensity_z':    (+1.0, 'Higher exploration intensity → Estrus'),
    
    # Vertical exploration
    'climbing_duration_z':        (+0.5, 'More climbing → Estrus'),
    'climbing_bout_count_z':      (+0.5, 'More climbing bouts → Estrus'),
}

print("Updated Classification Weights (with Exploration):")
print("="*70)
for feature, (weight, direction) in CLASSIFICATION_WEIGHTS_V2.items():
    sign = "+" if weight > 0 else ""
    print(f"  {feature:30} weight={sign}{weight:<5} ({direction})")

print(f"\nTotal features: {len(CLASSIFICATION_WEIGHTS_V2)}")

Updated Classification Weights (with Exploration):
  activity_amplitude_z           weight=+2.0   (Higher activity → Estrus)
  locomotion_duration_z          weight=+1.5   (More locomotion → Estrus)
  inactive_duration_z            weight=-1.5  (Less inactive time → Estrus)
  inferred_sleep_duration_z      weight=-1.0  (Less sleep → Estrus)
  locomotion_bout_count_z        weight=+1.0   (More locomotion bouts → Estrus)
  active_mean_bout_z             weight=+1.0   (Longer bouts (sustained) → Estrus)
  active_fragmentation_z         weight=-1.0  (Less fragmented activity → Estrus)
  sleep_fragmentation_z          weight=-0.5  (Less sleep fragmentation → Estrus)
  feeding_ratio_z                weight=-1.0  (Less feeding → Estrus)
  total_distance_z               weight=+1.5   (More distance traveled → Estrus)
  locomotion_distance_z          weight=+1.0   (More locomotion distance → Estrus)
  exploration_intensity_z        weight=+1.0   (Higher exploration intensity → Estrus)
  climbin

In [18]:
# Make sure we have z-scored features
FEATURES_TO_ZSCORE = [
    'activity_amplitude', 'locomotion_duration', 'inactive_duration',
    'inferred_sleep_duration', 'feeding_duration', 'climbing_duration',
    'locomotion_bout_count', 'active_bout_count', 'climbing_bout_count',
    'active_mean_bout', 'active_fragmentation', 'sleep_fragmentation',
    'feeding_ratio', 'sleep_ratio',
    'total_distance', 'locomotion_distance', 'active_distance',
    'climbing_distance', 'exploration_intensity', 'locomotion_efficiency',
]

def compute_within_animal_zscores(df, features):
    df_out = df.copy()
    for feature in features:
        if feature in df_out.columns:
            zscore_col = f'{feature}_z'
            df_out[zscore_col] = df_out.groupby('animal_id')[feature].transform(
                lambda x: (x - x.mean()) / x.std() if x.std() > 0 else 0
            )
    return df_out

df_nightly = compute_within_animal_zscores(df_nightly, FEATURES_TO_ZSCORE)

# Classification weights
CLASSIFICATION_WEIGHTS_V2 = {
    'activity_amplitude_z':       (+2.0, 'Higher activity → Estrus'),
    'locomotion_duration_z':      (+1.5, 'More locomotion → Estrus'),
    'inactive_duration_z':        (-1.5, 'Less inactive time → Estrus'),
    'inferred_sleep_duration_z':  (-1.0, 'Less sleep → Estrus'),
    'climbing_duration_z':        (+0.5, 'More climbing → Estrus'),
    'locomotion_bout_count_z':    (+1.0, 'More locomotion bouts → Estrus'),
    'climbing_bout_count_z':      (+0.5, 'More climbing bouts → Estrus'),
    'active_mean_bout_z':         (+1.0, 'Longer bouts (sustained) → Estrus'),
    'active_fragmentation_z':     (-1.0, 'Less fragmented activity → Estrus'),
    'sleep_fragmentation_z':      (-0.5, 'Less sleep fragmentation → Estrus'),
    'feeding_ratio_z':            (-1.0, 'Less feeding → Estrus'),
    'total_distance_z':           (+1.5, 'More distance traveled → Estrus'),
    'locomotion_distance_z':      (+1.0, 'More locomotion distance → Estrus'),
    'exploration_intensity_z':    (+1.0, 'Higher exploration intensity → Estrus'),
}

# Compute estrous score
df_nightly['estrous_score'] = 0
used = 0
for feature, (weight, _) in CLASSIFICATION_WEIGHTS_V2.items():
    if feature in df_nightly.columns:
        df_nightly['estrous_score'] += weight * df_nightly[feature]
        used += 1
print(f"Used {used}/{len(CLASSIFICATION_WEIGHTS_V2)} features for scoring")

# Classify
def classify_estrous_state(df, threshold_percentile=75):
    df_out = df.copy()
    def classify_animal(group):
        threshold = np.percentile(group['estrous_score'], threshold_percentile)
        group['estrous_state'] = np.where(
            group['estrous_score'] >= threshold, 'Estrus-like', 'Diestrus-like'
        )
        return group
    return df_out.groupby('animal_id', group_keys=False).apply(classify_animal)

df_nightly = classify_estrous_state(df_nightly, threshold_percentile=75)

# Results
state_counts = df_nightly['estrous_state'].value_counts()
n_total = len(df_nightly)
n_estrus = state_counts.get('Estrus-like', 0)

print("\n" + "="*70)
print("CLASSIFICATION RESULTS (WITH EXPLORATION FEATURES)")
print("="*70)
print(f"  Estrus-like nights:   {n_estrus:3d} ({100*n_estrus/n_total:.1f}%)")
print(f"  Diestrus-like nights: {n_total-n_estrus:3d} ({100*(n_total-n_estrus)/n_total:.1f}%)")
print(f"  Expected: ~25% Estrus-like")

# Validate
print("\n" + "="*70)
print("FEATURE VALIDATION")
print("="*70)

estrus_df = df_nightly[df_nightly['estrous_state'] == 'Estrus-like']
diestrus_df = df_nightly[df_nightly['estrous_state'] == 'Diestrus-like']

validation_features = [
    ('activity_amplitude', 'Higher in Estrus'),
    ('locomotion_duration', 'Higher in Estrus'),
    ('total_distance', 'Higher in Estrus'),
    ('locomotion_distance', 'Higher in Estrus'),
    ('climbing_distance', 'Higher in Estrus'),
    ('exploration_intensity', 'Higher in Estrus'),
    ('active_mean_bout', 'Higher in Estrus'),
    ('inactive_duration', 'Lower in Estrus'),
    ('inferred_sleep_duration', 'Lower in Estrus'),
    ('active_fragmentation', 'Lower in Estrus'),
    ('feeding_ratio', 'Lower in Estrus'),
]

print(f"\n{'Feature':<25} {'Estrus':<12} {'Diestrus':<12} {'p-value':<10} {'Dir?':<5} {'Sig?'}")
print("-"*75)

sig_count = 0
correct_count = 0

for feat, expected in validation_features:
    if feat not in df_nightly.columns:
        continue
    
    e_vals = estrus_df[feat].dropna()
    d_vals = diestrus_df[feat].dropna()
    
    if len(e_vals) == 0 or len(d_vals) == 0:
        continue
    
    e_mean = e_vals.mean()
    d_mean = d_vals.mean()
    _, p_val = ttest_ind(e_vals, d_vals)
    
    if 'Higher' in expected:
        correct = e_mean > d_mean
    else:
        correct = e_mean < d_mean
    
    dir_mark = "✓" if correct else "✗"
    sig_mark = "*" if p_val < 0.05 else ""
    
    if p_val < 0.05:
        sig_count += 1
    if correct:
        correct_count += 1
    
    print(f"{feat:<25} {e_mean:<12.2f} {d_mean:<12.2f} {p_val:<10.4f} {dir_mark:<5} {sig_mark}")

print("-"*75)
print(f"Significant (p<0.05): {sig_count}/{len(validation_features)}")
print(f"Correct direction:    {correct_count}/{len(validation_features)}")

Used 14/14 features for scoring

CLASSIFICATION RESULTS (WITH EXPLORATION FEATURES)
  Estrus-like nights:    72 (27.9%)
  Diestrus-like nights: 186 (72.1%)
  Expected: ~25% Estrus-like

FEATURE VALIDATION

Feature                   Estrus       Diestrus     p-value    Dir?  Sig?
---------------------------------------------------------------------------
activity_amplitude        29303.86     27727.74     0.0339     ✓     *
locomotion_duration       4057.39      2971.91      0.0000     ✓     *
total_distance            74143.72     52530.09     0.0000     ✓     *
locomotion_distance       15111.03     10833.85     0.0000     ✓     *
climbing_distance         18946.65     10520.73     0.0000     ✓     *
exploration_intensity     2.59         1.92         0.0000     ✓     *
active_mean_bout          7.29         8.23         0.0000     ✗     *
inactive_duration         9006.81      10379.78     0.0002     ✓     *
inferred_sleep_duration   2794.29      3982.05      0.0000     ✓     *
activ

In [19]:
# ============================================================
# CORRECTED APPROACH: Data-driven weights, not manual weights
# Based on Khatiz et al. methodology
# ============================================================

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import FactorAnalysis, PCA
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, fcluster

# Features to use (all standardized equally - no manual weights!)
FEATURES_FOR_ANALYSIS = [
    # Duration metrics
    'activity_amplitude',
    'locomotion_duration', 
    'inactive_duration',
    'inferred_sleep_duration',
    'feeding_duration',
    'climbing_duration',
    
    # Bout metrics
    'locomotion_bout_count',
    'climbing_bout_count',
    'active_mean_bout',
    'active_fragmentation',
    'sleep_fragmentation',
    
    # Exploration metrics
    'total_distance',
    'locomotion_distance',
    'exploration_intensity',
]

# Prepare data - drop rows with missing values
df_analysis = df_nightly.dropna(subset=FEATURES_FOR_ANALYSIS).copy()
X = df_analysis[FEATURES_FOR_ANALYSIS].values

# STANDARDIZE (like Khatiz - puts all metrics on equal footing)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Data prepared: {X_scaled.shape[0]} nights × {X_scaled.shape[1]} features")
print(f"Features (all equally weighted after standardization):")
for f in FEATURES_FOR_ANALYSIS:
    print(f"  - {f}")

Data prepared: 258 nights × 14 features
Features (all equally weighted after standardization):
  - activity_amplitude
  - locomotion_duration
  - inactive_duration
  - inferred_sleep_duration
  - feeding_duration
  - climbing_duration
  - locomotion_bout_count
  - climbing_bout_count
  - active_mean_bout
  - active_fragmentation
  - sleep_fragmentation
  - total_distance
  - locomotion_distance
  - exploration_intensity


In [20]:
# Factor Analysis - let the DATA determine the weights (loadings)
# This is exactly what Khatiz did

n_factors = 3  # Khatiz found ~3 main factors
fa = FactorAnalysis(n_components=n_factors, random_state=42)
fa.fit(X_scaled)

# Get loadings (these are the DATA-DERIVED weights!)
loadings = pd.DataFrame(
    fa.components_.T,
    index=FEATURES_FOR_ANALYSIS,
    columns=[f'Factor_{i+1}' for i in range(n_factors)]
)

print("="*70)
print("FACTOR ANALYSIS LOADINGS (Data-Derived Weights)")
print("="*70)
print(loadings.round(3).to_string())

# Interpret factors
print("\n" + "="*70)
print("FACTOR INTERPRETATION:")
print("="*70)
for i in range(n_factors):
    col = f'Factor_{i+1}'
    print(f"\n{col}:")
    
    # Top positive loadings
    top_pos = loadings[col].nlargest(3)
    print(f"  HIGH: {', '.join([f'{idx} ({val:+.2f})' for idx, val in top_pos.items()])}")
    
    # Top negative loadings
    top_neg = loadings[col].nsmallest(3)
    print(f"  LOW:  {', '.join([f'{idx} ({val:+.2f})' for idx, val in top_neg.items()])}")

FACTOR ANALYSIS LOADINGS (Data-Derived Weights)
                         Factor_1  Factor_2  Factor_3
activity_amplitude          0.494    -0.437    -0.231
locomotion_duration         1.000     0.003    -0.009
inactive_duration           0.060    -0.995    -0.004
inferred_sleep_duration    -0.021    -0.825    -0.042
feeding_duration            0.215    -0.095     0.061
climbing_duration           0.476     0.021     0.831
locomotion_bout_count       0.981    -0.050    -0.020
climbing_bout_count         0.548     0.013     0.731
active_mean_bout           -0.739     0.229    -0.232
active_fragmentation        0.759    -0.138     0.252
sleep_fragmentation        -0.194     0.323     0.027
total_distance              0.758    -0.053     0.635
locomotion_distance         0.985     0.020     0.058
exploration_intensity       0.514     0.128     0.783

FACTOR INTERPRETATION:

Factor_1:
  HIGH: locomotion_duration (+1.00), locomotion_distance (+0.98), locomotion_bout_count (+0.98)
  LOW:  act

In [21]:
# PCA - standardized, no manual weights
# Khatiz used this to find axes of differentiation

pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)

# Add PCA coordinates to dataframe
for i in range(5):
    df_analysis[f'PC{i+1}'] = X_pca[:, i]

print("="*70)
print("PCA RESULTS (Standardized - Equal Feature Weighting)")
print("="*70)
print(f"\nVariance explained:")
for i, var in enumerate(pca.explained_variance_ratio_):
    print(f"  PC{i+1}: {var*100:.1f}%")
print(f"  Total (5 PCs): {sum(pca.explained_variance_ratio_)*100:.1f}%")

# PCA loadings
pca_loadings = pd.DataFrame(
    pca.components_.T,
    index=FEATURES_FOR_ANALYSIS,
    columns=[f'PC{i+1}' for i in range(5)]
)
print("\nPCA Loadings (first 3 components):")
print(pca_loadings[['PC1', 'PC2', 'PC3']].round(3).to_string())

PCA RESULTS (Standardized - Equal Feature Weighting)

Variance explained:
  PC1: 49.2%
  PC2: 18.1%
  PC3: 10.6%
  PC4: 8.2%
  PC5: 5.7%
  Total (5 PCs): 91.8%

PCA Loadings (first 3 components):
                           PC1    PC2    PC3
activity_amplitude      -0.145 -0.399 -0.342
locomotion_duration     -0.338 -0.067 -0.317
inactive_duration       -0.058 -0.504  0.283
inferred_sleep_duration -0.009 -0.503  0.394
feeding_duration        -0.101 -0.206 -0.026
climbing_duration       -0.296  0.185  0.346
locomotion_bout_count   -0.336 -0.111 -0.318
climbing_bout_count     -0.316  0.148  0.265
active_mean_bout         0.325  0.023  0.036
active_fragmentation    -0.328  0.042 -0.022
sleep_fragmentation      0.074  0.372 -0.198
total_distance          -0.358  0.078  0.155
locomotion_distance     -0.342 -0.039 -0.270
exploration_intensity   -0.293  0.264  0.345


In [22]:
# K-Means - no weights, let algorithm find natural clusters
# Khatiz used k=5 (4 estrous phases + male), we use k=2 or k=4

print("="*70)
print("K-MEANS CLUSTERING (Data-Driven Grouping)")
print("="*70)

# Try different k values
print(f"\n{'k':<5} {'Silhouette':<12} {'Interpretation'}")
print("-"*40)

silhouettes = []
for k in range(2, 7):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels)
    silhouettes.append(sil)
    
    if k == 2:
        interp = "High vs Low estrogen?"
    elif k == 4:
        interp = "4 estrous phases?"
    else:
        interp = ""
    print(f"{k:<5} {sil:<12.3f} {interp}")

optimal_k = range(2, 7)[np.argmax(silhouettes)]
print(f"\nOptimal k by silhouette: {optimal_k}")

# Apply k=2 for binary classification
kmeans_2 = KMeans(n_clusters=2, random_state=42, n_init=10)
df_analysis['kmeans_cluster'] = kmeans_2.fit_predict(X_scaled)

# Determine which cluster is "estrus-like" based on activity
cluster_means = df_analysis.groupby('kmeans_cluster')[FEATURES_FOR_ANALYSIS].mean()
print("\nCluster means (key features):")
print(cluster_means[['activity_amplitude', 'total_distance', 'inactive_duration']].round(1))

# Higher activity cluster = estrus-like
estrus_cluster = cluster_means['activity_amplitude'].idxmax()
df_analysis['kmeans_state'] = df_analysis['kmeans_cluster'].apply(
    lambda x: 'Estrus-like' if x == estrus_cluster else 'Diestrus-like'
)

km_counts = df_analysis['kmeans_state'].value_counts()
print(f"\nK-Means Classification (k=2):")
print(f"  Estrus-like:   {km_counts.get('Estrus-like', 0)} ({100*km_counts.get('Estrus-like', 0)/len(df_analysis):.1f}%)")
print(f"  Diestrus-like: {km_counts.get('Diestrus-like', 0)} ({100*km_counts.get('Diestrus-like', 0)/len(df_analysis):.1f}%)")

K-MEANS CLUSTERING (Data-Driven Grouping)

k     Silhouette   Interpretation
----------------------------------------
2     0.272        High vs Low estrogen?
3     0.331        
4     0.227        4 estrous phases?
5     0.201        
6     0.189        

Optimal k by silhouette: 3

Cluster means (key features):
                activity_amplitude  total_distance  inactive_duration
kmeans_cluster                                                       
0                          28054.7         45488.5            10158.9
1                          28329.4         77308.4             9764.0

K-Means Classification (k=2):
  Estrus-like:   106 (41.1%)
  Diestrus-like: 152 (58.9%)


In [23]:
# Hierarchical clustering with Ward linkage (like Khatiz)

print("="*70)
print("HIERARCHICAL CLUSTERING")
print("="*70)

linkage_matrix = linkage(X_scaled, method='ward')

# Cut at k=2
hc_labels = fcluster(linkage_matrix, 2, criterion='maxclust')
df_analysis['hc_cluster'] = hc_labels

# Determine which cluster is estrus-like
hc_means = df_analysis.groupby('hc_cluster')[FEATURES_FOR_ANALYSIS].mean()
hc_estrus = hc_means['activity_amplitude'].idxmax()

df_analysis['hc_state'] = df_analysis['hc_cluster'].apply(
    lambda x: 'Estrus-like' if x == hc_estrus else 'Diestrus-like'
)

hc_counts = df_analysis['hc_state'].value_counts()
print(f"Hierarchical Classification (k=2):")
print(f"  Estrus-like:   {hc_counts.get('Estrus-like', 0)} ({100*hc_counts.get('Estrus-like', 0)/len(df_analysis):.1f}%)")
print(f"  Diestrus-like: {hc_counts.get('Diestrus-like', 0)} ({100*hc_counts.get('Diestrus-like', 0)/len(df_analysis):.1f}%)")

HIERARCHICAL CLUSTERING
Hierarchical Classification (k=2):
  Estrus-like:   207 (80.2%)
  Diestrus-like: 51 (19.8%)


In [24]:
# Use Factor Analysis loadings as DATA-DERIVED weights
# This is the closest to Khatiz's approach

print("="*70)
print("FACTOR SCORE CLASSIFICATION (Data-Derived Weights)")
print("="*70)

# Get factor scores (projection onto factor axes)
factor_scores = fa.transform(X_scaled)
df_analysis['Factor_1'] = factor_scores[:, 0]
df_analysis['Factor_2'] = factor_scores[:, 1]
df_analysis['Factor_3'] = factor_scores[:, 2]

# Identify which factor best captures "activity vs rest" dimension
# Look at factor loadings to interpret
print("\nFactor loadings for key features:")
key_features = ['activity_amplitude', 'locomotion_duration', 'inactive_duration', 
                'total_distance', 'inferred_sleep_duration']
print(loadings.loc[key_features].round(3))

# The factor with high activity/distance and low inactive/sleep loadings = "estrus factor"
# Let's compute a composite: activity-related features positive, rest-related negative
activity_features = ['activity_amplitude', 'locomotion_duration', 'total_distance']
rest_features = ['inactive_duration', 'inferred_sleep_duration']

# Check which factor has this pattern
for i in range(n_factors):
    col = f'Factor_{i+1}'
    activity_loadings = loadings.loc[activity_features, col].mean()
    rest_loadings = loadings.loc[rest_features, col].mean()
    print(f"\n{col}: Activity mean={activity_loadings:.3f}, Rest mean={rest_loadings:.3f}")
    if activity_loadings > 0.3 and rest_loadings < -0.1:
        estrus_factor = col
        print(f"  → This appears to be the 'Activity/Estrus' factor")
    elif rest_loadings > 0.3:
        print(f"  → This appears to be the 'Rest/Diestrus' factor")

FACTOR SCORE CLASSIFICATION (Data-Derived Weights)

Factor loadings for key features:
                         Factor_1  Factor_2  Factor_3
activity_amplitude          0.494    -0.437    -0.231
locomotion_duration         1.000     0.003    -0.009
inactive_duration           0.060    -0.995    -0.004
total_distance              0.758    -0.053     0.635
inferred_sleep_duration    -0.021    -0.825    -0.042

Factor_1: Activity mean=0.751, Rest mean=0.019

Factor_2: Activity mean=-0.162, Rest mean=-0.910

Factor_3: Activity mean=0.131, Rest mean=-0.023


In [25]:
# Compare classifications from different methods

print("="*70)
print("COMPARISON OF DATA-DRIVEN CLASSIFICATION METHODS")
print("="*70)

methods = {
    'K-Means': 'kmeans_state',
    'Hierarchical': 'hc_state',
}

print("\nClassification Distribution:")
print("-"*50)
for name, col in methods.items():
    if col in df_analysis.columns:
        counts = df_analysis[col].value_counts()
        e_pct = 100 * counts.get('Estrus-like', 0) / len(df_analysis)
        print(f"  {name:15}: {e_pct:.1f}% Estrus-like")

# Agreement between methods
print("\n" + "-"*50)
print("Method Agreement (Adjusted Rand Index):")

if 'kmeans_state' in df_analysis.columns and 'hc_state' in df_analysis.columns:
    km_binary = (df_analysis['kmeans_state'] == 'Estrus-like').astype(int)
    hc_binary = (df_analysis['hc_state'] == 'Estrus-like').astype(int)
    ari = adjusted_rand_score(km_binary, hc_binary)
    print(f"  K-Means vs Hierarchical: ARI = {ari:.3f}")

# Use K-Means as primary classification (most common in literature)
df_analysis['estrous_state'] = df_analysis['kmeans_state']
print(f"\nUsing K-Means as primary classification")

COMPARISON OF DATA-DRIVEN CLASSIFICATION METHODS

Classification Distribution:
--------------------------------------------------
  K-Means        : 41.1% Estrus-like
  Hierarchical   : 80.2% Estrus-like

--------------------------------------------------
Method Agreement (Adjusted Rand Index):
  K-Means vs Hierarchical: ARI = 0.320

Using K-Means as primary classification


In [26]:
# Validate that classified groups differ on key features

print("="*70)
print("CLASSIFICATION VALIDATION")
print("="*70)

estrus_df = df_analysis[df_analysis['estrous_state'] == 'Estrus-like']
diestrus_df = df_analysis[df_analysis['estrous_state'] == 'Diestrus-like']

validation_features = [
    ('activity_amplitude', 'Higher in Estrus'),
    ('locomotion_duration', 'Higher in Estrus'),
    ('total_distance', 'Higher in Estrus'),
    ('exploration_intensity', 'Higher in Estrus'),
    ('climbing_duration', 'Higher in Estrus'),
    ('inactive_duration', 'Lower in Estrus'),
    ('inferred_sleep_duration', 'Lower in Estrus'),
    ('feeding_duration', 'Lower in Estrus'),
]

print(f"\n{'Feature':<25} {'Estrus':<12} {'Diestrus':<12} {'p-value':<10} {'Expected?'}")
print("-"*70)

sig_count = 0
correct_count = 0

for feat, expected in validation_features:
    if feat not in df_analysis.columns:
        continue
    
    e_vals = estrus_df[feat].dropna()
    d_vals = diestrus_df[feat].dropna()
    
    e_mean = e_vals.mean()
    d_mean = d_vals.mean()
    _, p_val = ttest_ind(e_vals, d_vals)
    
    if 'Higher' in expected:
        correct = e_mean > d_mean
    else:
        correct = e_mean < d_mean
    
    dir_mark = "✓" if correct else "✗"
    sig_mark = "***" if p_val < 0.001 else "**" if p_val < 0.01 else "*" if p_val < 0.05 else ""
    
    if p_val < 0.05:
        sig_count += 1
    if correct:
        correct_count += 1
    
    print(f"{feat:<25} {e_mean:<12.2f} {d_mean:<12.2f} {p_val:<10.4f} {dir_mark} {sig_mark}")

print("-"*70)
print(f"Significant (p<0.05): {sig_count}/{len(validation_features)}")
print(f"Correct direction:    {correct_count}/{len(validation_features)}")

CLASSIFICATION VALIDATION

Feature                   Estrus       Diestrus     p-value    Expected?
----------------------------------------------------------------------
activity_amplitude        28329.41     28054.74     0.6864     ✓ 
locomotion_duration       4178.69      2644.51      0.0000     ✓ ***
total_distance            77308.36     45488.53     0.0000     ✓ ***
exploration_intensity     2.78         1.64         0.0000     ✓ ***
climbing_duration         4767.00      2452.18      0.0000     ✓ ***
inactive_duration         9763.97      10158.88     0.2387     ✓ 
inferred_sleep_duration   3345.65      3863.23      0.0128     ✓ *
feeding_duration          2858.08      2792.55      0.6942     ✗ 
----------------------------------------------------------------------
Significant (p<0.05): 5/8
Correct direction:    7/8
