# SWIM Research — Phase 1: Data Audit, EDA & Monolithic Baselines

**Project:** Agentic AI for Multi-Modal Earth Observation  
**RQ1:** Does decomposing environmental prediction into specialized agents outperform monolithic models?  

---

### What this notebook does:
1. Loads & audits all 4 data sources (53K ml_ready + 49K in-situ + 8.7K satellite + lake JSON bloom labels)
2. Checks data quality — flags synthetic data, missing values, suspicious patterns
3. Builds a unified research dataset merging all sources
4. Creates proper bloom labels (multi-criteria)
5. Temporal train/val/test split (no data leakage)
6. Trains monolithic baselines (GradientBoosting, RandomForest, Ensemble)
7. Evaluates with research metrics (AUROC, AUPRC, Brier, ECE)
8. Exports everything as a zip

### Data included in this package:
```
data/
  processed/ml_ready_data.csv          (53K rows, 10 features)
  raw/in_situ/LUBW_BW_...csv           (49K rows, 23 features)
  raw/satellite/*.csv                   (8.7K rows, spectral indices)
  raw/lake_data/*.json                  (bloom probability, status, toxin levels)
```

---
## 0. Setup & Environment Detection

In [None]:
# Detect environment (Colab vs local Jupyter)
import os, sys

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print('Running in Google Colab')
    # If running in Colab, upload the zip or mount Google Drive
    # Option A: Upload zip manually
    # from google.colab import files
    # uploaded = files.upload()  # Upload research_data.zip
    # !unzip research_data.zip -d /content/research_data
    # DATA_ROOT = '/content/research_data/data'
    
    # Option B: Mount Google Drive (if you uploaded data there)
    # from google.colab import drive
    # drive.mount('/content/drive')
    # DATA_ROOT = '/content/drive/MyDrive/SWIM_Research/data'
    
    # Default: assume zip was extracted to /content/data
    DATA_ROOT = '/content/data'
    print(f'  Data root: {DATA_ROOT}')
else:
    print('Running in local Jupyter')
    # Data is right next to this notebook
    DATA_ROOT = os.path.join(os.path.dirname(os.path.abspath('.')), 'research_package', 'data')
    # Fallback: try relative path
    if not os.path.exists(DATA_ROOT):
        DATA_ROOT = './data'
    if not os.path.exists(DATA_ROOT):
        DATA_ROOT = '../research_package/data'
    print(f'  Data root: {os.path.abspath(DATA_ROOT)}')

# Verify data exists
required = [
    os.path.join(DATA_ROOT, 'processed', 'ml_ready_data.csv'),
    os.path.join(DATA_ROOT, 'raw', 'in_situ'),
    os.path.join(DATA_ROOT, 'raw', 'satellite'),
    os.path.join(DATA_ROOT, 'raw', 'lake_data'),
]
for p in required:
    exists = os.path.exists(p)
    print(f'  {"OK" if exists else "MISSING"}: {p}')
    if not exists:
        print(f'    -> Fix DATA_ROOT above or upload data')

In [None]:
# Install dependencies (Colab may need these)
# !pip install -q pandas numpy matplotlib seaborn scikit-learn pyarrow

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import glob
import pickle
import shutil
import zipfile
from pathlib import Path
from datetime import datetime
from collections import OrderedDict
import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import (roc_auc_score, average_precision_score, brier_score_loss,
                             classification_report, confusion_matrix, 
                             roc_curve, precision_recall_curve)
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.calibration import calibration_curve

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

# Output directory for all results
RESULTS_DIR = Path('results')
RESULTS_DIR.mkdir(exist_ok=True)
(RESULTS_DIR / 'figures').mkdir(exist_ok=True)
(RESULTS_DIR / 'models').mkdir(exist_ok=True)
(RESULTS_DIR / 'data').mkdir(exist_ok=True)

DATA_ROOT = Path(DATA_ROOT)
print(f'Results will be saved to: {RESULTS_DIR.absolute()}')
print('Setup complete.')

---
## 1. Load All Datasets

In [None]:
# ─── 1A: ML-ready data (pre-processed, 10 features) ───
ml_data = pd.read_csv(DATA_ROOT / 'processed' / 'ml_ready_data.csv', parse_dates=['timestamp'])
print(f'1. ml_ready_data: {ml_data.shape[0]:,} rows x {ml_data.shape[1]} cols')
print(f'   Date range: {ml_data["timestamp"].min()} → {ml_data["timestamp"].max()}')
print(f'   Lakes: {ml_data["lake"].nunique()} unique')
print(f'   Top lakes: {ml_data["lake"].value_counts().head(5).to_dict()}')
display(ml_data.head(3))

In [None]:
# ─── 1B: In-situ measurements (raw, 23 features) ───
insitu_path = list((DATA_ROOT / 'raw' / 'in_situ').glob('*.csv'))[0]
insitu = pd.read_csv(insitu_path, parse_dates=['timestamp'])
print(f'2. in_situ: {insitu.shape[0]:,} rows x {insitu.shape[1]} cols')
print(f'   Date range: {insitu["timestamp"].min()} → {insitu["timestamp"].max()}')
print(f'   Locations: {insitu["location_name"].nunique()} unique')
print(f'   Columns: {list(insitu.columns)}')
display(insitu.head(3))

In [None]:
# ─── 1C: Satellite data (all CSVs combined) ───
sat_files = sorted((DATA_ROOT / 'raw' / 'satellite').glob('*.csv'))
sat_dfs = []
for f in sat_files:
    df = pd.read_csv(f, parse_dates=['acquisition_date'])
    df['source_file'] = f.name
    sat_dfs.append(df)
    print(f'   {f.name}: {len(df):,} rows')

satellite = pd.concat(sat_dfs, ignore_index=True)
print(f'\n3. satellite combined: {satellite.shape[0]:,} rows x {satellite.shape[1]} cols')
print(f'   Date range: {satellite["acquisition_date"].min()} → {satellite["acquisition_date"].max()}')
print(f'   Lakes: {satellite["lake_name"].nunique()} → {satellite["lake_name"].value_counts().to_dict()}')

In [None]:
# ─── 1D: Lake JSON data (has bloom labels!) ───
lake_json_files = sorted((DATA_ROOT / 'raw' / 'lake_data').glob('*.json'))
lake_records = []
for f in lake_json_files:
    if 'current_week' not in str(f):
        with open(f) as fh:
            data = json.load(fh)
            for rec in data:
                rec['source_file'] = f.name
            lake_records.extend(data)
            print(f'   {f.name}: {len(data):,} records')

# Flatten nested JSON into a dataframe
lake_flat = []
for rec in lake_records:
    flat = {
        'date': rec.get('date'),
        'lake': rec.get('lake'),
        'source': rec.get('source'),
        'quality_score': rec.get('quality_score'),
        'source_file': rec.get('source_file'),
    }
    for k, v in rec.get('parameters', {}).items():
        flat[f'param_{k}'] = v
    for k, v in rec.get('habs_indicators', {}).items():
        flat[f'hab_{k}'] = v
    lake_flat.append(flat)

lake_data = pd.DataFrame(lake_flat)
lake_data['date'] = pd.to_datetime(lake_data['date'])

print(f'\n4. lake_json combined: {lake_data.shape[0]:,} rows')
print(f'   Date range: {lake_data["date"].min()} → {lake_data["date"].max()}')
print(f'   Lakes: {lake_data["lake"].value_counts().to_dict()}')
print(f'   Bloom statuses: {lake_data["hab_bloom_status"].value_counts().to_dict()}')
print(f'   Bloom probability: mean={lake_data["hab_bloom_probability"].mean():.3f}, '
      f'std={lake_data["hab_bloom_probability"].std():.3f}')

---
## 2. Data Quality Audit
Flag synthetic data, zeros, missing values, suspicious uniformity

In [None]:
# ─── 2A: ML-ready data audit ───
print('=' * 70)
print('ML-READY DATA QUALITY AUDIT')
print('=' * 70)

feature_cols = ['chlorophyll_a', 'water_temperature', 'turbidity', 'dissolved_oxygen',
                'ph', 'total_nitrogen', 'total_phosphorus', 'solar_radiation',
                'wind_speed', 'precipitation']

audit_rows = []
for col in feature_cols:
    n = len(ml_data)
    zeros = (ml_data[col] == 0).sum()
    nulls = ml_data[col].isna().sum()
    unique = ml_data[col].nunique()
    mean_val = ml_data[col].mean()
    std_val = ml_data[col].std()
    flag = []
    if zeros / n > 0.5: flag.append('HIGH_ZEROS')
    if nulls / n > 0.3: flag.append('HIGH_NULLS')
    if std_val < 0.001: flag.append('NO_VARIANCE')
    if unique < 5: flag.append('LOW_CARDINALITY')
    audit_rows.append({
        'feature': col, 'zeros': zeros, 'zero_pct': f'{zeros/n*100:.1f}%',
        'nulls': nulls, 'unique': unique,
        'mean': f'{mean_val:.3f}', 'std': f'{std_val:.3f}',
        'flags': ', '.join(flag) if flag else 'OK'
    })

audit_df = pd.DataFrame(audit_rows)
display(audit_df)

# Lake name issues
print(f'\nLake column issues:')
print(f'  "object" as lake name: {(ml_data["lake"] == "object").sum()} rows')
print(f'  Empty/null lake: {ml_data["lake"].isna().sum()} rows')
print(f'  Valid lakes: {ml_data[ml_data["lake"] != "object"]["lake"].unique().tolist()}')

In [None]:
# ─── 2B: In-situ data audit ───
print('=' * 70)
print('IN-SITU DATA QUALITY AUDIT')
print('=' * 70)

insitu_features = ['chlorophyll_a', 'turbidity', 'dissolved_oxygen', 'ph',
                   'conductivity', 'temperature', 'wind_speed', 'air_temperature', 'humidity']

audit_rows2 = []
for col in insitu_features:
    if col in insitu.columns:
        n = len(insitu)
        zeros = (insitu[col] == 0).sum()
        nulls = insitu[col].isna().sum()
        mn, mx = insitu[col].min(), insitu[col].max()
        audit_rows2.append({
            'feature': col, 'min': f'{mn:.2f}', 'max': f'{mx:.2f}',
            'mean': f'{insitu[col].mean():.2f}', 'std': f'{insitu[col].std():.2f}',
            'zeros': zeros, 'nulls': nulls
        })

display(pd.DataFrame(audit_rows2))
print(f'\nData sources: {insitu["data_source"].value_counts().to_dict()}')
print(f'Locations: {insitu["location_name"].unique().tolist()}')

In [None]:
# ─── 2C: Cross-source chlorophyll comparison ───
fig, axes = plt.subplots(1, 4, figsize=(22, 5))

for ax, (name, series, color) in zip(axes, [
    ('ml_ready', ml_data['chlorophyll_a'].dropna(), 'steelblue'),
    ('in_situ', insitu['chlorophyll_a'].dropna(), 'coral'),
    ('satellite', satellite['chlorophyll_index'].dropna(), 'seagreen'),
    ('lake_json', lake_data['param_chlorophyll_a'].dropna(), 'orchid'),
]):
    ax.hist(series, bins=50, alpha=0.7, color=color, edgecolor='white')
    ax.set_title(f'{name}\nn={len(series):,}  '
                 f'\u03bc={series.mean():.2f}  \u03c3={series.std():.2f}')
    ax.set_xlabel('Chlorophyll (\u03bcg/L or index)')
    ax.axvline(series.mean(), color='black', linestyle='--', alpha=0.5)

plt.suptitle('Chlorophyll Distribution Across All Data Sources', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'figures' / 'fig1_chlorophyll_cross_source.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved: figures/fig1_chlorophyll_cross_source.png')

---
## 3. Temporal Coverage Analysis

In [None]:
# ─── 3A: Records per lake per month (heatmaps) ───
fig, axes = plt.subplots(2, 2, figsize=(20, 14))

datasets = [
    ('ML-Ready', ml_data[ml_data['lake'] != 'object'].copy(), 'timestamp', 'lake'),
    ('In-Situ', insitu.copy(), 'timestamp', 'location_name'),
    ('Satellite', satellite.copy(), 'acquisition_date', 'lake_name'),
    ('Lake JSON', lake_data.copy(), 'date', 'lake'),
]

for ax, (title, df_copy, date_col, lake_col) in zip(axes.flat, datasets):
    # Force datetime conversion (handles mixed timezone/format issues)
    df_copy[date_col] = pd.to_datetime(df_copy[date_col], errors='coerce', utc=True)
    df_copy = df_copy.dropna(subset=[date_col])
    df_copy['yearmonth'] = df_copy[date_col].dt.to_period('M').astype(str)
    pivot = df_copy.groupby([lake_col, 'yearmonth']).size().unstack(fill_value=0)
    if pivot.shape[1] > 30:  # too many columns, sample
        pivot = pivot.iloc[:, ::max(1, pivot.shape[1]//30)]
    sns.heatmap(pivot, ax=ax, cmap='YlOrRd', cbar_kws={'label': 'Records'},
                xticklabels=True, yticklabels=True)
    ax.set_title(f'{title} ({len(df_copy):,} rows)', fontsize=12, fontweight='bold')
    ax.tick_params(axis='x', rotation=90, labelsize=7)
    ax.tick_params(axis='y', labelsize=8)

plt.suptitle('Temporal Coverage: Records per Lake per Month', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'figures' / 'fig2_temporal_coverage.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved: figures/fig2_temporal_coverage.png')

---
## 4. Feature Distributions & Correlations

In [None]:
# ─── 4A: In-situ feature distributions (richest source) ───
plot_features = ['chlorophyll_a', 'turbidity', 'dissolved_oxygen', 'ph',
                 'temperature', 'conductivity', 'wind_speed', 'air_temperature', 'humidity']

fig, axes = plt.subplots(3, 3, figsize=(18, 14))
for i, col in enumerate(plot_features):
    ax = axes[i // 3, i % 3]
    vals = insitu[col].dropna()
    ax.hist(vals, bins=50, alpha=0.7, color='steelblue', edgecolor='white')
    ax.axvline(vals.mean(), color='red', linestyle='--', alpha=0.7, label=f'mean={vals.mean():.1f}')
    ax.axvline(vals.median(), color='orange', linestyle=':', alpha=0.7, label=f'median={vals.median():.1f}')
    ax.set_title(f'{col}  (n={len(vals):,})', fontweight='bold')
    ax.legend(fontsize=8)

plt.suptitle('In-Situ Feature Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'figures' / 'fig3_insitu_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# ─── 4B: Correlation matrices ───
fig, axes = plt.subplots(1, 2, figsize=(22, 9))

# In-situ
corr1 = insitu[plot_features].corr()
mask1 = np.triu(np.ones_like(corr1, dtype=bool))
sns.heatmap(corr1, mask=mask1, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
            ax=axes[0], vmin=-1, vmax=1, square=True, linewidths=0.5)
axes[0].set_title('In-Situ Feature Correlations', fontweight='bold')

# ML-ready
corr2 = ml_data[feature_cols].corr()
mask2 = np.triu(np.ones_like(corr2, dtype=bool))
sns.heatmap(corr2, mask=mask2, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
            ax=axes[1], vmin=-1, vmax=1, square=True, linewidths=0.5)
axes[1].set_title('ML-Ready Feature Correlations', fontweight='bold')

plt.suptitle('Feature Correlation Comparison: In-Situ vs ML-Ready', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'figures' / 'fig4_correlations.png', dpi=150, bbox_inches='tight')
plt.show()

---
## 5. Bloom Label Analysis
The lake JSON has `bloom_probability`, `bloom_status`, `cyanobacteria_density`, `toxin_levels`

In [None]:
# ─── 5A: Bloom indicator statistics ───
print('=' * 70)
print('BLOOM INDICATOR ANALYSIS (from lake JSON monitoring data)')
print('=' * 70)

bloom_cols = ['hab_bloom_probability', 'hab_cyanobacteria_density', 'hab_toxin_levels']
for col in bloom_cols:
    print(f'\n  {col}:')
    print(f'    {lake_data[col].describe().to_dict()}')

print(f'\n  hab_bloom_status:')
print(f'    {lake_data["hab_bloom_status"].value_counts().to_dict()}')

In [None]:
# ─── 5B: Bloom indicator visualizations ───
fig, axes = plt.subplots(2, 3, figsize=(20, 10))

# Bloom probability distribution
lake_data['hab_bloom_probability'].hist(bins=30, ax=axes[0, 0], color='crimson', alpha=0.7, edgecolor='white')
axes[0, 0].axvline(0.5, color='black', linestyle='--', label='threshold=0.5')
axes[0, 0].axvline(0.3, color='gray', linestyle=':', label='threshold=0.3')
axes[0, 0].set_title('Bloom Probability Distribution', fontweight='bold')
axes[0, 0].legend()

# Bloom status counts
status_order = ['None', 'Low', 'Moderate', 'High', 'Critical']
status_counts = lake_data['hab_bloom_status'].value_counts().reindex(status_order).fillna(0)
colors_status = ['#2ecc71', '#f1c40f', '#e67e22', '#e74c3c', '#8e44ad']
status_counts.plot.bar(ax=axes[0, 1], color=colors_status[:len(status_counts)], edgecolor='white')
axes[0, 1].set_title('Bloom Status Counts', fontweight='bold')
axes[0, 1].tick_params(axis='x', rotation=45)

# Cyanobacteria density
lake_data['hab_cyanobacteria_density'].hist(bins=30, ax=axes[0, 2], color='seagreen', alpha=0.7, edgecolor='white')
axes[0, 2].set_title('Cyanobacteria Density', fontweight='bold')

# Bloom probability by lake
lake_data.boxplot(column='hab_bloom_probability', by='lake', ax=axes[1, 0])
axes[1, 0].set_title('Bloom Probability by Lake', fontweight='bold')
axes[1, 0].tick_params(axis='x', rotation=45)
fig.suptitle('')

# Chlorophyll vs bloom probability
axes[1, 1].scatter(lake_data['param_chlorophyll_a'], lake_data['hab_bloom_probability'],
                   alpha=0.3, s=10, c='steelblue')
axes[1, 1].set_xlabel('Chlorophyll-a (\u03bcg/L)')
axes[1, 1].set_ylabel('Bloom Probability')
axes[1, 1].set_title('Chlorophyll-a vs Bloom Probability', fontweight='bold')

# Temperature vs bloom probability
axes[1, 2].scatter(lake_data['param_temperature'], lake_data['hab_bloom_probability'],
                   alpha=0.3, s=10, c='coral')
axes[1, 2].set_xlabel('Temperature (\u00b0C)')
axes[1, 2].set_ylabel('Bloom Probability')
axes[1, 2].set_title('Temperature vs Bloom Probability', fontweight='bold')

plt.suptitle('Bloom Label Analysis', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'figures' / 'fig5_bloom_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# ─── 5C: Compare labeling strategies ───
print('=' * 70)
print('LABELING STRATEGY COMPARISON')
print('=' * 70)

strategies = {
    'bloom_prob >= 0.5': (lake_data['hab_bloom_probability'] >= 0.5).astype(int),
    'bloom_prob >= 0.3': (lake_data['hab_bloom_probability'] >= 0.3).astype(int),
    'status in [High,Critical]': lake_data['hab_bloom_status'].isin(['High', 'Critical']).astype(int),
    'status in [Mod,High,Crit]': lake_data['hab_bloom_status'].isin(['Moderate', 'High', 'Critical']).astype(int),
    'chlorophyll_a > 20': (lake_data['param_chlorophyll_a'] > 20).astype(int),
    'chlorophyll_a > 10': (lake_data['param_chlorophyll_a'] > 10).astype(int),
}

print(f'{"Strategy":<30s} {"Positive":>10s} {"Total":>8s} {"Rate":>8s} {"Imbalance":>12s}')
print('-' * 70)
for name, labels in strategies.items():
    pos = labels.sum()
    total = len(labels)
    ratio = f'1:{total // max(pos, 1)}'
    print(f'{name:<30s} {pos:>10,} {total:>8,} {pos/total*100:>7.1f}% {ratio:>12s}')

---
## 6. Build Unified Research Dataset
Merge in-situ (features) + satellite (spectral) + lake_json (bloom labels)

In [None]:
# ─── 6A: Prepare in-situ as primary feature source ───
research_insitu = insitu[['location_name', 'timestamp', 'latitude', 'longitude', 'station_id',
                           'chlorophyll_a', 'turbidity', 'dissolved_oxygen', 'ph',
                           'temperature', 'conductivity', 'wind_speed', 'air_temperature',
                           'humidity', 'quality_score', 'data_source']].copy()

# Standardize lake names ("Bodensee - Überlingen" → "Bodensee")
research_insitu['lake'] = research_insitu['location_name'].str.split(' - ').str[0].str.strip()
research_insitu['date'] = research_insitu['timestamp'].dt.normalize()

print(f'In-situ prepared: {len(research_insitu):,} rows')
print(f'Lakes: {research_insitu["lake"].value_counts().to_dict()}')

In [None]:
# ─── 6B: Aggregate to daily per lake ───
daily_insitu = research_insitu.groupby(['lake', 'date']).agg({
    'chlorophyll_a': 'mean',
    'turbidity': 'mean',
    'dissolved_oxygen': 'mean',
    'ph': 'mean',
    'temperature': 'mean',
    'conductivity': 'mean',
    'wind_speed': 'mean',
    'air_temperature': 'mean',
    'humidity': 'mean',
    'latitude': 'first',
    'longitude': 'first',
    'quality_score': 'mean',
}).reset_index()

daily_insitu['date'] = pd.to_datetime(daily_insitu['date'])
print(f'Daily in-situ: {len(daily_insitu):,} rows')
print(f'Per lake: {daily_insitu.groupby("lake").size().to_dict()}')

In [None]:
# ─── 6C: Prepare satellite features (daily per lake) ───
sat_features = satellite[['lake_name', 'acquisition_date', 'ndvi', 'surface_temperature',
                           'chlorophyll_index', 'turbidity_index', 'cloud_coverage',
                           'reflectance_blue', 'reflectance_green', 'reflectance_red',
                           'reflectance_nir']].copy()
sat_features = sat_features.rename(columns={'lake_name': 'lake', 'acquisition_date': 'date'})
sat_features['date'] = sat_features['date'].dt.normalize()

daily_sat = sat_features.groupby(['lake', 'date']).mean(numeric_only=True).reset_index()
print(f'Daily satellite: {len(daily_sat):,} rows')
print(f'Per lake: {daily_sat.groupby("lake").size().to_dict()}')

In [None]:
# ─── 6D: Prepare bloom labels (daily per lake) ───
lake_labels = lake_data[['date', 'lake', 'hab_bloom_probability', 'hab_bloom_status',
                          'hab_cyanobacteria_density', 'hab_toxin_levels']].copy()
lake_labels['date'] = lake_labels['date'].dt.normalize()

daily_labels = lake_labels.groupby(['lake', 'date']).agg({
    'hab_bloom_probability': 'mean',
    'hab_cyanobacteria_density': 'mean',
    'hab_toxin_levels': 'mean',
    'hab_bloom_status': 'first',
}).reset_index()

print(f'Daily labels: {len(daily_labels):,} rows')
print(f'Per lake: {daily_labels.groupby("lake").size().to_dict()}')

In [None]:
# ─── 6E: Merge all three sources ───
# Strategy: in-situ as base → merge_asof satellite (±3 days) → merge_asof labels (±7 days)

research_df = daily_insitu.sort_values(['lake', 'date']).copy()

# Merge satellite
merged_parts = []
for lake_name in research_df['lake'].unique():
    left = research_df[research_df['lake'] == lake_name].copy()
    right = daily_sat[daily_sat['lake'] == lake_name].sort_values('date').copy()
    if len(right) > 0:
        merged = pd.merge_asof(left.sort_values('date'), right.drop(columns=['lake']),
                               on='date', tolerance=pd.Timedelta('3D'), direction='nearest')
    else:
        merged = left
    merged_parts.append(merged)
research_df = pd.concat(merged_parts, ignore_index=True)

# Merge bloom labels
merged_parts2 = []
for lake_name in research_df['lake'].unique():
    left = research_df[research_df['lake'] == lake_name].sort_values('date').copy()
    right = daily_labels[daily_labels['lake'] == lake_name].sort_values('date').copy()
    if len(right) > 0:
        merged = pd.merge_asof(left, right.drop(columns=['lake']),
                               on='date', tolerance=pd.Timedelta('7D'), direction='nearest')
    else:
        merged = left
    merged_parts2.append(merged)
research_df = pd.concat(merged_parts2, ignore_index=True)

print(f'\n{"=" * 70}')
print(f'UNIFIED RESEARCH DATASET')
print(f'{"=" * 70}')
print(f'Shape: {research_df.shape}')
print(f'Date range: {research_df["date"].min()} → {research_df["date"].max()}')
print(f'Lakes: {research_df["lake"].value_counts().to_dict()}')
print(f'\nColumns ({len(research_df.columns)}): {list(research_df.columns)}')
print(f'\nNull counts (top 10):')
display(research_df.isnull().sum().sort_values(ascending=False).head(10))

---
## 7. Create Bloom Labels

In [None]:
# ─── 7A: Multi-criteria bloom label ───
def create_bloom_label(row, threshold=0.5):
    """Priority: bloom_probability → chlorophyll+temp threshold → 0"""
    if pd.notna(row.get('hab_bloom_probability')):
        return int(row['hab_bloom_probability'] >= threshold)
    chl = row.get('chlorophyll_a', np.nan)
    temp = row.get('temperature', np.nan)
    if pd.notna(chl):
        if chl > 20: return 1
        if chl > 10 and pd.notna(temp) and temp > 18: return 1
    return 0

research_df['bloom_label'] = research_df.apply(create_bloom_label, axis=1)

# Continuous risk score for regression
research_df['bloom_risk'] = research_df['hab_bloom_probability'].fillna(
    research_df['chlorophyll_a'].clip(0, 50) / 50
)

pos = research_df['bloom_label'].sum()
total = len(research_df)
print(f'Binary label: {pos} positive / {total} total ({pos/total*100:.1f}%)')
print(f'Continuous risk: mean={research_df["bloom_risk"].mean():.3f}, std={research_df["bloom_risk"].std():.3f}')

In [None]:
# ─── 7B: Label quality visualization ───
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

research_df['bloom_label'].value_counts().sort_index().plot.bar(
    ax=axes[0], color=['steelblue', 'crimson'], edgecolor='white')
axes[0].set_title('Binary Label Balance', fontweight='bold')
axes[0].set_xticklabels(['No Bloom (0)', 'Bloom (1)'], rotation=0)
for i, v in enumerate(research_df['bloom_label'].value_counts().sort_index()):
    axes[0].text(i, v + 10, str(v), ha='center', fontweight='bold')

research_df['bloom_risk'].hist(bins=30, ax=axes[1], color='orchid', alpha=0.7, edgecolor='white')
axes[1].axvline(0.5, color='black', linestyle='--', label='threshold=0.5')
axes[1].set_title('Continuous Bloom Risk', fontweight='bold')
axes[1].legend()

bloom_by_lake = research_df.groupby('lake')['bloom_label'].mean().sort_values(ascending=False)
bloom_by_lake.plot.bar(ax=axes[2], color='seagreen', edgecolor='white')
axes[2].set_title('Bloom Rate by Lake', fontweight='bold')
axes[2].set_ylabel('Fraction with bloom=1')
axes[2].tick_params(axis='x', rotation=45)

plt.suptitle('Label Quality Assessment', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'figures' / 'fig6_label_quality.png', dpi=150, bbox_inches='tight')
plt.show()

---
## 8. Temporal Train / Val / Test Split
**70% train / 15% val / 15% test — temporal, NOT random**

In [None]:
# ─── 8A: Split ───
research_df = research_df.sort_values('date').reset_index(drop=True)

date_min = research_df['date'].min()
date_max = research_df['date'].max()
date_range = (date_max - date_min).days

train_end = date_min + pd.Timedelta(days=int(date_range * 0.70))
val_end   = date_min + pd.Timedelta(days=int(date_range * 0.85))

train = research_df[research_df['date'] <= train_end].copy()
val   = research_df[(research_df['date'] > train_end) & (research_df['date'] <= val_end)].copy()
test  = research_df[research_df['date'] > val_end].copy()

print(f'Date range: {date_min.date()} → {date_max.date()} ({date_range} days)')
print(f'\n{"Split":<8s} {"Rows":>8s} {"From":>14s} {"To":>14s} {"Bloom %":>10s}')
print('-' * 60)
for name, df in [('Train', train), ('Val', val), ('Test', test)]:
    print(f'{name:<8s} {len(df):>8,} {str(df["date"].min().date()):>14s} '
          f'{str(df["date"].max().date()):>14s} {df["bloom_label"].mean()*100:>9.1f}%')

In [None]:
# ─── 8B: Visualize split ───
fig, ax = plt.subplots(figsize=(18, 4))

for i, (name, df, color) in enumerate([
    ('Train', train, 'steelblue'), ('Val', val, 'orange'), ('Test', test, 'crimson')
]):
    ax.scatter(df['date'], [i]*len(df), alpha=0.4, s=8, c=color, label=f'{name} ({len(df):,})')

ax.axvline(train_end, color='black', linestyle='--', alpha=0.5, label=f'train_end={train_end.date()}')
ax.axvline(val_end, color='black', linestyle=':', alpha=0.5, label=f'val_end={val_end.date()}')
ax.set_yticks([0, 1, 2])
ax.set_yticklabels(['Train', 'Val', 'Test'])
ax.set_title('Temporal Train / Val / Test Split', fontweight='bold')
ax.legend(loc='upper left')
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'figures' / 'fig7_temporal_split.png', dpi=150, bbox_inches='tight')
plt.show()

---
## 9. Monolithic Baselines (RQ1)
Train GradientBoosting + RandomForest + Ensemble on ALL features

In [None]:
# ─── 9A: Define feature sets ───
INSITU_FEATURES = ['chlorophyll_a', 'turbidity', 'dissolved_oxygen', 'ph',
                   'temperature', 'conductivity', 'wind_speed', 'air_temperature', 'humidity']

SATELLITE_FEATURES = ['ndvi', 'surface_temperature', 'chlorophyll_index',
                      'turbidity_index', 'cloud_coverage']

ALL_FEATURES = INSITU_FEATURES + SATELLITE_FEATURES
TARGET = 'bloom_label'

# Use only features that exist in the merged dataset
available_features = [f for f in ALL_FEATURES if f in research_df.columns]
missing_features = [f for f in ALL_FEATURES if f not in research_df.columns]
print(f'Available features ({len(available_features)}): {available_features}')
if missing_features:
    print(f'Missing features ({len(missing_features)}): {missing_features}')

In [None]:
# ─── 9B: Prepare X, y ───
def prepare_split(df, features, target):
    X = df[features].copy()
    y = df[target].copy()
    return X, y

X_train, y_train = prepare_split(train, available_features, TARGET)
X_val, y_val = prepare_split(val, available_features, TARGET)
X_test, y_test = prepare_split(test, available_features, TARGET)

# Impute missing values with median, then standardize
imputer = SimpleImputer(strategy='median')
scaler = StandardScaler()

X_train_proc = scaler.fit_transform(imputer.fit_transform(X_train))
X_val_proc = scaler.transform(imputer.transform(X_val))
X_test_proc = scaler.transform(imputer.transform(X_test))

print(f'Train: X={X_train_proc.shape}, y={y_train.shape} (bloom={y_train.mean():.3f})')
print(f'Val:   X={X_val_proc.shape}, y={y_val.shape} (bloom={y_val.mean():.3f})')
print(f'Test:  X={X_test_proc.shape}, y={y_test.shape} (bloom={y_test.mean():.3f})')

In [None]:
# ─── 9C: Train models ───
print('Training models...')

models = OrderedDict()

models['GradientBoosting'] = GradientBoostingClassifier(
    n_estimators=200, max_depth=5, learning_rate=0.1,
    subsample=0.8, min_samples_leaf=10, random_state=42
)
models['RandomForest'] = RandomForestClassifier(
    n_estimators=200, max_depth=10, min_samples_leaf=5,
    random_state=42, n_jobs=-1
)

for name, model in models.items():
    t0 = datetime.now()
    model.fit(X_train_proc, y_train)
    elapsed = (datetime.now() - t0).total_seconds()
    print(f'  {name}: trained in {elapsed:.1f}s')

print('Done.')

In [None]:
# ─── 9D: Evaluate all models ───
def evaluate(name, y_true, y_prob):
    y_pred = (y_prob >= 0.5).astype(int)
    n_classes = len(np.unique(y_true))
    return {
        'model': name,
        'AUROC': roc_auc_score(y_true, y_prob) if n_classes > 1 else np.nan,
        'AUPRC': average_precision_score(y_true, y_prob) if n_classes > 1 else np.nan,
        'Brier': brier_score_loss(y_true, y_prob),
        'Accuracy': (y_pred == y_true).mean(),
        'TP': ((y_pred == 1) & (y_true == 1)).sum(),
        'FP': ((y_pred == 1) & (y_true == 0)).sum(),
        'FN': ((y_pred == 0) & (y_true == 1)).sum(),
        'TN': ((y_pred == 0) & (y_true == 0)).sum(),
    }

# Validation results
print('VALIDATION SET RESULTS')
print('=' * 80)
val_results = []
ens_prob_val = np.zeros(len(y_val))
ens_prob_test = np.zeros(len(y_test))

for name, model in models.items():
    prob_val = model.predict_proba(X_val_proc)[:, 1]
    prob_test = model.predict_proba(X_test_proc)[:, 1]
    ens_prob_val += prob_val / len(models)
    ens_prob_test += prob_test / len(models)
    val_results.append(evaluate(name, y_val, prob_val))

val_results.append(evaluate('Ensemble (GB+RF)', y_val, ens_prob_val))
val_df = pd.DataFrame(val_results)
display(val_df)

# Test results
print('\nTEST SET RESULTS (Final)')
print('=' * 80)
test_results = []
for name, model in models.items():
    prob = model.predict_proba(X_test_proc)[:, 1]
    test_results.append(evaluate(name, y_test, prob))
test_results.append(evaluate('Ensemble (GB+RF)', y_test, ens_prob_test))
test_df = pd.DataFrame(test_results)
display(test_df)

In [None]:
# ─── 9E: ROC, PR curves, Feature Importance, Calibration ───
fig, axes = plt.subplots(2, 2, figsize=(16, 14))

# ROC
ax = axes[0, 0]
for name, model in models.items():
    prob = model.predict_proba(X_test_proc)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, prob)
    auc = roc_auc_score(y_test, prob)
    ax.plot(fpr, tpr, label=f'{name} (AUC={auc:.3f})', linewidth=2)
fpr_e, tpr_e, _ = roc_curve(y_test, ens_prob_test)
auc_e = roc_auc_score(y_test, ens_prob_test)
ax.plot(fpr_e, tpr_e, '--', label=f'Ensemble (AUC={auc_e:.3f})', linewidth=2)
ax.plot([0, 1], [0, 1], 'k:', alpha=0.3)
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curve (Test Set)', fontweight='bold')
ax.legend()

# PR
ax = axes[0, 1]
for name, model in models.items():
    prob = model.predict_proba(X_test_proc)[:, 1]
    prec, rec, _ = precision_recall_curve(y_test, prob)
    ap = average_precision_score(y_test, prob)
    ax.plot(rec, prec, label=f'{name} (AP={ap:.3f})', linewidth=2)
prec_e, rec_e, _ = precision_recall_curve(y_test, ens_prob_test)
ap_e = average_precision_score(y_test, ens_prob_test)
ax.plot(rec_e, prec_e, '--', label=f'Ensemble (AP={ap_e:.3f})', linewidth=2)
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curve (Test Set)', fontweight='bold')
ax.legend()

# Feature importance
ax = axes[1, 0]
gb_imp = models['GradientBoosting'].feature_importances_
rf_imp = models['RandomForest'].feature_importances_
idx = np.argsort(gb_imp)[::-1]
x_pos = np.arange(len(available_features))
ax.barh(x_pos - 0.2, gb_imp[idx], 0.4, label='GradientBoosting', color='steelblue', alpha=0.8)
ax.barh(x_pos + 0.2, rf_imp[idx], 0.4, label='RandomForest', color='coral', alpha=0.8)
ax.set_yticks(x_pos)
ax.set_yticklabels([available_features[i] for i in idx])
ax.set_xlabel('Feature Importance')
ax.set_title('Feature Importance Comparison', fontweight='bold')
ax.legend()
ax.invert_yaxis()

# Calibration
ax = axes[1, 1]
for name, prob in [('GB', models['GradientBoosting'].predict_proba(X_test_proc)[:, 1]),
                   ('RF', models['RandomForest'].predict_proba(X_test_proc)[:, 1]),
                   ('Ensemble', ens_prob_test)]:
    try:
        fraction_pos, mean_pred = calibration_curve(y_test, prob, n_bins=10, strategy='uniform')
        ax.plot(mean_pred, fraction_pos, 's-', label=name, linewidth=2, markersize=6)
    except ValueError:
        pass
ax.plot([0, 1], [0, 1], 'k:', alpha=0.3, label='Perfect calibration')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives')
ax.set_title('Calibration Plot (Test Set)', fontweight='bold')
ax.legend()

plt.suptitle('RQ1 Monolithic Baseline — Full Evaluation', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.savefig(RESULTS_DIR / 'figures' / 'fig8_baseline_evaluation.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# ─── 9F: Modality ablation (quick preview for RQ1) ───
# What happens when we remove satellite OR in-situ features?
print('=' * 70)
print('MODALITY ABLATION (preview for RQ1 - graceful degradation)')
print('=' * 70)

ablation_results = []

feature_sets = {
    'All Features': available_features,
    'In-Situ Only': [f for f in INSITU_FEATURES if f in available_features],
    'Satellite Only': [f for f in SATELLITE_FEATURES if f in available_features],
}

for set_name, feats in feature_sets.items():
    if len(feats) == 0:
        print(f'  {set_name}: NO FEATURES AVAILABLE, skipping')
        continue
    
    imp = SimpleImputer(strategy='median')
    sc = StandardScaler()
    X_tr = sc.fit_transform(imp.fit_transform(train[feats]))
    X_te = sc.transform(imp.transform(test[feats]))
    
    gb_abl = GradientBoostingClassifier(n_estimators=200, max_depth=5, learning_rate=0.1,
                                        subsample=0.8, random_state=42)
    gb_abl.fit(X_tr, y_train)
    prob = gb_abl.predict_proba(X_te)[:, 1]
    
    m = evaluate(set_name, y_test, prob)
    m['n_features'] = len(feats)
    m['features'] = ', '.join(feats)
    ablation_results.append(m)

ablation_df = pd.DataFrame(ablation_results)[['model', 'n_features', 'AUROC', 'AUPRC', 'Brier', 'Accuracy']]
display(ablation_df)

print('\n→ This shows how much each modality contributes.')
print('  In RQ1, we compare this monolithic ablation to the agentic architecture.')

---
## 10. Save Everything & Export ZIP

In [None]:
# ─── 10A: Save research dataset ───
research_df.to_parquet(RESULTS_DIR / 'data' / 'unified_research_dataset.parquet', index=False)
train.to_parquet(RESULTS_DIR / 'data' / 'train.parquet', index=False)
val.to_parquet(RESULTS_DIR / 'data' / 'val.parquet', index=False)
test.to_parquet(RESULTS_DIR / 'data' / 'test.parquet', index=False)

# Save models
for name, model in models.items():
    with open(RESULTS_DIR / 'models' / f'baseline_{name.lower()}.pkl', 'wb') as f:
        pickle.dump(model, f)
with open(RESULTS_DIR / 'models' / 'scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)
with open(RESULTS_DIR / 'models' / 'imputer.pkl', 'wb') as f:
    pickle.dump(imputer, f)

# Save results tables
val_df.to_csv(RESULTS_DIR / 'val_results.csv', index=False)
test_df.to_csv(RESULTS_DIR / 'test_results.csv', index=False)
ablation_df.to_csv(RESULTS_DIR / 'ablation_results.csv', index=False)
audit_df.to_csv(RESULTS_DIR / 'data_audit.csv', index=False)

# Save experiment metadata
experiment_meta = {
    'experiment': 'RQ1_Phase1_Monolithic_Baselines',
    'date': datetime.now().isoformat(),
    'dataset': {
        'total_records': len(research_df),
        'train': {'records': len(train), 'date_range': [str(train['date'].min()), str(train['date'].max())], 'bloom_rate': float(y_train.mean())},
        'val': {'records': len(val), 'date_range': [str(val['date'].min()), str(val['date'].max())], 'bloom_rate': float(y_val.mean())},
        'test': {'records': len(test), 'date_range': [str(test['date'].min()), str(test['date'].max())], 'bloom_rate': float(y_test.mean())},
        'lakes': research_df['lake'].unique().tolist(),
        'n_lakes': int(research_df['lake'].nunique()),
        'features_used': available_features,
        'n_features': len(available_features),
    },
    'models': {
        'GradientBoosting': {'n_estimators': 200, 'max_depth': 5, 'lr': 0.1},
        'RandomForest': {'n_estimators': 200, 'max_depth': 10},
    },
    'test_results': test_df.to_dict(orient='records'),
    'ablation_results': ablation_df.to_dict(orient='records'),
}

with open(RESULTS_DIR / 'experiment_metadata.json', 'w') as f:
    json.dump(experiment_meta, f, indent=2, default=str)

print('All results saved to results/ directory.')

In [None]:
# ─── 10B: Create ZIP of all results ───
zip_path = 'swim_research_phase1_results.zip'

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
    for root, dirs, files in os.walk(RESULTS_DIR):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, '.')
            zf.write(file_path, arcname)

zip_size_mb = os.path.getsize(zip_path) / (1024 * 1024)
print(f'\nResults ZIP created: {zip_path} ({zip_size_mb:.1f} MB)')
print(f'\nContents:')
with zipfile.ZipFile(zip_path, 'r') as zf:
    for info in zf.infolist():
        print(f'  {info.filename:<55s} {info.file_size/1024:>8.1f} KB')

In [None]:
# ─── 10C: Download ZIP (Colab only) ───
if IN_COLAB:
    from google.colab import files
    files.download(zip_path)
    print('Download started!')
else:
    print(f'Results zip ready at: {os.path.abspath(zip_path)}')
    print('Copy it from your Jupyter file browser.')

---
## Summary & Next Steps

### What we did:
1. Loaded & audited 4 data sources (~110K total records)
2. Built unified research dataset (in-situ + satellite + bloom labels)
3. Created multi-criteria bloom labels
4. Temporal train/val/test split
5. Trained monolithic baselines (GB, RF, Ensemble)
6. Evaluated: AUROC, AUPRC, Brier, Accuracy, Calibration
7. Modality ablation (in-situ only vs satellite only vs all)
8. Exported everything as ZIP

### Results saved in ZIP:
```
results/
  figures/          (8 publication-quality plots)
  models/           (GB, RF, scaler, imputer pickles)
  data/             (train/val/test parquets)
  val_results.csv
  test_results.csv
  ablation_results.csv
  data_audit.csv
  experiment_metadata.json
```

### Next notebooks:
- **02**: Deep learning baselines (LSTM, Transformer) — needs GPU
- **03**: Agentic model comparison (run same data through SWIM agents)
- **04**: Communication protocol experiments (RQ2)
- **05**: Conflict resolution & fusion experiments (RQ3)