# EDON CAV Dataset Builder

This notebook builds the Context-Aware Vectors (CAV) dataset by:
1. Loading/extracting physiological features (WESAD)
2. Fetching environmental data from APIs
3. Computing 128-dimensional embeddings using PCA
4. Saving to JSON format

## Data Sources
- **WESAD**: Physiological signals (HR, HRV, EDA, Respiration, Accelerometer)
- **OpenWeatherMap**: Temperature, humidity, cloud coverage
- **AirNow (EPA)**: Air quality (AQI, PM2.5, Ozone)
- **WorldTimeAPI**: Circadian rhythm (hour, daylight)


In [None]:
import sys
import os
sys.path.insert(0, os.path.abspath('../'))

import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from src.pipeline import build_cav_dataset

print("✓ Imports successful")


## Step 1: Load WESAD Data (Optional)

If you have WESAD dataset files, load them here. Otherwise, the pipeline will generate synthetic physiological data.


In [None]:
# Load WESAD data if available
# Uncomment and modify path if you have WESAD dataset
# import scipy.io
# wesad_path = "path/to/WESAD/S2/S2.mat"
# wesad_data = scipy.io.loadmat(wesad_path)
# print("✓ WESAD data loaded")

# For demo, we'll use None (synthetic data will be generated)
wesad_data = None
print("Using synthetic physiological data (WESAD not loaded)")


## Step 2: Build CAV Dataset

Generate 10,000+ CAV records with 128-dimensional embeddings.


In [None]:
# Build dataset
df = build_cav_dataset(
    n_samples=10000,
    output_path="../data/edon_cav.json",
    wesad_data=wesad_data,
    lat=40.7128,  # NYC coordinates
    lon=-74.0060,
    model_dir="../models"
)


## Step 3: Explore Generated Data


In [None]:
# Load and inspect the generated data
with open('../data/edon_cav.json', 'r') as f:
    data = json.load(f)

print(f"Total records: {len(data)}")
print(f"\nSample record:")
print(json.dumps(data[0], indent=2))

# Verify embedding dimensions
print(f"\n✓ Embedding dimension: {len(data[0]['cav128'])}")
print(f"✓ All embeddings are 128-D: {all(len(r['cav128']) == 128 for r in data)}")


In [None]:
# Statistics
df_stats = pd.DataFrame([
    {
        'hr': r['bio']['hr'],
        'hrv_rmssd': r['bio']['hrv_rmssd'],
        'eda_mean': r['bio']['eda_mean'],
        'temp_c': r['env']['temp_c'],
        'aqi': r['env']['aqi'],
        'activity': r['activity']
    }
    for r in data
])

print("Feature Statistics:")
print(df_stats.describe())
print(f"\nActivity Distribution:")
print(df_stats['activity'].value_counts())


## Step 4: Verify Embeddings

Check that embeddings are properly normalized and have expected properties.


In [None]:
# Check embedding properties
embeddings = np.array([r['cav128'] for r in data[:1000]])  # Sample for speed

print(f"Embedding shape: {embeddings.shape}")
print(f"Mean L2 norm: {np.mean(np.linalg.norm(embeddings, axis=1)):.4f}")
print(f"Std L2 norm: {np.std(np.linalg.norm(embeddings, axis=1)):.4f}")
print(f"✓ Embeddings are L2 normalized: {np.allclose(np.linalg.norm(embeddings, axis=1), 1.0, atol=1e-5)}")


## Summary

✓ Dataset generated with 10,000+ records
✓ Each record contains physiological, environmental, and contextual features
✓ 128-dimensional CAV embeddings computed using PCA
✓ Data saved to `data/edon_cav.json`
✓ Models saved to `models/` directory
