# FedSense Data Exploration

This notebook explores synthetic wearable sensor data for federated anomaly detection.

**Goals:**
- Generate and examine synthetic time-series data (HR + accelerometer)
- Analyze data distribution across federated clients
- Visualize normal vs anomalous patterns
- Demonstrate windowing and preprocessing steps

In [None]:
# Import Required Libraries
import sys
import os
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# FedSense imports
from fedsense.config import get_config
from fedsense.datasets import generate_synthetic_data, create_federated_splits
from fedsense.features import make_windows, get_dataset_stats

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# Load Configuration
config = get_config()

print("FedSense Configuration:")
print(f"  Window Length: {config.window_len} samples")
print(f"  Stride: {config.stride} samples")
print(f"  Number of Clients: {config.n_clients}")
print(f"  Random Seed: {config.random_seed}")
print(f"  Data Directory: {config.data_dir}")

# Set up random seed for reproducibility
np.random.seed(config.random_seed)

## 1. Generate Synthetic Data

First, let's generate synthetic wearable sensor data that mimics real physiological signals.

In [None]:
# Generate synthetic wearable data
synthetic_df = generate_synthetic_data(
    n_samples=10000,
    fs=50.0,  # 50 Hz sampling rate
    window_len=config.window_len,
    anomaly_rate=0.1,  # 10% anomalies
    random_seed=config.random_seed
)

print(f"Generated dataset shape: {synthetic_df.shape}")
print(f"Columns: {list(synthetic_df.columns)}")
print(f"Anomaly rate: {synthetic_df['label'].mean():.3f}")
print(f"Time range: {synthetic_df['timestamp'].min()} to {synthetic_df['timestamp'].max()}")

# Display first few rows
synthetic_df.head(10)

## 2. Visualize Raw Time-Series Data

In [None]:
# Plot a subset of the time series data
plot_duration = 300  # Plot 300 seconds (5 minutes)
plot_samples = int(plot_duration * 50)  # 50 Hz
plot_data = synthetic_df.head(plot_samples).copy()

fig, axes = plt.subplots(4, 1, figsize=(15, 12))

# Time vector for x-axis
time_seconds = np.arange(len(plot_data)) / 50.0

# Heart Rate
axes[0].plot(time_seconds, plot_data['hr'], 'b-', linewidth=0.8)
anomaly_mask = plot_data['label'] == 1
if anomaly_mask.any():
    axes[0].scatter(time_seconds[anomaly_mask], plot_data['hr'][anomaly_mask], 
                   c='red', s=10, alpha=0.7, label='Anomaly')
axes[0].set_ylabel('Heart Rate (BPM)')
axes[0].set_title('Synthetic Wearable Sensor Data')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accelerometer X
axes[1].plot(time_seconds, plot_data['accel_x'], 'g-', linewidth=0.8)
if anomaly_mask.any():
    axes[1].scatter(time_seconds[anomaly_mask], plot_data['accel_x'][anomaly_mask], 
                   c='red', s=10, alpha=0.7)
axes[1].set_ylabel('Accel X (m/s²)')
axes[1].grid(True, alpha=0.3)

# Accelerometer Y
axes[2].plot(time_seconds, plot_data['accel_y'], 'orange', linewidth=0.8)
if anomaly_mask.any():
    axes[2].scatter(time_seconds[anomaly_mask], plot_data['accel_y'][anomaly_mask], 
                   c='red', s=10, alpha=0.7)
axes[2].set_ylabel('Accel Y (m/s²)')
axes[2].grid(True, alpha=0.3)

# Accelerometer Z
axes[3].plot(time_seconds, plot_data['accel_z'], 'purple', linewidth=0.8)
if anomaly_mask.any():
    axes[3].scatter(time_seconds[anomaly_mask], plot_data['accel_z'][anomaly_mask], 
                   c='red', s=10, alpha=0.7)
axes[3].set_ylabel('Accel Z (m/s²)')
axes[3].set_xlabel('Time (seconds)')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Plotted {plot_duration} seconds of data ({plot_samples} samples)")
print(f"Anomalies in this segment: {anomaly_mask.sum()} ({anomaly_mask.mean():.2%})")

## 3. Windowing and Preprocessing

Demonstrate how raw time-series data is converted into windows for machine learning.

In [None]:
# Create windows from the synthetic data
X, y = make_windows(
    synthetic_df, 
    window_len=config.window_len, 
    stride=config.stride,
    target_col='label'
)

print(f"Window shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Feature interpretation: (n_windows={X.shape[0]}, window_len={X.shape[1]}, n_features={X.shape[2]})")
print(f"Features are: [HR, Accel_X, Accel_Y, Accel_Z]")
print(f"Window anomaly rate: {y.mean():.3f}")

# Visualize a few example windows
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

# Show 2 normal and 2 anomalous windows
normal_indices = np.where(y == 0)[0][:2]
anomaly_indices = np.where(y == 1)[0][:2]

window_indices = list(normal_indices) + list(anomaly_indices)
titles = ['Normal Window 1', 'Normal Window 2', 'Anomaly Window 1', 'Anomaly Window 2']
colors = ['blue', 'green', 'red', 'orange']

for i, (window_idx, title, color) in enumerate(zip(window_indices, titles, colors)):
    window_data = X[window_idx]  # Shape: (window_len, n_features)
    time_steps = np.arange(window_data.shape[0]) / 50.0  # Convert to seconds
    
    ax = axes[i]
    
    # Plot heart rate
    ax2 = ax.twinx()
    ax.plot(time_steps, window_data[:, 0], 'r-', linewidth=2, alpha=0.8, label='HR')
    
    # Plot acceleration magnitude
    accel_mag = np.sqrt(window_data[:, 1]**2 + window_data[:, 2]**2 + window_data[:, 3]**2)
    ax2.plot(time_steps, accel_mag, color=color, linewidth=2, alpha=0.7, label='Accel Mag')
    
    ax.set_xlabel('Time (s)')
    ax.set_ylabel('Heart Rate (BPM)', color='r')
    ax2.set_ylabel('Accel Magnitude (m/s²)', color=color)
    ax.set_title(f'{title} (Label: {y[window_idx]})')
    ax.grid(True, alpha=0.3)
    
    # Add legends
    lines1, labels1 = ax.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    ax.legend(lines1 + lines2, labels1 + labels2, loc='upper right')

plt.tight_layout()
plt.show()

## 4. Federated Data Distribution

Analyze how data is distributed across federated clients in a non-IID manner.

In [None]:
# Create federated data splits
data_splits = create_federated_splits(
    df=synthetic_df,
    n_clients=config.n_clients,
    alpha=0.1,  # Low alpha = more non-IID
    random_seed=config.random_seed
)

# Analyze distribution across clients
client_stats = []
for client_id, client_data in data_splits.items():
    stats = {
        'client_id': client_id,
        'n_samples': len(client_data),
        'anomaly_rate': client_data['label'].mean(),
        'avg_hr': client_data['hr'].mean(),
        'std_hr': client_data['hr'].std(),
        'avg_accel_x': client_data['accel_x'].mean(),
        'avg_accel_y': client_data['accel_y'].mean(),
        'avg_accel_z': client_data['accel_z'].mean(),
    }
    client_stats.append(stats)

client_stats_df = pd.DataFrame(client_stats)
print("Federated Client Statistics:")
print(client_stats_df.round(3))

# Visualize client distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Anomaly rate per client
axes[0, 0].bar(client_stats_df['client_id'], client_stats_df['anomaly_rate'])
axes[0, 0].set_xlabel('Client ID')
axes[0, 0].set_ylabel('Anomaly Rate')
axes[0, 0].set_title('Anomaly Rate Distribution Across Clients')
axes[0, 0].grid(True, alpha=0.3)

# Sample count per client
axes[0, 1].bar(client_stats_df['client_id'], client_stats_df['n_samples'])
axes[0, 1].set_xlabel('Client ID')
axes[0, 1].set_ylabel('Number of Samples')
axes[0, 1].set_title('Data Volume per Client')
axes[0, 1].grid(True, alpha=0.3)

# Average heart rate per client
axes[1, 0].bar(client_stats_df['client_id'], client_stats_df['avg_hr'])
axes[1, 0].set_xlabel('Client ID')
axes[1, 0].set_ylabel('Average HR (BPM)')
axes[1, 0].set_title('Average Heart Rate per Client')
axes[1, 0].grid(True, alpha=0.3)

# Heart rate variability per client
axes[1, 1].bar(client_stats_df['client_id'], client_stats_df['std_hr'])
axes[1, 1].set_xlabel('Client ID')
axes[1, 1].set_ylabel('HR Standard Deviation')
axes[1, 1].set_title('Heart Rate Variability per Client')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nNon-IID Statistics:")
print(f"Anomaly rate range: {client_stats_df['anomaly_rate'].min():.3f} - {client_stats_df['anomaly_rate'].max():.3f}")
print(f"Sample count range: {client_stats_df['n_samples'].min()} - {client_stats_df['n_samples'].max()}")
print(f"HR mean range: {client_stats_df['avg_hr'].min():.1f} - {client_stats_df['avg_hr'].max():.1f} BPM")

## 5. Feature Correlation and Distribution Analysis

In [None]:
# Feature correlation analysis
feature_cols = ['hr', 'accel_x', 'accel_y', 'accel_z', 'label']
correlation_matrix = synthetic_df[feature_cols].corr()

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Correlation heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0,
            square=True, ax=axes[0])
axes[0].set_title('Feature Correlation Matrix')

# Feature distributions by class
normal_data = synthetic_df[synthetic_df['label'] == 0]
anomaly_data = synthetic_df[synthetic_df['label'] == 1]

# Heart rate distributions
axes[1].hist(normal_data['hr'], bins=50, alpha=0.7, label='Normal', color='blue')
axes[1].hist(anomaly_data['hr'], bins=50, alpha=0.7, label='Anomaly', color='red')
axes[1].set_xlabel('Heart Rate (BPM)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Heart Rate Distribution by Class')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary statistics by class
print("Summary Statistics by Class:")
print("\nNormal samples:")
print(normal_data[['hr', 'accel_x', 'accel_y', 'accel_z']].describe().round(2))

print("\nAnomalous samples:")
print(anomaly_data[['hr', 'accel_x', 'accel_y', 'accel_z']].describe().round(2))

## 6. Summary and Next Steps

### Key Findings

1. **Data Quality**: Synthetic data successfully mimics physiological patterns
2. **Anomaly Patterns**: Clear differences in HR and acceleration between normal/anomalous periods
3. **Non-IID Distribution**: Significant variation in anomaly rates and signal characteristics across clients
4. **Windowing**: 250-sample windows (5 seconds) effectively capture temporal patterns

### Data Characteristics

- **Sampling Rate**: 50 Hz (realistic for wearable devices)
- **Window Length**: 5 seconds (250 samples) 
- **Features**: Heart rate + 3-axis accelerometer
- **Class Balance**: ~10% anomalies (realistic for health monitoring)
- **Federated Splits**: 8 clients with varying data distributions

### Next Steps

1. **Feature Engineering**: Explore FFT features, rolling statistics
2. **Model Training**: Train JAX/Flax CNN on windowed data
3. **Federated Learning**: Compare centralized vs federated performance
4. **Differential Privacy**: Evaluate privacy-accuracy tradeoffs
5. **Model Deployment**: Export to ONNX and serve via Triton

### Model Development Pipeline

```bash
# Generate federated data
make data

# Train centralized baseline
make train_local

# Run federated learning  
make fl_server    # Terminal 1
make fl_client    # Terminal 2-N (multiple clients)

# Export and deploy
make export
make triton_up
make api_up
```