## üìÇ Step 1: Setup & Configuration

**What:** Import libraries and define paths/parameters.

| Parameter | Value | Description |
|-----------|-------|-------------|
| `window_size` | 200 | Samples per window (4 sec @ 50Hz) |
| `overlap` | 50% | Windows overlap by half |
| `step` | 100 | Samples between window starts |
| `conversion_factor` | 0.00981 | milliG to m/s¬≤ |

**Input:** None  
**Output:** `CONFIG` dict, path variables

In [None]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
from datetime import datetime

# Paths (CORRECT folder names)
PROJECT_ROOT = Path.cwd().parent
DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed'
DATA_PREPARED = PROJECT_ROOT / 'data' / 'prepared'

# Configuration
CONFIG = {
    'input_file': DATA_PROCESSED / 'sensor_fused_50Hz.csv',
    'window_size': 200,
    'overlap': 0.5,
    'step': 100,
    'conversion_factor': 0.00981,  # milliG to m/s¬≤
    'sensor_columns': ['Ax', 'Ay', 'Az', 'Gx', 'Gy', 'Gz'],
    'accel_columns': ['Ax', 'Ay', 'Az'],
}

print(f"‚úì Project Root: {PROJECT_ROOT}")
print(f"‚úì Data Processed: {DATA_PROCESSED}")
print(f"‚úì Data Prepared: {DATA_PREPARED}")
print(f"\nüì• Input File: {CONFIG['input_file']}")
print(f"üìä Window Size: {CONFIG['window_size']} samples (4 sec @ 50Hz)")
print(f"üîÑ Overlap: {CONFIG['overlap']*100:.0f}%")
print(f"üî¢ Step: {CONFIG['step']} samples")

## üì• Step 2: Load Production Data

**What:** Load the production CSV file and validate columns.

### Expected Columns
| Column | Sensor | Unit (raw) |
|--------|--------|------------|
| Ax, Ay, Az | Accelerometer | milliG or m/s¬≤ |
| Gx, Gy, Gz | Gyroscope | deg/s |

**Input:** `data/processed/sensor_fused_50Hz.csv`  
**Output:** `df` DataFrame with ~181,699 rows √ó 6+ columns

In [None]:
# Load CSV
df = pd.read_csv(CONFIG['input_file'])

print(f"‚úì Loaded: {CONFIG['input_file'].name}")
print(f"  Shape: {df.shape}")
print(f"  Columns: {list(df.columns)}")
print(f"\nüìã First 5 rows:")
df.head()

In [None]:
# Validate required columns exist
missing = [col for col in CONFIG['sensor_columns'] if col not in df.columns]
if missing:
    raise ValueError(f"‚ùå Missing columns: {missing}")
    
print(f"‚úì All sensor columns present: {CONFIG['sensor_columns']}")

## üîç Step 3: Unit Detection & Conversion

**What:** Automatically detect if accelerometer data is in milliG or m/s¬≤, then convert if needed.

### Detection Logic

| Condition | Units Detected | Action |
|-----------|----------------|--------|
| max absolute > 100 | **milliG** | Convert √ó 0.00981 |
| max absolute < 50 | **m/s¬≤** | No conversion |
| 50-100 (ambiguous) | unknown | Assume m/s¬≤ |

### Why This Matters
- Training data was in **m/s¬≤**
- Production data from Garmin is often in **milliG**
- Mismatch = wrong predictions!

**Input:** `df` (raw sensor data)  
**Output:** `df_converted`, `units_detected`, `conversion_applied`

### üìñ Understanding Units: milliG vs m/s¬≤

**milliG (milligravity):**
- 1 milliG = 1/1000 of Earth's gravity
- Used by: Garmin watches, fitness trackers
- Example: Az at rest = -1000 milliG (gravity pointing down)

**m/s¬≤ (meters per second squared):**
- Standard physics unit of acceleration
- Used by: Scientific datasets, ML models
- Example: Az at rest = -9.81 m/s¬≤ (Earth's gravity)

**Conversion:**
```python
m_s2 = milliG √ó 0.00981

# Why? 1 milliG = 0.001 G √ó 9.81 m/s¬≤ = 0.00981 m/s¬≤
```

**Why Critical:**
- ‚ö†Ô∏è Training data: m/s¬≤ (values around -9.8)
- ‚ö†Ô∏è Garmin production: milliG (values around -1000)
- ‚ö†Ô∏è Mismatch = 100√ó wrong scale = Bad predictions!

**Example:**
```
Training learns: "Standing has Az ‚âà -9.8"
Production sends: Az = -1000 (not converted)
Model thinks: "This is 100√ó gravity!" ‚Üí Wrong prediction
```

In [None]:
def detect_and_convert_units(df, accel_cols, conversion_factor=0.00981):
    """
    Automatically detect accelerometer units and convert if needed.
    
    Returns:
        df_converted: DataFrame with converted values
        units_detected: 'milliG' or 'm/s¬≤'
        conversion_applied: True if conversion was applied
    """
    # Check max absolute value
    max_abs = df[accel_cols].abs().max().max()
    mean_abs = df[accel_cols].abs().mean().mean()
    
    print(f"üìä Accelerometer Statistics:")
    print(f"  Max absolute value: {max_abs:.2f}")
    print(f"  Mean absolute value: {mean_abs:.2f}")
    
    # Detection
    if max_abs > 100:
        units_detected = 'milliG'
        conversion_applied = True
        print(f"\nüîç Detected: milliG (max > 100)")
        print(f"üîÑ Converting with factor: {conversion_factor}")
        
        # Convert
        df_converted = df.copy()
        for col in accel_cols:
            df_converted[col] = df[col] * conversion_factor
        
        # Validate
        az_mean = df_converted['Az'].mean()
        print(f"\n‚úÖ Validation:")
        print(f"  Az mean after conversion: {az_mean:.2f} m/s¬≤")
        if -11 < az_mean < -8:
            print(f"  ‚úì Valid (expected ‚âà -9.8 for gravity)")
        else:
            print(f"  ‚ö† Warning: Az not near -9.8, check data")
            
    elif max_abs < 50:
        units_detected = 'm/s¬≤'
        conversion_applied = False
        df_converted = df.copy()
        print(f"\nüîç Detected: m/s¬≤ (max < 50)")
        print(f"‚úì No conversion needed")
    else:
        # Ambiguous range
        units_detected = 'unknown'
        conversion_applied = False
        df_converted = df.copy()
        print(f"\n‚ö† Ambiguous range (50-100), assuming m/s¬≤")
    
    return df_converted, units_detected, conversion_applied

# Apply detection and conversion
df_converted, units_detected, conversion_applied = detect_and_convert_units(
    df, CONFIG['accel_columns'], CONFIG['conversion_factor']
)

print(f"\nüìù Result: units={units_detected}, converted={conversion_applied}")

In [None]:
# Show before/after comparison
if conversion_applied:
    comparison = pd.DataFrame({
        'Column': CONFIG['accel_columns'],
        'Before (milliG)': [df[col].mean() for col in CONFIG['accel_columns']],
        'After (m/s¬≤)': [df_converted[col].mean() for col in CONFIG['accel_columns']],
    })
    print("üìä Before/After Conversion (means):")
    display(comparison)
else:
    print("‚úì No conversion applied - data already in m/s¬≤")

## üßπ Step 4: Handle NaN Values

**What:** Fill missing values using forward-fill then backward-fill.

### Strategy
```
ffill ‚Üí Use previous valid value
bfill ‚Üí Use next valid value (for leading NaNs)
```

**Why:** Windows with NaN cannot be used for inference.

**Input:** `df_converted` (may contain NaN)  
**Output:** `df_clean` (no NaN)

In [None]:
# Check for NaN
nan_count = df_converted[CONFIG['sensor_columns']].isna().sum().sum()
print(f"üîç NaN values in sensor columns: {nan_count}")

if nan_count > 0:
    # Fill NaN using forward fill then backward fill
    df_clean = df_converted.copy()
    df_clean[CONFIG['sensor_columns']] = df_clean[CONFIG['sensor_columns']].ffill().bfill()
    
    remaining_nan = df_clean[CONFIG['sensor_columns']].isna().sum().sum()
    print(f"‚úì After ffill+bfill: {remaining_nan} NaN remaining")
else:
    df_clean = df_converted.copy()
    print("‚úì No NaN values to handle")

## üìè Step 5: Normalization (StandardScaler)

**What:** Normalize data using the **same scaler as training**.

### Formula
```
normalized = (value - mean) / scale
```

### Important
- Mean and scale come from `config.json` (saved during training)
- Using different scaler = bad predictions!

**Input:** `df_clean[sensor_columns]` + `config.json`  
**Output:** `sensor_normalized` (range ~[-3, 3])

### üìñ Why Use SAME Scaler as Training?

**What StandardScaler Does:**
```python
normalized = (value - mean) / std
```

**During Training:**
```python
# Fit scaler on training data (all_users_data_labeled.csv)
scaler = StandardScaler()
scaler.fit(training_data)

# This calculates:
mean = [0.12, -0.08, 9.81, ...]  # Per sensor
std = [2.34, 1.98, 0.87, ...]

# Save these values
config.json: { "scaler_mean": [...], "scaler_scale": [...] }
```

**During Production (MUST use same scaler):**
```python
# Load saved mean and std from training
mean = load_from_config()  # Same values from training!
std = load_from_config()

# Apply to production data
production_normalized = (production_data - mean) / std
```

**Why Same Scaler is Critical:**

‚úÖ **Correct (same scaler):**
```python
Training:   (5.0 - 3.0) / 2.0 = 1.0
Production: (5.0 - 3.0) / 2.0 = 1.0  ‚Üê Same normalized value!
Model recognizes this ‚úì
```

‚ùå **Wrong (fit new scaler on production):**
```python
Training:   (5.0 - 3.0) / 2.0 = 1.0
Production: (5.0 - 7.0) / 4.0 = -0.5  ‚Üê Different normalized value!
Model confused ‚úó
```

**Real Example:**
```
Training: Az_norm = (9.8 - 9.81) / 0.5 = -0.02  ‚Üê Model learned "standing"
Production (same scaler): Az_norm = (9.85 - 9.81) / 0.5 = 0.08  ‚Üê Recognized as "standing" ‚úì
Production (new scaler): Az_norm = (9.85 - 10.2) / 0.8 = -0.44  ‚Üê Model thinks different activity ‚úó
```

**Where Scaler is Stored:**
```
data/prepared/config.json
```

In [None]:
# Load scaler parameters from training config
scaler_config_path = DATA_PREPARED / 'config.json'

if scaler_config_path.exists():
    with open(scaler_config_path, 'r') as f:
        scaler_config = json.load(f)
    
    scaler_mean = np.array(scaler_config['scaler_mean'])
    scaler_scale = np.array(scaler_config['scaler_scale'])
    
    print(f"‚úì Loaded scaler from: {scaler_config_path.name}")
    print(f"  Mean: {scaler_mean}")
    print(f"  Scale: {scaler_scale}")
    
    # Apply normalization
    sensor_data = df_clean[CONFIG['sensor_columns']].values
    sensor_normalized = (sensor_data - scaler_mean) / scaler_scale
    
    print(f"\nüìä Normalized data:")
    print(f"  Shape: {sensor_normalized.shape}")
    print(f"  Range: [{sensor_normalized.min():.2f}, {sensor_normalized.max():.2f}]")
    print(f"  Mean: {sensor_normalized.mean():.4f}")
else:
    print(f"‚ö† Scaler config not found at {scaler_config_path}")
    print("Using raw data without normalization")
    sensor_normalized = df_clean[CONFIG['sensor_columns']].values

## ü™ü Step 6: Create Sliding Windows

**What:** Split continuous data into fixed-size windows for model input.

### Parameters
| Parameter | Value | Meaning |
|-----------|-------|---------||
| Window size | 200 | 200 timesteps = 4 seconds @ 50Hz |
| Overlap | 50% | Each window shares 100 samples with next |
| Step | 100 | Start of next window = current + 100 |

### Visualization
```
Data:     [--------------------181,699 samples--------------------]
Window 1: [######]
Window 2:    [######]
Window 3:       [######]
...
```

**Input:** `sensor_normalized` (181,699 √ó 6)  
**Output:** `X_prod` (N √ó 200 √ó 6), `window_metadata`

### üìñ Understanding Windows, Overlap, and Hz

**What is a Window?**
- Fixed-size chunk of continuous sensor data
- Like a 4-second video clip from a longer video
- Model analyzes one window at a time

**Window Size = 200 samples:**
```
200 samples √∑ 50 Hz = 4 seconds of data
```
- ‚úÖ Long enough to capture activity patterns (walking = 2 steps/sec)
- ‚úÖ Short enough to detect activity changes
- ‚úÖ Model architecture expects exactly 200 timesteps

**What is Overlap = 50%?**
```
Window 1:  [########]          samples 0-199
           0       199
Window 2:      [########]      samples 100-299
              100       299
Window 3:          [########]  samples 200-399
                  200       399
```
- Each window shares 100 samples (50%) with next window
- **Why?** Captures transitions between activities
- Without overlap: might miss moment when activity changes!

**What is Hz (Hertz)?**
- Sampling rate = samples per second
- 50 Hz = 50 samples every second
- 1 sample every 0.02 seconds (20 milliseconds)

**Why 50 Hz?**
- ‚úÖ Human activities: Walking ~2 Hz, Running ~3 Hz
- ‚úÖ 50 Hz captures all movements well (25√ó faster than walking)
- ‚úÖ Standard in activity recognition research

**Step = 100 samples:**
- Distance between window starts
- Step = Window_size √ó (1 - Overlap)
- Step = 200 √ó 0.5 = 100

**Result:**
```python
From 181,699 samples ‚Üí ~1,772 windows
Each window = 4 seconds of sensor data
```

In [None]:
def create_windows(data, window_size, step):
    """
    Create sliding windows from sensor data.
    
    Args:
        data: numpy array of shape (samples, features)
        window_size: number of timesteps per window
        step: step size between windows
        
    Returns:
        windows: numpy array of shape (n_windows, window_size, features)
        metadata: list of dicts with window info
    """
    windows = []
    metadata = []
    
    n_samples = len(data)
    n_windows = (n_samples - window_size) // step + 1
    
    for i in range(n_windows):
        start = i * step
        end = start + window_size
        window = data[start:end]
        
        # Skip if window has NaN
        if np.isnan(window).any():
            continue
            
        windows.append(window)
        metadata.append({
            'window_index': len(windows) - 1,
            'start_sample': start,
            'end_sample': end,
        })
    
    return np.array(windows), metadata

# Create windows
X_prod, window_metadata = create_windows(
    sensor_normalized,
    CONFIG['window_size'],
    CONFIG['step']
)

print(f"‚úì Windows created:")
print(f"  Shape: {X_prod.shape}")
print(f"  Format: (windows, timesteps, features)")
print(f"  Total: {len(X_prod):,} windows")

## üíæ Step 7: Save Preprocessed Data

**What:** Export windows and metadata for model inference.

### Output Files

| File | Format | Contains |
|------|--------|----------|
| `production_X.npy` | NumPy | Windows array (N, 200, 6) |
| `production_metadata.json` | JSON | Pipeline info, unit detection result |

### Metadata Contents
- Source file name
- Units detected (milliG / m/s¬≤)
- Conversion applied (true/false)
- Window parameters
- Total windows created

**Input:** `X_prod`, pipeline info  
**Output:** Files in `data/prepared/`

### üìñ Why .npy Format?

**What is production_X.npy?**
- NumPy binary file containing windowed arrays
- Shape: **(1772, 200, 6)** = 1772 windows √ó 200 timesteps √ó 6 sensors
- Ready for direct model input

**Why .npy vs other formats?**

| Format | Size | Load Speed | Preserves Shape | Use Case |
|--------|------|------------|-----------------|----------|
| **.npy** | 8.5 MB | ‚ö° 0.1s | ‚úÖ Yes | Model input |
| **.csv** | 45 MB | üêå 2s | ‚ùå No | Human reading |
| **.pkl** | 9 MB | ‚ö° 0.2s | ‚úÖ Yes | Python objects |

**Advantages:**
1. **Fast loading:** 20√ó faster than CSV
2. **Smaller size:** 80% smaller than CSV
3. **Preserves structure:** No reshaping needed
4. **Type safety:** Keeps float32/float64
5. **Direct use:** `model.predict(np.load('X.npy'))`

**Example:**
```python
# Save
np.save('production_X.npy', X)  # Shape: (1772, 200, 6)

# Load (no reshaping needed!)
X = np.load('production_X.npy')
predictions = model.predict(X)  # Works directly ‚úì
```

**CSV would require:**
```python
# Load CSV
df = pd.read_csv('production.csv')  # 2 seconds, loses shape
X = df.values.reshape(1772, 200, 6)  # Manual reshaping needed
predictions = model.predict(X)
```

In [None]:
# Create output directory
DATA_PREPARED.mkdir(parents=True, exist_ok=True)

# Save windows
output_X = DATA_PREPARED / 'production_X.npy'
np.save(output_X, X_prod)
print(f"‚úì Saved: {output_X}")
print(f"  Shape: {X_prod.shape}")
print(f"  Size: {X_prod.nbytes / 1024 / 1024:.2f} MB")

# Save metadata
output_meta = DATA_PREPARED / 'production_metadata.json'
meta_summary = {
    'created': datetime.now().isoformat(),
    'source_file': str(CONFIG['input_file'].name),
    'units_detected': units_detected,
    'conversion_applied': conversion_applied,
    'conversion_factor': CONFIG['conversion_factor'] if conversion_applied else None,
    'window_size': CONFIG['window_size'],
    'overlap': CONFIG['overlap'],
    'total_windows': len(X_prod),
    'original_samples': len(df),
}

with open(output_meta, 'w') as f:
    json.dump(meta_summary, f, indent=2)
print(f"‚úì Saved: {output_meta}")

## ‚úÖ Step 8: Summary & Next Steps

**What:** Print final summary and show what to do next.

### Pipeline Complete!
```
Input:  sensor_fused_50Hz.csv (181,699 samples)
Output: production_X.npy (~1,772 windows)
```

### Next Steps
1. **Load model:** `keras.models.load_model('model.keras')`
2. **Run inference:** `predictions = model.predict(X_prod)`
3. **Analyze results:** Check prediction distribution & confidence

In [None]:
print("="*70)
print("üéâ PREPROCESSING COMPLETE")
print("="*70)
print(f"\nüì• Input:")
print(f"  File: {CONFIG['input_file'].name}")
print(f"  Samples: {len(df):,}")
print(f"\nüîç Unit Detection:")
print(f"  Detected: {units_detected}")
print(f"  Converted: {conversion_applied}")
if conversion_applied:
    print(f"  Factor: {CONFIG['conversion_factor']}")
print(f"\nüì§ Output:")
print(f"  Windows: {len(X_prod):,}")
print(f"  Shape: {X_prod.shape}")
print(f"  Files:")
print(f"    - production_X.npy")
print(f"    - production_metadata.json")
print(f"\nüöÄ Next: Load model and run inference")
print("="*70)