# Data Preparation for New 1-Second Data (1sec_new)
---

This notebook preprocesses the **new 1-second resolution data** from the `1sec_new` folder.

## Why Keep 1-Second Resolution?
According to NILM literature, **1 second is significantly better than 10 seconds** for Transformer-based NILM:
- Better accuracy for short-cycle appliances (kettles, microwaves)
- Standard resolution for state-of-the-art models (ELECTRIcity, Energformer, STNILM)
- Preserves transient events that get "smoothed out" at lower resolutions

## Data Overview
- **21 months raw** (March 2024 - December 2025)
- **494 days usable** after filtering to gap-free periods
- **~40M rows** at 1-second resolution

## Usable Periods (Gap-Free)
| Period | Start | End | Days |
|--------|-------|-----|------|
| A | 2024-04-15 | 2024-05-31 | 46 |
| B | 2024-07-01 | 2024-09-30 | 92 |
| C | 2024-10-09 | 2025-09-30 | 356 |

## Key Processing Steps (Based on EDA Analysis - see eda_1sec_new.ipynb):
1. **clip(lower=0) for ALL appliances**: dishwasher, washing_machine, stove, oven, etc.
   - EDA shows: 85-99% negative but values are SMALL (-8W to -15W)
   - Positives are LARGE (1000-3600W) when ON
   - This is **SENSOR OFFSET**, NOT inverted CT!
   - If inverted CT: we'd see LARGE negatives when ON
2. **Fix double counting**: `garage_cabinet -= ev_charger + ev_socket`
   - Confirmed via analysis: garage >= EV 100% of time
3. **Noise thresholding**: Values < 5W → 0
   - Clean "OFF" states for model training
4. Keep negative values for Solar, Grid, Battery (bidirectional energy flow)
5. Fill EVCharger/EVSocket with 0 before Aug 2024
6. **NO resampling - keep native 1sec resolution**
7. Add cyclical temporal features

## Corrections Rationale (Validated via eda_1sec_new.ipynb):
| Issue | Columns Affected | Solution | Evidence |
|-------|-----------------|----------|----------|
| **OFFSET (NOT inverted CT!)** | dishwasher, washing_machine, stove | `clip(0)` | Neg: -8 to -15W, Pos: 1000-3600W |
| Sensor noise | dryer, heat_pump, range_hood, etc. | `clip(0)` | <5% negative, small magnitude |
| Double counting | garage_cabinet | Subtract EV | garage >= EV 100% of time |
| Small positive noise | All appliances | Threshold < 5W → 0 | Dryer shows 3W when OFF |

**IMPORTANT**: abs() is WRONG for this data because:
- Negative values are TINY (-8W to -15W) = sensor offset when OFF
- Positive values are LARGE (1000W+) = real consumption when ON
- abs() would flip tiny offsets to tiny positives (no meaningful change)
- clip(0) correctly zeros out the small offsets

## Output:
- NILM-ready dataset at **1-second resolution** (~40M rows)
- Consistent feature set: 11 appliances + Solar/Grid + 6 temporal features

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')

# Paths - OUTPUT IS 1sec (NO resampling!)
BASE_DIR = Path('.').resolve().parent.parent
RAW_DIR = BASE_DIR / 'data' / 'raw' / '1sec_new'
OUTPUT_DIR = BASE_DIR / 'data' / 'processed' / '1sec_processed'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f'Base directory: {BASE_DIR}')
print(f'Raw data: {RAW_DIR}')
print(f'Output: {OUTPUT_DIR}')
print(f'\nFiles available:')
for f in sorted(RAW_DIR.glob('*.csv')):
    print(f'   {f.name}')

Base directory: C:\Users\gamek\School\TeamProject\MTS3-MCTE-Team-Project-Energy-G1
Raw data: C:\Users\gamek\School\TeamProject\MTS3-MCTE-Team-Project-Energy-G1\data\raw\1sec_new
Output: C:\Users\gamek\School\TeamProject\MTS3-MCTE-Team-Project-Energy-G1\data\processed\1sec_processed

Files available:
   household_2024_03.csv
   household_2024_04.csv
   household_2024_05.csv
   household_2024_07.csv
   household_2024_08.csv
   household_2024_09.csv
   household_2024_10.csv
   household_2024_11.csv
   household_2024_12.csv
   household_2025_01.csv
   household_2025_02.csv
   household_2025_03.csv
   household_2025_04.csv
   household_2025_05.csv
   household_2025_06.csv
   household_2025_07.csv
   household_2025_08.csv
   household_2025_09.csv
   household_2025_10.csv
   household_2025_11.csv
   household_2025_12.csv


---
## 1. Define Target Columns & Mapping

In [2]:
# Standard column set for output
# Using lowercase with underscores to match new data format

TARGET_COLUMNS = {
    'time': 'Time',
    
    # Aggregate columns - pick the main building
    'building_33a8340b-f03c-4851-9f9f-99b98e2c4cc9': 'Aggregate',
    'building_33a8340b-f03c-4851-9f9f-99b98e2c4cc12': 'Aggregate',  # Older files
    
    # Appliances
    'heat_pump': 'HeatPump',
    'dishwasher': 'Dishwasher',
    'washing_machine': 'WashingMachine',
    'dryer': 'Dryer',
    'oven': 'Oven',
    'stove': 'Stove',
    'range_hood': 'RangeHood',
    'ev_charger': 'EVCharger',
    'ev_socket': 'EVSocket',
    'garage_cabinet': 'GarageCabinet',
    'rainwater_pump': 'RainwaterPump',
    
    # Solar & Grid (keep signs for energy flow analysis)
    'solar': 'Solar',
    'grid': 'Grid',
    
    # Battery (only in late 2025)
    'battery': 'Battery'
}

# =============================================================================
# CORRECTION SETTINGS 
# Based on raw data analysis (see temp/analyze_raw_1sec.py)
# =============================================================================

# Columns where negative is meaningful (bidirectional energy flow)
ALLOW_NEGATIVE = ['solar', 'grid', 'battery', 
                  'building_33a8340b-f03c-4851-9f9f-99b98e2c4cc9',
                  'building_33a8340b-f03c-4851-9f9f-99b98e2c4cc12']

# =============================================================================
# NOTE: NO abs() needed!
# EDA analysis (eda_1sec_new.ipynb) shows:
# - dishwasher, washing_machine, stove have 85-99% negative values
# - BUT: negatives are SMALL (-8W to -15W) = sensor offset when OFF
# - AND: positives are LARGE (1000-3600W) = real consumption when ON
# - This is OFFSET, NOT inverted CT!
# - If inverted CT: we'd see LARGE negatives (-2000W) when ON
# - Solution: clip(lower=0), NOT abs()
# =============================================================================

# Noise threshold: Values below this are sensor noise (set to 0)
# Applied AFTER clip(0) correction
# Reason: Some appliances show ~3W offset even when OFF
NOISE_THRESHOLD_KW = 0.005  # 5 Watts

# Double counting hierarchy (same as 15-min: Kast garage contains EV chargers)
# garage_cabinet = garage_cabinet - ev_charger - ev_socket
DOUBLE_COUNTING_PARENT = 'garage_cabinet'
DOUBLE_COUNTING_CHILDREN = ['ev_charger', 'ev_socket']

print('Target columns defined.')
print(f'\n📋 Correction Strategy:')
print(f'   1. clip(lower=0) for ALL appliances (offset removal, NOT abs!)')
print(f'   2. Noise threshold: {NOISE_THRESHOLD_KW*1000:.0f}W (values below → 0)')
print(f'   3. Double counting fix: {DOUBLE_COUNTING_PARENT} -= {DOUBLE_COUNTING_CHILDREN}')
print(f'   4. Keep negatives: {[c for c in ALLOW_NEGATIVE if "building" not in c]}')

Target columns defined.

📋 Correction Strategy:
   1. clip(lower=0) for ALL appliances (offset removal, NOT abs!)
   2. Noise threshold: 5W (values below → 0)
   3. Double counting fix: garage_cabinet -= ['ev_charger', 'ev_socket']
   4. Keep negatives: ['solar', 'grid', 'battery']


---
## 1b. Define Usable Periods (Gap-Free)
Based on detailed exploration of 1sec_new data, we identified 3 continuous periods without gaps.
These periods have all essential columns present with no Building NULL values.

In [3]:
# =============================================================================
# USABLE PERIODS (from gap analysis)
# =============================================================================
# These are continuous periods WITHOUT gaps (max 10 min allowed)
# Identified via deep_explore_v2.py and find_continuous_periods.py

USABLE_PERIODS = [
    # Period A: April-May 2024 (partial)
    # 46 days, ~1.2M rows - some early gaps excluded
    ('2024-04-15 00:00:00', '2024-05-31 23:59:59'),
    
    # Period B: July-September 2024
    # 92 days, ~7.9M rows - June 2024 completely missing
    ('2024-07-01 00:00:00', '2024-09-30 23:59:59'),
    
    # Period C: October 2024 - September 2025
    # 356 days, ~30.8M rows - MAIN PERIOD
    # Starts from Oct 9 due to early October gap
    ('2024-10-09 12:15:00', '2025-09-30 23:59:59'),
]

# Note: October-December 2025 have gaps and should NOT be used
# Battery column (Nov-Dec 2025 only) is also excluded for consistency

print('='*70)
print('📅 USABLE PERIODS DEFINED')
print('='*70)
for i, (start, end) in enumerate(USABLE_PERIODS):
    print(f'Period {chr(65+i)}: {start} → {end}')
print('='*70)

📅 USABLE PERIODS DEFINED
Period A: 2024-04-15 00:00:00 → 2024-05-31 23:59:59
Period B: 2024-07-01 00:00:00 → 2024-09-30 23:59:59
Period C: 2024-10-09 12:15:00 → 2025-09-30 23:59:59


---
## 2. Load and Process All Files

In [4]:
def process_single_file(filepath):
    """Process a single CSV file with all necessary corrections.
    
    Corrections applied (based on EDA analysis - eda_1sec_new.ipynb):
    1. clip(lower=0) for ALL appliance columns
       - Negative values are SMALL sensor offsets (-8W to -15W)
       - NOT inverted CT (which would show LARGE negatives when ON)
    2. Fix double counting: garage_cabinet -= ev_charger + ev_socket
    3. Apply noise threshold: values < 5W → 0 (sensor noise when OFF)
    """
    
    df = pd.read_csv(filepath)
    
    # 1. Parse timestamp
    df['time'] = pd.to_datetime(df['time'])
    
    # 2. Clip ALL appliance negatives to 0
    # EDA shows: negatives are SMALL offsets (-8W to -15W)
    # Positives are LARGE when ON (1000W+) → this is OFFSET, not inverted CT
    for col in df.columns:
        if col not in ALLOW_NEGATIVE and col != 'time':
            if df[col].dtype in ['float64', 'int64']:
                df[col] = df[col].clip(lower=0)
    
    # 3. Fix double counting: garage_cabinet contains EV chargers
    # Same logic as 15-min: Kast garage -= Smappee_laadpaal + Laadpaal_stopcontact
    if DOUBLE_COUNTING_PARENT in df.columns:
        subtraction = 0
        for child in DOUBLE_COUNTING_CHILDREN:
            if child in df.columns:
                subtraction = subtraction + df[child]
        df[DOUBLE_COUNTING_PARENT] = (df[DOUBLE_COUNTING_PARENT] - subtraction).clip(lower=0)
    
    # 4. Apply noise threshold: small positive values → 0
    # Reason: Some appliances show ~3W offset when OFF
    for col in df.columns:
        if col not in ALLOW_NEGATIVE and col != 'time' and df[col].dtype in ['float64', 'int64']:
            mask = (df[col] > 0) & (df[col] < NOISE_THRESHOLD_KW)
            df.loc[mask, col] = 0
    
    # 6. Handle missing columns - find aggregate column
    agg_col = None
    for col in ['building_33a8340b-f03c-4851-9f9f-99b98e2c4cc9', 
                'building_33a8340b-f03c-4851-9f9f-99b98e2c4cc12']:
        if col in df.columns:
            agg_col = col
            break
    
    if agg_col is None:
        print(f"WARNING: No aggregate column found in {filepath.name}")
        return None
    
    # 7. Create standardized dataframe
    result = pd.DataFrame()
    result['Time'] = df['time']
    result['Aggregate'] = df[agg_col].clip(lower=0)  # Clip negatives (CT noise)
    
    # Add appliances
    appliance_cols = ['heat_pump', 'dishwasher', 'washing_machine', 'dryer',
                     'oven', 'stove', 'range_hood', 'ev_charger', 'ev_socket',
                     'garage_cabinet', 'rainwater_pump']
    
    for col in appliance_cols:
        if col in df.columns:
            result[TARGET_COLUMNS.get(col, col)] = df[col]
        else:
            result[TARGET_COLUMNS.get(col, col)] = 0.0  # Fill missing with zeros
    
    # Add solar, grid, battery (keep signs for energy flow)
    if 'solar' in df.columns:
        result['Solar'] = df['solar']
    if 'grid' in df.columns:
        result['Grid'] = df['grid']
    if 'battery' in df.columns:
        result['Battery'] = df['battery']
    else:
        result['Battery'] = 0.0
    
    return result

# Test on first file
test_file = sorted(RAW_DIR.glob('*.csv'))[0]
test_df = process_single_file(test_file)
print(f'Tested on: {test_file.name}')
print(f'Result columns: {list(test_df.columns)}')
print(f'Rows: {len(test_df):,}')

# Verify corrections
print(f'\n✅ Corrections Applied:')
print(f'   1. clip(0) for ALL appliances (offset removal, NOT abs!)')
print(f'   2. Double counting: GarageCabinet -= EVCharger + EVSocket')
print(f'   3. Noise threshold: values < {NOISE_THRESHOLD_KW*1000:.0f}W → 0')

Tested on: household_2024_03.csv
Result columns: ['Time', 'Aggregate', 'HeatPump', 'Dishwasher', 'WashingMachine', 'Dryer', 'Oven', 'Stove', 'RangeHood', 'EVCharger', 'EVSocket', 'GarageCabinet', 'RainwaterPump', 'Solar', 'Grid', 'Battery']
Rows: 154,236

✅ Corrections Applied:
   1. clip(0) for ALL appliances (offset removal, NOT abs!)
   2. Double counting: GarageCabinet -= EVCharger + EVSocket
   3. Noise threshold: values < 5W → 0


In [None]:
# Process all files
print('='*70)
print('📂 LOADING ALL FILES')
print('='*70)

all_files = sorted(RAW_DIR.glob('*.csv'))
dfs = []

for f in tqdm(all_files, desc='Processing'):
    df_file = process_single_file(f)
    if df_file is not None:
        dfs.append(df_file)
        print(f'   ✅ {f.name}: {len(df_file):,} rows')

# Combine
df = pd.concat(dfs, ignore_index=True)
df = df.sort_values('Time').reset_index(drop=True)

print(f'\n📊 Combined (all): {len(df):,} rows')
print(f'   Time: {df["Time"].min()} → {df["Time"].max()}')

# =============================================================================
# FILTER TO USABLE PERIODS ONLY
# =============================================================================
print(f'\n🔧 FILTERING TO USABLE PERIODS...')
rows_before = len(df)

# Create mask for usable periods
period_mask = pd.Series([False] * len(df), index=df.index)

for i, (start, end) in enumerate(USABLE_PERIODS):
    start_dt = pd.to_datetime(start)
    end_dt = pd.to_datetime(end)
    mask_i = (df['Time'] >= start_dt) & (df['Time'] <= end_dt)
    rows_in_period = mask_i.sum()
    period_mask = period_mask | mask_i
    print(f'   Period {chr(65+i)}: {start} → {end}')
    print(f'           {rows_in_period:,} rows ({rows_in_period/rows_before*100:.1f}%)')

# Apply filter
df = df[period_mask].reset_index(drop=True)
rows_after = len(df)

print(f'\n📊 After filtering:')
print(f'   Rows: {rows_before:,} → {rows_after:,}')
print(f'   Removed: {rows_before - rows_after:,} rows ({(rows_before - rows_after)/rows_before*100:.1f}%)')
print(f'   Time: {df["Time"].min()} → {df["Time"].max()}')

📂 LOADING ALL FILES


Processing:   5%|▍         | 1/21 [00:00<00:07,  2.75it/s]

   ✅ household_2024_03.csv: 154,236 rows


Processing:  10%|▉         | 2/21 [00:00<00:09,  2.09it/s]

   ✅ household_2024_04.csv: 218,821 rows


Processing:  14%|█▍        | 3/21 [00:03<00:29,  1.66s/it]

   ✅ household_2024_05.csv: 1,080,770 rows


Processing:  19%|█▉        | 4/21 [00:24<02:31,  8.92s/it]

   ✅ household_2024_07.csv: 2,673,691 rows


Processing:  24%|██▍       | 5/21 [00:38<02:54, 10.94s/it]

   ✅ household_2024_08.csv: 2,677,885 rows


Processing:  29%|██▊       | 6/21 [00:44<02:16,  9.13s/it]

   ✅ household_2024_09.csv: 2,590,681 rows


Processing:  33%|███▎      | 7/21 [00:50<01:53,  8.10s/it]

   ✅ household_2024_10.csv: 2,046,755 rows


Processing:  38%|███▊      | 8/21 [00:57<01:43,  7.93s/it]

   ✅ household_2024_11.csv: 2,589,684 rows


Processing:  43%|████▎     | 9/21 [01:05<01:35,  7.99s/it]

   ✅ household_2024_12.csv: 2,675,505 rows


Processing:  48%|████▊     | 10/21 [01:13<01:25,  7.81s/it]

   ✅ household_2025_01.csv: 2,673,103 rows


Processing:  52%|█████▏    | 11/21 [01:18<01:10,  7.06s/it]

   ✅ household_2025_02.csv: 2,419,200 rows


Processing:  57%|█████▋    | 12/21 [01:25<01:04,  7.14s/it]

   ✅ household_2025_03.csv: 2,678,348 rows


Processing:  62%|██████▏   | 13/21 [01:33<00:58,  7.27s/it]

   ✅ household_2025_04.csv: 2,581,542 rows


Processing:  67%|██████▋   | 14/21 [01:41<00:51,  7.37s/it]

   ✅ household_2025_05.csv: 2,676,913 rows


Processing:  71%|███████▏  | 15/21 [01:49<00:46,  7.78s/it]

   ✅ household_2025_06.csv: 2,591,220 rows


Processing:  76%|███████▌  | 16/21 [01:58<00:39,  8.00s/it]

   ✅ household_2025_07.csv: 2,678,081 rows


Processing:  81%|████████  | 17/21 [02:05<00:30,  7.71s/it]

   ✅ household_2025_08.csv: 2,672,659 rows


Processing:  86%|████████▌ | 18/21 [02:12<00:22,  7.42s/it]

   ✅ household_2025_09.csv: 2,590,780 rows


Processing:  90%|█████████ | 19/21 [02:19<00:14,  7.43s/it]

   ✅ household_2025_10.csv: 2,463,145 rows


Processing:  95%|█████████▌| 20/21 [02:28<00:07,  7.96s/it]

   ✅ household_2025_11.csv: 2,496,891 rows


Processing: 100%|██████████| 21/21 [02:32<00:00,  7.27s/it]

   ✅ household_2025_12.csv: 1,412,067 rows





---
## 3. Data Quality Check

In [None]:
print('='*70)
print('🔍 DATA QUALITY CHECK')
print('='*70)

# Check for nulls
null_counts = df.isna().sum()
print(f'\nNull values:')
for col in df.columns:
    if null_counts[col] > 0:
        print(f'   {col}: {null_counts[col]:,} ({null_counts[col]/len(df)*100:.2f}%)')

if null_counts.sum() == 0:
    print('   ✅ No null values!')

# Check for remaining negatives in appliances
print(f'\nNegative value check (appliances):')
appliance_cols = ['HeatPump', 'Dishwasher', 'WashingMachine', 'Dryer', 'Oven', 
                  'Stove', 'RangeHood', 'EVCharger', 'EVSocket', 'GarageCabinet', 'RainwaterPump']
for col in appliance_cols:
    if col in df.columns:
        neg_count = (df[col] < 0).sum()
        if neg_count > 0:
            print(f'   ⚠️ {col}: {neg_count:,} negative values')

# Show statistics
print(f'\n📊 Column Statistics:')
numeric_cols = df.select_dtypes(include=[np.number]).columns
stats_df = df[numeric_cols].describe().round(4)
print(stats_df.to_string())

---
## 4. Verify 1-Second Resolution (NO Resampling)

In [None]:
print('='*70)
print('⏱️ VERIFYING 1-SECOND RESOLUTION (NO RESAMPLING)')
print('='*70)

# Check current resolution
time_diffs = df['Time'].diff().dt.total_seconds().dropna()
print(f'\nResolution distribution:')
for diff, count in time_diffs.value_counts().head(5).items():
    print(f'   {diff:.0f}sec: {count:,} ({count/len(time_diffs)*100:.1f}%)')

# Verify mostly 1-second
pct_1sec = (time_diffs == 1).sum() / len(time_diffs) * 100
print(f'\n✅ {pct_1sec:.1f}% of data is at 1-second resolution')
print(f'   Total rows: {len(df):,}')
print(f'\n📌 KEEPING NATIVE 1-SECOND RESOLUTION FOR BEST NILM ACCURACY')

---
## 5. Final Negative Value Cleanup

In [None]:
print('='*70)
print('✂️ FINAL CLEANUP: Clipping Small Negatives')
print('='*70)

# For appliances: clip to 0
appliance_cols = ['HeatPump', 'Dishwasher', 'WashingMachine', 'Dryer', 'Oven', 
                  'Stove', 'RangeHood', 'EVCharger', 'EVSocket', 'GarageCabinet', 
                  'RainwaterPump', 'Aggregate']

print('\nNegative values before clipping:')
for col in appliance_cols:
    if col in df.columns:
        neg_count = (df[col] < 0).sum()
        if neg_count > 0:
            neg_pct = neg_count / len(df) * 100
            print(f'   {col}: {neg_count:,} ({neg_pct:.1f}%)')
            df[col] = df[col].clip(lower=0)

# For Solar/Grid/Battery: keep negative values (meaningful)
print(f'\n✅ Solar, Grid, Battery: keeping negative values (energy flow)')

print('\n✅ Cleanup complete')

---
## 6. Time Gap Analysis

In [None]:
print('='*70)
print('🕳️ TIME GAP ANALYSIS')
print('='*70)

df['Time'] = pd.to_datetime(df['Time'])
time_diffs = df['Time'].diff().dt.total_seconds()

# Analyze gaps
gaps_1min = (time_diffs > 60).sum()
gaps_1hour = (time_diffs > 3600).sum()
max_gap = time_diffs.max()

print(f'\nTotal rows: {len(df):,}')
print(f'Gaps > 1 minute: {gaps_1min:,}')
print(f'Gaps > 1 hour: {gaps_1hour:,}')
print(f'Max gap: {max_gap/3600:.2f} hours')

if gaps_1hour > 0:
    print(f'\nLarge gaps (> 1 hour):')
    large_gaps = time_diffs[time_diffs > 3600]
    for idx in large_gaps.index[:10]:
        gap_h = time_diffs.loc[idx] / 3600
        gap_time = df.loc[idx-1, 'Time'] if idx > 0 else 'N/A'
        print(f'   {gap_h:.2f}h after {gap_time}')

print('\n📌 Note: Gaps will be handled during model training (windowing excludes gap-crossing windows)')

---
## 7. Add Temporal Features

In [None]:
print('='*70)
print('⏰ ADDING TEMPORAL FEATURES')
print('='*70)

# Extract time components
df['hour'] = df['Time'].dt.hour + df['Time'].dt.minute / 60 + df['Time'].dt.second / 3600
df['dow'] = df['Time'].dt.dayofweek  # 0=Monday
df['month'] = df['Time'].dt.month

# Cyclical encoding
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

df['dow_sin'] = np.sin(2 * np.pi * df['dow'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['dow'] / 7)

df['month_sin'] = np.sin(2 * np.pi * (df['month'] - 1) / 12)
df['month_cos'] = np.cos(2 * np.pi * (df['month'] - 1) / 12)

# Drop intermediate columns
df = df.drop(columns=['hour', 'dow', 'month'])

print('\nTemporal features added:')
print('   • hour_sin, hour_cos (24h cycle)')
print('   • dow_sin, dow_cos (7-day cycle)')
print('   • month_sin, month_cos (12-month cycle)')

print(f'\nFinal columns ({len(df.columns)}): {list(df.columns)}')

---
## 8. Ghost Load Analysis

In [None]:
print('='*70)
print('👻 GHOST LOAD ANALYSIS')
print('='*70)

# Appliance columns for sum
appliance_cols = ['HeatPump', 'Dishwasher', 'WashingMachine', 'Dryer', 'Oven', 
                  'Stove', 'RangeHood', 'EVCharger', 'EVSocket', 'GarageCabinet', 'RainwaterPump']
existing_appliances = [c for c in appliance_cols if c in df.columns]

# Calculate sum of appliances
df['_sum_appliances'] = df[existing_appliances].sum(axis=1)
df['_ghost_load'] = df['Aggregate'] - df['_sum_appliances']

agg_mean = df['Aggregate'].mean()
sum_mean = df['_sum_appliances'].mean()
ghost_mean = df['_ghost_load'].mean()
ghost_pct = ghost_mean / agg_mean * 100 if agg_mean > 0 else 0

print(f'\n📊 Energy Balance:')
print(f'   Aggregate mean:      {agg_mean:.4f} kW')
print(f'   Sum(Appliances):     {sum_mean:.4f} kW')
print(f'   Ghost Load:          {ghost_mean:.4f} kW ({ghost_pct:.1f}%)')

# Correlation
corr = df['Aggregate'].corr(df['_sum_appliances'])
print(f'\n   Correlation Aggregate vs Sum: {corr:.4f}')

# Analysis of exceed cases
exceed_cases = (df['_sum_appliances'] > df['Aggregate']).sum()
exceed_pct = exceed_cases / len(df) * 100
print(f'\n   Cases where Sum > Aggregate: {exceed_cases:,} ({exceed_pct:.1f}%)')

# Drop temporary columns
df = df.drop(columns=['_sum_appliances', '_ghost_load'])

---
## 9. Visualization

In [None]:
# Sample for visualization
sample_size = min(10000, len(df))
df_sample = df.sample(sample_size, random_state=42).sort_values('Time')

# Recalculate sum for visualization
appliance_cols = ['HeatPump', 'Dishwasher', 'WashingMachine', 'Dryer', 'Oven', 
                  'Stove', 'RangeHood', 'EVCharger', 'EVSocket', 'GarageCabinet', 'RainwaterPump']
existing_appliances = [c for c in appliance_cols if c in df_sample.columns]
df_sample['_sum'] = df_sample[existing_appliances].sum(axis=1)

fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# 1. Time series: Aggregate vs Sum
ax1 = axes[0]
ax1.plot(df_sample['Time'], df_sample['Aggregate'], alpha=0.7, label='Aggregate', linewidth=0.5)
ax1.plot(df_sample['Time'], df_sample['_sum'], alpha=0.7, label='Sum(Appliances)', linewidth=0.5)
ax1.set_xlabel('Time')
ax1.set_ylabel('Power (kW)')
ax1.set_title('Aggregate vs Sum of Appliances (10sec resolution)')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Scatter: Aggregate vs Sum
ax2 = axes[1]
ax2.scatter(df_sample['_sum'], df_sample['Aggregate'], alpha=0.3, s=5)
max_val = max(df_sample['Aggregate'].max(), df_sample['_sum'].max())
ax2.plot([0, max_val], [0, max_val], 'r--', label='y=x (perfect match)')
ax2.set_xlabel('Sum(Appliances) (kW)')
ax2.set_ylabel('Aggregate (kW)')
corr = df_sample['Aggregate'].corr(df_sample['_sum'])
ax2.set_title(f'Correlation: {corr:.3f}')
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. Solar and Grid
ax3 = axes[2]
if 'Solar' in df_sample.columns:
    ax3.plot(df_sample['Time'], df_sample['Solar'], alpha=0.7, label='Solar', linewidth=0.5, color='orange')
if 'Grid' in df_sample.columns:
    ax3.plot(df_sample['Time'], df_sample['Grid'], alpha=0.7, label='Grid', linewidth=0.5, color='blue')
ax3.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax3.set_xlabel('Time')
ax3.set_ylabel('Power (kW)')
ax3.set_title('Solar Generation and Grid Exchange (negative = export)')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## 10. Final Null Handling

In [None]:
print('='*70)
print('🔧 FINAL NULL HANDLING')
print('='*70)

# Check for nulls
null_counts = df.isna().sum()
total_nulls = null_counts.sum()

print(f'\nNull values before handling: {total_nulls}')

if total_nulls > 0:
    for col in df.columns:
        if null_counts[col] > 0:
            print(f'   {col}: {null_counts[col]:,}')
    
    # Forward fill for small gaps (up to 1 minute = 6 rows at 10sec)
    df = df.fillna(method='ffill', limit=6)
    df = df.fillna(method='bfill', limit=6)
    
    # Drop remaining
    remaining_nulls = df.isna().sum().sum()
    if remaining_nulls > 0:
        print(f'\n   Remaining nulls after interpolation: {remaining_nulls}')
        df = df.dropna()
        print(f'   After dropping: {len(df):,} rows')
    else:
        print(f'\n   ✅ All nulls filled')
else:
    print('   ✅ No null values!')

print(f'\nFinal dataset: {len(df):,} rows')

---
## 11. Export Dataset

In [None]:
print('='*70)
print('💾 EXPORTING DATASET (1-SECOND RESOLUTION)')
print('='*70)

# Reorder columns for consistency
temporal_cols = ['hour_sin', 'hour_cos', 'dow_sin', 'dow_cos', 'month_sin', 'month_cos']

# Define column order
appliance_order = ['Aggregate', 'HeatPump', 'Dishwasher', 'WashingMachine', 'Dryer', 
                   'Oven', 'Stove', 'RangeHood', 'EVCharger', 'EVSocket', 
                   'GarageCabinet', 'RainwaterPump']

# Note: Battery excluded - only present in Nov-Dec 2025 which is outside usable periods
energy_flow = ['Solar', 'Grid']

# Filter to existing columns
appliance_order = [c for c in appliance_order if c in df.columns]
energy_flow = [c for c in energy_flow if c in df.columns]

final_cols = ['Time'] + appliance_order + energy_flow + temporal_cols
df = df[final_cols]

print(f'\nFinal dataset:')
print(f'   Rows: {len(df):,}')
print(f'   Columns: {len(df.columns)}')
print(f'   Resolution: 1 SECOND (native, no resampling)')
print(f'   Time range: {df["Time"].min()} → {df["Time"].max()}')
print(f'   Columns: {list(df.columns)}')

# Export - using parquet for efficiency with large dataset
parquet_path = OUTPUT_DIR / 'nilm_ready_1sec.parquet'
df.to_parquet(parquet_path, index=False)

print(f'\n✅ Exported:')
print(f'   Parquet: {parquet_path}')
print(f'   Size: {parquet_path.stat().st_size / 1e9:.2f} GB')

# Also save metadata
metadata = {
    'rows': len(df),
    'columns': list(df.columns),
    'time_start': str(df['Time'].min()),
    'time_end': str(df['Time'].max()),
    'resolution': '1 second (native)',
    'source': '1sec_new (filtered to usable periods)',
    'usable_periods': [
        {'period': 'A', 'start': '2024-04-15', 'end': '2024-05-31', 'days': 46},
        {'period': 'B', 'start': '2024-07-01', 'end': '2024-09-30', 'days': 92},
        {'period': 'C', 'start': '2024-10-09', 'end': '2025-09-30', 'days': 356}
    ],
    'preprocessing_notes': [
        'Applied clip(lower=0) to appliances with sensor offset (dishwasher, washing_machine, stove - small negatives when OFF),',
        'Applied clip(lower=0) to all appliances (sensor offset removal, NOT abs())',
        'Fixed double counting: garage_cabinet -= ev_charger + ev_socket (aligned with 15-min)',
        'Applied noise thresholding: values < 5W → 0 (sensor noise when OFF)',
        'Kept negative values for solar, grid, battery (bidirectional energy flow)',
        'Battery column excluded (only in Nov-Dec 2025 outside usable periods)',
        'EVCharger/EVSocket filled with 0 before Aug 2024',
        'Missing columns filled with 0',
        'NO RESAMPLING - kept native 1-second resolution for best NILM accuracy',
        'Added cyclical temporal features (hour, dow, month sin/cos)',
        'Filtered to 3 continuous gap-free periods (494 days total)'
    ],
    'inverted_ct_columns': ['dishwasher', 'washing_machine', 'stove', 'oven'],
    'noise_threshold_kw': 0.005
}

import json
with open(OUTPUT_DIR / 'dataset_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print(f'   Metadata: {OUTPUT_DIR / "dataset_metadata.json"}')

In [None]:
# Final summary
print('='*70)
print('📊 FINAL SUMMARY')
print('='*70)
print()
print(df.describe().round(4).to_string())

---
## ✅ Dataset Ready for Training

### Output Specifications
| Aspect | Value |
|--------|-------|
| **Resolution** | **1 second (native)** |
| Total rows | ~40M |
| Usable periods | 494 days from 3 continuous periods |
| Columns | 20 (Time + Aggregate + 11 appliances + Solar/Grid + 6 temporal) |

### Why 1-Second Resolution?
According to NILM literature:
- **Better accuracy** for short-cycle appliances
- **Standard for state-of-the-art** Transformer models (ELECTRIcity, Energformer)
- **Preserves transient events** that get smoothed out at 10s

### Usable Periods (Gap-Free)
| Period | Start | End | Duration |
|--------|-------|-----|----------|
| A | 2024-04-15 | 2024-05-31 | 46 days |
| B | 2024-07-01 | 2024-09-30 | 92 days |
| C | 2024-10-09 | 2025-09-30 | 356 days |

### Appliances (11)
1. HeatPump
2. Dishwasher
3. WashingMachine
4. Dryer
5. Oven
6. Stove
7. RangeHood
8. EVCharger (from Aug 2024)
9. EVSocket (from Aug 2024)
10. GarageCabinet
11. RainwaterPump

### Recommended Train/Val/Test Split (in Pretraining Notebook)
- **Train**: All Period A + Period B + Period C until June 2025 (~14 months)
- **Validation**: July - August 2025 (2 months)
- **Test**: September 2025 (1 month)