# 1.2.4.2 Feature Definition & Exploration Plan

This notebook defines the feature engineering strategy and exploratory data analysis (EDA) plan for the AB Data Challenge project.

## Objectives
- Define comprehensive feature families for anomaly detection
- Create detailed EDA plan for Iteration 2
- Establish data cleaning rules and preprocessing requirements
- Plan feature engineering approach for model development


## Feature Families for Anomaly Detection

### 1. Temporal Window Features
These features capture consumption patterns over different time windows:

#### Short-term Windows (1-24 hours)
- **Rolling Mean**: 1h, 3h, 6h, 12h, 24h rolling averages
- **Rolling Std**: 1h, 3h, 6h, 12h, 24h rolling standard deviations
- **Rolling Min/Max**: 1h, 3h, 6h, 12h, 24h rolling minimums and maximums
- **Rolling Percentiles**: 25th, 50th, 75th, 90th, 95th percentiles over 24h window

#### Medium-term Windows (1-7 days)
- **Daily Aggregates**: Mean, median, std, min, max consumption per day
- **Daily Patterns**: Hourly consumption patterns within each day
- **Weekend vs Weekday**: Different patterns for weekend vs weekday consumption
- **Rolling 7-day**: 7-day rolling statistics for trend analysis

#### Long-term Windows (1-12 months)
- **Weekly Aggregates**: Mean, median, std consumption per week
- **Monthly Aggregates**: Mean, median, std consumption per month
- **Seasonal Patterns**: Quarterly and seasonal consumption trends
- **Year-over-Year**: Comparison with same period previous year

### 2. Baseline Features
These features establish normal consumption patterns:

#### Municipality-specific Baselines
- **Historical Mean**: Long-term average consumption per municipality
- **Historical Median**: Long-term median consumption per municipality
- **Historical Std**: Long-term standard deviation per municipality
- **Percentile Baselines**: 25th, 50th, 75th, 90th, 95th percentiles

#### Time-based Baselines
- **Hourly Baseline**: Average consumption for each hour of day
- **Daily Baseline**: Average consumption for each day of week
- **Monthly Baseline**: Average consumption for each month of year
- **Seasonal Baseline**: Average consumption for each season

### 3. Variability Features
These features measure consumption variability and stability:

#### Statistical Variability
- **Coefficient of Variation**: Std/Mean ratio for different time windows
- **Range**: Max - Min consumption over time windows
- **Interquartile Range**: 75th - 25th percentile range
- **Skewness**: Distribution asymmetry measure
- **Kurtosis**: Distribution tail heaviness measure

#### Stability Measures
- **Consumption Stability**: Variance in consumption over time
- **Pattern Consistency**: Consistency of daily/weekly patterns
- **Trend Stability**: Stability of consumption trends
- **Volatility**: Rate of change in consumption

### 4. Seasonality Features
These features capture seasonal and cyclical patterns:

#### Cyclical Components
- **Hour of Day**: Cyclical encoding of hour (sin/cos)
- **Day of Week**: Cyclical encoding of weekday (sin/cos)
- **Day of Month**: Cyclical encoding of day in month
- **Month of Year**: Cyclical encoding of month (sin/cos)
- **Day of Year**: Cyclical encoding of day in year

#### Seasonal Indicators
- **Season**: Spring, Summer, Fall, Winter indicators
- **Quarter**: Q1, Q2, Q3, Q4 indicators
- **Holiday Periods**: Special holiday and vacation periods
- **Weather Season**: Hot/Cold weather periods

### 5. Aggregate Features
These features combine multiple data points:

#### Time-based Aggregates
- **Daily Totals**: Total consumption per day
- **Weekly Totals**: Total consumption per week
- **Monthly Totals**: Total consumption per month
- **Peak Consumption**: Maximum consumption in time windows
- **Off-peak Consumption**: Minimum consumption in time windows

#### Statistical Aggregates
- **Moving Averages**: Different window sizes for trend analysis
- **Exponential Smoothing**: Weighted averages with decay
- **Cumulative Sums**: Running totals over time
- **Growth Rates**: Period-over-period growth rates

### 6. Data Quality Features
These features help identify data quality issues:

#### Completeness Features
- **Missing Value Count**: Number of missing values in time windows
- **Missing Value Rate**: Percentage of missing values
- **Data Availability**: Percentage of available data points
- **Gap Length**: Length of consecutive missing value periods

#### Consistency Features
- **Value Consistency**: Consistency of values within time windows
- **Pattern Consistency**: Consistency of consumption patterns
- **Timestamp Consistency**: Consistency of timestamp intervals
- **Range Consistency**: Consistency of value ranges

#### Anomaly Indicators
- **Negative Value Count**: Number of negative consumption values
- **Zero Value Count**: Number of zero consumption values
- **Extreme Value Count**: Number of extreme consumption values
- **Outlier Indicators**: Statistical outlier detection flags


## EDA Plan for Iteration 2

### 1. Data Coverage Analysis
- **Temporal Coverage**: Analyze data availability across time periods
- **Municipality Coverage**: Assess data completeness by municipality
- **Seasonal Coverage**: Evaluate data availability across seasons
- **Gap Analysis**: Identify and quantify data gaps and missing periods

### 2. Distribution Analysis
- **Consumption Distributions**: Analyze consumption value distributions by municipality
- **Temporal Distributions**: Examine consumption patterns across different time periods
- **Outlier Analysis**: Identify and analyze extreme consumption values
- **Normality Tests**: Assess distribution normality and transformation needs

### 3. Stability Analysis
- **Trend Analysis**: Identify long-term consumption trends
- **Seasonality Analysis**: Detect seasonal patterns and cycles
- **Volatility Analysis**: Measure consumption volatility over time
- **Pattern Consistency**: Assess consistency of consumption patterns

### 4. Correlation Analysis
- **Feature Correlations**: Analyze correlations between different features
- **Temporal Correlations**: Examine autocorrelations and lag relationships
- **Cross-Municipality Correlations**: Identify correlations between municipalities
- **External Factor Correlations**: Explore correlations with external factors

### 5. Anomaly Pattern Analysis
- **Anomaly Frequency**: Analyze frequency and patterns of anomalies
- **Anomaly Clustering**: Identify clusters and patterns in anomalous data
- **Anomaly Severity**: Categorize anomalies by severity and impact
- **Anomaly Context**: Analyze context and conditions surrounding anomalies

### 6. Feature Engineering Validation
- **Feature Importance**: Assess importance of different feature families
- **Feature Stability**: Evaluate stability of engineered features
- **Feature Interactions**: Identify important feature interactions
- **Feature Redundancy**: Detect and address redundant features


## Handover Notes: Cleaning Rules → Feature Engineering

### Data Cleaning Rules for Iteration 2

#### 1. Missing Value Handling
- **Strategy**: Forward fill for short gaps (< 6 hours), interpolation for medium gaps (6-24 hours), municipality-specific baseline for long gaps (> 24 hours)
- **Validation**: Flag records with missing values for quality assessment
- **Documentation**: Track missing value patterns and imputation methods

#### 2. Negative Value Treatment
- **Identification**: Flag all negative consumption values as data quality issues
- **Treatment**: Replace with municipality-specific median or set to zero
- **Validation**: Cross-check with meter readings and maintenance records
- **Documentation**: Log all negative value corrections

#### 3. Outlier Detection and Treatment
- **Statistical Outliers**: Use IQR method (Q3 + 1.5*IQR) for initial detection
- **Business Logic Outliers**: Flag values > 3x municipality-specific 95th percentile
- **Treatment**: Cap outliers at 99th percentile or investigate further
- **Validation**: Manual review of extreme outliers

#### 4. Timestamp Validation
- **Consistency**: Ensure hourly intervals are consistent
- **Completeness**: Identify and flag missing time periods
- **Validation**: Check for future dates and invalid timestamps
- **Correction**: Interpolate missing timestamps or flag for exclusion

#### 5. Municipality-specific Rules
- **Barcelona**: Higher consumption thresholds, more complex patterns
- **L'Hospitalet**: Moderate consumption patterns, industrial areas
- **Santa Coloma**: Lower consumption, residential focus
- **Viladecans**: Mixed patterns, smaller dataset

### Feature Engineering Pipeline

#### Phase 1: Basic Features (Week 1)
1. **Temporal Features**: Hour, day, month, season extraction
2. **Rolling Statistics**: 1h, 6h, 24h rolling means and standard deviations
3. **Baseline Features**: Municipality-specific historical averages
4. **Data Quality Features**: Missing value flags, outlier indicators

#### Phase 2: Advanced Features (Week 2)
1. **Seasonality Features**: Cyclical encoding, seasonal indicators
2. **Variability Features**: Coefficient of variation, stability measures
3. **Aggregate Features**: Daily/weekly/monthly totals and patterns
4. **Interaction Features**: Municipality × time interactions

#### Phase 3: Model-specific Features (Week 3)
1. **Anomaly Features**: Distance from baseline, z-scores
2. **Trend Features**: Growth rates, trend indicators
3. **Pattern Features**: Consistency measures, pattern deviations
4. **Ensemble Features**: Combined feature scores

### Quality Assurance

#### Feature Validation
- **Range Checks**: Ensure features are within expected ranges
- **Consistency Checks**: Validate feature consistency across time
- **Correlation Analysis**: Identify and address feature redundancy
- **Performance Impact**: Monitor feature engineering impact on model performance

#### Documentation Requirements
- **Feature Definitions**: Clear documentation of all engineered features
- **Transformation Logic**: Document all data transformations
- **Validation Results**: Record all validation and quality checks
- **Performance Metrics**: Track feature engineering impact on model performance

### Success Criteria

#### Technical Criteria
- **Feature Completeness**: All planned features implemented and validated
- **Data Quality**: <5% missing values, <1% negative values, <2% outliers
- **Feature Stability**: Features stable across different time periods
- **Performance Impact**: Features improve model performance by >10%

#### Business Criteria
- **Interpretability**: Features are interpretable and explainable
- **Scalability**: Feature engineering pipeline can handle larger datasets
- **Maintainability**: Code is well-documented and maintainable
- **Reproducibility**: Results are reproducible across different runs


In [None]:
# Optional: Simple deterministic plot if data is available
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import os

# Try to load data for visualization
try:
    data_path = '../data/dataset_sample.parquet'
    if os.path.exists(data_path):
        df = pd.read_parquet(data_path)
        print("✓ Data loaded successfully for visualization")
        
        # Set up the plotting style
        plt.style.use('default')
        sns.set_palette("husl")
        
        # Create a simple consumption distribution plot
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        fig.suptitle('Water Consumption Data Overview', fontsize=16, fontweight='bold')
        
        # Plot 1: Consumption distribution by municipality
        if 'municipality' in df.columns and 'consumption' in df.columns:
            consumption_clean = df[df['consumption'].notna() & (df['consumption'] > 0)]
            if len(consumption_clean) > 0:
                consumption_clean.boxplot(column='consumption', by='municipality', ax=axes[0,0])
                axes[0,0].set_title('Consumption Distribution by Municipality')
                axes[0,0].set_xlabel('Municipality')
                axes[0,0].set_ylabel('Consumption')
                axes[0,0].tick_params(axis='x', rotation=45)
        
        # Plot 2: Time series of consumption (sample)
        if 'timestamp' in df.columns and 'consumption' in df.columns:
            # Sample data for visualization (first 1000 points)
            sample_data = df.head(1000)
            sample_data = sample_data[sample_data['consumption'].notna()]
            if len(sample_data) > 0:
                axes[0,1].plot(sample_data['timestamp'], sample_data['consumption'], alpha=0.7)
                axes[0,1].set_title('Consumption Time Series (Sample)')
                axes[0,1].set_xlabel('Time')
                axes[0,1].set_ylabel('Consumption')
                axes[0,1].tick_params(axis='x', rotation=45)
        
        # Plot 3: Missing values by municipality
        if 'municipality' in df.columns:
            missing_by_municipality = df.groupby('municipality').apply(lambda x: x.isnull().sum().sum())
            missing_by_municipality.plot(kind='bar', ax=axes[1,0])
            axes[1,0].set_title('Missing Values by Municipality')
            axes[1,0].set_xlabel('Municipality')
            axes[1,0].set_ylabel('Missing Values Count')
            axes[1,0].tick_params(axis='x', rotation=45)
        
        # Plot 4: Data availability over time
        if 'timestamp' in df.columns:
            # Group by month and count records
            df['year_month'] = df['timestamp'].dt.to_period('M')
            monthly_counts = df.groupby('year_month').size()
            monthly_counts.plot(kind='line', ax=axes[1,1])
            axes[1,1].set_title('Data Availability Over Time')
            axes[1,1].set_xlabel('Month')
            axes[1,1].set_ylabel('Record Count')
            axes[1,1].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.show()
        
        print("✓ Visualization completed successfully")
        
    else:
        print("⚠ Dataset file not found - skipping visualization")
        
except Exception as e:
    print(f"⚠ Error during visualization: {e}")
    print("  This is expected if the dataset is not available or has different structure")
