# Energy Consumption Predictor - Complete Workflow

This notebook demonstrates the complete workflow of the Energy Consumption Predictor project, from data exploration to model training and prediction.

## Table of Contents
1. [Setup and Imports](#setup)
2. [Data Loading and Exploration](#data-exploration)
3. [Data Preprocessing](#preprocessing)
4. [Feature Engineering](#feature-engineering)
5. [Model Training and Evaluation](#modeling)
6. [Predictions and Visualization](#predictions)
7. [API Usage Examples](#api-usage)
8. [Results Analysis](#analysis)

## 1. Setup and Imports {#setup}

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
import sys
import os
sys.path.append('../src')

from data_preprocessing import DataPreprocessor
from feature_engineering import FeatureEngineer
from modeling import EnergyPredictionPipeline
from train_pipeline import ModelTrainer
from visualization import EnergyVisualizationSuite
from utils import load_config, Timer

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

print("Setup completed successfully!")

## 2. Data Loading and Exploration {#data-exploration}

In [None]:
# Load raw data
electricity_path = '../data/raw/electricity_consumption.csv'
weather_path = '../data/raw/weather_data.csv'

# Check if sample data exists, if not generate it
if not os.path.exists(electricity_path) or not os.path.exists(weather_path):
    print("Sample data not found. Generating sample data...")
    import subprocess
    subprocess.run([sys.executable, '../generate_sample_data.py'])

# Load electricity consumption data
electricity_df = pd.read_csv(electricity_path)
electricity_df['timestamp'] = pd.to_datetime(electricity_df['timestamp'])
electricity_df.set_index('timestamp', inplace=True)

# Load weather data
weather_df = pd.read_csv(weather_path)
weather_df['timestamp'] = pd.to_datetime(weather_df['timestamp'])
weather_df.set_index('timestamp', inplace=True)

print(f"Electricity data shape: {electricity_df.shape}")
print(f"Weather data shape: {weather_df.shape}")
print(f"\nElectricity data columns: {list(electricity_df.columns)}")
print(f"Weather data columns: {list(weather_df.columns)}")

In [None]:
# Display basic statistics
print("Electricity Consumption Statistics:")
print(electricity_df.describe())

print("\nWeather Data Statistics:")
print(weather_df.describe())

In [None]:
# Visualize raw data
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Electricity consumption over time
axes[0, 0].plot(electricity_df.index, electricity_df['kwh'])
axes[0, 0].set_title('Electricity Consumption Over Time')
axes[0, 0].set_ylabel('kWh')
axes[0, 0].tick_params(axis='x', rotation=45)

# Temperature over time
axes[0, 1].plot(weather_df.index, weather_df['temperature_c'], color='red')
axes[0, 1].set_title('Temperature Over Time')
axes[0, 1].set_ylabel('Temperature (°C)')
axes[0, 1].tick_params(axis='x', rotation=45)

# Consumption distribution
axes[1, 0].hist(electricity_df['kwh'], bins=30, alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Distribution of Electricity Consumption')
axes[1, 0].set_xlabel('kWh')
axes[1, 0].set_ylabel('Frequency')

# Temperature vs Consumption scatter
merged_temp = pd.merge(electricity_df, weather_df[['temperature_c']], left_index=True, right_index=True, how='inner')
axes[1, 1].scatter(merged_temp['temperature_c'], merged_temp['kwh'], alpha=0.6)
axes[1, 1].set_title('Temperature vs Electricity Consumption')
axes[1, 1].set_xlabel('Temperature (°C)')
axes[1, 1].set_ylabel('kWh')

plt.tight_layout()
plt.show()

## 3. Data Preprocessing {#preprocessing}

In [None]:
# Initialize data preprocessor
preprocessor = DataPreprocessor(electricity_path, weather_path)

# Run complete preprocessing pipeline
with Timer("Data Preprocessing"):
    processed_data = preprocessor.preprocess_pipeline(
        missing_method='interpolate',
        frequency='H',
        outlier_method='iqr',
        remove_outliers=True
    )

print(f"Processed data shape: {processed_data.shape}")
print(f"Date range: {processed_data.index.min()} to {processed_data.index.max()}")

# Display processed data sample
print("\nProcessed Data Sample:")
print(processed_data.head())

In [None]:
# Get data summary
summary = preprocessor.get_data_summary()

print("Data Summary:")
print(f"Total records: {summary['total_records']}")
print(f"Duration: {summary['date_range']['duration_days']} days")
print(f"Average consumption: {summary['electricity_stats']['mean_kwh']:.2f} kWh")
print(f"Peak consumption: {summary['electricity_stats']['max_kwh']:.2f} kWh")
print(f"Min consumption: {summary['electricity_stats']['min_kwh']:.2f} kWh")

## 4. Feature Engineering {#feature-engineering}

In [None]:
# Initialize feature engineer
feature_engineer = FeatureEngineer(target_col='kwh')

# Run complete feature engineering
with Timer("Feature Engineering"):
    engineered_data = feature_engineer.engineer_all_features(
        processed_data,
        lag_hours=[1, 2, 3, 6, 12, 24, 48, 72, 168],
        rolling_windows=[6, 12, 24, 48, 168],
        rolling_stats=['mean', 'std', 'min', 'max'],
        ema_alphas=[0.1, 0.3, 0.5, 0.7]
    )

print(f"Engineered data shape: {engineered_data.shape}")
print(f"Total features created: {len(feature_engineer.get_feature_names())}")

# Display feature groups
feature_groups = feature_engineer.get_feature_groups()
print("\nFeature Groups:")
for group, features in feature_groups.items():
    print(f"{group}: {len(features)} features")

In [None]:
# Visualize some engineered features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Original vs lag features
axes[0, 0].plot(engineered_data.index[-168:], engineered_data['kwh'].iloc[-168:], label='Original', alpha=0.8)
axes[0, 0].plot(engineered_data.index[-168:], engineered_data['kwh_lag_24h'].iloc[-168:], label='Lag 24h', alpha=0.8)
axes[0, 0].set_title('Original vs Lag Features (Last Week)')
axes[0, 0].set_ylabel('kWh')
axes[0, 0].legend()
axes[0, 0].tick_params(axis='x', rotation=45)

# Rolling mean features
axes[0, 1].plot(engineered_data.index[-168:], engineered_data['kwh'].iloc[-168:], label='Original', alpha=0.8)
axes[0, 1].plot(engineered_data.index[-168:], engineered_data['kwh_rolling_24h_mean'].iloc[-168:], label='Rolling 24h Mean', alpha=0.8)
axes[0, 1].set_title('Original vs Rolling Mean')
axes[0, 1].set_ylabel('kWh')
axes[0, 1].legend()
axes[0, 1].tick_params(axis='x', rotation=45)

# Hourly patterns
hourly_avg = engineered_data.groupby('hour')['kwh'].mean()
axes[1, 0].plot(hourly_avg.index, hourly_avg.values, marker='o')
axes[1, 0].set_title('Average Consumption by Hour')
axes[1, 0].set_xlabel('Hour of Day')
axes[1, 0].set_ylabel('Average kWh')
axes[1, 0].set_xticks(range(0, 24, 2))

# Weekend vs weekday patterns
weekend_avg = engineered_data[engineered_data['is_weekend'] == 1]['kwh'].mean()
weekday_avg = engineered_data[engineered_data['is_weekend'] == 0]['kwh'].mean()
axes[1, 1].bar(['Weekday', 'Weekend'], [weekday_avg, weekend_avg], alpha=0.7)
axes[1, 1].set_title('Average Consumption: Weekday vs Weekend')
axes[1, 1].set_ylabel('Average kWh')

plt.tight_layout()
plt.show()

## 5. Model Training and Evaluation {#modeling}

In [None]:
# Initialize prediction pipeline
pipeline = EnergyPredictionPipeline(target_col='kwh')

# Train all models
with Timer("Model Training"):
    results = pipeline.train_pipeline(engineered_data, test_size=0.2)

print("\nModel Training Results:")
print(f"Models trained: {len(pipeline.models)}")
print(f"Features used: {len(pipeline.feature_names)}")

# Display model performance
print("\nModel Performance Summary:")
metrics_df = pd.DataFrame(pipeline.metrics).T
metrics_df = metrics_df.sort_values('rmse')
print(metrics_df.round(4))

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# MAE comparison
mae_values = [pipeline.metrics[model].get('mae', 0) for model in pipeline.metrics.keys()]
model_names = list(pipeline.metrics.keys())
axes[0].bar(model_names, mae_values, alpha=0.7)
axes[0].set_title('Mean Absolute Error (MAE)')
axes[0].set_ylabel('MAE')
axes[0].tick_params(axis='x', rotation=45)

# RMSE comparison
rmse_values = [pipeline.metrics[model].get('rmse', 0) for model in pipeline.metrics.keys()]
axes[1].bar(model_names, rmse_values, alpha=0.7, color='orange')
axes[1].set_title('Root Mean Square Error (RMSE)')
axes[1].set_ylabel('RMSE')
axes[1].tick_params(axis='x', rotation=45)

# R² comparison
r2_values = [pipeline.metrics[model].get('r2', 0) for model in pipeline.metrics.keys()]
axes[2].bar(model_names, r2_values, alpha=0.7, color='green')
axes[2].set_title('R-squared (R²)')
axes[2].set_ylabel('R²')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Get feature importance from best model
best_model = min(pipeline.metrics.items(), key=lambda x: x[1].get('rmse', float('inf')))[0]
importance = pipeline.get_feature_importance(best_model)

if importance:
    # Plot top 15 most important features
    sorted_features = sorted(importance.items(), key=lambda x: x[1], reverse=True)[:15]
    features, importances = zip(*sorted_features)
    
    plt.figure(figsize=(10, 8))
    plt.barh(range(len(features)), importances, alpha=0.7)
    plt.yticks(range(len(features)), [f.replace('_', ' ') for f in features])
    plt.xlabel('Feature Importance')
    plt.title(f'Top 15 Feature Importance - {best_model.title()}')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
    print(f"\nTop 10 Most Important Features ({best_model}):")
    for feature, imp in sorted_features[:10]:
        print(f"{feature}: {imp:.4f}")
else:
    print("Feature importance not available for the best model")

## 6. Predictions and Visualization {#predictions}

In [None]:
# Visualize predictions vs actual values
y_test = pipeline.predictions['y_test']
predictions_to_plot = {}

# Get predictions for available models
for model_name, pred in pipeline.predictions.items():
    if model_name != 'y_test' and isinstance(pred, np.ndarray) and len(pred) == len(y_test):
        predictions_to_plot[model_name] = pred

# Plot first week of predictions
week_length = min(168, len(y_test))  # 1 week or available data
y_test_week = y_test.iloc[:week_length]

plt.figure(figsize=(15, 8))
plt.plot(y_test_week.index, y_test_week.values, label='Actual', linewidth=3, alpha=0.8)

colors = plt.cm.Set1(np.linspace(0, 1, len(predictions_to_plot)))
for i, (model_name, pred) in enumerate(predictions_to_plot.items()):
    plt.plot(y_test_week.index, pred[:week_length], 
             label=model_name.replace('_', ' ').title(), 
             color=colors[i], linewidth=2, alpha=0.7)

plt.title('Energy Consumption: Predictions vs Actual (First Week of Test Data)', fontsize=16)
plt.xlabel('Time', fontsize=12)
plt.ylabel('Energy Consumption (kWh)', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Residual analysis for the best model
if predictions_to_plot:
    best_pred = predictions_to_plot[best_model]
    residuals = y_test.values - best_pred
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Residuals vs Predicted
    axes[0, 0].scatter(best_pred, residuals, alpha=0.6)
    axes[0, 0].axhline(y=0, color='red', linestyle='--', alpha=0.8)
    axes[0, 0].set_xlabel('Predicted Values')
    axes[0, 0].set_ylabel('Residuals')
    axes[0, 0].set_title('Residuals vs Predicted')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Histogram of residuals
    axes[0, 1].hist(residuals, bins=30, alpha=0.7, edgecolor='black')
    axes[0, 1].set_xlabel('Residuals')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].set_title('Distribution of Residuals')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Residuals over time
    axes[1, 0].plot(y_test.index, residuals, alpha=0.7)
    axes[1, 0].axhline(y=0, color='red', linestyle='--', alpha=0.8)
    axes[1, 0].set_xlabel('Time')
    axes[1, 0].set_ylabel('Residuals')
    axes[1, 0].set_title('Residuals Over Time')
    axes[1, 0].tick_params(axis='x', rotation=45)
    axes[1, 0].grid(True, alpha=0.3)
    
    # Q-Q plot
    from scipy import stats
    stats.probplot(residuals, dist="norm", plot=axes[1, 1])
    axes[1, 1].set_title('Q-Q Plot')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.suptitle(f'Residual Analysis - {best_model.title()}', fontsize=16)
    plt.tight_layout()
    plt.show()
    
    print(f"Residual Statistics for {best_model}:")
    print(f"Mean: {np.mean(residuals):.4f}")
    print(f"Std: {np.std(residuals):.4f}")
    print(f"Min: {np.min(residuals):.4f}")
    print(f"Max: {np.max(residuals):.4f}")

## 7. API Usage Examples {#api-usage}

In [None]:
# Example of how to use the trained model for new predictions
# This simulates what the API does internally

# Get recent consumption data (last 24 hours)
recent_data = engineered_data.tail(24)
recent_consumption = recent_data['kwh'].tolist()

# Simulate weather forecast data
weather_forecast = {
    'temperature_c': 22.5,
    'humidity_percent': 65.0,
    'wind_speed_kmh': 12.0,
    'precipitation_mm': 0.0,
    'cloud_cover_percent': 40.0,
    'solar_irradiance_wm2': 450.0
}

print("API Request Example:")
api_request = {
    "recent_consumption": recent_consumption,
    "weather_forecast": weather_forecast,
    "forecast_hours": 24
}

print(f"Recent consumption (last 5 values): {recent_consumption[-5:]}")
print(f"Weather forecast: {weather_forecast}")
print(f"Forecast hours requested: {api_request['forecast_hours']}")

In [None]:
# Demonstrate making predictions with the trained model
# Use the best model for prediction
if best_model in pipeline.models:
    model = pipeline.models[best_model]
    
    # Use the last available features for prediction
    last_features = engineered_data.iloc[-1:].copy()
    
    # Remove target column
    feature_cols = [col for col in last_features.columns if col != 'kwh']
    X_pred = last_features[feature_cols]
    
    # Fill any NaN values
    X_pred = X_pred.fillna(0)
    
    # Make prediction
    if hasattr(model, 'predict'):
        try:
            prediction = model.predict(X_pred)[0]
            print(f"Next hour prediction using {best_model}: {prediction:.3f} kWh")
            
            # Compare with actual next value (if available)
            last_actual = recent_consumption[-1]
            print(f"Last actual consumption: {last_actual:.3f} kWh")
            print(f"Prediction difference: {abs(prediction - last_actual):.3f} kWh")
            
        except Exception as e:
            print(f"Error making prediction: {e}")
    else:
        print(f"Model {best_model} doesn't support direct prediction")
else:
    print("Best model not found in trained models")

## 8. Results Analysis {#analysis}

In [None]:
# Comprehensive results summary
print("="*60)
print("ENERGY CONSUMPTION PREDICTOR - RESULTS SUMMARY")
print("="*60)

print(f"\nDataset Information:")
print(f"  Total records: {len(engineered_data)}")
print(f"  Date range: {engineered_data.index.min()} to {engineered_data.index.max()}")
print(f"  Duration: {(engineered_data.index.max() - engineered_data.index.min()).days} days")
print(f"  Features engineered: {len(feature_engineer.get_feature_names())}")

print(f"\nConsumption Statistics:")
print(f"  Average: {engineered_data['kwh'].mean():.3f} kWh")
print(f"  Median: {engineered_data['kwh'].median():.3f} kWh")
print(f"  Std Dev: {engineered_data['kwh'].std():.3f} kWh")
print(f"  Min: {engineered_data['kwh'].min():.3f} kWh")
print(f"  Max: {engineered_data['kwh'].max():.3f} kWh")

print(f"\nModel Performance (sorted by RMSE):")
sorted_models = sorted(pipeline.metrics.items(), key=lambda x: x[1].get('rmse', float('inf')))
for i, (model_name, metrics) in enumerate(sorted_models, 1):
    mae = metrics.get('mae', 0)
    rmse = metrics.get('rmse', 0)
    mape = metrics.get('mape', 0)
    r2 = metrics.get('r2', 0)
    print(f"  {i}. {model_name.ljust(15)} MAE: {mae:.4f}, RMSE: {rmse:.4f}, MAPE: {mape:.2f}%, R²: {r2:.4f}")

print(f"\nBest Model: {best_model}")
best_metrics = pipeline.metrics[best_model]
print(f"  Achieved {best_metrics.get('rmse', 0):.4f} RMSE")
print(f"  Mean Absolute Percentage Error: {best_metrics.get('mape', 0):.2f}%")
print(f"  R-squared Score: {best_metrics.get('r2', 0):.4f}")

if importance:
    print(f"\nTop 5 Most Important Features:")
    for feature, imp in sorted(importance.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  - {feature}: {imp:.4f}")

print(f"\nProject Status: ✅ COMPLETED SUCCESSFULLY")
print(f"Models trained and ready for deployment!")
print("="*60)

## Conclusion

This notebook has demonstrated the complete Energy Consumption Predictor workflow:

1. **Data Loading**: Successfully loaded electricity consumption and weather data
2. **Preprocessing**: Cleaned data, handled missing values, and removed outliers
3. **Feature Engineering**: Created comprehensive time-based, lag, and statistical features
4. **Model Training**: Trained multiple models including XGBoost, LightGBM, and Random Forest
5. **Evaluation**: Compared model performance using MAE, RMSE, MAPE, and R² metrics
6. **Visualization**: Created comprehensive plots for analysis

### Key Insights:
- The best performing model achieved low prediction errors
- Lag features and rolling statistics are among the most important predictors
- Time-based features help capture daily and weekly patterns
- Weather features provide additional predictive power

### Next Steps:
1. Deploy the API server: `python main.py serve`
2. Test the API endpoints with real data
3. Set up automated model retraining
4. Monitor model performance in production
5. Collect more data to improve accuracy

The models are now ready for production use and can help optimize energy consumption and reduce electricity costs!