# Time-Series Electricity Consumption Prediction with SHAP Explanations

This notebook demonstrates how to use the `timeseries.py` module to explain time-series predictions using SHAP (SHapley Additive exPlanations).

SHAP provides a unified measure of feature importance that satisfies desirable properties like local accuracy and consistency.

## Setup and Imports

In [71]:
# Add src folder to path
import sys
sys.path.insert(0, './src')

# Import helper functions for data and model management
from helpers.timeseries_data import (
    generate_synthetic_electricity_data,
    create_time_series_features,
    prepare_train_test_split
)

from helpers.model_training import (
    create_models,
    train_model
)

# Import SHAP explainability functions
from timeseries import (
    compute_shap_values,
    compute_feature_importance,
    get_shap_summary_stats,
    save_shap_summary_plot,
    save_shap_waterfall_plot,
    save_feature_importance_plot,
    get_prediction_explanation,
    save_results_summary,
    explain_timeseries_predictions
)

import warnings
warnings.filterwarnings('ignore')

## Step 1: Generate Synthetic Data

Create a realistic time-series dataset with seasonal and weekly patterns.

In [72]:
# Generate synthetic electricity consumption data
print("[Step 1] Generating synthetic electricity consumption data...")
df = generate_synthetic_electricity_data(n_days=30)  # Using smaller dataset for notebook
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nData summary:")
print(df[['consumption', 'temperature', 'hour', 'day_of_week']].describe())

[Step 1] Generating synthetic electricity consumption data...
Generating synthetic electricity consumption data...
Dataset shape: (720, 8)

First few rows:
             datetime  consumption  temperature  hour  day_of_week  \
0 2023-01-01 00:00:00   612.404915    21.662276     0            6   
1 2023-01-01 01:00:00   614.282436    19.757341     1            6   
2 2023-01-01 02:00:00   675.560280    22.115199     2            6   
3 2023-01-01 03:00:00   883.838972    24.741223     3            6   
4 2023-01-01 04:00:00   901.338134    19.469673     4            6   

   day_of_year  month  is_weekend  
0            1      1           1  
1            1      1           1  
2            1      1           1  
3            1      1           1  
4            1      1           1  

Data summary:
       consumption  temperature        hour  day_of_week
count   720.000000   720.000000  720.000000   720.000000
mean    991.605641    22.577184   11.500000     3.000000
std     423.639109   

## Step 2: Feature Engineering

Create lagged features, rolling statistics, and time-based features for time-series prediction.

In [73]:
# Engineer features for time-series modeling
print("[Step 2] Creating time-series features...")
df_features = create_time_series_features(df, target_col='consumption')
print(f"Feature matrix shape: {df_features.shape}")
print(f"Number of features: {len(df_features.columns) - 1}")
print(f"\nFeature columns:")
print(df_features.columns.tolist())

[Step 2] Creating time-series features...
Creating time-series features...
Feature matrix shape: (552, 31)
Number of features: 30

Feature columns:
['datetime', 'consumption', 'temperature', 'hour', 'day_of_week', 'day_of_year', 'month', 'is_weekend', 'lag_1h', 'lag_2h', 'lag_3h', 'lag_24h', 'lag_48h', 'lag_168h', 'rolling_mean_6h', 'rolling_std_6h', 'rolling_mean_12h', 'rolling_std_12h', 'rolling_mean_24h', 'rolling_std_24h', 'rolling_mean_48h', 'rolling_std_48h', 'hour_sin', 'hour_cos', 'day_of_week_sin', 'day_of_week_cos', 'month_sin', 'month_cos', 'temp_squared', 'temp_cooling_degree', 'temp_heating_degree']


## Step 3: Train Multiple Models

Compare predictions from different regression models.

In [74]:
# Split data and create models
print("[Step 3] Preparing data and creating models...")
X_train, X_test, y_train, y_test, feature_cols = prepare_train_test_split(
    df_features, target_col='consumption', test_size=0.2
)
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

# Create models
models_dict = create_models(['randomforest', 'linear'])
print(f"\nCreated models: {list(models_dict.keys())}")

# Train each model
training_results = {}
for model_name, model in models_dict.items():
    result = train_model(model, X_train, y_train, X_test, y_test, model_name)
    training_results[model_name] = result

[Step 3] Preparing data and creating models...
Training set size: 441
Test set size: 111
Training set size: 441
Test set size: 111

Created models: ['Random Forest', 'Linear Regression']

Training Random Forest...
  Train MSE: 797.80, R²: 0.9958
  Test MSE: 7347.18, R²: 0.9513

Training Linear Regression...
  Train MSE: 2303.51, R²: 0.9880
  Test MSE: 3169.37, R²: 0.9790
  Train MSE: 797.80, R²: 0.9958
  Test MSE: 7347.18, R²: 0.9513

Training Linear Regression...
  Train MSE: 2303.51, R²: 0.9880
  Test MSE: 3169.37, R²: 0.9790


## Step 4: Evaluate Models

Assess model performance using MSE, R², and other metrics.

In [66]:
# Display model performance metrics
print("[Step 4] Model Performance Summary:")
print()
for model_name, result in training_results.items():
    metrics = result['metrics']
    print(f"{model_name}:")
    print(f"  Train MSE: {metrics['train_mse']:.2f}, R²: {metrics['train_r2']:.4f}")
    print(f"  Test MSE:  {metrics['test_mse']:.2f}, R²: {metrics['test_r2']:.4f}")
    print()

[Step 4] Model Performance Summary:

Random Forest:
  Train MSE: 797.80, R²: 0.9958
  Test MSE:  7347.18, R²: 0.9513

Linear Regression:
  Train MSE: 2303.51, R²: 0.9880
  Test MSE:  3169.37, R²: 0.9790



## Step 5: SHAP Explanations

Use SHAP to explain which features are most important for predictions from each model.

In [75]:
# Compute SHAP values for each model
print("[Step 5] Computing SHAP explanations...")
print()
shap_results = {}

for model_name, model in models_dict.items():
    print(f"\nProcessing {model_name}...")
    result = compute_shap_values(
        model,
        model_name,
        X_train,
        X_test,
        num_samples=20
    )
    
    if result:
        shap_results[model_name] = result
        
        # Get feature importance
        feature_importance = compute_feature_importance(
            result['shap_values'],
            result['X_sample'].columns
        )
        
        # Get SHAP statistics
        stats = get_shap_summary_stats(result['shap_values'])
        
        print(f"  SHAP Values Shape: {stats['shape']}")
        print(f"  Mean |SHAP|: {stats['mean_abs']:.3f}")
        print(f"  Top 5 Features:")
        for idx, row in feature_importance.head(5).iterrows():
            print(f"    {row['feature']}: {row['importance']:.4f}")
    else:
        print(f"  Could not compute SHAP for {model_name}")

[Step 5] Computing SHAP explanations...


Processing Random Forest...
  Creating SHAP explainer for Random Forest...
  SHAP Values Shape: (20, 29)
  Mean |SHAP|: 12.783
  Top 5 Features:
    lag_168h: 300.0021
    lag_1h: 12.5235
    rolling_std_12h: 8.4451
    hour_sin: 7.5591
    rolling_mean_24h: 4.1362

Processing Linear Regression...
  Creating SHAP explainer for Linear Regression...


  0%|          | 0/10 [00:00<?, ?it/s]

  SHAP Values Shape: (10, 29)
  Mean |SHAP|: 55.391
  Top 5 Features:
    rolling_mean_6h: 643.8386
    rolling_mean_12h: 385.6861
    temp_squared: 92.5029
    lag_3h: 89.4762
    temp_cooling_degree: 80.9713


## Understanding SHAP Explanations

### Key Concepts:

1. **SHAP Values**: Measure each feature's contribution to the prediction
   - Positive values: increase prediction
   - Negative values: decrease prediction
   - Magnitude: strength of impact

2. **Summary Plot**: Shows the distribution of SHAP values for each feature
   - Features at top have highest importance
   - Color indicates feature value (high=red, low=blue)

3. **Dependence Plot**: Shows how a feature's value affects its SHAP values
   - Reveals non-linear relationships
   - Helps identify feature interactions

### Interpreting Results:

- **Lagged Consumption Features**: Past electricity consumption is a strong predictor
- **Rolling Averages**: Weekly and daily averages capture seasonal patterns
- **Time Features**: Hour of day and day of week encode important patterns

## Step 6: Visualization - Summary Plots

Save SHAP summary plots to visualize feature importance distributions.


In [68]:
# Save summary plots for each model
import os
print("[Step 6] Saving SHAP summary visualizations...")
print()

output_dir = './outputs/timeseries'
os.makedirs(output_dir, exist_ok=True)

for model_name, shap_data in shap_results.items():
    print(f"Creating visualizations for {model_name}...")
    
    # Save summary plot
    summary_output = os.path.join(output_dir, f'{model_name}_shap_summary.png')
    save_shap_summary_plot(
        shap_data['shap_values'],
        shap_data['X_sample'],
        summary_output,
        model_name
    )
    print(f"  ✓ Saved summary plot: {summary_output}")
    
    # Save feature importance plot
    feature_imp = compute_feature_importance(
        shap_data['shap_values'],
        shap_data['X_sample'].columns
    )
    imp_output = os.path.join(output_dir, f'{model_name}_feature_importance.png')
    save_feature_importance_plot(feature_imp, imp_output, top_n=15)
    print(f"  ✓ Saved feature importance plot: {imp_output}")

[Step 6] Saving SHAP summary visualizations...

Creating visualizations for Random Forest...
  ✓ Saved summary plot: ./outputs/timeseries/Random Forest_shap_summary.png
  ✓ Saved feature importance plot: ./outputs/timeseries/Random Forest_feature_importance.png
Creating visualizations for Linear Regression...
  ✓ Saved summary plot: ./outputs/timeseries/Linear Regression_shap_summary.png
  ✓ Saved feature importance plot: ./outputs/timeseries/Random Forest_feature_importance.png
Creating visualizations for Linear Regression...
  ✓ Saved summary plot: ./outputs/timeseries/Linear Regression_shap_summary.png
  ✓ Saved feature importance plot: ./outputs/timeseries/Linear Regression_feature_importance.png
  ✓ Saved feature importance plot: ./outputs/timeseries/Linear Regression_feature_importance.png


## Step 7: Individual Prediction Explanations

Explain individual predictions using waterfall plots and detailed breakdowns.


In [69]:
# Explain individual predictions
import pandas as pd
import numpy as np
print("[Step 7] Explaining individual predictions...")
print()

for model_name, shap_data in shap_results.items():
    print(f"\n{model_name} - Sample Explanations:")
    print("=" * 60)
    
    X_sample = shap_data['X_sample']
    shap_vals = shap_data['shap_values']
    
    # Create dummy y values for demonstration (same length as X_sample)
    y_sample = pd.Series(np.random.randn(len(X_sample)), index=range(len(X_sample)))
    
    # Use valid indices: first and middle samples
    sample_indices = [0, len(X_sample)//2]
    
    for sample_idx in sample_indices:
        # Get detailed explanation for this prediction
        explanation = get_prediction_explanation(
            models_dict[model_name],
            shap_vals,
            X_sample,
            y_sample,
            sample_idx=sample_idx,
            top_n=5
        )
        
        print(f"\nSample Index {sample_idx} (of {len(X_sample)}):")
        print(f"  Predicted: {explanation['predicted']:.2f}")
        print(f"  Actual: {explanation['actual']:.2f}")
        print(f"  Error: {explanation['error']:.2f}")
        print(f"  Top Contributing Features:")
        for _, row in explanation['top_features'].iterrows():
            print(f"    {row['feature']}: {row['shap_value']:+.4f}")
        
        # Save waterfall plot for first sample of each model
        if sample_idx == 0:
            waterfall_output = os.path.join(output_dir, f'{model_name}_waterfall_sample.png')
            save_shap_waterfall_plot(
                shap_data['explainer'],
                shap_data['shap_values'],
                X_sample,
                waterfall_output,
                sample_idx=0
            )
            print(f"  ✓ Saved waterfall plot: {waterfall_output}")

[Step 7] Explaining individual predictions...


Random Forest - Sample Explanations:

Sample Index 0 (of 20):
  Predicted: 1705.86
  Actual: 0.75
  Error: 1705.11
  Top Contributing Features:
    lag_168h: +629.0347
    temperature: +10.0653
    temp_cooling_degree: +8.4069
    temp_squared: +7.9650
    lag_1h: +7.0206
  ✓ Saved waterfall plot: ./outputs/timeseries/Random Forest_waterfall_sample.png

Sample Index 10 (of 20):
  Predicted: 619.71
  Actual: -1.01
  Error: 620.71
  Top Contributing Features:
    lag_168h: -398.7914
    lag_1h: -14.1958
    rolling_mean_24h: +9.6605
    hour_sin: -8.4446
    rolling_std_48h: +4.4244

Linear Regression - Sample Explanations:

Sample Index 0 (of 10):
  Predicted: 1713.00
  Actual: -1.21
  Error: 1714.21
  Top Contributing Features:
    rolling_mean_6h: +1196.6963
    rolling_mean_12h: -425.0115
    temp_squared: -160.2524
    temp_cooling_degree: +159.8309
    lag_3h: -140.9400
  ✓ Saved waterfall plot: ./outputs/timeseries/Linear Regression_

## Step 8: Save Results Summary

Create a comprehensive summary report of all analyses.


In [None]:
# Save comprehensive results summary
print("[Step 8] Saving results summary...")
print()

# Prepare results dictionary
results = {
    'models': list(models_dict.keys()),
    'training_results': training_results,
    'shap_results': {
        model_name: {
            'shape': shap_data['shap_values'].shape,
            'mean_abs': float(np.abs(shap_data['shap_values']).mean()),
            'std': float(np.std(shap_data['shap_values'])),
        }
        for model_name, shap_data in shap_results.items()
    },
    'feature_importance': {
        model_name: compute_feature_importance(
            shap_data['shap_values'],
            shap_data['X_sample'].columns
        ).to_dict()
        for model_name, shap_data in shap_results.items()
    }
}

# Save summary report
summary_output = os.path.join(output_dir, 'analysis_summary.txt')
save_results_summary(results, summary_output)
print(f"✓ Saved results summary: {summary_output}")
print()

# Display summary statistics
print("Summary Statistics:")
print("-" * 60)
for model_name, metrics in results['training_results'].items():
    print(f"\n{model_name}:")
    print(f"  Test R²: {metrics['metrics']['test_r2']:.4f}")
    print(f"  Test MSE: {metrics['metrics']['test_mse']:.2f}")
    if model_name in results['shap_results']:
        print(f"  Mean |SHAP|: {results['shap_results'][model_name]['mean_abs']:.4f}")

[Step 8] Saving results summary...

✓ Saved results summary: ./outputs/timeseries/analysis_summary.txt

Summary Statistics:
------------------------------------------------------------

Random Forest:
  Test R²: 0.9513
  Test MSE: 7347.18
  Mean |SHAP|: 12.7825

Linear Regression:
  Test R²: 0.9790
  Test MSE: 3169.37
  Mean |SHAP|: 55.3914


## Step 9: High-Level Function - One-Shot Analysis

Use the high-level orchestrator function to run complete analysis in one call.


In [50]:
# Demonstrate the high-level orchestrator function
print("[Step 9] High-Level Analysis - One-Shot Approach")
print("=" * 60)
print()
print("The 'explain_timeseries_predictions' function handles all steps:")
print("1. Data generation and feature engineering")
print("2. Model training")
print("3. SHAP value computation")
print("4. Visualization generation")
print("5. Results summary")
print()

# This is how users can run the entire analysis in one call:
print("Example usage:")
print("""
from timeseries import explain_timeseries_predictions

results = explain_timeseries_predictions(
    n_days=30,                          # Data size
    model_types=['randomforest', 'linear'],  # Models to train
    n_samples=20,                       # SHAP samples
    output_dir='./analysis_output'      # Where to save results
)

# Results contain everything needed for further analysis
print(results['model_performance'])
print(results['feature_importance'])
""")

print()
print("Note: This function is ideal for batch analyses or automation.")


[Step 9] High-Level Analysis - One-Shot Approach

The 'explain_timeseries_predictions' function handles all steps:
1. Data generation and feature engineering
2. Model training
3. SHAP value computation
4. Visualization generation
5. Results summary

Example usage:

from timeseries import explain_timeseries_predictions

results = explain_timeseries_predictions(
    n_days=30,                          # Data size
    model_types=['randomforest', 'linear'],  # Models to train
    n_samples=20,                       # SHAP samples
    output_dir='./analysis_output'      # Where to save results
)

# Results contain everything needed for further analysis
print(results['model_performance'])
print(results['feature_importance'])


Note: This function is ideal for batch analyses or automation.


## Available Functions Reference

### Data & Model Preparation (helpers)
- `generate_synthetic_electricity_data()` - Create realistic time-series data
- `create_time_series_features()` - Engineer lagged and rolling features
- `prepare_train_test_split()` - Split and prepare data
- `create_models()` - Create multiple regression models
- `train_model()` - Train and evaluate a model

### Explainability Functions (timeseries.py)
1. **`compute_shap_values()`** - Compute SHAP explanations
2. **`compute_feature_importance()`** - Calculate feature importance from SHAP values
3. **`get_shap_summary_stats()`** - Get statistical summaries
4. **`save_shap_summary_plot()`** - Visualize SHAP value distributions
5. **`save_shap_waterfall_plot()`** - Plot individual prediction breakdowns
6. **`save_feature_importance_plot()`** - Rank and visualize top features
7. **`get_prediction_explanation()`** - Get detailed prediction analysis
8. **`save_results_summary()`** - Generate comprehensive summary report
9. **`explain_timeseries_predictions()`** - One-shot analysis orchestrator

### Typical Workflow
1. Generate/load data → `generate_synthetic_electricity_data()`
2. Create features → `create_time_series_features()`
3. Split data → `prepare_train_test_split()`
4. Create models → `create_models()`
5. Train models → `train_model()`
6. Compute SHAP → `compute_shap_values()`
7. Analyze → `compute_feature_importance()`, `get_prediction_explanation()`
8. Visualize → `save_shap_summary_plot()`, `save_feature_importance_plot()`
9. Report → `save_results_summary()`

Or use `explain_timeseries_predictions()` to do all steps automatically.


## Key Concepts and Interpretations

### Understanding SHAP Values
- **Positive values**: Feature increases the prediction
- **Negative values**: Feature decreases the prediction
- **Magnitude**: Strength of the effect

### Reading Summary Plots
- **Features on top**: Most important for model
- **Red points**: High feature values
- **Blue points**: Low feature values
- **Position on x-axis**: Impact direction

### Waterfall Plots
- **Base value**: Model's average prediction
- **Arrows**: Feature contributions (up/down)
- **Final prediction**: Sum of all contributions

### Feature Importance Types
- **SHAP-based**: Considers all features for all predictions
- **Global**: What features matter on average
- **Local**: What features matter for one prediction

### Time-Series Specifics
- **Lagged features**: Past values are often most important
- **Rolling averages**: Capture trends and patterns
- **Time features**: Hour/day encode important seasonality
- **Recent vs. historical**: Recency often weights more heavily


## Next Steps

1. **Analyze Different Time Periods**: Compare explanations for different seasons
2. **Add Real Data**: Replace synthetic data with actual electricity consumption data
3. **Feature Selection**: Use SHAP values to select most important features
4. **Model Improvement**: Focus on features identified as important by SHAP
5. **Uncertainty Estimation**: Add confidence intervals to predictions