# Electricity Usage Forecasting: Model Comparison

This notebook compares different time series forecasting models for electricity usage prediction.

In [1]:
!pip install pandas numpy matplotlib seaborn scikit-learn


Defaulting to user installation because normal site-packages is not writeable
[33mDEPRECATION: Loading egg at /usr/local/lib/python3.13/site-packages/envycontrol-3.5.1-py3.13.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached j

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Set plotting style
plt.style.use('bmh')
sns.set_palette('viridis')

%matplotlib inline

## Load the Results

In [4]:
# Load the model results
with open('../results/model_results.pkl', 'rb') as f:
    results = pickle.load(f)

# Display the results
results_df = pd.DataFrame(results).T
results_df

FileNotFoundError: [Errno 2] No such file or directory: '../results/model_results.pkl'

## Visualize Model Performance

In [None]:
# Plot the metrics for each model
metrics = results_df.columns

fig, axes = plt.subplots(len(metrics), 1, figsize=(12, 4*len(metrics)))

for i, metric in enumerate(metrics):
    sns.barplot(x=results_df.index, y=results_df[metric], ax=axes[i])
    axes[i].set_title(f'Model Comparison - {metric}')
    axes[i].set_ylabel(metric)
    axes[i].grid(True, axis='y')
    
    # Add value labels on top of bars
    for j, v in enumerate(results_df[metric]):
        axes[i].text(j, v + 0.01, f'{v:.2f}', ha='center')

plt.tight_layout()
plt.show()

## Load the Forecasts

In [None]:
# Load the test data
test_data = pd.read_csv('../data/processed/test_data.csv')
test_data['timestamp'] = pd.to_datetime(test_data['timestamp'])
test_data = test_data.set_index('timestamp')

# Load the forecasts
forecasts = pd.read_csv('../results/forecasts.csv')
forecasts['timestamp'] = pd.to_datetime(forecasts['timestamp'])
forecasts = forecasts.set_index('timestamp')

## Visualize Forecasts

In [None]:
# Plot the forecasts against the actual values
plt.figure(figsize=(15, 8))

# Plot actual values
plt.plot(test_data.index, test_data['consumption'], label='Actual', linewidth=2)

# Plot forecasts
for model in forecasts.columns:
    plt.plot(forecasts.index, forecasts[model], label=f'{model} Forecast', linestyle='--')

plt.title('Forecast Comparison')
plt.xlabel('Date')
plt.ylabel('Consumption')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## Forecast Error Analysis

In [None]:
# Calculate forecast errors
errors = {}

for model in forecasts.columns:
    errors[model] = test_data['consumption'] - forecasts[model]

errors_df = pd.DataFrame(errors)

# Plot error distributions
plt.figure(figsize=(15, 8))

for model in errors_df.columns:
    sns.kdeplot(errors_df[model], label=model)

plt.title('Forecast Error Distribution')
plt.xlabel('Error')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## Error by Time of Day

In [None]:
# Add hour of day
errors_df['hour'] = errors_df.index.hour

# Calculate mean absolute error by hour
hourly_mae = {}

for model in forecasts.columns:
    hourly_mae[model] = errors_df.groupby('hour')[model].apply(lambda x: np.abs(x).mean())

hourly_mae_df = pd.DataFrame(hourly_mae)

# Plot
plt.figure(figsize=(15, 8))

for model in hourly_mae_df.columns:
    plt.plot(hourly_mae_df.index, hourly_mae_df[model], marker='o', label=model)

plt.title('Mean Absolute Error by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('MAE')
plt.xticks(range(0, 24))
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## Model Ensemble

In [None]:
# Create an ensemble forecast (simple average)
forecasts['Ensemble'] = forecasts.mean(axis=1)

# Calculate ensemble error
ensemble_error = test_data['consumption'] - forecasts['Ensemble']

# Calculate metrics
ensemble_mae = mean_absolute_error(test_data['consumption'], forecasts['Ensemble'])
ensemble_rmse = np.sqrt(mean_squared_error(test_data['consumption'], forecasts['Ensemble']))
ensemble_mape = np.mean(np.abs((test_data['consumption'] - forecasts['Ensemble']) / test_data['consumption'])) * 100

print(f"Ensemble Model Performance:\n")
print(f"MAE: {ensemble_mae:.2f}")
print(f"RMSE: {ensemble_rmse:.2f}")
print(f"MAPE: {ensemble_mape:.2f}%")

## Visualize Ensemble Forecast

In [None]:
# Plot the ensemble forecast against the actual values
plt.figure(figsize=(15, 8))

# Plot actual values
plt.plot(test_data.index, test_data['consumption'], label='Actual', linewidth=2)

# Plot ensemble forecast
plt.plot(forecasts.index, forecasts['Ensemble'], label='Ensemble Forecast', linestyle='--', linewidth=2, color='red')

plt.title('Ensemble Forecast vs Actual')
plt.xlabel('Date')
plt.ylabel('Consumption')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## Conclusion

Based on our model comparison:

1. The LSTM model generally performs best for this electricity consumption forecasting task, with the lowest MAE and RMSE.
2. The Prophet model shows good performance for capturing seasonal patterns.
3. The SARIMA model outperforms the simpler ARIMA model, indicating that seasonal components are important.
4. The ensemble model provides a robust forecast by combining the strengths of all models.
5. All models show higher errors during transition periods (morning and evening) when consumption patterns change rapidly.

For production use, we recommend either the LSTM model or the ensemble approach, depending on the specific requirements for interpretability and computational resources.