# 9. Overall Summary and Introduction of Custom Evaluation Metrics

## Overview
- Reviewed the contents of notebooks 01-08, aggregated the results of the five methods, introduced custom evaluation metrics believed to be useful for data understanding, and compared the results of each method.

### 9.1 Characteristics of MASE and Its Suitability for the Current Data

#### 9.1.1 Definition of MASE

MASE is the MAE of the forecast period divided by the MAE between the seasonal naive, or naive forecast and y in the training period.

$$
\text{MASE} = \frac{\frac{1}{H} \sum\limits_{t=N+1}^{N+H} |y_t - \hat{y}_t|}{\frac{1}{N-m} \sum\limits_{t=m+1}^{N} |y_t - y_{t-m}|}
$$

| Symbol | Meaning |
| :--- | :--- |
| $\text{MASE}$ | Mean Absolute Scaled Error |
| $y_t$ | **Actual value** at time $t$ |
| $\hat{y}_t$ | **Forecast value** at time $t$ |
| $N$ | Total number of observations in the **training data** |
| $H$ | Number of observations in the **test data** (forecast period) (from $t=N+1$ to $t=N+H$) |
| $m$ | **Seasonal period** (e.g., $m=7$ for a weekly cycle in daily data) |

#### 9.1.2 MASE and Suitability for the Current Data

The models adopted this time: holt-winters, SARIMAX (using Fourier terms), prophet, lightGBM, GRU.
Generally,
- Solar power generation forecast had a MASE of approximately 1.8-2.3, with a period of 1 day (48 frames for 30-minute interval data).
- Electricity demand forecast had a MASE of approximately 0.5-0.9, with a period of 1 week (336 frames for 30-minute interval data).

In conjunction with graph observations,
- For solar power, the value after 24 hours is not expected to change drastically and is primarily influenced by weather factors. While total solar radiation and sunrise times differ significantly between January 1st and July 1st, the difference after just one day would be quite limited.
- On the other hand, for electricity consumption, the difference from the same time one week later is expected to be somewhat larger due to human activities, weather, temperature, events, etc.

Given these assumptions, and since forecasts are made for 30 days, with a training period of 60 days and a forecast horizon of 1 day, the error of the seasonal naive forecast is expected to be averaged over a slightly longer period, and thus smaller for solar power generation.

Consequently, there is a tendency for solar power generation forecasts to yield higher MASE values, so to confirm this situation, the following metric was introduced:

#### 9.1.3 Introducing the Custom Metric My_Eval_Index

The underlying idea is:

"Given that we are comparing the MAE of naive seasonal forecasts with the forecasts from each method, wouldn't it be appropriate to use the MAE of the naive seasonal forecast for the test period as the denominator? Furthermore, since we are comparing and dividing errors from the same data over the same period, there is inherent consistency."

Therefore, the formula is: MAE of test period / MAE of seasonal naive forecast for test period
* Data periods excluded during the creation of the seasonal naive forecast are also excluded from the numerator's MAE calculation.

$MAE_{test}$ / $MAE_{test\_seasonaly\_naive}$ 

Upon searching, I also encountered the term 'relative MAE'. However, this term appears to refer to measuring the relative magnitude of error using the formula: relative error = |predicted value - actual value| / actual value, which seems to differ from the concept I devised.

From here on, we will compare the evaluation results of the five methods, also utilizing this My_Eval_Index.

### 9.2 Aggregation of Prediction Result Evaluation Including Custom Metrics

In [1]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import warnings

# Import common modules
from src.data_utils import load_timeseries_data
from src.evaluation_utils import evaluate_forecast_result

# Display settings
pd.options.display.float_format = '{:.4f}'.format

# Style settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [15, 7]
plt.rcParams['font.family'] = 'Meiryo' # For Windows. For Mac, use 'Hiragino Sans' etc.

# Hide warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data
BASE_DIR = Path().resolve()
DATA_DIR = BASE_DIR.parent / "data"
target_file = DATA_DIR / "e_gen_demand.csv"

df = load_timeseries_data(target_file)

print("Data shape:", df.shape)
print("Data period:", df.index.min(), "to", df.index.max())
df.head()

# Define training period (before this is considered Train)
TRAIN_LENGTH = 2880  # 60 days * 48 points/day

# Extract training data
y_train_solar = df['solar_gen_mw'].iloc[:2880]
y_train_demand = df['e_demand_mw'].iloc[:2880]

print(f"Train Solar Len: {len(y_train_solar)}")
print(f"Train Demand Len: {len(y_train_demand)}")

Data shape: (39408, 2)
Data period: 2023-01-01 00:00:00 to 2025-03-31 23:30:00
Train Solar Len: 2880
Train Demand Len: 2880


In [3]:
results = []
preds_dir = '../results/preds/'  # Pickle save destination
file_list = [f for f in os.listdir(preds_dir) if f.endswith('.pkl')]

# Configure seasonal periods
seasonal_map = {'solar': 48, 'demand': 336}
train_map = {'solar': y_train_solar, 'demand': y_train_demand}

for filename in file_list:
    # Filename rule: "solar_lightgbm.pkl" â†’ target="solar", model="lightgbm"
    # Please adjust the split logic as needed
    target = 'solar' if 'solar' in filename else 'demand'
    
    # Extract model name (simple processing)
    model_name = filename.replace('.pkl', '').replace(f'{target}_', '')
    
    # Load Pickle
    pred_df = pd.read_pickle(os.path.join(preds_dir, filename))
    
    # Execute evaluation
    metrics = evaluate_forecast_result(
        pred_df=pred_df,
        y_train=train_map[target],
        seasonal_period=seasonal_map[target]
    )
    
    # Store results
    res_dict = metrics.to_dict()
    res_dict['Model'] = model_name
    res_dict['Target'] = target
    results.append(res_dict)

# Convert to DataFrame
df_results = pd.DataFrame(results)

In [4]:
# Reshape for better readability: rows as models, columns as target x metric
pivot_df = df_results.pivot(index='Model', columns='Target', 
values=['MAE','RMSE', 'MAE(adjusted)', 'MAE_Naive(Test)', 'My_Eval_Index', 
        'MAE_Naive(Train)', 'MASE (Train)'])

# Reorder columns for better readability (group Solar and Demand)
pivot_df = pivot_df.swaplevel(0, 1, axis=1).sort_index(axis=1)

print("=== Model Evaluation Results Summary ===")
display(pivot_df)

# Save to CSV
pivot_df.to_csv('../results/final_model_comparison_summary.csv')
print("Saved comparison to ../results/final_model_comparison_summary.csv")

=== Model Evaluation Results Summary ===


Target,demand,demand,demand,demand,demand,demand,demand,solar,solar,solar,solar,solar,solar,solar
Unnamed: 0_level_1,MAE,MAE(adjusted),MAE_Naive(Test),MAE_Naive(Train),MASE (Train),My_Eval_Index,RMSE,MAE,MAE(adjusted),MAE_Naive(Test),MAE_Naive(Train),MASE (Train),My_Eval_Index,RMSE
Model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
GRU,2119.5621,1986.452,1911.2595,3200.9591,0.6622,1.0393,2908.0458,1340.2721,1352.8759,1227.9985,721.5784,1.8574,1.1017,2316.5246
SARIMAX,1712.4219,1695.4076,1908.2428,3200.9591,0.535,0.8885,2372.9847,1395.8209,1405.8266,1228.4986,721.5784,1.9344,1.1443,2278.5357
hw,1663.8908,1664.6051,1908.2428,3200.9591,0.5198,0.8723,2212.8888,1663.868,1680.0904,1228.4986,721.5784,2.3059,1.3676,2740.709
lightGBM,2657.4638,2660.7534,1911.2595,3200.9591,0.8302,1.3921,3356.7358,1358.2313,1349.7915,1227.9985,721.5784,1.8823,1.0992,2728.5854
prophet,2163.3559,2193.3998,1908.2428,3200.9591,0.6758,1.1494,2753.3141,1730.1959,1729.4279,1228.4986,721.5784,2.3978,1.4078,2446.8022


Saved comparison to ../results/final_model_comparison_summary.csv


From the left column in the table above:
```
MAE: MAE of regular predicted values,
MAE(adjusted): Forecast MAE corresponding to the seasonal naive forecast period,
MAE_Naive(Test): MAE of seasonal naive forecast for the test period,
My_Eval_Index: MAE(adjusted) / MAE_Naive(Test),
MAE_Naive(Train): MAE of seasonal naive forecast for the Train period,
MASE (Train): Regular MASE: MAE / MAE_Naive(Train),
RMSE: Regular RMSE
```

### 9.3 Considerations for Evaluation Metrics

- As expected, the MAE of the seasonal naive forecast during the training period, MAE_Naive(Train), is small for solar power (solar) at 721.6, and large for electricity demand (demand) at 3201.0.
- Consequently, with standard MASE, solar power tends to have a larger value due to its smaller denominator, whereas electricity demand tends to have a value slightly below 1 due to its larger denominator.
- On the other hand, with My_Eval_Index, which aligns the target period during the test period, both values are around 1, making it easier to understand how much better or worse the forecast is compared to the seasonal naive forecast over the same period.

### 9.4 Relative Evaluation of Prediction Methods

- However, although the characteristics of the data became clear with the use of My_Eval_Index, a relative comparison of the five types of prediction methods is possible with any metric.

- The range for solar power generation is 0-15000mw, while electricity demand (i.e., consumption) ranges from approximately 20000-56000, with the latter being significantly larger. Given that the errors are generally within the same magnitude range, it can be stated as a premise that solar power generation predictions tend to have larger errors.

- Furthermore, for solar power generation prediction, LightGBM and GRU were found to be superior (though LightGBM's RMSE is large, indicating a tendency to produce large errors). For electricity demand prediction, Holt-Winters and SARIMAX were superior (both of which have relatively low RMSE).

### 9.5 Feature Selection and Challenges of the 5 Methods

#### 9.5.1 Feature Selection

In feature selection, the usual relationship where the number of selected features decreases and the error increases as the LASSO regularization coefficient for electricity demand rises was not observed. To address this, we manually determined the regularization coefficient; however, other methods of feature selection, such as using feature importance from Random Forest or LightGBM, could also be considered.

#### 9.5.2 holt-winters

Although I initially thought it had limitations due to its inability to handle multiple seasonal periods, it yielded good results for electricity demand, which is expected to have weekly and daily or higher frequency seasonality. Further performance improvement presents some challenges.
- While common to all methods involving solar power generation, completely failing to account for periods of zero output at night.  This was addressed by manually setting the predictions to zero during the expected nighttime hours.
- Choices such as whether to use additive or multiplicative forecasting methods, or whether to incorporate a `damped_trend` (trend damping), still have room for exploration.

#### 9.5.2 holt-winters

Although I initially thought it had limitations due to its inability to handle multiple seasonal periods, it yielded good results for electricity demand, which is expected to have weekly and daily or higher frequency seasonality. Further performance improvement presents some challenges:

* **While common to all methods involving solar power generation, completely failing to account for periods of zero output at night.** **This was addressed by manually setting the predictions to zero during the expected nighttime hours.**

#### 9.5.3 SARIMAX

Initially, performing prediction parameter estimation with SARIMA without using exogenous variables for Fourier terms was time-consuming, often exceeding 10 hours. This was improved by introducing Fourier terms, leading to relatively good results.
- A potential improvement involves introducing annual periodicity (considering that factors such as the sun's angle and sunrise/sunset times significantly impact solar power, while temperature, other weather conditions, and seasonal events significantly impact consumption).

#### 9.5.4 Prophet

It is assumed that holidays and days off can be incorporated as exogenous variables, and other general exogenous variables can also be used.
- First, by including holidays as exogenous variables and exploring various settings and a wide range of parameters, performance improvement can be expected. Details on parameters, etc., are described at the end of 05_Prophet_Forecast.ipynb.

#### 9.5.5 LightGBM

- Because it is non-linear, I thought it would effectively distinguish things like zero solar power at night, but no significant advantage over other methods was observed. On the other hand, it seemed to struggle with electricity demand, which likely has complex seasonality, and its performance was not good.
- Parameter tuning was not very successful, and a slight performance degradation occurred after tuning. Regarding this, revisiting the approach/methodology and expanding the range of parameters used could be considered.

#### 9.5.6 GRU

- Through parameter tuning, we have incorporated a simple number of layers, thereby avoiding a model with a complex configuration.
- Performance improvement has been achieved through tuning. Complicating the model, for example by increasing the number of layers, introduces trade-offs such as longer prediction times. Therefore, it is advisable to adopt common, general improvement strategies.

## 9.6 Overall Review

- For solar power generation, once periodicity is captured to a certain extent, excluding zero generation at night can lead to performance improvement.
- Furthermore, while solar angle and sunrise/sunset times, which directly impact the total annual solar irradiance, have a high potential to improve accuracy, beyond that, "Will it be sunny, rainy, or cloudy tomorrow?" is likely to remain a significant factor in 24-hour forecasts.
- This means it would come down to "obtaining accurate weather forecast information." Beyond that, it would be necessary to consider real-time-like, few-hour delayed forecasts (e.g., "It was sunny today, so power generation is likely to increase following this pattern").

- Unlike solar power, electricity demand forecasting is likely the sum of many intricate factors such as economic activity, making it inherently challenging.
- Heating and cooling demand shares the same element as solar power: if only temperature could be accurately predicted. Other factors include events and economic activities; some events can be predicted to a certain extent in advance, but others, like a pandemic, are difficult to forecast.
- Since there's no method here like excluding zero values at night, implementing improvement measures for each model will likely be the primary means of performance enhancement.