### 1) What data the framework expects
a single CSV with a timestamp column (default: measurement_time) and a set of feature columns.

In [1]:
from src.data_preparation import load_data
import pvlib

In [2]:
# Read the processed dataset into a Pandas DataFrame in a notebook.
# Config file will be covered later.
# the code loads the CSV, parses the timestamp column, ensures UTC, sets it as the index, and sorts chronologically (data_preparation.load_data).
PROCESSED_DATA_PATH = "data/processed/dayTime_NAM_dayahead_features_processed.csv"

df = load_data(PROCESSED_DATA_PATH, date_col="measurement_time")


Loaded 15,078 records
Date range: 2014-01-03 14:00:00+00:00 to 2016-12-31 03:00:00+00:00
Timezone: UTC


In [3]:
df.head()

Unnamed: 0_level_0,ghi,dni,solar_zenith,time_gap_hours,time_gap_norm,day_boundary_flag,hour_progression,absolute_hour,GHI_cs,DNI_cs,...,CSI_dni,season_flag,hour_sin,hour_cos,month_sin,month_cos,nam_ghi,nam_dni,nam_cc,nam_target_time
measurement_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-01-03 14:00:00+00:00,0.0,0.0,100.276046,0.166667,0.006944,0.0,14.491667,24.491667,0.0,0.0,...,,2,-0.5,-0.8660254,0.5,0.866025,0.0,0.0,2.0,2014-01-03 15:00:00+00:00
2014-01-03 15:00:00+00:00,12.9145,89.245833,89.724162,0.016667,0.000694,0.0,15.491667,25.491667,0.0,0.0,...,1.2,2,-0.707107,-0.7071068,0.5,0.866025,99.625,497.069277,2.0,2014-01-03 16:00:00+00:00
2014-01-03 16:00:00+00:00,143.7755,636.475,80.146724,0.016667,0.000694,0.0,16.491667,26.491667,35.1595,182.29784,...,1.2,2,-0.866025,-0.5,0.5,0.866025,274.75,760.162689,0.0,2014-01-03 17:00:00+00:00
2014-01-03 17:00:00+00:00,289.955,783.193333,71.988814,0.016667,0.000694,0.0,17.491667,27.491667,184.554711,540.120302,...,1.2,2,-0.965926,-0.258819,0.5,0.866025,416.375,871.248807,2.0,2014-01-03 18:00:00+00:00
2014-01-03 18:00:00+00:00,402.643333,843.05,65.816158,0.016667,0.000694,0.0,18.491667,28.491667,321.992462,692.514065,...,1.2,2,-1.0,-1.83697e-16,0.5,0.866025,510.875,919.980698,8.0,2014-01-03 19:00:00+00:00


In [4]:
df.columns

Index(['ghi', 'dni', 'solar_zenith', 'time_gap_hours', 'time_gap_norm',
       'day_boundary_flag', 'hour_progression', 'absolute_hour', 'GHI_cs',
       'DNI_cs', 'CSI_ghi', 'CSI_dni', 'season_flag', 'hour_sin', 'hour_cos',
       'month_sin', 'month_cos', 'nam_ghi', 'nam_dni', 'nam_cc',
       'nam_target_time'],
      dtype='object')

create three types of datasets based on feature selection.

1. Phase 1: Simple Baseline
    - Inputs:
        - GHI_CS, Elevation.
    - Output:
        - Day ahead (GHI -> GHI_kt)

2. Phase 2: Simple Baseline
    - Inputs:
        - Historical sequence: 3 Days (Before Issuing) × 3 features (ghi_kt , elevation)
            - GHI_kt,
            - Elevation
            - Temporal and Geo Features.
        - NAM: Nearest grid point only (26-39 horizons ahead  (1 day) × 2 features) (From Issue Time)
            - NAM GHI forecast (DSWRF)
            - NAM cloud cover (TCDC)
    - Output:
        - Day ahead (GHI -> GHI_kt)

2. Phase 3: Simple Baseline
    - Inputs:
        - Historical sequence: 3 Days (Before Issuing) × 3 features (ghi_kt , elevation)
            - GHI_kt,
            - Elevation
        - NAM: Nearest grid point only (26-39 horizons ahead  (1 day) × 2 features) (From Issue Time)
            - NAM GHI forecast (DSWRF)
            - NAM cloud cover (TCDC)
    - Output:
        - Day ahead (GHI -> GHI_kt)

To control the features be fed into the models for different experiments, we edit `config.py`, add the set for example `"vaf_only": [...]`

``` 
FEATURE_SETS = {
    "all": [...],
    "vaf_only": [...],
    "weather_minimal": [...],
}

```

Then in any script we set `selected_columns = FEATURE_SETS["vaf_only"]`, and when we need to define this set into the model we set `config["feature_selection"] = "vaf_only"` in

```
LSTM_CONFIG = {
    "experiment_name": "solar_lstm_kbins",
    "model_type": "LSTM",
    "model_config": {
        "hidden_size": 64,
        "num_layers": 2,
        "dropout": 0.2,
        "bidirectional": True,
    },
    "data_prefix": DEFAULT_DATA_PREFIX,
    --> "feature_cols": FEATURE_COLS,
    --> "feature_selection": DEFAULT_FEATURE_SET,
    "target_col": TARGET_COL,
    "batch_size": 32,
    "num_epochs": 50,
    "learning_rate": 0.001,
    "early_stopping_patience": 20,
    "max_folds": 3,
}
``` 

This is will covered later.

### 2) Prepare splits (once)

Module: `data_preparation.py` 

1. Fixed holdout (`fixed_holdout_split`): train/val/test by calendar ranges.
2. Rolling-origin evaluation (`rolling_origin_evaluation`): month-by-month rolling folds (train up to month t, validate on the next window), with fold metadata saved to `splits/rolling_origin_splits.json`.

#### Fixed Holdout Split

In [5]:
from src.data_preparation import fixed_holdout_split, rolling_origin_evaluation

# 1. Fixed Holdout split, use predefined calender dates for the splits. 
df_ph1_train_df, df_ph1_val_df, df_ph1_test_df, df_ph1_split_indices = fixed_holdout_split(df=df)




FIXED HOLDOUT SPLIT
Training:   2014-01-01 to 2015-09-30 → 8,705 records
Validation: 2015-10-01 to 2015-12-31 → 1,275 records
Test:       2016-01-01 to 2016-12-31 → 5,079 records
Total:      15,059 records
Original:   15,078 records


#### Rolling Origin Evaluation

In [6]:

# 2. Rolling Origin Split (More Suitable in TimeSeries) like K-Folds
from src.data_preparation import fixed_holdout_split, rolling_origin_evaluation,save_splits_info

rollingSplits_df = rolling_origin_evaluation(df=df, start_train = '2014-01-2',
    end_train = '2016-12-31')
save_splits_info({}, rollingSplits_df, experiment_name="exp-002")



ROLLING ORIGIN EVALUATION
Total folds: 35
Frequency: MS
Data range: 2014-01-03 to 2016-12-31

Fold Summary:
  Fold 1: Train [2014-01-03 to 2014-02-01] (378 records) → Val [2014-02-02 to 2014-02-28] (235 records)
  Fold 2: Train [2014-01-03 to 2014-03-01] (627 records) → Val [2014-03-02 to 2014-03-31] (407 records)
  Fold 3: Train [2014-01-03 to 2014-04-01] (1,061 records) → Val [2014-04-02 to 2014-04-30] (393 records)
  Fold 4: Train [2014-01-03 to 2014-05-01] (1,481 records) → Val [2014-05-02 to 2014-05-31] (407 records)
  Fold 5: Train [2014-01-03 to 2014-06-01] (1,915 records) → Val [2014-06-02 to 2014-06-30] (393 records)
  Fold 6: Train [2014-01-03 to 2014-07-01] (2,335 records) → Val [2014-07-02 to 2014-07-31] (407 records)
  Fold 7: Train [2014-01-03 to 2014-08-01] (2,769 records) → Val [2014-08-02 to 2014-08-31] (407 records)
  Fold 8: Train [2014-01-03 to 2014-09-01] (3,203 records) → Val [2014-09-02 to 2014-09-30] (393 records)
  Fold 9: Train [2014-01-03 to 2014-10-01] (3,6

Meta data about the results of spliting stored into (splits/...json)

ph1_rollingSplits_df is a list consists of Folds from 0 to N, and each fold has meta data about the train, test and validation
- start, end, and size
- indices 

### Data Preview, Does it concist with our expectations?
Data has a fixed length bounded by the hours of day which choosed to match the CASIO forecast and NAM forecast Horizon.

Does the data met our expectations?
- Does NAM records is one hour lag ahead from measurements.
- Do sequences have the same length
- Do we have zero values in features?

In [7]:
import pandas as pd
import numpy as np
df_noIndex = df.reset_index()
df_noIndex['nam_target_time'] = pd.to_datetime(df_noIndex['nam_target_time'])
df_noIndex['measurement_time'] = pd.to_datetime(df_noIndex['measurement_time'])
df_noIndex['time_diff'] = (df_noIndex['nam_target_time'] - df_noIndex['measurement_time']).dt.total_seconds() / 3600  # Calculate the time difference in hours
    
# Check if the time difference is exactly 1 hour
is_nam_lag_correct = np.allclose(df_noIndex['time_diff'], 1)  # All time differences should be 1 hour
print(f"Does NAM records have a 1-hour lag ahead from measurements? {'Yes' if is_nam_lag_correct else 'No'}")

Does NAM records have a 1-hour lag ahead from measurements? Yes


In [8]:
# 2. Check if sequences have the same length across each day
# Extract date from the 'measurement_time'
df_noIndex['date'] = df_noIndex['measurement_time'].dt.date
    
# Group by date and count records per day
daily_counts = df_noIndex.groupby('date').size()
    
# Find dates where the record count is not equal to 24 (assuming 24 records per day)
inconsistent_dates = daily_counts[daily_counts != 14]
    
# Print dates where records differ from 24
if not inconsistent_dates.empty:
    print("Dates with record counts different from 14 hours:")
    print(inconsistent_dates)
else:
    print("All dates have the expected 14-hour record count.")

Dates with record counts different from 14 hours:
date
2014-01-03    10
2014-01-30     4
2014-02-05    10
2014-02-13     4
2014-02-19    10
2015-08-23     4
2015-08-24    10
2015-12-31     4
2016-01-01    10
2016-07-13     4
2016-07-15    10
2016-12-31     4
dtype: int64


In [9]:
# Create a list of dates to drop
dates_to_drop = [
    '2014-01-03', '2014-01-30', '2014-02-05', '2014-02-13', '2014-02-19',
    '2015-08-23', '2015-08-24', '2015-12-31', '2016-01-01', '2016-07-13',
    '2016-07-15', '2016-12-31'
]

# Print the first few rows of the index to see its format
print("Index sample before processing:")
print(df.index[:5])
print("\nIndex timezone:", df.index.tz)

# Convert dates to datetime with UTC timezone
dates_to_drop = pd.to_datetime(dates_to_drop).tz_localize('UTC')

# Get the original size of the DataFrame
original_size = len(df)

df = df[~df.index.normalize().isin(dates_to_drop)]

# Print the number of rows dropped
print(f"\nOriginal number of records: {original_size}")
print(f"Number of records after dropping dates: {len(df)}")
print(f"Number of records dropped: {original_size - len(df)}")

Index sample before processing:
DatetimeIndex(['2014-01-03 14:00:00+00:00', '2014-01-03 15:00:00+00:00',
               '2014-01-03 16:00:00+00:00', '2014-01-03 17:00:00+00:00',
               '2014-01-03 18:00:00+00:00'],
              dtype='datetime64[ns, UTC]', name='measurement_time', freq=None)

Index timezone: UTC

Original number of records: 15078
Number of records after dropping dates: 14994
Number of records dropped: 84


In [10]:
# 3. Check if there are zero values in any of the features
zero_values_in_features = (df[['ghi', 'dni', 'solar_zenith', 'GHI_cs', 'DNI_cs', 'CSI_ghi', 
                                   'CSI_dni', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos', 
                                   'nam_ghi', 'nam_dni', 'nam_cc']] == 0).sum().sum()
print(f"Do we have NaN values in features? {'Yes' if zero_values_in_features > 0 else 'No'}")


features_to_check = [
    'ghi', 'dni', 'solar_zenith', 'GHI_cs', 'DNI_cs', 'CSI_ghi', 
    'CSI_dni', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos', 
    'nam_ghi', 'nam_dni', 'nam_cc'
]

# 2. Calculate the number of NaN values for each feature
# We use .isna() instead of == 0
nans_per_feature = df[features_to_check].isna().sum()

# 3. Filter to get only features that actually have NaN values
features_with_nans = nans_per_feature[nans_per_feature > 0]

# 4. Report the findings for which features have NaNs
if features_with_nans.empty:
    print("No NaN (missing) values found in any of the specified features.")
else:
    print("--- Features With NaN Values ---")
    print("The following features have NaN values, with the total count for each:")
    # Sort for clearer output
    print(features_with_nans.sort_values(ascending=False))
    print("\n" + "="*40 + "\n")

    # 5. Analyze the distribution of hours for rows containing NaNs
    print("--- Distribution of NaN-Value Records by Hour ---")
    
    # Create a boolean mask for rows that contain *at least one* NaN
    # in the specified columns
    rows_with_any_nan = df[features_to_check].isna().any(axis=1)
    
    if rows_with_any_nan.sum() > 0:
        # Get the index for these rows
        nan_rows_index = df.index[rows_with_any_nan]
        
        # Extract the hour from the DatetimeIndex and get the value counts
        hour_distribution = nan_rows_index.hour.value_counts().sort_index()
        
        print("Distribution of records (rows) containing at least one NaN, by hour:")
        print(hour_distribution)
        
        # Optional: Print total number of affected rows
        print(f"\nTotal number of rows with at least one NaN: {rows_with_any_nan.sum()}")
    else:
        # This case shouldn't be hit if features_with_nans was not empty,
        # but it's good practice to include.
        print("No rows found with NaN values (this is unexpected, check logic).")

Do we have NaN values in features? Yes
--- Features With NaN Values ---
The following features have NaN values, with the total count for each:
CSI_dni    1807
CSI_ghi    1623
nam_ghi     280
nam_dni     280
nam_cc      280
dtype: int64


--- Distribution of NaN-Value Records by Hour ---
Distribution of records (rows) containing at least one NaN, by hour:
measurement_time
0      20
1     207
2     471
3     777
14    335
15     48
16     20
17     20
18     20
19     20
20     20
21     20
22     20
23     20
Name: count, dtype: int64

Total number of rows with at least one NaN: 2018


In [11]:
df_interpolated = df.interpolate(method='linear')

  df_interpolated = df.interpolate(method='linear')


In [12]:
# 3. Check if there are zero values in any of the features
zero_values_in_features = (df_interpolated[['ghi', 'dni', 'solar_zenith', 'GHI_cs', 'DNI_cs', 'CSI_ghi', 
                                   'CSI_dni', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos', 
                                   'nam_ghi', 'nam_dni', 'nam_cc']] == 0).sum().sum()
print(f"Do we have zero values in features? {'Yes' if zero_values_in_features > 0 else 'No'}")


features_to_check = [
    'ghi', 'dni', 'solar_zenith', 'GHI_cs', 'DNI_cs', 'CSI_ghi', 
    'CSI_dni', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos', 
    'nam_ghi', 'nam_dni', 'nam_cc'
]

# 2. Calculate the number of NaN values for each feature
# We use .isna() instead of == 0
nans_per_feature = df_interpolated[features_to_check].isna().sum()

# 3. Filter to get only features that actually have NaN values
features_with_nans = nans_per_feature[nans_per_feature > 0]

# 4. Report the findings for which features have NaNs
if features_with_nans.empty:
    print("No NaN (missing) values found in any of the specified features.")
else:
    print("--- Features With NaN Values ---")
    print("The following features have NaN values, with the total count for each:")
    # Sort for clearer output
    print(features_with_nans.sort_values(ascending=False))
    print("\n" + "="*40 + "\n")

    # 5. Analyze the distribution of hours for rows containing NaNs
    print("--- Distribution of NaN-Value Records by Hour ---")
    
    # Create a boolean mask for rows that contain *at least one* NaN
    # in the specified columns
    rows_with_any_nan = df_interpolated[features_to_check].isna().any(axis=1)
    
    if rows_with_any_nan.sum() > 0:
        # Get the index for these rows
        nan_rows_index = df_interpolated.index[rows_with_any_nan]
        
        # Extract the hour from the DatetimeIndex and get the value counts
        hour_distribution = nan_rows_index.hour.value_counts().sort_index()
        
        print("Distribution of records (rows) containing at least one NaN, by hour:")
        print(hour_distribution)
        
        # Optional: Print total number of affected rows
        print(f"\nTotal number of rows with at least one NaN: {rows_with_any_nan.sum()}")
    else:
        # This case shouldn't be hit if features_with_nans was not empty,
        # but it's good practice to include.
        print("No rows found with NaN values (this is unexpected, check logic).")

Do we have zero values in features? Yes
No NaN (missing) values found in any of the specified features.


In [13]:
df_interpolated.head()

Unnamed: 0_level_0,ghi,dni,solar_zenith,time_gap_hours,time_gap_norm,day_boundary_flag,hour_progression,absolute_hour,GHI_cs,DNI_cs,...,CSI_dni,season_flag,hour_sin,hour_cos,month_sin,month_cos,nam_ghi,nam_dni,nam_cc,nam_target_time
measurement_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-01-04 00:00:00+00:00,43.827167,139.092167,86.467556,0.016667,0.000694,0.016667,0.491667,34.491667,77.958793,323.747434,...,0.429632,2,0.0,1.0,0.5,0.866025,107.083333,403.544129,21.333333,2014-01-04 01:00:00+00:00
2014-01-04 01:00:00+00:00,0.0,0.0,96.742105,0.016667,0.000694,0.0,1.491667,35.491667,0.0,0.0,...,0.583705,2,0.258819,0.965926,0.5,0.866025,53.541667,201.772064,22.666667,2014-01-04 02:00:00+00:00
2014-01-04 02:00:00+00:00,0.0,0.0,107.755671,0.016667,0.000694,0.0,2.491667,36.491667,0.0,0.0,...,0.737779,2,0.5,0.866025,0.5,0.866025,0.0,0.0,24.0,2014-01-04 03:00:00+00:00
2014-01-04 03:00:00+00:00,0.0,0.0,119.225607,0.016667,0.000694,0.0,3.491667,37.491667,0.0,0.0,...,0.891853,2,0.707107,0.707107,0.5,0.866025,0.0,0.0,24.666667,2014-01-04 04:00:00+00:00
2014-01-04 14:00:00+00:00,0.0,0.0,100.295605,0.166667,0.006944,0.0,14.491667,48.491667,0.0,0.0,...,1.045926,2,-0.5,-0.866025,0.5,0.866025,0.0,0.0,0.0,2014-01-04 15:00:00+00:00


In [14]:
df_phase1 = df_interpolated[['CSI_ghi']]

df_phase2 = df_interpolated[['solar_zenith', 'time_gap_hours', 'time_gap_norm', 'day_boundary_flag',
       'hour_progression', 'absolute_hour', 'CSI_ghi', 'CSI_dni',
       'season_flag', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos']]

df_phase3 = df_interpolated[['solar_zenith', 'time_gap_hours', 'time_gap_norm', 'day_boundary_flag',
       'hour_progression', 'absolute_hour', 'CSI_ghi', 'CSI_dni',
       'season_flag', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos',
       'nam_ghi', 'nam_dni', 'nam_cc', 'nam_target_time']]

----

### 3. Convert daily series → K-bin tensors

Modules: `preprocessing.process_splits_to_kbins` and `preprocessing.build_model_arrays`.

- Why this kind of sequence formating? 
    - To solve the problem of variable length of sequences.
- How?
    - Normalize each day into K bins (default K_BINS = 60) aligned to solar phase (sunrise→sunset mapped to [0,1]).
    - Handle different column “strategies” (`DEFAULT_STRATEGY`):

        - We can define specific sampling stratigy for each kind of variables.
            - Irradiance (ghi, dni, CSI_ghi, CSI_dni): resampled carefully to preserve daily energy (conservative method) or smooth & shape-preserving
            - PCHIP interpolation with optional re-normalization (KBinConfig.irradiance_mode).
            - Continuous signals: interpolated (linear/PCHIP).
            - Categorical: nearest fill.
            - Temporal recompute: recomputed from the new K-bin index (hour_sin/cos, month_sin/cos, etc.), so they stay consistent after re-binning.
        - ```
            DEFAULT_STRATEGY = {
                "ghi": "irradiance",
                "dni": "irradiance",
                "CSI_ghi": "irradiance",
                "CSI_dni": "irradiance",

                "air_temp": "continuous",
                "relhum": "continuous",
                "windsp": "continuous",

                "solar_zenith": "continuous",
                "solar_elevation": "continuous",
                "GHI_cs": "continuous",
                "DNI_cs": "continuous",

                "winddirection_sin": "windvec",
                "winddirection_cos": "windvec",

                "season_flag": "categorical",
                "is_daylight": "categorical",
                "day_boundary_flag": "categorical",

                "hour_sin": "temporal_recompute",
                "hour_cos": "temporal_recompute",
                "month_sin": "temporal_recompute",
                "month_cos": "temporal_recompute",
                "absolute_hour": "temporal_recompute",
                "hour_progression": "temporal_recompute",
                "time_gap_hours": "temporal_recompute",
                "time_gap_norm": "temporal_recompute",
            }
            ```
    - Produces a normalized day-level MultiIndex [date, bin_id, target_time] with added metadata (solar_phase, sunrise, sunset, day_len_hours).
	    - Then build_model_arrays creates:
            - X: shape (samples, history_days, K, F)
            - Y: shape (samples, horizon_days, K)
            - labels: list of datetime indices for each predicted day.
    - `process_splits_to_kbins`(...): takes daily series and remaps each daylight period to a fixed K bins grid (sunrise→sunset → [0..K-1]), adds per-day metadata, and guarantees same length per day.
    - `build_model_arrays`(...): slides a window of history_days past days to predict horizon_days ahead, packing tensors as:
        - X: (N, history_days, K, F)
        - Y: (N, horizon_days, K)
        - labels: the target day timestamps
        - meta: shapes, features, etc.

In [13]:
from src.preprocessing import KBinConfig, process_splits_to_kbins, build_model_arrays
USE_KBINS = True
k_bins = 60
selected_columns = df_phase1.columns
feature_cols = df_phase1.columns
TARGET_COL = "CSI_ghi"
history_days = 3
horizon_days = 1

if USE_KBINS:
        print("\n--- Step 2: Converting full dataset to K-Bins format ---")
        cfg = KBinConfig(K=k_bins)
        processed = process_splits_to_kbins({"full_data": df}, cfg, feature_cols=selected_columns)
        norm_df = processed.get("full_data")
        if norm_df is None or norm_df.empty:
            raise ValueError("Normalized dataframe is empty after K-bin processing.")

        print("\n--- Step 3: Building model arrays (X, Y) ---")
        ph1_kbins_X, ph1_kbins_Y, ph1_kbins_labels_list = build_model_arrays(
            norm_df,
            feature_cols=feature_cols,
            target_col=TARGET_COL,
            history_days=history_days,
            horizon_days=horizon_days,
        )


--- Step 2: Converting full dataset to K-Bins format ---

--- Step 3: Building model arrays (X, Y) ---


-------
### No KBins

In [19]:
from src.preprocessing import KBinConfig, process_splits_to_kbins, build_model_arrays, to_fixedgrid_multiindex
USE_KBINS = False
k_bins = 60
selected_columns = df_phase1.columns
feature_cols = df_phase1.columns
TARGET_COL = "CSI_ghi"
history_days = 7
horizon_days = 1

if USE_KBINS:
        print("\n--- Step 2: Converting full dataset to K-Bins format ---")
        cfg = KBinConfig(K=k_bins)
        processed = process_splits_to_kbins({"full_data": df}, cfg, feature_cols=selected_columns)
        norm_df = processed.get("full_data")
        if norm_df is None or norm_df.empty:
            raise ValueError("Normalized dataframe is empty after K-bin processing.")

        print("\n--- Step 3: Building model arrays (X, Y) ---")
        ph1_kbins_X, ph1_kbins_Y, ph1_kbins_labels_list = build_model_arrays(
            norm_df,
            feature_cols=feature_cols,
            target_col=TARGET_COL,
            history_days=history_days,
            horizon_days=horizon_days,
        )
else:
    print("\n--- Step 2: Building model arrays (X, Y) ---")
    fixed_df = to_fixedgrid_multiindex(df_phase1, timestamp_col="measurement_time", expected_T=None)  # or set T
    ph1_X, ph1_Y, ph1_labels_list = build_model_arrays(
        fixed_df,
        feature_cols=selected_columns,  # must include your chosen inputs
        target_col=TARGET_COL,          # target must be present in columns
        history_days=history_days,
        horizon_days=horizon_days,
    )


--- Step 2: Building model arrays (X, Y) ---


In [20]:
ph1_X.shape, ph1_Y.shape

((1064, 7, 14, 1), (1064, 1, 14))

	•	1081 → number of training samples (sliding windows).
	•	3 → history_days: how many past days per sample.
	•	15 → T: fixed time steps per day (Resample grid; e.g., 15 bins/day).
	•	2 → F: number of features in feature_cols (excluding the target column).
	•	1 in Y → horizon_days: predict the next day only.
	•	15 in Y → same within-day grid as X.

In [62]:
# Controlling the selected features example
selected_columns = ['solar_zenith', 'time_gap_hours', 'time_gap_norm', 'day_boundary_flag',
       'hour_progression', 'absolute_hour', 'CSI_ghi', 'CSI_dni',
       'season_flag', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos',
       'nam_ghi', 'nam_dni', 'nam_cc']
print("\n--- Step 2: Building model arrays (X, Y) ---")
fixed_df = to_fixedgrid_multiindex(df, timestamp_col="measurement_time", expected_T=None)  # or set T
# Zero-impute everything (features + target)
fixed_df = fixed_df.fillna(0.0)
ph1_X, ph1_Y, ph1_labels_list = build_model_arrays(
        fixed_df,
        feature_cols=selected_columns,  # must include your chosen inputs
        target_col=TARGET_COL,          # target must be present in columns
        history_days=history_days,
        horizon_days=horizon_days,
    )


--- Step 2: Building model arrays (X, Y) ---


In [21]:
ph1_X.shape, ph1_Y.shape

((1064, 7, 14, 1), (1064, 1, 14))

In [22]:
TARGET_COL = 'CSI_ghi'
feature_cols = [c for c in df_phase1.columns.tolist() if c != TARGET_COL]
import pandas as pd
from src.utils import DataManager
data_manager = DataManager()
data_manager.save_arrays(
    ph1_X, ph1_Y,
    pd.DataFrame(index=pd.to_datetime(ph1_labels_list, utc=True)),
    filename_prefix='phas1_data',
    feature_cols=feature_cols,
    target_col=TARGET_COL,
    metadata={
        "input_csv": "data/processed/dayTime_NAM_dayahead_features_processed.csv",
        "timestamp_col": "measurement_time",
        "feature_set": feature_cols,     # for traceability
        "history_days": 7,
        "horizon_days": 1,
        "k_bins": None,
    }
)

INFO:src.utils:Saved arrays to data/phas1_data_*.npy
INFO:src.utils:X shape: (1064, 7, 14, 1), Y shape: (1064, 1, 14)


Data has saved as npy file in `data/..X|Y.npy`files 

## Running Experiment on Models

In [23]:
import torch
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available()
    else "cpu"
)

In [24]:
device

'cuda'

In [25]:
df_phase1.head()

Unnamed: 0_level_0,CSI_ghi
measurement_time,Unnamed: 1_level_1
2014-01-04 00:00:00+00:00,0.562184
2014-01-04 01:00:00+00:00,0.689747
2014-01-04 02:00:00+00:00,0.81731
2014-01-04 03:00:00+00:00,0.944874
2014-01-04 14:00:00+00:00,1.072437


In [None]:
#feature_cols = df_phase1.columns.tolist()   
#FEATURE_COLS= feature_cols
#DEFAULT_FEATURE_SET = feature_cols
feature_cols = [c for c in df_phase1.columns.tolist() if c != TARGET_COL]  # e.g., ['solar_zenith']


LSTM_CONFIG = {
    "experiment_name": "phas1_exp01_overfitting_solve",
    "model_type": "LSTM",
    "model_config": {
        "hidden_size": 64,
        "num_layers": 2,
        "dropout": 0.25,
        "bidirectional": True,
    },
    "data_prefix": "phas1_data",
    "splits_file": "exp-001/exp-001rolling_origin_splits.json",
    "feature_cols": feature_cols,
    "feature_selection": feature_cols,
    "target_col": TARGET_COL,
    "batch_size": 32,
    "num_epochs": 50,
    "learning_rate": 0.001,
    "loss_function": "Huber",
    "early_stopping_patience": 20,
    "max_folds": 35,
}

Incase of update config

In [22]:
FEATURE_COLS = feature_cols
DEFAULT_FEATURE_SET = feature_cols  # ok if your resolver accepts raw lists

LSTM_CONFIG.update({
    "feature_cols": FEATURE_COLS,
    "feature_selection": DEFAULT_FEATURE_SET,
})

In [28]:
from src.pipeline import SolarForecastingPipeline

pipeline = SolarForecastingPipeline(LSTM_CONFIG)
_, summary = pipeline.run()

INFO:src.pipeline:Loading data...
INFO:src.utils:Loaded arrays from data/phas1_data_*.npy
INFO:src.utils:X shape: (1064, 7, 14, 1), Y shape: (1064, 1, 14)
INFO:src.pipeline:Reference dataframe prepared for NAM comparison (15162 rows).
INFO:src.utils:Loaded 35 folds from exp-001/exp-001rolling_origin_splits.json
INFO:src.pipeline:
=== Running Fold 1 ===
INFO:src.pipeline:Train samples: 19, Val samples: 16
INFO:src.pipeline:Model parameters: 242,062
INFO:src.engine:Epoch [10/50] - Train Loss: 0.034855, Val Loss: 0.085353, Train MAE: 0.185549, Val MAE: 0.312215, LR: 0.001000
INFO:src.engine:Epoch [20/50] - Train Loss: 0.023681, Val Loss: 0.054794, Train MAE: 0.151009, Val MAE: 0.276369, LR: 0.001000
INFO:src.engine:Epoch [30/50] - Train Loss: 0.019172, Val Loss: 0.089405, Train MAE: 0.130675, Val MAE: 0.322971, LR: 0.001000
INFO:src.engine:Epoch [40/50] - Train Loss: 0.015449, Val Loss: 0.130248, Train MAE: 0.118386, Val MAE: 0.393226, LR: 0.000500
INFO:src.engine:Early stopping at epoch 

-----

### EXP 02

In [25]:
df_phase2.head()

Unnamed: 0_level_0,solar_zenith,time_gap_hours,time_gap_norm,day_boundary_flag,hour_progression,absolute_hour,CSI_ghi,CSI_dni,season_flag,hour_sin,hour_cos,month_sin,month_cos
measurement_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2014-01-04 00:00:00+00:00,86.467556,0.016667,0.000694,0.016667,0.491667,34.491667,0.562184,0.429632,2,0.0,1.0,0.5,0.866025
2014-01-04 01:00:00+00:00,96.742105,0.016667,0.000694,0.0,1.491667,35.491667,0.689747,0.583705,2,0.258819,0.965926,0.5,0.866025
2014-01-04 02:00:00+00:00,107.755671,0.016667,0.000694,0.0,2.491667,36.491667,0.81731,0.737779,2,0.5,0.866025,0.5,0.866025
2014-01-04 03:00:00+00:00,119.225607,0.016667,0.000694,0.0,3.491667,37.491667,0.944874,0.891853,2,0.707107,0.707107,0.5,0.866025
2014-01-04 14:00:00+00:00,100.295605,0.166667,0.006944,0.0,14.491667,48.491667,1.072437,1.045926,2,-0.5,-0.866025,0.5,0.866025


In [32]:
# Controlling the selected features example
selected_columns = ['solar_zenith',
       'hour_progression', 'absolute_hour', 'CSI_ghi',
       'season_flag', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos']

TARGET_COL = 'CSI_ghi'
print("\n--- Step 2: Building model arrays (X, Y) ---")
fixed_df = to_fixedgrid_multiindex(df_phase2, timestamp_col="measurement_time", expected_T=None)  # or set T
# Zero-impute everything (features + target)
ph2_X, ph2_Y, ph2_labels_list = build_model_arrays(
        fixed_df,
        feature_cols=selected_columns,  # must include your chosen inputs
        target_col=TARGET_COL,          # target must be present in columns
        history_days=7,
        horizon_days=1,
    )


--- Step 2: Building model arrays (X, Y) ---


In [33]:
ph2_X.shape, ph2_Y.shape

((1064, 7, 14, 9), (1064, 1, 14))

In [34]:
import pandas as pd
from src.utils import DataManager
from src.pipeline import SolarForecastingPipeline


TARGET_COL = 'CSI_ghi'
feature_cols = [c for c in df_phase2.columns.tolist() if c != TARGET_COL]

data_manager = DataManager()
data_manager.save_arrays(
    ph2_X, ph2_Y,
    pd.DataFrame(index=pd.to_datetime(ph2_labels_list, utc=True)),
    filename_prefix='phas2_data',
    feature_cols=feature_cols,
    target_col=TARGET_COL,
    metadata={
        "input_csv": "data/processed/dayTime_NAM_dayahead_features_processed.csv",
        "timestamp_col": "measurement_time",
        "feature_set": feature_cols,     # for traceability
        "history_days": 7,
        "horizon_days": 1,
        "k_bins": None,
    }
)



LSTM_CONFIG = {
    "experiment_name": "phas2_exp_test",
    "model_type": "LSTM",
    "model_config": {
        "hidden_size": 128,
        "num_layers": 2,
        "dropout": 0.2,
        "bidirectional": True,
    },
    "data_prefix": "phas2_data",
    "splits_file": "exp-001/exp-001rolling_origin_splits.json",
    "feature_cols": feature_cols,
    "feature_selection": feature_cols,
    "target_col": TARGET_COL,
    "batch_size": 32,
    "num_epochs": 50,
    "learning_rate": 0.001,
    "early_stopping_patience": 20,
    "max_folds": 35,
}


pipeline = SolarForecastingPipeline(LSTM_CONFIG)
_, summary = pipeline.run()

INFO:src.utils:Saved arrays to data/phas2_data_*.npy
INFO:src.utils:X shape: (1064, 7, 14, 9), Y shape: (1064, 1, 14)
INFO:src.pipeline:Loading data...
INFO:src.utils:Loaded arrays from data/phas2_data_*.npy
INFO:src.utils:X shape: (1064, 7, 14, 9), Y shape: (1064, 1, 14)
INFO:src.pipeline:Reference dataframe prepared for NAM comparison (15162 rows).
INFO:src.utils:Loaded 35 folds from exp-001/exp-001rolling_origin_splits.json
INFO:src.pipeline:
=== Running Fold 1 ===
INFO:src.pipeline:Train samples: 19, Val samples: 16
INFO:src.pipeline:Model parameters: 973,582
INFO:src.engine:Epoch [10/50] - Train Loss: 0.052913, Val Loss: 0.197582, Train MAE: 0.145659, Val MAE: 0.375049, LR: 0.001000
INFO:src.engine:Epoch [20/50] - Train Loss: 0.032827, Val Loss: 0.643124, Train MAE: 0.120945, Val MAE: 0.648740, LR: 0.000500
INFO:src.engine:Early stopping at epoch 26
INFO:src.pipeline:Fold 1 - Validation Metrics:
INFO:src.pipeline:  mse: 0.592511
INFO:src.pipeline:  rmse: 0.769748
INFO:src.pipeline

----

In [30]:
selected_columns = ['solar_zenith',
       'hour_progression', 'absolute_hour', 'CSI_ghi',
       'season_flag', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos',
       'nam_ghi', 'nam_dni', 'nam_cc']


print("\n--- Step 2: Building model arrays (X, Y) ---")
fixed_df = to_fixedgrid_multiindex(df_phase3, timestamp_col="measurement_time", expected_T=None)  # or set T
# Zero-impute everything (features + target)
ph3_X, ph3_Y, ph3_labels_list = build_model_arrays(
        fixed_df,
        feature_cols=selected_columns,  # must include your chosen inputs
        target_col=TARGET_COL,          # target must be present in columns
        history_days=7,
        horizon_days=1,
    )

print(ph3_X.shape, ph3_Y.shape)


--- Step 2: Building model arrays (X, Y) ---
(1064, 7, 14, 12) (1064, 1, 14)


In [31]:


print("=====================================\n")

import pandas as pd
from src.utils import DataManager
from src.pipeline import SolarForecastingPipeline


TARGET_COL = 'CSI_ghi'
feature_cols = [c for c in df_phase3.columns.tolist() if c != TARGET_COL]

data_manager = DataManager()
data_manager.save_arrays(
    ph3_X, ph3_Y,
    pd.DataFrame(index=pd.to_datetime(ph3_labels_list, utc=True)),
    filename_prefix='phas3_data',
    feature_cols=feature_cols,
    target_col=TARGET_COL,
    metadata={
        "input_csv": "data/processed/dayTime_NAM_dayahead_features_processed.csv",
        "timestamp_col": "measurement_time",
        "feature_set": feature_cols,     # for traceability
        "history_days": 7,
        "horizon_days": 1,
        "k_bins": None,
    }
)



LSTM_CONFIG = {
    "experiment_name": "phas3_exp_test",
    "model_type": "LSTM",
    "model_config": {
        "hidden_size": 128,
        "num_layers": 2,
        "dropout": 0.2,
        "bidirectional": True,
    },
    "data_prefix": "phas3_data",
    "splits_file": "exp-001/exp-001rolling_origin_splits.json",
    "feature_cols": feature_cols,
    "feature_selection": feature_cols,
    "target_col": TARGET_COL,
    "batch_size": 32,
    "num_epochs": 50,
    "learning_rate": 0.001,
    "early_stopping_patience": 20,
    "max_folds": 35,
}


pipeline = SolarForecastingPipeline(LSTM_CONFIG)
_, summary = pipeline.run()

INFO:src.utils:Saved arrays to data/phas3_data_*.npy
INFO:src.utils:X shape: (1064, 7, 14, 12), Y shape: (1064, 1, 14)
INFO:src.pipeline:Loading data...
INFO:src.utils:Loaded arrays from data/phas3_data_*.npy
INFO:src.utils:X shape: (1064, 7, 14, 12), Y shape: (1064, 1, 14)





INFO:src.pipeline:Reference dataframe prepared for NAM comparison (15162 rows).
INFO:src.utils:Loaded 35 folds from exp-001/exp-001rolling_origin_splits.json
INFO:src.pipeline:
=== Running Fold 1 ===
INFO:src.pipeline:Train samples: 19, Val samples: 16
INFO:src.pipeline:Model parameters: 978,958
INFO:src.engine:Epoch [10/50] - Train Loss: 0.050164, Val Loss: 0.188769, Train MAE: 0.147607, Val MAE: 0.370701, LR: 0.001000
INFO:src.engine:Epoch [20/50] - Train Loss: 0.036586, Val Loss: 0.356881, Train MAE: 0.131588, Val MAE: 0.474056, LR: 0.000500
INFO:src.engine:Early stopping at epoch 27
INFO:src.pipeline:Fold 1 - Validation Metrics:
INFO:src.pipeline:  mse: 0.551429
INFO:src.pipeline:  rmse: 0.742583
INFO:src.pipeline:  mae: 0.602680
INFO:src.pipeline:  r2: -2.700127
INFO:src.pipeline:  mape: 126.177391
INFO:src.pipeline:  model_rmse_ghi: 115.457624
INFO:src.pipeline:  nam_rmse_ghi: 47.325744
INFO:src.pipeline:  model_mae_ghi: 84.649267
INFO:src.pipeline:  nam_mae_ghi: 39.566995
INFO:s

----

USING HUBER LOSS FUCNTION

In [35]:
selected_columns = ['solar_zenith',
       'hour_progression', 'absolute_hour', 'CSI_ghi',
       'season_flag', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos',
       'nam_ghi', 'nam_dni', 'nam_cc']


print("\n--- Step 2: Building model arrays (X, Y) ---")
fixed_df = to_fixedgrid_multiindex(df_phase3, timestamp_col="measurement_time", expected_T=None)  # or set T
# Zero-impute everything (features + target)
ph3_X, ph3_Y, ph3_labels_list = build_model_arrays(
        fixed_df,
        feature_cols=selected_columns,  # must include your chosen inputs
        target_col=TARGET_COL,          # target must be present in columns
        history_days=7,
        horizon_days=1,
    )

print(ph3_X.shape, ph3_Y.shape)


--- Step 2: Building model arrays (X, Y) ---
(1064, 7, 14, 12) (1064, 1, 14)


In [36]:


print("=====================================\n")

import pandas as pd
from src.utils import DataManager
from src.pipeline import SolarForecastingPipeline


TARGET_COL = 'CSI_ghi'
feature_cols = [c for c in df_phase3.columns.tolist() if c != TARGET_COL]

data_manager = DataManager()
data_manager.save_arrays(
    ph3_X, ph3_Y,
    pd.DataFrame(index=pd.to_datetime(ph3_labels_list, utc=True)),
    filename_prefix='phas3_data',
    feature_cols=feature_cols,
    target_col=TARGET_COL,
    metadata={
        "input_csv": "data/processed/dayTime_NAM_dayahead_features_processed.csv",
        "timestamp_col": "measurement_time",
        "feature_set": feature_cols,     # for traceability
        "history_days": 7,
        "horizon_days": 1,
        "k_bins": None,
    }
)



LSTM_CONFIG = {
    "experiment_name": "phas3_exp_test",
    "model_type": "LSTM",
    "model_config": {
        "hidden_size": 128,
        "num_layers": 2,
        "dropout": 0.2,
        "bidirectional": True,
    },
    "data_prefix": "phas3_data",
    "splits_file": "exp-001/exp-001rolling_origin_splits.json",
    "feature_cols": feature_cols,
    "feature_selection": feature_cols,
    "target_col": TARGET_COL,
    "batch_size": 32,
    "loss_function": "Huber",
    "num_epochs": 50,
    "learning_rate": 0.001,
    "early_stopping_patience": 20,
    "max_folds": 35,
}


pipeline = SolarForecastingPipeline(LSTM_CONFIG)
_, summary = pipeline.run()

INFO:src.utils:Saved arrays to data/phas3_data_*.npy
INFO:src.utils:X shape: (1064, 7, 14, 12), Y shape: (1064, 1, 14)
INFO:src.pipeline:Loading data...
INFO:src.utils:Loaded arrays from data/phas3_data_*.npy
INFO:src.utils:X shape: (1064, 7, 14, 12), Y shape: (1064, 1, 14)





INFO:src.pipeline:Reference dataframe prepared for NAM comparison (15162 rows).
INFO:src.utils:Loaded 35 folds from exp-001/exp-001rolling_origin_splits.json
INFO:src.pipeline:
=== Running Fold 1 ===
INFO:src.pipeline:Train samples: 19, Val samples: 16
INFO:src.pipeline:Model parameters: 978,958
INFO:src.engine:Epoch [10/50] - Train Loss: 0.048711, Val Loss: 0.135200, Train MAE: 0.142769, Val MAE: 0.298243, LR: 0.001000
INFO:src.engine:Epoch [20/50] - Train Loss: 0.034094, Val Loss: 0.429463, Train MAE: 0.116191, Val MAE: 0.500429, LR: 0.001000
INFO:src.engine:Epoch [30/50] - Train Loss: 0.031853, Val Loss: 0.853458, Train MAE: 0.116793, Val MAE: 0.742078, LR: 0.000500
INFO:src.engine:Early stopping at epoch 30
INFO:src.pipeline:Fold 1 - Validation Metrics:
INFO:src.pipeline:  mse: 0.853458
INFO:src.pipeline:  rmse: 0.923828
INFO:src.pipeline:  mae: 0.742078
INFO:src.pipeline:  r2: -4.726761
INFO:src.pipeline:  mape: 178.220566
INFO:src.pipeline:  model_rmse_ghi: 133.039244
INFO:src.pi

-----

In [None]:
from src.preprocessing import KBinConfig, process_splits_to_kbins, build_model_arrays, to_fixedgrid_multiindex

selected_columns = ['solar_zenith',
       'hour_progression', 'absolute_hour', 'CSI_ghi',
       'season_flag', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos',
       'nam_ghi', 'nam_dni', 'nam_cc']


TARGET_COL = 'CSI_ghi'
feature_cols = [c for c in df_phase3.columns.tolist() if c != TARGET_COL]

print("\n--- Step 2: Building model arrays (X, Y) ---")
fixed_df = to_fixedgrid_multiindex(df_phase3, timestamp_col="measurement_time", expected_T=None)  # or set T
# Zero-impute everything (features + target)
ph3_X, ph3_Y, ph3_labels_list = build_model_arrays(
        fixed_df,
        feature_cols=selected_columns,  # must include your chosen inputs
        target_col=TARGET_COL,          # target must be present in columns
        history_days=7,
        horizon_days=5,
    )

print(ph3_X.shape, ph3_Y.shape)



print("=====================================\n")

import pandas as pd
from src.utils import DataManager
from src.pipeline import SolarForecastingPipeline



data_manager = DataManager()
data_manager.save_arrays(
    ph3_X, ph3_Y,
    pd.DataFrame(index=pd.to_datetime(ph3_labels_list, utc=True)),
    filename_prefix='phas3_data',
    feature_cols=selected_columns,
    target_col=TARGET_COL,
    metadata={
        "input_csv": "data/processed/dayTime_NAM_dayahead_features_processed.csv",
        "timestamp_col": "measurement_time",
        "feature_set": selected_columns,     # for traceability
        "history_days": 7,
        "horizon_days": 5,
        "k_bins": None,
    }
)



LSTM_CONFIG = {
    "experiment_name": "phas3_exp_test",
    "model_type": "LSTM",
    "model_config": {
        "hidden_size": 64,
        "num_layers": 2,
        "dropout": 0.2,
        "bidirectional": True,
    },
    "data_prefix": "phas3_data",
    "splits_file": "exp-001/exp-001rolling_origin_splits.json",
    "feature_cols": selected_columns,
    "feature_selection": selected_columns,
    "target_col": TARGET_COL,
    "batch_size": 32,
    "loss_function": "Huber",
    "num_epochs": 50,
    "learning_rate": 0.001,
    "early_stopping_patience": 20,
    "max_folds": 35,
}


pipeline = SolarForecastingPipeline(LSTM_CONFIG)
_, summary = pipeline.run()

INFO:src.utils:Saved arrays to data/phas3_data_*.npy
INFO:src.utils:X shape: (1060, 7, 14, 12), Y shape: (1060, 5, 14)
INFO:src.pipeline:Loading data...
INFO:src.utils:Loaded arrays from data/phas3_data_*.npy
INFO:src.utils:X shape: (1060, 7, 14, 12), Y shape: (1060, 5, 14)



--- Step 2: Building model arrays (X, Y) ---
(1060, 7, 14, 12) (1060, 5, 14)



INFO:src.pipeline:Reference dataframe prepared for NAM comparison (15162 rows).
INFO:src.utils:Loaded 35 folds from exp-001/exp-001rolling_origin_splits.json
INFO:src.pipeline:
=== Running Fold 1 ===
INFO:src.pipeline:Train samples: 15, Val samples: 16
INFO:src.pipeline:Model parameters: 986,182
INFO:src.engine:Epoch [10/50] - Train Loss: 0.015084, Val Loss: 0.083551, Train MAE: 0.122177, Val MAE: 0.303775, LR: 0.001000
INFO:src.engine:Epoch [20/50] - Train Loss: 0.010508, Val Loss: 0.091602, Train MAE: 0.097487, Val MAE: 0.340067, LR: 0.001000
INFO:src.engine:Epoch [30/50] - Train Loss: 0.008677, Val Loss: 0.098078, Train MAE: 0.082239, Val MAE: 0.346897, LR: 0.000500
INFO:src.engine:Early stopping at epoch 36
INFO:src.pipeline:Fold 1 - Validation Metrics:
INFO:src.pipeline:  mse: 0.248999
INFO:src.pipeline:  rmse: 0.498998
INFO:src.pipeline:  mae: 0.390676
INFO:src.pipeline:  r2: -0.731214
INFO:src.pipeline:  mape: 116.896164
INFO:src.pipeline:
=== Running Fold 2 ===
INFO:src.pipelin

---