# Modeling & Validation

This notebook covers the modeling and validation phase for solar power generation forecasting. We will:
- Load processed features and target
- Define a baseline
- Train and validate several regression models
- Compare results using MAE and TimeSeriesSplit
- Select the best model for deployment

### 0. Import Libraries

We import essential libraries and load the processed dataset for modeling.

In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import joblib

from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_absolute_error, classification_report

# from data_loader import load_data
# from baselines import naive_last_value_baseline
# from model_selection import evaluate_model_cv

# External models
try:
    from xgboost import XGBRegressor
except ImportError:
    XGBRegressor = None
try:
    from lightgbm import LGBMRegressor
except ImportError:
    LGBMRegressor = None
try:
    from catboost import CatBoostRegressor
except ImportError:
    CatBoostRegressor = None
# Paths
DATA_PATH = Path("../data/processed/solar_features.csv")
MODEL_PATH = Path("../models/final_model.joblib")

# Config
N_SPLITS = 5
RANDOM_STATE = 42

In [62]:
# This code is saved into data_loader.py
#=====================================================
TARGET_COL = "power_generated_kw"

def load_data(path: Path):
    """
    Load features and target from processed CSV.
    """
    df = pd.read_csv(path)
    X = df.drop(columns=[TARGET_COL])
    y = df[TARGET_COL]
    return X, y
#=====================================================

In [63]:
# This code is saved into baselines.py
#=====================================================
def naive_last_value_baseline(y):
    """
    Naive baseline: predict previous timestep value.
    """
    y_true = y.iloc[1:]
    y_pred = y.shift(1).iloc[1:]
    mae = mean_absolute_error(y_true, y_pred)
    return mae
#=====================================================

In [64]:
# This code is saved into model_selection.py
#=====================================================
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

def evaluate_model_cv(model, X, y, tscv, use_scaler=False):
    """
    Manual TimeSeries CV evaluation.
    """
    maes = []

    for train_idx, val_idx in tscv.split(X):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        if use_scaler:
            pipeline = Pipeline([
                ("scaler", StandardScaler()),
                ("model", model)
            ])
            pipeline.fit(X_train, y_train)
            preds = pipeline.predict(X_val)
        else:
            model.fit(X_train, y_train)
            preds = model.predict(X_val)

        mae = mean_absolute_error(y_val, preds)
        maes.append(mae)

    return np.mean(maes)
#=====================================================

### 1. Load Data

In [65]:
# Load processed features and target
X, y = load_data(DATA_PATH)
tscv = TimeSeriesSplit(n_splits=N_SPLITS)
X.head()

Unnamed: 0,day_of_year,is_daylight,distance_to_solar_noon,average_temperature_day,average_wind_direction_day,average_wind_speed_day,sky_cover,visibility,relative_humidity,average_wind_speed_period,average_barometric_pressure_period,first_hour_of_period_sin,first_hour_of_period_cos,day_of_year_sin,day_of_year_cos,distance_to_noon_squared,solar_potential
0,245,0,0.859897,69,28,7.5,0,10.0,75,8.0,29.82,0.258819,0.965926,-0.874481,-0.48506,0.739423,0.246604
1,245,0,0.628535,69,28,7.5,0,10.0,77,5.0,29.85,0.866025,0.5,-0.874481,-0.48506,0.395056,0.449311
2,245,1,0.397172,69,28,7.5,0,10.0,70,0.0,29.89,0.965926,-0.258819,-0.874481,-0.48506,0.157746,0.652019
3,245,1,0.16581,69,28,7.5,0,10.0,33,0.0,29.91,0.5,-0.866025,-0.874481,-0.48506,0.027493,0.854726
4,245,1,0.065553,69,28,7.5,0,10.0,21,3.0,29.89,-0.258819,-0.965926,-0.874481,-0.48506,0.004297,0.942566


### 2. Data Preprocessing

The data has already been cleaned and processed in previous steps. We check for missing values and confirm data types.

In [66]:
# Check for missing values and data types
print(X.info())
X.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2920 entries, 0 to 2919
Data columns (total 17 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   day_of_year                         2920 non-null   int64  
 1   is_daylight                         2920 non-null   int64  
 2   distance_to_solar_noon              2920 non-null   float64
 3   average_temperature_day             2920 non-null   int64  
 4   average_wind_direction_day          2920 non-null   int64  
 5   average_wind_speed_day              2920 non-null   float64
 6   sky_cover                           2920 non-null   int64  
 7   visibility                          2920 non-null   float64
 8   relative_humidity                   2920 non-null   int64  
 9   average_wind_speed_period           2920 non-null   float64
 10  average_barometric_pressure_period  2920 non-null   float64
 11  first_hour_of_period_sin            2920 no

day_of_year                           0
is_daylight                           0
distance_to_solar_noon                0
average_temperature_day               0
average_wind_direction_day            0
average_wind_speed_day                0
sky_cover                             0
visibility                            0
relative_humidity                     0
average_wind_speed_period             0
average_barometric_pressure_period    0
first_hour_of_period_sin              0
first_hour_of_period_cos              0
day_of_year_sin                       0
day_of_year_cos                       0
distance_to_noon_squared              0
solar_potential                       0
dtype: int64

### 3. Feature Engineering

Features were engineered in previous scripts. Here, we review the feature set and check for redundancy or leakage.

In [67]:
# Review feature columns
print(f"Features: {list(X.columns)}")

Features: ['day_of_year', 'is_daylight', 'distance_to_solar_noon', 'average_temperature_day', 'average_wind_direction_day', 'average_wind_speed_day', 'sky_cover', 'visibility', 'relative_humidity', 'average_wind_speed_period', 'average_barometric_pressure_period', 'first_hour_of_period_sin', 'first_hour_of_period_cos', 'day_of_year_sin', 'day_of_year_cos', 'distance_to_noon_squared', 'solar_potential']


### 4. Train-Test Split

We use TimeSeriesSplit for validation, as random splits would leak future information. No separate holdout set is created at this stage.

In [68]:
# Set up TimeSeriesSplit
N_SPLITS = 5
tscv = TimeSeriesSplit(n_splits=N_SPLITS)

for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    print(f"Fold {fold+1}: Train {len(train_idx)} | Val {len(val_idx)}")

Fold 1: Train 490 | Val 486
Fold 2: Train 976 | Val 486
Fold 3: Train 1462 | Val 486
Fold 4: Train 1948 | Val 486
Fold 5: Train 2434 | Val 486


### 5. Model Selection and Training

We define a baseline (mean predictor) and train several regression models: Linear Regression, Ridge, Random Forest, and HistGradientBoostingRegressor.

In [69]:
# ---- Baseline ----
baseline_mae = naive_last_value_baseline(y)
print(f"Naive baseline (last value) MAE: {baseline_mae:.3f}")

Naive baseline (last value) MAE: 6077.422


In [70]:
# ---- Models ----
models = {
    "LinearRegression": {
        "model": LinearRegression(),
        "use_scaler": True
    },
    "Ridge": {
        "model": Ridge(alpha=10.0, random_state=RANDOM_STATE),
        "use_scaler": True
    },
    "ElasticNet": {
        "model": ElasticNet(alpha=1.0, l1_ratio=0.5, random_state=RANDOM_STATE),
        "use_scaler": True
    },
    "RandomForest": {
        "model": RandomForestRegressor(
            n_estimators=200,
            max_depth=10,
            random_state=RANDOM_STATE
        ),
        "use_scaler": False
    },
    "ExtraTrees": {
        "model": ExtraTreesRegressor(
            n_estimators=200,
            max_depth=10,
            random_state=RANDOM_STATE
        ),
        "use_scaler": False
    },
    "HistGBR": {
        "model": HistGradientBoostingRegressor(
            max_iter=200,
            learning_rate=0.05,
            random_state=RANDOM_STATE
        ),
        "use_scaler": False
    },
}

In [71]:
# Add external models if available
if XGBRegressor is not None:
    models["XGBoost"] = {
        "model": XGBRegressor(n_estimators=200, max_depth=10, learning_rate=0.05, random_state=RANDOM_STATE, verbosity=0),
        "use_scaler": False
    }
if LGBMRegressor is not None:
    models["LightGBM"] = {
        "model": LGBMRegressor(n_estimators=200, max_depth=10, learning_rate=0.05, random_state=RANDOM_STATE, verbose=-1),
        "use_scaler": False
    }
if CatBoostRegressor is not None:
    models["CatBoost"] = {
        "model": CatBoostRegressor(iterations=200, depth=8, learning_rate=0.05, random_state=RANDOM_STATE, verbose=0),
        "use_scaler": False
    }

In [72]:
best_name = None
best_mae = float("inf")
best_model = None
best_use_scaler = False

### 6. Model Evaluation

We compare the MAE of each model across all folds and visualize the results.

In [73]:
# ---- CV evaluation ----
from sklearn.metrics import root_mean_squared_error
for name, cfg in models.items():
    print(f"\nEvaluating {name}...")
    mae = evaluate_model_cv(
        cfg["model"], X, y, tscv, use_scaler=cfg["use_scaler"]
    )
    print(f"{name} CV MAE: {mae:.3f}")

    if mae < best_mae:
        best_mae = mae
        best_name = name
        best_model = cfg["model"]
        best_use_scaler = cfg["use_scaler"]

print(f"\nBest model: {best_name} (MAE: {best_mae:.3f})")


Evaluating LinearRegression...
LinearRegression CV MAE: 8917.241

Evaluating Ridge...
Ridge CV MAE: 3873.254

Evaluating ElasticNet...
ElasticNet CV MAE: 5077.591

Evaluating RandomForest...
RandomForest CV MAE: 2064.896

Evaluating ExtraTrees...
ExtraTrees CV MAE: 1932.558

Evaluating HistGBR...
HistGBR CV MAE: 2154.621

Evaluating XGBoost...
XGBoost CV MAE: 2123.268

Evaluating LightGBM...
LightGBM CV MAE: 2212.100

Evaluating CatBoost...
CatBoost CV MAE: 2222.669

Best model: ExtraTrees (MAE: 1932.558)


In [74]:
# ---- Train final model ----
print("\nTraining final model on full dataset...")

if best_use_scaler:
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

    final_model = Pipeline([
        ("scaler", StandardScaler()),
        ("model", best_model)
    ])
else:
    final_model = best_model

final_model.fit(X, y)


Training final model on full dataset...


0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",200
,"criterion  criterion: {""squared_error"", ""absolute_error"", ""friedman_mse"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in Poisson deviance to find splits. Training using ""absolute_error"" is significantly slower than when using ""squared_error"". .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion.",'squared_error'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",10
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=1.0 The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None or 1.0, then `max_features=n_features`. .. note::  The default of 1.0 is equivalent to bagged trees and more  randomness can be achieved by setting smaller values, e.g. 0.3. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to 1.0. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",1.0
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=False Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",False


### 7. Save best model

In [75]:
MODEL_PATH.parent.mkdir(exist_ok=True)
joblib.dump(final_model, MODEL_PATH)
print(f"Model saved to {MODEL_PATH}")

# ---- Sanity check ----
loaded_model = joblib.load(MODEL_PATH)
preds = loaded_model.predict(X.iloc[:5])
print("Test predictions:", preds)

Model saved to ..\models\final_model.joblib
Test predictions: [    0.             0.          4715.79100257 23149.73339716
 29908.79373521]


#### Explanation: Model Training Pipeline

- **Baseline:** The naive baseline predicts the previous value (last observed power). This is a strong reference for time series and is used to benchmark all models.
- **Model Candidates:** Multiple regression models are compared, including linear, regularized, tree-based, and boosting models. External libraries (XGBoost, LightGBM, CatBoost) are used if available.
- **Validation:** All models are evaluated using TimeSeriesSplit, which preserves temporal order and prevents data leakage.
- **Scaler:** Linear models use feature scaling (StandardScaler) within a pipeline. Tree-based models do not require scaling.
- **Selection:** The model with the lowest cross-validated MAE is selected as the best.
- **Final Training:** The best model is retrained on the full dataset and saved to disk for production use.
- **Sanity Check:** The notebook prints sample predictions to verify the saved model works as expected.

This approach ensures robust, production-ready model selection and training, fully aligned with the project guidelines.