# Model Training

First, let's test linear models with transformed features. 
The models include Linear Regression, Ridge Regression, and Lasso Regression.

### Final choice of features for linear models:

H0_MJ - strong correlation, obtained by formula

daylength - almost identical to sunlight_hours, but showed better correlation so we chose only daylength

lat_cos- transformed latitude

lon_cos - transformed longtitude

Cloud_Cover_Mean_24h - cloud cover from AgERA5   

elevation_log - transform elevation from DEM Dataset 

doy_sin and doy_cos

In [28]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from math import sqrt, sin, cos, pi, radians, degrees, asin

# Load the data and prepare features
df = pd.read_csv('../data/processed/clean_with_transformed_features.csv')
features = ['H0_MJ', 'daylength', 'lat_cos', 'lon_cos', 'Cloud_Cover_Mean_24h', 
           'elevation_log', 'doy_sin', 'doy_cos']
X = df[features]
y = df['Solar_Radiation_Flux']

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
}

# Cross-validation setup
cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Store results
results = {}

print("Training and evaluating models with 3-fold cross-validation...")
print("="*70)

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Cross-validation scores
    cv_r2_scores = cross_val_score(model, X, y, cv=cv, scoring='r2')
    cv_mae_scores = -cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')
    cv_rmse_scores = np.sqrt(-cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error'))
    
    # Store results
    results[name] = {
        'R2_mean': cv_r2_scores.mean(),
        'R2_std': cv_r2_scores.std(),
        'MAE_mean': cv_mae_scores.mean(),
        'MAE_std': cv_mae_scores.std(),
        'RMSE_mean': cv_rmse_scores.mean(),
        'RMSE_std': cv_rmse_scores.std()
    }
    
    print(f"R² Score: {cv_r2_scores.mean():.4f} (±{cv_r2_scores.std():.4f})")
    print(f"MAE: {cv_mae_scores.mean():.4f} (±{cv_mae_scores.std():.4f})")
    print(f"RMSE: {cv_rmse_scores.mean():.4f} (±{cv_rmse_scores.std():.4f})")

print("\n" + "="*70)
print("MODEL PERFORMANCE SUMMARY")
print("="*70)

# Create results DataFrame for better visualization
results_df = pd.DataFrame(results).T
print(results_df.round(4))

Training and evaluating models with 3-fold cross-validation...

Training Linear Regression...
R² Score: 0.8484 (±0.0033)
MAE: 2.2716 (±0.0095)
RMSE: 3.0570 (±0.0300)

Training Ridge Regression...
R² Score: 0.8484 (±0.0033)
MAE: 2.2715 (±0.0095)
RMSE: 3.0570 (±0.0300)

Training Lasso Regression...
R² Score: 0.8440 (±0.0039)
MAE: 2.2927 (±0.0135)
RMSE: 3.1014 (±0.0347)

MODEL PERFORMANCE SUMMARY
                   R2_mean  R2_std  MAE_mean  MAE_std  RMSE_mean  RMSE_std
Linear Regression   0.8484  0.0033    2.2716   0.0095     3.0570    0.0300
Ridge Regression    0.8484  0.0033    2.2715   0.0095     3.0570    0.0300
Lasso Regression    0.8440  0.0039    2.2927   0.0135     3.1014    0.0347


All the 3 linear models show almost identical performance, but Lasso slightly underperforming

## 🌳 Non-Linear Model Testing with Raw Features

After testing linear models with transformed features, this step evaluates tree-based ensemble methods using raw (non-transformed) features. 

Tree-based models can naturally handle non-linear relationships and feature interactions without requiring manual transformations.

Now let's test complex models (Random Forest, Gradient Boosting, XGBoost) trained on non-transformed feature set:

H0_MJ, daylength, lat, lon, Cloud_Cover_Mean_24h, elevation, doy_sin, doy_cos.

Key differences from linear models:

lat, lon - Raw coordinates (not cosine-transformed)

elevation - Raw elevation (not log-transformed)

No standardization - Tree models handle different scales naturally

In [34]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Load cleaned dataset
df_model = pd.read_csv("../data/processed/clean_with_transformed_features.csv")

# Define features and target
features = [
    'H0_MJ',
    'daylength',
    'lat',
    'lon',
    'Cloud_Cover_Mean_24h',
    'elevation',
    'doy_sin',
    'doy_cos',
]
target = 'Solar_Radiation_Flux'

X = df_model[features]
y = df_model[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define models
models = {
    "Random Forest": RandomForestRegressor(
        n_estimators=200,
        max_depth=20,
        random_state=42,
        n_jobs=-1
    ),
    "Gradient Boosting": GradientBoostingRegressor(
        n_estimators=200,
        max_depth=5,
        learning_rate=0.1,
        random_state=42
    ),
    "XGBoost": XGBRegressor(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        n_jobs=-1
    )
}

# Cross-validation setup
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Collect results
results = []


# ML models evaluation
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    cv_scores = cross_val_score(model, X, y, cv=cv, scoring="r2")

    results.append({
        "Model": name,
        "R² Test": r2,
        "RMSE Test": rmse,
        "CV R² Mean": cv_scores.mean(),
        "CV R² Std": cv_scores.std()
    })

    # Feature importance (when available)
    if hasattr(model, "feature_importances_"):
        importances = pd.DataFrame({
            'Feature': features,
            'Importance': model.feature_importances_
        }).sort_values(by='Importance', ascending=False)

        print(f"\n=== {name.upper()} FEATURE IMPORTANCES ===")
        print(importances.to_string(index=False))

# Print summary table
results_df = pd.DataFrame(results)
print("\n=== MODEL COMPARISON ===")
print(results_df.sort_values("R² Test", ascending=False).to_string(index=False))



=== RANDOM FOREST FEATURE IMPORTANCES ===
             Feature  Importance
               H0_MJ    0.718002
Cloud_Cover_Mean_24h    0.187115
           elevation    0.023619
                 lon    0.016558
             doy_sin    0.016201
                 lat    0.014348
           daylength    0.012352
             doy_cos    0.011805

=== GRADIENT BOOSTING FEATURE IMPORTANCES ===
             Feature  Importance
               H0_MJ    0.758174
Cloud_Cover_Mean_24h    0.190950
           elevation    0.014456
                 lat    0.009358
             doy_sin    0.009056
           daylength    0.007493
                 lon    0.006934
             doy_cos    0.003579

=== XGBOOST FEATURE IMPORTANCES ===
             Feature  Importance
               H0_MJ    0.557615
           daylength    0.177280
Cloud_Cover_Mean_24h    0.157253
                 lat    0.028157
           elevation    0.022879
             doy_cos    0.020941
             doy_sin    0.018669
               

# Results:

R² > 0.90. Models explain >90% of solar radiation variance

Low RMSE (~2.4 MJ/m²). High prediction performance

Consistent CV scores - models generalize well

XGBoost shows top performance with

- Best R² (0.905 test and 0.904 CV)

- Lowest RMSE (2.42 MJ/m²)

- Most stable cross-validation