<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/Solar_Panel_Performance_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ☀️ Solar Panel Performance Optimization Challenge ☀️

**Predicting Degradation and Failures for a Sustainable Future**

---

## 🎯 1. The Challenge: Maximizing Solar Energy Output

Solar energy is a cornerstone of sustainable power. However, the efficiency of Photovoltaic (PV) panels can degrade over time or due to unforeseen failures. Traditional maintenance is often reactive, leading to:

*   📉 **Energy Loss:** Suboptimal performance means less clean energy generated.
*   💰 **Increased Costs:** Reactive repairs and downtime are expensive.

**Our Mission:** To develop a sophisticated Machine Learning model that predicts `efficiency` (our target variable), enabling **predictive maintenance**. This proactive approach will help maintain peak performance and reduce operational interruptions.

---

## 📊 2. Understanding Our Data

We're provided with a rich dataset containing sensor readings and panel characteristics.

*   **`train.csv`**: The training ground for our model (20,000 samples, 17 features including `efficiency`).
*   **`test.csv`**: The unseen data where we'll make our predictions (12,000 samples, 16 features).
*   **`sample_submission.csv`**: The blueprint for our final submission file.

### Key Data Features at a Glance:

| Feature Category    | Column Examples                                 | Description                                                                    |
| :------------------ | :---------------------------------------------- | :----------------------------------------------------------------------------- |
| 🆔 **Identifiers**   | `id`, `string_id`                               | Unique row and panel group identifiers.                                        |
| 🌡️ **Environmental** | `temperature`, `irradiance`, `humidity`, `cloud_coverage`, `wind_speed`, `pressure` | Ambient conditions influencing panel operation.                |
| 🛠️ **Panel Specifics**| `panel_age`, `maintenance_count`, `soiling_ratio`, `module_temperature`, `error_code`, `installation_type` | Panel history, condition, and setup.                               |
| ⚡ **Electrical**   | `voltage`, `current`                            | Measured electrical output.                                                    |
| 🏆 **Target**        | **`efficiency`**                                | **The crucial variable we need to predict!** (0.0 - 1.0 scale)             |

*A detailed description of each column is available in the problem statement.*

---

## 🚀 3. Our Game Plan: Building a Winning Model

We'll follow a structured, iterative approach to tackle this prediction task:

1.  **⚙️ Initial Setup & Environment Configuration:**
    *   Importing essential Python libraries (Pandas, NumPy, Scikit-learn, LightGBM, XGBoost, Plotly, Optuna).
    *   Loading the datasets into our workspace.

2.  **🔍 Exploratory Data Analysis (EDA) - Unveiling Insights:**
    *   Deep dive into data distributions, missing values, and potential outliers.
    *   Visualizing feature relationships and their correlation with `efficiency` using:
        *   **Matplotlib & Seaborn:** For static, foundational plots.
        *   **Plotly:** For dynamic, interactive visualizations to uncover subtle patterns.

3.  **✨ Feature Engineering - Crafting Predictive Power:**
    *   Creating new, informative features from existing ones (e.g., interaction terms like `temperature * irradiance`, ratios like `module_temperature - temperature`). The goal is to provide the model with richer signals.

4.  **🧹 Data Preprocessing - Preparing for Modeling:**
    *   **Missing Value Imputation:** Strategically filling in any data gaps.
    *   **Categorical Encoding:** Transforming text-based features (like `string_id`, `error_code`) into a numerical format (One-Hot Encoding).
    *   **Feature Scaling:** Normalizing numerical features (`StandardScaler`) to ensure fair contribution from all variables.

5.  **🧠 Model Building & Cross-Validation - The Core Engine:**
    *   **Algorithm Selection:** Focusing on state-of-the-art gradient boosting models:
        *   **LightGBM (LGBM):** Known for speed and efficiency.
        *   **XGBoost:** A robust and widely-used powerhouse.
    *   **K-Fold Cross-Validation:** Training and evaluating models on different subsets of the data to ensure robustness and get a reliable performance estimate. This helps prevent overfitting.

6.  **🛠️ Hyperparameter Optimization - Fine-Tuning for Excellence:**
    *   Leveraging **Optuna**, an automated hyperparameter optimization framework. Optuna will intelligently search for the best set of model settings (e.g., learning rate, tree depth) to maximize our chosen metric.

7.  **🤝 Model Ensembling - The Power of Collaboration:**
    *   **Blending:** Combining the predictions from our fine-tuned LGBM and XGBoost models. The idea is that different models capture different aspects of the data, and their combined wisdom is often superior to any single model. We'll optimize the blending weights.

8.  **📜 Prediction & Submission - Delivering Results:**
    *   Applying our final, ensembled model to the `test.csv` data.
    *   Generating the `submission.csv` file in the specified format.

---

## 📈 4. Measuring Success: The Evaluation Metric

Our model's prowess will be judged by a custom scoring formula:

**Score = 100 \* (1 - RMSE)**

Where `RMSE` (Root Mean Squared Error) is calculated as:
`RMSE = sqrt(mean_squared_error(actual_efficiency, predicted_efficiency))`

**A higher score indicates a more accurate model.** Our goal is to maximize this score!

---

## 🏁 Let's Begin the Journey!

The code cells below will bring this plan to life. We'll document each step, share our findings, and strive for the best possible prediction model.

In [17]:
!pip install pandas numpy matplotlib seaborn plotly scikit-learn lightgbm xgboost optuna shap kaleido catboost -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Plotly imports
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer # <--- ADD THIS LINE
from sklearn.impute import SimpleImputer, IterativeImputer # <--- THIS LINE REMAINS
import lightgbm as lgb
import xgboost as xgb
import catboost as cb # Assuming CatBoost will be used
from sklearn.metrics import mean_squared_error

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

import shap

import os
import warnings
warnings.filterwarnings('ignore')

In [19]:
# --- Configuration ---
RUN_OPTUNA = True
OPTUNA_TRIALS_LGBM = 30     # Number of Optuna trials
OPTUNA_TRIALS_XGB = 30
OPTUNA_TRIALS_CATBOOST = 30 # Set to 0 to skip CatBoost
N_SPLITS = 5
RANDOM_STATE = 42
USE_ITERATIVE_IMPUTER = False # Set to True to try (slower)
SHAP_SAMPLE_SIZE = 500     # Sample size for SHAP explanations (faster) or use len(X_processed_df) for full

In [20]:
# --- Directory Setup ---
BASE_DIR = "solar_panel_analysis_full_v3"
EDA_PLOTS_DIR = os.path.join(BASE_DIR, "eda_plotly_plots")
OPTUNA_PLOTS_DIR = os.path.join(BASE_DIR, "optuna_plots")
SHAP_PLOTS_DIR = os.path.join(BASE_DIR, "shap_plots")

for D_path in [BASE_DIR, EDA_PLOTS_DIR, OPTUNA_PLOTS_DIR, SHAP_PLOTS_DIR]:
    if not os.path.exists(D_path):
        os.makedirs(D_path)
print(f"Outputs will be saved in '{BASE_DIR}' subdirectories.")

Outputs will be saved in 'solar_panel_analysis_full_v3' subdirectories.


In [21]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
# --- 0. Data Loading ---
print("\n--- 0. Data Loading ---")
try:
    # Assuming Google Drive is mounted at /content/drive
    # Adjust path if your files are elsewhere
    DRIVE_PATH = "/content/drive/MyDrive/zelestra_data/"
    train_df_orig = pd.read_csv(os.path.join(DRIVE_PATH, "train.csv"))
    test_df_orig = pd.read_csv(os.path.join(DRIVE_PATH, "test.csv"))
    sample_submission_df = pd.read_csv(os.path.join(DRIVE_PATH, "sample_submission.csv"))
    print(f"Train data shape: {train_df_orig.shape}, Test data shape: {test_df_orig.shape}")
except FileNotFoundError:
    print(f"ERROR: CSV files not found in {DRIVE_PATH}. Please check the path.")
    exit()


--- 0. Data Loading ---
Train data shape: (20000, 17), Test data shape: (12000, 16)


In [23]:
# --- 1. Data Cleaning & Initial Type Conversion ---
print("\n--- 1. Data Cleaning & Initial Type Conversion ---")
def clean_data(df, df_name="DataFrame"):
    df_cleaned = df.copy()
    cols_to_numeric = ['humidity', 'wind_speed', 'pressure']
    print(f"\nCleaning {df_name}:")
    for col in cols_to_numeric:
        if col in df_cleaned.columns:
            original_dtype = df_cleaned[col].dtype
            df_cleaned[col] = pd.to_numeric(df_cleaned[col], errors='coerce')
            new_nans = df_cleaned[col].isnull().sum()
            # print(f"  Converted '{col}' from {original_dtype} to numeric. Introduced/confirmed {new_nans} NaNs.")
    return df_cleaned

train_df_cleaned = clean_data(train_df_orig, "Train Data")
test_df_cleaned = clean_data(test_df_orig, "Test Data")


--- 1. Data Cleaning & Initial Type Conversion ---

Cleaning Train Data:

Cleaning Test Data:


In [24]:
# --- 2. In-Depth EDA (with Plotly) ---
print("\n--- 2. In-Depth EDA (Post-Cleaning) ---")
train_eda = train_df_cleaned.copy()
test_eda = test_df_cleaned.copy()
TARGET = 'efficiency'

numerical_features_eda = train_eda.select_dtypes(include=np.number).columns.tolist()
if TARGET in numerical_features_eda: numerical_features_eda.remove(TARGET)
if 'id' in numerical_features_eda: numerical_features_eda.remove('id')

categorical_features_eda = train_eda.select_dtypes(include='object').columns.tolist()
if 'id' in categorical_features_eda: categorical_features_eda.remove('id')

def plot_target_distribution(df, target_col, save_dir):
    # ... (Plotting functions from previous response - keep them here)
    fig = px.histogram(df, x=target_col, nbins=50, title=f'Distribution of Target ({target_col})',
                       marginal="box", color_discrete_sequence=['#636EFA'])
    fig.update_layout(bargap=0.1)
    if save_dir: fig.write_html(os.path.join(save_dir, "plotly_target_distribution.html"))

def plot_numerical_distributions_train_test(train_df, test_df, num_cols, save_dir):
    for col in num_cols:
        if col not in train_df.columns or col not in test_df.columns: continue
        fig = go.Figure()
        fig.add_trace(go.Histogram(x=train_df[col].dropna(), name='Train', nbinsx=40, marker_color='#EF553B', opacity=0.75))
        fig.add_trace(go.Histogram(x=test_df[col].dropna(), name='Test', nbinsx=40, marker_color='#00CC96', opacity=0.75))
        fig.update_layout(barmode='overlay', title_text=f'Distribution of {col} (Train vs Test)')
        fig.update_traces(opacity=0.7)
        if save_dir: fig.write_html(os.path.join(save_dir, f"plotly_dist_{col}_train_test.html"))

def plot_correlation_heatmap(df, num_cols, target_col, save_dir):
    valid_num_cols = [col for col in num_cols if col in df.columns]
    if not valid_num_cols: return
    # Ensure target_col is also numeric and present
    cols_for_corr = valid_num_cols
    if target_col in df.columns and pd.api.types.is_numeric_dtype(df[target_col]):
        cols_for_corr = valid_num_cols + [target_col]
    else:
        print(f"Warning: Target column '{target_col}' not found or not numeric for correlation heatmap.")

    correlation_matrix = df[cols_for_corr].corr()
    fig = px.imshow(correlation_matrix, text_auto=".2f", aspect="auto",
                    color_continuous_scale='RdBu_r', title='Correlation Matrix')
    if save_dir: fig.write_html(os.path.join(save_dir, "plotly_correlation_heatmap.html"))

def plot_categorical_vs_target(df, cat_cols, target_col, save_dir):
    for col in cat_cols:
        if col not in df.columns or target_col not in df.columns : continue
        df_to_plot = df[[col, target_col]].copy().dropna(subset=[col, target_col])
        if df_to_plot.empty: continue

        unique_cats = df_to_plot[col].nunique()
        if unique_cats > 20:
            top_categories = df_to_plot[col].value_counts().nlargest(20).index
            df_filtered = df_to_plot[df_to_plot[col].isin(top_categories)]
            title_suffix = " (Top 20 Categories)"
        else:
            df_filtered = df_to_plot
            title_suffix = ""
        if df_filtered.empty: continue

        fig = px.box(df_filtered, x=col, y=target_col,
                     title=f'{target_col} vs {col}{title_suffix}',
                     color=col, color_discrete_sequence=px.colors.qualitative.Plotly)
        if save_dir: fig.write_html(os.path.join(save_dir, f"plotly_boxplot_{target_col}_vs_{col}.html"))

print("Generating EDA plots post-cleaning...")
plot_target_distribution(train_eda, TARGET, EDA_PLOTS_DIR)
plot_numerical_distributions_train_test(train_eda, test_eda, numerical_features_eda, EDA_PLOTS_DIR)
plot_correlation_heatmap(train_eda, numerical_features_eda, TARGET, EDA_PLOTS_DIR)
plot_categorical_vs_target(train_eda, categorical_features_eda, TARGET, EDA_PLOTS_DIR)
print(f"EDA plots saved to {EDA_PLOTS_DIR}")


--- 2. In-Depth EDA (Post-Cleaning) ---
Generating EDA plots post-cleaning...
EDA plots saved to solar_panel_analysis_full_v3/eda_plotly_plots


In [25]:
# --- 3. Strategic Feature Engineering ---
print("\n--- 3. Strategic Feature Engineering ---")
def feature_engineer(df):
    df_fe = df.copy()
    # Interaction Features
    if 'temperature' in df_fe and 'irradiance' in df_fe: df_fe['temp_x_irradiance'] = df_fe['temperature'] * df_fe['irradiance']
    if 'voltage' in df_fe and 'current' in df_fe: df_fe['voltage_x_current'] = df_fe['voltage'] * df_fe['current']
    if 'panel_age' in df_fe and 'maintenance_count' in df_fe: df_fe['age_x_maintenance'] = df_fe['panel_age'] * (df_fe['maintenance_count'].fillna(0) + 1e-6)
    if 'irradiance' in df_fe and 'soiling_ratio' in df_fe: df_fe['irradiance_eff_soiling'] = df_fe['irradiance'] * df_fe['soiling_ratio']
    # Ratio/Difference Features
    if 'temperature' in df_fe and 'humidity' in df_fe: df_fe['temp_humidity_ratio'] = df_fe['temperature'] / (df_fe['humidity'].replace(0, 1e-6).fillna(1e-6) + 1e-6)
    if 'module_temperature' in df_fe and 'temperature' in df_fe: df_fe['temp_diff_module_ambient'] = df_fe['module_temperature'] - df_fe['temperature']
    if 'irradiance' in df_fe and 'cloud_coverage' in df_fe: df_fe['irradiance_per_cloud'] = df_fe['irradiance'] / (df_fe['cloud_coverage'].fillna(0) + 1)
    # Polynomials
    for col in ['irradiance', 'temperature', 'module_temperature', 'voltage', 'current', 'panel_age']:
         if col in df_fe.columns: df_fe[f'{col}_sq'] = df_fe[col] ** 2
    if 'module_temperature' in df_fe and 'temperature' in df_fe and 'irradiance' in df_fe:
        df_fe['temp_rise_per_irradiance'] = (df_fe['module_temperature'] - df_fe['temperature']) / (df_fe['irradiance'].replace(0,1e-6).fillna(1e-6) + 1e-6)
    return df_fe

train_df_fe = feature_engineer(train_df_cleaned)
test_df_fe = feature_engineer(test_df_cleaned)

X = train_df_fe.drop([TARGET, 'id'], axis=1, errors='ignore')
y_target = train_df_fe[TARGET]
X_test_full = test_df_fe.drop('id', axis=1, errors='ignore')
test_ids = test_df_orig['id']

train_cols = X.columns; test_cols = X_test_full.columns
common_cols = list(set(train_cols) & set(test_cols))
X = X[common_cols]; X_test = X_test_full[common_cols]

numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
print(f"Shape of X: {X.shape}, X_test: {X_test.shape}")
print(f"Num features: {len(numerical_features)}, Cat features: {len(categorical_features)}")


--- 3. Strategic Feature Engineering ---
Shape of X: (20000, 29), X_test: (12000, 29)
Num features: 26, Cat features: 3


In [26]:
# --- 4. Robust Data Preprocessing ---
print("\n--- 4. Robust Data Preprocessing ---")
numerical_imputer = IterativeImputer(max_iter=10, random_state=RANDOM_STATE) if USE_ITERATIVE_IMPUTER else SimpleImputer(strategy='median')
numerical_pipeline = Pipeline([('imputer', numerical_imputer), ('scaler', StandardScaler())])
categorical_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                                 ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])
preprocessor = ColumnTransformer(
    [('num', numerical_pipeline, numerical_features),
     ('cat', categorical_pipeline, categorical_features)],
    remainder='drop', n_jobs=-1) # Drop any columns not specified to avoid errors

X_processed = preprocessor.fit_transform(X)
X_test_processed = preprocessor.transform(X_test)

try:
    ohe_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
    all_feature_names = numerical_features + list(ohe_feature_names)
    X_processed_df = pd.DataFrame(X_processed, columns=all_feature_names, index=X.index)
    X_test_processed_df = pd.DataFrame(X_test_processed, columns=all_feature_names, index=X_test.index)
except Exception as e:
    print(f"Fallback: Using generic feature names due to: {e}")
    X_processed_df = pd.DataFrame(X_processed, index=X.index)
    X_test_processed_df = pd.DataFrame(X_test_processed, index=X_test.index)
    all_feature_names = [f"feat_{i}" for i in range(X_processed_df.shape[1])]
    X_processed_df.columns = all_feature_names
    X_test_processed_df.columns = all_feature_names
print(f"Processed data shapes: X_processed_df {X_processed_df.shape}, X_test_processed_df {X_test_processed_df.shape}")


--- 4. Robust Data Preprocessing ---
Processed data shapes: X_processed_df (20000, 36), X_test_processed_df (12000, 36)


In [27]:
# --- 5. Model Building & CV ---
print("\n--- 5. Model Building & CV ---")
def custom_score_func(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    return 100 * (1 - rmse)
kf = KFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)


--- 5. Model Building & CV ---


In [None]:
# --- 6. Hyperparameter Optimization (Optuna) & Training ---
print("\n--- 6. Hyperparameter Optimization & Training ---")
# --- LightGBM ---
def objective_lgbm(trial, X_data, y_data):
    # ... (LGBM objective params - same as before) ...
    params = {
        'objective': 'regression_l1', 'metric': 'rmse', 'random_state': RANDOM_STATE,
        'n_estimators': trial.suggest_int('n_estimators', 300, 2500),
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.05, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150), 'max_depth': trial.suggest_int('max_depth', 4, 12),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0), 'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-2, 10.0, log=True), 'reg_lambda': trial.suggest_float('reg_lambda', 1e-2, 10.0, log=True),
        'verbose': -1, 'n_jobs': -1
    }
    fold_rmses = []
    for train_idx, val_idx in kf.split(X_data, y_data):
        model = lgb.LGBMRegressor(**params)
        model.fit(X_data.iloc[train_idx], y_data.iloc[train_idx],
                  eval_set=[(X_data.iloc[val_idx], y_data.iloc[val_idx])],
                  eval_metric='rmse', callbacks=[lgb.early_stopping(100, verbose=False)])
        preds_val = model.predict(X_data.iloc[val_idx])
        fold_rmses.append(np.sqrt(mean_squared_error(y_data.iloc[val_idx], preds_val)))
    return np.mean(fold_rmses)

best_params_lgbm = {}
if RUN_OPTUNA and OPTUNA_TRIALS_LGBM > 0:
    study_lgbm = optuna.create_study(direction='minimize', study_name="LGBM_Opt")
    study_lgbm.optimize(lambda trial: objective_lgbm(trial, X_processed_df, y_target), n_trials=OPTUNA_TRIALS_LGBM, n_jobs=-1)
    best_params_lgbm = study_lgbm.best_params
    print("Best LGBM Params:", best_params_lgbm)
    if OPTUNA_PLOTS_DIR:
        optuna.visualization.plot_optimization_history(study_lgbm).write_html(os.path.join(OPTUNA_PLOTS_DIR, "optuna_lgbm_history.html"))
        optuna.visualization.plot_slice(study_lgbm).write_html(os.path.join(OPTUNA_PLOTS_DIR, "optuna_lgbm_slice.html"))
else: best_params_lgbm = {'n_estimators': 1500, 'learning_rate': 0.01} # Basic default

oof_lgbm = np.zeros(len(X_processed_df)); test_preds_lgbm = np.zeros(len(X_test_processed_df)); lgbm_models = []
final_lgbm_params = {**{'objective': 'regression_l1', 'metric': 'rmse', 'random_state': RANDOM_STATE, 'verbose': -1, 'n_jobs': -1}, **best_params_lgbm}
for fold, (train_idx, val_idx) in enumerate(kf.split(X_processed_df, y_target)):
    model = lgb.LGBMRegressor(**final_lgbm_params)
    model.fit(X_processed_df.iloc[train_idx], y_target.iloc[train_idx],
              eval_set=[(X_processed_df.iloc[val_idx], y_target.iloc[val_idx])],
              eval_metric='rmse', callbacks=[lgb.early_stopping(100, verbose=False)])
    oof_lgbm[val_idx] = model.predict(X_processed_df.iloc[val_idx])
    test_preds_lgbm += model.predict(X_test_processed_df) / N_SPLITS
    lgbm_models.append(model)
lgbm_cv_score = custom_score_func(y_target, oof_lgbm)
print(f"LightGBM CV Custom Score: {lgbm_cv_score:.4f}")

# --- XGBoost ---
def objective_xgb(trial, X_data, y_data):
    # ... (XGB objective params - same as before) ...
    params = {
        'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'random_state': RANDOM_STATE,
        'n_estimators': trial.suggest_int('n_estimators', 300, 2500),
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.05, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10), 'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0), 'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 1e-2, 1.0, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-2, 10.0, log=True), 'reg_lambda': trial.suggest_float('reg_lambda', 1e-2, 10.0, log=True),
        'n_jobs': -1, 'tree_method': 'hist'
    }
    fold_rmses = []
    for train_idx, val_idx in kf.split(X_data, y_data):
        model = xgb.XGBRegressor(**params)
        model.fit(X_data.iloc[train_idx], y_data.iloc[train_idx],
                  eval_set=[(X_data.iloc[val_idx], y_data.iloc[val_idx])],
                  early_stopping_rounds=100, verbose=False)
        preds_val = model.predict(X_data.iloc[val_idx])
        fold_rmses.append(np.sqrt(mean_squared_error(y_data.iloc[val_idx], preds_val)))
    return np.mean(fold_rmses)

best_params_xgb = {}
if RUN_OPTUNA and OPTUNA_TRIALS_XGB > 0:
    study_xgb = optuna.create_study(direction='minimize', study_name="XGB_Opt")
    study_xgb.optimize(lambda trial: objective_xgb(trial, X_processed_df, y_target), n_trials=OPTUNA_TRIALS_XGB, n_jobs=-1)
    best_params_xgb = study_xgb.best_params
    print("Best XGB Params:", best_params_xgb)
    if OPTUNA_PLOTS_DIR:
        optuna.visualization.plot_optimization_history(study_xgb).write_html(os.path.join(OPTUNA_PLOTS_DIR, "optuna_xgb_history.html"))
        optuna.visualization.plot_slice(study_xgb).write_html(os.path.join(OPTUNA_PLOTS_DIR, "optuna_xgb_slice.html"))
else: best_params_xgb = {'n_estimators': 1500, 'learning_rate': 0.01} # Basic default

oof_xgb = np.zeros(len(X_processed_df)); test_preds_xgb = np.zeros(len(X_test_processed_df)); xgb_models = []
final_xgb_params = {**{'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'random_state': RANDOM_STATE, 'n_jobs': -1, 'tree_method': 'hist'}, **best_params_xgb}
for fold, (train_idx, val_idx) in enumerate(kf.split(X_processed_df, y_target)):
    model = xgb.XGBRegressor(**final_xgb_params)
    model.fit(X_processed_df.iloc[train_idx], y_target.iloc[train_idx],
              eval_set=[(X_processed_df.iloc[val_idx], y_target.iloc[val_idx])],
              early_stopping_rounds=100, verbose=False)
    oof_xgb[val_idx] = model.predict(X_processed_df.iloc[val_idx])
    test_preds_xgb += model.predict(X_test_processed_df) / N_SPLITS
    xgb_models.append(model)
xgb_cv_score = custom_score_func(y_target, oof_xgb)
print(f"XGBoost CV Custom Score: {xgb_cv_score:.4f}")

# --- CatBoost ---
oof_cat = np.zeros(len(X_processed_df)); test_preds_cat = np.zeros(len(X_test_processed_df)); cat_models = []
catboost_cv_score = -np.inf
if OPTUNA_TRIALS_CATBOOST > 0:
    def objective_catboost(trial, X_data, y_data):
        params = {
            'objective': 'RMSE', 'eval_metric': 'RMSE', 'random_seed': RANDOM_STATE,
            'iterations': trial.suggest_int('iterations', 300, 2500),
            'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.05, log=True),
            'depth': trial.suggest_int('depth', 4, 10), 'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-2, 10.0, log=True),
            'border_count': trial.suggest_int('border_count', 32, 255), 'verbose': 0
        }
        fold_rmses = []
        # Note: If using CatBoost with original categorical features, pass cat_features_indices
        # Here, data is OHE, so no cat_features needed for CatBoostRegressor
        for train_idx, val_idx in kf.split(X_data, y_data):
            model = cb.CatBoostRegressor(**params)
            model.fit(X_data.iloc[train_idx], y_data.iloc[train_idx],
                      eval_set=[(X_data.iloc[val_idx], y_data.iloc[val_idx])],
                      early_stopping_rounds=100, verbose=0)
            preds_val = model.predict(X_data.iloc[val_idx])
            fold_rmses.append(np.sqrt(mean_squared_error(y_data.iloc[val_idx], preds_val)))
        return np.mean(fold_rmses)

    best_params_cat = {}
    if RUN_OPTUNA:
        study_cat = optuna.create_study(direction='minimize', study_name="CatBoost_Opt")
        study_cat.optimize(lambda trial: objective_catboost(trial, X_processed_df, y_target), n_trials=OPTUNA_TRIALS_CATBOOST, n_jobs=-1)
        best_params_cat = study_cat.best_params
        print("Best CatBoost Params:", best_params_cat)
        if OPTUNA_PLOTS_DIR:
            optuna.visualization.plot_optimization_history(study_cat).write_html(os.path.join(OPTUNA_PLOTS_DIR, "optuna_catboost_history.html"))
            optuna.visualization.plot_slice(study_cat).write_html(os.path.join(OPTUNA_PLOTS_DIR, "optuna_catboost_slice.html"))
    else: best_params_cat = {'iterations': 1500, 'learning_rate': 0.01}

    final_cat_params = {**{'objective': 'RMSE', 'eval_metric': 'RMSE', 'random_seed': RANDOM_STATE, 'verbose': 0}, **best_params_cat}
    for fold, (train_idx, val_idx) in enumerate(kf.split(X_processed_df, y_target)):
        model = cb.CatBoostRegressor(**final_cat_params)
        model.fit(X_processed_df.iloc[train_idx], y_target.iloc[train_idx],
                  eval_set=[(X_processed_df.iloc[val_idx], y_target.iloc[val_idx])],
                  early_stopping_rounds=100, verbose=0)
        oof_cat[val_idx] = model.predict(X_processed_df.iloc[val_idx])
        test_preds_cat += model.predict(X_test_processed_df) / N_SPLITS
        cat_models.append(model)
    catboost_cv_score = custom_score_func(y_target, oof_cat)
    print(f"CatBoost CV Custom Score: {catboost_cv_score:.4f}")


--- 6. Hyperparameter Optimization & Training ---


In [None]:
# --- 7. Explainable AI (XAI with SHAP) ---
print("\n--- 7. Explainable AI (XAI with SHAP) ---")
X_shap_sample_df = X_processed_df.sample(min(SHAP_SAMPLE_SIZE, len(X_processed_df)), random_state=RANDOM_STATE)
shap_plot_data = {}

if lgbm_models:
    try:
        explainer = shap.TreeExplainer(lgbm_models[0]) # Explain first fold model
        shap_values = explainer.shap_values(X_shap_sample_df)
        shap_plot_data['lgbm'] = (shap_values, X_shap_sample_df, explainer.expected_value)
    except Exception as e: print(f"SHAP for LGBM failed: {e}")
if xgb_models:
    try:
        explainer = shap.TreeExplainer(xgb_models[0])
        shap_values = explainer.shap_values(X_shap_sample_df)
        shap_plot_data['xgb'] = (shap_values, X_shap_sample_df, explainer.expected_value)
    except Exception as e: print(f"SHAP for XGBoost failed: {e}")
if cat_models:
    try:
        explainer = shap.TreeExplainer(cat_models[0])
        shap_values = explainer.shap_values(X_shap_sample_df)
        shap_plot_data['cat'] = (shap_values, X_shap_sample_df, explainer.expected_value)
    except Exception as e: print(f"SHAP for CatBoost failed: {e}")

for model_name, (s_values, s_df, exp_val) in shap_plot_data.items():
    print(f"Generating SHAP plots for {model_name.upper()}...")
    plt.figure(); shap.summary_plot(s_values, s_df, plot_type="bar", show=False); plt.title(f"SHAP Global Importance ({model_name.upper()})")
    plt.tight_layout(); plt.savefig(os.path.join(SHAP_PLOTS_DIR, f"shap_{model_name}_global_bar.png")); plt.close()

    plt.figure(); shap.summary_plot(s_values, s_df, show=False); plt.title(f"SHAP Summary Plot ({model_name.upper()})")
    plt.tight_layout(); plt.savefig(os.path.join(SHAP_PLOTS_DIR, f"shap_{model_name}_summary_beeswarm.png")); plt.close()

    shap.initjs()
    for i in range(min(3, len(s_df))):
        try: # Force plot saving can be tricky
            plt.figure() # Create a new figure for each force plot
            shap.force_plot(exp_val, s_values[i,:], s_df.iloc[i,:], matplotlib=True, show=False)
            # plt.title(f"SHAP Force Plot ({model_name.upper()}) - Sample {i}") # Title often overlaps
            plt.savefig(os.path.join(SHAP_PLOTS_DIR, f"shap_{model_name}_force_plot_sample_{i}.png"), bbox_inches='tight', dpi=150)
            plt.close()
        except Exception as e_force:
            print(f"Could not save force plot for {model_name} sample {i}: {e_force}")

print(f"SHAP plots saved to {SHAP_PLOTS_DIR}")

In [None]:
# --- 8. Sophisticated Ensembling ---
print("\n--- 8. Sophisticated Ensembling ---")
oof_models_dict = {'lgbm': oof_lgbm, 'xgb': oof_xgb}
test_preds_models_dict = {'lgbm': test_preds_lgbm, 'xgb': test_preds_xgb}
if OPTUNA_TRIALS_CATBOOST > 0 and catboost_cv_score > -np.inf:
    oof_models_dict['cat'] = oof_cat
    test_preds_models_dict['cat'] = test_preds_cat

def blend_objective_multi(trial, oofs_dict, target):
    weights = [trial.suggest_float(f"w_{name}", 0, 1) for name in oofs_dict.keys()]
    total_weight = sum(weights);
    if total_weight == 0: return 1e6
    normalized_weights = [w / total_weight for w in weights]
    blended_oof = sum(norm_w * oofs_dict[name] for norm_w, name in zip(normalized_weights, oofs_dict.keys()))
    return np.sqrt(mean_squared_error(target, blended_oof))

study_blend = optuna.create_study(direction='minimize', study_name="Blend_Opt_Multi")
study_blend.optimize(lambda trial: blend_objective_multi(trial, oof_models_dict, y_target), n_trials=50)
best_blend_weights_raw = study_blend.best_params
total_raw_blend_weight = sum(best_blend_weights_raw.values())
optimized_blend_weights = {name: best_blend_weights_raw[f"w_{name}"] / total_raw_blend_weight for name in oof_models_dict.keys()}

print("Optimized Blend Weights:"); [print(f"  {name.upper()}: {weight:.4f}") for name, weight in optimized_blend_weights.items()]
blended_test_preds = sum(w * test_preds_models_dict[name] for name, w in optimized_blend_weights.items())
blended_oof_final = sum(w * oof_models_dict[name] for name, w in optimized_blend_weights.items())
blended_cv_score = custom_score_func(y_target, blended_oof_final)
print(f"Blended OOF Custom Score: {blended_cv_score:.4f}")
final_test_predictions = blended_test_preds

fig_oof_blend = px.scatter(x=y_target, y=blended_oof_final, labels={'x': 'Actual', 'y': 'Blended OOF Predicted'},
                           title='Blended OOF vs. Actual', opacity=0.5, marginal_y='histogram', marginal_x='histogram')
fig_oof_blend.add_shape(type="line", x0=y_target.min(), y0=y_target.min(), x1=y_target.max(), y1=y_target.max(), line=dict(color="Red", dash="dash"))
if BASE_DIR: fig_oof_blend.write_html(os.path.join(BASE_DIR, "plotly_blended_oof_vs_actual.html"))

In [None]:
# --- 9. Prediction & Submission ---
print("\n--- 9. Prediction & Submission ---")
# Optional: Clip predictions
# final_test_predictions = np.clip(final_test_predictions, y_target.min() * 0.9, y_target.max() * 1.1)
# final_test_predictions = np.clip(final_test_predictions, 0.0, 1.0)

submission_df = pd.DataFrame({'id': test_ids, TARGET: final_test_predictions})
submission_df.columns = ['id', 'efficiency']
submission_filename = os.path.join(BASE_DIR, "submission_full_v3.csv")
submission_df.to_csv(submission_filename, index=False)
print(f"Submission file '{submission_filename}' created successfully.")

print("\n--- Final CV Scores Summary ---")
print(f"LGBM CV Custom Score: {lgbm_cv_score:.4f}")
print(f"XGBoost CV Custom Score: {xgb_cv_score:.4f}")
if OPTUNA_TRIALS_CATBOOST > 0 and catboost_cv_score > -np.inf: print(f"CatBoost CV Custom Score: {catboost_cv_score:.4f}")
print(f"Blended OOF Custom Score: {blended_cv_score:.4f}")
print(f"\nAll outputs saved in '{BASE_DIR}' directory.")