<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/Solar_Panel_Performance_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ☀️ Solar Panel Performance Optimization Challenge ☀️

**Predicting Degradation and Failures for a Sustainable Future**

---

## 🎯 1. The Challenge: Maximizing Solar Energy Output

Solar energy is a cornerstone of sustainable power. However, the efficiency of Photovoltaic (PV) panels can degrade over time or due to unforeseen failures. Traditional maintenance is often reactive, leading to:

*   📉 **Energy Loss:** Suboptimal performance means less clean energy generated.
*   💰 **Increased Costs:** Reactive repairs and downtime are expensive.

**Our Mission:** To develop a sophisticated Machine Learning model that predicts `efficiency` (our target variable), enabling **predictive maintenance**. This proactive approach will help maintain peak performance and reduce operational interruptions.

---

## 📊 2. Understanding Our Data

We're provided with a rich dataset containing sensor readings and panel characteristics.

*   **`train.csv`**: The training ground for our model (20,000 samples, 17 features including `efficiency`).
*   **`test.csv`**: The unseen data where we'll make our predictions (12,000 samples, 16 features).
*   **`sample_submission.csv`**: The blueprint for our final submission file.

### Key Data Features at a Glance:

| Feature Category    | Column Examples                                 | Description                                                                    |
| :------------------ | :---------------------------------------------- | :----------------------------------------------------------------------------- |
| 🆔 **Identifiers**   | `id`, `string_id`                               | Unique row and panel group identifiers.                                        |
| 🌡️ **Environmental** | `temperature`, `irradiance`, `humidity`, `cloud_coverage`, `wind_speed`, `pressure` | Ambient conditions influencing panel operation.                |
| 🛠️ **Panel Specifics**| `panel_age`, `maintenance_count`, `soiling_ratio`, `module_temperature`, `error_code`, `installation_type` | Panel history, condition, and setup.                               |
| ⚡ **Electrical**   | `voltage`, `current`                            | Measured electrical output.                                                    |
| 🏆 **Target**        | **`efficiency`**                                | **The crucial variable we need to predict!** (0.0 - 1.0 scale)             |

*A detailed description of each column is available in the problem statement.*

---

## 🚀 3. Our Game Plan: Building a Winning Model

We'll follow a structured, iterative approach to tackle this prediction task:

1.  **⚙️ Initial Setup & Environment Configuration:**
    *   Importing essential Python libraries (Pandas, NumPy, Scikit-learn, LightGBM, XGBoost, Plotly, Optuna).
    *   Loading the datasets into our workspace.

2.  **🔍 Exploratory Data Analysis (EDA) - Unveiling Insights:**
    *   Deep dive into data distributions, missing values, and potential outliers.
    *   Visualizing feature relationships and their correlation with `efficiency` using:
        *   **Matplotlib & Seaborn:** For static, foundational plots.
        *   **Plotly:** For dynamic, interactive visualizations to uncover subtle patterns.

3.  **✨ Feature Engineering - Crafting Predictive Power:**
    *   Creating new, informative features from existing ones (e.g., interaction terms like `temperature * irradiance`, ratios like `module_temperature - temperature`). The goal is to provide the model with richer signals.

4.  **🧹 Data Preprocessing - Preparing for Modeling:**
    *   **Missing Value Imputation:** Strategically filling in any data gaps.
    *   **Categorical Encoding:** Transforming text-based features (like `string_id`, `error_code`) into a numerical format (One-Hot Encoding).
    *   **Feature Scaling:** Normalizing numerical features (`StandardScaler`) to ensure fair contribution from all variables.

5.  **🧠 Model Building & Cross-Validation - The Core Engine:**
    *   **Algorithm Selection:** Focusing on state-of-the-art gradient boosting models:
        *   **LightGBM (LGBM):** Known for speed and efficiency.
        *   **XGBoost:** A robust and widely-used powerhouse.
    *   **K-Fold Cross-Validation:** Training and evaluating models on different subsets of the data to ensure robustness and get a reliable performance estimate. This helps prevent overfitting.

6.  **🛠️ Hyperparameter Optimization - Fine-Tuning for Excellence:**
    *   Leveraging **Optuna**, an automated hyperparameter optimization framework. Optuna will intelligently search for the best set of model settings (e.g., learning rate, tree depth) to maximize our chosen metric.

7.  **🤝 Model Ensembling - The Power of Collaboration:**
    *   **Blending:** Combining the predictions from our fine-tuned LGBM and XGBoost models. The idea is that different models capture different aspects of the data, and their combined wisdom is often superior to any single model. We'll optimize the blending weights.

8.  **📜 Prediction & Submission - Delivering Results:**
    *   Applying our final, ensembled model to the `test.csv` data.
    *   Generating the `submission.csv` file in the specified format.

---

## 📈 4. Measuring Success: The Evaluation Metric

Our model's prowess will be judged by a custom scoring formula:

**Score = 100 \* (1 - RMSE)**

Where `RMSE` (Root Mean Squared Error) is calculated as:
`RMSE = sqrt(mean_squared_error(actual_efficiency, predicted_efficiency))`

**A higher score indicates a more accurate model.** Our goal is to maximize this score!

---

## 🏁 Let's Begin the Journey!

The code cells below will bring this plan to life. We'll document each step, share our findings, and strive for the best possible prediction model.

In [1]:
!pip install pandas numpy matplotlib seaborn plotly scikit-learn lightgbm xgboost optuna shap kaleido -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.6/386.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.5/242.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Plotly imports
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import mean_squared_error

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

import shap # For XAI

import os
import warnings
warnings.filterwarnings('ignore')

In [3]:
# --- Configuration ---
RUN_OPTUNA = True  # Set to False to skip Optuna and use default/tuned params
OPTUNA_TRIALS_LGBM = 30 # Number of Optuna trials for LightGBM (can be increased for more thorough search)
OPTUNA_TRIALS_XGB = 30  # Number of Optuna trials for XGBoost
N_SPLITS = 5 # Number of K-Fold splits
RANDOM_STATE = 42

In [4]:
# --- Directory Setup for Visualizations ---
BASE_DIR = "solar_panel_analysis"
EDA_PLOTS_DIR = os.path.join(BASE_DIR, "eda_plotly_plots")
OPTUNA_PLOTS_DIR = os.path.join(BASE_DIR, "optuna_plots")
SHAP_PLOTS_DIR = os.path.join(BASE_DIR, "shap_plots")

for D in [BASE_DIR, EDA_PLOTS_DIR, OPTUNA_PLOTS_DIR, SHAP_PLOTS_DIR]:
    if not os.path.exists(D):
        os.makedirs(D)

print(f"Plots will be saved in '{BASE_DIR}' subdirectories.")

Plots will be saved in 'solar_panel_analysis' subdirectories.


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
# --- 1. Setup & Data Loading ---
print("\n--- 1. Setup & Data Loading ---")
train_df_orig = pd.read_csv("/content/drive/MyDrive/zelestra_data/train.csv")
test_df_orig = pd.read_csv("/content/drive/MyDrive/zelestra_data/test.csv")
sample_submission_df = pd.read_csv("/content/drive/MyDrive/zelestra_data/sample_submission.csv")

print(f"Train data shape: {train_df_orig.shape}")
print(f"Test data shape: {test_df_orig.shape}")


--- 1. Setup & Data Loading ---
Train data shape: (20000, 17)
Test data shape: (12000, 16)


In [13]:
train_df_orig.head()

Unnamed: 0,id,temperature,irradiance,humidity,panel_age,maintenance_count,soiling_ratio,voltage,current,module_temperature,cloud_coverage,wind_speed,pressure,string_id,error_code,installation_type,efficiency
0,0,7.817315,576.17927,41.24308670850264,32.135501,4.0,0.803199,37.403527,1.963787,13.691147,62.494044,12.82491203459621,1018.8665053152532,A1,,,0.562096
1,1,24.785727,240.003973,1.3596482765960705,19.97746,8.0,0.479456,21.843315,0.241473,27.545096,43.851238,12.012043660984917,1025.6238537572883,D4,E00,dual-axis,0.396447
2,2,46.652695,687.612799,91.26536837560256,1.496401,4.0,0.822398,48.222882,4.1918,43.363708,,1.814399755560454,1010.9226539809572,C3,E00,,0.573776
3,3,53.339567,735.141179,96.1909552117616,18.491582,3.0,0.837529,46.295748,0.960567,57.720436,67.361473,8.736258932034128,1021.8466633134252,A1,,dual-axis,0.629009
4,4,5.575374,12.241203,27.495073003585222,30.722697,6.0,0.551833,0.0,0.898062,6.786263,3.632,0.52268384077164,1008.5559577591928,B2,E00,fixed,0.341874


In [14]:
train_df_orig.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  20000 non-null  int64  
 1   temperature         18999 non-null  float64
 2   irradiance          19013 non-null  float64
 3   humidity            20000 non-null  object 
 4   panel_age           18989 non-null  float64
 5   maintenance_count   18973 non-null  float64
 6   soiling_ratio       18990 non-null  float64
 7   voltage             19007 non-null  float64
 8   current             19023 non-null  float64
 9   module_temperature  19022 non-null  float64
 10  cloud_coverage      18990 non-null  float64
 11  wind_speed          20000 non-null  object 
 12  pressure            20000 non-null  object 
 13  string_id           20000 non-null  object 
 14  error_code          14088 non-null  object 
 15  installation_type   14972 non-null  object 
 16  effi

In [10]:
# --- 2. In-Depth EDA (with Plotly) ---
print("\n--- 2. In-Depth EDA ---")
train_eda = train_df_orig.copy()
test_eda = test_df_orig.copy() # For comparing distributions

TARGET = 'efficiency'
y_eda = train_eda[TARGET]

numerical_features_orig = train_eda.select_dtypes(include=np.number).columns.tolist()
if TARGET in numerical_features_orig: numerical_features_orig.remove(TARGET)
if 'id' in numerical_features_orig: numerical_features_orig.remove('id')

categorical_features_orig = train_eda.select_dtypes(include='object').columns.tolist()
if 'id' in categorical_features_orig: categorical_features_orig.remove('id')

# --- EDA Plotting Functions (using Plotly) ---
def plot_target_distribution(df, target_col, save_dir):
    fig = px.histogram(df, x=target_col, nbins=50, title=f'Distribution of Target ({target_col})',
                       marginal="box", color_discrete_sequence=['#636EFA'])
    fig.update_layout(bargap=0.1)
    fig.write_html(os.path.join(save_dir, "plotly_target_distribution.html"))
    # fig.show() # Uncomment for interactive view in Colab

def plot_numerical_distributions_train_test(train_df, test_df, num_cols, save_dir):
    for col in num_cols:
        fig = go.Figure()
        fig.add_trace(go.Histogram(x=train_df[col], name='Train', nbinsx=40, marker_color='#EF553B', opacity=0.75))
        fig.add_trace(go.Histogram(x=test_df[col], name='Test', nbinsx=40, marker_color='#00CC96', opacity=0.75))
        fig.update_layout(barmode='overlay', title_text=f'Distribution of {col} (Train vs Test)')
        fig.update_traces(opacity=0.7)
        fig.write_html(os.path.join(save_dir, f"plotly_dist_{col}_train_test.html"))
        # fig.show()

def plot_correlation_heatmap(df, num_cols, target_col, save_dir):
    correlation_matrix = df[num_cols + [target_col]].corr()
    fig = px.imshow(correlation_matrix, text_auto=".2f", aspect="auto",
                    color_continuous_scale='RdBu_r', title='Correlation Matrix (Numerical Features & Target)')
    fig.write_html(os.path.join(save_dir, "plotly_correlation_heatmap.html"))
    # fig.show()

def plot_categorical_vs_target(df, cat_cols, target_col, save_dir):
    for col in cat_cols:
        top_n = df[col].nunique()
        if top_n > 15: # Limit for very high cardinality features
            top_categories = df[col].value_counts().nlargest(15).index
            df_filtered = df[df[col].isin(top_categories)]
            title_suffix = " (Top 15 Categories)"
        else:
            df_filtered = df
            title_suffix = ""

        fig = px.box(df_filtered, x=col, y=target_col,
                     title=f'{target_col} vs {col}{title_suffix}',
                     color=col, color_discrete_sequence=px.colors.qualitative.Plotly)
        fig.write_html(os.path.join(save_dir, f"plotly_boxplot_{target_col}_vs_{col}.html"))
        # fig.show()

# Generate EDA plots
print("Generating EDA plots...")
plot_target_distribution(train_eda, TARGET, EDA_PLOTS_DIR)
plot_numerical_distributions_train_test(train_eda, test_eda, numerical_features_orig, EDA_PLOTS_DIR)
plot_correlation_heatmap(train_eda, numerical_features_orig, TARGET, EDA_PLOTS_DIR)
plot_categorical_vs_target(train_eda, categorical_features_orig, TARGET, EDA_PLOTS_DIR)
print(f"EDA plots saved to {EDA_PLOTS_DIR}")


--- 2. In-Depth EDA ---
Generating EDA plots...
EDA plots saved to solar_panel_analysis/eda_plotly_plots


In [11]:
# --- 3. Strategic Feature Engineering ---
print("\n--- 3. Strategic Feature Engineering ---")
def feature_engineer(df):
    df_fe = df.copy()
    # Interaction Features
    df_fe['temp_x_irradiance'] = df_fe['temperature'] * df_fe['irradiance']
    df_fe['voltage_x_current'] = df_fe['voltage'] * df_fe['current'] # Proxy for Power
    df_fe['age_x_maintenance'] = df_fe['panel_age'] * (df_fe['maintenance_count'] + 1e-6) # Add epsilon for 0 maintenance
    df_fe['irradiance_eff_soiling'] = df_fe['irradiance'] * df_fe['soiling_ratio']

    # Ratio/Difference Features
    df_fe['temp_humidity_ratio'] = df_fe['temperature'] / (df_fe['humidity'] + 1e-6)
    df_fe['temp_diff_module_ambient'] = df_fe['module_temperature'] - df_fe['temperature']
    df_fe['irradiance_per_cloud'] = df_fe['irradiance'] / (df_fe['cloud_coverage'] + 1) # +1 to avoid div by zero

    # Polynomials (use with caution, can lead to overfitting - keep simple for now)
    for col in ['irradiance', 'temperature', 'module_temperature']:
         if col in df_fe.columns:
            df_fe[f'{col}_sq'] = df_fe[col] ** 2

    # Time-based (if applicable - panel_age is already there)
    # Cyclical features if datetime components were present (e.g., hour of day)

    return df_fe

train_df = feature_engineer(train_df_orig.copy())
test_df = feature_engineer(test_df_orig.copy())

print(f"Train shape after FE: {train_df.shape}")
print(f"Test shape after FE: {test_df.shape}")

# Prepare data for modeling
X = train_df.drop([TARGET, 'id'], axis=1)
y_target = train_df[TARGET]
X_test_full = test_df.drop('id', axis=1) # Keep 'id' for submission
test_ids = test_df['id']

# Align columns - crucial if FE creates columns present in train but not test or vice-versa
common_cols = list(set(X.columns) & set(X_test_full.columns))
X = X[common_cols]
X_test = X_test_full[common_cols]

print(f"Aligned X shape: {X.shape}, Aligned X_test shape: {X_test.shape}")

# Update feature lists after FE
numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(include='object').columns.tolist()


--- 3. Strategic Feature Engineering ---


TypeError: can only concatenate str (not "float") to str

In [12]:
# --- 4. Robust Data Preprocessing ---
print("\n--- 4. Robust Data Preprocessing ---")
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
], remainder='passthrough', n_jobs=-1)

X_processed = preprocessor.fit_transform(X)
X_test_processed = preprocessor.transform(X_test)

# Get feature names after OHE for models that can use them (and for XAI)
try:
    ohe_feature_names = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)
    all_feature_names = numerical_features + list(ohe_feature_names)
    X_processed_df = pd.DataFrame(X_processed, columns=all_feature_names)
    X_test_processed_df = pd.DataFrame(X_test_processed, columns=all_feature_names)
    print(f"Total features after preprocessing: {X_processed_df.shape[1]}")
except Exception as e: # Fallback if get_feature_names_out is not available or fails
    print(f"Could not get feature names from OHE. Error: {e}")
    X_processed_df = pd.DataFrame(X_processed)
    X_test_processed_df = pd.DataFrame(X_test_processed)


--- 4. Robust Data Preprocessing ---


NameError: name 'numerical_features' is not defined