<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/Solar_Panel_Performance_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ☀️ Solar Panel Performance Optimization Challenge ☀️

**Predicting Degradation and Failures for a Sustainable Future**

---

## 🎯 1. The Challenge: Maximizing Solar Energy Output

Solar energy is a cornerstone of sustainable power. However, the efficiency of Photovoltaic (PV) panels can degrade over time or due to unforeseen failures. Traditional maintenance is often reactive, leading to:

*   📉 **Energy Loss:** Suboptimal performance means less clean energy generated.
*   💰 **Increased Costs:** Reactive repairs and downtime are expensive.

**Our Mission:** To develop a sophisticated Machine Learning model that predicts `efficiency` (our target variable), enabling **predictive maintenance**. This proactive approach will help maintain peak performance and reduce operational interruptions.

---

## 📊 2. Understanding Our Data

We're provided with a rich dataset containing sensor readings and panel characteristics.

*   **`train.csv`**: The training ground for our model (20,000 samples, 17 features including `efficiency`).
*   **`test.csv`**: The unseen data where we'll make our predictions (12,000 samples, 16 features).
*   **`sample_submission.csv`**: The blueprint for our final submission file.

### Key Data Features at a Glance:

| Feature Category    | Column Examples                                 | Description                                                                    |
| :------------------ | :---------------------------------------------- | :----------------------------------------------------------------------------- |
| 🆔 **Identifiers**   | `id`, `string_id`                               | Unique row and panel group identifiers.                                        |
| 🌡️ **Environmental** | `temperature`, `irradiance`, `humidity`, `cloud_coverage`, `wind_speed`, `pressure` | Ambient conditions influencing panel operation.                |
| 🛠️ **Panel Specifics**| `panel_age`, `maintenance_count`, `soiling_ratio`, `module_temperature`, `error_code`, `installation_type` | Panel history, condition, and setup.                               |
| ⚡ **Electrical**   | `voltage`, `current`                            | Measured electrical output.                                                    |
| 🏆 **Target**        | **`efficiency`**                                | **The crucial variable we need to predict!** (0.0 - 1.0 scale)             |

*A detailed description of each column is available in the problem statement.*

---

## 🚀 3. Our Game Plan: Building a Winning Model

We'll follow a structured, iterative approach to tackle this prediction task:

1.  **⚙️ Initial Setup & Environment Configuration:**
    *   Importing essential Python libraries (Pandas, NumPy, Scikit-learn, LightGBM, XGBoost, Plotly, Optuna).
    *   Loading the datasets into our workspace.

2.  **🔍 Exploratory Data Analysis (EDA) - Unveiling Insights:**
    *   Deep dive into data distributions, missing values, and potential outliers.
    *   Visualizing feature relationships and their correlation with `efficiency` using:
        *   **Matplotlib & Seaborn:** For static, foundational plots.
        *   **Plotly:** For dynamic, interactive visualizations to uncover subtle patterns.

3.  **✨ Feature Engineering - Crafting Predictive Power:**
    *   Creating new, informative features from existing ones (e.g., interaction terms like `temperature * irradiance`, ratios like `module_temperature - temperature`). The goal is to provide the model with richer signals.

4.  **🧹 Data Preprocessing - Preparing for Modeling:**
    *   **Missing Value Imputation:** Strategically filling in any data gaps.
    *   **Categorical Encoding:** Transforming text-based features (like `string_id`, `error_code`) into a numerical format (One-Hot Encoding).
    *   **Feature Scaling:** Normalizing numerical features (`StandardScaler`) to ensure fair contribution from all variables.

5.  **🧠 Model Building & Cross-Validation - The Core Engine:**
    *   **Algorithm Selection:** Focusing on state-of-the-art gradient boosting models:
        *   **LightGBM (LGBM):** Known for speed and efficiency.
        *   **XGBoost:** A robust and widely-used powerhouse.
    *   **K-Fold Cross-Validation:** Training and evaluating models on different subsets of the data to ensure robustness and get a reliable performance estimate. This helps prevent overfitting.

6.  **🛠️ Hyperparameter Optimization - Fine-Tuning for Excellence:**
    *   Leveraging **Optuna**, an automated hyperparameter optimization framework. Optuna will intelligently search for the best set of model settings (e.g., learning rate, tree depth) to maximize our chosen metric.

7.  **🤝 Model Ensembling - The Power of Collaboration:**
    *   **Blending:** Combining the predictions from our fine-tuned LGBM and XGBoost models. The idea is that different models capture different aspects of the data, and their combined wisdom is often superior to any single model. We'll optimize the blending weights.

8.  **📜 Prediction & Submission - Delivering Results:**
    *   Applying our final, ensembled model to the `test.csv` data.
    *   Generating the `submission.csv` file in the specified format.

---

## 📈 4. Measuring Success: The Evaluation Metric

Our model's prowess will be judged by a custom scoring formula:

**Score = 100 \* (1 - RMSE)**

Where `RMSE` (Root Mean Squared Error) is calculated as:
`RMSE = sqrt(mean_squared_error(actual_efficiency, predicted_efficiency))`

**A higher score indicates a more accurate model.** Our goal is to maximize this score!

---

## 🏁 Let's Begin the Journey!

The code cells below will bring this plan to life. We'll document each step, share our findings, and strive for the best possible prediction model.

In [1]:
!pip install pandas numpy matplotlib seaborn plotly scikit-learn lightgbm xgboost optuna shap kaleido -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.6/386.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.5/242.5 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Plotly imports
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import mean_squared_error

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

import shap # For XAI

import os
import warnings
warnings.filterwarnings('ignore')

In [5]:
# --- Configuration ---
RUN_OPTUNA = True  # Set to False to skip Optuna and use default/tuned params
OPTUNA_TRIALS_LGBM = 30 # Number of Optuna trials for LightGBM (can be increased for more thorough search)
OPTUNA_TRIALS_XGB = 30  # Number of Optuna trials for XGBoost
N_SPLITS = 5 # Number of K-Fold splits
RANDOM_STATE = 42

In [6]:
# --- Directory Setup for Visualizations ---
BASE_DIR = "solar_panel_analysis"
EDA_PLOTS_DIR = os.path.join(BASE_DIR, "eda_plotly_plots")
OPTUNA_PLOTS_DIR = os.path.join(BASE_DIR, "optuna_plots")
SHAP_PLOTS_DIR = os.path.join(BASE_DIR, "shap_plots")

for D in [BASE_DIR, EDA_PLOTS_DIR, OPTUNA_PLOTS_DIR, SHAP_PLOTS_DIR]:
    if not os.path.exists(D):
        os.makedirs(D)

print(f"Plots will be saved in '{BASE_DIR}' subdirectories.")

Plots will be saved in 'solar_panel_analysis' subdirectories.


In [9]:
from google.colab import drive
drive.mount('/content/drive')
# --- 1. Setup & Data Loading ---
print("\n--- 1. Setup & Data Loading ---")
train_df_orig = pd.read_csv("train.csv")
test_df_orig = pd.read_csv("test.csv")
sample_submission_df = pd.read_csv("sample_submission.csv")

print(f"Train data shape: {train_df_orig.shape}")
print(f"Test data shape: {test_df_orig.shape}")

Mounted at /content/drive

--- 1. Setup & Data Loading ---


FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'