<img src="https://upload.wikimedia.org/wikipedia/commons/0/06/Imperial_College_London_new_logo.png" alt="Imperial Logo" width="400">

### **Course:** CIVE70111 Machine Learning
### Task 4 PV Plant Modelling and Machine Learning Pipeline

**Project:** Clssification of operation conditions

**Date:** 09/12/2025  

<p align="right">
Created by: Michael Wong

# Table of Contents

1. **Project Overview**
2. **Workflow Summary**
3. **Imports & Paths**
4. **Helper Functions**
5. **Machine Learning Helpers**
6. **End-to-End Classification Pipeline**


# 1. Project Overview

This project focuses on detecting **suboptimal inverter operating conditions** in two solar power plants.
Each plant contains multiple inverters and weather sensors recording AC/DC power, yield, irradiance,
and temperature. The dataset contains numerous real-world issues including missing values, inconsistent
measurements, noisy power output at night, and non-monotonic yield counters.

The goal is to develop a **robust and interpretable machine learning model** that:

- Predicts inverter state as **Optimal (0)** or **Suboptimal (1)**
- Uses strict **time-based splitting** to avoid data leakage
- Is evaluated using F1-score with emphasis on Suboptimal detection
- Incorporates **data cleaning, outlier removal, feature engineering**
- Provides **engineering interpretability** using ALE and Drop-Column Importance

The final system integrates preprocessing, model training, evaluation,
and interpretability into a fully automated pipeline.


# 2. Workflow Summary

The overall workflow is divided into six major stages:

1. **Imports & Paths**
   - Load required Python libraries
   - Define file locations for Plant 1 and Plant 2 datasets

2. **Helper Functions**
   - Weather cleaning
   - AC/DC cleaning
   - Daily and total yield correction
   - Outlier removal
   - Merging inverter and weather data

3. **Machine Learning Helper Functions**
   - Label construction
   - Feature engineering (AC/IRRA, DC/IRRA)
   - Train/validation/test splitting
   - Threshold optimisation for Suboptimal F1
   - ALE plotting and drop-column importance

4. **End-to-End Classification Pipeline**
   - Assemble datasets
   - Clean and engineer features
   - Split chronologically
   - Train Logistic Regression and Linear SVM (scaled/unscaled)
   - Generate evaluation metrics
   - Produce ALE interpretability plots
   - Compute drop-column feature importance

5. **Experiments**
   - With vs. without outlier removal
   - Before vs. after feature selection
   - Plant 1 vs. Plant 2 comparison

6. **Results Interpretation**
   - Performance comparison across plants and models
   - Importance of each input feature
   - Impact of outlier removal
   - Engineering insights into inverter performance


# 3. Imports & Paths

In [25]:
import os
import datetime as dt

import numpy as np
import pandas as pd

# Disable all plot display
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import (
    precision_recall_curve, classification_report, confusion_matrix,
    f1_score, average_precision_score
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.utils.class_weight import compute_class_weight

from PyALE import ale

import pickle
from tqdm import tqdm
import logging
logging.getLogger("PyALE").setLevel(logging.WARNING)


# 4. Helper Functions 
### Weather, AC/DC, Yield, Outliers

In [26]:
import pickle

def ensure_dir(path):
    if not os.path.exists(path):
        os.makedirs(path)


def regression_outlier_detection_graph(df, x_col="IRRADIATION_CLEAN",
                                       y_col="AC_CLEAN", z_thresh=3, plot=True):
    df = df.copy()
    mask_valid = df[[x_col, y_col]].notna().all(axis=1)
    if mask_valid.sum() < 10:
        return df

    X = df.loc[mask_valid, [x_col]].values
    y = df.loc[mask_valid, y_col].values

    model = LinearRegression()
    model.fit(X, y)
    y_pred = model.predict(X)

    residuals = y - y_pred
    z = (residuals - residuals.mean()) / residuals.std(ddof=0)
    outlier_mask = np.abs(z) > z_thresh

    df_valid = df.loc[mask_valid].copy()
    df_valid["outlier_reg"] = outlier_mask

    df_clean = df_valid.loc[~df_valid["outlier_reg"]].drop(columns=["outlier_reg"])
    df_rest = df.loc[~mask_valid]
    df_result = pd.concat([df_clean, df_rest], axis=0).sort_index()
    return df_result

### Weather Cleaning

In [27]:
def clean_weather(df_weather_raw):
    """
    Create IRRADIATION_CLEAN using simple 6:00–18:30 day/night rule,
    drop SOURCE_KEY, and set DATE_TIME as index.
    """
    dfw = df_weather_raw.copy()
    dfw["DATE_TIME"] = pd.to_datetime(dfw["DATE_TIME"])

    day_start = dt.time(6, 0)
    day_end   = dt.time(18, 30)
    dfw["expected_day"] = dfw["DATE_TIME"].dt.time.between(day_start, day_end)

    dfw["IRRADIATION_CLEAN"] = dfw["IRRADIATION"].copy()
    dfw.loc[(~dfw["expected_day"]) & (dfw["IRRADIATION_CLEAN"] > 0), "IRRADIATION_CLEAN"] = 0

    dfw.set_index("DATE_TIME", inplace=True)
    if "SOURCE_KEY" in dfw.columns:
        dfw = dfw.drop(columns=["SOURCE_KEY"])

    return dfw

### Aggregating Generation Data by Inverter

In [28]:
def aggregate_inverters(df_gen_clean):
    """
    Aggregate generation data per inverter and time, and count Optimal/Suboptimal.
    Returns dict: {source_key: aggregated_df}
    """
    agg_dict = {}
    grouped = df_gen_clean.groupby("SOURCE_KEY")
    for sk, g in grouped:
        agg_df = g.groupby("DATE_TIME").agg(
            SOURCE_KEY=("SOURCE_KEY", "first"),
            DC_POWER=("DC_POWER", "first"),
            AC_POWER=("AC_POWER", "first"),
            DAILY_YIELD=("DAILY_YIELD", "first"),
            TOTAL_YIELD=("TOTAL_YIELD", "first"),
            NUM_OPT=("Operating_Condition", lambda x: (x == "Optimal").sum()),
            NUM_SUBOPT=("Operating_Condition", lambda x: (x == "Suboptimal").sum())
        ).reset_index()
        agg_dict[sk] = agg_df
    return agg_dict

### Merge Inverter + Weather

In [29]:
def merge_inverter_weather(agg_inv_dict, df_weather_clean):
    """
    Inner-join each inverter df with weather df on matching DATE_TIME index.
    Returns dict: {source_key: joined_df}
    """
    joined = {}
    for sk, inv_df in agg_inv_dict.items():
        d = inv_df.copy()
        d["DATE_TIME"] = pd.to_datetime(d["DATE_TIME"])
        d.set_index("DATE_TIME", inplace=True)
        join_df = d.join(df_weather_clean, how="inner")
        joined[sk] = join_df
    return joined


### Clean AC/DC Power

In [30]:
def clean_ac_dc_dict(wea_inv_dict):
    """
    Clean AC_POWER and DC_POWER into AC_CLEAN/DC_CLEAN based on IRRADIATION_CLEAN.
    Returns dict on the same keys.
    """
    cleaned = {}
    for sk, df_join in wea_inv_dict.items():
        d = df_join.copy()
        d["AC_CLEAN"] = d["AC_POWER"].copy()
        d["DC_CLEAN"] = d["DC_POWER"].copy()

        night_mask = d["IRRADIATION_CLEAN"] == 0
        d.loc[night_mask & (d["AC_CLEAN"] > 0), "AC_CLEAN"] = 0
        d.loc[night_mask & (d["DC_CLEAN"] > 0), "DC_CLEAN"] = 0

        day_mask = d["IRRADIATION_CLEAN"] > 0
        d.loc[day_mask & (d["AC_CLEAN"] == 0), "AC_CLEAN"] = float("nan")
        d.loc[day_mask & (d["DC_CLEAN"] == 0), "DC_CLEAN"] = float("nan")

        d["AC_CLEAN"] = d["AC_CLEAN"].interpolate(method="linear")
        d["DC_CLEAN"] = d["DC_CLEAN"].interpolate(method="linear")

        d["AC_CLEAN"] = d["AC_CLEAN"].fillna(0)
        d["DC_CLEAN"] = d["DC_CLEAN"].fillna(0)

        cleaned[sk] = d
    return cleaned


### Clean DAILY_YIELD

In [31]:
def clean_daily_yield_dict(acdc_dict):
    """
    Enforce DAILY_YIELD_CLEAN:
      - 0 at night
      - monotonic increasing during daytime
      - flat after sunset
    Returns dict with DAILY_YIELD_CLEAN added.
    """
    cleaned = {}
    for sk, df_in in acdc_dict.items():
        d = df_in.copy()
        d.index = pd.to_datetime(d.index)
        d["DAILY_YIELD_CLEAN"] = d["DAILY_YIELD"].copy()

        dates = np.unique(d.index.date)
        for day in dates:
            mask_day_full = d.index.date == day
            df_day = d.loc[mask_day_full]

            irr_pos = df_day["IRRADIATION_CLEAN"] > 0
            if not irr_pos.any():
                d.loc[mask_day_full, "DAILY_YIELD_CLEAN"] = 0.0
                continue

            day_start_idx = df_day[irr_pos].index[0]
            day_end_idx   = df_day[irr_pos].index[-1]

            night_mask   = mask_day_full & (d.index < day_start_idx)
            day_mask     = mask_day_full & (d.index >= day_start_idx) & (d.index <= day_end_idx)
            evening_mask = mask_day_full & (d.index > day_end_idx)

            d.loc[night_mask, "DAILY_YIELD_CLEAN"] = 0.0
            val_end = d.at[day_end_idx, "DAILY_YIELD"]
            d.loc[evening_mask, "DAILY_YIELD_CLEAN"] = val_end

            day_idx = d.loc[day_mask].index
            if len(day_idx) == 0:
                continue

            raw_vals = d.loc[day_idx, "DAILY_YIELD_CLEAN"].values.astype(float)
            invalid = np.zeros(len(raw_vals), dtype=bool)

            invalid |= raw_vals <= 0
            if len(raw_vals) > 1:
                drops = np.diff(raw_vals) < 0
                invalid[1:][drops] = True

            d.loc[day_idx[invalid], "DAILY_YIELD_CLEAN"] = np.nan
            d.loc[day_idx, "DAILY_YIELD_CLEAN"] = (
                d.loc[day_idx, "DAILY_YIELD_CLEAN"]
                .interpolate(method="linear", limit_direction="both")
            )

            prev_val = d.at[day_idx[0], "DAILY_YIELD_CLEAN"]
            for t in day_idx[1:]:
                cur = d.at[t, "DAILY_YIELD_CLEAN"]
                if pd.isna(cur) or cur < prev_val:
                    d.at[t, "DAILY_YIELD_CLEAN"] = prev_val
                else:
                    prev_val = cur

            d.loc[night_mask, "DAILY_YIELD_CLEAN"] = 0.0
            d.loc[evening_mask, "DAILY_YIELD_CLEAN"] = val_end

        cleaned[sk] = d
    return cleaned

### Clean TOTAL_YIELD

In [32]:
def clean_total_yield_dict(daily_dict):
    """
    Clean TOTAL_YIELD into TOTAL_YIELD_CLEAN using increments in DAILY_YIELD_CLEAN.
    Returns dict with TOTAL_YIELD_CLEAN added, and trimmed columns + OPERATING_CONDITION_CLEAN.
    """
    cleaned = {}
    for sk, df_in in daily_dict.items():
        d = df_in.copy()
        d["TOTAL_YIELD_CLEAN"] = d["TOTAL_YIELD"].copy()
        timestamps = d.index

        for i in range(1, len(timestamps)):
            t_prev = timestamps[i - 1]
            t_curr = timestamps[i]

            TY_prev = d.at[t_prev, "TOTAL_YIELD_CLEAN"]
            TY_now  = d.at[t_curr, "TOTAL_YIELD"]
            DY_prev = d.at[t_prev, "DAILY_YIELD_CLEAN"]
            DY_now  = d.at[t_curr, "DAILY_YIELD_CLEAN"]

            is_new_day = t_curr.date() != t_prev.date()
            if is_new_day:
                d.at[t_curr, "TOTAL_YIELD_CLEAN"] = TY_prev
                continue

            delta_dy = DY_now - DY_prev
            TY_expected = TY_prev + delta_dy

            if TY_now < TY_prev:
                d.at[t_curr, "TOTAL_YIELD_CLEAN"] = TY_expected
            else:
                d.at[t_curr, "TOTAL_YIELD_CLEAN"] = TY_now

        cols_keep = [
            "PLANT_ID", "SOURCE_KEY",
            "AC_CLEAN", "DC_CLEAN",
            "DAILY_YIELD_CLEAN", "TOTAL_YIELD_CLEAN",
            "AMBIENT_TEMPERATURE", "MODULE_TEMPERATURE",
            "IRRADIATION_CLEAN", "NUM_OPT", "NUM_SUBOPT"
        ]
        cols_keep = [c for c in cols_keep if c in d.columns]
        d = d[cols_keep]

        d["OPERATING_CONDITION_CLEAN"] = np.where(
            d["NUM_OPT"] > d["NUM_SUBOPT"], "Optimal", "Suboptimal"
        )
        d = d.drop(columns=["NUM_OPT", "NUM_SUBOPT"])

        cleaned[sk] = d
    return cleaned

### Outlier Removal Wrapper

In [33]:
def remove_outliers_ps_dict(df_ps_dict):
    """
    Apply regression_outlier_detection_graph to each inverter df.
    """
    out_dict = {}
    for sk, df_in in df_ps_dict.items():
        out_dict[sk] = regression_outlier_detection_graph(
            df_in, x_col="IRRADIATION_CLEAN", y_col="AC_CLEAN",
            z_thresh=3, plot=False
        )
    return out_dict

# 5. Machine Learning Helpers

### Label Creation

In [34]:
def make_label(df_all):
    """
    Label: Optimal -> 0, Suboptimal -> 1
    """
    return (df_all["OPERATING_CONDITION_CLEAN"].str.lower() == "suboptimal").astype(int)


### Feature Engineering


In [35]:
def engineer_features(df_all):
    """
    Sort by DATE_TIME per SOURCE_KEY and add AC/IRRA, DC/IRRA.
    """
    df_feat = df_all.groupby("SOURCE_KEY", group_keys=False).apply(
        lambda g: g.sort_values("DATE_TIME")
    )
    df_feat["DC/IRRA"] = df_feat["DC_CLEAN"] / (df_feat["IRRADIATION_CLEAN"] + 1e-3)
    df_feat["AC/IRRA"] = df_feat["AC_CLEAN"] / (df_feat["IRRADIATION_CLEAN"] + 1e-3)
    return df_feat


### Combine All Inverter Data


In [36]:
def assemble_all_from_df_ps(df_ps_dict):
    """
    Combine all inverter dfs into one dataframe.
    """
    parts = []
    for sk, df_inv in df_ps_dict.items():
        d = df_inv.copy()
        d = d.reset_index()  # bring DATE_TIME back as a column
        parts.append(d)

    df_all = pd.concat(parts, ignore_index=True).drop_duplicates()
    df_all["DATE_TIME"] = pd.to_datetime(df_all["DATE_TIME"])

    mask = (~df_all["OPERATING_CONDITION_CLEAN"].isna()) & (~df_all["IRRADIATION_CLEAN"].isna())
    df_all = df_all[mask]

    counts = df_all["OPERATING_CONDITION_CLEAN"].value_counts()
    print("\n=== Operating Condition Counts ===")
    print(f"Number of Optimal (0):     {counts.get('Optimal', 0)}")
    print(f"Number of Suboptimal (1):  {counts.get('Suboptimal', 0)}")

    return df_all


### Time-Based Splitting (Prevents leakage)


In [37]:
def time_split(df_feat, y, test_days=10, val_days=3):
    """
    Chronological split into train/val/test.
    """
    last_time = df_feat["DATE_TIME"].max()
    test_start = last_time - pd.Timedelta(days=test_days)
    val_start  = test_start - pd.Timedelta(days=val_days)

    mask_test = df_feat["DATE_TIME"] >= test_start
    mask_val  = (df_feat["DATE_TIME"] >= val_start) & (~mask_test)
    mask_train = df_feat["DATE_TIME"] < val_start

    X_tr = df_feat[mask_train]
    X_val = df_feat[mask_val]
    X_te = df_feat[mask_test]

    y_tr = y[mask_train]
    y_val = y[mask_val]
    y_te = y[mask_test]

    return X_tr, X_val, X_te, y_tr, y_val, y_te

### Preprocessing Pipeline (StandardScaler on numeric columns)


In [38]:
def make_preprocessor(df_feat, drop_col):
    """
    StandardScaler on numeric columns not in drop_col.
    """
    num_cols = [
        c for c in df_feat.columns
        if c not in drop_col and df_feat[c].dtype.kind in "fcui"
    ]
    pre = ColumnTransformer(
        [("num", Pipeline([("scaler", StandardScaler())]), num_cols)]
    )
    return pre

### Select Threshold that Maximises F1 for Suboptimal Class


In [39]:
def Suboptimal_f1_threshold(y_true, scores_suboptimal):
    """
    Pick threshold that maximises F1 for the Suboptimal (1) class.
    """
    p, r, thr = precision_recall_curve(y_true, scores_suboptimal)
    if len(thr) == 0:
        return 0.0

    f1 = 2 * p[1:] * r[1:] / (p[1:] + r[1:] + 1e-12)
    best_ix = np.nanargmax(f1)
    return float(thr[best_ix])


### Evaluation: Confusion Matrix, Classification Report, PR-AUC


In [40]:
def Suboptimal_evaluate(name, y_true, scores_suboptimal, thr, tag):
    """
    Print confusion matrix + classification report + PR-AUC focused on suboptimal.
    """
    preds = (scores_suboptimal >= thr).astype(int)
    ap = average_precision_score(y_true, scores_suboptimal)
    print(f"\n==== {name} | {tag} ====")
    print(f"Suboptimal focused Threshold: {thr:.4f} | PR-AUC: {ap:.4f}")
    print(classification_report(y_true, preds, digits=3))
    print("Suboptimal focused Confusion Matrix:\n", confusion_matrix(y_true, preds))

### Compute F1 Score Using a Custom Threshold


In [41]:
def f1_threshold_scorer(model, X, y_true, thr):
    """
    Compute F1 (Suboptimal=1) for a given model and threshold.
    """
    try:
        scores = model.predict_proba(X)[:, 1]
    except Exception:
        scores = model.decision_function(X)
    preds = (scores > thr).astype(int)
    return f1_score(y_true, preds, pos_label=1)

### 1-D ALE Plots for Model Interpretability


In [42]:
def plot_ale_1d(model, X, feature, bins=20, save_path=None):
    # Run ALE
    ale(X=X, model=model, feature=[feature], include_CI=False, grid_size=bins)

    # Sanitize filename
    safe_feature = str(feature)
    for bad in ["/", "\\", ":", "*", "?", "\"", "<", ">", "|"]:
        safe_feature = safe_feature.replace(bad, "_")

    plt.title(f"ALE for {feature}")
    plt.tight_layout()

    if save_path:
        file = os.path.join(save_path, f"ALE_{safe_feature}.png")
        plt.savefig(file)

    plt.show()  # prevents display


### Drop-Column Importance (Re-trains SVM per feature)


In [43]:
def drop_column_importance(df_feat, baseline_f1, drop_col,
                           X_tr, y_tr, X_val, y_val, X_te, y_te):
    """
    Drop-column importance using LinearSVC: importance = baseline_f1 - dropped_f1.
    """
    importances = {}
    base_drop_cols = set(drop_col)

    for col in X_tr.columns:
        if col in base_drop_cols:
            continue

        X_tr_d = X_tr.drop(columns=[col])
        X_val_d = X_val.drop(columns=[col])
        X_te_d = X_te.drop(columns=[col])

        df_feat_d = df_feat.drop(columns=[col])
        pre_d = make_preprocessor(df_feat_d, drop_col)

        svm_d = Pipeline([
            ("pre", pre_d),
            ("clf", LinearSVC(class_weight="balanced", max_iter=5000))
        ])
        svm_d.fit(X_tr_d, y_tr)

        thr_d = Suboptimal_f1_threshold(y_val, svm_d.decision_function(X_val_d))
        dropped_f1 = f1_threshold_scorer(svm_d, X_te_d, y_te, thr_d)

        importances[col] = baseline_f1 - dropped_f1

    return importances

# 6. End-to-End Classification Pipeline for a Plant


In [44]:
def run_classification_on_df_ps(df_ps_dict, test_days=10, val_days=3, drop_col=None):
    """
    Full pipeline + SAVE plots + SHOW plots + unique filenames per run.
    ALE is computed on TRAINING data (correct theoretical usage).
    """

    run_count = 0

    # global run_count
    # run_count += 1   # increment unique run ID

    # ================================================================
    # FOLDER SETUP
    # ================================================================

############################################################################################################################################
    
    # Change here 

    base_path = r"C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data"

######################################################################################################################################################################
    
    folder_main = os.path.join(base_path, "03 ALE SVM Decision")
    folder_plots = os.path.join(folder_main, "Plots")
    folder_ale = os.path.join(folder_plots, "ALE")
    folder_svm = os.path.join(folder_plots, "SVM")

    ensure_dir(folder_main)
    ensure_dir(folder_plots)
    ensure_dir(folder_ale)
    ensure_dir(folder_svm)

    # ================================================================
    # DATA PREPARATION
    # ================================================================
    if drop_col is None:
        drop_col = ["OPERATING_CONDITION_CLEAN", "DATE_TIME", "PLANT_ID", "SOURCE_KEY"]

    df_all = assemble_all_from_df_ps(df_ps_dict)
    y = make_label(df_all)
    df_feat = engineer_features(df_all)

    # train/val/test split
    X_tr, X_val, X_te, y_tr, y_val, y_te = time_split(df_feat, y, test_days, val_days)

    X_tr_model = X_tr.drop(columns=drop_col)
    X_val_model = X_val.drop(columns=drop_col)
    X_te_model = X_te.drop(columns=drop_col)

    pre = make_preprocessor(df_feat, drop_col)

    class_weights_arr = compute_class_weight("balanced", classes=np.array([0,1]), y=y_tr)
    class_weights = {0: class_weights_arr[0], 1: class_weights_arr[1]}

    # ================================================================
    # MODELS
    # ================================================================
    print("\n=== LogReg with scaling ===")
    lr = Pipeline([("pre", pre),("clf", LogisticRegression(max_iter=5000, class_weight=class_weights))])
    lr.fit(X_tr_model, y_tr)
    print("\n=== LogReg without scaling ===")
    lr_ns = LogisticRegression(max_iter=5000, class_weight=class_weights)
    lr_ns.fit(X_tr_model, y_tr)
    print("\n=== LinearSVC with scaling ===")
    svm = Pipeline([("pre", pre),("clf", LinearSVC(class_weight="balanced", max_iter=5000))])
    svm.fit(X_tr_model, y_tr)
    print("\n=== LinearSVC without scaling ===")
    svm_ns = LinearSVC(class_weight="balanced", max_iter=5000)
    svm_ns.fit(X_tr_model, y_tr)

    # ================================================================
    # THRESHOLDS
    # ================================================================
    thr_lr = Suboptimal_f1_threshold(y_val, lr.predict_proba(X_val_model)[:,1])
    thr_lr_ns = Suboptimal_f1_threshold(y_val, lr_ns.predict_proba(X_val_model)[:,1])
    thr_svm = Suboptimal_f1_threshold(y_val, svm.decision_function(X_val_model))
    thr_svm_ns = Suboptimal_f1_threshold(y_val, svm_ns.decision_function(X_val_model))

    # ================================================================
    # ALE PLOTS — SAVE + SHOW (on X_tr_model)
    # ================================================================
    print("\n=== Saving ALE plots (train set, correct) ===")
    for feat in tqdm(X_tr_model.columns):
        plot_ale_1d(
            svm,                      # model
            X_tr_model,               # ALE should use TRAINING DATA
            feat,
            save_path=folder_ale      # SAVE ONLY (but we also show)
        )
        plt.show()  # show the plot after saving

    # ================================================================
    # DROP-COLUMN IMPORTANCE
    # ================================================================
    baseline_f1 = f1_threshold_scorer(svm, X_te_model, y_te, thr_svm)

    svm_importance = drop_column_importance(
        df_feat, baseline_f1, drop_col,
        X_tr_model, y_tr, X_val_model, y_val, X_te_model, y_te
    )

    # ================================================================
    # SVM DECISION HISTOGRAM — UNIQUE FILENAME
    # ================================================================
    safe_name = f"Run_{run_count}"
    hist_file = os.path.join(folder_svm, f"SVM_Decision_Histogram_{safe_name}.png")

    scores_te = svm.decision_function(X_te_model)

    plt.hist(scores_te[y_te == 0], bins=50, alpha=0.6, label="Optimal")
    plt.hist(scores_te[y_te == 1], bins=50, alpha=0.6, label="Suboptimal")
    plt.axvline(thr_svm, linestyle="--", label="boundary")
    plt.xlabel("SVM decision function")
    plt.ylabel("Count")
    plt.legend()
    plt.savefig(hist_file)
    plt.close()

    print(f"Saved SVM histogram → {hist_file}")

    # ================================================================
    # SAVE RESULTS (PKL) — UNIQUE FILE NAME
    # ================================================================
    pkl_file = os.path.join(folder_main, f"results_Run_{run_count}.pkl")

    results_dict = {
        "LogReg_scaled": lr,
        "LogReg_no_scaling": lr_ns,
        "SVM_scaled": svm,
        "SVM_no_scaling": svm_ns,
        "thresholds": {
            "lr": thr_lr,
            "lr_ns": thr_lr_ns,
            "svm": thr_svm,
            "svm_ns": thr_svm_ns
        },
        "drop_column_importance": svm_importance,
        "baseline_f1": baseline_f1,
        "features": list(X_te_model.columns),
    }

    with open(pkl_file, "wb") as f:
        pickle.dump(results_dict, f)

    print(f"Saved results to → {pkl_file}\n")


### File Paths

In [45]:
# ============================================================
# 0. PATHS
# ============================================================

############################################################################################################################################
# Change here 

folder = r"C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\In"

############################################################################################################################################

gen_path_1     = os.path.join(folder, "Plant_1_Generation_Data_updated.csv")   # Plant 1 generation
weather_path_1 = os.path.join(folder, "Plant_1_Weather_Sensor_Data.csv")       # Plant 1 weather

gen_path_2     = os.path.join(folder, "Plant_2_Generation_Data.csv")           # Plant 2 generation
weather_path_2 = os.path.join(folder, "Plant_2_Weather_Sensor_Data.csv")       # Plant 2 weather


### Main Pipeline: Plant 1


In [46]:
# ============================================================
# 3. MAIN PIPELINE
# ============================================================

# ------------------ Plant 1 ------------------

print("\n=== PLANT 1: LOADING DATA ===")
df_p1_gen_raw = pd.read_csv(gen_path_1, parse_dates=["DATE_TIME"])
df_p1_weather_raw = pd.read_csv(weather_path_1, parse_dates=["DATE_TIME"])

# Drop rows with missing Operating_Condition, drop PLANT_ID and 'day' as in original
df_p1_gen = df_p1_gen_raw.dropna().copy()
for col_drop in ["PLANT_ID", "day"]:
    if col_drop in df_p1_gen.columns:
        df_p1_gen = df_p1_gen.drop(columns=[col_drop])
df_p1_gen.set_index("DATE_TIME", inplace=True)

# Aggregate by inverter
df_p1_gen.reset_index(inplace=True)
agg_inv_p1 = aggregate_inverters(df_p1_gen)

# Clean weather
df_p1_weather = clean_weather(df_p1_weather_raw)

# Join inverter + weather
wea_inv_p1 = merge_inverter_weather(agg_inv_p1, df_p1_weather)

# Clean AC/DC, DAILY_YIELD, TOTAL_YIELD
p1_step1 = clean_ac_dc_dict(wea_inv_p1)
p1_step2 = clean_daily_yield_dict(p1_step1)
df_ps1 = clean_total_yield_dict(p1_step2)

# Outlier removal
df_ps1_outlier = remove_outliers_ps_dict(df_ps1)


=== PLANT 1: LOADING DATA ===


### Main Pipeline: Plant 2

In [47]:
# ------------------ Plant 2 ------------------

print("\n=== PLANT 2: LOADING DATA ===")
df_p2_gen_raw = pd.read_csv(gen_path_2, parse_dates=["DATE_TIME"])
df_p2_weather_raw = pd.read_csv(weather_path_2, parse_dates=["DATE_TIME"])

# Drop PLANT_ID from generation (as in original)
if "PLANT_ID" in df_p2_gen_raw.columns:
    df_p2_gen = df_p2_gen_raw.drop(columns=["PLANT_ID"]).copy()
else:
    df_p2_gen = df_p2_gen_raw.copy()

df_p2_gen.set_index("DATE_TIME", inplace=True)
df_p2_gen.reset_index(inplace=True)

agg_inv_p2 = aggregate_inverters(df_p2_gen)
df_p2_weather = clean_weather(df_p2_weather_raw)
wea_inv_p2 = merge_inverter_weather(agg_inv_p2, df_p2_weather)

p2_step1 = clean_ac_dc_dict(wea_inv_p2)
p2_step2 = clean_daily_yield_dict(p2_step1)
df_ps2 = clean_total_yield_dict(p2_step2)

df_ps2_outlier = remove_outliers_ps_dict(df_ps2)



=== PLANT 2: LOADING DATA ===


### Experiments: With/Without Outliers, Before/After Feature Selection


In [48]:
# ============================================================
# 4. EXPERIMENTS (as in your original script)
# ============================================================

drop_base = ["OPERATING_CONDITION_CLEAN", "DATE_TIME", "PLANT_ID", "SOURCE_KEY"]

print("\n\n==============================")
print("PLANT 1: SVM BEFORE FEATURE SELECTION (WITH OUTLIERS)")
print("==============================")
run_classification_on_df_ps(df_ps1, drop_col=drop_base)

drop1 = drop_base + ['AC/IRRA', 'DC/IRRA', 'MODULE_TEMPERATURE','TOTAL_YIELD_CLEAN', 'DC_CLEAN', 'AC_CLEAN']
print("\n\n==============================")
print("PLANT 1: AFTER FEATURE SELECTION (WITH OUTLIERS)")
print("==============================")
run_classification_on_df_ps(df_ps1, drop_col=drop1)

print("\n\n==============================")
print("PLANT 1: SVM BEFORE FEATURE SELECTION (WITHOUT OUTLIERS)")
print("==============================")
run_classification_on_df_ps(df_ps1_outlier, drop_col=drop_base)

drop2 = drop_base + ['AC/IRRA', 'DC/IRRA', 'MODULE_TEMPERATURE','TOTAL_YIELD_CLEAN', 'DC_CLEAN', 'AC_CLEAN']
print("\n\n==============================")
print("PLANT 1: AFTER FEATURE SELECTION (WITHOUT OUTLIERS)")
print("==============================")
run_classification_on_df_ps(df_ps1_outlier, drop_col=drop2)

# Comment from your notes:
# The removed outliers are actually those optimal condition correctly predicted by the model,
# hence removing outliers worsens model performance.

print("\n\n==============================")
print("PLANT 2: Before FEATURE SELECTION MODEL (WITH OUTLIERS)")
print("==============================")
run_classification_on_df_ps(df_ps2, drop_col=drop_base)

drop3 = drop_base + ['DAILY_YIELD_CLEAN', 'AMBIENT_TEMPERATURE','MODULE_TEMPERATURE', 'AC_CLEAN','TOTAL_YIELD_CLEAN','DC/IRRA']
print("\n\n==============================")
print("PLANT 2: After FEATURE SELECTION MODEL (WITH OUTLIERS)")
print("==============================")
run_classification_on_df_ps(df_ps2, drop_col=drop3)

print("\n\n==============================")
print("PLANT 2: SVM BEFORE FEATURE SELECTION (WITHOUT OUTLIERS)")
print("==============================")
run_classification_on_df_ps(df_ps2_outlier, drop_col=drop_base)

drop4 = drop_base + ['TOTAL_YIELD_CLEAN']
print("\n\n==============================")
print("PLANT 2: AFTER FEATURE SELECTION (WITHOUT OUTLIERS)")
print("==============================")
run_classification_on_df_ps(df_ps2_outlier, drop_col=drop4)
# Comment from your notes:
# Does not perform better after feature selection




PLANT 1: SVM BEFORE FEATURE SELECTION (WITH OUTLIERS)

=== Operating Condition Counts ===
Number of Optimal (0):     7656
Number of Suboptimal (1):  38024

=== LogReg with scaling ===

=== LogReg without scaling ===


  df_feat = df_all.groupby("SOURCE_KEY", group_keys=False).apply(



=== LinearSVC with scaling ===

=== LinearSVC without scaling ===

=== Saving ALE plots (train set, correct) ===


  0%|          | 0/9 [00:00<?, ?it/s]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 11%|█         | 1/9 [00:13<01:45, 13.17s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 22%|██▏       | 2/9 [00:25<01:28, 12.69s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 33%|███▎      | 3/9 [00:41<01:25, 14.29s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 44%|████▍     | 4/9 [00:53<01:06, 13.39s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 56%|█████▌    | 5/9 [01:03<00:48, 12.02s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
 

Saved SVM histogram → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\Plots\SVM\SVM_Decision_Histogram_Run_0.png
Saved results to → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\results_Run_0.pkl



PLANT 1: AFTER FEATURE SELECTION (WITH OUTLIERS)

=== Operating Condition Counts ===
Number of Optimal (0):     7656
Number of Suboptimal (1):  38024

=== LogReg with scaling ===


  df_feat = df_all.groupby("SOURCE_KEY", group_keys=False).apply(



=== LogReg without scaling ===

=== LinearSVC with scaling ===

=== LinearSVC without scaling ===

=== Saving ALE plots (train set, correct) ===


  0%|          | 0/3 [00:00<?, ?it/s]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 33%|███▎      | 1/3 [00:17<00:34, 17.38s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 67%|██████▋   | 2/3 [00:30<00:14, 14.73s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
100%|██████████| 3/3 [00:40<00:00, 13.50s/it]


Saved SVM histogram → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\Plots\SVM\SVM_Decision_Histogram_Run_0.png
Saved results to → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\results_Run_0.pkl



PLANT 1: SVM BEFORE FEATURE SELECTION (WITHOUT OUTLIERS)

=== Operating Condition Counts ===
Number of Optimal (0):     6817
Number of Suboptimal (1):  37737

=== LogReg with scaling ===

=== LogReg without scaling ===


  df_feat = df_all.groupby("SOURCE_KEY", group_keys=False).apply(



=== LinearSVC with scaling ===

=== LinearSVC without scaling ===

=== Saving ALE plots (train set, correct) ===


  0%|          | 0/9 [00:00<?, ?it/s]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 11%|█         | 1/9 [00:13<01:46, 13.28s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 22%|██▏       | 2/9 [00:27<01:36, 13.76s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 33%|███▎      | 3/9 [00:41<01:22, 13.73s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 44%|████▍     | 4/9 [00:51<01:01, 12.28s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 56%|█████▌    | 5/9 [00:59<00:43, 10.89s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
 

Saved SVM histogram → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\Plots\SVM\SVM_Decision_Histogram_Run_0.png
Saved results to → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\results_Run_0.pkl



PLANT 1: AFTER FEATURE SELECTION (WITHOUT OUTLIERS)

=== Operating Condition Counts ===
Number of Optimal (0):     6817
Number of Suboptimal (1):  37737

=== LogReg with scaling ===

=== LogReg without scaling ===

=== LinearSVC with scaling ===

=== LinearSVC without scaling ===

=== Saving ALE plots (train set, correct) ===


  0%|          | 0/3 [00:00<?, ?it/s]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 33%|███▎      | 1/3 [00:12<00:25, 12.62s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 67%|██████▋   | 2/3 [00:20<00:09,  9.94s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  fig, ax = plt.subplots(figsize=(8, 4))
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
100%|██████████| 3/3 [00:28<00:00,  9.37s/it]


Saved SVM histogram → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\Plots\SVM\SVM_Decision_Histogram_Run_0.png
Saved results to → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\results_Run_0.pkl



PLANT 2: Before FEATURE SELECTION MODEL (WITH OUTLIERS)

=== Operating Condition Counts ===
Number of Optimal (0):     7414
Number of Suboptimal (1):  60284

=== LogReg with scaling ===

=== LogReg without scaling ===


  df_feat = df_all.groupby("SOURCE_KEY", group_keys=False).apply(



=== LinearSVC with scaling ===

=== LinearSVC without scaling ===

=== Saving ALE plots (train set, correct) ===


  0%|          | 0/9 [00:00<?, ?it/s]PyALE._ALE_generic:INFO: Continuous feature detected.
  fig, ax = plt.subplots(figsize=(8, 4))
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 11%|█         | 1/9 [00:12<01:43, 12.88s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 22%|██▏       | 2/9 [00:26<01:32, 13.18s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 33%|███▎      | 3/9 [00:43<01:28, 14.82s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 44%|████▍     | 4/9 [00:57<01:13, 14.65s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 56%|█████▌    | 5/9 [01:10<00:55, 13.96s/it]PyALE._ALE_generic:INFO: Continuous feature det

Saved SVM histogram → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\Plots\SVM\SVM_Decision_Histogram_Run_0.png
Saved results to → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\results_Run_0.pkl



PLANT 2: After FEATURE SELECTION MODEL (WITH OUTLIERS)

=== Operating Condition Counts ===
Number of Optimal (0):     7414
Number of Suboptimal (1):  60284

=== LogReg with scaling ===

=== LogReg without scaling ===

=== LinearSVC with scaling ===

=== LinearSVC without scaling ===

=== Saving ALE plots (train set, correct) ===


  0%|          | 0/3 [00:00<?, ?it/s]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 33%|███▎      | 1/3 [00:13<00:27, 13.82s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 67%|██████▋   | 2/3 [00:23<00:11, 11.43s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
100%|██████████| 3/3 [00:35<00:00, 11.89s/it]


Saved SVM histogram → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\Plots\SVM\SVM_Decision_Histogram_Run_0.png
Saved results to → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\results_Run_0.pkl



PLANT 2: SVM BEFORE FEATURE SELECTION (WITHOUT OUTLIERS)

=== Operating Condition Counts ===
Number of Optimal (0):     6383
Number of Suboptimal (1):  59435

=== LogReg with scaling ===

=== LogReg without scaling ===


  df_feat = df_all.groupby("SOURCE_KEY", group_keys=False).apply(



=== LinearSVC with scaling ===

=== LinearSVC without scaling ===

=== Saving ALE plots (train set, correct) ===


  0%|          | 0/9 [00:00<?, ?it/s]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 11%|█         | 1/9 [00:12<01:36, 12.04s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 22%|██▏       | 2/9 [00:23<01:23, 11.97s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 33%|███▎      | 3/9 [00:39<01:21, 13.54s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 44%|████▍     | 4/9 [00:52<01:07, 13.44s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 56%|█████▌    | 5/9 [01:04<00:51, 12.76s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
 

Saved SVM histogram → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\Plots\SVM\SVM_Decision_Histogram_Run_0.png
Saved results to → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\results_Run_0.pkl



PLANT 2: AFTER FEATURE SELECTION (WITHOUT OUTLIERS)

=== Operating Condition Counts ===
Number of Optimal (0):     6383
Number of Suboptimal (1):  59435

=== LogReg with scaling ===

=== LogReg without scaling ===


  df_feat = df_all.groupby("SOURCE_KEY", group_keys=False).apply(



=== LinearSVC with scaling ===

=== LinearSVC without scaling ===

=== Saving ALE plots (train set, correct) ===


  0%|          | 0/8 [00:00<?, ?it/s]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 12%|█▎        | 1/8 [00:13<01:33, 13.42s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 25%|██▌       | 2/8 [00:26<01:19, 13.27s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 38%|███▊      | 3/8 [00:42<01:11, 14.27s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 50%|█████     | 4/8 [00:53<00:52, 13.10s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
  plt.show()  # show the plot after saving
 62%|██████▎   | 5/8 [01:05<00:38, 12.77s/it]PyALE._ALE_generic:INFO: Continuous feature detected.
  plt.show()  # prevents display
 

Saved SVM histogram → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\Plots\SVM\SVM_Decision_Histogram_Run_0.png
Saved results to → C:\Users\B.KING\OneDrive - Imperial College London\CIVE70111 Machine Learning\CouseWork\Group-11\data\03 ALE SVM Decision\results_Run_0.pkl



### Summary
All models have poorer performance when trained without feature scaling because it prevents large features (such as AC and DC) from dominating the loss function while small feature (such as irradiation) is ignored.

Feature selection was performed using the drop-column method. Each feature was removed in turn, and the model was retrained to measure the change in performance. A positive importance value indicates that the feature contributes to model performance. A value of zero suggests the feature is redundant, and a negative value indicates the feature reduces model performance. Features were removed until all remaining features had positive importance values.


For Plant 1, the retained features are daily yield, ambient temperature, and irradiation.


For Plant 2, the retained features are DC power, irradiation, and the ratio of AC power to irradiation (AC/IRRA).

The effect of each feature on model predictions was calculated using accumulated local effects (ALE), a method that remains reliable even when features are correlated. For the LinearSVC model, more negative prediction values correspond to a higher probability of the inverter being in an optimal state. More positive values indicate a higher probability of being suboptimal. For Plant 1, the ALE plots show that increases in daily yield and irradiation lead to more negative prediction values, indicating a higher likelihood of optimal performance. In contrast, higher ambient temperatures raise the prediction value, which signals a greater probability of suboptimal performance. In summary, maintaining optimal operation in Plant 1 requires lower ambient temperatures and sufficient irradiation so that both irradiation and daily yield can increase. For Plant 2, higher irradiation and higher DC output both move the inverter toward an optimal state. An increase in the AC-to-irradiation ratio, however, shifts the inverter toward suboptimal performance. To support optimal operation in Plant 2, locating the plant in an area with strong and consistent sunlight is beneficial.

Data Quality Issues:


A key limitation was the uneven distribution between optimal and suboptimal operating states. Optimal events are relatively rare, leading the model to bias predictions toward the majority class. Although threshold tuning and class weighting help, the imbalance fundamentally limits the model’s ability to learn subtle patterns associated with rare operational faults. 


Model Assumptions and Limitations:


LinearSVC assumes that the two classes can be separated by a linear decision boundary in feature space. In reality, inverter performance is influenced by complex relationships such as: non-linear efficiency curves,
interaction between temperature and irradiation, operational hysteresis effects.These interactions are difficult for a linear model to capture, limiting predictive accuracy.


Challenges:


It is difficult to identity the outliers in the data that affected the classification model performance, eventhough outliers are removed based on linear regression model. 



Data Collection and Quality Improvements:


Increase representative coverage of minority class. my model’s difficulty in detecting optimal or suboptimal states (depending on imbalance structure) often comes from limited examples of optimal class, which can be improved by: collecting more data for optimal operating conditions .


Alternative Modelling Approaches to Improve Prediction:


Even if LinearSVC performs reasonably, consider alternative models that can capture non-linearities or provide complementary insights. Non-linear models such as RBF-SVM, XGBoost and Logistic Regression with engineered polynomial features, are able to capture non-linear feature interactions


Real world application:


One application will be applying the results of the model to find out the factors causing the inverters to be suboptimal so that corresponding measures can be implemented to prevent inverters from being suboptimal. 