<a href="https://colab.research.google.com/github/Mabinogit/AI-Image-Classification/blob/main/Project_Creating_Features(solar_fault_prediction).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

## Project: Power Generation Unit Performance Anomaly Detection

**Project Goal:** To identify abnormal or suboptimal operating conditions in individual power generation units (e0.g., wind turbines, solar inverters, or thermal power plant components) using sensor data and operational logs. This can enable predictive maintenance, optimize energy output, and prevent costly failures.

**Problem Type:** Anomaly Detection (unsupervised or semi-supervised) or Classification (if you have historical labels for "optimal" vs. "suboptimal" operation). We'll lean towards **Anomaly Detection**, which is often more realistic in real-world scenarios where "faulty" labels might be scarce.

**Dataset:**
This project relies heavily on time-series sensor data from a single generation unit or a fleet of similar units.

* **Wind Turbine SCADA (Supervisory Control and Data Acquisition) Data:** This is an excellent choice. Many open datasets exist on platforms like Kaggle (e.g., "Wind Turbine Scada Dataset"). These typically contain measurements like:
    * `timestamp`
    * `wind_speed`
    * `wind_direction`
    * `power_output_kW`
    * `rotor_speed`
    * `generator_speed`
    * `blade_pitch_angle`
    * `nacelle_temp`
    * `gearbox_temp`
    * `vibration_sensors_readings` (if available)
    * `ambient_temp`
* **Solar PV Plant Data:** Data from solar farms (irradiance, module temperature, inverter output).
* **Simulated or Publicly Available Conventional Power Plant Component Data:** Data from sensors on pumps, turbines, or boilers (pressure, temperature, flow rates, vibration).

For this example, let's assume we're working with **Wind Turbine SCADA Data**.

**Core Features (Example Dataset Columns):**

* `timestamp`: Date and time of the sensor reading.
* `turbine_id`: Unique identifier (if multiple turbines).
* `wind_speed_mps`: Wind speed in meters per second.
* `power_output_kW`: Actual power generated by the turbine.
* `rotor_rpm`: Rotations per minute of the rotor.
* `blade_pitch_angle_deg`: Angle of the turbine blades.
* `gearbox_temp_c`: Temperature of the gearbox.
* `ambient_temp_c`: Ambient air temperature.
* `generator_current_amps`: Current drawn by the generator.
* `vibration_x`, `vibration_y`, `vibration_z`: Readings from vibration sensors.

---

### Feature Engineering Tasks by Topic:

**1. Mathematical Transforms**

* **Power Curve Deviation:** This is critical for wind turbines. A theoretical power curve maps `wind_speed` to expected `power_output_kW`. Calculate the deviation of actual power output from this theoretical curve. Large deviations could indicate underperformance or a fault.
    * `expected_power = f(wind_speed)` (You might need to fit a regression model or use a provided curve).
    * `power_deviation = power_output_kW - expected_power`
* **Efficiency Ratios:**
    * `mechanical_efficiency = power_output_kW / (rotor_rpm * blade_pitch_angle_deg)` (Simplified, but the idea is a ratio of output to input/control).
    * `thermal_ratio = gearbox_temp_c / ambient_temp_c` (Ratio of internal component temperature to external temperature, indicating heat dissipation issues).
* **Log/Square Root Transforms:** Apply to skewed sensor readings (e.g., `vibration_x`, `generator_current_amps`) to normalize their distribution, which can help clustering or distance-based anomaly detection algorithms.
* **Polynomial Features:** For relationships like temperature and component wear, or wind speed and vibration, polynomial terms (`wind_speed^2`, `gearbox_temp_c^2`) might capture non-linear effects leading to anomalies.
* **Rate of Change (Derivatives):** Calculate the difference in sensor readings over time (e.g., 5-minute or 1-hour intervals). Rapid changes can signify an issue.
    * `d_gearbox_temp_dt = gearbox_temp_c - gearbox_temp_c.shift(1)`
    * `d_power_output_dt = power_output_kW - power_output_kW.shift(1)`

**2. Counts**

* **Count of "Stall" Events:** Define a "stall" condition (e.g., `rotor_rpm` very low despite high `wind_speed`). Count the occurrences of such events over a sliding window (e.g., last 24 hours).
* **Count of "Over-Temperature" Events:** Count how many times `gearbox_temp_c` exceeded a safe threshold in the last few hours or days.
* **Count of "High Vibration" Events:** Similar to temperature, count spikes in vibration readings.
* **Consecutive "Off-Nominal" Readings:** Count how many consecutive readings fall outside a normal range for specific parameters. This can indicate a persistent problem rather than a transient spike.

**3. Building-Up and Breaking-Down Features**

* **Break Down `timestamp`:**
    * `hour_of_day`, `day_of_week`, `month`, `season` (as cyclic features, e.g., using sine/cosine transformations for `hour_of_day` and `day_of_year` to capture periodicity: `sin(2 * pi * hour / 24)`).
    * `is_daylight_hours` (binary based on sunrise/sunset, or just `hour_of_day` for simplicity).
* **Categorizing Continuous Variables:** If a sensor reading has natural thresholds, convert it into categorical bins. For example, `wind_speed_category` (e.g., "Low Wind", "Optimal Wind", "High Wind"). Then, one-hot encode these.
* **Building Up Operation States:** Create combined categorical features representing the "state" of the turbine.
    * Example: `operation_state = { "HighWind_OptimalPitch", "LowWind_Idle", "OptimalWind_HighPower" }`
    * This could involve combining `wind_speed_category` and `blade_pitch_angle_deg` ranges.

**4. Group Transforms**

* **Rolling Window Statistics:** This is crucial for anomaly detection in time series. For each sensor reading, calculate:
    * **Moving Average:** `power_output_kW.rolling(window=10).mean()` (10-minute average)
    * **Moving Standard Deviation:** `gearbox_temp_c.rolling(window=20).std()` (captures volatility)
    * **Moving Min/Max:** `rotor_rpm.rolling(window=5).min()` / `.max()`
    * **Moving Skewness/Kurtosis:** To detect changes in the shape of the data distribution over time.
* **Group by `turbine_id` (if multiple turbines):**
    * **Deviation from Fleet Average:** For a given `wind_speed`, how does *this* turbine's `power_output_kW` compare to the average of *all other* turbines at that same `wind_speed`? This requires a bit more complex grouping (e.g., a multi-level group-by or a lookup table).
    * `current_power_dev_from_fleet_avg_at_wind_speed = power_output_kW - df.groupby('wind_speed_bin')['power_output_kW'].transform('mean')`
* **Operating Regime Aggregations:** Group data by `operation_state` (if created) and calculate average sensor readings for each state. This helps establish "normal" ranges for different operational modes.

**5. Combine and Transform Features**

* **Temperature-Corrected Vibration:** `vibration_x / gearbox_temp_c`. A high vibration at low temperatures might be more anomalous than at high temperatures due to expansion/contraction.
* **Power-to-Wind-Speed Ratio:** `power_output_kW / wind_speed_mps`. This can be a simplified measure of efficiency, especially at optimal wind speeds.
* **Difference from Rolling Baseline:** `gearbox_temp_c - gearbox_temp_c.rolling(window=100).median()`. This highlights deviations from the recent typical behavior.
* **Lagged Interactions:** Multiply a current sensor reading by a lagged reading of another sensor. For example, `power_output_kW * gearbox_temp_c.shift(5)` might reveal interactions that lead to anomalies.
* **Custom Anomaly Scores (based on domain knowledge):** If you know that `(high_vibration AND high_gearbox_temp)` is a strong indicator of a problem, create a boolean or numerical feature for this specific combination.

---

### Project Steps:

1.  **Data Acquisition and Preprocessing:** Get your wind turbine (or other generation unit) SCADA data. Handle missing values (interpolation is common for time series), outliers, and ensure timestamps are correctly parsed and set as index.
2.  **Exploratory Data Analysis (EDA):** Visualize time series for various sensors. Look for trends, seasonality, sudden jumps, and correlations between features. Plot power curves (`power_output_kW` vs. `wind_speed`).
3.  **Implement Feature Engineering:** Systematically create all the features outlined above. Be mindful of look-ahead bias if you use rolling windows – ensure you only use past data.
4.  **Feature Scaling:** For anomaly detection algorithms, scaling (e.g., `StandardScaler`) is almost always necessary as many algorithms are distance-based.
5.  **Anomaly Detection Model:**
    * **Unsupervised:** Algorithms like Isolation Forest, One-Class SVM, Local Outlier Factor (LOF), or Autoencoders are common. Train the model on the *assumed normal* operating data (or a representative subset).
    * **Semi-supervised (if you have a few labeled anomalies):** You could use these labels to fine-tune thresholding for unsupervised methods or even train a binary classifier if you have enough examples of "normal" and "anomalous" periods.
6.  **Anomaly Thresholding and Evaluation:**
    * For unsupervised models, you'll get an "anomaly score." You'll need to determine a threshold to classify points as anomalous. This often involves domain expertise or examining the distribution of scores.
    * Evaluate the effectiveness of your anomaly detection (e.g., using precision, recall, F1-score if you have ground truth labels, or qualitatively by reviewing flagged anomalies).
7.  **Interpretation and Root Cause Analysis:**
    * Once an anomaly is detected, which features contributed most to its identification? This can help pinpoint the likely cause of the suboptimal performance (e.g., high gearbox temperature, unusual blade pitch).
    * Visualize the anomalous periods with the engineered features to confirm intuition.

This project offers a deep dive into time-series feature engineering specifically for identifying operational issues in crucial energy generation assets, moving beyond simple forecasting into actionable insights for optimization and maintenance.

# Preprocessing

In [3]:
import pandas as pd

In [4]:
df = pd.read_excel('/content/Train(B).xlsx')

In [5]:
df.head()

Unnamed: 0,I1,I2,I1MAX,I1MIN,I1VAR,I2MAX,I2MIN,I2VAR,I3,I4,...,Vdcmax1,Vdcmin1,Pdcmean1,IR,T,range 1,range 2,range 3,range 4,class
0,3.464132,3.464132,3.776515,3.433125,5.4e-05,3.776515,3.433125,5.4e-05,3.75505,3.75505,...,514.369415,509.002655,169.603373,660,22,0.34339,0.34339,1.33227e-14,5.77316e-15,1
1,2.244766,2.244766,2.611714,2.210978,0.00035,2.611714,2.210978,0.00035,2.572122,2.572122,...,521.970593,503.241431,116.954097,450,15,0.400736,0.400736,1.55431e-14,0.0,1
2,3.87836,3.87836,4.101282,3.854098,1e-05,4.101282,3.854098,1e-05,4.094577,4.094577,...,511.837039,509.378912,184.213307,720,24,0.247183,0.247183,2.13163e-14,7.10543e-15,1
3,2.816389,2.816389,3.285793,2.758149,0.000192,3.285793,2.758149,0.000192,3.247759,3.247759,...,518.028478,507.378348,147.283635,570,19,0.527644,0.527644,1.19904e-14,1.33227e-14,1
4,1.238798,1.238798,5.455667,0.141519,0.001153,5.455667,0.141519,0.001153,5.46564,5.46564,...,500.827418,500.127036,238.608844,960,32,5.314147,5.314147,2.44249e-14,2.04281e-14,1


In [6]:
# build mutual information function

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
import pandas as pd

def compute_mutual_information(X, y, problem_type ='classification', discrete_features='auto', normalize=False):
    """
    Computes mutual information scores between each feature in X and the target y.

    Parameters:
    -----------
    X : pd.DataFrame
        Feature matrix
    y : pd.Series or np.array
        Target variable
    problem_type : str
        'classification' or 'regression'
    discrete_features : bool, array-like, or 'auto'
        Whether to consider features as discrete
    normalize : bool
        If True, normalize scores between 0 and 1

    Returns:
    --------
    pd.Series
        Mutual information scores for each feature
    """

    if problem_type == 'classification':
        mi = mutual_info_classif(X, y, discrete_features=discrete_features)
    elif problem_type == 'regression':
        mi = mutual_info_regression(X, y, discrete_features=discrete_features)
    else:
        raise ValueError("problem_type must be 'classification' or 'regression'")

    mi_series = pd.Series(mi, index=X.columns).sort_values(ascending=False)

    if normalize:
        mi_series = mi_series / mi_series.max()

    return mi_series



Which variable can help us closely predict what class problems we have

In [7]:
X = df.drop('class', axis=1)
y = df['class']

# Compute mutual information, assuming a classification problem based on the 'class' column
mi_scores = compute_mutual_information(X, y, problem_type='classification')
display(mi_scores)

# MI score means that if you know the value of feature x, you will be much better able to determine which of the classes (0, 1, 2, etc.)

Unnamed: 0,0
range 3,0.941825
range 1,0.727036
range 2,0.700252
range 4,0.627684
Itotalmax1,0.472632
Itotal1,0.354826
Itotalmin1,0.352785
I1,0.238922
I1MIN,0.226105
I2VAR,0.190626


Which variable are closly related to class o

In [8]:
class0 = df.copy()
class0['final class'] = (df['class'] == 0).astype(int)
display(df.head())

Unnamed: 0,I1,I2,I1MAX,I1MIN,I1VAR,I2MAX,I2MIN,I2VAR,I3,I4,...,Vdcmax1,Vdcmin1,Pdcmean1,IR,T,range 1,range 2,range 3,range 4,class
0,3.464132,3.464132,3.776515,3.433125,5.4e-05,3.776515,3.433125,5.4e-05,3.75505,3.75505,...,514.369415,509.002655,169.603373,660,22,0.34339,0.34339,1.33227e-14,5.77316e-15,1
1,2.244766,2.244766,2.611714,2.210978,0.00035,2.611714,2.210978,0.00035,2.572122,2.572122,...,521.970593,503.241431,116.954097,450,15,0.400736,0.400736,1.55431e-14,0.0,1
2,3.87836,3.87836,4.101282,3.854098,1e-05,4.101282,3.854098,1e-05,4.094577,4.094577,...,511.837039,509.378912,184.213307,720,24,0.247183,0.247183,2.13163e-14,7.10543e-15,1
3,2.816389,2.816389,3.285793,2.758149,0.000192,3.285793,2.758149,0.000192,3.247759,3.247759,...,518.028478,507.378348,147.283635,570,19,0.527644,0.527644,1.19904e-14,1.33227e-14,1
4,1.238798,1.238798,5.455667,0.141519,0.001153,5.455667,0.141519,0.001153,5.46564,5.46564,...,500.827418,500.127036,238.608844,960,32,5.314147,5.314147,2.44249e-14,2.04281e-14,1


In [9]:
X = class0.drop(['class', 'final class'], axis=1)
y = class0['final class']

# Compute mutual information, assuming a classification problem based on the 'class' column
mi_scores = compute_mutual_information(X, y, problem_type='classification')
display(mi_scores)

Unnamed: 0,0
range 1,0.363063
range 3,0.180553
range 2,0.130572
range 4,0.087271
I1,0.056178
I1MIN,0.048705
Itotalmax1,0.046499
I2MIN,0.044659
Itotalmin1,0.043898
I1VAR,0.040229


Which variable are closely correlated with class 1

In [10]:
class1 = df.copy()
class1['final class'] = (df['class'] == 1).astype(int)
display(class1.head())

Unnamed: 0,I1,I2,I1MAX,I1MIN,I1VAR,I2MAX,I2MIN,I2VAR,I3,I4,...,Vdcmin1,Pdcmean1,IR,T,range 1,range 2,range 3,range 4,class,final class
0,3.464132,3.464132,3.776515,3.433125,5.4e-05,3.776515,3.433125,5.4e-05,3.75505,3.75505,...,509.002655,169.603373,660,22,0.34339,0.34339,1.33227e-14,5.77316e-15,1,1
1,2.244766,2.244766,2.611714,2.210978,0.00035,2.611714,2.210978,0.00035,2.572122,2.572122,...,503.241431,116.954097,450,15,0.400736,0.400736,1.55431e-14,0.0,1,1
2,3.87836,3.87836,4.101282,3.854098,1e-05,4.101282,3.854098,1e-05,4.094577,4.094577,...,509.378912,184.213307,720,24,0.247183,0.247183,2.13163e-14,7.10543e-15,1,1
3,2.816389,2.816389,3.285793,2.758149,0.000192,3.285793,2.758149,0.000192,3.247759,3.247759,...,507.378348,147.283635,570,19,0.527644,0.527644,1.19904e-14,1.33227e-14,1,1
4,1.238798,1.238798,5.455667,0.141519,0.001153,5.455667,0.141519,0.001153,5.46564,5.46564,...,500.127036,238.608844,960,32,5.314147,5.314147,2.44249e-14,2.04281e-14,1,1


In [11]:
X = class1.drop(['class', 'final class'], axis=1)
y = class1['final class']

# Compute mutual information, assuming a classification problem based on the 'class' column
mi_scores = compute_mutual_information(X, y, problem_type='classification')
display(mi_scores)

Unnamed: 0,0
range 2,0.538422
Itotalmax1,0.459558
Itotal1,0.412858
Itotalmin1,0.408663
range 3,0.262122
range 1,0.220151
I2MIN,0.214094
I6,0.1865
I5,0.1865
IR,0.176477


Which factors are closly related to class 2

In [12]:
class2 = df.copy()
class2['final class'] = (df['class'] == 2).astype(int)
display(class2.head())

Unnamed: 0,I1,I2,I1MAX,I1MIN,I1VAR,I2MAX,I2MIN,I2VAR,I3,I4,...,Vdcmin1,Pdcmean1,IR,T,range 1,range 2,range 3,range 4,class,final class
0,3.464132,3.464132,3.776515,3.433125,5.4e-05,3.776515,3.433125,5.4e-05,3.75505,3.75505,...,509.002655,169.603373,660,22,0.34339,0.34339,1.33227e-14,5.77316e-15,1,0
1,2.244766,2.244766,2.611714,2.210978,0.00035,2.611714,2.210978,0.00035,2.572122,2.572122,...,503.241431,116.954097,450,15,0.400736,0.400736,1.55431e-14,0.0,1,0
2,3.87836,3.87836,4.101282,3.854098,1e-05,4.101282,3.854098,1e-05,4.094577,4.094577,...,509.378912,184.213307,720,24,0.247183,0.247183,2.13163e-14,7.10543e-15,1,0
3,2.816389,2.816389,3.285793,2.758149,0.000192,3.285793,2.758149,0.000192,3.247759,3.247759,...,507.378348,147.283635,570,19,0.527644,0.527644,1.19904e-14,1.33227e-14,1,0
4,1.238798,1.238798,5.455667,0.141519,0.001153,5.455667,0.141519,0.001153,5.46564,5.46564,...,500.127036,238.608844,960,32,5.314147,5.314147,2.44249e-14,2.04281e-14,1,0


In [13]:
X = class2.drop(['class', 'final class'], axis=1)
y = class2['final class']

# Compute mutual information, assuming a classification problem based on the 'class' column
mi_scores = compute_mutual_information(X, y, problem_type='classification')
display(mi_scores)

Unnamed: 0,0
range 3,0.407172
range 1,0.22165
range 2,0.18371
range 4,0.121716
Itotalmax1,0.092608
I1,0.065592
I2VAR,0.058764
I1MIN,0.057013
Itotalmin1,0.055324
I2,0.03695


# Building Aboost models( Baseline)

In [18]:
import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Because adaboost is a binary classification model i need to turn it into a fault or no fault dataset

In [19]:
data = df.copy()
data['final class'] = (df['class'] != 0).astype(int)
X = data.drop(['class','final class'], axis=1)
y = data['final class']

display(X.head())

Unnamed: 0,I1,I2,I1MAX,I1MIN,I1VAR,I2MAX,I2MIN,I2VAR,I3,I4,...,Vdcmean1,Vdcmax1,Vdcmin1,Pdcmean1,IR,T,range 1,range 2,range 3,range 4
0,3.464132,3.464132,3.776515,3.433125,5.4e-05,3.776515,3.433125,5.4e-05,3.75505,3.75505,...,513.522799,514.369415,509.002655,169.603373,660,22,0.34339,0.34339,1.33227e-14,5.77316e-15
1,2.244766,2.244766,2.611714,2.210978,0.00035,2.611714,2.210978,0.00035,2.572122,2.572122,...,516.959237,521.970593,503.241431,116.954097,450,15,0.400736,0.400736,1.55431e-14,0.0
2,3.87836,3.87836,4.101282,3.854098,1e-05,4.101282,3.854098,1e-05,4.094577,4.094577,...,511.459214,511.837039,509.378912,184.213307,720,24,0.247183,0.247183,2.13163e-14,7.10543e-15
3,2.816389,2.816389,3.285793,2.758149,0.000192,3.285793,2.758149,0.000192,3.247759,3.247759,...,515.75792,518.028478,507.378348,147.283635,570,19,0.527644,0.527644,1.19904e-14,1.33227e-14
4,1.238798,1.238798,5.455667,0.141519,0.001153,5.455667,0.141519,0.001153,5.46564,5.46564,...,500.334599,500.827418,500.127036,238.608844,960,32,5.314147,5.314147,2.44249e-14,2.04281e-14


Stage 1:
AdaBoost model trained on the full dataset to classify:

Fault

No Fault

In [20]:
# Stage 1: AdaBoost fault detector (binary)
ada_fault_detector = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1), n_estimators=50)
ada_fault_detector.fit(X, y)  # labels: 0 = no fault, 1 = fault

In [21]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import numpy as np
# test model
test = pd.read_excel('/content/Train(B).xlsx')

test['final class'] = (test['class'] != 0).astype(int)
X_test= data.drop(['class','final class'], axis=1)
y_test= data['final class']


# Step 1: Run stage 1 model (binary classification)
stage1_preds = ada_fault_detector.predict(X_test)

# Step 2: Evaluate Stage 1
print("=== Stage 1: Fault/No Fault ===")
print(confusion_matrix(y_test, stage1_preds))
print(classification_report(y_test, stage1_preds))

=== Stage 1: Fault/No Fault ===
[[100   0]
 [  0 500]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       100
           1       1.00      1.00      1.00       500

    accuracy                           1.00       600
   macro avg       1.00      1.00      1.00       600
weighted avg       1.00      1.00      1.00       600



In [22]:
# Stage 2: Fault type classifier (only on fault samples)
# Extract fault samples

fault = df.copy()
fault = fault[fault['class'] != 0]
display(fault.head())

X_fault = fault.drop(['class'], axis=1)
y_fault = fault['class']

Unnamed: 0,I1,I2,I1MAX,I1MIN,I1VAR,I2MAX,I2MIN,I2VAR,I3,I4,...,Vdcmax1,Vdcmin1,Pdcmean1,IR,T,range 1,range 2,range 3,range 4,class
0,3.464132,3.464132,3.776515,3.433125,5.4e-05,3.776515,3.433125,5.4e-05,3.75505,3.75505,...,514.369415,509.002655,169.603373,660,22,0.34339,0.34339,1.33227e-14,5.77316e-15,1
1,2.244766,2.244766,2.611714,2.210978,0.00035,2.611714,2.210978,0.00035,2.572122,2.572122,...,521.970593,503.241431,116.954097,450,15,0.400736,0.400736,1.55431e-14,0.0,1
2,3.87836,3.87836,4.101282,3.854098,1e-05,4.101282,3.854098,1e-05,4.094577,4.094577,...,511.837039,509.378912,184.213307,720,24,0.247183,0.247183,2.13163e-14,7.10543e-15,1
3,2.816389,2.816389,3.285793,2.758149,0.000192,3.285793,2.758149,0.000192,3.247759,3.247759,...,518.028478,507.378348,147.283635,570,19,0.527644,0.527644,1.19904e-14,1.33227e-14,1
4,1.238798,1.238798,5.455667,0.141519,0.001153,5.455667,0.141519,0.001153,5.46564,5.46564,...,500.827418,500.127036,238.608844,960,32,5.314147,5.314147,2.44249e-14,2.04281e-14,1


In [30]:
# second model
# This model will classify between three fault types: A, B, C
multi_class_fault_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2),  # you can adjust this
    n_estimators=100,
    algorithm='SAMME'  # or 'SAMME.R'
)

multi_class_fault_model.fit(X_fault, y_fault)



In [40]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import numpy as np
# test model
test = pd.read_excel('/content/test(B).xlsx')

# i want to remove all samples that have a class of 0
test = test[test['class'] != 0]
X_test= test.drop(['class'], axis=1)
y_test= test['class']


(75, 30)

In [41]:
# Step 1: Run stage 1 model (binary classification)
stage2_preds = multi_class_fault_model.predict(X_test)

# Step 2: Evaluate Stage 1
print("=== Stage 1: Fault/No Fault ===")
print(confusion_matrix(y_test, stage2_preds))
print(classification_report(y_test, stage2_preds))

=== Stage 1: Fault/No Fault ===
[[25  0  0]
 [ 0 25  0]
 [ 0  0 25]]
              precision    recall  f1-score   support

           1       1.00      1.00      1.00        25
           2       1.00      1.00      1.00        25
           3       1.00      1.00      1.00        25

    accuracy                           1.00        75
   macro avg       1.00      1.00      1.00        75
weighted avg       1.00      1.00      1.00        75



In [None]:
def predict_fault_type(x):
    # Stage 1: Detect fault
    fault_pred = ada_fault_detector.predict([x])[0]

    if fault_pred == 1:  # Fault detected
        # Stage 2: Classify fault type
        fault_type_pred = multi_class_fault_model.predict([x])[0]
        return f"Fault detected: Type {fault_type_pred}"
    else:
        return "No Fault detected"


Adaboost final system

In [58]:
predict_fault_type(df.iloc[300].drop('class'))



'Fault detected: Type 2'

# Download model

In [None]:
# download ai model



In [None]:
import joblib

# Save the Stage 1 fault detector model
filename_fault_detector = 'ada_fault_detector_model.joblib'
joblib.dump(ada_fault_detector, filename_fault_detector)

# Save the Stage 2 fault type classifier model
filename_fault_classifier = 'multi_class_fault_model.joblib'
joblib.dump(multi_class_fault_model, filename_fault_classifier)

print(f"Models saved as {filename_fault_detector} and {filename_fault_classifier}")