### Dependencies imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

from sklearn.model_selection import train_test_split
from scipy.stats import chi2_contingency

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, precision_recall_curve, auc
from sklearn.utils.class_weight import compute_class_weight

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier



from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek

In [2]:
df= pd.read_csv("C:/sai files/projects/predictive-maintenance-end2end/test.csv")

### 1. Preprocessing Strategy and Experiment Design 

This preprocessing phase is designed to **operationalize insights from EDA** and prepare the dataset for systematic machine learning experimentation. The steps below clearly separate **mandatory preprocessing** from **experimental preprocessing**, ensuring clarity, reproducibility, and fair comparison across modeling approaches.

---
**1. EDA Summary and Preprocessing Rationale**

Exploratory analysis established that:
- Machine failures are **rare and highly imbalanced**.
- Failure events arise from **localized, non-linear operational regimes** driven by feature interactions and thresholds.
- Individual failure mode indicators are **not available at inference time** and therefore cannot be used directly for prediction.
- Linear correlations between individual features and the failure target are weak, reinforcing the need for **feature engineering and imbalance-aware modeling strategies**.

The primary objective of preprocessing is to:
- Clean non-informative and leakage-prone columns
- Construct a realistic target variable for machine failure prediction
- Prepare multiple feature representations to evaluate their impact on downstream model performance
- Ensure that scaling, encoding, and feature selection are applied **in a controlled, pipeline-based manner** to avoid data leakage

The preprocessing stage acts as the **bridge between EDA and model development**, while aligning all steps with the **primary evaluation metric**, which will focus on metrics suitable for imbalanced classification (e.g., Recall or PR-AUC).

---

**2. Mandatory Preprocessing (Applied Once, Common to All Approaches)**

The following steps are **non-negotiable** and will be applied uniformly before any experimentation:

1. **Remove non-informative identifier columns**
   - Drop `id` and `Product ID` as they have no predictive value and may introduce leakage.

2. **Target variable construction**
   - Create a binary target variable `machine_failure`, where:
     - `1` if any of `TWF`, `HDF`, `PWF`, `OSF`, or `RNF` equals 1
     - `0` otherwise

3. **Leakage prevention**
   - Drop individual failure mode columns (`TWF`, `HDF`, `PWF`, `OSF`, `RNF`) after target construction to reflect a realistic deployment scenario.

4. **Train–test split**
   - Perform a stratified split on `machine_failure` to preserve class imbalance.
   - Fix the random seed to ensure reproducibility across experiments.

5. **Pipeline-aware preprocessing**
   - Apply feature scaling **within model pipelines** only when required by the algorithm to avoid leakage.
   - Categorical variables such as `Product Type` will be handled consistently via ordinal encoding since it has ordinal relationship.

These steps establish a **clean, consistent, and reproducible baseline dataset**.

---

**3. Experimental Preprocessing Approaches**

After mandatory preprocessing, three experimental approaches will be evaluated. Each approach modifies **only one aspect** of the preprocessing or learning setup, enabling fair comparison.

---
**Approach 1: Baseline Features with Class Weighting**

**Objective:**  
Evaluate whether handling class imbalance at the **algorithmic level** is effective without altering the feature space.

**Planned Steps:**
- Use cleaned, original operational features only.
- Apply class weighting during model training to penalize misclassification of failure events.
- Apply primarily to models that natively support class weighting (tree-based, linear models).

This approach isolates the impact of **loss-level imbalance handling**.

---

**Approach 2: Domain-Driven Feature Engineering**

**Objective:**  
Assess whether incorporating physically meaningful, interaction-based features improves failure prediction.

**Planned Steps:**
- Create derived features informed by EDA:
  - **Power** from rotational speed and torque
  - **Temperature difference** from process and air temperatures
- Retain original features alongside engineered features initially.
- Optionally evaluate a reduced feature set to assess information compression versus enrichment.
- Ensure all engineered features use **only instantaneous sensor values** to avoid temporal or target leakage.
- Encode categorical variables consistently within pipelines.

This approach focuses on **feature representation** and allows evaluation of **feature impact** independent of imbalance handling.


---

**Approach 3: Data-Level Imbalance Handling (Resampling)**

**Objective:**  
Examine whether modifying the training data distribution improves failure detection.

**Planned Steps:**
- Apply resampling techniques such as:
  - Synthetic oversampling (e.g., SMOTE)
  - Hybrid methods (e.g., SMOTE with Tomek links)
- Apply resampling **only on the training set**, keeping the test set unchanged.
- Limit resampling experiments to models compatible with synthetic data (primarily tree-based algorithms).

This approach isolates the impact of **data-level imbalance handling** while preserving realistic evaluation conditions.

---

**4. Experimental Consistency and Evaluation Scope**

Across all experimental approaches:
- The same train–test split will be used.
- The same evaluation metrics will be applied.
- Differences in performance will be attributed solely to preprocessing and imbalance-handling choices.

This controlled setup ensures **interpretable and defensible conclusions**.

---

**5. Summary**

The preprocessing strategy is intentionally structured into:
- **Mandatory preprocessing** to ensure data integrity, realism, and reproducibility
- **Experimental preprocessing** to systematically evaluate feature engineering and imbalance-handling strategies

This design provides a robust foundation for subsequent **model training, tuning, and explainability analysis**, while maintaining clarity for technical reviewers and stakeholders. Final pipeline selection will additionally consider **performance stability, model complexity, and deployment feasibility**.

### 2. Statistical Validation of Feature–Target Relationships

#### 2.1 Two-Sample T-Test: Numerical Feature Differences Between Failure and Non-Failure Cases

In [3]:
# 1. Define your numerical features
num_features = [
    'Air temperature [K]', 
    'Process temperature [K]', 
    'Rotational speed [rpm]', 
    'Torque [Nm]', 
    'Tool wear [min]'
]

# 2. Create a list to store the results
t_test_results = []

failure_cols = ['TWF', 'HDF', 'PWF', 'OSF', 'RNF']
df['machine_failure'] = (df[failure_cols].sum(axis=1)>0).astype(int)    

for col in num_features:
    # Separate the two groups
    group_no_failure = df[df['machine_failure'] == 0][col]
    group_failure = df[df['machine_failure'] == 1][col]
    
    # Perform T-Test
    t_stat, p_val = stats.ttest_ind(group_no_failure, group_failure)
    
    # Store results in a dictionary
    t_test_results.append({
        'Feature': col,
        'T-Statistic': round(t_stat, 4),
        'P-Value': format(p_val, '.10f'), # Format to show many decimals
        'Significant?': 'Yes' if p_val < 0.05 else 'No'
    })

# 3. Convert to DataFrame for a clean view
t_test_df = pd.DataFrame(t_test_results)

# Display the result
t_test_df

Unnamed: 0,Feature,T-Statistic,P-Value,Significant?
0,Air temperature [K],-21.0085,0.0,Yes
1,Process temperature [K],-10.8477,0.0,Yes
2,Rotational speed [rpm],18.8057,0.0,Yes
3,Torque [Nm],-40.6174,0.0,Yes
4,Tool wear [min],-12.7802,0.0,Yes


#### 2.2 Chi-Square Test of Independence: Product Type vs. Machine Failure

In [4]:
# 1. Create the cross-tab
contingency_table = pd.crosstab(df['Type'], df['machine_failure'])

# 2. Perform the Test
chi2, p_val, dof, expected = chi2_contingency(contingency_table)

# 3. Store in a clean DataFrame (Note the commas at the end of each line!)
chi2_results_df = pd.DataFrame({
    'Feature': ['Type'],
    'Chi-Square Stat': [round(chi2, 4)],
    'P-Value': [format(p_val, '.10f')],
    'Significant?': ['Yes' if p_val < 0.05 else 'No']
})

# To view the result in a notebook
chi2_results_df

Unnamed: 0,Feature,Chi-Square Stat,P-Value,Significant?
0,Type,22.6564,1.20289e-05,Yes


To statistically validate the patterns observed during EDA, inferential tests were conducted to assess the relationship between key features and the binary machine failure target.


A two-sample t-test was conducted to compare the distributions of key operational features between **failure** and **non-failure** cases. The goal was to assess whether observed differences in feature values are **statistically significant**, not merely visually apparent.

---

**Key Findings**

- **All analyzed features show statistically significant differences** between failure and non-failure groups (p-value < 0.05).
- Extremely small p-values (≈ 0) indicate that the observed differences are **highly unlikely to be due to random chance**, given the large sample size.

---

**Feature-Level Interpretation**

- **Torque [Nm]**
  - Exhibits the **largest magnitude t-statistic**, indicating the strongest separation between failure and non-failure cases.
  - Confirms torque as a **primary driver of machine failure**, consistent with mechanical overstrain and power-related failure regimes.

- **Rotational Speed [rpm]**
  - Shows a strong and significant difference, with failures occurring at **lower average speeds**.
  - Reinforces the importance of **low-speed, high-load operating conditions** in failure scenarios.

- **Tool Wear [min]**
  - Statistically significant difference supports the role of **accumulated degradation** in triggering failures.
  - Aligns with earlier findings on threshold-driven mechanical failures.

- **Air Temperature [K] and Process Temperature [K]**
  - Both show statistically significant differences, but their **effect sizes are comparatively small**.
  - Suggests that while thermal conditions matter, **raw temperature values alone are weak discriminators**, and interaction-based thermal features are more informative.

---

**Important Interpretation Note**

- Statistical significance here reflects **detectable distributional differences**, not predictive strength.
- Given the large dataset size, even modest shifts in means can yield very small p-values.

---

**Overall Insight**

The t-test results confirm that **failure and non-failure cases differ systematically across all core operational features**, with the strongest distinctions arising from **mechanical stress and wear-related variables**. These findings validate earlier EDA conclusions and further support the use of **interaction-aware, non-linear models**, rather than relying on individual features or linear assumptions for failure prediction.

### 3. Mandatory Preprocessing

In [5]:
def mandatory_preprocessing(df: pd.DataFrame, test_size: float = 0.2, random_state: int = 42):
    """
    Perform mandatory preprocessing:
    - Drop identifiers, Construct binary target, Remove leakage columns, Stratified train-test split, apply ordinal encoding to type
    
    Parameters
    ----------
    df : pd.DataFrame
        Raw dataset including failure columns
    test_size : float
        Fraction of data for test set
    random_state : int
        Random seed for reproducibility
        
    Returns
    -------
    X_train, X_test, y_train, y_test : pd.DataFrame / pd.Series
        Preprocessed train-test split
    """
    
    # --- 1. Drop non-informative identifiers ---
    df = df.drop(columns=['id', 'Product ID'], errors='ignore')
    
    # --- 2. Create binary target and drop columns ---
    failure_cols = ['TWF', 'HDF', 'PWF', 'OSF', 'RNF']
    df['machine_failure'] = (df[failure_cols].sum(axis=1)>0).astype(int)    
    df = df.drop(columns=failure_cols, errors='ignore')
    
    # 3. Mapping Low to 0, Medium to 1, and High to 2
    quality_mapping = {'L': 0, 'M': 1, 'H': 2}
    if 'Type' in df.columns:
        df['Type'] = df['Type'].map(quality_mapping)

    # 4. rename columns 
    df.rename(columns={"Type":"type", "Air temperature [K]": "air_temp", "Process temperature [K]": "process_temp", "Rotational speed [rpm]": "rpm", 
                       "Torque [Nm]": "torque", "Tool wear [min]": "tool_wear"}, inplace=True)

    # --- 5. Separate features and target ---
    X = df.drop(columns=['machine_failure'])
    y = df['machine_failure']
    
    # --- 6. Stratified train-test split ---
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = mandatory_preprocessing(df)

print(f"columns: {X_train.columns.to_list()}")
print(f" Imbalanced data counts train dataset : {pd.DataFrame(y_train.value_counts())}")
print(f" Imbalanced data counts test dataset : {pd.DataFrame(y_test.value_counts())}")

columns: ['type', 'air_temp', 'process_temp', 'rpm', 'torque', 'tool_wear']
 Imbalanced data counts train dataset :                  count
machine_failure       
0                71715
1                 1048
 Imbalanced data counts test dataset :                  count
machine_failure       
0                17929
1                  262


### 4. Experimental Preprocessing Approaches

#### 4.1. Base line Features With Class Weighted Learning

In [6]:
classes = np.unique(y_train)
class_weights = compute_class_weight(
    class_weight="balanced",
    classes=classes,
    y=y_train
)
class_weight_dict = dict(zip(classes, class_weights))

scale_pos_weight = class_weight_dict[1] / class_weight_dict[0]

print("Class weights:", class_weight_dict)
print("scale_pos_weight:", scale_pos_weight)


Class weights: {np.int64(0): np.float64(0.5073067001324688), np.int64(1): np.float64(34.715171755725194)}
scale_pos_weight: 68.43034351145039


In [7]:
models = {
    # Decision Tree
    "DT_normal": DecisionTreeClassifier(random_state=42),
    "DT_weighted": DecisionTreeClassifier(class_weight=class_weight_dict, random_state=42),

    # Random Forest
    "RF_normal": RandomForestClassifier(n_estimators=200, random_state=42),
    "RF_weighted": RandomForestClassifier(
        n_estimators=200, class_weight=class_weight_dict, random_state=42
    ),

    # Logistic Regression
    "LR_normal": LogisticRegression(max_iter=1000, random_state=42),
    "LR_weighted": LogisticRegression(
        class_weight=class_weight_dict, max_iter=1000, random_state=42
    ),

    # SVC
    "SVC_normal": SVC(probability=True, random_state=42),
    "SVC_weighted": SVC(
        class_weight=class_weight_dict, probability=True, random_state=42
    ),

    # LightGBM
    "LGBM_normal": LGBMClassifier(random_state=42),
    "LGBM_weighted": LGBMClassifier(
        class_weight=class_weight_dict, random_state=42
    ),

    # XGBoost
    "XGB_normal": XGBClassifier(
        eval_metric="logloss",
        use_label_encoder=False,
        random_state=42
    ),
    "XGB_weighted": XGBClassifier(
        eval_metric="logloss",
        scale_pos_weight=scale_pos_weight,
        use_label_encoder=False,
        random_state=42
    ),
}


In [8]:
results = []

for name, model in models.items():
    print(f"Training: {name}")
    
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    report = classification_report(y_test, y_pred, output_dict=True)
    
    precision, recall, _ = precision_recall_curve(y_test, y_prob)
    pr_auc = auc(recall, precision)
    
    results.append({
        "model": name,
        "weighted": "weighted" in name,
        "precision_1": report["1"]["precision"],
        "recall_1": report["1"]["recall"],
        "f1_1": report["1"]["f1-score"],
        "pr_auc": pr_auc
    })


Training: DT_normal
Training: DT_weighted
Training: RF_normal
Training: RF_weighted
Training: LR_normal
Training: LR_weighted
Training: SVC_normal


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Training: SVC_weighted
Training: LGBM_normal
[LightGBM] [Info] Number of positive: 1048, number of negative: 71715
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006913 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 932
[LightGBM] [Info] Number of data points in the train set: 72763, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.014403 -> initscore=-4.225816
[LightGBM] [Info] Start training from score -4.225816
Training: LGBM_weighted
[LightGBM] [Info] Number of positive: 1048, number of negative: 71715
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006269 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 932
[LightGBM] [Info] Number of data points in the train set: 72763, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[Lig

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Training: XGB_weighted


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [9]:
results_df = pd.DataFrame(results).sort_values(
    by=['model', 'weighted', "recall_1", "pr_auc"], ascending=False
)

results_df

Unnamed: 0,model,weighted,precision_1,recall_1,f1_1,pr_auc
11,XGB_weighted,True,0.269815,0.610687,0.374269,0.375126
10,XGB_normal,False,0.58871,0.278626,0.378238,0.357883
7,SVC_weighted,True,0.055789,0.759542,0.103944,0.108995
6,SVC_normal,False,0.0,0.0,0.0,0.026361
3,RF_weighted,True,0.592593,0.244275,0.345946,0.359739
2,RF_normal,False,0.652542,0.293893,0.405263,0.406911
5,LR_weighted,True,0.042834,0.706107,0.080768,0.160869
4,LR_normal,False,0.5,0.007634,0.015038,0.227324
9,LGBM_weighted,True,0.17162,0.683206,0.27433,0.420789
8,LGBM_normal,False,0.597015,0.305344,0.40404,0.373565


**Experimental Preprocessing: Baseline Features with Class Weighting — Insights & Decisions**

This section compares **normal vs. class-weighted training** using the same baseline features to evaluate whether **algorithm-level imbalance handling alone** improves failure prediction.

---

**Model-Level Observations (Weighted vs. Normal)**

- **Decision Tree**
  - Normal and weighted versions perform similarly.
  - Class weighting does **not provide a clear benefit** for single-tree models.

- **Random Forest**
  - Normal training performs **better than weighted**, with higher F1-score and PR-AUC.
  - Indicates Random Forest already handles imbalance reasonably well without weighting.

- **Logistic Regression**
  - Class weighting dramatically increases recall but causes **severe precision collapse**.
  - Normal version fails to detect failures, while weighted version **overpredicts failures**.

- **Support Vector Classifier (SVC)**
  - Weighted model achieves very high recall but **near-zero precision**.
  - Normal version fails entirely, confirming SVC is **not suitable** for this problem.

- **LightGBM**
  - Class-weighted version clearly outperforms the normal version.
  - Shows strong recall improvement while maintaining competitive PR-AUC.
  - Indicates **boosting models benefit from class weighting**.

- **XGBoost**
  - Class-weighted model significantly improves recall compared to normal training.
  - Precision drops, but PR-AUC improves slightly, indicating a **useful trade-off**.

---

**Overall Observations**

- Class weighting **consistently increases recall** across most models.
- Precision often degrades sharply, especially for linear and margin-based models.
- The effectiveness of class weighting is **highly model-dependent**.
- Boosting models handle the recall–precision trade-off **better than other model families**.

---

**Conclusions**

- Class-weighted learning **should not be used as a universal solution**.
- It is **ineffective or harmful** for linear and SVC-based models.
- It is **beneficial for boosting-based models**, where recall gains outweigh precision loss.
- Compared to feature engineering, class weighting produces **less stable and less interpretable improvements**.

---

**Decisions and Future Steps**

- **Retain class weighting as an optional strategy**, not a default preprocessing choice.
- **Exclude class-weighted linear and SVC models** from further consideration.
- **Retain class weighting as an optional enhancement** for boosting models only.
- Apply class weighting **on top of feature-augmented features**, not as a standalone strategy.
- In future experiments, evaluate whether combining:
  - Feature-augmented features
  - Selective class weighting  
  leads to improved recall without excessive false positives.

This experiment confirms that **class weighting is a secondary optimization tool**, not a replacement for robust feature engineering.

#### 4.2 Domain Driven Feature Engineering

In [10]:
# Feature Engineering Function
def add_engineered_features(X: pd.DataFrame) -> pd.DataFrame:
    X_fe = X.copy()
    
    X_fe["power"] = X_fe["rpm"] * X_fe["torque"]
    X_fe["temp_diff"] = X_fe["process_temp"] - X_fe["air_temp"]
    X_fe["torque_per_rpm"] = X_fe["torque"] / (X_fe["rpm"] + 1e-6)
    
    return X_fe

In [11]:
# Baseline (no feature engineering)
X_train_base = X_train.copy()
X_test_base = X_test.copy()

# Feature-Augmented
X_train_fe_aug = add_engineered_features(X_train)
X_test_fe_aug = add_engineered_features(X_test)

# Feature-Only
independent_cols = ["type", "tool_wear"]
engineered_cols = ["power", "temp_diff", "torque_per_rpm"]

X_train_fe_only = X_train_fe_aug[independent_cols + engineered_cols]
X_test_fe_only = X_test_fe_aug[independent_cols + engineered_cols]

In [12]:
models = {
    "DT": DecisionTreeClassifier(random_state=42),
    "RF": RandomForestClassifier(n_estimators=200, random_state=42),
    "LGBM": LGBMClassifier(random_state=42),
    "XGB": XGBClassifier(
        eval_metric="logloss",
        use_label_encoder=False,
        random_state=42
    ),
}


In [13]:
results_approach2_feature_eng = []

feature_sets = {
    "baseline": (X_train_base, X_test_base),
    "fe_aug": (X_train_fe_aug, X_test_fe_aug),
    "fe_only": (X_train_fe_only, X_test_fe_only),
}

for feature_name, (Xtr, Xte) in feature_sets.items():
    print(f"\n=== Feature Set: {feature_name} ===")
    
    for model_name, model in models.items():
        model.fit(Xtr, y_train)
        
        y_pred = model.predict(Xte)
        y_prob = model.predict_proba(Xte)[:, 1]
        
        report = classification_report(y_test, y_pred, output_dict=True)
        precision, recall, _ = precision_recall_curve(y_test, y_prob)
        pr_auc = auc(recall, precision)
        
        results_approach2_feature_eng.append({
            "model": model_name,
            "feature_set": feature_name,
            "precision_1": report["1"]["precision"],
            "recall_1": report["1"]["recall"],
            "f1_1": report["1"]["f1-score"],
            "pr_auc": pr_auc
        })



=== Feature Set: baseline ===
[LightGBM] [Info] Number of positive: 1048, number of negative: 71715
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006407 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 932
[LightGBM] [Info] Number of data points in the train set: 72763, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.014403 -> initscore=-4.225816
[LightGBM] [Info] Start training from score -4.225816


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



=== Feature Set: fe_aug ===
[LightGBM] [Info] Number of positive: 1048, number of negative: 71715
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004902 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1544
[LightGBM] [Info] Number of data points in the train set: 72763, number of used features: 9
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.014403 -> initscore=-4.225816
[LightGBM] [Info] Start training from score -4.225816


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



=== Feature Set: fe_only ===
[LightGBM] [Info] Number of positive: 1048, number of negative: 71715
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000532 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 858
[LightGBM] [Info] Number of data points in the train set: 72763, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.014403 -> initscore=-4.225816
[LightGBM] [Info] Start training from score -4.225816


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [14]:
results_approach2_feature_eng = pd.DataFrame(
    results_approach2_feature_eng
).sort_values(
    by=["recall_1", "pr_auc"], ascending=False
)

results_approach2_feature_eng


Unnamed: 0,model,feature_set,precision_1,recall_1,f1_1,pr_auc
6,LGBM,fe_aug,0.601307,0.351145,0.443373,0.400084
5,RF,fe_aug,0.674419,0.332061,0.445013,0.429231
0,DT,baseline,0.289037,0.332061,0.309059,0.316499
8,DT,fe_only,0.320896,0.328244,0.324528,0.326466
4,DT,fe_aug,0.294737,0.320611,0.30713,0.310667
11,XGB,fe_only,0.594203,0.312977,0.41,0.378569
10,LGBM,fe_only,0.632812,0.30916,0.415385,0.382623
2,LGBM,baseline,0.597015,0.305344,0.40404,0.373565
7,XGB,fe_aug,0.588235,0.305344,0.40201,0.364917
1,RF,baseline,0.652542,0.293893,0.405263,0.406911


**Experimental Preprocessing: Domain-Driven Feature Engineering — Insights & Decisions**

This section compares **three feature representations** across models:
- **Baseline** (original features)
- **Feature-Augmented (fe-aug)** (baseline + engineered interaction features)
- **Feature-Only (fe-only)** (engineered features with minimal baseline context)

The goal is to understand **which representation consistently improves model performance**.

---

**Model-wise Comparison:**

- **Decision Tree**
  - Performance is similar across all feature sets.
  - Feature augmentation offers **no clear advantage**, confirming limited model capacity.

- **Random Forest**
  - **Feature-augmented features perform best**, followed by feature-only, then baseline.
  - Indicates that Random Forest benefits from interaction features when combined with baseline inputs.

- **LightGBM**
  - **Feature-augmented representation clearly performs best**, followed by feature-only, then baseline.
  - Confirms LightGBM’s strong ability to exploit engineered interaction features.

- **XGBoost**
  - **Feature-augmented performs best**, with feature-only next and baseline lowest.
  - Shows that XGBoost benefits significantly from engineered features when sufficient context is retained.

---

**Overall Observations**

- **Feature-augmented (fe-aug) consistently outperforms baseline features** across all ensemble and boosting models.
- **Feature-only representations improve upon baseline in some cases**, but do not outperform feature-augmented setups.
- Single-tree models show minimal benefit from feature engineering.
- Improvements achieved through feature engineering are **more balanced and stable** than those observed using class weighting alone.

---

**Conclusions**

- **Feature-augmented representations are the most effective feature strategy** identified so far.
- Ensemble and boosting models (Random Forest, LightGBM, XGBoost) **benefit the most** from engineered interaction features.
- Feature engineering provides **clear and consistent gains**, unlike class weighting, which primarily trades precision for recall.

---

**Decisions and Future Scope**

- **Adopt feature-augmented features as the default input representation** for all future modeling.
- Use **class-weighted learning selectively** on top of feature-augmented features, particularly for boosting models.
- In future experiments, evaluate whether combining:
  - Feature-augmented features
  - Selective class weighting
  - Data-level resampling  
  leads to further improvements in recall without excessive precision loss.

This structured comparison confirms that **feature engineering is the strongest lever for improving model performance**, and should form the foundation of the final modeling pipeline.

#### 4.3 Data level Handling Resampling 

In [15]:
# --- Define models ---
models = {
    "DT": DecisionTreeClassifier(random_state=42),
    "RF": RandomForestClassifier(random_state=42),
    "LGBM": LGBMClassifier(random_state=42),
    "XGB": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
}

# --- Define resampling strategies ---
resampling_methods = {
    "baseline": (X_train, y_train),
    "SMOTE": SMOTE(random_state=42).fit_resample(X_train, y_train),
    "SMOTE_Tomek": SMOTETomek(random_state=42).fit_resample(X_train, y_train)
}

# --- Prepare results dataframe ---
results_resampling = []

# --- Iterate over resampling methods ---
for method_name, (X_res, y_res) in resampling_methods.items():
    for model_name, model in models.items():
        # Train model
        model.fit(X_res, y_res)
        # Predict probabilities for PR-AUC
        if hasattr(model, "predict_proba"):
            y_probs = model.predict_proba(X_test)[:, 1]
        else:
            # For DT or other classifiers without predict_proba fallback to decision_function
            y_probs = model.predict(X_test)
        # Predict classes
        y_pred = model.predict(X_test)
        # Metrics for minor class (1)
        precision_1 = precision_score(y_test, y_pred, pos_label=1, zero_division=0)
        recall_1 = recall_score(y_test, y_pred, pos_label=1, zero_division=0)
        f1_1 = f1_score(y_test, y_pred, pos_label=1, zero_division=0)
        # PR-AUC
        precision_curve, recall_curve, _ = precision_recall_curve(y_test, y_probs)
        pr_auc = auc(recall_curve, precision_curve)
        # Append results
        results_resampling.append({
            "model": model_name,
            "resampling": method_name,
            "precision_1": precision_1,
            "recall_1": recall_1,
            "f1_1": f1_1,
            "pr_auc": pr_auc
        })


[LightGBM] [Info] Number of positive: 1048, number of negative: 71715
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002138 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 932
[LightGBM] [Info] Number of data points in the train set: 72763, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.014403 -> initscore=-4.225816
[LightGBM] [Info] Start training from score -4.225816


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[LightGBM] [Info] Number of positive: 71715, number of negative: 71715
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005585 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1276
[LightGBM] [Info] Number of data points in the train set: 143430, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[LightGBM] [Info] Number of positive: 71471, number of negative: 71471
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001075 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1276
[LightGBM] [Info] Number of data points in the train set: 142942, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [16]:
# --- Convert to dataframe ---
results_resampling_df = pd.DataFrame(results_resampling)

# --- Display sorted results ---
results_resampling_df.sort_values(by=["pr_auc"], ascending=False)

Unnamed: 0,model,resampling,precision_1,recall_1,f1_1,pr_auc
1,RF,baseline,0.647059,0.293893,0.404199,0.40124
10,LGBM,SMOTE_Tomek,0.270979,0.591603,0.371703,0.386878
6,LGBM,SMOTE,0.265442,0.60687,0.369338,0.374451
2,LGBM,baseline,0.597015,0.305344,0.40404,0.373565
11,XGB,SMOTE_Tomek,0.283096,0.530534,0.36919,0.363796
3,XGB,baseline,0.58871,0.278626,0.378238,0.357883
7,XGB,SMOTE,0.293878,0.549618,0.382979,0.355726
5,RF,SMOTE,0.287257,0.507634,0.366897,0.334764
9,RF,SMOTE_Tomek,0.287234,0.515267,0.368852,0.327551
0,DT,baseline,0.289037,0.332061,0.309059,0.316499


**Experimental Preprocessing: Data-Level Imbalance Handling (Resampling) — Insights & Decisions**

This experiment evaluates **data-level imbalance handling** by comparing:
- **Baseline (no resampling)**
- **SMOTE**
- **SMOTE + Tomek Links**

All models are trained on resampled training data and evaluated on the original (unchanged) test set.

---

**Model-wise Comparison**

- **Decision Tree (DT)**
  - Resampling increases recall but **significantly reduces precision and PR-AUC**.
  - Baseline performs better overall, indicating DT is **highly sensitive to synthetic samples**.

- **Random Forest (RF)**
  - Baseline achieves the **best PR-AUC** among RF variants.
  - SMOTE and SMOTE-Tomek increase recall but at a **clear cost to precision and PR-AUC**.
  - Suggests RF already handles imbalance reasonably well without resampling.

- **LightGBM (LGBM)**
  - SMOTE and SMOTE-Tomek substantially improve recall compared to baseline.
  - PR-AUC remains competitive, though slightly below the baseline RF.
  - Indicates **boosting models benefit from controlled resampling**, especially when recall is prioritized.

- **XGBoost (XGB)**
  - Resampling improves recall over baseline.
  - PR-AUC remains comparable but does not exceed baseline performance.
  - Shows **moderate benefit**, but gains are not consistently superior.

---

**Cross-Resampling Observations**

- **SMOTE and SMOTE-Tomek consistently increase recall** across all models.
- Precision generally decreases, indicating **more false positives**.
- **SMOTE-Tomek provides slightly more stable results** than SMOTE alone by reducing noisy synthetic samples.
- Resampling benefits are **model-dependent** and strongest for boosting-based models.

---

**Overall Conclusions**

- Data-level resampling is effective for **improving minority-class recall**, but often at the expense of precision.
- Tree-based ensemble models tolerate resampling better than single trees.
- Compared to class-weighted learning:
  - Resampling provides **more balanced improvements** for boosting models.
  - But still does not outperform **feature-augmented representations** in terms of stability.

---

**Decisions and Future Scope**

- **Do not apply resampling as a default strategy** across all models.
- Prefer **SMOTE or SMOTE-Tomek selectively** with boosting models (LightGBM, XGBoost).
- Avoid resampling with single Decision Trees.
- Use resampling **only after feature engineering**, not on raw baseline features.
- In future experiments, evaluate:
  - Feature-augmented features + SMOTE-Tomek
  - Feature-augmented features + selective class weighting
  - Feature-augmented features + combined resampling and weighting

This experiment confirms that **resampling is a powerful but sensitive tool**, best used selectively and in combination with robust feature engineering rather than as a standalone solution.

### 5. Technical summary of preprocessing and baseline models results 

**Project Status Summary**

- This project has completed a comprehensive exploratory analysis and preprocessing evaluation to understand machine failure behavior and identify the most effective modeling strategy.

**Key Findings:**

- Machine failures are rare and highly imbalanced, requiring specialized handling.

- Failures are driven by localized, non-linear interactions between operational variables rather than simple linear trends.

- Among multiple preprocessing strategies:

    - Domain-driven feature engineering provided the most consistent and stable performance improvements.

    - Class weighting and resampling improved failure recall but introduced trade-offs in precision and stability.

- Tree-based ensemble and boosting models (Random Forest, LightGBM, XGBoost) consistently outperformed linear and single-tree models.

**Strategic Decisions:**

- Feature-augmented data representation is selected as the default input for modeling.

- LightGBM and XGBoost are identified as primary candidate models.

- Imbalance-handling techniques will be applied selectively, not universally, to balance recall and false-alarm risk.

**Next Phase Objective:**

- Final model training and hyperparameter optimization

- Controlled use of imbalance-handling techniques

- Model explainability and deployment readiness

This structured approach ensures robust, explainable, and production-relevant predictive maintenance models.