### Dependencies imports

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

### 1. Preprocessing Strategy and Experiment Design 

This preprocessing phase is designed to **operationalize insights from EDA** and prepare the dataset for systematic machine learning experimentation. The steps below clearly separate **mandatory preprocessing** from **experimental preprocessing**, ensuring clarity, reproducibility, and fair comparison across modeling approaches.

---
**1. EDA Summary and Preprocessing Rationale**

Exploratory analysis established that:
- Machine failures are **rare and highly imbalanced**.
- Failure events arise from **localized, non-linear operational regimes** driven by feature interactions and thresholds.
- Individual failure mode indicators are **not available at inference time** and therefore cannot be used directly for prediction.
- Linear correlations between individual features and the failure target are weak, reinforcing the need for **feature engineering and imbalance-aware modeling strategies**.

The primary objective of preprocessing is to:
- Clean non-informative and leakage-prone columns
- Construct a realistic target variable for machine failure prediction
- Prepare multiple feature representations to evaluate their impact on downstream model performance
- Ensure that scaling, encoding, and feature selection are applied **in a controlled, pipeline-based manner** to avoid data leakage

The preprocessing stage acts as the **bridge between EDA and model development**, while aligning all steps with the **primary evaluation metric**, which will focus on metrics suitable for imbalanced classification (e.g., Recall or PR-AUC).

---

**2. Mandatory Preprocessing (Applied Once, Common to All Approaches)**

The following steps are **non-negotiable** and will be applied uniformly before any experimentation:

1. **Remove non-informative identifier columns**
   - Drop `id` and `Product ID` as they have no predictive value and may introduce leakage.

2. **Target variable construction**
   - Create a binary target variable `machine_failure`, where:
     - `1` if any of `TWF`, `HDF`, `PWF`, `OSF`, or `RNF` equals 1
     - `0` otherwise

3. **Leakage prevention**
   - Drop individual failure mode columns (`TWF`, `HDF`, `PWF`, `OSF`, `RNF`) after target construction to reflect a realistic deployment scenario.

4. **Train–test split**
   - Perform a stratified split on `machine_failure` to preserve class imbalance.
   - Fix the random seed to ensure reproducibility across experiments.

5. **Pipeline-aware preprocessing**
   - Apply feature scaling **within model pipelines** only when required by the algorithm to avoid leakage.
   - Categorical variables such as `Product Type` will be handled consistently via ordinal encoding since it has ordinal relationship.

These steps establish a **clean, consistent, and reproducible baseline dataset**.

---

**3. Experimental Preprocessing Approaches**

After mandatory preprocessing, three experimental approaches will be evaluated. Each approach modifies **only one aspect** of the preprocessing or learning setup, enabling fair comparison.

---
**Approach 1: Baseline Features with Class Weighting**

**Objective:**  
Evaluate whether handling class imbalance at the **algorithmic level** is effective without altering the feature space.

**Planned Steps:**
- Use cleaned, original operational features only.
- Apply class weighting during model training to penalize misclassification of failure events.
- Apply primarily to models that natively support class weighting (tree-based, linear models).

This approach isolates the impact of **loss-level imbalance handling**.

---

**Approach 2: Domain-Driven Feature Engineering**

**Objective:**  
Assess whether incorporating physically meaningful, interaction-based features improves failure prediction.

**Planned Steps:**
- Create derived features informed by EDA:
  - **Power** from rotational speed and torque
  - **Temperature difference** from process and air temperatures
- Retain original features alongside engineered features initially.
- Optionally evaluate a reduced feature set to assess information compression versus enrichment.
- Ensure all engineered features use **only instantaneous sensor values** to avoid temporal or target leakage.
- Encode categorical variables consistently within pipelines.

This approach focuses on **feature representation** and allows evaluation of **feature impact** independent of imbalance handling.


---

**Approach 3: Data-Level Imbalance Handling (Resampling)**

**Objective:**  
Examine whether modifying the training data distribution improves failure detection.

**Planned Steps:**
- Apply resampling techniques such as:
  - Synthetic oversampling (e.g., SMOTE)
  - Hybrid methods (e.g., SMOTE with Tomek links)
- Apply resampling **only on the training set**, keeping the test set unchanged.
- Limit resampling experiments to models compatible with synthetic data (primarily tree-based algorithms).

This approach isolates the impact of **data-level imbalance handling** while preserving realistic evaluation conditions.

---

**4. Experimental Consistency and Evaluation Scope**

Across all experimental approaches:
- The same train–test split will be used.
- The same evaluation metrics will be applied.
- Differences in performance will be attributed solely to preprocessing and imbalance-handling choices.

This controlled setup ensures **interpretable and defensible conclusions**.

---

**5. Summary**

The preprocessing strategy is intentionally structured into:
- **Mandatory preprocessing** to ensure data integrity, realism, and reproducibility
- **Experimental preprocessing** to systematically evaluate feature engineering and imbalance-handling strategies

This design provides a robust foundation for subsequent **model training, tuning, and explainability analysis**, while maintaining clarity for technical reviewers and stakeholders. Final pipeline selection will additionally consider **performance stability, model complexity, and deployment feasibility**.

### 2. Mandatory Preprocessing

In [24]:
df= pd.read_csv("C:/sai files/projects/predictive-maintenance-end2end/test.csv")

In [25]:
def mandatory_preprocessing(df: pd.DataFrame, test_size: float = 0.2, random_state: int = 42):
    """
    Perform mandatory preprocessing:
    - Drop identifiers, Construct binary target, Remove leakage columns, Stratified train-test split, apply ordinal encoding to type
    
    Parameters
    ----------
    df : pd.DataFrame
        Raw dataset including failure columns
    test_size : float
        Fraction of data for test set
    random_state : int
        Random seed for reproducibility
        
    Returns
    -------
    X_train, X_test, y_train, y_test : pd.DataFrame / pd.Series
        Preprocessed train-test split
    """
    
    # --- 1. Drop non-informative identifiers ---
    df = df.drop(columns=['id', 'Product ID'], errors='ignore')
    
    # --- 2. Create binary target and drop columns ---
    failure_cols = ['TWF', 'HDF', 'PWF', 'OSF', 'RNF']
    df['machine_failure'] = (df[failure_cols].sum(axis=1)>0).astype(int)    
    df = df.drop(columns=failure_cols, errors='ignore')
    
    # 3. Mapping Low to 0, Medium to 1, and High to 2
    quality_mapping = {'L': 0, 'M': 1, 'H': 2}
    if 'Type' in df.columns:
        df['Type'] = df['Type'].map(quality_mapping)

    # 4. rename columns 
    df.rename(columns={"Type":"type", "Air temperature [K]": "air_temp", "Process temperature [K]": "process_temp", "Rotational speed [rpm]": "rpm", 
                       "Torque [Nm]": "torque", "Tool wear [min]": "tool_wear"}, inplace=True)

    # --- 5. Separate features and target ---
    X = df.drop(columns=['machine_failure'])
    y = df['machine_failure']
    
    # --- 6. Stratified train-test split ---
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = mandatory_preprocessing(df)

print(f"columns: {X_train.columns.to_list()}")
print(f" Imbalanced data counts train dataset : {pd.DataFrame(y_train.value_counts())}")
print(f" Imbalanced data counts test dataset : {pd.DataFrame(y_test.value_counts())}")

columns: ['type', 'air_temp', 'process_temp', 'rpm', 'torque', 'tool_wear']
 Imbalanced data counts train dataset :                  count
machine_failure       
0                71715
1                 1048
 Imbalanced data counts test dataset :                  count
machine_failure       
0                17929
1                  262
