# **Dataset Preprocessing**
This notebook demonstrates the complete pipeline for:

## **1-Cleaning** 
### üìå Step 1: Define Sensor Groups
 We start by defining two sets of sensors based on prior research:
 - **Known Constant Sensors** These sensors provide no useful variation and are droppedsensor_20, sensor_21)

- **Critical Sensors for RUL Prediction**  
These are operating conditions and sensors proven to be predictive of RUL:
- Operating conditions: `setting1`, `setting2`, `setting3`
- Temperatures: `sensor_2`, `sensor_3`, `sensor_4`
- Pressure ratios: `sensor_7`, `sensor_8`, `sensor_9`
- HPC/LPT: `sensor_11`, `sensor_12`, `sensor_13`
- Flow ratios: `sensor_14`, `sensor_15`
- Coolant/Bleed: `sensor_17`, `sensor_20`, `sensor_21`

## üìå Step 2: Cleaning Function

We define `clean_train_df(filepath)` to process each training dataset:

1. **Load raw data**  
 - Reads space-separated text files.  
 - Assigns column names: `id`, `cycle`, 3 settings, and 21 sensors.

2. **Compute RUL**  
 - For each engine (`id`),  
   

\[
   RUL = \max(\text{cycle}) - \text{cycle}
   \]



3. **Drop constant sensors**  
 - Removes the sensors listed in `CONSTANT_SENSORS`.

4. **Normalize operating regimes**  
 - Discretizes `setting1` into bins (`op_regime`) to capture different operating conditions.

5. **Retain critical features**  
 - Keeps only `id`, `cycle`, `RUL`, and the critical sensors.

6. **Correlation filtering**  
 - Removes one sensor from pairs with correlation > 0.9, keeping the one more correlated with RUL.

7. **Variance check**  
 - Drops sensors with zero variance.

8. **Return**  
 - Cleaned DataFrame.  
 - Separate RUL series.

## üìå Step 3: Process All Training Sets

We process the four CMAPSS training datasets:

```python
trains_path = [
  '.../train_FD001.txt',
  '.../train_FD002.txt',
  '.../train_FD003.txt',
  '.../train_FD004.txt'
]

train_datasets = [clean_train_df(f) for f in trains_path]

## Install Dpendencies

In [None]:

!pip install  pandas numpy matplolib scikit-learn 

In [1]:
from pathlib import Path
import numpy as np 
import pandas as pd

# # - Place your raw .txt datasets inside the "dataset" folder
# you find them in the datasets folder in the repository

In [11]:
# 1. CONFIGURATION (User Editable)
from pathlib import Path
import os

try:
    BASE_DIR = Path(__file__).resolve().parent
except NameError:
    BASE_DIR = Path(os.getcwd())   # fallback for Jupyter

DATASET_DIR = BASE_DIR / "dataset"
OUTPUT_DIR = BASE_DIR / "CLEANED_DATA"
OUTPUT_DIR.mkdir(exist_ok=True)

TRAIN_DIR = OUTPUT_DIR / "train"
TRAIN_DIR.mkdir(parents=True, exist_ok=True)
TEST_DIR = OUTPUT_DIR / "test"
TEST_DIR.mkdir(parents=True, exist_ok=True)

In [3]:
# Known constant sensors in C-MAPSS (from literature)
CONSTANT_SENSORS = ['sensor_1', 'sensor_5', 'sensor_6', 'sensor_10', 
                    'sensor_16', 'sensor_18', 'sensor_19']

# Critical sensors for RUL prediction (proven in research)
CRITICAL_SENSORS = [
    'setting1', 'setting2', 'setting3',  # Operating conditions
    'sensor_2', 'sensor_3', 'sensor_4',   # Temperatures
    'sensor_7', 'sensor_8', 'sensor_9',   # Pressure ratios
    'sensor_11', 'sensor_12', 'sensor_13', # HPC/LPT
    'sensor_14', 'sensor_15',              # Flow ratios
    'sensor_17', 'sensor_20', 'sensor_21'  # Coolant/bleed
]

In [20]:
def clean_train_df(filepath):
    df = pd.read_csv(filepath, sep=r'\s+', header=None)
    
    # Define columns
    column_names = ["id", "cycle", "setting1", "setting2", "setting3"] + \
                   [f"sensor_{i}" for i in range(1, 22)]
    df.columns = column_names
    df = df.copy()
    #  Compute RUL
    df['RUL'] = df.groupby('id')['cycle'].transform('max') - df['cycle']
    # Drop constant sensors
    df = df.drop(columns=CONSTANT_SENSORS, errors="ignore") 
    # Retain critical features 
    df['op_regime'] = pd.cut(df['setting1'], bins=5, labels=False)
    sensor_cols = [c for c in df.columns if c.startswith('sensor_')]
    for sensor in sensor_cols:
        # Normalize by operating regime mean/std
        df[f'{sensor}_norm'] = df.groupby('op_regime')[sensor].transform(
            lambda x: (x - x.mean()) / (x.std() + 1e-8)
        )
    
    keep_cols = ['id', 'cycle', 'RUL'] + \
                [c for c in CRITICAL_SENSORS if c in df.columns] + \
                [c for c in df.columns if c.endswith('_norm')]
    
    df = df[keep_cols] 
    # Remove highly correlated features (>0.9) 
    sensor_norm_cols = [c for c in df.columns if c.endswith('_norm')]
    corr_matrix = df[sensor_norm_cols].corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    to_drop = []
    for column in upper.columns:
        if any(upper[column] > 0.9):
            # Keep the one with higher RUL correlation
            corr_with_rul = df[[column] + ['RUL']].corr().iloc[0, 1]
            partner = upper[column].idxmax()
            partner_corr = df[[partner] + ['RUL']].corr().iloc[0, 1]
            
            if abs(corr_with_rul) < abs(partner_corr):
                to_drop.append(column)
    
    df = df.drop(columns=to_drop, errors='ignore')
    
    valid_norm_cols = [c for c in sensor_norm_cols if c in df.columns]
    variance = df[valid_norm_cols].var()
    
    zero_var = variance[variance == 0].index.tolist()
    df = df.drop(columns=zero_var, errors='ignore')
    df_rul=df['RUL']
    return df.copy(),df_rul


In [12]:
train_files = [
    DATASET_DIR /"train_FD001.txt",
    DATASET_DIR /"train_FD002.txt",
    DATASET_DIR /"train_FD003.txt",
    DATASET_DIR /"train_FD004.txt"]




In [21]:
def clean_test_df(filepath: str) -> pd.DataFrame:
    """
    Clean a single C-MAPSS test dataset.
    
    Args:
        filepath (str): Path to raw test file.
    
    Returns:
        pd.DataFrame: Cleaned test dataset.
    """
    # Read raw data
    df = pd.read_csv(filepath, sep=r'\s+', header=None)
    
    # Define columns
    column_names = ["id", "cycle", "setting1", "setting2", "setting3"] + \
                   [f"sensor_{i}" for i in range(1, 22)]
    df.columns = column_names
    df = df.copy()

    df = df.drop(columns=CONSTANT_SENSORS, errors='ignore')
    
    df['op_regime']=pd.cut(df['setting1'], bins=5, labels=False)

    
    sensor_cols = [c for c in df.columns if c.startswith('sensor_')]
    for sensor in sensor_cols:
        # Normalize by operating regime mean/std
        df[f'{sensor}_norm'] = df.groupby('op_regime')[sensor].transform(
            lambda x: (x - x.mean()) / (x.std() + 1e-8)
        )
    
    # 4Ô∏è‚É£ Use normalized sensors + keep critical features
    keep_cols = ['id', 'cycle'] + \
                [c for c in CRITICAL_SENSORS if c in df.columns] + \
                [c for c in df.columns if c.endswith('_norm')]
    
    df = df[keep_cols]

    return df.copy()


In [13]:

test_files = [
    DATASET_DIR / "test_FD001.txt",
    DATASET_DIR / "test_FD002.txt",
    DATASET_DIR / "test_FD003.txt",
    DATASET_DIR / "test_FD004.txt"
]

In [22]:
"""
Align Train/Test Datasets
-------------------------
Ensures that training and test sets share the same feature columns.
Outputs aligned CSVs for each FD dataset into CLEANED_DATA/train and CLEANED_DATA/test.
"""

# After cleaning
train_datasets = [clean_train_df(f) for f in train_files]   # list of (df_cleaned, df_rul)
test_datasets  = [clean_test_df(f) for f in test_files]     # list of df_cleaned

# Align
aligned_pairs = []
for i, ((df_cleaned, df_rul), test_df) in enumerate(zip(train_datasets, test_datasets), start=1):
    common_cols = list(set(df_cleaned.columns) & set(test_df.columns))
    train_cols = common_cols + ["RUL"]

    train_aligned = df_cleaned.reindex(columns=train_cols)
    test_aligned  = test_df.reindex(columns=common_cols)

    train_aligned.to_csv(TRAIN_DIR / f"train_FD00{i}_aligned.csv", index=False)
    test_aligned.to_csv(TEST_DIR / f"test_FD00{i}_aligned.csv", index=False)

    aligned_pairs.append((train_aligned, test_aligned))

print(" Aligned datasets have been saved to CLEANED_DATA/train and CLEANED_DATA/test")


 Aligned datasets have been saved to CLEANED_DATA/train and CLEANED_DATA/test


In [23]:
for i, (train_df, test_df) in enumerate(aligned_pairs, 1):
    print(f"\nFD00{i} Dataset:")
    print(f"Train columns ({len(train_df.columns)}): {sorted(train_df.columns)}")
    print(f"Test columns ({len(test_df.columns)}): {sorted(test_df.columns)}")



FD001 Dataset:
Train columns (33): ['RUL', 'cycle', 'id', 'sensor_11', 'sensor_11_norm', 'sensor_12', 'sensor_12_norm', 'sensor_13', 'sensor_13_norm', 'sensor_14', 'sensor_15', 'sensor_15_norm', 'sensor_17', 'sensor_17_norm', 'sensor_2', 'sensor_20', 'sensor_20_norm', 'sensor_21', 'sensor_21_norm', 'sensor_2_norm', 'sensor_3', 'sensor_3_norm', 'sensor_4', 'sensor_4_norm', 'sensor_7', 'sensor_7_norm', 'sensor_8', 'sensor_8_norm', 'sensor_9', 'sensor_9_norm', 'setting1', 'setting2', 'setting3']
Test columns (32): ['cycle', 'id', 'sensor_11', 'sensor_11_norm', 'sensor_12', 'sensor_12_norm', 'sensor_13', 'sensor_13_norm', 'sensor_14', 'sensor_15', 'sensor_15_norm', 'sensor_17', 'sensor_17_norm', 'sensor_2', 'sensor_20', 'sensor_20_norm', 'sensor_21', 'sensor_21_norm', 'sensor_2_norm', 'sensor_3', 'sensor_3_norm', 'sensor_4', 'sensor_4_norm', 'sensor_7', 'sensor_7_norm', 'sensor_8', 'sensor_8_norm', 'sensor_9', 'sensor_9_norm', 'setting1', 'setting2', 'setting3']

FD002 Dataset:
Train colu