# 02 – Feature Engineering

Create relevant features to improve RUL prediction.

Input:  
- `train_processed.csv` and `test_processed.csv` from `data/processed/`

Output:  
- Final datasets with engineered features for modeling

In [1]:
# Importing required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

In [2]:
# Load cleaned datasets from the processed folder
train_df = pd.read_csv("../data/processed/train_processed.csv")
test_df = pd.read_csv("../data/processed/test_processed.csv")

## Check dataset shapes

In [3]:
# Check the shape of loaded datasets
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

Train shape: (20631, 20)
Test shape: (100, 20)


## Quick look at the processed datasets

In [4]:
# Display first rows from the training datasets
train_df.head()

Unnamed: 0,unit_number,time_in_cycles,operational_setting_1,operational_setting_2,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_6,sensor_measurement_7,sensor_measurement_8,sensor_measurement_9,sensor_measurement_11,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_17,sensor_measurement_20,sensor_measurement_21,RUL
0,1,1,-0.0007,-0.0004,641.82,1589.7,1400.6,21.61,554.36,2388.06,9046.19,47.47,521.66,2388.02,8138.62,8.4195,392,39.06,23.419,191
1,1,2,0.0019,-0.0003,642.15,1591.82,1403.14,21.61,553.75,2388.04,9044.07,47.49,522.28,2388.07,8131.49,8.4318,392,39.0,23.4236,190
2,1,3,-0.0043,0.0003,642.35,1587.99,1404.2,21.61,554.26,2388.08,9052.94,47.27,522.42,2388.03,8133.23,8.4178,390,38.95,23.3442,189
3,1,4,0.0007,0.0,642.35,1582.79,1401.87,21.61,554.45,2388.11,9049.48,47.13,522.86,2388.08,8133.83,8.3682,392,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,642.37,1582.85,1406.22,21.61,554.0,2388.06,9055.15,47.28,522.19,2388.04,8133.8,8.4294,393,38.9,23.4044,187


In [5]:
# Display first rows from the test datasets
test_df.head()

Unnamed: 0,unit_number,time_in_cycles,operational_setting_1,operational_setting_2,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_6,sensor_measurement_7,sensor_measurement_8,sensor_measurement_9,sensor_measurement_11,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_17,sensor_measurement_20,sensor_measurement_21,true_RUL
0,1,31,-0.0006,0.0004,642.58,1581.22,1398.91,21.61,554.42,2388.08,9056.4,47.23,521.79,2388.06,8130.11,8.4024,393,38.81,23.3552,112
1,2,49,0.0018,-0.0001,642.55,1586.59,1410.83,21.61,553.52,2388.1,9044.77,47.67,521.74,2388.09,8126.9,8.4505,391,38.81,23.2618,98
2,3,126,-0.0016,0.0004,642.88,1589.75,1418.89,21.61,552.59,2388.16,9049.26,47.88,520.83,2388.14,8131.46,8.4119,395,38.93,23.274,69
3,4,106,0.0012,0.0004,642.78,1594.53,1406.88,21.61,552.64,2388.13,9051.3,47.65,521.88,2388.11,8133.64,8.4634,395,38.58,23.2581,82
4,5,98,-0.0013,-0.0004,642.27,1589.94,1419.36,21.61,553.29,2388.1,9053.99,47.46,521.0,2388.15,8125.74,8.4362,394,38.75,23.4117,91


- Both datasets share 20 columns.
- `unit_number`, `time_in_cycles`: engine and cycle identifiers.
- `operational_setting_*` and `sensor_measurement_*`: input features.
- Target: `RUL` (train), `true_RUL` (test).

## Generate Rolling Features (Train)

We compute rolling mean and standard deviation for each sensor over a 5-cycle window.  
This is done separately for each engine (unit_number) to avoid data leakage.  
These features help capture short-term trends in sensor behavior over time.

In [6]:
# Define sensor columns
sensor_cols = [col for col in train_df.columns if col.startswith('sensor_measurement_')]

# Window size for rolling calculations
window_size = 5

# Apply rolling mean and std per engine
for col in sensor_cols:
    train_df[f'{col}_rolling_mean'] = (
        train_df.groupby('unit_number')[col]
        .rolling(window=window_size, min_periods=1)
        .mean()
        .reset_index(level=0, drop=True)
    )
    train_df[f'{col}_rolling_std'] = (
        train_df.groupby('unit_number')[col]
        .rolling(window=window_size, min_periods=1)
        .std()
        .reset_index(level=0, drop=True)
    )

# Check new shape
print("Train shape after rolling features:", train_df.shape)

Train shape after rolling features: (20631, 50)


## Generate Rolling Features (Test)

Compute rolling mean and standard deviation for each sensor in the test set over the previous 5 cycles per engine.

In [7]:
# Apply same rolling features to test set

# Define sensor columns
sensor_cols = [col for col in test_df.columns if col.startswith('sensor_measurement_')]

# Window size for rolling calculations
window_size = 5

# Apply rolling mean and std per engine
for col in sensor_cols:
    test_df[f'{col}_rolling_mean'] = (
        test_df.groupby('unit_number')[col]
        .rolling(window=window_size, min_periods=1)
        .mean()
        .reset_index(level=0, drop=True)
    )
    test_df[f'{col}_rolling_std'] = (
        test_df.groupby('unit_number')[col]
        .rolling(window=window_size, min_periods=1)
        .std()
        .reset_index(level=0, drop=True)
    )

# Check new shape
print("Test shape after rolling features:", test_df.shape)

Test shape after rolling features: (100, 50)


## Verify rolling feature values for test set

Display the first 5 rows of rolling mean and std for two sensors to confirm correct computation.


In [8]:
# Preview rolling mean and std for selected sensors in test set
test_df[['sensor_measurement_2_rolling_mean', 'sensor_measurement_2_rolling_std',
         'sensor_measurement_3_rolling_mean', 'sensor_measurement_3_rolling_std']].head()

Unnamed: 0,sensor_measurement_2_rolling_mean,sensor_measurement_2_rolling_std,sensor_measurement_3_rolling_mean,sensor_measurement_3_rolling_std
0,642.58,,1581.22,
1,642.55,,1586.59,
2,642.88,,1589.75,
3,642.78,,1594.53,
4,642.27,,1589.94,


The rolling_std values are NaN for the first rows because the window contains only one data point. These NaNs will be replaced with 0 to indicate zero variability for single-point windows.

### Fix rolling std NaNs

Fill NaN values in rolling standard deviation columns with 0, since the std of a single value is 0.

In [9]:
# Identify rolling_std columns explicitly
rolling_std_cols = [col for col in test_df.columns if col.endswith('_rolling_std')]

# Fill NaN with 0 for each rolling_std column
for col in rolling_std_cols:
    test_df[col] = test_df[col].fillna(0)

# Verify 
test_df[rolling_std_cols[:2]].head()

Unnamed: 0,sensor_measurement_2_rolling_std,sensor_measurement_3_rolling_std
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0


### Fix rolling std NaNs in training set

Apply the same NaN → 0 replacement for rolling_std features on the training data.

In [10]:
# Identify rolling_std columns in train_df
rolling_std_cols_train = [col for col in train_df.columns if col.endswith('_rolling_std')]

# Fill NaN with 0 for each rolling_std column
for col in rolling_std_cols_train:
    train_df[col] = train_df[col].fillna(0)

# Verify the fix on the first few rows
train_df[rolling_std_cols_train[:2]].head()

Unnamed: 0,sensor_measurement_2_rolling_std,sensor_measurement_3_rolling_std
0,0.0,0.0
1,0.233345,1.499066
2,0.267644,1.918654
3,0.250117,3.855909
4,0.234776,4.075678


- Cycle 0 has `rolling_std = 0` (single data point).  
- Subsequent cycles show positive `rolling_std`, representing variability over the 5-cycle window.

## Normalized cycle feature

Normalized cycle feature (`cycle_ratio`) = current cycle / maximum cycle for each engine.  
Values range from 0 (engine start) to 1 (engine end), indicating each observation’s position in the engine’s lifespan.

In [11]:
# Compute normalized cycle feature
train_df['cycle_ratio'] = train_df['time_in_cycles'] / train_df.groupby('unit_number')['time_in_cycles'].transform('max')
test_df['cycle_ratio']  = test_df['time_in_cycles'] / test_df.groupby('unit_number')['time_in_cycles'].transform('max')

# Check shapes after adding cycle_ratio
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

# Preview cycle_ratio feature
print(train_df[['unit_number', 'time_in_cycles', 'cycle_ratio']].head())
print("\nTest sample:")
print(test_df[['unit_number', 'time_in_cycles', 'cycle_ratio']].head())

Train shape: (20631, 51)
Test shape: (100, 51)
   unit_number  time_in_cycles  cycle_ratio
0            1               1     0.005208
1            1               2     0.010417
2            1               3     0.015625
3            1               4     0.020833
4            1               5     0.026042

Test sample:
   unit_number  time_in_cycles  cycle_ratio
0            1              31          1.0
1            2              49          1.0
2            3             126          1.0
3            4             106          1.0
4            5              98          1.0


The shapes confirm `cycle_ratio` has been added (train: 51 columns, test: 51 columns).  
In the training set, `cycle_ratio` ranges from near 0 up to 1 over each engine’s life.  
In the test set, `cycle_ratio` is constant at 1 because only the last known cycle of each engine is included.

## Generate Delta Features

Compute the difference between each sensor’s current value and its previous cycle for each engine.  
Delta features capture the instantaneous change in sensor readings.


In [12]:
# Define original sensor columns (exclude rolling and other engineered features)
sensor_cols = [col for col in train_df.columns 
               if col.startswith('sensor_measurement_') and 
                  'rolling' not in col and 
                  'delta' not in col]

# Compute delta features for both train and test
for df in [train_df, test_df]:
    for col in sensor_cols:
        df[f'{col}_delta'] = df.groupby('unit_number')[col].diff().fillna(0)

# Check shapes after adding delta features
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

Train shape: (20631, 66)
Test shape: (100, 66)


After adding delta features, the dataset shapes update to:
- Train: (20631, 66)
- Test:  (100, 66)

This confirms 15 new delta columns (one per sensor) have been appended to both datasets.


## Remove constant features

Drop features that have zero variance in the test set, as they provide no information for prediction.  
We identify columns with a single unique value in `test_df` and remove them from both datasets.


In [13]:
# Identify constant columns in test set
constant_cols = [col for col in test_df.columns if test_df[col].nunique() <= 1]

# Drop these columns from both train and test
train_df.drop(columns=constant_cols, inplace=True)
test_df.drop(columns=constant_cols, inplace=True)

# Verify shapes after removal
print("Dropped columns:", constant_cols)
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

Dropped columns: ['sensor_measurement_2_rolling_std', 'sensor_measurement_3_rolling_std', 'sensor_measurement_4_rolling_std', 'sensor_measurement_6_rolling_std', 'sensor_measurement_7_rolling_std', 'sensor_measurement_8_rolling_std', 'sensor_measurement_9_rolling_std', 'sensor_measurement_11_rolling_std', 'sensor_measurement_12_rolling_std', 'sensor_measurement_13_rolling_std', 'sensor_measurement_14_rolling_std', 'sensor_measurement_15_rolling_std', 'sensor_measurement_17_rolling_std', 'sensor_measurement_20_rolling_std', 'sensor_measurement_21_rolling_std', 'cycle_ratio', 'sensor_measurement_2_delta', 'sensor_measurement_3_delta', 'sensor_measurement_4_delta', 'sensor_measurement_6_delta', 'sensor_measurement_7_delta', 'sensor_measurement_8_delta', 'sensor_measurement_9_delta', 'sensor_measurement_11_delta', 'sensor_measurement_12_delta', 'sensor_measurement_13_delta', 'sensor_measurement_14_delta', 'sensor_measurement_15_delta', 'sensor_measurement_17_delta', 'sensor_measurement_20_

Resulting dataset shapes:
- Train: (20631, 35)
- Test:  (100, 35)

All removed features had zero variance in the test set and therefore add no predictive value.

## Feature Scaling

Apply standard scaling (zero mean, unit variance) to all feature columns.  
We fit the scaler on the training set and transform both train and test to ensure consistency.

In [14]:
# Identify feature columns (exclude identifiers and target)
feature_cols = [col for col in train_df.columns 
                if col not in ['unit_number', 'time_in_cycles', 'RUL']]

# Initialize scaler and fit on train features
scaler = StandardScaler()
train_df[feature_cols] = scaler.fit_transform(train_df[feature_cols])

# Transform test features
test_df[feature_cols] = scaler.transform(test_df[feature_cols])

# Verify scaling: train feature means ~0, std ~1
print("Train feature means (first 5):\n", train_df[feature_cols].mean().round(2).head())
print("\nTrain feature stds (first 5):\n", train_df[feature_cols].std().round(2).head())

Train feature means (first 5):
 operational_setting_1    0.0
operational_setting_2   -0.0
sensor_measurement_2     0.0
sensor_measurement_3    -0.0
sensor_measurement_4     0.0
dtype: float64

Train feature stds (first 5):
 operational_setting_1    1.0
operational_setting_2    1.0
sensor_measurement_2     1.0
sensor_measurement_3     1.0
sensor_measurement_4     1.0
dtype: float64


# Correlation-based feature selection

From EDA, these sensor pairs showed high correlation (>0.80):
- sensor_measurement_9 & sensor_measurement_14  
- sensor_measurement_8 & sensor_measurement_13  
- sensor_measurement_4 & sensor_measurement_11  
- sensor_measurement_7 & sensor_measurement_12  

We drop one feature from each pair to reduce redundancy:
- Drop: `sensor_measurement_14`, `sensor_measurement_13`, `sensor_measurement_11`, `sensor_measurement_12`.

In [15]:
# Drop highly correlated features identified in EDA
drop_cols = [
    'sensor_measurement_14',
    'sensor_measurement_13',
    'sensor_measurement_11',
    'sensor_measurement_12'
]

train_df.drop(columns=drop_cols, inplace=True)
test_df.drop(columns=drop_cols, inplace=True)

# Verify shapes after dropping
print("Train shape after corr selection:", train_df.shape)
print("Test shape after corr selection: ", test_df.shape)

Train shape after corr selection: (20631, 31)
Test shape after corr selection:  (100, 31)


After dropping four highly correlated sensors, the dataset shapes update to:
- Train: (20631, 31)
- Test:  (100, 31)

## Save feature-engineered datasets

Save the final train and test sets with all engineered features for modeling.

In [16]:
## Save the processed feature datasets
train_df.to_csv("../data/processed/train_features.csv", index=False)
test_df.to_csv("../data/processed/test_features.csv", index=False)

## Summary

1. Loaded data (train: 20631×20, test: 100×20)  
2. Added rolling features (mean & std over 5 cycles) → train: 20631×50, test: 100×50  
3. Replaced NaN in rolling_std with 0  
4. Added `cycle_ratio` → train: 20631×51, test: 100×51 (constant in test)  
5. Added delta features (difference to previous cycle) → train: 20631×66, test: 100×66  
6. Dropped zero-variance columns → train: 20631×35, test: 100×35  
7. Standard scaled all features (fit on train, transform both)  
8. Removed four highly correlated sensors → final: train: 20631×31, test: 100×31  

Datasets saved as `train_features.csv` and `test_features.csv`.