# Baseline RUL Modeling (FD001)

This notebook builds interpretable baseline models for Remaining Useful Life prediction using the FD001 subset of the NASA CMAPSS dataset.

The goal is not maximum prediction performance, but to establish:
- a leakage-free modeling pipeline
- a transparent baseline for comparison only testing linear models (No time-series models or deep learning at this stage)
- a reference point for more complex models later (No hyperparam tuning at this stage)


## Imports and global configuration for plots

In [70]:
# import data analysis and modeling libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

# import modeling libraries
import statsmodels.api as sm
from sklearn.model_selection import GroupShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# import data loader module
from src.data_loader import load_fd001
from pathlib import Path

# Random state
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [71]:
# Global plot appearance settings
mpl.rcParams.update({

    # Figure & Axes
    "figure.figsize": (10, 6),
    "figure.dpi": 120,
    "figure.facecolor": "#0b0f1a",
    "axes.facecolor": "#0b0f1a",
    "axes.edgecolor": "#8aa2c8",
    "axes.linewidth": 0.8,

    # Grid
    "axes.grid": True,
    "grid.color": "#1f2a44",
    "grid.linestyle": "--",
    "grid.linewidth": 0.6,
    "grid.alpha": 0.6,

    #  Text
    "text.color": "#e6e6eb",
    "axes.labelcolor": "#e6e6eb",
    "xtick.color": "#c7d0e0",
    "ytick.color": "#c7d0e0",
    "axes.titleweight": "bold",
    "axes.titlesize": 14,
    "axes.labelsize": 12,

    # ===== Ticks =====
    "xtick.major.size": 4,
    "ytick.major.size": 4,
    "xtick.minor.size": 2,
    "ytick.minor.size": 2,

    # Lines
    "lines.linewidth": 2.0,
    "lines.markersize": 5,

    # Legend
    "legend.facecolor": "#0b0f1a",
    "legend.edgecolor": "#8aa2c8",
    "legend.framealpha": 0.9,
    "legend.fontsize": 10,

    # Color Cycle
    "axes.prop_cycle": mpl.cycler(color=[
        "#4cc9f0",
        "#f72585",
        "#b5179e",
        "#7209b7",
        "#560bad",
        "#480ca8",
        "#3a86ff",
        "#ffd166"
    ])
})

## Load preprocessed data

In [72]:
# Establish path and load in data using data loading function
PROJECT_ROOT = Path().resolve().parents[0]
DATA_PATH = PROJECT_ROOT / 'data' / 'processed' / 'fd001_processed.csv'
df = load_fd001(DATA_PATH)

In [73]:
# Check if load-in was successful
SENSORS = ["sensor_4", "sensor_11", "sensor_15"]
df.describe()

Unnamed: 0,unit_number,time_in_cycles,operation_setting_1,operation_setting_2,operation_setting_3,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,...,sensor_14,sensor_15,sensor_16,sensor_17,sensor_18,sensor_19,sensor_20,sensor_21,RUL,RUL_capped
count,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,...,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0
mean,51.506568,108.807862,-9e-06,2e-06,100.0,518.67,642.680934,1590.523119,1408.933782,14.62,...,8143.752722,8.442146,0.03,393.210654,2388.0,100.0,38.816271,23.289705,107.807862,86.829286
std,29.227633,68.88099,0.002187,0.000293,0.0,0.0,0.500053,6.13115,9.000605,1.7764e-15,...,19.076176,0.037505,1.3878120000000003e-17,1.548763,0.0,0.0,0.180746,0.108251,68.88099,41.673699
min,1.0,1.0,-0.0087,-0.0006,100.0,518.67,641.21,1571.04,1382.25,14.62,...,8099.94,8.3249,0.03,388.0,2388.0,100.0,38.14,22.8942,0.0,0.0
25%,26.0,52.0,-0.0015,-0.0002,100.0,518.67,642.325,1586.26,1402.36,14.62,...,8133.245,8.4149,0.03,392.0,2388.0,100.0,38.7,23.2218,51.0,51.0
50%,52.0,104.0,0.0,0.0,100.0,518.67,642.64,1590.1,1408.04,14.62,...,8140.54,8.4389,0.03,393.0,2388.0,100.0,38.83,23.2979,103.0,103.0
75%,77.0,156.0,0.0015,0.0003,100.0,518.67,643.0,1594.38,1414.555,14.62,...,8148.31,8.4656,0.03,394.0,2388.0,100.0,38.95,23.3668,155.0,125.0
max,100.0,362.0,0.0087,0.0006,100.0,518.67,644.53,1616.91,1441.49,14.62,...,8293.72,8.5848,0.03,400.0,2388.0,100.0,39.43,23.6184,361.0,125.0


## Engine level train/validation split

In [74]:
# Each engine should start at cycle 1
starts = df.groupby("unit_number")["time_in_cycles"].min()
assert (starts == 1).all(), "Some engines do not start at cycle 1"

# Cycles strictly increase within each engine
assert (
    df.sort_values(["unit_number", "time_in_cycles"])
      .groupby("unit_number")["time_in_cycles"]
      .apply(lambda x: x.is_monotonic_increasing)
      .all()
), "Cycle ordering issue detected"


In [75]:
# Grouping by engine id
groups = df['unit_number']

# Engine level train/validation split
gss = GroupShuffleSplit(
    n_splits=1,
    test_size=0.2,
    random_state=RANDOM_STATE
)

train_idx, val_idx = next(gss.split(df, groups=groups))

train_df = df.iloc[train_idx].copy()
val_df = df.iloc[val_idx].copy()

# Extract ID from each split
train_engines = set(train_df['unit_number'])
val_engines = set(val_df['unit_number'])

# Stops execution if any engine appears in both sets - preventing engine data leaks
assert train_engines.isdisjoint(val_engines), "Engine leak detected"

# Define features and target
X_train = train_df[SENSORS]
y_train = val_df['RUL_capped']

X_val = val_df[SENSORS]
y_val = val_df["RUL_capped"]

# Check that splits are properly proportioned
len(train_engines), len(val_engines)

(80, 20)

## Feature Scaling and preprocessing

In [76]:
# Initialize scaler
scaler = StandardScaler()

# Fit training data and validation data
X_train_scaled_array = scaler.fit_transform(X_train)
X_val_scaled_array = scaler.fit_transform(X_val)

# Turn array back to DF
X_train_scaled = pd.DataFrame(
    X_train_scaled_array,
    columns=SENSORS,
    index=X_train.index
)

X_val_scaled = pd.DataFrame(
    X_val_scaled_array,
    columns=SENSORS,
    index=X_val.index
)

In [79]:
# Check features are standardized
X_train_scaled

Unnamed: 0,sensor_4,sensor_11,sensor_15
192,-1.963014,-2.294365,-1.373754
193,-1.756664,-1.127747,-0.675217
194,-1.106382,-1.203012,-1.648369
195,-1.427619,-1.654606,-1.072476
196,-0.709298,-1.090114,-1.475068
...,...,...,...
20626,2.203027,1.995779,1.428392
20627,2.755153,1.882880,1.916301
20628,2.152834,2.071044,3.268050
20629,1.968792,3.200030,2.582844


## Baseline model choice
Since RUL prediction is a regression problem, we start with a linear regression model as a baseline reference. This provides a simple, interpretable benchmark before moving to more complex models. Many of the sensors identified in EDA show approximately monotonic degradation behavior, so a linear function should capture at least the first-order relationship between sensor values and RUL. Linear regression also has high bias and low variance, which makes it well suited as a baseline which does well at establishing a performance floor that later machine learning and time-series models can improve upon.