### **Exploratory Data Analysis (EDA)**

#### **1. Imports**

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from scipy.stats import skew, gaussian_kde

# Shows full list of rows in output
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

#### **2. Functions**

In [2]:
def add_RUL(df):
    """RUL = max cycle - current cycle """
    max_cycle = df.groupby("engine_id")["time_in_cycles"].transform("max")
    df["RUL"] = max_cycle - df["time_in_cycles"]
    return df

def outlier_detection(df):
    """Outlier detection using IQR"""
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers_mask = (df < lower) | (df > upper)
    return outliers_mask.sum()



#### **3. EDA**

In [3]:
header_names = ['engine_id', 'time_in_cycles', 'operational_setting_1', 'operational_setting_2', 'operational_setting_3', 'sensor_measurement_1', 'sensor_measurement_2', 'sensor_measurement_3', 'sensor_measurement_4', 'sensor_measurement_5', 'sensor_measurement_6', 'sensor_measurement_7', 'sensor_measurement_8', 'sensor_measurement_9', 'sensor_measurement_10', 'sensor_measurement_11', 'sensor_measurement_12', 'sensor_measurement_13', 'sensor_measurement_14', 'sensor_measurement_15', 'sensor_measurement_16', 'sensor_measurement_17', 'sensor_measurement_18', 'sensor_measurement_19', 'sensor_measurement_20', 'sensor_measurement_21']

#### **3.1 FD001**

In [4]:
# Reading data
df_1 = pd.read_csv("../data/CMAPSSData/train_FD001.txt", sep=r"\s+", header=None, names=header_names)
df_1.head()

Unnamed: 0,engine_id,time_in_cycles,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,sensor_measurement_8,sensor_measurement_9,sensor_measurement_10,sensor_measurement_11,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044


In [5]:
# Adding target variable 
df_1 = add_RUL(df_1)
df_1.head()

Unnamed: 0,engine_id,time_in_cycles,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,sensor_measurement_8,sensor_measurement_9,sensor_measurement_10,sensor_measurement_11,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,RUL
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,187


In [15]:
# Checking for NA or NULL values
df_1.info()

<class 'pandas.DataFrame'>
RangeIndex: 20631 entries, 0 to 20630
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   engine_id              20631 non-null  int64  
 1   time_in_cycles         20631 non-null  int64  
 2   operational_setting_1  20631 non-null  float64
 3   operational_setting_2  20631 non-null  float64
 4   operational_setting_3  20631 non-null  float64
 5   sensor_measurement_1   20631 non-null  float64
 6   sensor_measurement_2   20631 non-null  float64
 7   sensor_measurement_3   20631 non-null  float64
 8   sensor_measurement_4   20631 non-null  float64
 9   sensor_measurement_5   20631 non-null  float64
 10  sensor_measurement_6   20631 non-null  float64
 11  sensor_measurement_7   20631 non-null  float64
 12  sensor_measurement_8   20631 non-null  float64
 13  sensor_measurement_9   20631 non-null  float64
 14  sensor_measurement_10  20631 non-null  float64
 15  sensor_measur

In [7]:
print(f"(Rows, Columns): ({df_1.shape[0]},{df_1.shape[1]})")

(Rows, Columns): (20631,27)


In [10]:
# Number of cycles per engine
rows_per_engine_df_1 = df_1.groupby('engine_id').size().reset_index(name='cycles')
fig = px.bar(
    rows_per_engine_df_1, x='engine_id', y='cycles',
    title='Number of cycles per engine',
    labels={'engine_id': 'Engine ID', 'cycles': 'Cycles (lifecycle length)'},
)
fig.update_layout(xaxis_title='Engine ID', yaxis_title='Cycles')
fig.show()

In [30]:
df_1.describe()

Unnamed: 0,engine_id,time_in_cycles,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,sensor_measurement_8,sensor_measurement_9,sensor_measurement_10,sensor_measurement_11,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,RUL
count,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0
mean,51.506568,108.807862,-9e-06,2e-06,100.0,518.67,642.680934,1590.523119,1408.933782,14.62,21.609803,553.367711,2388.096652,9065.242941,1.3,47.541168,521.41347,2388.096152,8143.752722,8.442146,0.03,393.210654,2388.0,100.0,38.816271,23.289705,107.807862
std,29.227633,68.88099,0.002187,0.000293,0.0,0.0,0.500053,6.13115,9.000605,5.3292e-15,0.001389,0.885092,0.070985,22.08288,0.0,0.267087,0.737553,0.071919,19.076176,0.037505,3.469531e-18,1.548763,0.0,0.0,0.180746,0.108251,68.88099
min,1.0,1.0,-0.0087,-0.0006,100.0,518.67,641.21,1571.04,1382.25,14.62,21.6,549.85,2387.9,9021.73,1.3,46.85,518.69,2387.88,8099.94,8.3249,0.03,388.0,2388.0,100.0,38.14,22.8942,0.0
25%,26.0,52.0,-0.0015,-0.0002,100.0,518.67,642.325,1586.26,1402.36,14.62,21.61,552.81,2388.05,9053.1,1.3,47.35,520.96,2388.04,8133.245,8.4149,0.03,392.0,2388.0,100.0,38.7,23.2218,51.0
50%,52.0,104.0,0.0,0.0,100.0,518.67,642.64,1590.1,1408.04,14.62,21.61,553.44,2388.09,9060.66,1.3,47.51,521.48,2388.09,8140.54,8.4389,0.03,393.0,2388.0,100.0,38.83,23.2979,103.0
75%,77.0,156.0,0.0015,0.0003,100.0,518.67,643.0,1594.38,1414.555,14.62,21.61,554.01,2388.14,9069.42,1.3,47.7,521.95,2388.14,8148.31,8.4656,0.03,394.0,2388.0,100.0,38.95,23.3668,155.0
max,100.0,362.0,0.0087,0.0006,100.0,518.67,644.53,1616.91,1441.49,14.62,21.61,556.06,2388.56,9244.59,1.3,48.53,523.38,2388.56,8293.72,8.5848,0.03,400.0,2388.0,100.0,39.43,23.6184,361.0


##### 3.1.3 Numeric summary & data quality
Describe, duplicates, and basic data quality checks.

In [31]:
# Checking for duplicates
df_1.duplicated().sum()

np.int64(0)

In [32]:
# Checking for constant columns 
df_1.nunique()

engine_id                 100
time_in_cycles            362
operational_setting_1     158
operational_setting_2      13
operational_setting_3       1
sensor_measurement_1        1
sensor_measurement_2      310
sensor_measurement_3     3012
sensor_measurement_4     4051
sensor_measurement_5        1
sensor_measurement_6        2
sensor_measurement_7      513
sensor_measurement_8       53
sensor_measurement_9     6403
sensor_measurement_10       1
sensor_measurement_11     159
sensor_measurement_12     427
sensor_measurement_13      56
sensor_measurement_14    6078
sensor_measurement_15    1918
sensor_measurement_16       1
sensor_measurement_17      13
sensor_measurement_18       1
sensor_measurement_19       1
sensor_measurement_20     120
sensor_measurement_21    4745
RUL                       362
dtype: int64

In [33]:
# Dropping constant columns
df_1.drop(columns=['operational_setting_3', 'sensor_measurement_1', 'sensor_measurement_5', 'sensor_measurement_10', 'sensor_measurement_16', 'sensor_measurement_18', 'sensor_measurement_19'], inplace=True)

In [34]:
sensor_cols = [c for c in df_1.columns if 'sensor' in c]
for i in sensor_cols:
    print(f"{i}: {round((outlier_detection(df_1[i])/df_1[i].shape[0]) * 100, 2)}%")

sensor_measurement_2: 0.62%
sensor_measurement_3: 0.8%
sensor_measurement_4: 0.58%
sensor_measurement_6: 1.97%
sensor_measurement_7: 0.53%
sensor_measurement_8: 1.55%
sensor_measurement_9: 8.17%
sensor_measurement_11: 0.81%
sensor_measurement_12: 0.71%
sensor_measurement_13: 0.78%
sensor_measurement_14: 7.48%
sensor_measurement_15: 0.58%
sensor_measurement_17: 0.39%
sensor_measurement_20: 0.57%
sensor_measurement_21: 0.66%


##### **Since the CMAPSS dataset includes simulated measurement noise and does not provide physical sensor bounds, extreme values were not removed blindly. Instead, smoothing-based feature engineering was used to preserve potential degradation signals while mitigating noise.”**

In [35]:
fig = px.histogram(df_1, x="RUL", nbins=50, title="RUL distribution")
fig.update_layout(xaxis_title="RUL (cycles)", yaxis_title="Count")
fig.show()

In [36]:
print("RUL skewness:", skew(df_1["RUL"]))

RUL skewness: 0.49986761846946354


##### **RUL (across all rows) is moderately right‑skewed because, when pooling all engines, more rows have low RUL: every engine contributes to low RUL values (0, 1, 2, …), while only engines with longer life contribute to high RUL. So the skewness reflects the mix of engine lifespans and the one-row-per-cycle structure, not a countdown bias within a single engine.**

In [37]:
# 1. Filter for the first 100 engines
df_subset = df_1[df_1['engine_id'] <= 100].copy()

# 2. Convert engine_id to string so Plotly treats it as a discrete category (distinct colors)
df_subset['engine_id'] = df_subset['engine_id'].astype(str)

# 3. Plot all lines at once
fig = px.line(
    df_subset, 
    x='time_in_cycles', 
    y='RUL', 
    color='engine_id',
    title='RUL vs Cycle (Engines 1-100)',
    labels={'time_in_cycles': 'Cycle', 'RUL': 'Remaining Useful Life'}
)

# Optional: Hide the legend if 100 lines make it too cluttered
fig.update_layout(showlegend=False)

fig.show()

In [38]:
# Most data = healthy phase
#Few samples = near failure
print(f"Checking for RUL Imbalance: {round((df_1['RUL'] < 20).mean().item() * 100,2)}%")


Checking for RUL Imbalance: 9.69%


In [39]:
corr_with_rul = df_1[[c for c in df_1.columns if 'sensor' in c]].corrwith(df_1['RUL']).dropna()
corr_with_rul = corr_with_rul.sort_values()

fig = px.bar(
    x=corr_with_rul.values,
    y=corr_with_rul.index,
    orientation='h',
    title='Correlation of sensors with RUL',
    labels={'x': 'Correlation', 'y': 'Sensor'},
    color=corr_with_rul.values,
    color_continuous_scale='RdBu_r',
    range_color=[-1, 1],
)
fig.update_layout(height=500, yaxis={'categoryorder': 'total ascending'})
fig.show()

In [40]:
sensors = [c for c in df_1.columns if 'sensor' in c]
for s in sensors:
    fig = px.histogram(df_1, x=s, nbins=50, title=f"Distribution: {s}")
    fig.update_layout(height=300, width=600)
    fig.show()

| Shape | Description | Interpretation |
|-------|-------------|----------------|
| 🟢 Normal-looking | Smooth bell curve | Good, stable sensor with useful variation |
| 🟡 Skewed | Right/left skew | Might indicate degradation trend |
| 🔴 Very tight spike | Almost one value | Near-constant sensor (possibly useless) |
| 🔵 Multi-modal | Two peaks | Different operating conditions (common in CMAPSS due to multiple op settings) |

| Shape | Applies to FD001? |
|-------|-------------------|
| Bell curve | Yes - means sensor has useful variation |
| Skewed | Somewhat - but check time-series plots for real degradation trends |
| Tight spike | Yes - confirmed 7 constant sensors (including sensor 6) |
| Multi-modal | No - FD001 has one condition; multi-modal applies to FD002/FD004 |

##### **Dropping sensor 6, cause it has only 2 values so constant**

In [41]:
df_1.drop(columns=['sensor_measurement_6'], inplace=True)
df_1.head()

Unnamed: 0,engine_id,time_in_cycles,operational_setting_1,operational_setting_2,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_7,sensor_measurement_8,sensor_measurement_9,sensor_measurement_11,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_17,sensor_measurement_20,sensor_measurement_21,RUL
0,1,1,-0.0007,-0.0004,641.82,1589.7,1400.6,554.36,2388.06,9046.19,47.47,521.66,2388.02,8138.62,8.4195,392,39.06,23.419,191
1,1,2,0.0019,-0.0003,642.15,1591.82,1403.14,553.75,2388.04,9044.07,47.49,522.28,2388.07,8131.49,8.4318,392,39.0,23.4236,190
2,1,3,-0.0043,0.0003,642.35,1587.99,1404.2,554.26,2388.08,9052.94,47.27,522.42,2388.03,8133.23,8.4178,390,38.95,23.3442,189
3,1,4,0.0007,0.0,642.35,1582.79,1401.87,554.45,2388.11,9049.48,47.13,522.86,2388.08,8133.83,8.3682,392,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,642.37,1582.85,1406.22,554.0,2388.06,9055.15,47.28,522.19,2388.04,8133.8,8.4294,393,38.9,23.4044,187


In [43]:
sensors = [c for c in df_1.columns if 'sensor' in c]

for sensor in sensors:
    # Group by RUL so all engines align at failure
    stats = df_1.groupby("RUL")[sensor].agg(["mean", "std"]).reset_index()
    stats = stats.sort_values("RUL", ascending=False)

    fig = go.Figure()
    # Shaded band (mean ± std)
    fig.add_trace(go.Scatter(
        x=stats["RUL"], y=stats["mean"] + stats["std"],
        mode="lines", line=dict(width=0), showlegend=False,
    ))
    fig.add_trace(go.Scatter(
        x=stats["RUL"], y=stats["mean"] - stats["std"],
        mode="lines", line=dict(width=0), fill="tonexty", fillcolor="rgba(68,114,196,0.2)",
        showlegend=False,
    ))
    # Mean line
    fig.add_trace(go.Scatter(
        x=stats["RUL"], y=stats["mean"],
        mode="lines", line=dict(color="rgb(68,114,196)", width=2), name="Mean",
    ))

    fig.update_layout(
        title=f"{sensor} — mean ± std across all engines (aligned by RUL)",
        xaxis_title="RUL (cycles remaining)",
        yaxis_title=sensor,
        xaxis=dict(autorange="reversed"),
        height=450,
    )
    fig.show()

#### Sensor Degradation Trends (Mean ± Std aligned by RUL)

These plots show the **average sensor value** (blue line) and **±1 standard deviation band** (shaded area) across all 100 engines, aligned by **RUL** (Remaining Useful Life). RUL = 0 (failure) is on the right.

##### Sensors that INCREASE as engine approaches failure (negative correlation with RUL)

| Sensor | Paper Name | Observation |
|--------|-----------|-------------|
| sensor_measurement_2 | T24 (LPC outlet temp) | Mean rises as RUL → 0 |
| sensor_measurement_3 | T30 (HPC outlet temp) | Mean rises as RUL → 0 |
| sensor_measurement_4 | T50 (LPT outlet temp) | Mean rises as RUL → 0 |
| sensor_measurement_8 | Nf (fan speed) | Slight increase near failure |
| sensor_measurement_11 | Ps30 (HPC static pressure) | Rises as engine degrades (strongest signal) |
| sensor_measurement_13 | NRf (corrected fan speed) | Slight increase near failure |
| sensor_measurement_15 | BPR (bypass ratio) | Rises toward failure |
| sensor_measurement_17 | htBleed (bleed enthalpy) | Rises toward failure |

##### Sensors that DECREASE as engine approaches failure (positive correlation with RUL)

| Sensor | Paper Name | Observation |
|--------|-----------|-------------|
| sensor_measurement_7 | P30 (HPC outlet pressure) | Mean drops as RUL → 0 |
| sensor_measurement_12 | phi (fuel flow / Ps30) | Drops toward failure |
| sensor_measurement_20 | W31 (HPT coolant bleed) | Drops toward failure |
| sensor_measurement_21 | W32 (LPT coolant bleed) | Drops toward failure |

##### Sensors with weak or flat trend

| Sensor | Paper Name | Observation |
|--------|-----------|-------------|
| sensor_measurement_9 | Nc (core speed) | Slight trend but very noisy (wide band) |
| sensor_measurement_14 | NRc (corrected core speed) | Slight trend, very noisy |

##### Key Takeaways

- **Degradation is visible:** Most sensors show a clear trend as RUL decreases — temperatures rise, pressures/flows drop.
- **Degradation is gradual, not sudden:** Smooth drifts, not abrupt jumps — supports time-series / sequential modeling.
- **Best sensors for RUL prediction:** sensor_11 (Ps30), sensor_4 (T50), sensor_7 (P30), sensor_12 (phi), sensor_15 (BPR), sensor_2 (T24).
- **Weakest sensors:** sensor_9 (Nc), sensor_14 (NRc).
- **Noise is present but manageable:** Trends are stronger than noise for the best sensors.
- **Confirms the paper's physics:** HPC degradation → efficiency loss → temperatures up, pressures/flows down.

In [45]:
sensors = [c for c in df_1.columns if 'sensor' in c]
corr_matrix = df_1[sensors].corr()
fig = px.imshow(
    corr_matrix,
    color_continuous_scale="RdBu_r",
    zmin=-1, zmax=1,
    title="Correlation Heatmap",
    text_auto=".2f",
    height=800, width=900,
)
fig.show()

#### Correlation Heatmap – Sensor Relationships

##### 1. Two main sensor groups (red blocks)

The heatmap shows two clusters of highly correlated sensors:

**Group A (positive with each other):**
- Sensors: sensor_2 (T24), sensor_3 (T30), sensor_4 (T50), sensor_8 (Nf), sensor_11 (Ps30), sensor_13 (NRf), sensor_15 (BPR), sensor_17 (htBleed)
- These all **increase** together as the engine degrades.
- Correlations within this group: ~0.60–0.83.

**Group B (positive with each other):**
- Sensors: sensor_7 (P30), sensor_12 (phi), sensor_20 (W31), sensor_21 (W32)
- These all **decrease** together as the engine degrades.
- Correlations within this group: ~0.69–0.76.

**Between the two groups:** strong **negative** correlation (dark blue), ~−0.63 to −0.85 — one group goes up, the other goes down during degradation.

---

##### 2. Sensor 9 (Nc) and sensor 14 (NRc)

- Highly correlated with **each other** (~0.96) but weakly correlated with most other sensors (~0.16–0.34).
- They form a separate mini-cluster and carry different (or weaker) degradation signal.

---

##### 3. Multicollinearity within groups

- Within Group A: many pairwise correlations 0.70–0.83 → **redundant** information.
- Within Group B: many pairwise correlations 0.69–0.76 → **redundant** information.

**For modeling:**
- **Tree-based models:** Can keep all sensors; multicollinearity is not an issue.
- **Linear / neural nets:** Consider **PCA** or **feature selection** to reduce redundancy.

---

##### 4. Summary table

| Observation | Implication |
|-------------|-------------|
| Two anti-correlated groups | Degradation physics: some sensors up, some down. Good for prediction. |
| High correlation within groups | Redundancy; PCA or feature selection can help for linear models. |
| sensor_9, sensor_14 isolated | Weak degradation signal; may contribute little. |
| sensor_9 ↔ sensor_14 ≈ 0.96 | Nearly identical; keeping one is enough. |

**Conclusion:** Sensors split into two anti-correlated groups (temps/pressures up vs flows/pressures down). There is high redundancy within each group. Sensor_9 and sensor_14 form a separate weak cluster. PCA or feature selection can reduce redundancy for linear models.

In [54]:
max_cycles = df_1.groupby('engine_id')['time_in_cycles'].max().reset_index()
x = max_cycles['time_in_cycles']

# Histogram
fig = go.Figure()
fig.add_trace(go.Histogram(x=x, nbinsx=30, name='Engines', marker_color='#636EFA', opacity=0.7))

# KDE line (smooth curve) - scale to count
kde = gaussian_kde(x)
x_line = np.linspace(x.min(), x.max(), 200)
counts, _ = np.histogram(x, bins=30)
# Scale KDE to approximate count scale (area under KDE = 1, so scale by n * bin_width)
bin_width = (x.max() - x.min()) / 30
kde_vals = kde(x_line) * len(x) * bin_width

fig.add_trace(go.Scatter(
    x=x_line, y=kde_vals, mode='lines',
    name='Density', line=dict(color='red', width=2),
))

fig.update_layout(
    title='Distribution of Maximum Engine Life (Total Cycles)',
    xaxis_title='Total Cycles until Failure',
    yaxis_title='Number of Engines',
    bargap=0.1, showlegend=True,
)
fig.show()

#### Engine lifecycle distribution (max cycles per engine)

##### 1. Variability in failure time

- Lifespans range from **~125 to 350+ cycles** (spread ~225 cycles).
- Failure time is **not fixed** — some engines fail much earlier, some much later.
- **Implication:** RUL cannot be predicted from cycle number alone; sensor-based degradation matters.

##### 2. Shape: right-skewed

- Most engines sit in a **main band** (~140–220 cycles).
- **Long right tail:** a few engines last much longer (250–350 cycles).
- **Implication:** Typical life is in the 140–220 range; a minority are long-lived. The model will see more medium-life engines.

##### 3. Peak (most common lifespan)

- Highest bar around **185–195 cycles** (~15 engines).
- Secondary peaks around **145–155** and **175–185** cycles.
- **Implication:** There is a typical lifespan band, with real spread around it.

##### 4. Consistency across engines

- **Moderate:** Many engines cluster in the middle, but the distribution is wide.
- Initial wear and degradation paths differ (as in the paper).
- **Implication:** Model should use engine-specific information (sensors, early-cycle behavior), not only a global average life.

##### 5. Outliers

- A few engines **< 130** cycles (early failure).
- A few **> 300** cycles (very long life).
- **Implication:** Both short- and long-lived engines are present; the model should learn from this spread.

##### 6. Summary table

| Observation | Modeling takeaway |
|-------------|-------------------|
| Wide range (125–350) | Predict RUL from **sensor evolution**, not cycle index. |
| Right skew | More data in medium RUL; also evaluate on high-RUL (long-lived) engines. |
| Single peak + tail | One main behavior mode with a long-lived minority. |

**Conclusion:** Engine lifespans are right-skewed with a main band around 185 cycles and a long tail to 350+. Variability is high, so RUL prediction should rely on sensor-based degradation; evaluation should account for both typical and long-lived engines.