# NASA Turbofan Dataset â€“ Structured Data Exploration

Purpose:
- Understand dataset structure
- Extract schema requirements
- Identify ingestion constraints
- Quantify data statistics

Outputs:
- Dataset summary
- Schema implications
- Manifest design inputs


In [6]:
from pathlib import Path
import pandas as pd
import numpy as np

# -------------------------------------------------------------------
# Configuration
# -------------------------------------------------------------------

DATA_PATH = Path("../data/raw/nasa_turbofan")
SUBSET = "FD001"

# -------------------------------------------------------------------
# Dataset Schema Definition
# -------------------------------------------------------------------

COLUMNS = (
    ["engine_id", "cycle"]
    + [f"op_setting_{i}" for i in range(1, 4)]
    + [f"sensor_{i}" for i in range(1, 22)]
)

In [7]:
# -------------------------------------------------------------------
# Load dataset
# -------------------------------------------------------------------

df = pd.read_csv(
    DATA_PATH / f"train_{SUBSET}.txt",
    sep=r"\s+",
    header=None,
    names=COLUMNS
)

# -------------------------------------------------------------------
# Basic validation
# -------------------------------------------------------------------

print("Shape:", df.shape)
print("Unique engines:", df["engine_id"].nunique())
print("\nColumn check:", len(df.columns), "columns")

df.info()


Shape: (20631, 26)
Unique engines: 100

Column check: 26 columns
<class 'pandas.DataFrame'>
RangeIndex: 20631 entries, 0 to 20630
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   engine_id     20631 non-null  int64  
 1   cycle         20631 non-null  int64  
 2   op_setting_1  20631 non-null  float64
 3   op_setting_2  20631 non-null  float64
 4   op_setting_3  20631 non-null  float64
 5   sensor_1      20631 non-null  float64
 6   sensor_2      20631 non-null  float64
 7   sensor_3      20631 non-null  float64
 8   sensor_4      20631 non-null  float64
 9   sensor_5      20631 non-null  float64
 10  sensor_6      20631 non-null  float64
 11  sensor_7      20631 non-null  float64
 12  sensor_8      20631 non-null  float64
 13  sensor_9      20631 non-null  float64
 14  sensor_10     20631 non-null  float64
 15  sensor_11     20631 non-null  float64
 16  sensor_12     20631 non-null  float64
 17  sensor_13     2

In [8]:
# Rows per engine
engine_lengths = df.groupby("engine_id").size()

print("Engine lifetime stats:")
print(engine_lengths.describe())

# Cycle range
print("\nCycle range:")
print(df["cycle"].describe())


Engine lifetime stats:
count    100.000000
mean     206.310000
std       46.342749
min      128.000000
25%      177.000000
50%      199.000000
75%      229.250000
max      362.000000
dtype: float64

Cycle range:
count    20631.000000
mean       108.807862
std         68.880990
min          1.000000
25%         52.000000
50%        104.000000
75%        156.000000
max        362.000000
Name: cycle, dtype: float64


In [9]:
# -------------------------------------------------------------------
# Compute Remaining Useful Life (RUL)
# -------------------------------------------------------------------

max_cycles = df.groupby("engine_id")["cycle"].max()

df["RUL"] = df.apply(
    lambda row: max_cycles[row["engine_id"]] - row["cycle"],
    axis=1
)

# Quick validation
df[["engine_id", "cycle", "RUL"]].head()


Unnamed: 0,engine_id,cycle,RUL
0,1,1,191.0
1,1,2,190.0
2,1,3,189.0
3,1,4,188.0
4,1,5,187.0


In [10]:
print("RUL statistics:")
print(df["RUL"].describe())

RUL statistics:
count    20631.000000
mean       107.807862
std         68.880990
min          0.000000
25%         51.000000
50%        103.000000
75%        155.000000
max        361.000000
Name: RUL, dtype: float64
