# Predictive Maintenance: Remaining Useful Life (RUL) Prediction for Turbofan Engines

## Project Overview

This notebook demonstrates a predictive maintenance solution for turbofan engines. The goal is to predict the Remaining Useful Life (RUL) of aircraft engines based on sensor data, allowing for proactive maintenance scheduling and preventing unexpected failures. This project leverages time-series data and Long Short-Term Memory (LSTM) neural networks, a type of recurrent neural network (RNN) particularly well-suited for sequence prediction problems.

## Dataset

The dataset used in this project is the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset, specifically FD001. It comprises multivariate time series data from multiple turbofan engines. Each engine starts with different initial wear and manufacturing variations, and operates under different operational settings. The data records various sensor measurements at each operational cycle until failure.

**Key features of the dataset:**

* `engine_id`: Unique identifier for each engine.
* `cycle`: Operational cycle number.
* `op_setting_1`, `op_setting_2`, `op_setting_3`: Operational settings.
* `sensor_1` to `sensor_21`: Various sensor measurements.
* `sensor_22`, `sensor_23`: Redundant or constant sensor readings (to be dropped).

# Predictive Maintenance: Remaining Useful Life (RUL) Prediction for Turbofan Engines

*Author: Krishna Singh*
*Date: July 5, 2025*

## Table of Contents

1.  [Project Overview](#Project-Overview)
2.  [Dataset](#Dataset)
3.  [Data Loading and Initial Exploration](#1.-Data-Loading-and-Initial-Exploration)
    * [1.1 Loading Raw Data Files](#1.1-Loading-Raw-Data-Files)
    * [1.2 Initial Data Inspection](#1.2-Initial-Data-Inspection)
    * [1.3 Verifying Unique Engine IDs in Each Dataset](#1.3-Verifying-Unique-Engine-IDs-in-Each-Dataset)
    * [1.4 Consolidating Training Data](#1.4-Consolidating-Training-Data)
    * [1.5 Identifying Low-Variance Features](#1.5-Identifying-Low-Variance-Features)
    * [1.6 Data Types and Missing Values](#1.6-Data-Types-and-Missing-Values)
    * [1.7 Statistical Summary of Features](#1.7-Statistical-Summary-of-Features)
4.  [Feature Engineering: Remaining Useful Life (RUL) Calculation](#2.-Feature-Engineering:-Remaining-Useful-Life-(RUL)-Calculation)
    * [2.1 Verifying RUL at End-of-Life](#2.1-Verifying-RUL-at-End-of-Life)
5.  [Data Preprocessing: Scaling and Sequence Generation](#3.-Data-Preprocessing:-Scaling-and-Sequence-Generation)
    * [3.1 Feature Scaling](#3.1-Feature-Scaling)
    * [3.2 Saving Scalers](#3.2-Saving-Scalers)
6.  [Feature Engineering: Creating Rolling Window Statistics](#4.-Feature-Engineering:-Creating-Rolling-Window-Statistics)
    * [4.1 Updated Feature Set for Model Training](#4.1-Updated-Feature-Set-for-Model-Training)
7.  [Sequence Generation for LSTM Model](#5.-Sequence-Generation-for-LSTM-Model)
    * [5.1 Preparing Test Data Sequences](#5.1-Preparing-Test-Data-Sequences)
    * [5.2 Final Data Shapes for LSTM Input](#5.2-Final-Data-Shapes-for-LSTM-Input)

## 1. Data Loading and Initial Exploration

This section focuses on loading the raw sensor data, assigning meaningful column names, and performing initial data inspections to understand its structure and identify any immediate issues like missing values or constant columns.

### 1.1 Importing Libraries

This cell imports the necessary Python libraries for data manipulation, numerical operations, machine learning preprocessing, and model building:
- `pandas` for DataFrame operations.
- `numpy` for numerical operations.
- `tensorflow` for building and training the neural network model.
- `joblib` for saving and loading Python objects, such as scalers.
- `sklearn.preprocessing.MinMaxScaler` for data scaling.
- `custom_functions` for specific utility functions (e.g., `import_data`, `max_cycles`, `rolling_mean_std`, `create_sequences`, `create_single_last_sequence`).

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import joblib
from sklearn.preprocessing import MinMaxScaler
from custom_functions import *

2025-07-06 09:50:46.936016: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751795447.227929    4639 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751795447.302729    4639 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1751795447.928373    4639 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1751795447.928406    4639 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1751795447.928408    4639 computation_placer.cc:177] computation placer alr

### 1.2 Loading Raw Data Files

The C-MAPSS dataset is provided in separate text files for training (FD001-FD004) and testing. We use a custom `import_data` function (defined in `custom_functions.py`) to load these files into pandas DataFrames and assign standard column names as per the dataset documentation. We then display the head of `data_1` to inspect its initial structure.

The `head()` below above confirms the data has been loaded correctly with appropriate column names. We can see the `engine_id`, `cycle`, operational settings (`op_setting_1`, `op_setting_2`, `op_setting_3`), and various sensor readings (`sensor_1` to `sensor_21`). This initial view gives us an understanding of the data's raw format.

In [2]:
data_2 = import_data('train/train_FD002.txt')
data_4 = import_data('train/train_FD004.txt')
test_df = import_data('test/test_FD001.txt')

print(data_2.head())

   engine_id  cycle  op_setting_1  op_setting_2  op_setting_3  sensor_1  \
0          1      1       34.9983        0.8400         100.0    449.44   
1          1      2       41.9982        0.8408         100.0    445.00   
2          1      3       24.9988        0.6218          60.0    462.54   
3          1      4       42.0077        0.8416         100.0    445.00   
4          1      5       25.0005        0.6203          60.0    462.54   

   sensor_2  sensor_3  sensor_4  sensor_5  ...  sensor_11  sensor_12  \
0    555.32   1358.61   1137.23      5.48  ...      42.02     183.06   
1    549.90   1353.22   1125.78      3.91  ...      42.20     130.42   
2    537.31   1256.76   1047.45      7.05  ...      36.69     164.22   
3    549.51   1354.03   1126.38      3.91  ...      41.96     130.72   
4    537.07   1257.71   1047.93      7.05  ...      36.89     164.31   

   sensor_13  sensor_14  sensor_15  sensor_17  sensor_18  sensor_19  \
0    2387.72    8048.56     9.3461        334

### 1.3 Verifying Unique Engine IDs in Each Dataset

Before concatenating the datasets, it's important to verify the number of unique engines in each FD (Flight Dataset) file. The output confirms:
* **FD002:** 260 unique engines
* **FD004:** 249 unique engines

This step ensures we understand the scope of each individual training dataset.

In [3]:
print(data_2['engine_id'].nunique())
print(data_4['engine_id'].nunique())

260
249


### 1.4 Consolidating Training Data

To create a unified training dataset, we first re-index the `engine_id` column for `data_2`, and `data_4`. This ensures that each engine across all four FD datasets has a unique identifier, preventing ID clashes when concatenating. Following re-indexing, all four training DataFrames are concatenated into a single `data` DataFrame, which will be used for training the RUL prediction model. This unified dataset provides a richer and more diverse set of engine degradation patterns.

In [4]:
data_4['engine_id'] = data_4['engine_id'].replace([i for i in range(1, 250)],[i for i in range(261, 261+249)])

data = pd.concat([data_2,data_4],ignore_index=True)

### 1.5 Identifying Low-Variance Features

We calculate the variance for all columns in the combined training dataset to identify features with very little or no variation. Features with extremely low variance often provide little to no useful information for a predictive model and can sometimes be considered constant or near-constant.

The output shows the top 5 least variance columns:
* `op_setting_2`: Very low variance, suggesting it might be constant or have minimal changes.
* `sensor_15`, `sensor_11`, `sensor_5`, `sensor_19`: Also show relatively low variance compared to other sensors.

While some of these (`op_setting_3`, `sensor_1`, `sensor_5`, `sensor_18`, `sensor_19`) are known to be constant or near-constant in the C-MAPSS dataset and are often dropped in literature, we will proceed with the current set for now, relying on the model to learn their significance (or lack thereof). However, for a production system, these might be candidates for removal.

In [5]:
variances = data.var()
least_variance_columns = variances.sort_values().head(5)
print("\nTop 5 least variance columns:")
print(least_variance_columns)


Top 5 least variance columns:
op_setting_2     0.096336
sensor_15        0.562811
sensor_11       10.489534
sensor_5        13.094532
sensor_19       28.803589
dtype: float64


### 1.6 Data Types and Missing Values

The `data.info()` output provides a summary of the DataFrame, including the number of non-null entries for each column and their data types.

**Observations:**
* **No Missing Values:** All columns show `160359 non-null` entries, confirming there are no missing values in the combined training dataset, which simplifies preprocessing.
* **Data Types:** Most sensor readings and operational settings are `float64`, while `engine_id`, `cycle`, `sensor_17`, and `sensor_18` are `int64`. These data types are appropriate for numerical analysis.
* **Memory Usage:** The DataFrame occupies approximately 29.4 MB of memory.

This inspection confirms the data's integrity and readiness for further processing.

In [6]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115008 entries, 0 to 115007
Data columns (total 24 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   engine_id     115008 non-null  int64  
 1   cycle         115008 non-null  int64  
 2   op_setting_1  115008 non-null  float64
 3   op_setting_2  115008 non-null  float64
 4   op_setting_3  115008 non-null  float64
 5   sensor_1      115008 non-null  float64
 6   sensor_2      115008 non-null  float64
 7   sensor_3      115008 non-null  float64
 8   sensor_4      115008 non-null  float64
 9   sensor_5      115008 non-null  float64
 10  sensor_6      115008 non-null  float64
 11  sensor_7      115008 non-null  float64
 12  sensor_8      115008 non-null  float64
 13  sensor_9      115008 non-null  float64
 14  sensor_11     115008 non-null  float64
 15  sensor_12     115008 non-null  float64
 16  sensor_13     115008 non-null  float64
 17  sensor_14     115008 non-null  float64
 18  sens

### 1.6 Data Types and Missing Values

The `data.info()` output provides a summary of the DataFrame, including the number of non-null entries for each column and their data types.

**Observations:**
* **No Missing Values:** All columns show `160359 non-null` entries, confirming there are no missing values in the combined training dataset, which simplifies preprocessing.
* **Data Types:** Most sensor readings and operational settings are `float64`, while `engine_id`, `cycle`, `sensor_17`, and `sensor_18` are `int64`. These data types are appropriate for numerical analysis.
* **Memory Usage:** The DataFrame occupies approximately 29.4 MB of memory.

This inspection confirms the data's integrity and readiness for further processing.

In [7]:
print(data.describe())

           engine_id          cycle   op_setting_1   op_setting_2  \
count  115008.000000  115008.000000  115008.000000  115008.000000   
mean      265.950395     122.552257      23.999161       0.571679   
std       146.004517      81.777999      14.765080       0.310381   
min         1.000000       1.000000       0.000000       0.000000   
25%       140.000000      57.000000      10.004600       0.250700   
50%       274.000000     113.000000      25.001400       0.700000   
75%       393.000000     173.000000      41.998000       0.840000   
max       509.000000     543.000000      42.008000       0.842000   

        op_setting_3       sensor_1       sensor_2       sensor_3  \
count  115008.000000  115008.000000  115008.000000  115008.000000   
mean       94.038328     472.895417     579.538010    1418.866258   
std        14.245249      26.414703      37.317816     106.068820   
min        60.000000     445.000000     535.480000    1242.670000   
25%       100.000000     445.0000

## 2. Feature Engineering: Remaining Useful Life (RUL) Calculation

Predicting RUL requires a target variable that represents the time remaining until engine failure. For the C-MAPSS dataset, this is not directly provided but can be derived. The `max_cycles` function (from `custom_functions.py`) calculates the maximum `cycle` for each `engine_id` and then computes the RUL as `max_cycle_for_engine - current_cycle`. This creates a linearly decreasing target variable for each engine, going from its maximum operational cycle down to 0 at failure.

The `head()` output below for `engine_id`, `cycle`, and `RUL` demonstrates this calculation for the first engine. As the `cycle` increases, the `RUL` value correctly decreases, starting from 191 cycles remaining down to 0 (which will be seen at the tail of the data for each engine). This is a critical step in preparing the target variable for our predictive model.

In [8]:
data = max_cycles(data)


# Inspecting the RUL column
print(data[['engine_id', 'cycle', 'RUL']].head())

   engine_id  cycle  RUL
0          1      1  148
1          1      2  147
2          1      3  146
3          1      4  145
4          1      5  144


### 2.1 Verifying RUL at End-of-Life

By inspecting the `tail()` of the combined dataset, we can observe the behavior of the `RUL` column, particularly for engines nearing their failure point. The output above shows the last few cycles of `engine_id` 609. As expected, the `RUL` value decreases to `0` at the final recorded cycle (`cycle` 255 for `engine_id` 609), confirming the correct calculation of the Remaining Useful Life. This validates that our target variable is appropriately defined for the prediction task.

In [9]:
# Show df.tail() for an engine to verify RUL decreases to 0
print(data.tail())

        engine_id  cycle  op_setting_1  op_setting_2  op_setting_3  sensor_1  \
115003        509    251        9.9998        0.2500         100.0    489.05   
115004        509    252        0.0028        0.0015         100.0    518.67   
115005        509    253        0.0029        0.0000         100.0    518.67   
115006        509    254       35.0046        0.8400         100.0    449.44   
115007        509    255       42.0030        0.8400         100.0    445.00   

        sensor_2  sensor_3  sensor_4  sensor_5  ...  sensor_12  sensor_13  \
115003    605.33   1516.36   1315.28     10.52  ...     380.16    2388.73   
115004    643.42   1598.92   1426.77     14.62  ...     535.02    2388.46   
115005    643.68   1607.72   1430.56     14.62  ...     535.41    2388.48   
115006    555.77   1381.29   1148.18      5.48  ...     187.92    2388.83   
115007    549.85   1369.75   1147.45      3.91  ...     134.32    2388.66   

        sensor_14  sensor_15  sensor_17  sensor_18  sens

### 2.2 Final Columns After RUL Addition

This output shows all the columns currently present in our consolidated training DataFrame, including the newly added `RUL` column. This confirms that all relevant features and the target variable are ready for the next stages of preprocessing, such as feature scaling and sequence generation.

In [10]:
# Inspecting the columns of dataset
print(data.columns)

Index(['engine_id', 'cycle', 'op_setting_1', 'op_setting_2', 'op_setting_3',
       'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5', 'sensor_6',
       'sensor_7', 'sensor_8', 'sensor_9', 'sensor_11', 'sensor_12',
       'sensor_13', 'sensor_14', 'sensor_15', 'sensor_17', 'sensor_18',
       'sensor_19', 'sensor_20', 'sensor_21', 'RUL'],
      dtype='object')


## 3. Data Preprocessing: Scaling and Sequence Generation

Neural networks, especially LSTMs, perform best when input features are scaled to a common range. This prevents features with larger numerical values from dominating the learning process.

### 3.1 Feature Scaling

We use `MinMaxScaler` from `sklearn.preprocessing` to scale all operational settings and sensor measurements (`feature_cols`) to a range between 0 and 1. The `RUL` target variable is also scaled using a separate `MinMaxScaler` (`rul_scaler`). This is important because RUL values can be large, and scaling them helps the model converge faster and more stably.

* **`feature_scaler.pkl`**: The scaler fitted on `feature_cols` from the training data is saved to ensure that the same scaling transformation can be applied to new, unseen test data.
* **`rul_scaler.pkl`**: Similarly, the scaler for the `RUL` target variable is saved. This will be crucial for inverse-transforming the model's predicted RUL values back to their original scale for meaningful interpretation.

The `head()` output above shows the training DataFrame after all features and the RUL target have been scaled. Notice how all values are now within the [0, 1] range.

In [11]:
# Defining feature columns
feature_cols = [col for col in data.columns if col not in ['RUL',"engine_id"]]
print(feature_cols)

# Scaling the features that will be used to train and test
scaler = MinMaxScaler()
data[feature_cols] = scaler.fit_transform(data[feature_cols])
test_df[feature_cols] = scaler.transform(test_df[feature_cols])

# Scaling the labels that will be used to train
rul_scaler = MinMaxScaler()
data['RUL'] = rul_scaler.fit_transform(data['RUL'].values.reshape(-1, 1))

# Checking the dataframe after scaling
print('The Training Dataframe after Scaling:')
print(data.head())

['cycle', 'op_setting_1', 'op_setting_2', 'op_setting_3', 'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5', 'sensor_6', 'sensor_7', 'sensor_8', 'sensor_9', 'sensor_11', 'sensor_12', 'sensor_13', 'sensor_14', 'sensor_15', 'sensor_17', 'sensor_18', 'sensor_19', 'sensor_20', 'sensor_21']


The Training Dataframe after Scaling:
   engine_id     cycle  op_setting_1  op_setting_2  op_setting_3  sensor_1  \
0          1  0.000000      0.833134      0.997625           1.0  0.060269   
1          1  0.001845      0.999767      0.998575           1.0  0.000000   
2          1  0.003690      0.595096      0.738480           0.0  0.238089   
3          1  0.005535      0.999993      0.999525           1.0  0.000000   
4          1  0.007380      0.595137      0.736698           0.0  0.238089   

   sensor_2  sensor_3  sensor_4  sensor_5  ...  sensor_12  sensor_13  \
0  0.181952  0.313072  0.272086  0.146592  ...   0.133804   0.992367   
1  0.132245  0.298518  0.244628  0.000000  ...   0.005157   0.992202   
2  0.016783  0.038047  0.056787  0.293184  ...   0.087761   0.001267   
3  0.128668  0.300705  0.246067  0.000000  ...   0.005890   0.992064   
4  0.014582  0.040612  0.057938  0.293184  ...   0.087981   0.001185   

   sensor_14  sensor_15  sensor_17  sensor_18  sensor_19  se

### 3.2 Saving Scalers

It is crucial to save the `MinMaxScaler` objects (`feature_scaler.pkl` and `rul_scaler.pkl`) after fitting them to the training data. This ensures that:
1.  **Consistency:** The exact same scaling transformation (based on the training data's min/max values) can be applied to the test data or any future unseen data.
2.  **Inverse Transformation:** The `rul_scaler` can be used to convert the model's predicted, scaled RUL values back into their original, interpretable cycle counts.

This practice prevents data leakage from the test set and allows for consistent deployment of the preprocessing pipeline.

In [12]:
# It's good practice to save your scalers for later use on test data and for inverse transformation.
# Example: saving the scalers (though not runnable without actual saving mechanism)
import joblib
joblib.dump(scaler, 'feature_scaler.pkl')
joblib.dump(rul_scaler, 'rul_scaler.pkl')

['rul_scaler.pkl']

## 4. Feature Engineering: Creating Rolling Window Statistics

To capture the temporal dependencies and trends in the sensor data, we generate **rolling mean** and **rolling standard deviation** features for the selected sensor measurements. A `window_size` of 30 cycles is used. This means for each cycle, the rolling features are calculated based on the preceding 30 cycles (including the current one).

* **Rolling Mean:** Provides a smoothed trend of the sensor readings, indicating general deterioration.
* **Rolling Standard Deviation:** Captures the variability or volatility of sensor readings within the window, which can be an indicator of increasing instability as an engine degrades.

The output above demonstrates the `sensor_2` original values alongside its 30-cycle rolling mean and standard deviation for the first engine. Notice how the rolling mean starts from the current value and gradually smooths out as more data points fill the window. The rolling standard deviation provides insight into the local variability of the sensor. These features are vital for LSTMs to learn patterns over time.

In [13]:
# Defining the necessary variables for Forging Insights
window_size = 30
selected_sensors = [col for col in feature_cols if col not in ['cycle','op_setting_1','op_setting_2','op_setting_3']]

data = rolling_mean_std(data,window_size,feature_cols)
test = rolling_mean_std(test_df,window_size,feature_cols)

print("\n--- Displaying Rolling Features for Engine 1 (first 30 cycles) ---")
# Show how the rolling features look for engine 1
engine1_df = data[data['engine_id'] == 1]
print(engine1_df[['engine_id', 'cycle', 'sensor_2', f'sensor_2_rolling_mean_{window_size}', f'sensor_2_rolling_std_{window_size}']].head(15))


--- Displaying Rolling Features for Engine 1 (first 30 cycles) ---
    engine_id     cycle  sensor_2  sensor_2_rolling_mean_30  \
0           1  0.000000  0.181952                  0.181952   
1           1  0.001845  0.132245                  0.157098   
2           1  0.003690  0.016783                  0.110326   
3           1  0.005535  0.128668                  0.114912   
4           1  0.007380  0.014582                  0.094846   
5           1  0.009225  0.014123                  0.081392   
6           1  0.011070  0.130778                  0.088447   
7           1  0.012915  0.659941                  0.159884   
8           1  0.014760  0.127018                  0.156232   
9           1  0.016605  0.127018                  0.153311   
10          1  0.018450  0.131420                  0.151321   
11          1  0.020295  0.983309                  0.220653   
12          1  0.022140  0.662051                  0.254607   
13          1  0.023985  0.128302                 

### 4.1 Updated Feature Set for Model Training

After generating the rolling mean and standard deviation for all relevant sensor features, our feature set (`feature_cols`) has significantly expanded. The output below confirms the new list of features that will be used as input to the LSTM model. It now includes the original `cycle` and `op_setting` features, plus the original sensor readings, and their corresponding rolling mean and standard deviation features. This comprehensive set aims to provide the model with a rich representation of the engine's health over time.

In [14]:
# updating the feature_cols
feature_cols = [col for col in data.columns if col not in ['engine_id','RUL']]
print(feature_cols)
print(f'({len(feature_cols)})')

['cycle', 'op_setting_1', 'op_setting_2', 'op_setting_3', 'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5', 'sensor_6', 'sensor_7', 'sensor_8', 'sensor_9', 'sensor_11', 'sensor_12', 'sensor_13', 'sensor_14', 'sensor_15', 'sensor_17', 'sensor_18', 'sensor_19', 'sensor_20', 'sensor_21', 'sensor_1_rolling_mean_30', 'sensor_1_rolling_std_30', 'sensor_2_rolling_mean_30', 'sensor_2_rolling_std_30', 'sensor_3_rolling_mean_30', 'sensor_3_rolling_std_30', 'sensor_4_rolling_mean_30', 'sensor_4_rolling_std_30', 'sensor_5_rolling_mean_30', 'sensor_5_rolling_std_30', 'sensor_6_rolling_mean_30', 'sensor_6_rolling_std_30', 'sensor_7_rolling_mean_30', 'sensor_7_rolling_std_30', 'sensor_8_rolling_mean_30', 'sensor_8_rolling_std_30', 'sensor_9_rolling_mean_30', 'sensor_9_rolling_std_30', 'sensor_11_rolling_mean_30', 'sensor_11_rolling_std_30', 'sensor_12_rolling_mean_30', 'sensor_12_rolling_std_30', 'sensor_13_rolling_mean_30', 'sensor_13_rolling_std_30', 'sensor_14_rolling_mean_30', 'sensor_1

## 5. Sequence Generation for LSTM Model

LSTMs require input data to be in a sequence format (samples, timesteps, features). For our RUL prediction task, each sequence represents a fixed "look-back window" of an engine's operational history. The `create_sequences` function (from `custom_functions.py`) is used to transform our flattened DataFrame into this 3D format.

For each engine:
1.  It iterates through the engine's data, creating sequences of `sequence_length` (e.g., 30) cycles.
2.  Each sequence's target label is the RUL value at the *end* of that sequence.

This approach allows the LSTM to learn the temporal patterns leading up to a specific RUL value.

In [15]:
# Define the sequence length (hyperparameter)
sequence_length = 30 # Let's start with a look-back window of cycles

# Initialize empty lists to store all sequences and labels from all engines
X_train_sequences = []
y_train_labels = []

# Group the DataFrame by engine_id and iterate through each group (each engine)
# This is crucial to keep sequences from different engines separate
for engine_id, engine_df in data.groupby('engine_id'):
    # Generate sequences and labels for the current engine
    sequences_X, labels_y = create_sequences(engine_df, sequence_length, feature_cols)
    
    # Extend the main lists with the sequences and labels from this engine
    X_train_sequences.extend(sequences_X)
    y_train_labels.extend(labels_y)

### 5.1 Preparing Test Data Sequences

For the test set, we are interested in predicting the RUL for each engine based on its *last available operational cycle*. The `create_single_last_sequence` function (from `custom_functions.py`) is designed for this purpose. For each engine in the test dataset, it extracts only the final `sequence_length` (e.g., 30) cycles as a single sequence. This single sequence represents the most recent operational history, which is then fed into the trained LSTM model to predict the RUL. This mimics a real-world scenario where a model would predict RUL based on the current observed state of an engine.

In [16]:
# Initialize an empty list to store the sequences for test prediction
X_test_sequences = []

# Iterate through each engine in your test_df
for engine_id, engine_df_group in test_df.groupby('engine_id'): # Use engine_id as column name
    # Ensure you pass the grouped DataFrame (engine_df_group) to the function
    sequence = create_single_last_sequence(engine_df_group.copy(), sequence_length, feature_cols)
    
    if sequence is not None:
        X_test_sequences.append(sequence)

### 5.2 Final Data Shapes for LSTM Input

After sequence generation, the data is converted into NumPy arrays, which is the required input format for Keras/TensorFlow models.

The output confirms the shapes of our processed arrays:
* **`X_train` (Training Sequences):** `(142698, 30, 61)`
    * **142698:** Number of training samples (sequences).
    * **30:** `sequence_length` (timesteps per sequence). This means each input to the LSTM will consist of 30 historical cycles.
    * **61:** Number of features per timestep (the `feature_cols`).
* **`y_train` (Training Labels):** `(142698, 1)`
    * **142698:** Number of corresponding RUL labels for each sequence.
    * **1:** Each label is a single RUL value.
* **`X_test` (Test Sequences):** Will have a shape like `(num_test_engines, 30, 61)` (though not explicitly printed here, it's inferred).

These reshaped arrays are now in the correct format to be fed into an LSTM neural network for training and prediction. The saving of `X_train.npy`, `X_test.npy`, and `y_train.npy` ensures that these preprocessed datasets can be easily loaded for model training without re-running the entire preprocessing pipeline.

In [17]:
# Convert the lists into NumPy arrays for Keras
# This is where we get our 3D array (samples, timesteps, features)
X_train = np.array(X_train_sequences)
y_train = np.array(y_train_labels)
X_test = np.array(X_test_sequences)

# Saving The Processed Test set for late use
np.save('X_train.npy', X_train)
np.save('X_test.npy', X_test)

# Reshape y_train to be 2D for model fitting (if not already)
# This step is good practice to ensure compatibility with Keras
y_train = y_train.reshape(-1, 1)
np.save('y_train.npy', y_train)

# Checking all the Outputs of create_sequences
print("Shape of X_train (sequences):", X_train.shape)
print("Shape of y_train (RUL labels):", y_train.shape)
print("\nFirst sequence from X_train:")
print(X_train[0])

Shape of X_train (sequences): (100247, 30, 61)
Shape of y_train (RUL labels): (100247, 1)

First sequence from X_train:
[[0.00000000e+00 8.33134165e-01 9.97624703e-01 ... 1.02748109e-01
  1.56455773e-01 1.00503812e-01]
 [1.84501845e-03 9.99766711e-01 9.98574822e-01 ... 1.02748109e-01
  8.53888457e-02 1.00503812e-01]
 [3.69003690e-03 5.95096172e-01 7.38479810e-01 ... 7.83484323e-02
  1.06564026e-01 7.99729512e-02]
 ...
 [4.98154982e-02 5.23709770e-05 0.00000000e+00 ... 2.79075281e-01
  1.96865442e-01 2.76437467e-01]
 [5.16605166e-02 1.66634927e-05 0.00000000e+00 ... 3.09139835e-01
  2.23182526e-01 3.06224670e-01]
 [5.35055351e-02 4.76242620e-01 8.32660333e-01 ... 3.07530130e-01
  2.31916247e-01 3.04677394e-01]]
