### CS 418 Final Project Report: Team 13 (James Cook, Datta Sai VVN, Heba Syed, Faaizah Ismail, Lasya Sahithi)
Relevant Notebooks: https://github.com/cs418-fa25/project-check-in-team-13

### RailLife: Introduction
Traditionally, railway maintenance was performed after a fixed number of days, replacing parts upon scheduled maintenance days. This method comes with a whole host of problems, from the wasted time and effort of over-maintenance, to the potential for unexpected breakdowns far ahead of the scheduled repair dates. This can lead to safety issues and extended downtime, all due to inefficiency in maintenance scheduling. Repair costs for train air production units can reach over $50,000 in such cases, which brings us to the problem we wanted to focus on in this project. The goal of our project is to develop an advanced machine learning approach for predicting the Remaining Useful Life (RUL) of train air production units. We want to be able to predict the RUL with less than 50 hours of error, and enable condition-based maintenance, where repairs can be performed before failure occurs. 

### Dataset Summary — MetroPT-3
In order to develop our predictive models, we utilized sensor data from the UCI repository, with records from February - September 2020. There are four recorded failure events used to predict the RUL. The key features within the dataset are as follows:
| Sensor | Description | Degradation Indicator |
|------|---------|-------------|
| Motor_current | Electrical current draw (Amperes) | Increased load from wear |
| Oil_temperature | Lubricant temperature (°C) | Friction, cooling failure |
| TP2, TP3 | Temperature probes at key locations | Localized overheating |
| H1 | Air intake pressure valve reading | Supply degradation |
| DV_pressure | Distributor valve pressure (bar) | Pressure regulation failure |
| Reservoirs | Air reservoir pressure | System capacity decline |
| COMP | Compressor output pressure | Core functionality metric |
| MPG | Main pressure gauge | Overall system health |

The four recorded failure events occur as follows:
| ID | Start | End | Fault | Maintenance |
|----|--------|------|--------|-------------|
| 1 | 2020-04-18 00:00 | 2020-04-18 23:59 | Air leak | — |
| 2 | 2020-05-29 23:30 | 2020-05-30 06:00 | Air leak | Apr 30 12:00 |
| 3 | 2020-06-05 10:00 | 2020-06-07 14:30 | Air leak | Jun 8 16:00 |
| 4 | 2020-07-15 14:30 | 2020-07-15 19:00 | Air leak | Jul 16 00:00 |

The raw dataset is sampled at 1Hz, i.e. there is one sample per second. We experimented with different downsampling rates/bin sizes, mainly to mitigate overfitting and make computations more feasible, and settled on downsampling to one hour intervals by taking the mean of all values within each hour. Hence all models will be predicting hourly RUL. A challenge encountered in this project is that the dataset, despite containing a huge number of samples and being very feature-rich, contains only four failure events. In order to obtain evaluations which are indicative of the general predictive ability of our models, we elected to test the models on two of the four failure events, leaving only the remaining two events to train on. 

As observed, the third failure happens rapidly after the second (only about 100 hours later) whereas the first and second failures happen after about 1800 and 1000 hours respectively. This will be challenging for any model making use of long-term trends in the dataset. Later in the report we try reframing the question: rather than continuously predicting RUL in a regression context, we train classifiers to predict “will there be a failure event within 100 hours?” (We’ll see this does not improve our results.) Additionally, in an attempt to improve the general predictive capabilities of the model, we experimented with generating synthetic data by approximating the distributions of the features in the original dataset and generating new samples based on those approximate distributions with some Gaussian noise added. We further experimented by combining these synthetic samples with the real samples in various ways. Perhaps this idea could have been developed into success, but in our experiments any incorporation of the synthetic data led to extremely unsuccessful results/bad metrics, so for the purposes of this project the idea was discarded. 


<img src ="corrheatmap.png" width ="350" height= "350"> <img src="boxplot.png" width="450">

Before conducting the feature engineering, we observed the trends within the dataset. The correlation heatmap shows physically consistent sensor groups strongly linked pressure pairs (TP2–H1, TP3–Reservoirs), a moderate motor-current–oil-temperature link, and some redundant pressure signals that are useful for cross-checking data quality. The sensor boxplot shows a compressor that is idle most of the time with distinct off-load and full-load modes, while pressures stay tightly clustered and oil temperature and current vary widely across operating conditions.

### Feature Engineering
Feature engineering is the critical bridge between raw sensor data and predictive models. For the RailLife project, we transformed 10 raw sensor readings into 233 engineered features designed to capture temporal degradation patterns, cumulative stress indicators, and early warning signals of equipment failure. The core challenge was to create features that would be production-viable (no data leakage), physically meaningful (reflect real degradation mechanisms), computationally efficient (scalable for real-time deployment), and temporally aware (capture how equipment degrades over time).

Feature Engineering Results:
Total features: 233 | Features per sensor: ~23

Breakdown by category:
Base sensors: 10
- Z-score normalized: 10
- EWMAs (3 spans × 10 sensors): 30
- Delta features (2 scales × 10 sensors): 20
- Rolling statistics (5 metrics × 3 windows × 10 sensors): 150
- Degradation signatures: 15
- Baseline deviations: 18

(10 features) For each sensor x, we computed the z-score: Where μ is the mean and σ is the standard deviation across the training set. This standardization ensures that all sensors contribute equally regardless of scale, gradient descent converges faster in neural networks, and outliers are clearly identified (|z| > 3). 

(30 features) For each sensor, we calculated EWMAs with spans of 3h, 12h, and 24h:
<br>$EWMA_t = \alpha \cdot x_t + (1 - \alpha) \cdot EWMA_{t-1}$, where $\alpha = \frac{2}{span + 1}$
<br>EWMAs smooth out noise while giving more weight to recent observations, making them ideal for detecting gradual trends.

(20 features) We computed first-order differences at two time scales: 
<br>$\Delta x_t^{(1h)} = x_t - x_{t-1}$
<br>$\Delta x_t^{(3h)} = x_t - x_{t-3}$

(90 features) For windows of 6h, 12h, and 24h, we computed: Mean, Standard deviation, Maximum, Minimum, Range. High standard deviation indicates erratic behavior, often a precursor to failure.
<br>(15 features) Custom features were designed from domain knowledge: Temperature trend, motor-temperature interaction, volatility metrics, recent extreme values. These capture the relationship between electrical load and thermal stress.
<br>(18 features) We established "healthy operation" baselines from the first 10% of data (assumed healthy). Deviation from a healthy baseline is a direct measure of degradation.








### Machine Learning Models
For the vast majority of our trials, we treated prediction of RUL over time as a regression problem and sought to predict its graph over the test data. To this end we trained three machine learning models using the scikit-learn library: a gradient boosted decision tree regressor (GBR), a Cox proportional hazards regressor (Cox), and an ensemble learner combining the GBR and Cox models. We also trained a long short-term memory (LSTM) deep learning network which will be discussed in later sections.
#### Gradient Boosted (Decision Tree) Regressor (GBR)
The most successful of the standard machine learning models was the GBR, chosen as a robust general-purpose regressor for our structured data. After tuning hyperparameters with bias-variance tradeoff being the primary consideration, the best-performing GBR featured 200 estimators (boosting stages/trees) with a max depth of 5, a subsample of 0.8, and a learning rate of 0.1, which are fairly standard and moderate choices when overfitting is a concern (which it is on this dataset). The GBR self-learns which features it thinks are important. In early trials, effectively only the feature “time since last failure” contributed to its prediction, with that feature weighing orders of magnitude more than all other features. Such predictions fit the train data quite well, and indeed predicted well the linear rate of RUL decay over time throughout the test data, but did so by essentially just guessing a y-coordinate and decaying linearly from it, ultimately predicting poorly.

This obvious problem aside, we also philosophically speaking don’t want to over-rely on “time since last failure” for predicting mechanical failures. As seen in the data, mechanical failures can and do happen at wildly uneven intervals. This occurs often in the real world, which is one reason predicting RUL is a hard and important problem in general. We tried penalizing the feature “time since last failure” to various degrees in favor of the sensor data, but we found that including the feature at all led the models to perform worse than if we simply discarded it, which we did. Ultimately the top 10 features most important to the GBR were as shown in the table below.
|Feature|Importance |Category|
|------|---------|-------------|
| MPG 24hr EWMA | 0.391222 | Temperature|
| COMP 24hr EWMA | 0.146122 | Pressure |
| Temp trend 720hr | 0.082274 | Temperature |
| DV pressure rolling 24hr min | 0.065527 | Pressure |
| Caudal impulses | 0.056647 | Flow |
| H1 24hr EWMA | 0.024132 | Pressure |
| LPS 24hr EWMA | 0.020982 | Pressure |
| Temp recent 24hr max | 0.018370 | Temperature |
| DV pressure rolling 24hr max | 0.015326 | Pressure |
| Motor 48hr volatility | 0.013633 | Motor |

With these features, the model performed somewhat decently, achieving a mean absolute error of 171.44 hours. The main defect is the high level of noisiness and volatility. The graph that predicted RUL on the train and test data is below.

<img src="GBR.png" width = "650">

#### Cox proportional hazards model and Cox/GBR ensemble
Another model we experimented with is Cox regression (also known as Cox proportional hazards model or Cox survival) equipped with L2 regularization. We chose this model because it was invented specifically for purposes of survival analysis, and essentially asks how each feature affects the risk of some event happening at any particular time (to be precise it models the hazard function as a product of some baseline hazard $h(t)$ dependent on time $t$ and an exponential term $e^{\beta^T X}$ dependent on covariates $X$ and parameters $\beta$). Hence it seemed like a good candidate for a model to predict RUL. 

<img src="cox.png" width = "350" style="float: left; margin-right: 10px;">

The Cox model predicted worse than the GBR, achieving a MAE of 292.7. However, one interesting thing is that the Cox was the model which best detected the third (“outlier”) failure event and insisted RUL was very low leading up to it, as can be seen in the graph. We thought we might be able to combine the respective strengths of the GBR and Cox regressors into an ensemble learner which voted (various voting weights were tried) based on the predictions of both regressors. This model achieved a 216.0 MAE, making it an improvement over the Cox model by itself (292.7 MAE) but worse than the GBR by itself (171.4 MAE).



#### LSTM Model Development
Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Network (RNN) specifically designed to remember long-term dependencies, learn temporal patterns, and handle variable-length sequences. The key advantage for predictive maintenance: LSTM learns that "oil temperature rising from 65°C to 82°C over 24 hours" is different from "constant 82°C operation," even though the final temperature is the same.

<img src="LSTM.png" height="300" width="175" style="float: left; margin-right: 10px;">
Pictured on the left is the architecture of the LSTM network created. It consisted of 2 layers, wherein the first layer (64 units) learns high-level temporal patterns across all 233 features, and the second layer (32 units) refines these patterns into failure-relevant sequences. More layers risk overfitting on our limited 4 failure events. These unit counts were chosen to prevent the network from simply memorizing training sequences. Rule of thumb: Each layer should have ≤50% the features of the input, hence 233 features → 64 units (27%) → 32 units (50% of 64).

Dropout randomly deactivates neurons during training, and so values of 0.2-0.3 were chosen to force the network to learn robust patterns, which is vital with only 4 failure events. Batch normalization normalizes activations between layers, speeds up training and improves stability, and helps with gradient flow in deep networks. Lastly, ReLU activation in dense layers is computationally efficient and standard for hidden layers in modern neural networks.

A 24-hour lookback window was chosen based on the premise of daily operational cycles, and due to being long enough to capture degradation trends while still short enough to have sufficient training sequences. All 233 engineered features were utilized at each timestep, and each hour has full sensor + engineered feature vector.
<br>Censored data (where no failure is observed) creates ambiguity for RUL targets. For a sequence at time t with no observed failure, what's the "true" RUL? Unknown. Training on events ensures clear ground truth, and so the data was split in the following way:

Training: 3,091 event samples → 3,022 sequences
<br>Validation: Split 80/20 from training → 2,417 train, 605 validation
<br>Test: 319 event samples → 296 sequences

| Epoch | Training Loss | Validation Loss | Notes |
|----|----------|--------|-------------|
| 1 | 830000 | 530859 | Initial high loss |
| 5 | 710000 | 449974 | Rapid improvement |
| 8 | 550000 | 389226 | Best validation loss |
| 13 | 320000 | 361391 | Early stopping triggered |
| 25 | 140000 | 358000 | (Hypothetical without stopping) |
| 50 | 58000 | 380000 | Training continues to decrease |

We can observe that the gap between training and validation loss widens after epoch 8, indicating overfitting. The divergence between training (58k) and validation (380k) loss reveals the model is memorizing training sequences rather than learning generalizable patterns. However, even a partially overfitted temporal model beats static models due to the following reasons: 
<br>- Temporal patterns (even from 4 examples) are better than static features
<br>- LSTM captures "how failures develop" not just "what failed sensors look like"
<br>- Regularization (dropout, early stopping) limits overfitting damage

Comparing the models, we observe that LSTM achieves best performance across all models. 

<img src="modelcomparison.png" width = "550" style="float: left; margin-right: 10px;" >

LSTM vs GBR (171.44h): 18.7% better
<br>LSTM vs Ensemble (216.02h): 35.5% better
<br>LSTM vs Cox (292.69h): 52.4% better
<br>LSTM vs Baseline (576.58h): 75.8% better

#### Regarding Classification Method
During our experiments we tried reframing the task as a classification problem rather than a regression problem. Instead of asking at each hour “how much RUL is left?” (seeking some positive real number), we ask “will a failure happen in the next N hours?” (seeking a yes or no). This may at first glance appear to simplify the situation, but the scarcity of failure events in our dataset makes such a predictive classification extremely difficult because the dataset is so heavily dominated by negative samples. For example, if we take N=100, and our dataset is downsampled to 1-hour intervals over a 7-month (~5100 hour) period, that corresponds to ~5000 unique 100-hour windows, of which (given failure event #3 happens a little over 100 hours after event #2) ~300 windows contain a failure event. Therefore if a trivial classifier always predicts the negative class, it will be correct (50-3)/50=94% of the time (high accuracy but terrible precision/recall). Taking N smaller exacerbates this issue (e.g. N=50 means trivial model predicts with ~97% accuracy), and taking N large means essentially asking a decreasingly informative version of our original question: if we ask whether a failure event will happen in the next N hours for large N, we may as well just try to predict RUL continuously, especially if our classifiers are predicting poorly anyway: no classifier we tried produced better metrics overall than the trivial classifier did, even with undersampling.

#### Final Takeaways
The analysis confirms that LSTM-based temporal deep learning models outperform all other approaches for Remaining Useful Life (RUL) prediction in the RailLife dataset. The LSTM achieved a 139.38-hour MAE, outperforming Gradient Boosting Regression, Cox Proportional Hazards, and ensemble models, and delivering a 75.8% improvement over the baseline. This confirms that modeling sequential sensor behavior is essential for capturing early degradation signals in industrial equipment. By engineering 233 window-based features and integrating techniques such as dropout, early stopping, and batch normalization, the model achieved greater reliability and consistency. 

Visualizations of rolling statistics, degradation patterns, and feature importances provided strong evidence that short-term temporal windows carry the most predictive signal, while cumulative or long-horizon features either leaked information or added noise. We also developed several tools to support analysis, including an automated RUL-labeling pipeline, a feature-generation module for rolling and window-based statistics, and sequence builders for the LSTM model. A temporal train test split was also used to avoid data leakage, ensuring the model only learned from past data and was evaluated on future unseen time periods, accurately reflecting real deployment conditions. Visual tools like correlation heatmaps, sensor-trend plots, failure-aligned timelines, and feature-importance charts highlighted the sensors with the strongest short-term patterns. These dashboards were crucial for interpreting the dataset and confirming our modeling choices.

However, the results also reveal a significant structural limitation in the dataset: only four true failure events exist across the entire time range. This scarcity severely constrains generalization and explains the model’s overfitting behavior (training loss ~58k vs. validation loss ~380k). All models showed weakened performance on censored data and struggled with long-term prediction due to weak degradation trends in the sensors themselves. Attempts to generate synthetic data did not improve performance because synthetic samples replicated the same weak failure signal as the real data. Despite these limitations, the models consistently produced actionable early-warning windows—often more than 100 hours in advance—demonstrating clear business value. Our findings show that the modeling approach is correct, but dataset size is the bottleneck; scaling to 30+ failure cycles is essential to reach production-grade reliability (<50h MAE). Future improvements such as uncertainty quantification, transfer learning, and hybrid real-time systems can further enhance robustness, but the current results already validate the feasibility and potential impact of predictive maintenance using temporal modeling.


# Appendix
#### Roles and Contributions
James - Machine learning models (GBR, Cox regressor, ensembles, classifiers, others not included), synthetic data, corresponding sections of slides and report

Datta - Data cleaning, feature engineering, LSTM, model comparison & experimentation

Heba - EDA, report coordination & editing

Faaizah - EDA

Lasya - Result interpretation

#### Peer Assessment