<a href="https://colab.research.google.com/github/Ogunfool/Prognostics-Strategies-An-Aero-engine-Use-case/blob/main/Residual_Similarity_Based_RUL_Estimation_on_test_dataset_(FD002_NASA_testset).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Given that the health indices for the different models developed have been collected in the master dataframe of HI's, the only requirements for this notebook are:


1.   The test_RUL dataframe: This is provided on the NASA website. It contains the true RULs of the test set that have not been used for training or validation.
2.   The master dataframe (newcomp_df): The dataframe that contains all HI's from different models for the train and test set.


1.   The HI prediction or construction models (supervised or unsupervised learning) is also required to get the HI for the test data. Recall, before the best model is used to predict the test data HI; the preprocessing, feature selection and data preparation operations have to be done on the test dataset.
2.   The maximum cycle dataframe is required to estimate the residual-similarity-based RUL model.







Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import math
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.decomposition import PCA
import seaborn as sns

Helper Functions

In [2]:
np.set_printoptions(suppress=True, linewidth=100, precision=2)

In [3]:
# Checkpoints - List and npy files
# np.save() - Saves a single array in a binary numpy format
def checkpoints(filename, checkpoint_data):
  np.save(filename, checkpoint_data)
  checkpoint_variable = np.load(filename + '.npy') #Load so that we always have an on-hand version of the checkpoint
  return(checkpoint_variable)

# List Checkpoint
def list_checkpoints(filename, checkpoint_data):
  np.save(filename, checkpoint_data, allow_pickle=True)
  checkpoint_variable = np.load(filename + '.npy', allow_pickle=True) #Load so that we always have an on-hand version of the checkpoint
  return(checkpoint_variable)

Load Data

In [None]:
from google.colab import files
uploaded = files.upload()

In [5]:
# Test Data and true RUL.
test_RUL = pd.read_csv('/content/RUL_FD002.txt', header=None)

In [6]:
# Combined df and test df
newcomp_df = pd.read_csv('/content/newcomp-df')
HItest_df = pd.read_csv('/content/HItest-df')

In [7]:
max_cycle_df = pd.read_csv('/content/max-cycle-df (1)')

In [None]:
# Import Model
CNN_model = tf.keras.models.load_model('/content/best_model_CNN.h5')

In [8]:
test_RUL.head()

Unnamed: 0,0
0,18
1,79
2,106
3,110
4,15


In [9]:
newcomp_df.head()

Unnamed: 0,Unit_id,Time,ori_HI,Maxcycle,RUL,HI_PCA,apprx_HI,apprx_HI_CNN
0,1,1,1.0,149,148,-0.880739,0.556799,0.0
1,1,2,0.993243,149,147,-0.754394,0.586015,0.0
2,1,3,0.986486,149,146,-0.553637,0.607174,0.0
3,1,4,0.97973,149,145,-1.370404,0.731363,0.0
4,1,5,0.972973,149,144,-0.670824,0.573055,0.725099


In [10]:
HItest_df.head()

Unnamed: 0,Unit_id,time,RUL,ori_HI,HI_PCA,apprx_HI,apprx_HI_CNN
0,1,1,275,0.996377,-1.014529,0.621829,0.0
1,1,2,274,0.992754,-0.997621,0.676785,0.0
2,1,3,273,0.98913,-1.082384,0.662697,0.0
3,1,4,272,0.985507,-0.266505,0.48449,0.0
4,1,5,271,0.981884,-1.59329,0.735416,0.733132


In [11]:
max_cycle_df.head()

Unnamed: 0,max_cycle
0,149
1,269
2,206
3,235
4,154


Prepare max_cycle_df dataframe for use.

In [14]:
# Take median of the maxcycle of nearest residuals or neighbours
# Instances maxcycle_df
max_cycle_df = newcomp_df.groupby('Unit_id').max()['Time'].reset_index().rename(columns={'Time':'max_cycle'})
max_cycle_df.head()

Unnamed: 0,Unit_id,max_cycle
0,1,149
1,2,269
2,3,206
3,4,235
4,5,154


In [15]:
max_cycle_df.set_index('Unit_id',inplace=True)
max_cycle_df.head()

Unnamed: 0_level_0,max_cycle
Unit_id,Unnamed: 1_level_1
1,149
2,269
3,206
4,235
5,154


# Supervised learning HI Approximation Models Evaluation on test data.

Test Data RUL Estimation - Use all train set, including validation data.

In [16]:
# This is the residual function we are using because the lent can be larger than some instance's maxcycles.....
def residual_func(train_df, test_df, lent, name):
  no_instances = train_df['Unit_id'].unique().shape[0]
  res_mat = np.zeros((lent, no_instances))
  cur_val = test_df[name][:lent]
  for id in train_df['Unit_id'].unique():
    # If it is already dead, then maybe you shouldn't be part of the computation OR add zeros below it
    temp_df = train_df[train_df['Unit_id']==id]
    if temp_df.shape[0] < lent:
      diff_arr = np.zeros(lent- temp_df.shape[0])
      cur_train = temp_df[name][:lent]
      cur_train = np.concatenate((cur_train, diff_arr), axis=0)
      res_mat[:,id-1] = cur_train
    else:
      temp_df = train_df[train_df['Unit_id']==id]
      cur_train = temp_df[name][:lent]
      res_mat[:,id-1] = cur_train

  l = res_mat - cur_val.to_numpy().reshape((-1,1))
  residual = np.sqrt(np.mean(l**2, axis=0))
  return residual

def neighbors(residual,nearest):
  n_neighbors = np.argsort(residual)[:nearest]
  return n_neighbors

In [17]:
#  The function for new data will be a little different from that of val data
def RUL_estimator(train_df, test_df, max_cycle_df, nearest, name, id):
  # Call Lenght Function
  lent = test_df.shape[0]

  # Call Residual Function
  residual = residual_func(train_df, test_df, lent, name)

  # Nearest neighbours function - where you can change the size of nearest neighbours
  n_neighbors = neighbors(residual,nearest)

  # RUL Estimation
  # We want the closest 50 that are not yet dead
  ens_RULs = max_cycle_df.loc[n_neighbors+1] - lent

  true_RUL = test_RUL.iloc[id-1]

  m = ens_RULs[ens_RULs>-10]
  m[m.isna()] = 0

  est_RUL = m['max_cycle'].median()

  return (est_RUL, true_RUL, (ens_RULs))
  # return lent, residual, n_neighbors

In [18]:
def RUL_Estimator_coll(train_df, test_df, max_cycle_df, nearest, name):
  val_inst = []
  # test_data unique ID
  inst_id = test_df['Unit_id'].unique()

  # Loop through all test_instances
  for k in inst_id: 
    cur_val_df = test_df[test_df['Unit_id'] == k]
    val_ins = RUL_estimator(train_df=train_df, test_df=cur_val_df, max_cycle_df=max_cycle_df, nearest=nearest, name=name, id=k)
    val_inst.append(val_ins)

  return val_inst

In [19]:
ll = RUL_Estimator_coll(train_df=newcomp_df, test_df=HItest_df, max_cycle_df=max_cycle_df, nearest=50, name='ori_HI')

In [20]:
def score(errors):
  a1=10
  a2=13
  s1=0
  s2=0
  for err in errors:
    if err < 0:
      s1 += (np.exp(-1*(err/a1))) - 1
    if ((err > 0) or (err == 0)):
      s2 += (np.exp(err/a2)) - 1
  return [s1 , s2]

In [21]:
def RUL_metrics(a):
  res_list = []
  RMSE = []
  MAE = []
  SCORE = []

  # Make an array for easy computation
  r0 = np.zeros(len(a))
  r1 = np.zeros(len(a))
  for idx, res in enumerate(a):
    r0[idx] = res[0]
    r1[idx] = res[1]

  errors = r0-r1
  RMSE.append(mean_squared_error(r0, r1, squared=False))
  MAE.append(mean_absolute_error(r0,r1))
  SCORE.append(score(errors))

  return RMSE, MAE, SCORE


Using the original HI .

*   Under the assumption that it is known or that we can find a model that an
approximate it better. The result below shows that the highest score we can possible get from a residual-similarity-model-with-degradation-degree or health-index estimation on the dataset is around 3500.
*  Some publications in literature have gotten better results with unsupervised learning models and direct RUL prediction models. So rather than wasting any more time of the supervised learning appraoch to estimate health index, we will focus our effort on improving the unsupervised learning models and getting better results with direct RUL approximation methods.

In [22]:
# L is Full
val_inst = RUL_Estimator_coll(train_df=newcomp_df, test_df=HItest_df, max_cycle_df=max_cycle_df, nearest=50, name='ori_HI')
RMSE, MAE, SCORE = RUL_metrics(a=val_inst)

In [23]:
# With Complete data
print(RMSE,MAE,SCORE)

[12.15298237861722] [6.498069498069498] [[3436.777901150471, 41.852459293933336]]


Test score with Linear Regression.

In [24]:
# L is Full
val_inst = RUL_Estimator_coll(train_df=newcomp_df, test_df=HItest_df, max_cycle_df=max_cycle_df, nearest=50, name='apprx_HI')
RMSE, MAE, SCORE = RUL_metrics(a=val_inst)
print(RMSE,MAE,SCORE)

[30.693528048966026] [23.482625482625483] [[48588.99722556637, 1642.56755761662]]


Test scores with CNN.

In [25]:
# L is Full
val_inst = RUL_Estimator_coll(train_df=newcomp_df, test_df=HItest_df, max_cycle_df=max_cycle_df, nearest=50, name='apprx_HI_CNN')
RMSE, MAE, SCORE = RUL_metrics(a=val_inst)
print(RMSE,MAE,SCORE)

[30.710741018406146] [24.063706563706564] [[29676.863436682877, 1923.5920971820747]]


Comment: There is a huge improvement in the score value of the CNN model compared to the linear regression model. But there is still along way to go, we need a more sophisticated model to get close to the baseline score of 3500 gotten from the original model.







# Unsupervised learning HI Construction Models Evaluation on test data.

Autoencoders test