The goal of the project is to develop ml model for Remaining Useful Life (RUL) prediction based on NASA Bering dataset from Kaggle platform. Check data/sources.txt for data and feature engineering and selection ideas. The final result is in form of sklearn pipeline. The RUL is in form of remaining rotations (not time).

In this analysis only feature extraction is performed for feature engineering activities. Rest of the feature selection, scaling, etc. activities are to be handled by AutoML.

1. Data preparation - feature extraction

Because only some of the data repersent bearings that failed only this part of the data will be used in model developement.

1.1 Feature and labels extraction from dataset and save to separate file to avoid recalculation each run

In [None]:
import pandas as pd
from feature_extraction.feature_extraction import extract_features

directories_list = ['data/1st_test/1st_test', 'data/2nd_test/2nd_test', 'data/3rd_test/4th_test/txt']
columns_indices_list = [[4,5,6,7], [0], [2]]
time_format = r'%Y.%m.%d.%H.%M.%S'
sampling_freq = 20000
sampling_time = 1
shaft_rpm = 2000

bearing_properties = {'roll_elem_diam'  : 0.331,
                      'pitch_diam'      : 2.815,
                      'roll_elem_count' : 16,
                      'contact_angle'   : 15.17}

rul_rotations_df_list = []
time_df_list = []
orders_df_list = []

for directory, column_indices in zip(directories_list, columns_indices_list):
    rul_rotations, time_features, orders_features = extract_features(directory, column_indices, time_format, sampling_freq, sampling_time, shaft_rpm, bearing_properties['roll_elem_diam'], bearing_properties['pitch_diam'], bearing_properties['roll_elem_count'], bearing_properties['contact_angle'])
    rul_rotations_df_list.append(rul_rotations)
    time_df_list.append(time_features)
    orders_df_list.append(orders_features)

cummulated_rul_rotations_df = pd.concat(rul_rotations_df_list, ignore_index=True, axis=0)
cummulated_time_features_df = pd.concat(time_df_list, ignore_index=True, axis=0)
cummulated_orders_features_df = pd.concat(orders_df_list, ignore_index=True, axis=0)

extracted_data = pd.concat((cummulated_rul_rotations_df, cummulated_time_features_df, cummulated_orders_features_df), axis=1)
extracted_data.to_csv('extracted_data', index=False)

In [None]:
import pandas as pd

extracted_data = pd.read_csv('extracted_data')

2. Preparation of train and test sets

In [None]:
y_raw = extracted_data['RUL_rotations']
X_raw = extracted_data.drop('RUL_rotations', axis=1)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.1)

3. Definition of custom scorer. Motivation is as following: the prediction should not deviate from real value more than 5% for 95% of predictions

In [None]:
import numpy as np
from sklearn.metrics import make_scorer

def custom_scorer_fn(y, y_pred, **kwargs):
    error = y - y_pred
    relative_error = error / y
    abs_relative_error = np.abs(relative_error)
    model_error = np.percentile(abs_relative_error, 95)
    return model_error

custom_scorer = make_scorer(custom_scorer_fn, greater_is_better=False)

4. Training using AutoML library - TPOT

In [None]:
from sklearn.model_selection import KFold
from tpot import TPOTRegressor

new_cv = KFold(n_splits=9, shuffle=True, random_state=42)

tpotregr = TPOTRegressor(scoring=custom_scorer, cv=new_cv, n_jobs=10, max_time_mins=720, random_state=21, warm_start=True, early_stop=100)
tpotregr.fit(X_train, y_train.values.ravel())

In [None]:
tpotregr.export('rul_tpot_pipeline')

5. Retraining of model on all training data (no cross-validation)

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.pipeline import make_pipeline
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor

pipeline = make_pipeline(
    StackingEstimator(estimator=ExtraTreesRegressor(bootstrap=False, max_features=1.0, min_samples_leaf=7, min_samples_split=3, n_estimators=100)),
    XGBRegressor(learning_rate=0.01, max_depth=10, min_child_weight=5, n_estimators=100, n_jobs=-1, objective="reg:squarederror", subsample=0.3, verbosity=0)
)

In [None]:
pipeline.fit(X_train, y_train)

6. Analysis of model performance

The measures of performance will be:
1. result of the custom scoring function o
2. signed prediction error as percentage of remaining useful life

6.1 Custom scoring function

In [None]:
import numpy as np

y_pred = pipeline.predict(X_test)
y_test = y_test.values.ravel()
custom_scorer_fn(y_test, y_pred)

6.2 Signed prediction error as percentage of remaining useful life

In [None]:
error = y_test - y_pred
relative_error = error / y_test
relative_error[np.isinf(relative_error)] = np.nan
relative_error_percentage = relative_error * 100

indices = np.argsort(y_test)
y_test_arr_sorted = y_test[indices]
rel_err_perc_sorted = relative_error_percentage[indices]

In [None]:
import plotly.express as px

fig = px.line(x=y_test_arr_sorted, y=rel_err_perc_sorted)

fig.update_layout(
    xaxis_title='True remaining useful life [rotations]',
    yaxis_title='Relative model error [%]',
)

7. Conclsions

Performance of trained algorithm is hardly satisfactory. This experiment is considered unsuccessful.