# Predict Turbofan Degradation

In this example, we build a machine learning application that predicts turbofan engine degradation. This application is structured into three important steps:

* Prediction Engineering
* Feature Engineering
* Machine Learning

In the first step, we generate new labels from the data by using [Compose](https://compose.alteryx.com/). In the second step, we generate features for the labels by using [Featuretools](https://docs.featuretools.com/). In the third step, we search for the best machine learning pipeline by using [EvalML](https://evalml.alteryx.com/). 
After working through these steps, you will learn how to build machine learning applications for real-world problems like predictive maintenance. Let's get started.

In [None]:
from demo.predict_rul import load_sample
from matplotlib.pyplot import subplots
import composeml as cp
import featuretools as ft
import evalml

We will use a dataset provided by NASA simulating turbofan engine degradation. In this dataset, we have engines which are monitored over time. Each engine had operational settings and sensor measurements recorded for each cycle. The remaining useful life (RUL) is the amount of cycles an engine has left before it needs maintenance. What makes this dataset special is that the engines run all the way until failure, giving us precise RUL information for every engine at every point in time. 

In [None]:
records = load_sample()

records.head()

## Prediction Engineering

> Which range is the RUL of a turbofan engine in?

In this prediction problem, we want to group the RUL into ranges. Then, predict which range the RUL is in. We can make variations of the ranges to create different prediction problems. For example, the ranges can be manually defined (0 - 150, 150 - 300, etc.) or based on the quartiles from historical observations. These variations can be done by simply binning the RUL. This helps us explore different scenarios which is crucial for making better decisions.

### Defining the Labeling Process

Let's stary by defining the labeling function of an engine that calculates the RUL. Given that engines run all the way until failure, the RUL is just the remaining number of observations.

In [None]:
def rul(ds):
    return len(ds) - 1

### Representing the Prediction Problem

Then, let's represent the prediction problem by creating a label maker with the following parameters:

* The `target_entity` as the column for the engine ID, since we want to process records for each engine.
* The `labeling_function` as the function we defined previously.
* The `time_index` as the column for the event time.

In [None]:
lm = cp.LabelMaker(
    target_entity='engine_no',
    labeling_function=rul,
    time_index='time',
)

### Finding the Training Examples

Now, let's run a search to get the training examples by using the following parameters:

* The data sorted by the event time.
* The `num_examples_per_instance` as the number of training examples to find for each engine.
* The `minimum_data` as the first number of records to skip before starting the search. In this data sample, turbines generally don't fail before 5 cycles, so we start after 5 cycles.
* The `gap` as the number of records to skip between examples. This is done to cover different points in time of an engine.

We can easily tweak these parameters and run more searches for training examples as the requirements of our model changes.

In [None]:
lt = lm.search(
    records.sort_values('time'),
    num_examples_per_instance=20,
    minimum_data=5,
    gap=20,
    verbose=False,
)

lt.head()

The output from the search is a label times table with three columns:

* The engine ID associated to the records.
* The event time of the engine. This is also known as a cutoff time for building features. Only data that existed beforehand is valid to use for predictions.
* The value of the RUL. This is calculated by our labeling function.

At this point, we only have continuous values of the RUL. As a helpul reference, we can print out the search settings that were used to generate the labels.

In [None]:
lt.describe()

We can also get a better look at the values by plotting the distribution and the cumulative count across time.

In [None]:
%matplotlib inline
fig, ax = subplots(nrows=2, ncols=1, figsize=(6, 8))
lt.plot.distribution(ax=ax[0])
lt.plot.count_by_time(ax=ax[1])
fig.tight_layout(pad=2)

With continusous values, we can explore different ranges without running the search again. We will just use quartiles to bin the values into ranges.

In [None]:
lt = lt.bin(4, quantiles=True, precision=0)

When we print out the settings again, we can see that the description of the labels has been updated and reflects the recent changes.

In [None]:
lt.describe()

Let's see the new label distribution and the cumulative count across time.

In [None]:
%matplotlib inline
fig, ax = subplots(nrows=2, ncols=1, figsize=(6, 8))
lt.plot.distribution(ax=ax[0])
lt.plot.count_by_time(ax=ax[1])
fig.tight_layout(pad=2)

## Feature Engineering

In the previous step, we generated the labels. The next step is to generate the features.

### Representing the Data

We will represent the data using an entity set. We currently have a single table of records where one engine can many records. This one-to-many relationship can be represented in an entity set by normalizing an entity for the engines. The same can be done for engine cycles. Let's start by structuring the entity set.

In [None]:
es = ft.EntitySet('observations')

es.entity_from_dataframe(
    dataframe=records.reset_index(),
    entity_id='records',
    index='id',
    time_index='time',
)

es.normalize_entity(
    base_entity_id='records',
    new_entity_id='engines',
    index='engine_no',
)

es.normalize_entity(
    base_entity_id='records',
    new_entity_id='cycles',
    index='time_in_cycles',
)

es.plot()

### Calculating the Features

Now, we can generate features by using a method called Deep Feature Synthesis (DFS). This will automatically build features by stacking and applying mathematical operations called primitives across relationships in an entity set. The more structured an entity set is, the better DFS can leverage the relationships to generate better features. Let’s run DFS using the following parameters:

* The `entity_set` as the entity set we structured previously.
* The `target_entity` as the engines, since we want to generate features for each engine. 
* The `cutoff_time` as the label times that we generated in the previous step.

In [None]:
fm, fd = ft.dfs(
    entityset=es,
    target_entity='engines',
    agg_primitives=['sum'],
    trans_primitives=[],
    cutoff_time=lt,
    cutoff_time_in_index=True,
    include_cutoff_time=False,
    verbose=False,
)

fm.head()


There are two outputs from DFS: a feature matrix and feature definitions. The feature matrix is a table that contains the feature values based on the cutoff times from our labels. Feature definitions are features in a list that can be stored and reused later to calculate the same set of features on future data.

## Machine Learning

In the previous steps, we generated the labels and features. The final step is to build the machine learning pipeline.

### Splitting the Data

Let's start by extracting the labels from the feature matrix and splitting the data into a training set and holdout set.

In [None]:
y = fm.pop('rul').cat.codes
splits = evalml.preprocessing.split_data(fm, y, test_size=0.2, random_state=0)
X_train, X_holdout, y_train, y_holdout = splits

### Finding the Best Model

Then, let's run a search on the training set for the best machine learning pipeline.

In [None]:
automl = evalml.AutoMLSearch(problem_type='multiclass', objective='f1_macro')
automl.search(X_train, y_train, data_checks=None, show_iteration_plot=False)

We can print out the details and steps from the best pipeline found.

In [None]:
automl.best_pipeline.describe()
automl.best_pipeline.graph()

Now, let's score the model performance by evaluating predictions on the holdout set.

In [None]:
best_pipeline = automl.best_pipeline.fit(X_train, y_train)
score = best_pipeline.score(X_holdout, y_holdout, objectives=['f1_macro'])
dict(score)