In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
import re
import os
from ipywidgets import register
import cufflinks as cf
cf.go_offline(connected=True)
cf.set_config_file(colorscale='plotly', world_readable=True)
%matplotlib inline
from IPython.core.display import HTML
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

from starter_kits.visualization import vis_plotly_widgets as vpw
import visualizations as v
import support as sp
from support import LstmApp, get_test_scores

# Fixing the used seeds for reproducibility of the experiments
np.random.seed(1234)  
PYTHONHASHSEED = 0
display(HTML("<style>.container { width:95% !important; }</style>"))

# Starter Kit 1.2.1: Remaining useful life estimation

## Business context

Maintenance is an important part of an asset's lifecycle. In the past, corrective and preventive maintenance were the norm:
- *corrective* maintenance was performed when an asset had failed and involved tasks for identifying the fault and rectifying it so that the asset could resume normal operation
- *preventive* maintenance was performed to avoid asset failure and involved executing maintenance tasks at regular intervals, e.g. when an asset was used for a certain period of time or when it executed a pre-determined number of cycles.

Nowadays, ever more assets are equipped with sensors and ever more data is gathered. This data can be exploited to realize *predictive* maintenance, which promises costs saving over corrective or preventive maintenance because maintenance is only performed when warranted. Predictive maintenance encompasses a variety of topics, such as failure prediction, remaining useful lifetime estimation (RUL), failure detection, failure diagnosis (root cause analysis) and recommendation of mitigation or maintenance actions after failure. 

## Business goal

The business goal addressed by this Starter Kit is the **estimation of the remaining useful lifetime** of an asset. 'Lifetime' in this context is expressed in terms of the evolution of a particular quantity, such as the distance travelled, fuel consumed, repetition cycles performed, number of transactions experienced, etc. Estimating the remaining useful lifetime is non-trivial, as it is influenced by a multitude of internal and external factors, e.g. operating conditions, weather conditions, usage scenarios, etc.


## Application contexts

Estimating the remaining useful lifetime of an asset can be beneficial for several purposes in a variety of industrial contexts, for example:

- to better schedule maintenance operations, e.g. for offshore wind turbines, for which maintenance can only be performed under the right weather conditions.
- to optimise the operational efficiency of an asset, by more optimally planning the use of that asset before its end of life, or adapting its operational use to extend its useful lifetime.
- to avoid unplanned downtime, e.g. in manufacturing settings, where a downtime of the production line results in several operational problems and associated costs.
- to anticipate and avoid safety-critical situations, e.g. for aircrafts or vehicles.
- ...


## Data characteristics and requirements

Typically, the data to be exploited for remaining useful lifetime estimation has the following characteristics:
- it consists of time-series data, in particular for the quantity used to express the lifetime of the asset but also for all the factors that influence that quantity (e.g. weather and operating conditions).
- temporal patterns such as trends, cycles, seasonality, day/night or week/weekend patterns can be present in the data.
- random fluctuation (noise) can also be present, as a result of random variation or unexplained causes. In a large number of cases the data is measured using sensors, therefore some noise is inherently present on the measured signal.

Moreover, the following requirements are imposed on the dataset:
- a sufficiently large amount of historical data needs to be available. Two aspects are important in this respect:
     - the implicit assumption is that an asset has a continuous degradation pattern, which is reflected in the asset's sensor measurements. Degradation is often slow, however, so the monitoring period should be sufficiently long so as to include a clear presence of the degradation pattern. 
     - a sufficient number of examples of assets that reached their end of life needs to be present. 
- the data needs to include explicit information on when an asset has reached its end of life or should allow to easily derive it (e.g. the lifetime quantity has reached a particular threshold). 


## Starter Kit outline

The data-driven approach for remaining useful life estimation will be illustrated on the industrially-relevant use case of **predicting the remaining useful lifetime of aircraft engines based on run-to-failure data**.  We will use this dataset to illustrate:
* How to appropriately preprocess the data and construct training, test and validation sets to learn and evaluate a model able to predict whether an engine will fail within a certain number of operational cycles.
* How to use deep learning to construct a model for remaining useful lifetime prediction. In particular, we apply a deep learning approach that uses short-term memory ([LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)) networks and allows the construction of a predictive model without the time-consuming feature engineering step, which requires one to manually extract the characteristics from which the algorithm can learn.
* How to experimentally validate the resulting model and objectively compare it with other approaches.

The resulting model achieves a high accuracy. The approach is simple and flexible enough to be applied to a wide range of RUL problems. 

## Data understanding

The time series data we will use in this Starter Kit is simulated run-to-failure data from aircraft engines. Engines in the data are assumed to start with varying degrees of wear and manufacturing variation. This information is unknown to the user. Furthermore, in this simulated data, engines are assumed to be operating normally at the beginning, and start to degrade at some point during operation. The degradation progresses and grows in magnitude. When a predefined threshold is reached, the engine is considered unsafe for further operation. In other words, the last operational cycle of the engine can be considered as the failure point of the corresponding engine.

In machine learning experiments, a dataset is often split into a training set and a test set, which allows one to quickly evaluate the performance of an algorithm on the problem at hand. The training dataset is used to prepare a model, to train it. We pretend the test dataset is new, unseen data where the output values (in this case, whether or not a specific engine is going to fail within a predefined number of cycles) are withheld from the algorithm.

We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set, the so-called *ground truth*. Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.

For this reason, the dataset is divided into three parts, namely the training, test and ground truth datasets:
- The *training data* will be used by the algorithm to learn the prediction model. It consists of multiple time series with "cycle" as the time unit, and 21 sensor readings for each cycle. Each sequence of cycles can be assumed as being generated from a different engine (identified by the engine ID) of the same type. Since the algorithm needs to be able to learn when the engine will fail, **the last time period does represent the failure point**. 
- The *test data* has the same data schema as the training data and is used to test the quality of the resulting model. The only difference is that the data does not indicate when the failure occurs, or put differently, the last time period does not represent the failure point. 
- The *ground truth data* provides the number of remaining working cycles for the engines in the test data, which will only be used to assess the quality of the results, but will not be given as input to the learned model.

More details on this dataset can be found in \[1\].

In [None]:
train_df, test_df, truth_df = sp.load_data()

The table below shows an excerpt of the training data, with the following attributes:
* id: the identifier of the engine
* cycle: the current cycle of the engine from which the data in the respective row is originating, i.e. the current flight of the aircraft
* setting 1-3: the values of 3 different operational settings of the engine, e.g. an operational setting could correspond to a cruise setting when the aircraft is flying at cruise speed
* s1-21: 21 sensor readings monitoring various parameters of the engine, e.g. temperature or vibration

Since the input file is space-separated, and two spurious spaces appear at the end of each line, we remove the last two columns.

In [None]:
train_df.head()

The test data looks similar to the training data, with the difference that it does not indicate when the failure occurs, or put differently, the last time period does not represent the failure point. Taking the sample test data shown in the table below as an example, the engine with id=1 runs from cycle 1 through cycle 31. It is not shown how many more cycles this engine can last before it fails.

In [None]:
test_df[25:31]

The ground truth data provides the number of remaining working cycles for the engines in the test dataset. Taking the sample ground truth data shown in the following table as an example, the engine with id 1 in the test data can run for another 112 cycles before it fails.

In [None]:
display(truth_df.set_index(truth_df.index +1).head(3))

## Data preprocessing
Before we can start training a model to predict the RUL, a number of preparatory steps needs to be taken.

1. As could be seen from the training data sample above, currenly the data does not include a column indicating the RUL, which is the target value we want to predict. As the algorithm needs this information to learn from, the first step will be to create such a column.

2. A second observation that can be made on the training data sample is that the scale of the values differs across columns. This difference could cause problems when the model needs to calculate the similarity between different cycles (i.e., the rows in the table) during modeling. To address this problem, we will normalise the data. The goal of normalisation is to change the values of the numeric columns in the dataset, in order to have a common scale and avoid distorting differences in the ranges of values or losing information. The normalised values maintain the general distribution and ratios in the source data, while keeping values within a scale applied across all numeric columns used in the model.

#### RUL label creation
The first preprocessing step is to generate a label indicating the Remaining Useful Lifetime. In this Starter Kit, we will not predict the exact number of remaining cycles the engine still has left before failure, yet rather try answering the question 'Is a specific engine going to fail within *w1* cycles?', where *w1* is a predefined threshold used during the label creation process.

##### Training data
In order to determine the remaining number of cycles at each time point, the maximum number of cycles per engine is first determined. Subsequently, the current cycle number is substracted from the max number of cycles to obtain the number of cycles remaining at a particular time point. In the following sample, the column `RUL` indicates the remaining number of cycles for that particular point in time (i.e. the row).

In [None]:
rul = pd.DataFrame(train_df.groupby('id')['cycle'].max()).reset_index()
rul.columns = ['id', 'max']
train_df = train_df.merge(rul, on=['id'], how='left')
train_df['RUL'] = train_df['max'] - train_df['cycle']
train_df.drop('max', axis=1, inplace=True)
Xtrain = train_df.copy()
train_df[['id','cycle','RUL']].head()

In order to arrive to a binary label indicating whether an engine is going to fail within *w1* cycles, the next step is to test whether the number of remaining cycles in the *RUL* column is smaller than or equal to the predefined threshold *w1*, currently set to 30 cycles. This is illustrated in the sample below, where for the first row, the value in the column `RUL_label` is 1, because the engine will fail within 4 cycles.

In [None]:
w1 = 30
train_df['RUL_label'] = train_df.RUL.apply(lambda x: x <= w1).astype(int)
train_df[['id','cycle','RUL','RUL_label']].tail()

##### Test data
A similar approach is adopted for the test set. Note that, in order to generate the labels for the test data, we need to use the ground truth dataset, because the test data does not indicate when the failure occurs, or put differently, the last time period does not represent the failure point.

In [None]:
# Generate column max for test data
rul = pd.DataFrame(test_df.groupby('id')['cycle'].max()).reset_index()
rul.columns = ['id', 'max']
truth_df.columns = ['more']
truth_df['id'] = truth_df.index + 1
truth_df['max'] = rul['max'] + truth_df['more']
truth_df.drop('more', axis=1, inplace=True)

# Generate RUL for test data
test_df = test_df.merge(truth_df, on=['id'], how='left')
test_df['RUL'] = test_df['max'] - test_df['cycle']
test_df.drop('max', axis=1, inplace=True)

# Generate label columns RUL and RUL_label for test data
test_df['RUL_label'] = np.where(test_df['RUL'] <= w1, 1, 0 )
test_df[['id','cycle','RUL','RUL_label']];

#### Normalisation using Min-Max scaling
As explained before, the goal of *normalisation* is to change the values of numeric columns in the dataset to a common scale. In this notebook, to put it simply, normalisation is performed to avoid variables on a higher scale affecting the outcome in bigger measure only because they are on a higher scale.

One of the most popular normalisation technique is *Min-Max normalisation*, which we also use here. It scales the values of a variable to a range between 0 and 1 using the following formula, where $X$ represents the value to be normalised, $X_{min}$ is the minimum value of the variable in that column and $X_{max}$ is the maximum value for that variable:

$X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$

Note: an alternative approach to Min-Max scaling is Z-score normalisation (or *standardisation*) in which the variables will be rescaled so that they have the properties of a standard normal distribution with zero mean and unit variance.

For both the training set and the test set, we normalise all columns except `id`,`cycle`, and`RUL`, as we want to keep the original values for these columns. The following sample shows a portion of the normalised training data.

In [None]:
train_df['cycle_norm'] = train_df['cycle']
cols_normalize = train_df.columns.difference(['id','cycle','RUL'])
min_max_scaler = preprocessing.MinMaxScaler()
norm_train_df = pd.DataFrame(min_max_scaler.fit_transform(train_df[cols_normalize]), 
                             columns=cols_normalize, 
                             index=train_df.index)
join_df = train_df[train_df.columns.difference(cols_normalize)].join(norm_train_df)
train_df = join_df.reindex(columns = train_df.columns)
train_df.describe().loc[['mean','min','max']]

As can be seen, both the settings variables and the sensor variables now only contain values between 0 and 1.

In [None]:
test_df['cycle_norm'] = test_df['cycle']
norm_test_df = pd.DataFrame(min_max_scaler.transform(test_df[cols_normalize]), 
                            columns=cols_normalize, 
                            index=test_df.index)
test_join_df = test_df[test_df.columns.difference(cols_normalize)].join(norm_test_df)
test_df = test_join_df.reindex(columns = test_df.columns)
test_df = test_df.reset_index(drop=True);

## Data Modeling and Analysis

Traditional machine learning models are usually based on manually extracting distinguishing factors that characterise the phenomenon to be learned using domain expertise (the so-called *feature engineering* step). This usually makes these models hard to reuse since feature engineering is specific to the problem scenario and the available data, and therefore a labour-intensive process. In contrast, deep learning algorithms can automatically extract the right features from the data, eliminating the need for manual feature engineering. Deep learning refers to machine learning approaches involving artificial neural networks (ANNs) that contain a multitude of so-called hidden layers.

One particular type of ANNs are the Long Short-Term Memory networks, or LSTMs. One critical advantage of LSTMs is their ability to remember information from long-term sequences, which is hard to achieve with traditional feature engineering. When manually engineering features for this problem, one would for example compute the rolling average per windows of 50 cycles. This averaging may however lead to a loss of information due to the smoothing and abstracting of values over such a long period. Instead, using all 50 values as input may provide better results. While performing feature engineering over large window sizes may complicate things for traditional machine learning methods, LSTMs are naturally able to handle larger window sizes and use all the information in the window as input.

#### Algorithm-specific data preparation
As a first step in the modeling phase, we will prepare the data to serve as input for the LSTM network.

When using LSTMs in the time-series domain, one important parameter to pick is the sequence length, which is the size of the window within which the LSTMs needs to remember information. This is similar to picking a window size when calculating time series features using a rolling window. The idea of using LSTMs is to let the model extract abstract features from the sequence of sensor values in the window rather than engineering those manually. The expectation is that if there is a pattern in these sensor values within the window prior to failure, the pattern should be extracted and learned by the LSTM.

The window size chosen for the training data set strongly influences the prediction results. In the animation below, you can see how differently the single variables for a given engine behave. Furthermore, the time at which the degradation becomes visible is different for different engines. 

When using the initial settings in the animation below, i.e. `Variable` equals `s11`, `ID` equals 1 and `period` equals 50, we see a comparably short line fluctuating around a constant value. When selecting the IDs 3 and 13 instead and increasing the `period` to 130 at the same time, one sees a clear deviation from the median value indicated by the horizontal gray dashed line. The deviation for both starts at around index 110. When then additionally selecting the engine with `ID` 36, the deviation starts already at approximately index 80 and therewith significantly earlier. Further looking at `Variable` `s12`, the deviation from the median starts at the same times as for `s11` for the engines under inspection.

When selecting the training period for the model, we need to take the different start of the deviation into account. Here we need to find a good balance between the two different influences: on the one hand, the longer the training data sequence, the better the results; on the other hand, when selecting a too long training period, it is possible that the model learns that a slight deviation is fine in several engines and will therewith fail in predicting the remaining useful lifetime. 

The period length will be used in the remainder of the notebook to calculate the model sequences.

In [None]:
date_range_slider = v.interactive_plot(test_df)

Based on the figures above, it is very hard to visually spot any trends that are influencial for the remaining useful life, let alone to manually extract features from it. As indicated above, the advantage of deep learning algorithms is that they can automatically extract the right features from the data, eliminating the need for manual feature engineering.

In the following, we pick the feature columns (`sensor_cols`) to use for modelling as well as the columns identifying the single sequences (`sequence_cols`).

In [None]:
sensor_cols = [x for x in test_df if re.findall(string=x,pattern='s[0-9]')]
sequence_cols = ['setting1', 'setting2', 'setting3', 'cycle_norm']
sequence_cols.extend(sensor_cols)

[Keras](https://keras.io/) is a deep learning library written in Python, which naturally interfaces with [TensorFlow](https://www.tensorflow.org), an open-source software library for machine learning. Its [LSTM](https://keras.io/layers/recurrent/) layer expects an input in the shape of a 3-dimensional matrix of (samples, time steps, features) (more specifically a 3-dimensional [NumPy](https://www.numpy.org) array) where 'samples' is the number of training sequences, 'time steps' is the look back window or sequence length and 'features' is the number of features of each sequence at each time step. During this transformation, we also select the columns from the input data that can serve as features. In this case, we selected all sensor values, as well as the 3 settings parameters and the (normalised) current cycle the machine is in at this point.

In [None]:
def gen_sequence(id_df, seq_length, seq_cols):
    """
    Function to reshape the data into a 3-dimensional (samples, time steps, features) matrix 
    """
    data_array = id_df[seq_cols].values
    num_elements = data_array.shape[0]
    for start, stop in zip(range(0, num_elements-seq_length), range(seq_length, num_elements)):
        yield data_array[start:stop, :]

For each `id` in the data set, a data sequence will be created. As initial sequence length, we will use 50 time steps.

In [None]:
register(date_range_slider)
sequence_length = 50
seq_gen = (list(gen_sequence(train_df[train_df['id']==id], sequence_length, sequence_cols)) 
           for id in train_df['id'].unique())

seq_array = np.concatenate(list(seq_gen)).astype(np.float32)
printmd(f'''This results in a 3-dimensional input matrix with {seq_array.shape[0]} samples, 
{seq_array.shape[1]} time steps and {seq_array.shape[2]} features (including 21 sensor values, 
the 3 settings parameters and the normalised current cycle).''') 

For the training data, the labels are also added to this data structure, so that the LSTM can learn from it.

In [None]:
def gen_labels(id_df, seq_length, label):
    data_array = id_df[label].values
    num_elements = data_array.shape[0]
    return data_array[seq_length:num_elements, :]
def handler(x):
    pass

sequence_length = date_range_slider.get_interact_value()
label_gen = [gen_labels(train_df[train_df['id'] == id_], sequence_length, ['RUL_label'])
                     for id_ in train_df['id'].unique()]
label_array = np.concatenate(label_gen).astype(np.float32)

v.plot_label_frequency(label_array)

The majority of data points, roughly 80 %, are labeled as zero, indicating that no failure was observed. 

## LSTM network

*Note: we refer the interested reader to Appendix A for a brief introduction to LSTM networks.*

In this section, we guide you through the general process of building, training, evaluating and testing a LSTM model. This will be done by means of an interactive interface. In order to see the code used for these steps, please expand the fields below.

### Exercise 1

#### Network construction

Let us start by building an LSTM network. In our setup, we will experiment with a stacked structure. In order to learn from different architectures of the LSTM model, it is possible to change the model variables in the animation below. 

We ask the reader to test different model scenarios by applying the following steps:

1. Select the number of intermediate layers. For the first experiment, please select a single intermediate layer only. 
2. Set the size of the intermediate layer to 30. This corresponds to a rather small network.
3. Set the number of epochs for training to 15. Epochs define the number of times the model iterates over the training data arrays, the aim being for the model to improve during each training epoch. In general, the more epochs the better the results, until we reach the given model's limit. 
4. Similarly as before, the training sequence length defines the number of data points to take into account for training per engine. Please select a sequence length of 50.
5. Furthermore, we add a dropout layer after each LSTM layer. The dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps preventing overfitting. *Overfitting* means that a model fits too closely a limited set of data points. This typically results in an overly complex model that fits all little details in the data under study. Please choose a value of 0.2 for this first experiment. 

#### Model training 

Now that we have fixed the network design, we can fit the network to the training data by pushing the button `Train model`. After a successful training, a green badge will appear. 
The fitting step is influenced by the following parameters, which we will keep fixed during our experiments:

- Batch size: The number of samples that are used per gradient update (while fixing the best parameters for the model). We will use a batch size of 200.
- Validation split: fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. We will use a validation split of 5%.

Furthermore, the fitting function will stop early if the validation loss (which is the quality criterion that is used to evaluate the model performance) is not decreasing anymore. To parametrise the early stopping function, the following parameters can be handled by TensorFlow:
- Minimal delta: minimum change in the monitored quantity as given by the validation loss to qualify as an improvement, i.e. an absolute change of less than min_delta will count as no improvement. 
- Patience: number of epochs with no improvement after which training will be stopped.

In the experiments defined in this Starter Kit, we keep both minimal delta and patience at zero.

#### Model evaluation and testing

Now that we have a fitted model, we can evaluate its performance. We will first evaluate the model performance on the training data and subsequently on the test data. If both evaluations result in approximately the same score, this gives an indication that the model is generalisable to unseen datasets and thus is not overfitting on the training data.

In the interface, please switch to the tab `Evaluation` at the top and press the button `Evaluate model`. This will lead to the evaluation of the last trained model. 
The model can be evaluated using different measures. We will on the one hand evaluate the model in function of *accuracy*, which measures the fraction of all instances that are correctly categorized; it is the ratio of the number of correct classifications to the total number of instances to classify. On the other hand, we will show the so-called *confusion matrix*. The confusion matrix for the model that we just trained shows that the model is able to correctly classify that a specific engine is not going to fail within w1 cycles for almost all of the 12531 samples. Vice versa, the model is able to correctly classify for almost all of the 3100 cases that a specific engine is going to fail within w1 cycles. \
Based on the confusion matrix, several other measures such as *precision* and *recall* can be calculated. Intuitively, precision is the ability of the classifier not to label a sample that is negative as positive. When instantiated for the RUL prediction problem we are currently considering, precision indicates how many of the engines that the model predicted to reach their end of life within 30 cycles also actually reach their end of life within this amount of cycles. Recall is the ability of the classifier to find all the positive samples, i.e., all engines that reach their end of life within 30 cycles. The respective results are given in the model evaluation summary table.

A unique id will be assigned to each evaluated model in order to compare the results for different models. Note that the evaluation is performed on the training data. In order to evaluate it on the test data (that is unknown to the model), please continue to the tab `Testing`. By pushing the button `Evaluate model`, the model selected in the dropdown menu is evaluated on the test data.

#### Questions:

- Is there a significant difference in accuracy when evaluating the model on the validation and test data?
- Do you think the model is overfitting?

### Exercise 2:

The LSTM network built in Exercise 1 performs well on the evaluation data set with an accuracy of about 95% but with a lower accuracy on the test data. This gives us a hint that the model might be overfitting on the training data. Therefore, in this second experiment, we add a second LSTM layer to the network. Please increase the number of layers to 2 and set the following layer sizes:
- Intermediate Layer 1: 200
- Intermediate Layer 2: 50

Keep the remaining parameters as they are. Subsequently repeat the steps taken in the previous exercise:
1. Train the model
2. Evaluate it on the validation data
3. Evaluate it on the test data


#### Questions:
- Which difference do you observe? Compare the results in the tables in both tabs - evaluation and testing
- When comparing the two models, which one would you use and why?

### Exercise 3:

Use the interface to experiment with alternative architectures and see how the parameters influence the quality of the model. You can try to answer questions such as:
- Will the model improve when the number of epochs is increased?
- Is it possible to reach an accuracy of 98 % on both data sets (validation and test)?
- Does adding a third LSTM layer improve the results even further?

### LSTM interface
Note: if the current port is already in use, change it to any port between 8050 and 8060

In [None]:
PORT = 8050
v.start_dashboard(Xtrain, sequence_cols, test_df, PORT)

## Code

In what follows, you can find the most prominent functions that are called in the interactive interface to construct and train the LSTM model, and subsequently evaluate it. 

##### Model building and compilation

```python
def compile_model(num_layers=2, sizes):

    model = Sequential(name='Sequential_Layer')

    for i in range(num_layers):
        if sizes[i]:
            if i == 0:
                model.add(LSTM(
                    input_shape=(seq_len, nb_features),
                    units=sizes[i],
                    return_sequences=bool(sizes[i + 1] is not None)))
            else:
                model.add(LSTM(
                    units=sizes[i],
                    return_sequences=bool(sizes[i + 1] is not None)))
            model.add(Dropout(dropout))

    model.add(Dense(units=nb_out, activation='sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return(model)
```

##### Model training 
```python
model.fit(seq_array, label_array, epochs=epochs, batch_size=200,
          validation_split=0.05, verbose=0,
          callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=0, mode='auto')])

```

#####  Make predictions and compute confusion matrix
```python
y_pred = model.predict_classes(seq_array,verbose=1, batch_size=200)
y_true = label_array
cm = confusion_matrix(y_true, y_pred)
```



#### Comparison of performance against other methods

In order to get an idea of the quality of the LSTM model, we will compare it with the best performing model from [1]. In this experiment, four binary classification models are compared, namely a Two-Class Logistic Regression model, a Two-Class Boosted Decision Tree, a Two-Class Decision Forest, and a Two-Class Neural Network. Note that for these models, a manual feature engineering step was needed, whereas in the case of the LSTM model, the algorithm automatically decides on the most appropriate features. 

The table below compares the best performing model, namely a Two-Class Neural Network, with the LSTM approach, in terms of accuracy, precision, recall, and F1-score.

In [None]:
results_df = get_test_scores()
results_df

## Conclusion

In this Starter Kit, we covered the basics of using deep learning for remaining useful life prediction, which is one of the central topics in predictive maintenance. We illustrated the approach on the industrially-relevant use case of predicting the remaining useful lifetime of aircraft engines based on run-to-failure data. In doing so, we have shown:

* How to appropriately preprocess the data and construct training, validation and test sets to learn and evaluate a model that predicts whether an engine will fail within a certain number of operational cycles.
* How to use deep learning to construct a model for remaining useful life prediction. In particular, we applied a deep learning approach called long short-term memory (LSTM) networks which allows the construction of a predictive model without the time-consuming feature engineering step, which requires one to manually extract the characteristics from which the algorithm can learn.
* How to experimentally validate the resulting model and objectively compare it with other approaches.

Comparing the above results on the test set, we see that the LSTM results are comparable to the results of the best performing model in [[1]](#PMT), with the advantage of not requiring a manual feature engineering step. Furthermore, it should be noted that the data set used here is very small and deep learning models are known to perform better with large datasets, so for a more fair comparison larger datasets should be used. The approach is simple and flexible enough to be applied to a wide range of RUL problems.

#### Possible next steps

One of the attention points when working with deep learning techniques is that it is important to tune the models to the right parameters, such as the window size. While the presented approach already gives promising results, quite some improvements are still possible, such as:

- Trying different window sizes.
- Trying different architectures with a different number of layers and nodes.
- Trying to tune the hyperparameters of the network.
- Trying to cast the RUL prediction problem as a regression or multi-class classification task instead of as a binary classification task.
- Trying on a larger dataset with more instances.

## Appendix A: A brief introduction to LSTM networks
In the following, we will briefly explain the intuition behind an LSTM network, yet for a more detailed explanation we refer the interested reader to the original publication [[2]](#Hochreiter1997) or more intuitive descriptions (e.g. [[3]](#Olah), [[4]](#Yan)).

A single layer LSTM consists of several *blocks*. Each block takes three inputs: 1) the input of the current time step, 2) the output from the previous LSTM block (of the previous time step) and 3) the "memory" of the previous block. Each block also provides output for the current timestep, and produces an updated memory of the current block. Therefore, a single block makes a decision by considering the current input, previous output and previous memory, generates a new output and alters its memory. This information flow is graphically represented in the following figure (taken from [[4]](#Yan)):

![LSTM with multiple blocks (taken from [4])](images/LSTM_blocks.png "LSTM with multiple blocks (taken from [4]")

The way the internal memory changes is controlled by different *gates* that control the information flow. Each gate computes a value between 0 and 1 using a logistic function to compute a value. The data is multiplied by this value to partially allow or deny information to flow into or out of the memory. An *input gate* controls the extent to which a new value flows into the memory. A *forget gate* controls the extent to which a value remains in memory. An *output gate* controls the extent to which the value in memory is used to compute the output activation of the block.

An LSTM thus keeps two pieces of information as it propagates through time:
- *A hidden state*; which is the memory the LSTM accumulates using its (forget, input, and output) gates through time,
- *The previous time-step output*.

An important parameter when constructing LSTM networks is the size of the LSTM's hidden state. In the underlying computation framework TensorFlow that is used in this notebook, this parameter is set by the variable `num_units`. To make the name `num_units` more intuitive, you can think of it as the number of hidden units in the LSTM block, or the number of memory units in the block.

The explaination above describes the structure of a single LSTM layer. However, an LSTM network usually consists of multiple *layers*, that can be stacked or merged. In case the layers are stacked, the output of the previous layer serves as input for the next layer.

![Multi-layer LSTM](images/multilayerLSTM.png "Multi-layer LSTM")

## References

1. <span id="PMT">[Predictive Maintenance Template](https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Template-3)</span>
2. <span id="Hochreiter1997">Hochreiter, Sepp, and Jurgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.</span>
3. <span id="Olah">Olah, Christopher. ["Understanding LSTM Networks"](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)</span>
4. <span id="Yan">Yan, Shi. ["Understanding LSTM and its diagrams"](https://medium.com/@shiyan/understanding-lstm-and-its-diagrams-37e2f46f1714)</span>
5. <span id="MathWorks">MathWorks, ["Models for Predicting Remaining Useful Life"](https://nl.mathworks.com/help/predmaint/ug/models-for-predicting-remaining-useful-life.html)</span>

## Additional information

This notebook is based on the ['Deep Learning for Predictive Maintenance' notebook](https://github.com/Azure/lstms_for_predictive_maintenance) by Fidan Boylu Uz.

*MIT License*

*Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:*

*The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.*