> **Copyright &copy; 2020 CertifAI Sdn. Bhd.**<br>
 **Copyright &copy; 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program and the accompanying materials are made available under the
terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). \
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License. <br>
<br>**SPDX-License-Identifier: Apache-2.0**> 

 # 05 - Predictive Maintenance Binary Classification
 Predictive maintenance techniques are designed to help determine the condition of in-service equipment in order to
 estimate when maintenance should be performed. Predictive maintenance can be modeled in several ways,
 1. Predict the Remaining Useful Life (RUL), or Time to Failure (TTF)
 2. Predict if the asset will fail by given a certain time frame
 3. Predict critical level of the asset by give a certain time frame

 This example we will look at the 2nd modeling strategy which is to predict weather the asset is going to fail. The target variable is "Label1".
 This label consist of 0 and 1. 0 means the assets is working fine and 1 means it require maintenance.

## Notebook Description
This tutorial will show different approaches other than deep learning that can be applied to search for anomalies. All the techniques is readily available in Scikit-learn library. An exercise section is attached for you to practice and hone your skills. Do make good use of it.

By the end of this tutorial, you will be able to:

1. Prepare dataset to be feed into anomaly detection algorithms
2. Apply different anomaly detection algorithms readily accessible from Scikit-learn API
3. Compare and contrast performance of anomaly detection algorithms

## Notebook Outline
Below is the outline for this tutorial:
1. [A Little Bit of Theory](#theory)
2. [Choice of Dataset](#dataset)
3. [Baseline Performance](#baseline)
4. [Anomaly Detection Techniques](#techniques) 
    * a) [Isolation Forest](#isolation-forest)
    * b) [Minimum Covariance Determinant](#minimum-cov-determinant)
    * c) [Local Outlier Factor (LOF)](#lof)
    * d) [One-class Support Vector Machine (OCSVM)](#ocsvm)

5. [Exercise](#exercise)
6. [Reference](#reference)

In [1]:
from collections import Counter
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import sys
from torch.utils.data import Dataset, DataLoader, TensorDataset, IterableDataset
import torch
from torch import nn, optim
import torch.nn.functional as F

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

### Importing Data

In [2]:
# reading dataset
df_train = pd.read_csv("../../datasets/predictive_maintenance/train.csv")
df_test = pd.read_csv("../../datasets/predictive_maintenance/test.csv")

### Initial Data Exploration

In [3]:
df_train.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,cycle_norm,RUL,label1,label2
0,1,1,0.45977,0.166667,0,0,0.183735,0.406802,0.309757,0,1,0.726248,0.242424,0.109755,0,0.369048,0.633262,0.205882,0.199608,0.363986,0,0.333333,0,0,0.713178,0.724662,0.0,191,0,0
1,1,2,0.609195,0.25,0,0,0.283133,0.453019,0.352633,0,1,0.628019,0.212121,0.100242,0,0.380952,0.765458,0.279412,0.162813,0.411312,0,0.333333,0,0,0.666667,0.731014,0.00277,190,0,0
2,1,3,0.252874,0.75,0,0,0.343373,0.369523,0.370527,0,1,0.710145,0.272727,0.140043,0,0.25,0.795309,0.220588,0.171793,0.357445,0,0.166667,0,0,0.627907,0.621375,0.00554,189,0,0
3,1,4,0.54023,0.5,0,0,0.343373,0.256159,0.331195,0,1,0.740741,0.318182,0.124518,0,0.166667,0.889126,0.294118,0.174889,0.166603,0,0.333333,0,0,0.573643,0.662386,0.00831,188,0,0
4,1,5,0.390805,0.333333,0,0,0.349398,0.257467,0.404625,0,1,0.668277,0.242424,0.14996,0,0.255952,0.746269,0.235294,0.174734,0.402078,0,0.416667,0,0,0.589147,0.704502,0.01108,187,0,0


One important note is that the target variable needs to have balanced distribution for us to decide which metrics that is suitable to be used for model evaluation. So, we will check for distribution of target variable first.

In [4]:
print('Classes in train dataset:', Counter(df_train["label1"])) # show the class distribution, and here we observe
                                                                # high imbalance with ratio of around 3.5 : 1
df_train.isna().sum() # shows that there are no missing values

Classes in train dataset: Counter({0: 17531, 1: 3100})


id            0
cycle         0
setting1      0
setting2      0
setting3      0
s1            0
s2            0
s3            0
s4            0
s5            0
s6            0
s7            0
s8            0
s9            0
s10           0
s11           0
s12           0
s13           0
s14           0
s15           0
s16           0
s17           0
s18           0
s19           0
s20           0
s21           0
cycle_norm    0
RUL           0
label1        0
label2        0
dtype: int64

Not all features are useful for model building. We will first remove some unnecessary features before we proceed with model building.

### Data Pre-processing

In [5]:
# let's first remove the unnecessary columns
drop_columns_list = ["setting3", "s1", "s5", "s10", "s16", "s18", "s19", "RUL", "label2"]
df_train = df_train.drop(drop_columns_list, axis=1)
df_test = df_test.drop(drop_columns_list, axis=1)

df_train.head()

Unnamed: 0,id,cycle,setting1,setting2,s2,s3,s4,s6,s7,s8,s9,s11,s12,s13,s14,s15,s17,s20,s21,cycle_norm,label1
0,1,1,0.45977,0.166667,0.183735,0.406802,0.309757,1,0.726248,0.242424,0.109755,0.369048,0.633262,0.205882,0.199608,0.363986,0.333333,0.713178,0.724662,0.0,0
1,1,2,0.609195,0.25,0.283133,0.453019,0.352633,1,0.628019,0.212121,0.100242,0.380952,0.765458,0.279412,0.162813,0.411312,0.333333,0.666667,0.731014,0.00277,0
2,1,3,0.252874,0.75,0.343373,0.369523,0.370527,1,0.710145,0.272727,0.140043,0.25,0.795309,0.220588,0.171793,0.357445,0.166667,0.627907,0.621375,0.00554,0
3,1,4,0.54023,0.5,0.343373,0.256159,0.331195,1,0.740741,0.318182,0.124518,0.166667,0.889126,0.294118,0.174889,0.166603,0.333333,0.573643,0.662386,0.00831,0
4,1,5,0.390805,0.333333,0.349398,0.257467,0.404625,1,0.668277,0.242424,0.14996,0.255952,0.746269,0.235294,0.174734,0.402078,0.416667,0.589147,0.704502,0.01108,0


We can take note of a few things here. First, the 'ID' column should be indicative of which machine that the data is collected, while the 'Cycle' column indicates the time step that the measurement is taken from. The rest all of the variables have values of more or less the same scale, that is within 0 and 1. This could indicate that someone has already performed Min-Max Normalization for us. Nonetheless, we will also perform it later just to be on the safe side.

In [6]:
# let's also check for the data types
df_train.dtypes

id              int64
cycle           int64
setting1      float64
setting2      float64
s2            float64
s3            float64
s4            float64
s6              int64
s7            float64
s8            float64
s9            float64
s11           float64
s12           float64
s13           float64
s14           float64
s15           float64
s17           float64
s20           float64
s21           float64
cycle_norm    float64
label1          int64
dtype: object

### Train-Test Split

As usual, we need to perform a train-test split so that we can use training dataset to build our model and have some data to evaluate it on.

In [7]:
# separate out features and target variable
y_train = df_train['label1'].to_numpy() # convert dtype to numpy so that can be easily converted to torch tensor, originally in Pandas
x_train = df_train.drop('label1', axis=1).to_numpy()
y_test = df_test['label1'].to_numpy()
x_test = df_test.drop('label1', axis=1).to_numpy()

print('Shape for y of train dataset: ', y_train.shape)
print('Shape for x of train dataset: ', x_train.shape)
print('Shape for y of test dataset: ', y_test.shape)
print('Shape for x of test dataset: ', x_test.shape)



# # data pre-processing : min-max normalization
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)


Shape for y of train dataset:  (20631,)
Shape for x of train dataset:  (20631, 20)
Shape for y of test dataset:  (13096,)
Shape for x of test dataset:  (13096, 20)


In [8]:
# convert our target variable to Torch Tensor
y_train = torch.Tensor(y_train).to(torch.int64)
y_test = torch.Tensor(y_test).to(torch.int64)


# So by right, the dtype now should be Torch tensor and has 2 columns (since 2 classes)
print(y_train.shape)
print(y_test.shape)

torch.Size([20631])
torch.Size([13096])


Recall that we need to split and prepare our dataset into format that is suitable for LSTM to be trained on. We will write a helper function to perform this.

In [9]:
def data_processor(x_data, y_data, sequence_length):
    """
    Helper function to sample sub-sequence of training data.
    Input data must be numpy.
    """
    x, y = [], []

    # Fill the batch with random sequences of data
    for i in range(x_data.shape[0] - sequence_length):

        # copy the sequences of data starting at this index
        x.append(x_data[i:i + sequence_length])
        y.append(y_data[i + sequence_length])
    
    return x, y

In [10]:
# let's change the data to a suitable format for model to train on
x_sequence_train, y_sequence_train = data_processor(x_train, y_train, 30)
x_sequence_test, y_sequence_test = data_processor(x_test, y_test, 30)

# let's do a sanity check too
print("Total samples for X train: " + str(len(x_sequence_train)))
print("Total samples for y train: " + str(len(y_sequence_train)))
print("Total samples for X test: " + str(len(x_sequence_test)))
print("Total samples for y test: " + str(len(y_sequence_test)))

Total samples for X train: 20601
Total samples for y train: 20601
Total samples for X test: 13066
Total samples for y test: 13066


In [11]:
# Let us use the Dataset object to instantiate our dataset, this way it enables the use of len and indexing
# This is the preferred way of preparing data in Pytorch
class MaintenanceDataset(torch.utils.data.dataset.Dataset):
    def __init__(self, x, y):
        self.x = torch.Tensor(x)
        self.y = y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

We will also do a sanity check after using Dataset class of PyTorch to ensure that we have prepared our data into the right format before we feed them into the model.

In [12]:
# creating train and test datasets
train_ds = MaintenanceDataset(x_sequence_train, y_sequence_train)
test_ds = MaintenanceDataset(x_sequence_test, y_sequence_test)

# let us print out the first two rows of train dataset and check the shape of dataset too
print(train_ds[:2])
print(len(train_ds))          # print out number of samples
print(len(train_ds[0]))       # print out ?
print(len(train_ds[0][0]))    # print out number of sequences / sequence length
print(len(train_ds[0][0][0])) # print out number of features 

(tensor([[[0.0000, 0.0000, 0.4598,  ..., 0.7132, 0.7247, 0.0000],
         [0.0000, 0.0028, 0.6092,  ..., 0.6667, 0.7310, 0.0028],
         [0.0000, 0.0055, 0.2529,  ..., 0.6279, 0.6214, 0.0055],
         ...,
         [0.0000, 0.0748, 0.3621,  ..., 0.6744, 0.5384, 0.0748],
         [0.0000, 0.0776, 0.5690,  ..., 0.6124, 0.6428, 0.0776],
         [0.0000, 0.0803, 0.3736,  ..., 0.7054, 0.7136, 0.0803]],

        [[0.0000, 0.0028, 0.6092,  ..., 0.6667, 0.7310, 0.0028],
         [0.0000, 0.0055, 0.2529,  ..., 0.6279, 0.6214, 0.0055],
         [0.0000, 0.0083, 0.5402,  ..., 0.5736, 0.6624, 0.0083],
         ...,
         [0.0000, 0.0776, 0.5690,  ..., 0.6124, 0.6428, 0.0776],
         [0.0000, 0.0803, 0.3736,  ..., 0.7054, 0.7136, 0.0803],
         [0.0000, 0.0831, 0.5805,  ..., 0.6202, 0.6091, 0.0831]]]), [tensor(0), tensor(0)])
20601
2
30
20


In [13]:
# let us print out the shape of the dataset
print(len(test_ds))          # print out number of samples
print(len(test_ds[0]))       # print out ?
print(len(test_ds[0][0]))    # print out number of sequences / sequence length
print(len(test_ds[0][0][0])) # print out number of label columns 

13066
2
30
20


Next, after preparing it into the right format, we will use DataLoader to transform our data into iterable.

In [14]:
# Now, we are ready to create iterator using DataLoader
train_loader = torch.utils.data.DataLoader(dataset=train_ds,
                                          batch_size=200,
                                          shuffle=False)

test_loader = torch.utils.data.DataLoader(dataset=test_ds,
                                          batch_size=1,
                                          shuffle=False)

In [15]:
# Or we can also just load one batch of the iterator as checking
next(iter(test_loader))

[tensor([[[0.0000, 0.0000, 0.6322, 0.7500, 0.5452, 0.3107, 0.2694, 1.0000,
           0.6522, 0.2121, 0.1276, 0.2083, 0.6461, 0.2206, 0.1322, 0.3090,
           0.3333, 0.5581, 0.6618, 0.0000],
          [0.0000, 0.0028, 0.3448, 0.2500, 0.1506, 0.3796, 0.2223, 1.0000,
           0.8052, 0.1667, 0.1467, 0.3869, 0.7399, 0.2647, 0.2048, 0.2132,
           0.4167, 0.6822, 0.6868, 0.0028],
          [0.0000, 0.0055, 0.5172, 0.5833, 0.3765, 0.3466, 0.3222, 1.0000,
           0.6860, 0.2273, 0.1581, 0.3869, 0.6994, 0.2206, 0.1556, 0.4586,
           0.4167, 0.7287, 0.7213, 0.0055],
          [0.0000, 0.0083, 0.7414, 0.5000, 0.3705, 0.2852, 0.4080, 1.0000,
           0.6795, 0.1970, 0.1057, 0.2560, 0.5736, 0.2500, 0.1701, 0.2570,
           0.2500, 0.6667, 0.6621, 0.0083],
          [0.0000, 0.0111, 0.5805, 0.5000, 0.3916, 0.3521, 0.3320, 1.0000,
           0.6940, 0.1667, 0.1024, 0.2738, 0.7377, 0.2206, 0.1528, 0.3009,
           0.1667, 0.6589, 0.7164, 0.0111],
          [0.0000, 0.0139, 0.5

### Model Configuration

We will perform some needed configuration for the model here.

In [16]:
# this is just to configure model hyperparameters
# Input configurations
input_size = 20      # since one row has 20 features, we are reading one row at a time
sequence_length = 30 # since there are 30 rows 
num_layers = 2       # stack 2 RNN together

# Hyperparameter
hidden_size = 128 # i think this is the number of hidden nodes
num_classes = 2
epochs = 5
# batch_size = 200
learning_rate = 0.001

random_seed = 42

torch.manual_seed(random_seed) # to ensure reproducivility

<torch._C.Generator at 0x20ac7e0c510>

### Model Building

In [17]:
# Let's instantiate a model
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__() # this is for backward compatibility of Python 2, same as super().__init()
        self.hidden_size = hidden_size # this is to set attribute of the instance
        self.num_layers = num_layers
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True) # in this order, so (28, 128, 2)
                                                                                 # batch_first set number of batch as 1st dimension
        # x -> (batch_size, seq, input_size) [the shape needed for the tensor]
        self.fc = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size) # initiate zeros tensor with (num_layers, batch_size, hidden_size)
        
        out, _ = self.rnn(x, h0)
        # output shape: batch_size, seq_length, hidden_size
        # out (N, 28, 128)
        out = out[:, -1, :]
        # out (N, 128)
        out = self.fc(out)
        return out
    
model = RNN(input_size, hidden_size, num_layers, num_classes)
print(model)

# let's set loss function and optimizer
criterion = nn.CrossEntropyLoss() # since this is a binary class problem
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# finally we can start to train
n_total_steps = len(train_loader)
print(n_total_steps)

for epoch in range(epochs):
    for i, (x, y) in enumerate(train_loader):          
        # forward pass
        outputs = model(x)
        loss = criterion(outputs, y)
        
        # backward pass
        optimizer.zero_grad() # this is to clear the parameters, done before each backward pass
        loss.backward()
        optimizer.step()
        
        # print out the training performance
        if (i+1) % 100 == 0:
            print (f'Epoch [{epoch+1}/{epochs}], Step [{i+1}/{n_total_steps}], Loss: {loss.item():10.4f}')
            

RNN(
  (rnn): RNN(20, 128, num_layers=2, batch_first=True)
  (fc): Linear(in_features=128, out_features=2, bias=True)
)
104
Epoch [1/5], Step [100/104], Loss:     0.1841
Epoch [2/5], Step [100/104], Loss:     0.1500
Epoch [3/5], Step [100/104], Loss:     0.1377
Epoch [4/5], Step [100/104], Loss:     0.1326
Epoch [5/5], Step [100/104], Loss:     0.1304


### Model Evaluation

In [18]:
# Let's evaluate our model
# Remember, we don't need to compute gradients as it is not required (and save some precious memory too!)
with torch.no_grad():
    n_correct = 0
    n_samples = 0
    for x, y in test_loader:
        outputs = model(x)
        # max returns (value, index)
        _, predicted = torch.max(outputs.data, 1)
        n_samples += y.size(0)
        n_correct += (predicted == y).sum().item()
        
acc = 100.0 * n_correct / n_samples
print(f"Accuracy of the network on the 10000 test images: {acc:10.2f} %")

Accuracy of the network on the 10000 test images:      98.45 %


Well, our model has a great accuracy! But this does not mean that it is a very good model, probably is due to the imbalance class issue and that the model just predicted the major classes. We can check out other classification metrics like precision, recall, and F1-score to evaluate it in further modifications.

## Exercise

Please perform binary classification task using the same dataset and features but instead choose the target variable of "label2". Feel free to experiment with other features or use feature engineering techniques in case you have an adventurous spirit.

In [None]:
# import libraries

# read dataset

# initial data exploration

# data pre-processing

# model configuration

# model building

# model evaluation


## <a name="reference">Reference</a>:
1. [Deep Learning for tabular data using Pytorch](ttps://jovian.ai/aakanksha-ns/shelter-outcome)
2. [pytorch custom dataset: DataLoader returns a list of tensors rather than tensor of a list](https://stackoverflow.com/questions/62208904/pytorch-custom-dataset-dataloader-returns-a-list-of-tensors-rather-than-tensor)
3. [Predictive maintenance - NASA Turbofan Dataset](https://www.kaggle.com/c/predictive-maintenance)