# Step 2: Model Building & Evaluation
Using the training and test data sets we constructed in the `Code/1_data_ingestion_and_preparation.ipynb` Jupyter notebook, this notebook builds a LSTM network for scenerio described at [Predictive Maintenance Template](https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Template-3) to predict failure in aircraft engines. We will store the model for deployment in an Azure web service which we build in the `Code/3_operationalization.ipynb` Jupyter notebook.

In [1]:

# import the libraries
import os

from azureml.core import  (Workspace,Run,VERSION,
                           Experiment,Datastore)
from azureml.core.compute import (AmlCompute, ComputeTarget)
from azureml.exceptions import ComputeTargetException

from azureml.train.dnn import PyTorch
from azureml.train.hyperdrive import *

from azureml.widgets import RunDetails

print('SDK verison', VERSION)

SDK verison 1.0.2


## Azure ML workspace

In [2]:
project_folder = os.getcwd()
exp_name = "deep_pred"

ws = Workspace.from_config()
print('Workspace loaded:', ws.name)

Found the config file in: /home/sasuke/dev/amlsamples/deep_predictive_maintenance/aml_config/config.json
Workspace loaded: vienna


## Load feature data set

We have previously created the labeled data set in the `Code\1_Data Ingestion and Preparation.ipynb` Jupyter notebook and stored it in default data store of the AML workspace.

Here We get a reference to the aforementioned data store

In [3]:


ds = Datastore.get(ws,'workspaceblobstore')
ds.download(project_folder, overwrite=True, show_progress = True)

data_path = "data"
ds_path = ds.path(data_path)
print(ds_path)


Client-Request-ID=fcb1de02-03b4-11e9-a19f-5910073e53f4 Retry policy did not allow for a retry: Server-Timestamp=Wed, 19 Dec 2018 17:39:07 GMT, Server-Request-ID=a9fde3d6-701e-0047-55c1-97ce74000000, HTTP status code=416, Exception=The range specified is invalid for the current size of the resource. ErrorCode: InvalidRange<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidRange</Code><Message>The range specified is invalid for the current size of the resource.RequestId:a9fde3d6-701e-0047-55c1-97ce74000000Time:2018-12-19T17:39:08.0628487Z</Message></Error>.
Client-Request-ID=fcb240a4-03b4-11e9-a19f-5910073e53f4 Retry policy did not allow for a retry: Server-Timestamp=Wed, 19 Dec 2018 17:39:07 GMT, Server-Request-ID=4e8e2988-a01e-0054-50c1-97ea78000000, HTTP status code=416, Exception=The range specified is invalid for the current size of the resource. ErrorCode: InvalidRange<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidRange</Code><Message>The range specified is invali

$AZUREML_DATAREFERENCE_a6d6fe1311244220b9c3d274f72304c1


## Modelling

The traditional predictive maintenance machine learning models are based on feature engineering, the manual construction of variable using domain expertise and intuition. This usually makes these models hard to reuse as the feature are specific to the problem scenario and the available data may vary between customers. Perhaps the most attractive advantage of deep learning they automatically do feature engineering from the data, eliminating the need for the manual feature engineering step.

When using LSTMs in the time-series domain, one important parameter is the sequence length, the window to examine for failure signal. This may be viewed as picking a `window_size` (i.e. 5 cycles) for calculating the rolling features in the [Predictive Maintenance Template](https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Template-3). The rolling features included rolling mean and rolling standard deviation over the 5 cycles for each of the 21 sensor values. In deep learning, we allow the LSTMs to extract abstract features out of the sequence of sensor values within the window. The expectation is that patterns within these sensor values will be automatically encoded by the LSTM.

Another critical advantage of LSTMs is their ability to remember from long-term sequences (window sizes) which is hard to achieve by traditional feature engineering. Computing rolling averages over a window size of 50 cycles may lead to loss of information due to smoothing over such a long period. LSTMs are able to use larger window sizes and use all the information in the window as input. 

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ contains more information on the details of LSTM networks.

This notebook illustrates the LSTM approach to binary classification using a sequence_length of 50 cycles to predict the probability of engine failure within 30 days.

In [4]:
training_dir = './train'
os.makedirs(training_dir, exist_ok=True)

# choose a name for your cluster
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=6)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

Found existing compute target.


## LSTM Network

Building a Neural Net requires determining the network architecture. In this scenario we will build an LSTM network using Pytorch [estimator](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-pytorch).

The hyperparameters tunning of the network is achieved using [Hyperdrive](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters)

In the train directory, the listed below files are used as follow:

 - Utils.py: contains data preparation to read csv files and transform them into lstm ready 3D tensors.
 - network.py: Defines LSTM network in pytorch.
 - train.py: entry script to estimator, contain training script.

In [5]:
%%writefile ./train/network.py

import torch 
import torch.nn as nn
import torch.utils.data as utils

class Network(nn.Module):
    
    def __init__(self,device, input_size, hidden_size, nb_layers, dropout, nb_classes=2):
        super(Network, self).__init__()
        
        self.device = device
        self.hidden_size = hidden_size
        self.nb_layers = nb_layers
        self.dropout = nn.Dropout(dropout)
        self.lstm0 = nn.LSTM(input_size, hidden_size, nb_layers, batch_first=True)
        self.lstm1 = nn.LSTM(hidden_size, hidden_size//2, nb_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size//2, nb_classes)
        self.activation = nn.ReLU()
        
    
    def forward(self, x):
        
        # Set initial hidden and cell states 
        h0 = torch.zeros(self.nb_layers, x.size(0), self.hidden_size).to(self.device)
        c0 = torch.zeros(self.nb_layers, x.size(0), self.hidden_size).to(self.device)
        
        h1 = torch.zeros(self.nb_layers, x.size(0), self.hidden_size//2).to(self.device)
        c1 = torch.zeros(self.nb_layers, x.size(0), self.hidden_size//2).to(self.device)
        
        # Forward propagate LSTM
        self.lstm0.flatten_parameters()
        out, _ = self.lstm0(x, (h0, c0))
        out = self.activation(out)
        out = self.dropout(out)
        
        self.lstm1.flatten_parameters()
        out, _ = self.lstm1(out, (h1, c1))
        out = self.activation(out)
        
        # retrieve hidden state of the last time step
        out = self.fc(out[:, -1, :])
       
        return out


Overwriting ./train/network.py


In [6]:
%%writefile ./train/train.py


import torch 
import torch.nn as nn
import torch.utils.data as utils
from azureml.core import Run
import numpy as np
import pandas as pd
from utils import tensorize,to_tensors
from network import Network
from sklearn.metrics import (recall_score, 
                             precision_score, 
                             accuracy_score)

run = Run.get_context()

def train( dataloader, learning_rate,
          device,input_size, 
          hidden_size, nb_layers,
          dropout, nb_classes,
         X_val,y_val):
    
    print("Start training")
    run.log('learning rate', learning_rate)
    run.log('dropout', dropout)
    
    network = Network(device, input_size,
                      hidden_size, nb_layers, dropout, 
                      nb_classes).to(device)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(network.parameters(), lr=learning_rate)
    

    
    # Train the model
    for epoch in range(nb_epochs):
        for i, (X, y) in enumerate(dataloader):
            X = X.reshape(-1, X.shape[1], input_size).to(device)
            y = y.to(device)

            # Forward pass
            y_pred = network(X)
            loss = criterion(y_pred, y)

            # Backprop
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if (i+1) % 100 == 0:
                '''print('epoch [{}/{}], loss: {:.4f}'
                                   .format(epoch+1,nb_epochs, loss.item()))'''
                run.log('loss', loss.item())
                
        evaluate(X_val,y_val, network, device)

    return network

def evaluate(X_test,y_test , network, device):
    
    '''
        Evaluate model on testing set
        
        params:
            testfile_path: path to testing file
    '''

    X_test = torch.from_numpy(X_test).to(device)
    y_test = torch.from_numpy(y_test).to(device)

    y_pred = network(X_test)
    
    y_pred_np = y_pred.to('cpu').data.numpy()
    y_test_np = y_test.to('cpu').data.numpy()
    y_pred_np = np.argmax(y_pred_np, axis=1)
    
    accuracy = accuracy_score(y_test_np, y_pred_np)
    precision = precision_score(y_test_np, y_pred_np)
    recall = recall_score(y_test_np, y_pred_np)
    
    run.log('accuracy', accuracy)
    run.log('precision', precision)
    run.log('recall', recall)

if __name__ == '__main__':
    
    print('Pytorch version', torch.__version__)
    
    parser = argparse.ArgumentParser()
    
    parser.add_argument('--epochs', type=int, default=2,
                        help='number of epochs to train')
    parser.add_argument('--learning_rate', type=float,
                        default=1e-3, help='learning rate')
    parser.add_argument('--dropout', type=float,
                        default=.2, help='drop out')
    parser.add_argument('--layers', type=int,
                        default=1, help='number of layers')
    parser.add_argument('--data_path', type=str, 
                        help='path to training-set file')
    parser.add_argument('--output_dir', type=str, 
                        help='output directory')
    
    args = parser.parse_args()
    nb_epochs = args.epochs
    learning_rate = args.learning_rate
    dropout = args.dropout
    data_path = args.data_path
    output_dir = args.output_dir
    nb_layers = args.layers
    
    hidden_size = 128
    nb_classes = 2
    batch_size = 32
    
    
    
    os.makedirs(data_path, exist_ok = True)
    training_file = os.path.join(data_path, 'preprocessed_train_file.csv')
    
    X_train, y_train = to_tensors(training_file)
    input_size = X_train.shape[2]
    X_train = torch.from_numpy(X_train)
    y_train = torch.from_numpy(y_train)
    
    val_file_path = os.path.join(data_path, 'preprocessed_test_file.csv')
    X_val,y_val = to_tensors(val_file_path, istest = True)
    
    dataset = utils.TensorDataset(X_train,y_train) 
    dataloader = utils.DataLoader(dataset)
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    
    network = train(dataloader,learning_rate,
                    device, input_size,
                    hidden_size, nb_layers,
                    dropout, nb_classes,
                    X_val,y_val)
    
    
    evaluate(X_val,y_val, network, device)
    
    os.makedirs(output_dir, exist_ok = True)
    model_path = os.path.join(output_dir, 'network.pth')
    torch.save(network, model_path)
    run.register_model(model_name = 'network.pth', model_path = model_path)

Overwriting ./train/train.py


## Estimator

Here, we define the Pytorch estimator.

In [7]:
script_params = {
    '--epochs': 2,
    '--data_path': ds_path,
    '--output_dir': './outputs'
}

estimator = PyTorch(source_directory = training_dir, 
                    conda_packages = ['pandas', 'numpy', 'scikit-learn'],
                    pip_packages = ['torch==0.4.1','torchvision'],
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script='train.py',
                    use_gpu=True)

## Hyperdrive

Here, we define hyerdrive configuration

In [8]:
param_sampling = RandomParameterSampling( {
        'learning_rate': uniform(1e-6, 1e-2),
        'dropout': uniform(.4,.8),
        'layers': choice(1,2,3)
    }
)

termination_policy = BanditPolicy(slack_factor=.1, evaluation_interval=1, delay_evaluation=1)

hyperdrive_run = HyperDriveRunConfig(estimator=estimator,
                                            hyperparameter_sampling=param_sampling, 
                                            policy=termination_policy,
                                            primary_metric_name='accuracy',
                                            primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                            max_total_runs=24,
                                            max_concurrent_runs=6)

We submit the exepriment for execution and render the Run execution through the widget

In [9]:
experiment = Experiment(workspace=ws, name=exp_name)
run = experiment.submit(hyperdrive_run)


In [10]:
RunDetails(run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO',…