# Step 2: Model Building & Evaluation
Using the training and test data sets we constructed in the `Code/1_data_ingestion_and_preparation.ipynb` Jupyter notebook, this notebook builds a LSTM network for scenerio described at [Predictive Maintenance Template](https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Template-3) to predict failure in aircraft engines. We will store the model for deployment in an Azure web service which we build in the `Code/3_operationalization.ipynb` Jupyter notebook.

In [148]:
%load_ext autoreload
%autoreload 2

import os

from azureml.core import  (Workspace,Run,VERSION,
                           Experiment,Datastore)
from azureml.core.compute import (AmlCompute, ComputeTarget)
from azureml.exceptions import ComputeTargetException

from azureml.train.dnn import PyTorch
from azureml.train.hyperdrive import *
from azureml.widgets import RunDetails

print('SDK verison', VERSION)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
SDK verison 1.0.6


## Azure ML workspace

In [149]:
project_folder = os.getcwd()
exp_name = "deep_pred"

ws = Workspace.from_config()
print('Workspace loaded:', ws.name)

Found the config file in: /home/sasuke/dev/amlsamples/deep_predictive_maintenance/aml_config/config.json
Workspace loaded: vienna


## Load feature data set

We have previously created the labeled data set in the `Code\1_Data Ingestion and Preparation.ipynb` Jupyter notebook and stored it in default data store of the AML workspace.

Here, we call path method that returns an instance to [data reference](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.data_reference.datareference?view=azure-ml-py) which  will be passed to the training script during the run execution.

In [150]:
ds = Datastore.get(ws,'workspaceblobstore')
data_path = "data"
ds_path = ds.path(data_path)
print(ds_path)

$AZUREML_DATAREFERENCE_25e3a14768454138838074bee5dc070e


## Compute target

Here, we provision the AML Compute that will be used to execute training script

In [151]:
training_dir = './train'
os.makedirs(training_dir, exist_ok=True)

# choose a name for your cluster
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=6)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

Found existing compute target.


## Modelling

The traditional predictive maintenance machine learning models are based on feature engineering, the manual construction of variable using domain expertise and intuition. This usually makes these models hard to reuse as the feature are specific to the problem scenario and the available data may vary between customers. Perhaps the most attractive advantage of deep learning they automatically do feature engineering from the data, eliminating the need for the manual feature engineering step.

When using LSTMs in the time-series domain, one important parameter is the sequence length, the window to examine for failure signal. This may be viewed as picking a `window_size` (i.e. 5 cycles) for calculating the rolling features in the [Predictive Maintenance Template](https://gallery.cortanaintelligence.com/Collection/Predictive-Maintenance-Template-3). The rolling features included rolling mean and rolling standard deviation over the 5 cycles for each of the 21 sensor values. In deep learning, we allow the LSTMs to extract abstract features out of the sequence of sensor values within the window. The expectation is that patterns within these sensor values will be automatically encoded by the LSTM.

Another critical advantage of LSTMs is their ability to remember from long-term sequences (window sizes) which is hard to achieve by traditional feature engineering. Computing rolling averages over a window size of 50 cycles may lead to loss of information due to smoothing over such a long period. LSTMs are able to use larger window sizes and use all the information in the window as input. 

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ contains more information on the details of LSTM networks.

This sample illustrates the LSTM approach to binary classification using a sequence_length of 50 cycles to predict the probability of engine failure within 30 days.

##  Implementation and hyperparameters tuning

Building a Neural Net requires determining the network architecture. In this scenario we will build an LSTM network using Pytorch [estimator](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-pytorch).

The hyperparameters tunning of the network is achieved using [Hyperdrive](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters)

In the train directory, the listed below files are used as follow:

 - Utils.py: contains data preparation to read csv files and transform them into lstm ready 3D tensors.
 - network.py: Defines LSTM network in pytorch.
 - train.py: entry script to estimator, contain training script.

In [152]:
%%writefile ./train/network.py

import torch 
import torch.nn as nn

class Network(nn.Module):
    
    def __init__(self,batch_size,input_size, 
                 hidden_size, nb_layers, dropout):
        super(Network, self).__init__()
        
        
        self.hidden_size = hidden_size
        self.nb_layers = nb_layers
        self.rnn = nn.LSTM(input_size, hidden_size, 
                           nb_layers, batch_first=True,
                           dropout = dropout)
        self.fc = nn.Linear(hidden_size, 2)
        self.activation = nn.ReLU()
        self.softmax = nn.LogSoftmax()
        
    
    def forward(self, sequence):
        
        out,_ = self.rnn(sequence)
        out = self.activation(out)
        out = self.fc(out[:, -1, :])
        likelihood = self.softmax(out)
        return out, likelihood


Overwriting ./train/network.py


In [153]:
%%writefile ./train/train.py

import numpy as np

import torch
import torch.nn as nn
import torch.utils.data as utils

from network import Network
from sklearn.metrics import precision_score,recall_score,f1_score



def train(X_train,y_train, 
          X_val,y_val,weight_decay,
          learning_rate,batch_size,
          hidden_size,dropout,
          nb_epochs, run):
    
    print("Start training....")
    print('learning rate', learning_rate)
    print("L2 regularization", weight_decay)
    print('dropout', dropout)
    print('batch_size', batch_size)
    print('hidden_units', hidden_size)
    
    dataset = utils.TensorDataset(torch.from_numpy(X_train),
                                  torch.from_numpy(y_train)) 
    dataloader = utils.DataLoader(dataset, batch_size = batch_size,
                                  shuffle = True)
    
    val_dataset = utils.TensorDataset(torch.from_numpy(X_val),
                                      torch.from_numpy(y_val))
    val_dataloader = utils.DataLoader(val_dataset)
    
    use_gpu = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    input_size = X_train.shape[2] #features dimension
    nb_layers = 2 # lstm layers
    
    network = Network(batch_size, 
                      input_size,hidden_size,
                      nb_layers,dropout).to(use_gpu)
    

    cost_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(network.parameters(), lr=learning_rate,
                                weight_decay=weight_decay)
    
    # Train the model
    for epoch in range(nb_epochs):
        for i, (X, y) in enumerate(dataloader):
            optimizer.zero_grad()
            y_pred, _ = network(X.to(use_gpu))
            loss = cost_fn(y_pred, y.to(use_gpu))
            loss.backward()
            optimizer.step()
            if (i+1) % 100 == 0:
                run.log('loss', loss.item())
                
        # end of epoch evaluate      
        evaluate(val_dataloader, network, use_gpu, run)
        network.train()

    return network

def evaluate(dataloader, network, use_gpu, run):
    
    '''
        Evaluate model on validation set
        
        params:
            dataloader: dataloader
            network: model
            use_gpu: device
            run: AML RUN
    '''
    
    
    
    y_pred_lst = []
    y_truth_lst = []
    with torch.no_grad():
        for i, (X, y) in enumerate(dataloader):
            
                #X = X.to(use_gpu)
                output, _ = network(X.to(use_gpu))
                y_pred = output.to('cpu').data.numpy().argmax(axis=1)
                
                y_pred_lst.append(y_pred)
                y_truth_lst.append(y.data.numpy())
                
                
        y_pred_np = np.array(y_pred_lst)
        y_truth_np = np.array(y_truth_lst)
        
        precision = precision_score(y_truth_np, y_pred_np)
        recall = recall_score(y_truth_np, y_pred_np)
        f1 = f1_score(y_truth_np, y_pred_np)

        run.log('precision', round(precision,2))
        run.log('recall', round(recall,2))
        run.log('f1', round(f1,2))

Overwriting ./train/train.py


In [154]:
%%writefile ./train/entry.py


import os
import numpy as np
import pandas as pd

import torch

from utils import to_tensors
from train import train

from sklearn.model_selection import train_test_split
from azureml.core import Run

if __name__ == '__main__':
    
    
    
    parser = argparse.ArgumentParser()
    
    parser.add_argument('--epochs', type=int, default=2,
                        help='number of epochs to train')
    parser.add_argument('--learning_rate', type=float,
                        default=1e-3, help='learning rate')
    parser.add_argument('--l2', type=float, 
                        help='Weight decay')
    parser.add_argument('--dropout', type=float,
                        default=.2, help='drop out')
    parser.add_argument('--hidden_units', type=int,
                        default=16, help='number of neurons')
    parser.add_argument('--batch_size', type=int,
                        default=16, help='Mini batch size')
    parser.add_argument('--data_path', type=str, 
                        help='path to training-set file')
    parser.add_argument('--output_dir', type=str, 
                        help='output directory')
    
    args = parser.parse_args()
    nb_epochs = args.epochs
    learning_rate = args.learning_rate
    weight_decay = args.l2
    dropout = args.dropout
    data_path = args.data_path
    output_dir = args.output_dir
    batch_size = args.batch_size
    hidden_size = args.hidden_units
    batch_size = args.batch_size
    
    SEED = 123
    torch.manual_seed(SEED)
    np.random.seed(SEED)
    
    
    print('Pytorch version', torch.__version__)
    
    run = Run.get_context()
    
    os.makedirs(data_path, exist_ok = True)
    training_file = os.path.join(data_path, 'preprocessed_train_file.csv')
    
    X, y = to_tensors(training_file)
    X_train, X_test, y_train, y_test = train_test_split(
                             X, y, test_size=0.15, random_state=SEED)
    

    
    network = train( X_train,y_train, 
                    X_test,y_test, weight_decay,
                    learning_rate,batch_size,
                    hidden_size,dropout,
                    nb_epochs, run)
    
    os.makedirs(output_dir, exist_ok = True)
    model_path = os.path.join(output_dir, 'network.pth')
    
    torch.save(network, model_path)
    run.register_model(model_name = 'network.pth', model_path = model_path)

Overwriting ./train/entry.py


## Estimator

Here, we define the Pytorch estimator.

In [155]:
script_params = {
    '--epochs': 2,
    '--data_path': ds_path,
    '--output_dir': './outputs'
}

estimator = PyTorch(source_directory = training_dir, 
                    conda_packages = ['pandas', 'numpy', 'scikit-learn'],
                    pip_packages = ['torch==1.0.0','torchvision'],
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script='entry.py',
                    use_gpu=True)

## Hyperparameters tuning using Hyperdrive

Here, we define hyerdrive configuration, as we are interested in true equipement failure, we will configure hyperdrive to optimize for precision metric.

For completness we will be tracking recall and F1 as well.

In [156]:
param_sampling = RandomParameterSampling( {
        'learning_rate':uniform(1e-4, 1e-2),
        'l2':uniform(1e-5, 1e-3),
        'dropout':uniform(.5,.7),
        'batch_size':choice(16,32,64),
        'hidden_units':choice(4,6,8)
    }
)

termination_policy = BanditPolicy(slack_factor=.1, evaluation_interval=1, delay_evaluation=1)

hd_run_config = HyperDriveRunConfig(estimator=estimator,
                                            hyperparameter_sampling=param_sampling, 
                                            policy=termination_policy,
                                            primary_metric_name='precision',
                                            primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                            max_total_runs=4,
                                            max_concurrent_runs=2)

We submit the exepriment for execution and render the Run execution through the widget

In [157]:
experiment = Experiment(workspace=ws, name=exp_name)
run = experiment.submit(hd_run_config)


In [158]:
RunDetails(run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

# Model registration

Finally, we save the best trained model found by hyperdrive based on the primary metric, we have selected.

In [159]:
best_run = run.get_best_run_by_primary_metric()

model = best_run.register_model(model_name='deep_pdm', model_path='outputs/network.pth')
print(model.name, 'saved')

deep_pdm saved


## Model operationalization


We are now ready to operationalizing the model and deloying the webservice. For testing purposes, we wil use ACI to serve predictions.

For more details on Model deployment workflow in Azure Machine learning service,click [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where#deployment-workflow) 

In [164]:
%%writefile ./score/score.py

import os
import json
import numpy as np
import torch
from azureml.core.model import Model

def init():
    global model
    model_path =Model.get_model_path('deep_pdm')
    print("model loaded:",model_path)
    model = torch.load(model_path, map_location=torch.device('cpu'))
    model.eval()
    

def run(input_data):
    x_input = torch.tensor(json.loads(input_data)['input_data'])
    score,proba = model(x_input)
    output = score.data.numpy().argmax(axis=1)[0]
    proba = score.data.numpy().max(axis=1)
    return json.dumps({'prediction':int(output), 'likelihood':float(proba)})

Overwriting ./score/score.py


In [165]:
from azureml.core.conda_dependencies import CondaDependencies 

conda_env = CondaDependencies.create(conda_packages=['numpy'],
                                    pip_packages=['azureml-defaults','torch', 'torchvision'])

with open("./score/myenv.yml","w") as f:
    f.write(conda_env.serialize_to_string())

## Image creation

Here, we instantiate an image configuration object and follow-up with Image creation

In [166]:
from azureml.core.image import ContainerImage


cur_dir = os.getcwd()
os.chdir(os.path.join(cur_dir,'score')) 
#Image_configuration call require us to be in current directory where score.py reside
print("Switched current directory to",os.getcwd())

image_config = ContainerImage.image_configuration(execution_script = "score.py",
                                                 runtime = "python",
                                                 conda_file = "myenv.yml",
                                                 dependencies = ["network.py"],
                                                 description = "Image of predictive maintenance model",
                                                 tags = { "type": "lstm_classifier"}
                                                 )

image = ContainerImage.create(name = "dpm-image", 
                              models = [model], 
                              image_config = image_config,
                              workspace = ws
                              )

image.wait_for_creation(show_output=True)


os.chdir(cur_dir)
print("Reverted to root directory")

Switched current directory to /home/sasuke/dev/amlsamples/deep_predictive_maintenance/score
Creating image
Running......................................................................
SucceededImage creation operation finished for image dpm-image:10, operation "Succeeded"
Reverted to root directory


In [167]:
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice


aci_config = AciWebservice.deploy_configuration(cpu_cores=2, 
                                               memory_gb=2, 
                                               tags={"type":"deep predictive maintenance"}, 
                                               description='Predict equipment failure')

service = Webservice.deploy_from_image(workspace=ws,
                                       name='predictive-maintenance-svc',
                                       deployment_config=aci_config,
                                       image = image)

service.wait_for_deployment(show_output=True)

Creating service
Running................................
SucceededACI service creation operation finished, operation "Succeeded"


In [168]:
from utils import to_tensors
import json
import numpy as np
import pandas as pd

path = os.path.join(os.getcwd(), 'data/preprocessed_test_file.csv')
X,y,engine_ids = to_tensors(path, is_test = True)

print(X.shape)

(93, 50, 25)


In [172]:
output_df= pd.DataFrame(columns = ['engine ID', 'prediction', 'likelihood'])

for i,x in enumerate(X):
    output =service.run(json.dumps({'input_data': x[np.newaxis,:].tolist()}))
    print(output)
    output = json.loads(output)
    output_df.loc[i] = [engine_ids[i], output['prediction'], round(output['likelihood'],2)]


output_df

WebserviceException: Received bad response from service:
Response Code: 502
Headers: {'Connection': 'keep-alive', 'Content-Length': '47', 'Content-Type': 'text/html; charset=utf-8', 'Date': 'Tue, 01 Jan 2019 20:01:08 GMT', 'Server': 'nginx/1.10.3 (Ubuntu)', 'X-Ms-Request-Id': '0c78cfd4-8618-4933-87dd-0bacdea2110c', 'X-Ms-Run-Function-Failed': 'True'}
Content: b'not enough values to unpack (expected 2, got 1)'