# Step 3: Model Building

Using the labeled feature data set constructed in the `Code/2_feature_engineering.ipynb` Jupyter notebook, this notebook loads the feature data from Azure Blob container and splits it into a training and test data set. We then build two machine learning models, a decision tree classifier and a random forest classifier, to predict when different components within our machine population will fail. The two models are compared and we store the better performing model for deployment in an Azure web service. We will prepare and build the web service in the `Code/4_operationalization.ipynb` Jupyter notebook.

**Note:** This notebook will take about 3-5 minutes to execute all cells, depending on the compute configuration you have setup. 

In [1]:
# import the libraries
import os
import glob

# for creating pipelines and model
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, VectorIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.sql import SparkSession

# For some data handling
import pandas as pd

# For Azure blob storage access
from azure.storage.blob import BlockBlobService
from azure.storage.blob import PublicAccess

spark = SparkSession.builder.getOrCreate()

# Load feature data set

We have previously created the labeled feature data set in the `Code\2_feature_engineering.ipynb` Jupyter notebook. Since the Azure Blob storage account name and account key are not passed between notebooks, you'll need your credentials here again.

In [2]:
# Enter your Azure blob storage details here 
ACCOUNT_NAME = "<your blob storage account name>"

# You can find the account key under the _Access Keys_ link in the 
# [Azure Portal](portal.azure.com) page for your Azure storage container.
ACCOUNT_KEY = "<your blob storage account key>"

#-------------------------------------------------------------------------------------------
# The data from the feature engineering note book is stored in the feature engineering container.
CONTAINER_NAME = CONTAINER_NAME = "featureengineering"

# Connect to your blob service     
az_blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)

# We will store and read each of these data sets in blob storage in an 
# Azure Storage Container on your Azure subscription.
# See https://github.com/Azure/ViennaDocs/blob/master/Documentation/UsingBlobForStorage.md
# for details.

# This is the final feature data file.
FEATURES_LOCAL_DIRECT = 'featureengineering_files.parquet'

# This is where we store the final model data file.
LOCAL_DIRECT = 'model_result.parquet'

Load the data and dump a short summary of the resulting DataFrame.

In [3]:
# load the previous created final dataset into the workspace
# create a local path where we store results
if not os.path.exists(FEATURES_LOCAL_DIRECT):
    os.makedirs(FEATURES_LOCAL_DIRECT)
    print('DONE creating a local directory!')

# download the entire parquet result folder to local path for a new run 
for blob in az_blob_service.list_blobs(CONTAINER_NAME):
    if FEATURES_LOCAL_DIRECT in blob.name:
        local_file = os.path.join(FEATURES_LOCAL_DIRECT, os.path.basename(blob.name))
        az_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)

feat_data = spark.read.parquet(FEATURES_LOCAL_DIRECT)

feat_data.limit(10).toPandas().head(10)

DONE creating a local directory!


Unnamed: 0,machineID,dt_truncated,volt_rollingmean_3,rotate_rollingmean_3,pressure_rollingmean_3,vibration_rollingmean_3,volt_rollingmean_24,rotate_rollingmean_24,pressure_rollingmean_24,vibration_rollingmean_24,...,error5sum_rollingmean_24,comp1sum,comp2sum,comp3sum,comp4sum,model,age,model_encoded,failure,label_e
0,27,2016-01-01 06:00:00,147.813753,410.546469,103.110374,39.881874,166.464991,449.92176,100.608971,40.31356,...,0.0,504.0,564.0,444.0,399.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
1,27,2016-01-01 03:00:00,161.893907,457.866635,106.67166,42.281086,167.917852,459.85011,99.954524,40.198525,...,0.0,504.0,564.0,444.0,399.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
2,27,2016-01-01 00:00:00,159.216094,466.617543,102.92824,39.135677,169.175332,456.416658,99.402692,39.688645,...,0.0,504.0,564.0,444.0,399.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
3,27,2015-12-31 21:00:00,173.141342,466.089834,102.410363,40.737921,170.269608,453.365861,97.793726,39.614332,...,0.0,503.0,563.0,443.0,398.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
4,27,2015-12-31 18:00:00,173.328305,445.790528,96.623228,39.30975,168.427467,452.489297,96.946852,39.826918,...,0.0,503.0,563.0,443.0,398.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
5,27,2015-12-31 15:00:00,177.571978,413.721281,101.407082,43.319996,166.300368,453.315787,97.547494,39.738175,...,0.0,503.0,563.0,443.0,398.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
6,27,2015-12-31 12:00:00,167.647979,445.761921,92.272371,39.157228,166.66717,453.45099,98.124989,39.282556,...,0.0,503.0,563.0,443.0,398.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
7,27,2015-12-31 09:00:00,165.986147,479.098619,99.443408,39.040782,164.224087,456.703897,99.921575,39.200765,...,0.0,503.0,563.0,443.0,398.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
8,27,2015-12-31 06:00:00,164.557066,503.854515,97.879841,38.605762,163.387522,454.477489,100.385824,39.571223,...,0.0,503.0,563.0,443.0,398.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0
9,27,2015-12-31 03:00:00,171.953747,430.399022,102.257001,38.202043,163.563755,447.991413,101.242371,39.515151,...,0.0,503.0,563.0,443.0,398.0,model2,9,"(0.0, 0.0, 1.0)",0.0,0.0


# Prepare the Training/Testing data

When working with data that comes with time-stamps such as telemetry and errors as in this example, splitting of data into training, validation and test sets should be performed carefully to prevent overestimating the performance of the models. In predictive maintenance, the features are usually generated using laging aggregates and consecutive examples that fall into the same time window may have similar feature values in that window. If a random splitting of training and testing is used, it is possible for some portion of these similar examples that are in the same window to be selected for training and the other portion to leak into the testing data. Also, it is possible for training examples to be ahead of time than validation and testing examples when data is randomly split. However, predictive models should be trained on historical data and valiadted and tested on future data. Due to these problems, validation and testing based on random sampling may provide overly optimistic results. Since random sampling is not a viable approach here, cross validation methods that rely on random samples such as k-fold cross validation is not useful either.

For predictive maintenance problems, a time-dependent spliting strategy is often a better approach to estimate performance which is done by validating and testing on examples that are later in time than the training examples. For a time-dependent split, a point in time is picked and model is trained on examples up to that point in time, and validated on the examples after that point assuming that the future data after the splitting point is not known. However, this effects the labelling of features falling into the labelling window right before the split as it is assumed that failure information is not known beyond the splitting cut-off. Due to that, those feature records can not be labeled and will not be used. This also prevents the leaking problem at the splitting point.

Validation can be performed by picking different split points and examining the performance of the models trained on different time splits. In the following, we use a splitting points to train the model and look at the performances for the other split in the evaluation section.

In [4]:
# define list of input columns for downstream modeling - note model variable was removed as string was not supported
input_features = [
'volt_rollingmean_3',
'rotate_rollingmean_3',
'pressure_rollingmean_3',
'vibration_rollingmean_3',
'volt_rollingmean_24',
'rotate_rollingmean_24',
'pressure_rollingmean_24',
'vibration_rollingmean_24',
'volt_rollingstd_3',
'rotate_rollingstd_3',
'pressure_rollingstd_3',
'vibration_rollingstd_3',
'volt_rollingstd_24',
'rotate_rollingstd_24',
'pressure_rollingstd_24',
'vibration_rollingstd_24',
'error1sum_rollingmean_24',
'error2sum_rollingmean_24',
'error3sum_rollingmean_24',
'error4sum_rollingmean_24',
'error5sum_rollingmean_24',
'comp1sum',
'comp2sum',
'comp3sum',
'comp4sum',
'age' #,
#'model_encoded'    
]

label_var = ['label_e']
key_cols =['machineID','dt_truncated']


Spark models require a vectorized data frame. We transform the dataset here and then split the data into a training and test set. We use this split data to train the model on 9 months of data (training data), and evaluate on the remaining 3 months (test data) going forward.

In [5]:
# assemble features
va = VectorAssembler(inputCols=(input_features), outputCol='features')
feat_data = va.transform(feat_data).select('machineID','dt_truncated','label_e','features')

# set maxCategories so features with > 10 distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features", 
                               outputCol="indexedFeatures", 
                               maxCategories=10).fit(feat_data)

# fit on whole dataset to include all labels in index
labelIndexer = StringIndexer(inputCol="label_e", outputCol="indexedLabel").fit(feat_data)

# split the data into train/test based on date
training = feat_data.filter(feat_data.dt_truncated > "2015-01-01").filter(feat_data.dt_truncated < "2015-09-30")
testing = feat_data.filter(feat_data.dt_truncated > "2015-09-30")

print(training.count())
print(testing.count())

2174000
747000


# Classification models

In predictive maintenance, machine failures are usually rare occurrences in the lifetime of the assets compared to normal operation. This causes an imbalance in the label distribution which usually causes poor performance as algorithms tend to classify majority class examples better at the expense of minority class examples as the total misclassification error is much improved when majority class is labeled correctly. This causes low recall rates although accuracy can be high and becomes a larger problem when the cost of false alarms to the business is very high. To help with this problem, sampling techniques such as oversampling of the minority examples are usually used along with more sophisticated techniques which are not covered in this notebook.

Also, due to the class imbalance problem, it is important to look at evaluation metrics other than accuracy alone and compare those metrics to the baseline metrics which are computed when random chance is used to make predictions rather than a machine learning model. The comparison will bring out the value and benefits of using a machine learning model better.

We will build and compare two models, a Random Forest Classifier and Decision Tree Classifier. To compare these models, we compute weighted precision/recall, F1 score along with the accuracy metric. 

# Decision Tree Classifier

Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.

Remember, we build the model by training on the training data set, then evaluate the model using the testing data set.

In [6]:
# train a DT model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# chain indexers and forest in a Pipeline
pipeline_dt = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# train model.  This also runs the indexers.
model_dt = pipeline_dt.fit(training)

To evaluate this model, we predict the component failures over the test data set. The standard method of viewing this evaluation is with a _confusion matrix_ shown below.

In [7]:
# make predictions.
predictions_dt = model_dt.transform(testing)

predictions_dt.stat.crosstab('indexedLabel', 'prediction').show()

+-----------------------+------+----+----+----+----+
|indexedLabel_prediction|   0.0| 1.0| 2.0| 3.0| 4.0|
+-----------------------+------+----+----+----+----+
|                    0.0|734418|   4| 116|   1| 263|
|                    1.0|     4|4698|   0|   0|   0|
|                    2.0|     0|   0|3630|   0|   0|
|                    3.0|   275|   0|   0|1868|   0|
|                    4.0|    11|   0|   0|  28|1684|
+-----------------------+------+----+----+----+----+



The confusion matrix lists each true component failure in rows. Numer 0.0 corresponds to not failed. Then, numbers 1.0-4.0 correspond to each of the 4 components in the machine. Each column represents the predicted value. 

So, the second number in the top row indicates how many days we predicted component 1 would fail, when no components actually did fail. The second number in the second row, indicates how many days we correctly predicted component 1 would fail.

We read the confusion matrix number along the diagonal as correctly classifying the component failure. Numbers above the diagonal indicate the model incorrectly predicting a failure when non occured, and those below indicate incorrectly predicting a non-failure for the indicated component failure.

When evaluating classification models, it is convenient to reduce the results in the confusion matrix into a single performance statistic. However, depending on the problem space, it is impossible to always use the same statistic in this evaluation. Below, we calculate 4 such statistics.

- **Accuracy**: reports how often we correctly predicted the labeled data. Unfortunatly, when there is a class imbalance (a large number of one of the labels relative to others), this measure is biased towards the largest class. In this case non-failure days.

Because of the class imbalance inherint in predictive maintenance problems, it is better to look at the remaining statistics instead. Here positive predictions indicate a failure.

- **Weighted Precision**: Precision is a measure of how well the model classifies the truely positive samples. Precision depends on falsely classifying negative days as positive.

- **Weighted Recall**: Recall is a measure of how well the model can find the positive samples. Recall depends on falsely classifying positive days as negative.

- **F1**: F1 considers both the precision and the recall. F1 score is the harmonic average of precision and recall. An F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

Below we calculate these evaluation statistics for the decision tree classifier.

In [8]:
# select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction")
print("Accuracy = %g" % evaluator.evaluate(predictions_dt, {evaluator.metricName: "accuracy"}))
print("Weighted Precision = %g" % evaluator.evaluate(predictions_dt, {evaluator.metricName: "weightedPrecision"}))
print("Weighted Recall = %g" % evaluator.evaluate(predictions_dt, {evaluator.metricName: "weightedRecall"}))
print("F1 = %g" % evaluator.evaluate(predictions_dt, {evaluator.metricName: "f1"}))
print("")

Accuracy = 0.99906
Weighted Precision = 0.9991
Weighted Recall = 0.99906
F1 = 0.999061



Remember that this is a simulated data set. We would expect a model built on real world data to behave very differently. The accuracy may still be close to one, but the precision and recall numbers would be much lower.

## Random Forest Classifier

A random forest is an ensemble of decision trees. Random forests combine many decision trees in order to reduce the risk of overfitting. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks.

In [9]:
# train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=100)

# chain indexers and forest in a Pipeline
pipeline_rf = Pipeline(stages=[labelIndexer, featureIndexer, rf])

# train model.  This also runs the indexers.
model_rf = pipeline_rf.fit(training)

We again predict on the test data and show the confusion matrix as before.

In [10]:
# make predictions.
predictions_rf = model_rf.transform(testing)

predictions_rf.stat.crosstab('indexedLabel', 'prediction').show()

+-----------------------+------+----+----+----+---+
|indexedLabel_prediction|   0.0| 1.0| 2.0| 3.0|4.0|
+-----------------------+------+----+----+----+---+
|                    0.0|734747|   3|  52|   0|  0|
|                    1.0|     0|4702|   0|   0|  0|
|                    2.0|   576|   1|3053|   0|  0|
|                    3.0|    54|   0|   0|2089|  0|
|                    4.0|  1664|   0|   0|   0| 59|
+-----------------------+------+----+----+----+---+



And calculate the performance statistics.

In [11]:
# select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction")
print("Accuracy = %g" % evaluator.evaluate(predictions_rf, {evaluator.metricName: "accuracy"}))
print("Weighted Precision = %g" % evaluator.evaluate(predictions_rf, {evaluator.metricName: "weightedPrecision"}))
print("Weighted Recall = %g" % evaluator.evaluate(predictions_rf, {evaluator.metricName: "weightedRecall"}))
print("F1 = %g" % evaluator.evaluate(predictions_rf, {evaluator.metricName: "f1"}))
print("")

Accuracy = 0.996854
Weighted Precision = 0.996852
Weighted Recall = 0.996854
F1 = 0.995783



Comparing these staticistics to those for the decision tree classifier above, we see that the random forest predicts (marginally) better. We store the random forest model in a serialized `Spark` model file for use in the next notebook.

In [12]:
# save model
model_rf.write().overwrite().save(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY']+'pdmrfull.model')
print("Model saved")

Model saved


## Conclusion

In the next notebook `Code\4_operationalization.ipynb` Jupyter notebook we will create the functions needed to operationalize and deploy any model to get realtime predictions. 
