# Model Training

In the previous notebook, we performed featurization on the raw telemetry dataset and, using failure records from the event logs, assigned a failure class to each of its entries (with 'None' signifying a non-failure). In what follows, we will be training the Random Forest classifier<sup>[[1]](#ref_1)</sup> from Spark MLlib. (The use of Random Forest and its tuning parameters were adopted from [this sample](https://github.com/Azure/MachineLearningSamples-PredictiveMaintenance/blob/master/Code/3_model_building.ipynb).)

In [1]:
%matplotlib inline
import os
import pyspark
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import udf, mean, lit, stddev, col, expr, when
from pyspark.sql.types import DoubleType, ArrayType, ShortType, LongType, IntegerType
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer, VectorAssembler, VectorIndexer, IndexToString

STORAGE_ACCOUNT_SUFFIX = 'core.windows.net'
STAGING_STORAGE_ACCOUNT_NAME = os.getenv('STAGING_STORAGE_ACCOUNT_NAME')
STAGING_STORAGE_ACCOUNT_KEY = os.getenv('STAGING_STORAGE_ACCOUNT_KEY')
AZUREML_NATIVE_SHARE_DIRECTORY = os.getenv('AZUREML_NATIVE_SHARE_DIRECTORY') + 'Solution1'

### Loading the feature data with labels

In [2]:
sc = SparkSession.builder.getOrCreate()
hc = sc._jsc.hadoopConfiguration()
hc.set("avro.mapred.ignore.inputs.without.extension", "false")

hc.set("fs.azure.account.key.{}.blob.core.windows.net".format(STAGING_STORAGE_ACCOUNT_NAME), STAGING_STORAGE_ACCOUNT_KEY)

sql = SQLContext.getOrCreate(sc)

wasbUrlOutput = "wasb://{0}@{1}.blob.{2}/features.parquet".format(
            'intermediate',
            STAGING_STORAGE_ACCOUNT_NAME,
            STORAGE_ACCOUNT_SUFFIX)

labeled_features_df = sql.read.parquet(wasbUrlOutput)
labeled_features_df.printSchema()

root
 |-- machineID: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- ambient_temperature: double (nullable = true)
 |-- ambient_pressure: double (nullable = true)
 |-- speed: double (nullable = true)
 |-- temperature: double (nullable = true)
 |-- pressure: double (nullable = true)
 |-- f0: double (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- a0: double (nullable = true)
 |-- a1: double (nullable = true)
 |-- a2: double (nullable = true)
 |-- temperature_n: double (nullable = true)
 |-- pressure_n: double (nullable = true)
 |-- f0_n: double (nullable = true)
 |-- f1_n: double (nullable = true)
 |-- f2_n: double (nullable = true)
 |-- a0_n: double (nullable = true)
 |-- a1_n: double (nullable = true)
 |-- a2_n: double (nullable = true)
 |-- failure: string (nullable = true)



### Transforming individual features into a vector column

We are using *VectorAssembler* here to combine all the features into a single feature vector. Vectors is what ML models like logistic regression and decision trees expect as their input. Note that we also alphabetically sort the names of the feature columns when passing them to *VectorAssembler* so that it's easier to identify individual elements in the feature vector.

In [3]:
features_sorted = sorted([c for c in labeled_features_df.columns if c not in ['machineID', 'timestamp', 'failure']])
va = VectorAssembler(inputCols=features_sorted, outputCol='features')
vectorized_features_df = va.transform(labeled_features_df)

### Indexers

In [4]:
featureIndexer = VectorIndexer(inputCol="features", 
                               outputCol="indexedFeatures", 
                               maxCategories=10).fit(vectorized_features_df)

labelIndexer = StringIndexer(inputCol="failure", outputCol="indexedLabel").fit(vectorized_features_df)

deIndexer = IndexToString(inputCol = "prediction", outputCol = "predictedFailure", labels = labelIndexer.labels)

### 80/20 Train/Test split

In [5]:
training, test = vectorized_features_df.randomSplit([0.8, 0.2], seed=12345)
print('Training dataset: {0} records'.format(training.count()))
print('Test dataset: {0} records'.format(test.count()))

Training dataset: 161866 records
Test dataset: 40609 records


### Training

In [7]:
classifier = RandomForestClassifier(labelCol="indexedLabel",
                                    featuresCol="indexedFeatures",
                                    maxDepth=15,
                                    maxBins=32,
                                    minInstancesPerNode=1,
                                    minInfoGain=0.0,
                                    impurity="gini",
                                    numTrees=50,
                                    featureSubsetStrategy="sqrt",
                                    subsamplingRate = 0.632)
 
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classifier, deIndexer])

fitted_pipeline = pipeline.fit(training)

### Testing (confusion matrix)

In [8]:
predictions = fitted_pipeline.transform(test)
conf_table = predictions.stat.crosstab('failure', 'predictedFailure')
confuse = conf_table.toPandas().sort_values(by=['failure_predictedFailure'])
confuse.head()

Unnamed: 0,failure_predictedFailure,F01,F02,None
0,F01,4099,0,0
2,F02,0,3958,123
1,,0,53,32376


### Persisting the model

In [9]:
model_path = os.path.join(AZUREML_NATIVE_SHARE_DIRECTORY, 'model')
model_archive_path = os.path.join(AZUREML_NATIVE_SHARE_DIRECTORY, 'model.tar.gz')

fitted_pipeline.write().overwrite().save(model_path)

import tarfile

tar = tarfile.open(model_archive_path, "w:gz")
tar.add(model_path, arcname="model")
tar.close()

## References

<a name="ref_1"></a>1.  [Random Forests](https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests). 
Machine Learning Library (MLlib) Guide.