##![LearnAI Header](https://coursematerial.blob.core.windows.net/assets/LearnAI_header.png)

-sandbox

# Getting started with Machine Learning for Predictive Maintenance

In this lab, we will create our first Machine Learning solution for predictive maintenance. We will rely on a simple but powerful algorithm: [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression).

## Reading the data

We begin by reading the data that we finished pre-processing in a prior Notebook.

> *Note:* If you you do get an error messages about a non-existent file, run the *feature_engineering* notebook of day 1 once more. Unfortunately, this will take a couple of minutes.

In [5]:
df = spark.read.parquet("dbfs:/FileStore/tables/preprocessed").cache()
display(df)

machineID,datetime,age,diff_error_0,diff_error_1,diff_error_2,diff_error_3,diff_error_4,diff_fail_0,diff_fail_1,diff_fail_2,diff_fail_3,diff_maint_0,diff_maint_1,diff_maint_2,diff_maint_3,pressure_ma_3,pressure_sd_3,rotate_ma_3,rotate_sd_3,vibration_ma_3,vibration_sd_3,volt_ma_3,volt_sd_3,y_0,y_1,y_2,y_3
16,2015-06-10T23:00:00.000+0000,3,489.0,318.0,257.0,617.0,2142.0,1673.0,233.0,593.0,3957.0,953.0,233.0,593.0,1313.0,103.36172205524504,7.328405874766052,470.40251318654725,92.20329334384516,36.4696066569045,3.306350914248545,162.77538146117274,17.15059004541141,0,0,0,0
16,2015-06-11T00:00:00.000+0000,3,490.0,319.0,258.0,618.0,2143.0,1674.0,234.0,594.0,3958.0,954.0,234.0,594.0,1314.0,99.363131214606,10.871060485258129,446.91811499211394,82.57552287693811,38.15925421485703,5.68406531953323,170.811166158514,1.9468855435949943,0,0,0,0
16,2015-06-11T01:00:00.000+0000,3,491.0,320.0,259.0,619.0,2144.0,1675.0,235.0,595.0,3959.0,955.0,235.0,595.0,1315.0,101.39131534989146,10.023193445589618,440.0497420926933,78.79381696420629,41.11166660967236,4.009084955041661,172.64899000407274,5.118941409799331,0,0,0,0
16,2015-06-11T02:00:00.000+0000,3,492.0,321.0,260.0,620.0,2145.0,1676.0,236.0,596.0,3960.0,956.0,236.0,596.0,1316.0,98.36866553478744,8.587414855405909,472.6304420269753,36.17558170016903,40.805403622712525,4.297997479151865,168.56396337307723,11.543020480874953,0,0,0,0
16,2015-06-11T03:00:00.000+0000,3,493.0,322.0,261.0,621.0,2146.0,1677.0,237.0,597.0,3961.0,957.0,237.0,597.0,1317.0,94.53421765518716,5.968726222941755,460.0051519863225,14.962694219838603,40.318859899470176,4.948653154935858,165.95955799828124,11.578897552357208,0,0,0,0
16,2015-06-11T04:00:00.000+0000,3,494.0,323.0,262.0,622.0,2147.0,1678.0,238.0,598.0,3962.0,958.0,238.0,598.0,1318.0,95.50008347465094,4.557688829884692,463.0520936189128,9.315773330681766,40.23520369738577,4.834785987328766,169.3007077087945,14.443743556629142,0,0,0,0
16,2015-06-11T05:00:00.000+0000,3,495.0,324.0,263.0,623.0,2148.0,1679.0,239.0,599.0,3963.0,959.0,239.0,599.0,1319.0,93.33858647635662,2.6168933071748706,469.417713108793,15.782893006930236,39.08330534417315,4.326818066247427,165.246881515914,12.606379112554771,0,0,0,0
16,2015-06-11T06:00:00.000+0000,3,496.0,325.0,264.0,624.0,2149.0,1680.0,240.0,600.0,3964.0,960.0,240.0,600.0,1320.0,93.63201011705628,3.189767394885411,477.4132658130537,20.319239222031907,40.20824684814615,4.330782174852976,172.3030821213965,10.968488570133442,0,0,0,0
16,2015-06-11T07:00:00.000+0000,3,497.0,326.0,265.0,625.0,2150.0,1681.0,241.0,601.0,3965.0,961.0,241.0,601.0,1321.0,96.48519840591158,5.848380359115432,472.9917483316395,23.23167780183143,41.5679305797079,2.66313041318923,176.98691452605274,8.8624264766146,0,0,0,0
16,2015-06-11T08:00:00.000+0000,3,498.0,327.0,266.0,626.0,2151.0,1682.0,242.0,602.0,3966.0,962.0,242.0,602.0,1322.0,100.6964862682261,6.751963714111103,474.7697043013748,21.139977606679174,39.69481984381637,1.8619248647472573,174.94751064292976,8.013683276232097,0,0,0,0


In [6]:
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml import Pipeline

keys = ['machineID', 'datetime']
X_keep = ['diff_maint_1', 'diff_error_1', 'volt_sd_3', 'diff_fail_3', 'pressure_ma_3', 'pressure_sd_3', 'diff_fail_1', 'diff_fail_0', 'age', 'vibration_ma_3', 'rotate_ma_3', 'diff_error_2', 'diff_fail_2', 'diff_error_3', 'diff_maint_2', 'volt_ma_3', 'diff_maint_0', 'vibration_sd_3', 'diff_maint_3', 'rotate_sd_3', 'diff_error_0', 'diff_error_4']
Y_keep = ['y_0', 'y_1', 'y_2', 'y_3']

vassembler = VectorAssembler(inputCols = X_keep, outputCol = "features")
stndscaler = StandardScaler(inputCol = "features", outputCol = "norm_features")

pipeline = Pipeline(stages = [vassembler, stndscaler])
df_norm = pipeline.fit(df).transform(df).select(keys + ["norm_features"] + Y_keep)
display(df_norm)

machineID,datetime,norm_features,y_0,y_1,y_2,y_3
16,2015-06-10T23:00:00.000+0000,"List(1, 22, List(), List(0.20952993119406246, 0.3348105818258993, 2.9340970661577592, 1.5719932532053056, 15.161634257041639, 1.8693599215101442, 0.12001661058038639, 0.7754073254874088, 0.5147900362625834, 11.540620742484096, 15.802437535752096, 0.2566995351267684, 0.22541394137012827, 0.5941540693074987, 0.555498311305963, 19.261202705974416, 0.9069586781562903, 1.6933771813998613, 1.1116810754241957, 4.729865538607657, 0.6097039463397917, 1.223219863982335))",0,0,0,0
16,2015-06-11T00:00:00.000+0000,"List(1, 22, List(), List(0.21042920128502413, 0.3358634452907606, 0.3330702411101798, 1.5723905221598684, 14.575100183654946, 2.7730348349603267, 0.12053170332965843, 0.7758708086466959, 0.5147900362625834, 12.07530108160157, 15.013515867331735, 0.2576983660027481, 0.22579406606046573, 0.5951170418671543, 0.5564350706842193, 20.212076705271276, 0.9079103661711447, 2.9111448721925743, 1.1125277479873519, 4.235977976747721, 0.6109507846758649, 1.223790928344605))",0,0,0,0
16,2015-06-11T01:00:00.000+0000,"List(1, 22, List(), List(0.21132847137598576, 0.33691630875562195, 0.8757407723324422, 1.5727877911144312, 14.872604767108934, 2.5567574221353495, 0.12104679607893047, 0.776334291805983, 0.5147900362625834, 13.009576903233503, 14.782783610014254, 0.2586971968787277, 0.22617419075080322, 0.5960800144268099, 0.5573718300624755, 20.429546308533265, 0.9088620541859992, 2.053288703236145, 1.1133744205505083, 4.041983165661366, 0.6121976230119381, 1.224361992706875))",0,0,0,0
16,2015-06-11T02:00:00.000+0000,"List(1, 22, List(), List(0.2122277414669474, 0.3379691722204832, 1.974762526412058, 1.573185060068994, 14.429226792434449, 2.1905131121833645, 0.12156188882820251, 0.7767977749652701, 0.5147900362625834, 12.912661545379086, 15.87728132452455, 0.25969602775470735, 0.2265543154411407, 0.5970429869864655, 0.5583085894407318, 19.946165312632008, 0.9098137422008536, 2.201257835502328, 1.1142210931136645, 1.85574322800626, 0.6134444613480113, 1.224933057069145))",0,0,0,0
16,2015-06-11T03:00:00.000+0000,"List(1, 22, List(), List(0.21312701155790903, 0.33902203568534456, 1.980900321665731, 1.573582329023557, 13.86677006113971, 1.522527241845776, 0.12207698157747456, 0.7772612581245574, 0.5147900362625834, 12.758697269388929, 15.453154429694273, 0.260694858630687, 0.2269344401314782, 0.5980059595461211, 0.559245348818988, 19.637986155549584, 0.9107654302157081, 2.5344969570888316, 1.1150677656768206, 0.7675596954136652, 0.6146912996840845, 1.225504121431415))",0,0,0,0
16,2015-06-11T04:00:00.000+0000,"List(1, 22, List(), List(0.2140262816488707, 0.3400748991502059, 2.47101385326268, 1.57397959797812, 14.008448276294258, 1.1625940182485999, 0.12259207432674661, 0.7777247412838445, 0.5147900362625834, 12.732224691549103, 15.555511673702012, 0.26169368950666666, 0.22731456482181567, 0.5989689321057767, 0.5601821081972443, 20.033344232843074, 0.9117171182305626, 2.4761788691612816, 1.115914438239977, 0.47788266171745436, 0.6159381380201577, 1.226075185793685))",0,0,0,0
16,2015-06-11T05:00:00.000+0000,"List(1, 22, List(), List(0.21492555173983233, 0.3411277626150672, 2.1566803166002755, 1.5743768669326828, 13.691388669660423, 0.6675279113763612, 0.12310716707601865, 0.7781882244431316, 0.5147900362625834, 12.36771234148782, 15.7693547156615, 0.2626925203826463, 0.22769468951215316, 0.5999319046654323, 0.5611188675755006, 19.553655183215596, 0.912668806245417, 2.2160185568558455, 1.1167611108031332, 0.809634439570637, 0.6171849763562309, 1.226646250155955))",0,0,0,0
16,2015-06-11T06:00:00.000+0000,"List(1, 22, List(), List(0.21582482183079396, 0.3421806260799285, 1.876472474043178, 1.5747741358872456, 13.734429573335383, 0.8136589906231091, 0.1236222598252907, 0.7786517076024188, 0.5147900362625834, 12.723694334300845, 16.037952817565746, 0.2636913512586259, 0.22807481420249065, 0.6008948772250878, 0.5620556269537569, 20.38861504616414, 0.9136204942602714, 2.2180488105195746, 1.1176077833662896, 1.042340960735642, 0.618431814692304, 1.227217314518225))",0,0,0,0
16,2015-06-11T07:00:00.000+0000,"List(1, 22, List(), List(0.2167240919217556, 0.3432334895447898, 1.516170548956176, 1.5751714048418084, 14.152950051148023, 1.491828923766658, 0.12413735257456274, 0.779115190761706, 0.5147900362625834, 13.153959305988138, 15.889418845372498, 0.26469018213460555, 0.22845493889282814, 0.6018578497847434, 0.5629923863320131, 20.942852699162145, 0.9145721822751258, 1.3639460510233137, 1.1184544559294458, 1.1917438982265376, 0.6196786530283772, 1.227788378880495))",0,0,0,0
16,2015-06-11T08:00:00.000+0000,"List(1, 22, List(), List(0.21762336201271726, 0.34428635300965116, 1.3709688429175013, 1.5755686737963714, 14.770683628433092, 1.722318683536753, 0.12465244532383478, 0.7795786739209931, 0.5147900362625834, 12.561222981328413, 15.949146498532203, 0.26568901301058523, 0.2288350635831656, 0.602820822344399, 0.5639291457102693, 20.701530140188122, 0.9155238702899803, 0.9536014661530919, 1.1193011284926022, 1.084443385290901, 0.6209254913644504, 1.228359443242765))",0,0,0,0


Let's begin by dividing the data into training and test sets. With time-series data, we usually divide the data based on a time cut-off and to avoid **leakage** we also put a gap (2 weeks in this case) between the training and test data. Another option we have is to sample every n-th row of the data. The data is collected hourly, and if we do not wish to use such a high frequency for modeling, we can sample every n-th row of the data.

In [8]:
# from pyspark.sql.types import DateType
from pandas import datetime
from pyspark.sql.functions import col, hour

# we sample every nth row of the data using the `hour` function
df_train = df_norm.filter((col('datetime') < datetime(2015, 10, 1))) # & (hour(col('datetime')) % 3 == 0))
df_test = df_norm.filter(col('datetime') > datetime(2015, 10, 15))

Let's look at some summary statistics for the labels in the data.

In [10]:
display(df_train.describe())

summary,machineID,y_0,y_1,y_2,y_3
count,654600.0,654600.0,654600.0,654600.0,654600.0
mean,50.5,0.0146379468377635,0.0188496791934005,0.0108203483043079,0.0150733272227314
stddev,28.866092096380104,0.1200987068394558,0.1359941066395186,0.1034566803921299,0.121844756591689
min,1.0,0.0,0.0,0.0,0.0
max,100.0,1.0,1.0,1.0,1.0


We now build a classifier for `y_0` (failure in the first component) (and drop the other labels).

In [12]:
df_train = df_train.drop("y_1","y_2","y_3","datetime", "machineID")
df_train = df_train.withColumnRenamed("y_0", "error")
df_train.cache()

df_test = df_test.drop("y_1","y_2","y_3","datetime", "machineID")
df_test = df_test.withColumnRenamed("y_0", "error")
df_test.cache()

Let's make sure we don't have any null values in our DataFrame.

In [14]:
recordCount = df_train.count()
noNullsRecordCount = df_train.na.drop().count()

print("We have {} records that contain null values.".format(recordCount - noNullsRecordCount))

In [15]:
display(df_train.groupBy("error").count())

error,count
1,9582
0,645018


## Train a Logistic Regression Model

Before we can apply the logistic regression model, we will need to do some data preparation, such as one hot encoding our categorical variables using `StringIndexer` and `OneHotEncoderEstimator`.

Let's start by taking a look at all of our columns, and determine which ones are categorical.

In [17]:
df_train.printSchema()

## Setting up the model

We set the `label` column of the LogisticRegression model to `error`, and the `features` column to `norm_features`.

In [19]:
from pyspark.ml.classification import LogisticRegression

lr = (LogisticRegression()
     .setLabelCol("error")
     .setFeaturesCol("norm_features"))

### Hands-on lab
Create a pipeline that contains a single stage for the model we created above. Then fit the pipeline to the training data and then use the fitted model to `transform` the test data.

In [21]:
# maximize this cell (click the + button on the right) to see the solution:
  
from pyspark.ml import Pipeline

pipeline = Pipeline(stages = [lr])
assert len(pipeline.getStages()) == 1 # make sure it's one stage only
print(pipeline.getStages())

lr_model = pipeline.fit(df_train)

df_pred = lr_model.transform(df_test) # apply the model to our held-out test set
display(df_pred)

norm_features,error,rawPrediction,probability,prediction
"List(1, 22, List(), List(2.2832467609516076, 2.51529081755369, 1.519700411830627, 2.774129109712572, 14.65828837599718, 1.4499966578532955, 1.6786872698775932, 1.3436376787734596, 0.5147900362625834, 12.71478234851845, 14.398613631067803, 0.25270421162284984, 0.007222369116412204, 0.04140782006519035, 0.017798428186868966, 21.80336393439427, 0.7032974429774381, 1.602436692749368, 1.844899515117534, 3.1141913325801536, 0.6296533597169628, 2.951260624211348))",0,"List(1, 2, List(), List(3.9330722076386344, -3.9330722076386344))","List(1, 2, List(), List(0.9807927282668616, 0.019207271733138492))",0.0
"List(1, 22, List(), List(2.2841460310425696, 2.5163436810185513, 1.2612647514705175, 2.7745263786671353, 13.913312489502026, 2.6689505781880656, 1.6792023626268653, 1.3441011619327468, 0.5147900362625834, 11.97281894929179, 15.280736027161465, 0.25370304249882947, 0.007602493806749688, 0.04237079262484594, 0.018735187565125228, 21.929706767556553, 0.7042491309922926, 2.917787940331837, 1.8457461876806904, 1.5608621270179244, 0.630900198053036, 2.951831688573618))",0,"List(1, 2, List(), List(3.695768493561573, -3.695768493561573))","List(1, 2, List(), List(0.9757731472741494, 0.024226852725850578))",0.0
"List(1, 22, List(), List(2.285045301133531, 2.5173965444834128, 1.555753196551649, 2.774923647621698, 14.28985610522001, 3.426542448341677, 1.6797174553761374, 1.3445646450920339, 0.5147900362625834, 11.369869738948456, 15.052497472245376, 0.25470187337480915, 0.007982618497087172, 0.04333376518450153, 0.01967194694338149, 21.61932052145031, 0.705200819007147, 1.8959090643538812, 1.8465928602438466, 1.8293690246135266, 0.6321470363891092, 2.952402752935888))",0,"List(1, 2, List(), List(3.9905525322847843, -3.9905525322847843))","List(1, 2, List(), List(0.9818461600830883, 0.018153839916911697))",0.0
"List(1, 22, List(), List(2.285944571224493, 2.5184494079482738, 3.117309020907023, 2.775320916576261, 13.661653347346657, 3.322156181515168, 1.6802325481254095, 1.345028128251321, 0.5147900362625834, 11.54527264471829, 14.777740364249428, 0.2557007042507888, 0.008362743187424656, 0.04429673774415712, 0.020608706321637752, 20.702627995836455, 0.7061525070220015, 2.295240674600147, 1.847439532807003, 2.3446082047151386, 0.6333938747251824, 2.952973817298158))",0,"List(1, 2, List(), List(4.881757811923947, -4.881757811923947))","List(1, 2, List(), List(0.9924734075696523, 0.007526592430347835))",0.0
"List(1, 22, List(), List(2.2868438413154544, 2.519502271413135, 2.0559155633955006, 2.7757181855308235, 13.347413778390939, 3.5722335442979083, 1.6807476408746813, 1.3454916114106081, 0.5147900362625834, 12.020891484985283, 14.606011399943256, 0.2566995351267684, 0.008742867877762142, 0.045259710303812706, 0.02154546569989401, 19.84231744673155, 0.7071041950368558, 2.8322702762557097, 1.8482862053701592, 1.9409373106097214, 0.6346407130612556, 2.953544881660428))",0,"List(1, 2, List(), List(5.931536390031717, -5.931536390031717))","List(1, 2, List(), List(0.9973526265302755, 0.002647373469724598))",0.0
"List(1, 22, List(), List(2.287743111406416, 2.5205551348779967, 2.268692880665532, 2.7761154544853865, 14.438656460947767, 3.8674317308246002, 1.6812627336239534, 1.3459550945698953, 0.5147900362625834, 12.207976982600718, 15.102024163892331, 0.2576983660027481, 0.009122992568099626, 0.0462226828634683, 0.022482225078150272, 19.957793549907436, 0.7080558830517103, 2.317763587651063, 1.8491328779333156, 2.853218066875978, 0.6358875513973288, 2.954115946022698))",0,"List(1, 2, List(), List(5.938064918009907, -5.938064918009907))","List(1, 2, List(), List(0.997369808376144, 0.0026301916238561304))",0.0
"List(1, 22, List(), List(2.2886423814973775, 2.5216079983428576, 2.222769142084244, 2.7765127234399496, 14.834933256855367, 4.800617627948995, 1.6817778263732255, 1.3464185777291826, 0.5147900362625834, 12.586585283114518, 14.609559098008353, 0.2586971968787277, 0.00950311725843711, 0.04718565542312389, 0.023418984456406534, 19.53907663469123, 0.7090075710665648, 2.3139422205274944, 1.8499795504964718, 3.6833554643739888, 0.637134389733402, 2.954687010384968))",0,"List(1, 2, List(), List(6.495596577377121, -6.495596577377121))","List(1, 2, List(), List(0.998492202816935, 0.001507797183064909))",0.0
"List(1, 22, List(), List(2.289541651588339, 2.522660861807719, 1.6902614525680717, 2.776909992394512, 15.35628034672665, 4.24545874851396, 1.6822929191224976, 1.3468820608884697, 0.5147900362625834, 11.720466899174284, 15.282207458608928, 0.25969602775470735, 0.009883241948774595, 0.048148627982779475, 0.024355743834662796, 20.30806516734302, 0.7099592590814192, 3.1077810889533657, 1.850826223059628, 3.391448381124417, 0.6383812280694752, 2.955258074747238))",0,"List(1, 2, List(), List(5.625193058232512, -5.625193058232512))","List(1, 2, List(), List(0.9964070886337475, 0.0035929113662524197))",0.0
"List(1, 22, List(), List(2.290440921679301, 2.5237137252725805, 1.9433522833806502, 2.7773072613490752, 16.309309924938415, 2.1725545289907076, 1.6828080118717696, 1.3473455440477569, 0.5147900362625834, 11.245491113698566, 15.892438367774947, 0.260694858630687, 0.010263366639112079, 0.04911160054243507, 0.02529250321291906, 20.90102293257184, 0.7109109470962737, 2.430729206710306, 1.8516728956227844, 4.1927599492263194, 0.6396280664055484, 2.955829139109508))",0,"List(1, 2, List(), List(4.958751376635604, -4.958751376635604))","List(1, 2, List(), List(0.9930272705362095, 0.006972729463790526))",0.0
"List(1, 22, List(), List(2.2913401917702627, 2.524766588737442, 3.000567477793781, 2.7777045303036383, 16.79313389883875, 2.7591888884564613, 1.6833231046210417, 1.347809027207044, 0.5147900362625834, 12.097660727426952, 15.224640483490687, 0.26169368950666666, 0.010643491329449563, 0.05007457310209065, 0.02622926259117532, 19.86051363916432, 0.7118626351111281, 2.974517467842894, 1.8525195681859405, 4.0053836364158135, 0.6408749047416216, 2.956400203471778))",0,"List(1, 2, List(), List(6.242338583901006, -6.242338583901006))","List(1, 2, List(), List(0.9980584758988666, 0.0019415241011334144))",0.0


### End of lab

In [23]:
df_pred.printSchema()

## Evaluate the Model

In [25]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print(evaluator.explainParams())

In [26]:
evaluator.setLabelCol("error")
evaluator.setRawPredictionCol('rawPrediction')

metricName = evaluator.getMetricName()
metricVal = evaluator.evaluate(df_pred)

print("{}: {}".format(metricName, metricVal))

We could wrap this into a function to make it easier to get the output of multiple metrics.

In [28]:
evaluator = BinaryClassificationEvaluator()
evaluator.setLabelCol("error")
evaluator.setRawPredictionCol("rawPrediction")

auroc = evaluator.setMetricName("areaUnderROC").evaluate(df_pred)

print("AUROC: {}".format(auroc))

##Conclusion
Hmmmm... our results are not great yet. We'll look into how to improve our results later.

In [30]:
# You can ignore this code, we use it for testing our notebooks.
assert auroc > .8

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.