-
Notifications
You must be signed in to change notification settings - Fork 1
/
Device_Failure_final.py
345 lines (282 loc) · 14.5 KB
/
Device_Failure_final.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
# 3D Technologies
# Reduce Maintainance cost through predictive Techniques
# August 2017
# Harsh Jain
# Importing Libraries and Load Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
data = pd.read_csv('device_failure.csv')
# Explore and Vizualize Data
data.head()
data.info()
# Observations
# No Categorical features
# No Missing Data. Probably has 0's or other 'filler values'
# Homogeneous Data types for attributes.
# Date field - may need to take into account. Correlation?
data.describe()
# Observations
# Most attributes have a min of 0.
# Attribute 2,3,4,7,8,9 have 25%,50%,75% percentile = 0. This means this is a very sparse dataset, with most values = 0
# Will validate this next..
#Zero count to validate assumption
a = [np.count_nonzero(data.attribute1),np.count_nonzero(data.attribute2),
np.count_nonzero(data.attribute3),np.count_nonzero(data.attribute4),
np.count_nonzero(data.attribute5),np.count_nonzero(data.attribute6),
np.count_nonzero(data.attribute7),np.count_nonzero(data.attribute8),
np.count_nonzero(data.attribute9)]
print 'Attribute Non-zero count', a
zero_count = [(len(data) - x) for x in a]
print 'Attribute zero counts', zero_count
plt.bar(np.arange(1,10),zero_count, alpha=.5, color='r')
plt.yscale('log')
plt.xticks(np.arange(1,10))
plt.xlabel('Attributes')
plt.ylabel('Log Count of Zero values')
plt.legend('0', loc='best')
plt.show()
#Target variable exploration - Class imbalance?
plt.hist(data.failure, bins=2, rwidth=.75, alpha=.5, align='mid')
plt.xticks([0,1])
plt.yscale('log')
plt.show()
fail_count = np.count_nonzero(data.failure)
form = lambda x: "%.3f" % x
print 'Count of failure: ', fail_count
print 'Rate of failure: ', form(fail_count/float(len(data))*100), '%'
#Exploration by Device
x_device_count = data.device.nunique()
y_device_obs = data.device.value_counts()
plt.plot(np.arange(0,x_device_count), y_device_obs)
plt.xlabel('Unique Devices')
plt.ylabel('Count of Observations by device')
plt.show()
print "Number of unique devices: ", x_device_count
print "Percent of Device failure:", form((x_device_count/fail_count)), '%'
# Unique Dates and Number of obervations by Date [pre-sorted]
x_date_count = data.date.nunique()
y_day_plot = data.date.value_counts()
plt.plot(np.arange(0,x_date_count), y_day_plot)
plt.xlabel('Unique Dates (ordered)')
plt.ylabel('Count of Observations')
plt.show()
bydate = data.date.groupby([data.failure[data.failure == 1], data.date]).count()
bydate_count = bydate[bydate > 2].count()
print "Number of unique dates: ", x_date_count
print "Number of days with more than 2 Failures:", bydate_count
# Observations
#
# All attributes other than 1,5,6 are are mostly 0 values.
# Very high Class Imbalance in dataset.
# Number of failures is 106 @ a 0.85% failure rate.
# < 1 % of minority class. Will need to tune models to account for such 'rare' true positive events.
# However, the total rate of device failure is not insignificant at @ 11%.
#
# Date is highly correlated to the Number of observations in dataset. However, only 3 days have more than 2 failures.
# - No co-relation of dates to failure.
# - Co-relation of Number of observation to dates can be addressed by shuffling when creating Cross Validation sets.
#
# Need an algorithm that can handle highly skewed classes and sparse data.
# Also will need a stratified k-fold CV to provide the largest possible dataset for training.
# Use confusion matrix and AUC to measure, Minimize False Positives and False negatives.
# Build a Benchmark model
# Using a subset of data. Simple train/test split, no parameters or hyperparameter tuning.
#Subset and split
bm_dtree = DecisionTreeClassifier()
bm_x = data[['attribute1', 'attribute5','attribute6']]
bm_y = data['failure']
bm_x_train, bm_x_test, bm_y_train, bm_y_test = train_test_split(bm_x, bm_y, test_size=.25, stratify=bm_y, random_state=412)
#Fit and Predict
bm_dtree.fit(bm_x_train, bm_y_train)
bm_predict = bm_dtree.predict_proba(bm_x_test)[:,1]
print 'Average probability of failure is', form((np.mean(bm_predict)*100)), '%'
print 'Number of positive predictions:', np.count_nonzero(bm_predict)
print 'Accuracy:', bm_dtree.score(bm_x_test, bm_y_test)
print "Confusion Matrix \n",confusion_matrix(bm_y_test, bm_predict)
bm_fpr, bm_tpr, bm_threshold = roc_curve(bm_y_test, bm_predict)
bm_report = classification_report(bm_y_test, bm_predict, labels=[1], digits=3)
print bm_report
# Observations
# - Even though "Average probablility" and "Number of positive predictions" seem reasonable and Accuracy is at 99.8% these aren't a good measure because this is a highly skewed dataset with over 99% majority class.
# - Measure using Confusion matrix, Precision, recall, F1 score, and AUC. This is a much better indicator or accuracy.
# - When using a confusion matrix and AUC, its evident that the model completely fails at prediction of positive events in test dataset.
# - With a minority class of < 1%, an AUC ~ 50% is not surprising and is expected due to the challenging nature of the data.
# Build a Robust Model
#
# Techniques we can implement to build a tuned and robust model to optimize model performance:
# - Use the full attribute list
# - Use multiple models to find which model performs best.
# - Use GridSearchCV to estimate hyperparameters for the various models.
# - Use Stratified KFold to split and cross-validate data to provide the largest possible training/test set.
# - Shuffle folds to minimize impact of date co-relation.
# - Use class weights to give importance to positive class. We can do this by using parameter 'class_weight' with a value = 'balanced'. This weight is calculated using n_samples / (n_classes * np.bincount(y)) and is equal to about 1: 585 in this case. [Reference](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
# - Assuming misclassification cost of False Positives = False Negatives, the best indicator of model accuracy is the measure that takes both Precision and Recall into accout.
# - To evaluate model performance, F1 score and AUC should be used.
# - Hyperparameters are also set to optimize for score='roc_auc'
# - The models we will use don't require standardization, so no scaling will be implemented.
#
# Experiment using Supervised Learning Algorithms.
# - Decision Tree - Simple classification model. Works well with sparse datasets.
# - Random Forest - Ensemble method to improve Decision tree performance and reduce individual feature variances.
# - Logistic Regression - Classic classifier that can handle sparse datasets. Output probablilities, can tune 'C' parameter.
# - Gaussian Naive Bayes - well documented and 'baseline' classifier. Especially for high bias problems like spam detection.
#
# Determine if and which algorithms are successful at predicting True Positives. Explore improvement techniques.
# Split Features and Response variable
x = data.drop(['date','device','failure'], axis=1)
y = data['failure']
#Instantiate the various Classification algorithms to try
dtree = DecisionTreeClassifier(class_weight='balanced')
randf = RandomForestClassifier(class_weight='balanced')
logreg = LogisticRegression(class_weight='balanced')
gnb = GaussianNB()
#Hyperparameter grid
dtree_paramgrid = {'max_depth': range(4,10)}
randf_paramgrid = {'n_estimators': range(4,20,2), 'max_depth': range(4,10)}
c_space = np.logspace(-3, 3, 7)
logreg_paramgrid = {'C': c_space, 'penalty' : ('l1', 'l2')}
x_cv_dtree = GridSearchCV(dtree, dtree_paramgrid, cv = 3, scoring='f1')
x_cv_randf = GridSearchCV(randf, randf_paramgrid, cv = 3, scoring='f1')
x_cv_logreg = GridSearchCV(logreg, logreg_paramgrid, cv = 3, scoring='f1')
x_cv_gnb = gnb
#Initialize counters
i = 0
k=5 #Folds
y_test = y_dtree = y_randf = y_logreg = y_gnb = []
#Stratified K-Fold Cross Validation
skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=412)
for train,test in skf.split(x,y):
x_train, x_test = x.iloc[train], x.iloc[test]
y_train, y_test_fold = y.iloc[train], y.iloc[test]
#Fit and Predict
i += 1
print 'Split: ', i
#dtree
print ('Running Decision Tree...')
x_cv_dtree.fit(x_train,y_train)
y_pred_dtree_proba = x_cv_dtree.predict_proba(x_test)[:,1]
#rand
print ('Running Random Forest...')
x_cv_randf.fit(x_train,y_train)
y_pred_randf_proba = x_cv_randf.predict_proba(x_test)[:,1]
#logreg
print ('Running Logistic Regression...')
x_cv_logreg.fit(x_train,y_train)
y_pred_logreg_proba = x_cv_logreg.predict_proba(x_test)[:,1]
#gnb
print 'Running NaiveBayes... ', '\n'
x_cv_gnb.fit(x_train,y_train)
y_pred_gnb_proba = x_cv_gnb.predict_proba(x_test)[:,1]
#Complile Test and prediction results
y_test = np.append(y_test, y_test_fold)
y_dtree = np.append(y_dtree, y_pred_dtree_proba)
y_randf = np.append(y_randf, y_pred_randf_proba)
y_logreg = np.append(y_logreg, y_pred_logreg_proba)
y_gnb = np.append(y_gnb, y_pred_gnb_proba)
# Best Parameters
print 'Best Decision Tree Parameters:', x_cv_dtree.best_params_
print 'Best Random Forest Parameters:', x_cv_randf.best_params_
print 'Best Logistic Regression Parameters:', x_cv_logreg.best_params_
print 'Best Naive Bayes Parameters: N/A'
# Observations
# In Logistic Regression, L1 penalty lends itself to sparse datasets, so it makes sense that it led to the highest score.
# Measure
# Since the goal is not to maximize the True Positive predictions but to get a 'probability of failure' while minimizing both False Positives and False negatives, we will evaluate the best F1 scores and the AUC of the models to determine the winner.
y_pred = [y_dtree.round(), y_randf.round(), y_logreg.round(), y_gnb.round()]
#Confusion matrices
conf_dtree = confusion_matrix(y_test, y_pred[0])
conf_randf = confusion_matrix(y_test, y_pred[1])
conf_logreg = confusion_matrix(y_test, y_pred[2])
conf_gnb = confusion_matrix(y_test, y_pred[3])
print 'Decision Tree:\n', conf_dtree
print '\nRandom Forest:\n', conf_randf
print '\nLogistic Regression:\n', conf_logreg
print '\nNaiveBayes:\n', conf_gnb
# Classification Report
report_dtree = classification_report(y_test, y_pred[0], labels=[1], digits=3)
report_randf = classification_report(y_test, y_pred[1],labels=[1], digits=3)
report_logreg = classification_report(y_test, y_pred[2],labels=[1], digits=3)
report_gnb = classification_report(y_test, y_pred[3],labels=[1], digits=3)
print 'Decision Tree Report:\n', report_dtree
print 'Random Forest Report:\n', report_randf
print 'Logistic Regression Report:\n', report_logreg
print 'Naive Bayes Report:\n', report_gnb
# Observations
# Best Precision: Naive Bayes has a much higher Precision than other models at .034.
# Best Recall: Decision Tree had the highest recall at 0.632.
# Best Fscore: The highest overall performance based on Confusion matrix was by Naive Bayes at an F1 score of 0.047.
# Plot a Historgram of probablilies
plt.hist(y_dtree, bins=20, alpha=.3, label='Decision Tree')
plt.hist(y_gnb, bins=20, alpha=.3, label='naive Bayes')
plt.yscale('log')
plt.xlabel('Predicted Probalility of Failure')
plt.ylabel('Frequency')
plt.legend()
plt.show()
# Plot a Historgram of probablilies
plt.hist(y_randf, bins=20, alpha=.3, label='Random Forest')
plt.hist(y_logreg, bins=20, alpha=.3, label='Logistic Regression')
plt.yscale('log')
plt.xlabel('Predicted Probalility of Failure')
plt.ylabel('Frequency')
plt.legend()
plt.show()
# Observations
# Decision Tree and Naive Bayes predictions have low prediction probability cardinality.
# Random Forest and Logistic Regression have a much smoother distribution of probablilies.
# If varying the threshold to increase Precision or Recall is required, Random Forest or Logistic Regression should be used.
#Plot the ROC curve
fpr_dtree, tpr_dtree, threshold_dtree = roc_curve(y_test, y_dtree)
fpr_randf, tpr_randf, threshold_randf = roc_curve(y_test, y_randf)
fpr_logreg, tpr_logreg, threshold_logreg = roc_curve(y_test, y_logreg)
fpr_gnb, tpr_gnb, threshold_gnb = roc_curve(y_test, y_gnb)
plt.plot(fpr_dtree, tpr_dtree, label='Decision Tree')
plt.plot(fpr_randf, tpr_randf, label='Random Forest')
plt.plot(fpr_logreg, tpr_logreg, label='Logistic Regression')
plt.plot(fpr_gnb, tpr_gnb, label='Naive Bayes')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
# Calculate the Area Under the Curve
auc_dtree = roc_auc_score(y_test, y_dtree)
auc_randf = roc_auc_score(y_test, y_randf)
auc_logreg = roc_auc_score(y_test, y_logreg)
auc_gnb = roc_auc_score(y_test, y_gnb)
print 'AUC Decision Tree:', auc_dtree
print 'AUC Random Forest:', auc_randf
print 'AUC Logistic Regression:', auc_logreg
print 'AUC Naive Bayes:', auc_gnb
# Results
# A high AUC is the most useful predictor for high class imbalance problems.
# When given a positive and a negative event, it indicates the models propensity to assign a higher probability for a positive event.
# The ROC curve also explains the discrepency between F1 score and AUC.
# Even though Naive Bayes has a higher F1 score, its predictions aren't affected by threshold, leading to a low overall AUC score.
#
# The best performing models here are Logistic Regression and surprisingly Decision trees (Random forest wasn't far behind).
# Logistic Regression AUC - 0.80
# Decision Tree AUC - 0.77
#
# Therefore the best models to consider and improve are Logistic Regression and Decision Trees (Xgboosted trees).
# Conclusion
#
# The goal of the model was to predict the probability of device failure.
# - After some initial data exploration, it was found that the data set is (~124k observations with 9 attributes), is mostly sparse, doesn't have any missing or categorical data, but has a significant class imbalance.
# - After building a quick benchmark model, a more robust model was built with multiple algorithms, grid search parameters, stratified Kfold, and advanced parameter selection and tuning.
# - The results from the models were compared using F1 score and AUC.
# - Logistic regression and Decision Tree model yielded the most accurate results at 0.79 and 0.77 AUC scores respectively.
# - Additional techniques that can be used to improve the model:
# - XGBoost.
# - Downsampling.
# - Treat like an anomaly detection unsupervised learning model.
# - SVM based approaches.
# - Neural network based techniques.
# - Autoencoders