# Using AI to Predict Server Hard Drive Failure

Project by Samy Djemaï during the LOG6309E course at Polytechnique Montréal.

Inspired by [*Predicting Disk Replacement Towards Reliable Data Centers* (Botezatu et al., 2016)](https://www.kdd.org/kdd2016/papers/files/adf0849-botezatuA.pdf).

## Abstract

Server hardware failures can cause unexpected crashes and downtime, which should be avoided at all costs for companies. Hard disk drive (HDD) failure is one of such failures. This project aims at predicting upcoming hard drive failures, using SMART statistics provided by the drives themselves as a multivariate time-series fed into a neural network to warn server operators of necessary disk replacements.

## Dataset Analysis

We first need to assess the dataset we are using, to find out which values are meaningful in our analysis.


In [95]:
import pandas as pd

df = pd.read_csv("data/Q1_2020/2020-03-31.csv")
df


Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,2020-03-31,Z305B2QN,ST4000DM000,4000787030016,0,114.0,72273728.0,,,91.0,...,,,,,,,,,,
1,2020-03-31,ZJV0XJQ4,ST12000NM0007,12000138625024,0,84.0,226044912.0,,,89.0,...,,,,,,,,,,
2,2020-03-31,ZJV0XJQ3,ST12000NM0007,12000138625024,0,83.0,207232920.0,,,98.0,...,,,,,,,,,,
3,2020-03-31,ZJV0XJQ0,ST12000NM0007,12000138625024,0,82.0,176197928.0,,,93.0,...,,,,,,,,,,
4,2020-03-31,PL1331LAHG1S4H,HGST HMS5C4040ALE640,4000787030016,0,100.0,0.0,134.0,103.0,143.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132334,2020-03-31,ZA10MCEQ,ST8000DM002,8001563222016,0,68.0,6560712.0,,,92.0,...,,,,,,,,,,
132335,2020-03-31,ZCH0CRTK,ST12000NM0007,12000138625024,0,82.0,176676744.0,,,97.0,...,,,,,,,,,,
132336,2020-03-31,ZA13ZBCT,ST8000DM002,8001563222016,0,83.0,204034056.0,,,89.0,...,,,,,,,,,,
132337,2020-03-31,PL1331LAHGD9NH,HGST HMS5C4040BLE640,4000787030016,0,100.0,0.0,134.0,100.0,100.0,...,,,,,,,,,,


The dataset contains 63 SMART attributes.

In [106]:
df.isnull().sum().sort_values(ascending=False).head(72)


smart_255_raw           132339
smart_250_normalized    132339
smart_15_raw            132339
smart_15_normalized     132339
smart_13_raw            132339
                         ...  
smart_22_raw            118912
smart_183_normalized    112270
smart_183_raw           112270
smart_8_raw              94583
smart_8_normalized       94583
Length: 72, dtype: int64

Some SMART stats are not reported by any drive in the dataset. We can drop around 35 SMART attributes (= 70 columns) without losing too much information: when sorted in descending order, we remove all SMART attributes with more `NaN` values than `SMART_8`.


In [105]:
df1 = df.dropna(axis=1, thresh=int(0.2 * df.shape[0]))
df1


Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_199_normalized,smart_199_raw,smart_200_normalized,smart_200_raw,smart_240_normalized,smart_240_raw,smart_241_normalized,smart_241_raw,smart_242_normalized,smart_242_raw
0,2020-03-31,Z305B2QN,ST4000DM000,4000787030016,0,114.0,72273728.0,,,91.0,...,200.0,0.0,,,100.0,37370.0,100.0,5.445219e+10,100.0,2.796870e+11
1,2020-03-31,ZJV0XJQ4,ST12000NM0007,12000138625024,0,84.0,226044912.0,,,89.0,...,200.0,0.0,100.0,0.0,100.0,13860.0,100.0,5.294202e+10,100.0,1.640811e+11
2,2020-03-31,ZJV0XJQ3,ST12000NM0007,12000138625024,0,83.0,207232920.0,,,98.0,...,200.0,0.0,100.0,0.0,100.0,11332.0,100.0,5.228804e+10,100.0,7.951115e+10
3,2020-03-31,ZJV0XJQ0,ST12000NM0007,12000138625024,0,82.0,176197928.0,,,93.0,...,200.0,0.0,100.0,0.0,100.0,15015.0,100.0,5.540053e+10,100.0,1.339848e+11
4,2020-03-31,PL1331LAHG1S4H,HGST HMS5C4040ALE640,4000787030016,0,100.0,0.0,134.0,103.0,143.0,...,200.0,0.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132334,2020-03-31,ZA10MCEQ,ST8000DM002,8001563222016,0,68.0,6560712.0,,,92.0,...,200.0,0.0,,,100.0,33084.0,100.0,8.139620e+10,100.0,1.917751e+11
132335,2020-03-31,ZCH0CRTK,ST12000NM0007,12000138625024,0,82.0,176676744.0,,,97.0,...,200.0,0.0,100.0,0.0,100.0,18576.0,100.0,6.776489e+10,100.0,1.413963e+11
132336,2020-03-31,ZA13ZBCT,ST8000DM002,8001563222016,0,83.0,204034056.0,,,89.0,...,200.0,0.0,,,100.0,29184.0,100.0,7.229272e+10,100.0,1.965452e+11
132337,2020-03-31,PL1331LAHGD9NH,HGST HMS5C4040BLE640,4000787030016,0,100.0,0.0,134.0,100.0,100.0,...,200.0,0.0,,,,,,,,


We now want to know which drives report the most SMART values.

In [152]:
df2 = df1.drop(df1.filter(like="normalized", axis=1).columns, axis=1)
df2.drop(labels=["date", "capacity_bytes"], axis=1, inplace=True)
df2


Unnamed: 0,serial_number,model,failure,smart_1_raw,smart_2_raw,smart_3_raw,smart_4_raw,smart_5_raw,smart_7_raw,smart_8_raw,...,smart_194_raw,smart_195_raw,smart_196_raw,smart_197_raw,smart_198_raw,smart_199_raw,smart_200_raw,smart_240_raw,smart_241_raw,smart_242_raw
0,Z305B2QN,ST4000DM000,0,72273728.0,,0.0,19.0,0.0,5.774153e+07,,...,22.0,,,0.0,0.0,0.0,,37370.0,5.445219e+10,2.796870e+11
1,ZJV0XJQ4,ST12000NM0007,0,226044912.0,,0.0,9.0,0.0,7.462914e+08,,...,27.0,226044912.0,,0.0,0.0,0.0,0.0,13860.0,5.294202e+10,1.640811e+11
2,ZJV0XJQ3,ST12000NM0007,0,207232920.0,,0.0,2.0,0.0,2.378736e+08,,...,31.0,207232920.0,,0.0,0.0,0.0,0.0,11332.0,5.228804e+10,7.951115e+10
3,ZJV0XJQ0,ST12000NM0007,0,176197928.0,,0.0,6.0,0.0,7.546835e+08,,...,24.0,176197928.0,,0.0,0.0,0.0,0.0,15015.0,5.540053e+10,1.339848e+11
4,PL1331LAHG1S4H,HGST HMS5C4040ALE640,0,0.0,103.0,543.0,12.0,0.0,0.000000e+00,42.0,...,29.0,,0.0,0.0,0.0,0.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132334,ZA10MCEQ,ST8000DM002,0,6560712.0,,0.0,4.0,8.0,3.544600e+09,,...,24.0,6560712.0,,0.0,0.0,0.0,,33084.0,8.139620e+10,1.917751e+11
132335,ZCH0CRTK,ST12000NM0007,0,176676744.0,,0.0,3.0,0.0,4.197787e+08,,...,22.0,176676744.0,,0.0,0.0,0.0,0.0,18576.0,6.776489e+10,1.413963e+11
132336,ZA13ZBCT,ST8000DM002,0,204034056.0,,0.0,6.0,0.0,3.351355e+09,,...,33.0,204034056.0,,0.0,0.0,0.0,,29184.0,7.229272e+10,1.965452e+11
132337,PL1331LAHGD9NH,HGST HMS5C4040BLE640,0,0.0,100.0,459.0,7.0,0.0,0.000000e+00,42.0,...,33.0,,0.0,0.0,0.0,0.0,,,,


In [153]:
df2.groupby(by='model').agg(lambda x: int(x.isnull().mean())).sum(axis=1).sort_values(ascending=True)

model
ST10000NM0086                              3
ST6000DM001                                4
ST6000DM004                                4
ST6000DX000                                4
ST8000DM004                                4
ST8000DM005                                4
ST8000NM0055                               4
ST8000DM002                                4
ST500LM021                                 5
ST4000DM000                                5
ST4000DM005                                5
ST500LM030                                 5
ST12000NM0007                              6
ST12000NM0008                              6
ST16000NM001G                              7
WDC WD5000BPKT                             8
TOSHIBA MQ01ABF050M                        9
TOSHIBA MQ01ABF050                         9
TOSHIBA MG08ACA16TA                        9
TOSHIBA HDWE160                            9
TOSHIBA HDWF180                            9
ST500LM012 HN                              9
TOSH

We can notice that Seagate drives report the most SMART parameters, followed by Western Digital, Toshiba and Hitachi drives. Seagate BarraCuda SSDs and DELLBOSS VD drives do not provide useful information.

In [154]:
df2.groupby(by='model').agg({'serial_number': "count"}).sort_values(by="serial_number", ascending=False)

Unnamed: 0_level_0,serial_number
model,Unnamed: 1_level_1
ST12000NM0007,36997
ST4000DM000,19142
ST8000NM0055,14464
HGST HMS5C4040BLE640,12744
ST12000NM0008,10876
HGST HUH721212ALN604,10847
ST8000DM002,9793
TOSHIBA MG07ACA14TA,7200
HGST HMS5C4040ALE640,2896
HGST HUH721212ALE600,1560


In [158]:
df2.groupby(by='model').agg({'failure': "sum"}).sort_values(by="failure", ascending=False)

Unnamed: 0_level_0,failure
model,Unnamed: 1_level_1
DELLBOSS VD,0
HGST HDS5C4040ALE630,0
ST8000DM002,0
ST8000DM004,0
ST8000DM005,0
ST8000NM0055,0
Seagate BarraCuda 120 SSD ZA250CM10003,0
Seagate BarraCuda SSD ZA2000CM10002,0
Seagate BarraCuda SSD ZA250CM10002,0
Seagate BarraCuda SSD ZA500CM10002,0
