![Backblaze logo](assets/Backblaze_logo.png)

## Objective
* Complete Exploratory Data Analysis.
* Predict hard drive failure using additional smart statistics.

## Background Information
* Data is an integral component of our society. From the simple caloric deficits collected in your apple watch to the user history in your Netflix account, data is used in a myriad of applications. With such an abundance of data being used daily, how is it stored? The solution is computer backup or cloud storage services. Furthermore, Backblaze is a world leader in computer backup and storage.  Since 2013, Backblaze has published statistics and insights based on the hard drives in their data center [1].  In this study, we’ll explore various features in a hard drive dataset to predict hard drive failure.

## Process:
* Exploratory Data Analysis conducted utilizing various python packages (Numpy, Matplotlib, Pandas, and Plotly).'

* Binary Classification Algorithms (Sci-Kit Learn)
    * Logistic Regression

## Table of Contents:
* Part I: Exploratory Data Analysis
    * EDA
* Part II: Failure Prediction
    * Binary Classification

In [None]:
from plotly.subplots import make_subplots
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

import seaborn as sns
import matplotlib.pyplot as plt
import glob
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Dataset

The original dataset is separated into four folders depicting the four quarters of the year. Each quarter contains daily CSV files, which has information on all the hard drives in operation. When a hard drive fails, it is removed from the proceeding day.  There are roughly 130 features in the dataset, and most are SMART stats, which are industry known statistics used by various hard drive manufacturers for quality assurance. Despite having access to all SMART stats, these can result in an abundant amount of null entries due to company bias - a company may record only SMART stats, they deem necessary.  Preprocessing is split into two distinct phases: EDA and Prediction.

    1.	Create EDA dataframes
        a.	Merge daily CSV’s into quarterly CSV’s (4 CSV)
        b.	Merge quarterly CSV’s into a yearly CSV (1 CSV)
        c.	Track the frequency of hard drive failures (1 CSV)
    2.	Create Prediction dataframe
        a.	Preprocess the yearly CSV file (1 CSV)
            i.	Remove null entries
            ii.	Scale columns by feature

# Exploratory Data Analysis

The EDA explored features such as drive size, manufacturer, and age because these are SMART statistics that most manufacturers record.

### Daily and Quarterly Hard Drive and Failure Frequency Plot

Across the four quarters, there is an increasing trend in both the number of hard drives and failures. The maximum number of hard drives and failures is 125k and 678, respectively. The minimum number of hard drives and failures is 108K and 444, respectively. There is an outlier visible in November - many hard drives were taken down for maintenance and were not recorded. 

In [None]:
# Daily and Quarterly Hard Drive and Failure Frequency Plot

## Daily objects
daily_hd_failures = pd.read_csv('data/daily_hd_failures.csv')
dates = daily_hd_failures['date']
hard_drives = daily_hd_failures['hard_drives']
failures = daily_hd_failures['failures']

## Quarterly objects
### Total number of failures per quarter
def parse_quarters(x):
    """Function to label quarter periods for each date"""
    if x <= '2019-03-31':
        return 'Q1'
    elif (x >= '2019-04-01') & (x <= '2019-06-30'):
        return 'Q2'
    elif (x >= '2019-07-01') & (x <= '2019-09-30'):
        return 'Q3'
    else:
        return 'Q4' 

daily_hd_failures['Quarter'] = daily_hd_failures['date'].apply(parse_quarters)

### Max Hard Drives at the end of quarter
quarter_hd_list = []

def quarterly_hard_drives(quarters):
    """Function which gathers the number of hard drives at the end of a quarter"""
    for x in quarters:
        quarter_hd_list.append(daily_hd_failures[daily_hd_failures['Quarter'] == x]['hard_drives'].max())
    return np.array([['Q1', 'Q2', 'Q3', 'Q4'], quarter_hd_list])
                
quarter_df = quarterly_hard_drives(['Q1', 'Q2', 'Q3', 'Q4'])
q_hard_drives = quarter_df[1]
q_failures = daily_hd_failures.groupby(['Quarter'])['failures'].sum().values

### Since we are using the dates as x-axis , we'll select the middle points of each quarter for visualization
quarters = ["2019-02-15", "2019-05-15",
            "2019-08-15", "2019-11-15"]
quarter_periods = ['Q1', 'Q2', 'Q3', 'Q4']

## Create figure
fig = go.Figure()

## Add traces
### Daily Failures
fig.add_trace(go.Scatter(
    x = dates,
    y = failures,
    yaxis = "y",
    name = "Daily Failures",
    text = failures,
    marker_color = 'red',
    hoverinfo = "name+x+text",
    line = {"width": 0.5},
    marker = {"size": 8},
    mode  =  "lines+markers",
    opacity  =  0.4,
    showlegend  =  True 
))

### Quarterly Failures
fig.add_trace(go.Scatter(
    x  =  quarters,     
    y  =  q_failures,
    yaxis  =  'y',
    marker_color  =  'red',
    name  =  'Quarterly_Failures',
    text  =  quarter_periods,
    hoverinfo  =  "name+text+y",
    line  =  {"width": 0.5},
    marker  =  {"size": 12},
    mode  =  "lines+markers",
    opacity  =  1.0,
    showlegend  =  True                  
))

### Daily Hard Drives
fig.add_trace(go.Scatter(
    x  =  dates,
    y  =  hard_drives,
    yaxis  =  "y2",
    name  =  "Daily Hard Drives",
    text  =  hard_drives,
    marker_color  =  'blue',
    hoverinfo  =  "name+x+text",
    line  =  {"width": 0.5},
    marker  =  {"size": 8},
    mode  =  "lines+markers",
    opacity  =  0.4,
    showlegend  =  True
))

### Quarterly Hard Drives
fig.add_trace(go.Scatter(
    x  =  quarters,
    y  =  q_hard_drives,
    marker_color  =  'Blue',
    name  =  'Quarterly_Hard Drives',
    yaxis  =  'y2',
    text  =  quarter_periods,
    hoverinfo  =  "name+text+y",
    line  =  {"width": 0.8},
    marker  =  {"size": 12},
    opacity  =  1.0,
    mode  =  "lines+markers",
    showlegend  =  True            
))

### Add annotations
fig.update_layout(
    annotations  =  [
        dict(
            x = "2019-02-15",
            y = 0,
            arrowcolor = "rgba(63, 81, 181, 0.2)",
            arrowsize = 0.3,
            ax = 0,
            ay = 30,
            text = "Q1",
            xref = "x",
            yanchor = "bottom",
            yref = "y"
        ),
        dict(
            x = "2019-05-15",
            y = 0,
            arrowcolor = "rgba(76, 175, 80, 0.1)",
            arrowsize = 0.3,
            ax = 0,
            ay = 30,
            text = "Q2",
            xref = "x",
            yanchor = "bottom",
            yref = "y"
        ),
        dict(
            x = "2019-08-15",
            y = 0,
            arrowcolor = "rgba(76, 175, 80, 0.1)",
            arrowsize = 0.3,
            ax = 0,
            ay = 30,
            text = "Q3",
            xref = "x",
            yanchor = "bottom",
            yref = "y"
        ),
        dict(
            x = "2019-11-15",
            y = 0,
            arrowcolor = "rgba(76, 175, 80, 0.1)",
            arrowsize = 0.3,
            ax = 0,
            ay = 30,
            text = "Q4",
            xref = "x",
            yanchor = "bottom",
            yref = "y"
        ),
        
    ],
)

### Add in fill shapes for the quarters
fig.update_layout(
    shapes = [
        dict(
            fillcolor = "rgba(63, 81, 181, 0.2)",
            line = {"width": 0},
            type = "rect",
            x0 = "2019-01-01",
            x1 = "2019-03-31",
            xref = "x",
            y0 = 0,
            y1 = 0.95,
            yref = "paper"
        ),
        dict(
            fillcolor = "rgba(76, 175, 80, 0.1)",
            line = {"width": 0},
            type = "rect",
            x0 = "2019-04-01",
            x1 = "2019-06-30",
            xref = "x",
            y0 = 0,
            y1 = 0.95,
            yref = "paper"
        ),
        dict(
            fillcolor = "rgba(231, 127, 255, 0.1)",
            line = {"width": 0},
            type = "rect",
            x0 = "2019-07-01",
            x1 = "2019-09-30",
            xref = "x",
            y0 = 0,
            y1 = 0.95,
            yref = "paper"
        ),
        dict(
            fillcolor = "rgba(0, 255, 255, 0.1)",
            line = {"width": 0},
            type = "rect",
            x0 = "2019-10-01",
            x1 = "2019-12-31",
            xref = "x",
            y0 = 0,
            y1 = 0.95,
            yref = "paper"
        ),
        
    ]
)

### Update axes - include rangeslider and autorange
fig.update_layout(
    xaxis = dict(
        autorange = True,
        range = ["2019-01-01", "2019-12-31"],
        rangeslider = dict(
            autorange = True,
            range = ["2019-01-01", "2019-12-31"]
        ),
        type = "date"
    ),
    yaxis = dict(
        anchor = "x",
        autorange = True,
        domain = [0, 0.5],
        linecolor = "red",
        mirror = True,
        showline = True,
        side = "left",
        tickfont = {"color": "red"},
        tickmode = "auto",
        ticks = "",
        title  =  'Failures',
        titlefont = {"color": "red"},
        type = "linear",
        zeroline = False
    ),
    yaxis2 = dict(
        anchor = "x",
        autorange = True,
        domain = [0.5, 1],
        linecolor = "blue",
        mirror = True,
        showline = True,
        side = "left",
        tickfont = {"color": "blue"},
        tickmode = "auto",
        ticks = "",
        title  =  'Hard Drives',
        titlefont = {"color": "blue"},
        type = "linear",
        zeroline = False
    ),

)


### Update x axes ticks
fig.update_xaxes( ticks  =  'outside',
    ticktext  =  ['Jan', 'Feb',
                'Mar', 'Apr',
                'May', 'Jun',
                'Jul', 'Aug',
                'Sep', 'Oct',
                'Nov', 'Dec'],
    tickvals  =  ['2019-01-01', '2019-02-01',
                '2019-03-01', '2019-04-01',
                '2019-05-01', '2019-06-01',
                '2019-07-01', '2019-08-01',
                '2019-09-01', '2019-10-01',
                '2019-11-01', '2019-12-01'],
)

# Update layout
fig.update_layout(
    dragmode = "zoom",
    hovermode = "x",
    height =  600,
    template = "plotly_white",
    title_text  =  'Daily and Quarterly Hard Drive and Failure Frequency',
)

fig.show()

### Drive Size

Across the four quarters, there is a trend in using larger hard drive sizes such as 12 TB, 14 TB, and 16 TB. The size of the hard drive’s ranges from < 1 TB to 16 TB. As of quarter 4, 16 TB hard drives were added in datacenters. The hard drive size with the largest failure rate is the <1 TB hard drives - a failure rate of 7.668% and a count of 190 hard drives. The hard drive size with the greatest failure frequency is the 12 TB hard drives - a failure rate of 2.09% and a count of 1200 hard drives.

In [None]:
def drive_sizes(x):
    """Function which outputs the size of the hard drive as a string"""
    if x in [i for i in range(0,100)]:
        return str(x) + ' TB'
    elif (x < 1) | (x == -1.0):
            return '<1 TB'
    else:
            return 'NAN'

def quarterly_hard_drive_sizes(csv):
    """Function which reads in the quarterly csv's and creates a feature hard drive size"""
    df = pd.read_csv(csv)
    df = df[df['capacity_bytes'] != -1]
    df['drive_size'] = round((df['capacity_bytes'] / (10 ** 12)),2)
    df['drive_size'] = df['drive_size'].apply(drive_sizes)
    return df

# Load in yearly hard drive data
full_df = quarterly_hard_drive_sizes('data/Full-2019.csv')

#### Quarterly Frequency of Hard Drive Sizes

In [None]:
# Daily and Quarterly Hard Drive and Failure Frequency Plot

## Pandas Dataframe
### Instantiate dataframes
q1_df = quarterly_hard_drive_sizes('data/Q1-2019.csv')
q2_df = quarterly_hard_drive_sizes('data/Q2-2019.csv')
q3_df = quarterly_hard_drive_sizes('data/Q3-2019.csv')
q4_df = quarterly_hard_drive_sizes('data/Q4-2019.csv')

## Plotly parameters
fig = go.Figure()

### Quarter 1 Trace
fig.add_trace(go.Histogram(
    x = q1_df['drive_size'],
    marker_color = 'crimson',
    name = 'Quarter 1'
))

### Quarter 2 Trace
fig.add_trace(go.Histogram(
    x = q2_df['drive_size'],
    marker_color = 'blue',
    name = 'Quarter 2'
))

### Quarter 3 Trace
fig.add_trace(go.Histogram(
    x = q3_df['drive_size'],
    marker_color = 'green',
    name = 'Quarter 3'
))

### Quarter 4 Trace
fig.add_trace(go.Histogram(
    x = q4_df['drive_size'],
    marker_color = 'orange',
    name = 'Quarter 4'))


### X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive size')

### Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 range = [0, 60000], type = 'log', title_text = 'Count')
    
### Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    xaxis = dict(
        categoryorder = 'total descending'
    ),
    barmode = 'group',
    title_text = 'Quarterly Frequency of Hard Drive Size'
)

fig.show()

#### Boxplot of Quarterly Drive Size

In [None]:
# Boxplot of Quarterly Drive Size

## Quarter Labels
q1_df['Quarter'] = 'Quarter 1'
q2_df['Quarter'] = 'Quarter 2'
q3_df['Quarter'] = 'Quarter 3'
q4_df['Quarter'] = 'Quarter 4'

## Plotly parameters
fig = go.Figure()

### Quarter 1 Trace
fig.add_trace(go.Box(
    x = q1_df['Quarter'],
    y = q1_df['capacity_bytes'],
   # marker_color = 'crimson',
    name = 'Quarter 1'
))

### Quarter 2 Trace
fig.add_trace(go.Box(
    x = q2_df['Quarter'],
    y = q2_df['capacity_bytes'],
   # marker_color = 'crimson',
    name = 'Quarter 2'
))

### Quarter 3 Trace
fig.add_trace(go.Box(
    x = q3_df['Quarter'],
    y = q3_df['capacity_bytes'],
   # marker_color = 'crimson',
    name = 'Quarter 3'
))

### Quarter 4 Trace
fig.add_trace(go.Box(
    x = q4_df['Quarter'],
    y = q4_df['capacity_bytes'],
   # marker_color = 'crimson',
    name = 'Quarter 4'
))

### X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,)

### Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive size')
    
### Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    title_text = 'Boxplot of Quarterly Drive Size'
)

fig.show()

#### Frequency of Hard Drive Size Failure

In [None]:
# Measure the amount of failures per hard drive size
tmp_failure = full_df[full_df['failure'] == 1]['drive_size'].value_counts()

# Drop hard drive sizes that have zero failures
data =  {'drive_size': full_df['drive_size'].value_counts().index.tolist(),
         'count': full_df['drive_size'].value_counts().values.tolist(),
        }

tmp_full_df = pd.DataFrame(data)
tmp_full_df = tmp_full_df[tmp_full_df['drive_size']
            .isin(tmp_failure.index.tolist())].reindex([0, 1, 2, 4, 5, 6, 3])

# Calculate the failure rate for each hard drive size
tmp_list = []
tmp_list.append(round((tmp_failure[0] / tmp_full_df['count'][0]) * 100, 3))
tmp_list.append(round((tmp_failure[1] / tmp_full_df['count'][1]) * 100, 3))
tmp_list.append(round((tmp_failure[2] / tmp_full_df['count'][2]) * 100, 3))
tmp_list.append(round((tmp_failure[3] / tmp_full_df['count'][4]) * 100, 3))
tmp_list.append(round((tmp_failure[4] / tmp_full_df['count'][5]) * 100, 3))
tmp_list.append(round((tmp_failure[5] / tmp_full_df['count'][6]) * 100, 3))
tmp_list.append(round((tmp_failure[6] / tmp_full_df['count'][3]) * 100, 3))


failure_rate = []
for x in tmp_list:
    failure_rate.append(str(x) + "%")


# Frequency of hard drive sizes that failed

fig = px.bar(    
    x = full_df[full_df['failure'] == 1]['drive_size'].value_counts().index,
    y = full_df[full_df['failure'] == 1]['drive_size'].value_counts().values,
    color = full_df[full_df['failure'] == 1]['drive_size'].value_counts().index,
    color_discrete_sequence = ['lightpink', 'blue',
                               'red', 'purple',
                               'lightblue', 'gold',
                               'gray'],
    text= failure_rate,
)

## X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive size')

## Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Count')

## Update traces 
fig.update_traces( textposition = 'outside')

## Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    title_text = 'Frequency of Hard Drive Size Failures'
)

fig.show()

### Drive Manufacturer

Across the four manufacturers, the frequency of hard drives from highest to lowest: Seagate, Hitachi, Toshiba, and Western Digital. Seagate is the only drive manufacturer that has a complete boxplot. The greatest failure rate is 4.884% from Western Digital with a count of ~20 hard drives. The largest failure count is the Seagate hard drives with a failure rate of 2.172% and a count of 2000 hard drives.

In [None]:
def manufacturer_labels(x):
    "Function which labels the model's manufacturer"
    tmp = x[:2]
    if tmp == 'ST':
        return 'Seagate'
    elif tmp == 'Se':
        return 'Seagate'
    elif tmp == 'HG':
        return 'Hitachi'
    elif tmp == 'Hi':
        return 'Hitachi'
    elif tmp == 'TO':
        return 'Toshiba'
    elif tmp == 'WD':
        return 'Western Digital'
    elif tmp == 'DE':
        return 'Dell'

full_df['manufacturer'] = full_df['model'].apply(manufacturer_labels)

# Drop Dell entries, they are all NAN values
full_df = full_df[(full_df['manufacturer'] != 'Dell')] 

#### Frequency of Hard Drive Manufacturers and Drive Sizes

In [None]:
# Frequency of Hard Drive Manufacturers and Drive Sizes

fig = go.Figure()

### Seagate
fig.add_trace(go.Histogram(
    x = full_df[full_df['manufacturer'] == 'Seagate']['drive_size'],
    marker_color = 'crimson',
    name = 'Seagate'
))

### Hitachi
fig.add_trace(go.Histogram(
    x = full_df[full_df['manufacturer'] == 'Hitachi']['drive_size'],
    marker_color = 'blue',
    name = 'Hitachi'
))

### Toshiba
fig.add_trace(go.Histogram(
    x = full_df[full_df['manufacturer'] == 'Toshiba']['drive_size'],
    marker_color = 'green',
    name = 'Toshiba'
))

### Western Digital
fig.add_trace(go.Histogram(
    x = full_df[full_df['manufacturer'] == 'Western Digital']['drive_size'],
    marker_color = 'orange',
    name = 'Western Digital'))



## X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive size')

## Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Count', type = 'log' )

### Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    xaxis = dict(
        categoryorder = 'total descending'
    ),
    barmode = 'group',
    title_text = 'Frequency of Hard Drive Manufacturers and Drive Sizes'
)

fig.show()

#### Boxplot of Drive Manufacturers and Size

In [None]:
# Box Plot Hard Drive Sizes with Hard Drive Days

## Plotly parameters
fig = px.box(    
    x = full_df['manufacturer'],
    y = full_df['capacity_bytes'],
    color = full_df['manufacturer'],
    boxmode = "overlay",
)

fig.update_traces(quartilemethod="linear") # or "inclusive", or "linear" by default

### X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive manufacturer')

### Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive size')
    
### Layout  
fig.update_layout(
        legend = dict(
            x = 1,
            y = 1.003,
            traceorder = "normal",
            font = dict(
                family = "sans-serif",
                size = 14,
                color = "black"
            ),
            bgcolor = "#e5ecf6",
            bordercolor = "Black",
            borderwidth = 2
        )
    )

## Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    xaxis = dict(
        categoryorder = 'total descending'
    ),
    
    title_text = 'Boxplot of Drive Manufacturer and Size'
)

fig.show()

#### Frequency of Hard Drive Manufacturer Failures

In [None]:
# Measure the amount of failures per manufacturer
tmp_failure = full_df[full_df['failure'] == 1]['manufacturer'].value_counts()

# Drop hard drive sizes that have zero failures
data =  {'manufacturer': full_df['manufacturer'].value_counts().index.tolist(),
         'count': full_df['manufacturer'].value_counts().values.tolist(),
        }

tmp_full_df = pd.DataFrame(data)
tmp_full_df = tmp_full_df[tmp_full_df['manufacturer'].isin(tmp_failure.index.tolist())]

# Calculate the failure rate for each manufacturer
tmp_list = []
tmp_list.append(round((tmp_failure[0] / tmp_full_df['count'][0]) * 100, 3))
tmp_list.append(round((tmp_failure[1] / tmp_full_df['count'][1]) * 100, 3))
tmp_list.append(round((tmp_failure[2] / tmp_full_df['count'][2]) * 100, 3))
tmp_list.append(round((tmp_failure[3] / tmp_full_df['count'][3]) * 100, 3))


failure_rate = []
for x in tmp_list:
    failure_rate.append(str(x) + "%")


# Frequency of hard drive sizes that failed

fig = px.bar(    
    x = full_df[full_df['failure'] == 1]['manufacturer'].value_counts().index,
    y = full_df[full_df['failure'] == 1]['manufacturer'].value_counts().values,
    color = full_df[full_df['failure'] == 1]['manufacturer'].value_counts().index,
    color_discrete_sequence = ['lightpink', 'blue',
                               'red', 'purple',
                               'lightblue', 'gold',
                               'gray'],
    text= failure_rate,
)

## X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Manufacturer')

## Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Count')

## Update traces 
fig.update_traces( textposition = 'outside')

## Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    title_text = 'Frequency of Hard Drive Manufacturer Failures'
)

fig.show()

### Drive Age

The hard drive age ranges from 0 – 2500 days. Most of the older hard drives are Western Digital, and the recent hard drives are Toshiba. The greatest failure rate is 6.249%, given by the 400-599 days interval. The highest failure count is the 1000+ days interval with a failure rate of 1.169%, and a count of 550. As a drive manufacturer, older drive ages or lifetimes before failure are ideal. Smaller lifetimes in certain models could be attributed to defects or poor design.

#### Frequency of Hard Drive Age Failures

In [None]:
# Frequency of Hard Drive Age Failures
full_df['drive_age'] = full_df['smart_9_raw'] / 24

# Create Bins
bins = [i for i in range(0, 1200, 200)] + [2475]
labels = [str(i) + '-' + str(i + 199) for i in range(0, 1000, 200)] + ['1000+']
full_df['drive_age_bins'] = pd.cut(full_df['drive_age'], bins = bins, labels = labels)

# Measure the amount of failures per hard drive age 
tmp_failure = full_df[full_df['failure'] == 1]['drive_age_bins'].value_counts()

# Drop hard drive sizes that have zero failures
data =  {'drive_age_bins': full_df['drive_age_bins'].value_counts().index.tolist(),
         'count': full_df['drive_age_bins'].value_counts().values.tolist(),
        }

tmp_full_df = pd.DataFrame(data)
tmp_full_df = tmp_full_df[tmp_full_df['drive_age_bins'].isin(tmp_failure.index.tolist())]

# Calculate the failure rate for each hard drive age bin
tmp_list = []
tmp_list.append(round((tmp_failure[5] / tmp_full_df['count'][0]) * 100, 3))
tmp_list.append(round((tmp_failure[3] / tmp_full_df['count'][1]) * 100, 3))
tmp_list.append(round((tmp_failure[2] / tmp_full_df['count'][5]) * 100, 3))
tmp_list.append(round((tmp_failure[1] / tmp_full_df['count'][2]) * 100, 3))
tmp_list.append(round((tmp_failure[0] / tmp_full_df['count'][4]) * 100, 3))
tmp_list.append(round((tmp_failure[4] / tmp_full_df['count'][3]) * 100, 3))
failure_rate = []
for x in tmp_list:
    failure_rate.append(str(x) + "%")

## Plotly parameters
fig = px.bar(    
    x = full_df[full_df['failure'] == 1]['drive_age_bins'].value_counts().index,
    y = full_df[full_df['failure'] == 1]['drive_age_bins'].value_counts().values,
    color = full_df[full_df['failure'] == 1]['drive_age_bins'].value_counts().index,
    text = failure_rate,
)

### X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive age (days)')

### Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Count')
    
## Update traces 
fig.update_traces( textposition = 'outside')
    
## Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    xaxis = dict(
        categoryorder = 'total descending'
    ),
    
    title_text = 'Frequency of Hard Drive Age Failures'
)

fig.show()

#### Boxplot of Drive Manufacturers and Age

In [None]:
# Box Plot Hard Drive Sizes with Hard Drive Days

## Plotly parameters
fig = px.box(    
    x = full_df['manufacturer'],
    y = full_df['drive_age'],
    color = full_df['manufacturer'],
    #color = ['red', 'blue', 'purple', 'lightpink']
boxmode = "overlay",
    # color = colors
       #color_discrete_sequence = ['red', 'blue',
    #                           'purple', 'lightpink']
)

### X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive manufacturer')

### Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive age (days)')
    
### Layout  
fig.update_layout(
        legend = dict(
            x = 1,
            y = 1.003,
            traceorder = "normal",
            font = dict(
                family = "sans-serif",
                size = 14,
                color = "black"
            ),
            bgcolor = "#e5ecf6",
            bordercolor = "Black",
            borderwidth = 2
        )
    )

## Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    xaxis = dict(
        categoryorder = 'total descending'
    ),
    title_text = 'Boxplot of Drive Manufacturer and Age'
)

fig.show()

#### Boxplot of Drive Age and Size

In [None]:
# Box Plot Hard Drive Sizes with Hard Drive Days


## Plotly parameters
### Set list of colors
colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)',
          'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']

fig = px.box(    
    x = full_df['drive_size'],
    y = full_df['drive_age'],
    color = full_df['drive_size'],
    boxmode = "overlay",
    color_discrete_sequence = ['red', 'blue',
                               'purple', 'lightpink',
                               'lightblue', 'gray',
                               'gold', 'orange',
                               'cyan', 'magenta']
    
)

### X Axis Parameters
fig.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', autorange = True,
                 mirror = True, gridcolor = 'LightPink', automargin = True, 
                 zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive size')

### Y Axis Parameters
fig.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                 mirror = True, gridcolor = 'LightPink',autorange = True,
                 zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                 ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
                 title_text = 'Drive age (days)')
    
### Layout  
fig.update_layout(
        legend = dict(
            x = 1,
            y = 1.003,
            traceorder = "normal",
            font = dict(
                family = "sans-serif",
                size = 14,
                color = "black"
            ),
            bgcolor = "#e5ecf6",
            bordercolor = "Black",
            borderwidth = 2
        )
    )

## Layout   
fig.update_layout(
    legend = dict(
        x = 1,
        y = 1.003,
        traceorder = "normal",
        font = dict(
            family = "sans-serif",
            size = 14,
            color = "black"
        ),
        bgcolor = "#e5ecf6",
        bordercolor = "Black",
        borderwidth = 2
    ), 
    xaxis = dict(
        categoryorder = 'total descending'
    ),
    
    title_text = 'Boxplot of Hard Drive Age and Hard Drive Size'
)

fig.show()

# Prediction

In the prediction stage, our problem is identifying additional features to classify failures in hard drives. In order to characterize the performance, we’ll be using the Receiver Operating Characteristics, mainly the AUC as a metric. Our baseline algorithm for our predictions is Logistic Regression. Our process will be selecting pertinent SMART statistics provided in the dataset by researching their functions and then  report
ing their performance in our model [2].

From manual investigation, two additional features were selected in predicting hard drive failure.

Baseline Model:
* SMART 5 – Reallocated_Sector_Count.
* SMART 187 – Reported_Uncorrectable_Errors.
* SMART 188 – Command_Timeout.
* SMART 197 – Current_Pending_Sector_Count.
* SMART 198 – Offline_Uncorrectable.

Final Model:
* SMART 5 – Reallocated_Sector_Count.
* SMART 187 – Reported_Uncorrectable_Errors.
* SMART 188 – Command_Timeout.
* SMART 197 – Current_Pending_Sector_Count.
* SMART 198 – Offline_Uncorrectable.
* **Smart 199: UDMA_CRC_Error_Count**
* **Smart 242: Total_LBAs_Read(Good)**

Feature Descriptions:
    

Firstly, we have to generate our sample sets for prediction. We start off by loading in our completed dataframe and preprocess it to remove null entries and scale the columns.

In [None]:
def quarterly_hard_drive_sizes(csv):
    """Function which reads in the quarterly csv's and creates a feature hard drive size"""
    df = pd.read_csv(csv)
    df = df[df['capacity_bytes'] != -1]
    df['drive_size'] = round((df['capacity_bytes'] / (10 ** 12)),2)
    df['drive_size'] = df['drive_size'].apply(drive_sizes)
    return df

In [None]:
def sample_set(x_features, y_feature):
    "Create the sample sets for prediction (X and y training/test set)"
    full_df = quarterly_hard_drive_sizes('data/Full-2019.csv')

    # Drop columns that have lower than 92043 entries, and drop columns that dont have 126623 entries
    for x in full_df.columns:
        if full_df[x].count() < 92043:
            full_df = full_df.drop(columns = x)

    full_df = full_df[full_df['smart_187_normalized'].notna()]
    
    # Create Training and Testing Sets
    X = full_df[x_features]
    y = full_df[y_feature]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10, stratify = y)
    
    # Column Scaler
    col_names = X_train.columns
    features = X_train[col_names]
    features_test = X_test[col_names]
    ct = ColumnTransformer([
            ('somename', StandardScaler(), [x for x in X_train.columns])
        ], remainder = 'passthrough')
    X_train_scaled = ct.fit_transform(features)
    X_test_scaled = ct.transform(features_test)
    return X_train_scaled, X_test_scaled, y_train, y_test

In [None]:
# Load in Baseline Set
X_train_scaled_baseline, X_test_scaled_baseline, y_train, y_test = sample_set(['smart_5_raw', 'smart_187_raw',
                                                                               'smart_188_raw', 'smart_197_raw',
                                                                               'smart_198_raw'],
                                                                              'failure')

# Load in Final Model Set
X_train_scaled_final, X_test_scaled_final, y_train, y_test = sample_set(['smart_5_raw', 'smart_187_raw',
                                                                               'smart_188_raw', 'smart_197_raw',
                                                                               'smart_198_raw', 'smart_199_raw',
                                                                               'smart_242_raw'],
                                                                              'failure')

Next, we apply hyperparameter tuning of our logistic regression model through the use of a pipeline, which applies recurse feature elimination and grid search. 

In [None]:
# Logistic Regression Pipeline
## Instantiate logistic regression model
clf_LR = LogisticRegression(random_state = 0, solver = 'liblinear')

## First step of pipeline recursive feature elimination with cross validation (10-fold)
rfecv = RFECV(estimator = clf_LR, 
              step = 1, 
              cv = StratifiedKFold(10), 
              scoring = 'roc_auc')

## Second step of pipeline grid search using cross validation (10-fold)
CV_rfc = GridSearchCV(clf_LR, 
                      param_grid = {'penalty': ['l2', 'l1'],
                                    'C': [0.001,.009,0.01,.09,.5, 0.8, 1,5,10,25]},
                      cv = StratifiedKFold(10), scoring = 'roc_auc')

## Instantiate Pipeline
pipeline = Pipeline([('feature_sele', rfecv),('clf_LR_cv', CV_rfc)])
pipeline.fit(X_train_scaled_baseline, y_train)

## Assign model predictions
y_pred_acc = pipeline.predict(X_test_scaled)

## Best parameters and features for the model
print(rfecv.get_support()) 
print('Best Penalty:', CV_rfc.best_estimator_.get_params()['penalty'])
print('Best C:', CV_rfc.best_estimator_.get_params()['C'])
print(CV_rfc.best_params_)

The optimal hyperparameters:
* C = 25
* Penalty = l2

In [None]:
# Tuned Hyperparameter
clf_LR = LogisticRegression(random_state = 0, solver = 'liblinear',
                            C = 25, penalty = 'l2')

Furthermore, we obtain the performance metrics of our baseline model and final model. 

In [None]:
def performance_metrics(model, X_train, X_test, y_train, y_test):
    "Reports the classification metrics given the sample sets"
    ## Get the ROC by k-fold cross validation on the training set
    cv_scores = cross_val_score(model, X_train,
                                y_train, cv = StratifiedKFold(10),
                                scoring = 'roc_auc')
    print('cv_scores mean:{}'.format(np.mean(cv_scores)))

    ## Fit the model
    model.fit(X_train, y_train)

    # Get the model predictions
    y_pred_acc = model.predict(X_test)

    # Classification metrics
    print('Accuracy Score : ' + str(accuracy_score(y_test, y_pred_acc)))
    print('Precision Score : ' + str(precision_score(y_test, y_pred_acc)))
    print('Recall Score : ' + str(recall_score(y_test, y_pred_acc)))
    print('F1 Score : ' + str(f1_score(y_test, y_pred_acc)))
    print('ROC Score : ' + str(roc_auc_score(y_test, model.predict_proba(X_test)[:,1])))

    # Logistic Regression (Grid Search) Confusion matrix
    confusion_matrix(y_test,y_pred_acc)

In [None]:
# Baseline Model Metrics
performance_metrics(clf_LR, X_train_scaled_baseline, X_test_scaled_baseline, y_train, y_test)

In [None]:
# Final Model Metrics
performance_metrics(clf_LR, X_train_scaled_final, X_test_scaled_final, y_train, y_test)

Finally, we create plot the ROC curves for both the baseline and final models.

In [None]:
def roc_auc_scores(model, X_train, X_test, y_train, y_test ):
    """Given a classification algorithm, it returns the ROC curve with the AUC score,
    inspired from plotly-dash plots. """
    # Fit the model
    model.fit(X_train, y_train)
    
    # Predicted Y values
    decision_test = model.decision_function(X_test)
    
     # Obtain the false-positive rate, true-positive rate, and threshold from sci-kit learn roc curve
    fpr, tpr, threshold = roc_curve(y_test, decision_test)

    # AUC Score
    auc_score = roc_auc_score(y_true = y_test, y_score = decision_test)
    
    return fpr, tpr, threshold, auc_score

In [None]:
# Obtain FPR, TPR, Threshold, and AUC Scores

## Baseline
fpr_baseline, tpr_baseline, threshold_baseline, auc_score_baseline = roc_auc_scores(clf_LR, X_train_scaled_baseline, X_test_scaled_baseline, y_train, y_test)

## Final Model
fpr_final, tpr_final, threshold_final, auc_score_final = roc_auc_scores(clf_LR, X_train_scaled_final, X_test_scaled_final, y_train, y_test)


# Figure
## Create our trace using a plotly go object
trace0 = go.Scatter(
    x = fpr_baseline, y = tpr_baseline,
    mode = "lines", name = "Baseline Model",
    marker = {"color": "#ff0000"}
)

## Create our trace using a plotly go object
trace1 = go.Scatter(
    x = fpr_final, y = tpr_final,
    mode = "lines", name = "Final Model",
    marker = {"color": "#13c6e9"},
)

## Adjust the figure parameters 
layout = go.Layout(
    title = dict(
        text = 'Receiver Operating Characteristics',
       # x = 0.5, xanchor = 'center', yanchor = 'top'),
    ),
    xaxis = dict(
        title = "False Positive Rate",
        showline = True, linewidth = 1, linecolor = 'black', autorange = True,
        mirror = True, gridcolor = 'LightPink', automargin = True, 
        zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
        ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
        title_font = dict(size = 22), range = (-0.05, 1.05) ),

    yaxis = dict(
        title = "True Positive Rate",
        showline = True, linewidth = 2, linecolor = 'black', 
        mirror = True, gridcolor = 'LightPink',autorange = True,
        zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
        ticks = "outside", tickwidth = 2, tickcolor = 'LightPink', ticklen = 10,
        title_font = dict(size = 22), range = (-0.05, 1.05) ),


    margin = dict(
        l = 100, r = 10,
        t = 40, b = 40),
    
    font = dict(size = 18),
    title_font  =  dict(size  =  22),
)

## Plug in our parameters above to the plotly go figure objects to create our plots
data = [trace0, trace1]
roc_auc_curve_fig = go.Figure(data = data, layout = layout)

## Add in annotations
roc_auc_curve_fig.add_annotation(
        x=0.15,
        y=0.75,
        xref="x",
        yref="y",
        text= f"AUC = {auc_score_baseline:.3f}",
)

roc_auc_curve_fig.add_annotation(
        x=0.5,
        y=0.90,
        xref="x",
        yref="y",
        text= f"AUC = {auc_score_final:.3f}",
)

roc_auc_curve_fig.show()

# References

[1] "Backblaze Hard Drive Stats", Backblaze.com, 2020. [Online]. Available: https://www.backblaze.com/b2/hard-drive-test-data.html. [Accessed: 27- Apr- 2020].

[2] "SMART Drive and Failure Rates", Backblaze.com, 2020. [Online]. Available: https://www.backblaze.com/blog-smart-stats-2014-8.html. [Accessed: 29- Apr- 2020].

[3] "Backup Software & Data Protection Solutions - Acronis", Acronis.com, 2020. [Online]. Available: https://www.acronis.com/en-us/. [Accessed: 01- May- 2020].
