# Hanes Brand Predictive Maintenance

In this notebook, we will make an attempt to correlate the machine sensor data with the failure data to identify when the machine has failed.  In addition, we will aggregrate the sensor data from a 2-minute interval to a 1-hour interval, as well as create a new indicator variables that signifies if the machine failed within the next hour. 

## Customize Enviornment

In [1]:
import pandas as pd
import numpy as np
import mysql.connector
from sqlalchemy import create_engine
import datetime
import pickle
from pprint import pprint
import matplotlib.pyplot as plt
from collections import Counter

In [2]:
pylab inline

Populating the interactive namespace from numpy and matplotlib


## Load Data

##### We will connect to the MySQL database and query the tables directly

In [3]:
# create a connection to the database
engine = create_engine('mysql+mysqlconnector://champt9:champt9@130.39.81.34:3306/hanes', echo=False)

# query the database for dryer 3 data
#dryer3_df = pd.read_sql_query("SELECT * FROM dryer3 WHERE Quality = 192", con=engine)
#dryer3_nonpm_df = pd.read_sql_query("SELECT * FROM dryer3_nonpm", con=engine)
#dryer3_pm_df = pd.read_sql_query("SELECT * FROM dryer3_pm", con=engine)

##### We will also store the data in a .pickle file to be able to refer to it later without needed to be connected to the LSU network

In [4]:
# pickle dataset to work with later
#pickle.dump(dryer3_df, open('dryer3.p', 'wb'))
#pickle.dump(dryer3_nonpm_df, open('dryer3_nonpm.p', 'wb'))
#pickle.dump(dryer3_pm_df, open('dryer3_pm.p', 'wb'))

# load pickle file
dryer3_df = pickle.load(open('dryer3.p', 'rb'))
dryer3_nonpm_df = pickle.load(open('dryer3_nonpm.p', 'rb'))
dryer3_pm_df = pickle.load(open('dryer3_pm.p', 'rb'))

## Feature Engineering

### Attribute Downtime

##### In order to properly attribute the reason why the machine is not running, we need to identify when the downtime begins, how long the downtime lasts, and when the downtime ends.  Once we have this information, we can use a matching rule to classify which of these downtimes are shift change, preventative maintenace, or failure.

#### Identify when Run changes

In [5]:
# find when the variable 'Run' changes 
run_change = dryer3_df['Run'].diff()

In [6]:
# get the amount of time between each sensor reading
step_length = dryer3_df['Datetime'].diff()

In [7]:
# calculate the continuous amount of change at each sensor reading from when the variable Run last changed
# thanks for the help at StackExchange #155111
since_change = []
current_delta = 0
for is_change, delta in zip(run_change, step_length):
    current_delta = 0 if is_change != 0 else \
        current_delta + delta.total_seconds() / 60.0
    since_change.append(current_delta)

In [8]:
# add these new variables back into the data frame
dryer3_df['Run_Change'] = run_change
dryer3_df['Step_Length'] = step_length
dryer3_df['Time_Since_Change'] = pd.Series(since_change).values

In [9]:
# show a sample of the data
dryer3_df[['Datetime', 'Run', 'Run_Change', 'Step_Length', 'Time_Since_Change']].head(5)

Unnamed: 0,Datetime,Run,Run_Change,Step_Length,Time_Since_Change
0,2015-01-01 00:00:00,1,,NaT,0.0
1,2015-01-01 00:02:00,1,0.0,00:02:00,2.0
2,2015-01-01 00:04:00,1,0.0,00:02:00,4.0
3,2015-01-01 00:06:00,1,0.0,00:02:00,6.0
4,2015-01-01 00:08:00,1,0.0,00:02:00,8.0


#### Group readings based on Run

In [10]:
# convert df to a list of dicts
dryer3_dict = dryer3_df.to_dict('records')

In [11]:
# set the ID for the first record equal to 0
dryer3_dict[0]['GroupId'] = 0

# create an auto-incrementing GroupId that updates with change in Run status
for i in range(1, len(dryer3_dict)):
    if dryer3_dict[i]['Run'] == dryer3_dict[i-1]['Run']:
        dryer3_dict[i]['GroupId'] = dryer3_dict[i-1]['GroupId']
    else:
        dryer3_dict[i]['GroupId'] = dryer3_dict[i-1]['GroupId'] + 1

In [12]:
# create a dict with the keys as Group Id
dryer3 = {}
for line in dryer3_dict:
    dryer3[line['GroupId']] = {'groupId' : line['GroupId'],
                               'zEvents' : []}

# add sensor reads into the dict based on their GroupId
for line in dryer3_dict:
    dryer3[line['GroupId']]['zEvents'].append(line)

In [13]:
# enrich dict with min, max, duration
for line in dryer3:
    dryer3[line]['Run'] = dryer3[line]['zEvents'][0]['Run']
    dryer3[line]['startDatetime'] = min([item['Datetime'] for item in dryer3[line]['zEvents']])
    dryer3[line]['endDatetime'] = max([item['Datetime'] for item in dryer3[line]['zEvents']])
    dryer3[line]['duration'] = (dryer3[line]['endDatetime'] - dryer3[line]['startDatetime']).total_seconds() / 60.0

In [14]:
# create reduced dict without sensor readings
dryer3_reduced = []
for line in dryer3:
    row = {'GroupId' : dryer3[line]['groupId'], 
           'Run' : dryer3[line]['Run'],
           'endDatetime' : dryer3[line]['endDatetime'],
           'startDatetime' : dryer3[line]['startDatetime'],
           'duration' : dryer3[line]['duration']}
    dryer3_reduced.append(row)

In [15]:
# show a sample of the data
pd.DataFrame(dryer3_reduced).head()

Unnamed: 0,GroupId,Run,duration,endDatetime,startDatetime
0,0,1,158.0,2015-01-01 02:38:00,2015-01-01 00:00:00
1,1,0,0.0,2015-01-01 02:40:00,2015-01-01 02:40:00
2,2,1,254.0,2015-01-01 06:56:00,2015-01-01 02:42:00
3,3,0,0.0,2015-01-01 06:58:00,2015-01-01 06:58:00
4,4,1,106.0,2015-01-01 08:46:00,2015-01-01 07:00:00


#### Classify downtime

In [16]:
# get unique dates for preventative maintenance and failure
pm_dates = set([line.date() for line in dryer3_pm_df['Completed'].tolist()])
failure_dates = set([line.date() for line in dryer3_nonpm_df['Assigned'].tolist()])

# define the start and end times for shift change #1 and shift change #2
shift_1_start = datetime.time(7, 30, 0)
shift_1_end = datetime.time(9, 30, 0)
shift_2_start = datetime.time(19, 30, 0)
shift_2_end = datetime.time(21, 30, 0)

# iterate through the aggregrated data
for line in dryer3_reduced:
    
    # create variables for easy reference
    run = line['Run']
    start_date = line['startDatetime'].date()
    start_time = line['startDatetime'].time()
    end_date = line['endDatetime'].date()
    end_time = line['endDatetime'].time()
    duration = line['duration']
    
    # until proven otherwise, set all reasons the machine is down to False
    line['shift_change'] = False
    line['fail_only'] = False
    line['pm_or_fail'] = False
    
    # look for when the machine is down
    if line['Run'] == 0:
        
        # if the downtime started between the start and end times for shift #1 and lasted less than 120 minutes...
        if ((shift_1_start <=  start_time <= shift_1_end) or (shift_2_start <=  start_time <= shift_2_end)) and (0 < duration < 120):
            line['shift_change'] = True
            
        # else if the downtime was not shift change, and was on date that both failure and PM happened...
        elif duration > 4 and start_date in failure_dates and end_date in pm_dates:
            line['pm_or_fail'] = True
        
        # else if the downtime was not shift change, and was on a date that only failure happend...
        elif duration > 4 and start_date in failure_dates and end_date not in pm_dates:
            line['fail_only'] = True
            
        # otherwise, we cannot attribute downtime
        else:
            pass

# print the number of classifications made
print("Number of failures: {}".format(len([line for line in dryer3_reduced if line['fail_only'] == True])))
print("Number of failures or maintenance: {}".format(len([line for line in dryer3_reduced if line['pm_or_fail'] == True])))

Number of failures: 95
Number of failures or maintenance: 12


In [17]:
# load downtime into a df
df = pd.DataFrame(dryer3_reduced)

# recorder columns
col_order = ['GroupId', 'startDatetime', 'endDatetime', 'duration', 'Run', 'shift_change', 'fail_only', 'pm_or_fail']
df = df.reindex(columns=col_order)

In [18]:
# show a sample of the data
df.head()

Unnamed: 0,GroupId,startDatetime,endDatetime,duration,Run,shift_change,fail_only,pm_or_fail
0,0,2015-01-01 00:00:00,2015-01-01 02:38:00,158.0,1,False,False,False
1,1,2015-01-01 02:40:00,2015-01-01 02:40:00,0.0,0,False,False,False
2,2,2015-01-01 02:42:00,2015-01-01 06:56:00,254.0,1,False,False,False
3,3,2015-01-01 06:58:00,2015-01-01 06:58:00,0.0,0,False,False,False
4,4,2015-01-01 07:00:00,2015-01-01 08:46:00,106.0,1,False,False,False


#### Identify failures in sensor data

In [19]:
# get the start and end datetimes for the identified failures
# THIS IS EXCLUDING THE OUTLIER WHERE THE MACHINE WAS DOWN FOR 10 days
# THIS IS ALSO EXLCUING TIMES WHEN WE CANNOT MAKE A DISTINCTION BETWEEN PREVENTATIVE MAINTENANCE AND FAILURE
failure = df[(df['fail_only'] == True) & (df['duration'] < 14000)]
print len(failure)

94


In [20]:
# convert to dict
failure_dict = failure.to_dict('records')

# create a set of tuples with starttime and endtime for faster look ups
fail_times = set([(line['startDatetime'], line['endDatetime']) for line in failure_dict])

In [21]:
# look through the dryer3 sensor data, and if the datetime of the reading is in the set of fail times, write fail indicator
fail_dates = []
for line in dryer3_dict:
    for item in fail_times:
        if item[0] <= line['Datetime'] <= item[1]:
            line['FAILURE'] = 1
            fail_dates.append(line['Datetime'])

# if it wasn't found as a failure, then code it as a 0
for line in dryer3_dict:
    if 'FAILURE' not in line.keys():
        line['FAILURE'] = 0
        
# print the number of failures tagged
print("Number of failures at 2 minute intervals: {}".format(len(fail_dates)))

Number of failures at 2 minute intervals: 2808


#### Look for at the matchup between failures are classified and actual failures

In [23]:
# get the fist and last sensor data
d1 = min([line['Datetime'] for line in dryer3_dict]).date()
d2 = max([line['Datetime'] for line in dryer3_dict]).date()

# find the differencec between the dates
delta = d2 - d1

# create a list of every day
dates = {}
for i in range(delta.days + 1):
    dates[(d1 + datetime.timedelta(days=i))] = {'recorded_failure' : [],
                                                'identified_failure' : []}

In [24]:
# read in failure dates
fail_record_dates = [line.date() for line in dryer3_nonpm_df['Assigned'].tolist()]

# go through the recorded failures and all the dates of the failures to the dict
for line in fail_record_dates:
    dates[line]['recorded_failure'].append(line)
    
# go through the failures that we identified and all to dict
for line in fail_times:
    dates[line[0].date()]['identified_failure'].append(line[0])

In [25]:
# count the number of dates identified
for key in dates:
    dates[key]['num_recorded_failure'] = len(dates[key]['recorded_failure'])
    dates[key]['num_identified_failure'] = len(dates[key]['identified_failure'])

In [26]:
f_df = pd.DataFrame(dates).T

In [27]:
f_df.to_csv("dates.csv")

In [28]:
f_df.head()

Unnamed: 0,identified_failure,num_identified_failure,num_recorded_failure,recorded_failure
2015-01-01,[],0,0,[]
2015-01-02,[],0,0,[]
2015-01-03,[],0,0,[]
2015-01-04,[],0,0,[]
2015-01-05,[2015-01-05 09:26:00],1,1,[2015-01-05]


## Aggregreate Results

Because our goal is to identify the indicators that lead up to a failure, we are going to aggregrate our 2-minute sensor readings up to the hour level.

In [22]:
# filter to only include the first failure if a failure happens
first_fail = []
for line in dryer3_dict:
    if line['FAILURE'] == 0:
        first_fail.append(line)
    elif line['FAILURE'] == 1 and line['Time_Since_Change'] == 0:
        first_fail.append(line)

In [23]:
# load dataset into dataframe
data = pd.DataFrame(first_fail)

In [24]:
# count the number of failures
data.FAILURE.value_counts()

0    339517
1        94
Name: FAILURE, dtype: int64

In [25]:
# create a new variable the anchors the datetime to the lowest hour...used for grouping next
# thanks to StackOverflow #27031169
data['Datetime_hour'] = data.Datetime.values.astype('<M8[h]')

# drop datetime as it's no longer needed
data.drop('Datetime', inplace=True, axis=1)

In [26]:
# group the data by the hour level
grouped = data.groupby(by='Datetime_hour')

In [27]:
# create function to return ratio
def ratio(arr):
    return float(arr.sum()) / len(arr)

In [28]:
# decide which variables to binary and which are continuous
binary = ['LintSysAuto', 'LintSysEnable', 'PleviaAuto', 'Run', 'QA']

continuous = ['CircFan1', 'CircFan2', 'CircFan3', 'CircFan4', 'CircFan5', 'CircFan6', 
              'CircFanAct1', 'CircFanAct2', 'CircFanAct3', 'CircfanAct4', 'CircFanAct5', 'CircFanAct6',
              'Temp1', 'Temp2', 'Temp3', 'Temp4', 'Temp5', 'Temp6',
              'TempSet1', 'TempSet2', 'TempSet3', 'TempSet4', 'TempSet5', 'TempSet6',
              'Valve1', 'Valve2', 'Valve3', 'Valve4', 'Valve5', 'Valve6', 
              'EntrySpeed', 'ExitCnvySpeed', 'FeedConvySpeed', 'FolderSpeed', 'LowerCnvySpeed', 'MiddleCnvySpeed', 'HMISpeed',
              'EntryRatio', 'ExitCnvyRatio', 'FeedCnvyRatio', 'FolderRatio', 'LowerCnvyRatio', 'MiddleCnvyRatio', 
              'ExhaustFan', 'ExhaustFanAct', 'ExhaustFanMan', 'HeatRecAct', 'HeatRecSet', 'PSum', 'Plevia', 'Speed']

# combine binary and continuous
all_vars = []
all_vars.extend(binary)
all_vars.extend(continuous)

In [29]:
# create dict for agg arguments
args = {}
for var in all_vars:
    if var in continuous:
        args[var] = {var : {str(var + '_mean') : 'mean',
                            str(var + '_std')  : 'std',
                            str(var + '_min')  : 'min',
                            str(var + '_max')  : 'max'}}
    if var in binary:
        args[var] = {var : {str(var + '_ratio') : ratio}}
        
# add failure indicator
args['FAILURE'] = {'FAILURE' : {'FAILURE' : 'max'}}

In [30]:
# run the arguments on the grouped data
results = grouped.agg(args)

In [31]:
# drop the MultiIndex
results.columns = results.columns.droplevel()

In [32]:
# count how many failures were identified in this grouped data
results['FAILURE'].value_counts()

0    11315
1       87
Name: FAILURE, dtype: int64

## Advance Failure Indicator

Because we want to identify indicatorsr of a failure before it happens, we are going to shift the indicator to the hour before the failure happens.  Ideally this timeframe will look significantly different than other times in the data.

In [33]:
# capture when the status of FAILURE changes
results['FAIL_CHANGE'] = results['FAILURE'].diff()

In [34]:
# find datetimes where multiple failures happen in a row and drop
results = results.drop(results[(results['FAILURE'] == 1) & (results['FAIL_CHANGE'] != 1.0)].index)

In [35]:
# find index values for when the failure happened
fail_dt = results[results['FAILURE'] == 1].index

In [36]:
# subtract 2 hours from the same of failure
new_fail_dts = set()
for dt in fail_dt:
    new_fail_dts.add((dt - pd.to_timedelta(1, unit='h'), dt))

In [37]:
# set new indicator variable for upcoming failure
results['NEW_FAIL'] = 0
for i in results.index:
    for dt in new_fail_dts:
        if dt[0] <= i <= dt[1]:
            results.set_value(i, 'NEW_FAIL', 1)

In [38]:
# look for when new_fail changes and only keep first one
results['FAIL_1_HOUR'] = results.NEW_FAIL.diff()

In [39]:
# find datetimes where multiple new failures happen in a row and keep first
results = results.drop(results[(results['NEW_FAIL'] == 1) & (results['FAIL_1_HOUR'] == 0)].index)

In [40]:
# drop unnesessary columns
results.drop(["FAIL_CHANGE", "FAIL_1_HOUR", "FAILURE"], axis=1, inplace=True)

In [41]:
# rename failure indicator
results = results.rename(columns={'NEW_FAIL' : 'FAIL'})

In [42]:
# sort columns in alphabetic order, with DOWN at the beginning
cols = ['FAIL']

for i in sorted(results.columns):
    if i not in cols:
        cols.append(i)
        
# reorder columns
results = results[cols]

In [43]:
# show a sample of the data
results.head()

Unnamed: 0_level_0,FAIL,CircFan1_max,CircFan1_mean,CircFan1_min,CircFan1_std,CircFan2_max,CircFan2_mean,CircFan2_min,CircFan2_std,CircFan3_max,...,Valve4_min,Valve4_std,Valve5_max,Valve5_mean,Valve5_min,Valve5_std,Valve6_max,Valve6_mean,Valve6_min,Valve6_std
Datetime_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-01 00:00:00,0,100,100.0,100,0.0,100,100.0,100,0.0,100,...,100,0.0,100,100.0,100,0.0,100,100.0,100,0.0
2015-01-01 01:00:00,0,100,100.0,100,0.0,100,100.0,100,0.0,100,...,98,0.461133,100,100.0,100,0.0,100,100.0,100,0.0
2015-01-01 02:00:00,0,100,100.0,100,0.0,100,100.0,100,0.0,100,...,81,4.944404,100,99.5,91,1.943158,100,100.0,100,0.0
2015-01-01 03:00:00,0,100,100.0,100,0.0,100,100.0,100,0.0,100,...,94,1.381736,100,100.0,100,0.0,100,100.0,100,0.0
2015-01-01 04:00:00,0,100,100.0,100,0.0,100,100.0,100,0.0,100,...,92,1.695498,100,100.0,100,0.0,100,100.0,100,0.0


In [45]:
# write new df to SQL
results.to_sql("dryer3_1_hour_before_fail", con=engine, index=True, if_exists="replace", chunksize=2500)