# Kaggle - Mpred
## Predicitive Maintenance Challenge
### https://www.kaggle.com/c/mpred-datascience-challenge
#### By: 55thSwiss


## Introduction

A major problem faced by businesses in asset-heavy industries such as manufacturing is the significant 
costs that are associated with delays in the production process due to mechanical problems. Most of these 
businesses are interested in predicting these problems in advance so that they can proactively prevent the 
problems before they occur which will reduce the costly impact caused by downtime.

The business problem for this example is about predicting problems caused by component failures such that 
the question “What is the probability that a machine will fail in the near future due to a failure of a 
certain component” can be answered. The problem is formatted as a multi-class classification problem and 
a machine learning algorithm is used to create the predictive model that learns from historical data collected 
from machines.

The goal is to predict when a given machine will fail within 24 hours due to a failure of a given component .

## Sources

Please refer to the playbook [1] for predictive maintenance for a detailed explanation of common use cases 
in predictive maintenance and modelling approaches.

Original data set can be found at [2]

[1] https://docs.microsoft.com/fr-fr/azure/machine-learning/team-data-science-process/cortana-analytics-playbook-predictive-maintenance

[2] https://www.kaggle.com/yuansaijie0604/xinjiang-pm/data

## Data Description

In this competition, you are asked to estimate the probability that a machine will fail in the near future due to a 
failure of a certain component. More specifically, the goal is to compute the probability that a machine will fail in 
the next 24 hours due to a certain component failure (component 1,2,3 or 4) hence you are asked to classify the 
observations into 5 categories : comp1, comp2,comp3,comp4 and none (e.g : The machine will fail due to 
component 1... 4 or will not fail in the next 24h).

## TRAIN/SPLIT

The data is made of 5 different data set : Errors, Failures, Machines features, Maintenance history and Telemetry which contains the historical data collected from 100 different machines

The training set contains the data of 70 machines while the test set contains the data of the 30 remaining machines

The data is collected every hour for a year. For simplicity, you are asked to group the data into 3 hours windows.

## File descriptions

__telemetry.csv__ The first data source is the telemetry time-series data which consists of voltage, rotation, pressure and vibration measurements collected from 70 machines in real time averaged over every hour collected during the year 2015

__errors.csv__ The errors logs are non-breaking errors thrown while the machine is still operational and do not constitute as failures. The error date and times are rounded to the closest hour since the telemetry data is collected at an hourly rate.

__maint.csv__ This file contains the scheduled and unscheduled maintenance records which correspond to both regular inspection of components as well as failures. A record is generated if a component is replaced during the scheduled inspection or replaced due to a break down. The records that are created due to break downs will be called failures which is explained in the later sections. Maintenance data has both 2014 and 2015 records.

__machines.csv__ This data set includes some information about the machines which are model type and years in service.

__failures.csv__ These are the records of component replacements due to failures. Each record has a date and time, machine ID and failed component type.

## Data fields

__volt, rotate, pressure, vibration__ Voltage, rotation, pressure and vibration measurements collected from the machines

__machineID__ The ID of a machine

__datetime__ The date

__errorID__ The ID of an error

__comp__ The component replaced during the scheduled maintenance

__model__ Model of the machine

__age__ Years in service

__errorID__ The ID of an error

__failure__ Failed component type

### First assessment
There has been a problem defined that forms the objective of the project and we've also been given the data to work with. Typically this would be steps one __(1)__ and two __(2)__ respectively, gaining an understanding of the business and defining the problem (an absolute precursor to deciding requirements and method of solution) as well as collecting the data, in this case from the manufacturing floor.

### Wrangling or preparing the data for consumption
There are a few requirements from the introduction on the formatting of the data and how it should be proceesed. In this step the data will be formatted as well as examined for it's architecture in regards to storage and processing, and an initial cleaning for aberrant, missing, duplicate, or outlier data points within each dataset. This is a good point to determine possible independent and dependent variables.  If a key feature has been identified, the dataframes will be easily concatenated, otherwise it will be done "manually". Lastly, the cumulative dataframe can be split into training, test, and validation tables. 

In [165]:
#libraries
import sys
import os
import pandas as pd 
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import matplotlib.ticker as ticker
import seaborn as sns
import sklearn
from sklearn.preprocessing import minmax_scale
import datetime
import warnings
warnings.filterwarnings('ignore')

#vizualization defaults
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

The first dataset I tried looking at was telemetry.csv, but an error kicked up when loading the CSV:

```ParserError: Error tokenizing data. C error: Expected 6 fields in line 92490, saw 9```

This likely kicked up from missing information, meaning cells that were void of any value. Opening in a spreadsheet editor like Excel often fixes this by replacing the empty cells with a NaN status. 

In [166]:
#import data
sensorData = pd.read_csv('PdM_telemetry.csv')
maintenanceData = pd.read_csv('PdM_maint.csv')
originsData = pd.read_csv('PdM_machines.csv')
failureData = pd.read_csv('PdM_failures.csv')
errorData = pd.read_csv('PdM_errors.csv')

We'll do a first run through starting with 'telemetry.csv' since this is the largest dataset, then finish up the rest quickly. After opening the csv in Excel we're able to import it into the notebook but it has three columns of entirely null values. I've confirmed this is just an anomaly and the csv is infact empty in those columns (seems to be a formatting error, one row contained a shifted column heading in the data cells), they will be dropped. 

### Telemetry Data:

This is is going to be the most valuable dataset inidivudally, as it contains a vast amount of sensor readings from one hundred different machines. This will probably be the foundation dataset for merging.

In [167]:
#first look
sensorData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 968589 entries, 0 to 968588
Data columns (total 9 columns):
datetime      968589 non-null object
machineID     968589 non-null int64
volt          968589 non-null float64
rotate        968589 non-null object
pressure      968589 non-null object
vibration     968589 non-null object
Unnamed: 6    1 non-null object
Unnamed: 7    1 non-null object
Unnamed: 8    1 non-null object
dtypes: float64(1), int64(1), object(7)
memory usage: 66.5+ MB


In [168]:
sensorData.sample(3)

Unnamed: 0,datetime,machineID,volt,rotate,pressure,vibration,Unnamed: 6,Unnamed: 7,Unnamed: 8
679148,12/18/2015 15:00,67,183.023081,520.022,122.401,36.0168,,,
773945,10/14/2015 1:00,78,180.971126,473.997,85.9261,38.5643,,,
357116,3/17/2015 3:00,31,165.795501,435.993,107.514,39.716,,,


In [169]:
# row index 92488 was the problem data from above, drop the row and clean up columns
sensorData.drop(92488, inplace = True)
sensorData.drop(['Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'], axis=1, inplace=True)
sensorData['datetime'] = pd.to_datetime(sensorData['datetime'], infer_datetime_format = True)
# change the datatype of 'rotate', 'pressure', and 'vibration'
sensorData['rotate'] = sensorData['rotate'].astype(float)
sensorData['pressure'] = sensorData['pressure'].astype(float)
sensorData['vibration'] = sensorData['vibration'].astype(float)
sensorData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 968588 entries, 0 to 968588
Data columns (total 6 columns):
datetime     968588 non-null datetime64[ns]
machineID    968588 non-null int64
volt         968588 non-null float64
rotate       968588 non-null float64
pressure     968588 non-null float64
vibration    968588 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(1)
memory usage: 51.7 MB


In [170]:
# check out some statistics on the dataset
sensorData.describe()

Unnamed: 0,machineID,volt,rotate,pressure,vibration
count,968588.0,968588.0,968588.0,968588.0,968588.0
mean,46.230764,170.774119,446.591205,100.841515,40.383648
std,30.450409,15.507488,52.701718,11.028763,5.366198
min,1.0,97.333604,138.432075,51.237106,14.877054
25%,18.0,160.298184,412.278255,93.492369,36.778303
50%,45.0,170.597611,447.555368,100.414332,40.237433
75%,73.0,181.006196,482.154486,107.535244,43.781329
max,100.0,255.124717,695.020984,185.951998,76.791072


It looks like every feature has data points that are well outside three standard deviations of their respective mean. The outliers need to be looked at and dealt with if they're determined to be noise in the dataset, although these could be triggers corresponding to the failure mode in so they won't be removed just yet.

In [171]:
sensorData.shape

(968588, 6)

In [172]:
# start organizing the data by date to see how it flows
sensorData.sort_values(by=['datetime'], inplace = True)

In [173]:
mask = sensorData['machineID'] == 1
sensorData[mask].head(5)

Unnamed: 0,datetime,machineID,volt,rotate,pressure,vibration
0,2015-01-01 06:00:00,1,176.217853,418.504078,113.077935,45.087686
92489,2015-01-01 06:00:00,1,176.217853,418.504078,113.077935,45.087686
92490,2015-01-01 07:00:00,1,162.879223,402.74749,95.460525,43.413973
1,2015-01-01 07:00:00,1,162.879223,402.74749,95.460525,43.413973
2,2015-01-01 08:00:00,1,170.989902,527.349825,75.237905,34.178847


In [174]:
# looking at the results above, their appears to be duplicated data based on the 'datetime' feature, investigate a litte here:
sum(sensorData[mask].duplicated('datetime'))

8761

In [175]:
# cleaning the duplicate data and check
sensorData.drop_duplicates(subset=['datetime', 'machineID', 'volt'], inplace = True)
sum(sensorData[mask].duplicated('datetime'))

0

In [176]:
# looks better, check a 24 hour period of another machine for consistancy
mask = sensorData['machineID'] == 2
sensorData[mask].head(24)

Unnamed: 0,datetime,machineID,volt,rotate,pressure,vibration
8761,2015-01-01 06:00:00,2,176.558913,424.624162,76.005332,43.767049
8762,2015-01-01 07:00:00,2,158.282044,432.37296,110.907806,37.267114
101252,2015-01-01 08:00:00,2,168.242028,454.629639,97.877007,39.709461
101253,2015-01-01 09:00:00,2,180.280316,438.391022,84.44043,40.490443
101254,2015-01-01 10:00:00,2,169.719531,473.055664,110.395683,41.229578
101255,2015-01-01 11:00:00,2,191.257247,369.738792,101.223451,45.616543
8767,2015-01-01 12:00:00,2,186.282977,483.698416,115.061863,50.690561
101257,2015-01-01 13:00:00,2,179.367188,450.943961,94.378019,38.684815
101258,2015-01-01 14:00:00,2,168.893782,494.876313,101.910022,34.566681
101259,2015-01-01 15:00:00,2,158.595797,427.282619,92.470163,32.160232


In [177]:
# organize the dataframe by 'machineID' first, then 'datetime'
sensorData.sort_values(['machineID', 'datetime'], inplace = True)

In [178]:
# remove after
mask = sensorData['machineID'] == 2
sensorData[mask].head(3)

Unnamed: 0,datetime,machineID,volt,rotate,pressure,vibration
8761,2015-01-01 06:00:00,2,176.558913,424.624162,76.005332,43.767049
8762,2015-01-01 07:00:00,2,158.282044,432.37296,110.907806,37.267114
101252,2015-01-01 08:00:00,2,168.242028,454.629639,97.877007,39.709461


In [179]:
# aggregate a moving average every three hours of the sensor data and populate new features 'key_ma'
# fill the NaN cells created in the first two rows from '.rolling()' with the mean to 'key_ma'
#for key in ['volt', 'rotate', 'pressure', 'vibration']:
#    movingAv = (key + '_ma')
#    sensorData[movingAv] = sensorData[key].rolling(3).mean()
#    colMean = sensorData[movingAv].mean()
#    sensorData[movingAv].fillna((colMean), inplace = True)
    
# select every third row per train/split requirements
# semi working...
#i = 1
#sensorData['machineID'] = sensorData['machineID'].astype(str)
#for it, mach in enumerate(sensorData['machineID']):
    #i = (i + 1)
#    if int(mach) == i:
#        print(mach, i)
#        i = (i + 1)             
        
        

# original code:
#sensorData.loc[sensorData['machineID'] == mach]
#sensorData = sensorData.iloc[::3, :]

In [180]:
sensorData.shape

(876100, 6)

Telemetry sensor data looks clean enough for the first pass, there are no missing values, the features have been converted to appropriate data types, and the dataframe has been reduced into three hour incremenets with averaged sensor data.

### Maintenance Data

I'm not sure how usefule the maintenanceData is going to be, from the description this csv contains components that were changed due to scheduled maintenance and failures. At the same time, the failures are logged in a separate dataset, which will likely be more pertinent to our model. After cleaning all the datasets we can compare the scheduled changes and failures to the failureData csv and check for duplication, completeness, etc.

In [181]:
maintenanceData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3286 entries, 0 to 3285
Data columns (total 3 columns):
datetime     3286 non-null object
machineID    3286 non-null int64
comp         3286 non-null object
dtypes: int64(1), object(2)
memory usage: 77.1+ KB


In [182]:
maintenanceData.shape

(3286, 3)

In [183]:
maintenanceData.describe()

Unnamed: 0,machineID
count,3286.0
mean,50.284236
std,28.914478
min,1.0
25%,25.25
50%,50.0
75%,75.0
max,100.0


In [184]:
maintenanceData.sample(3)

Unnamed: 0,datetime,machineID,comp
1664,2015-05-31 06:00:00,51,comp2
1026,2015-11-18 06:00:00,31,comp3
2809,2014-07-01 06:00:00,86,comp4


In [185]:
# change 'datetime' to_datetime :)
maintenanceData['datetime'] = pd.to_datetime(maintenanceData['datetime'], infer_datetime_format = True)
# remove the begnning characters from 'comp' column and # change 'comp' to an integer
maintenanceData['comp'] = maintenanceData['comp'].str[4:].astype(int)
# change 'comp' to an integer
#maintenanceData['comp'] = maintenanceData['comp'].astype(int)

maintenanceData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3286 entries, 0 to 3285
Data columns (total 3 columns):
datetime     3286 non-null datetime64[ns]
machineID    3286 non-null int64
comp         3286 non-null int32
dtypes: datetime64[ns](1), int32(1), int64(1)
memory usage: 64.3 KB


In [186]:
# check for duplicated rows by 'datetime' and 'machineID' to consolidate rows
maintenanceData.duplicated(['datetime', 'machineID']).sum()

758

In [187]:
#This first concatenates your dataframe and the output of the get_dummies call, then it 
#groups the result according to the relevant columns, takes the sum of those columns among 
#those groups and then resets the index so you don't have to deal with a multi-index data 
#frame. The result looks like:
maintenanceData = pd.concat([maintenanceData, pd.get_dummies(maintenanceData.comp)], 1).groupby(['datetime','machineID']).sum().reset_index()
maintenanceData.drop(['comp'], axis=1, inplace=True)

In [188]:
maintenanceData.duplicated(['datetime', 'machineID']).sum()

0

In [189]:
maintenanceData.columns = ['datetime', 'machineID', 'mComp_1', 'mComp_2', 'mComp_3', 'mComp_4']
maintenanceData.sample(1)

Unnamed: 0,datetime,machineID,mComp_1,mComp_2,mComp_3,mComp_4
1321,2015-06-16 06:00:00,4,0,0,0,1


### Origins Data

The vintage of each machine should add valuable correlation to the frequency of scheduled and failure component repairs.

In [190]:
originsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
machineID    100 non-null int64
model        100 non-null object
age          100 non-null int64
dtypes: int64(2), object(1)
memory usage: 2.4+ KB


In [191]:
# remove the begnning characters from 'model' column and change to integer data type
originsData['model'] = originsData['model'].str[5:].astype(int)

In [192]:
originsData.shape

(100, 3)

In [193]:
originsData['age'].describe()

count    100.000000
mean      11.330000
std        5.856974
min        0.000000
25%        6.750000
50%       12.000000
75%       16.000000
max       20.000000
Name: age, dtype: float64

In [194]:
originsData.sample(5)

Unnamed: 0,machineID,model,age
5,6,3,7
20,21,2,14
99,100,4,5
68,69,2,19
18,19,3,17


In [195]:
# quantity of each machine by model
originsData.model.value_counts().sort_index()

1    16
2    17
3    35
4    32
Name: model, dtype: int64

In [196]:
# quantity of each machine by age
originsData.age.value_counts().sort_index()

0      1
1      3
2      6
3      4
4      3
5      4
6      4
7      6
8      1
9      5
10    10
11     2
12     2
14    14
15     6
16     5
17     7
18     6
19     4
20     7
Name: age, dtype: int64

### Failure Data

This will obviously contain critical features to predicting future failures.

In [197]:
failureData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 761 entries, 0 to 760
Data columns (total 3 columns):
datetime     761 non-null object
machineID    761 non-null int64
failure      761 non-null object
dtypes: int64(1), object(2)
memory usage: 17.9+ KB


In [198]:
failureData.sample(3)

Unnamed: 0,datetime,machineID,failure
603,2015-08-29 06:00:00,83,comp1
226,2015-12-05 06:00:00,30,comp4
695,2015-05-23 06:00:00,94,comp4


In [199]:
failureData['failure'] = failureData['failure'].str[4:].astype(int)
failureData['datetime'] = pd.to_datetime(failureData['datetime'], infer_datetime_format = True)

In [200]:
failureData.head(3)

Unnamed: 0,datetime,machineID,failure
0,2015-01-05 06:00:00,1,4
1,2015-03-06 06:00:00,1,1
2,2015-04-20 06:00:00,1,2


In [201]:
failureData['failure'].describe()

count    761.000000
mean       2.390276
std        1.102084
min        1.000000
25%        1.000000
50%        2.000000
75%        3.000000
max        4.000000
Name: failure, dtype: float64

In [202]:
# check for duplicated rows by 'datetime' and 'machineID' to consolidate rows
failureData.duplicated(['datetime', 'machineID']).sum()

42

In [203]:
#This first concatenates your dataframe and the output of the get_dummies call, then it 
#groups the result according to the relevant columns, takes the sum of those columns among 
#those groups and then resets the index so you don't have to deal with a multi-index data 
#frame. The result looks like:
failureData = pd.concat([failureData, pd.get_dummies(failureData.failure)], 1).groupby(['datetime','machineID']).sum().reset_index()
failureData.drop(['failure'], axis=1, inplace=True)

In [204]:
# confirm
failureData.duplicated(['datetime', 'machineID']).sum()

0

In [205]:
failureData.columns = ['datetime', 'machineID', 'fComp_1', 'fComp_2', 'fComp_3', 'fComp_4']
failureData.head()

Unnamed: 0,datetime,machineID,fComp_1,fComp_2,fComp_3,fComp_4
0,2015-01-02 03:00:00,16,1,0,1,0
1,2015-01-02 03:00:00,17,0,0,0,1
2,2015-01-02 03:00:00,22,1,0,0,0
3,2015-01-02 03:00:00,35,1,0,0,0
4,2015-01-02 03:00:00,45,1,0,0,0


### Error Data

We're going to see if there is a correlation between the errors generated on the machines, and eventual component maintenance / failure.

In [206]:
errorData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3919 entries, 0 to 3918
Data columns (total 3 columns):
datetime     3919 non-null object
machineID    3919 non-null int64
errorID      3919 non-null object
dtypes: int64(1), object(2)
memory usage: 91.9+ KB


In [207]:
errorData.sample(3)

Unnamed: 0,datetime,machineID,errorID
1947,2015-07-10 12:00:00,51,error1
1242,2015-10-26 14:00:00,32,error2
650,2015-09-21 21:00:00,17,error3


In [208]:
errorData['errorID'] = errorData['errorID'].str[5:].astype(int)
errorData['datetime'] = pd.to_datetime(errorData['datetime'], infer_datetime_format = True)

In [209]:
errorData.sample(3)

Unnamed: 0,datetime,machineID,errorID
2317,2015-10-06 23:00:00,60,2
1796,2015-11-29 13:00:00,47,1
1447,2015-04-16 06:00:00,38,5


In [210]:
errorData.duplicated(['datetime', 'machineID']).sum()

303

In [211]:
#This first concatenates your dataframe and the output of the get_dummies call, then it 
#groups the result according to the relevant columns, takes the sum of those columns among 
#those groups and then resets the index so you don't have to deal with a multi-index data 
#frame. The result looks like:
errorData = pd.concat([errorData, pd.get_dummies(errorData.errorID)], 1).groupby(['datetime','machineID']).sum().reset_index()
errorData.drop(['errorID'], axis=1, inplace=True)

In [212]:
# confirm
errorData.duplicated(['datetime', 'machineID']).sum()

0

In [213]:
errorData.columns = ['datetime', 'machineID', 'error_1', 'error_2', 'error_3', 'error_4', 'error_5']
errorData.head()

Unnamed: 0,datetime,machineID,error_1,error_2,error_3,error_4,error_5
0,2015-01-01 06:00:00,24,1,0,0,0,0
1,2015-01-01 06:00:00,73,0,0,0,1,0
2,2015-01-01 06:00:00,81,1,0,0,0,0
3,2015-01-01 07:00:00,43,0,0,1,0,0
4,2015-01-01 08:00:00,14,0,0,0,1,0


### Merging, Feature Engineering, and more Cleaning

In [214]:
singleDF = pd.merge(sensorData, maintenanceData, how = 'left', copy = False, on = ['datetime', 'machineID'])
singleDF = pd.merge(singleDF, errorData, how = 'left', copy = False, on = ['datetime', 'machineID'])
singleDF = pd.merge(singleDF, failureData, how = 'left', copy = False, on = ['datetime', 'machineID'])
singleDF = pd.merge(singleDF, originsData, how = 'left', copy = False, on = ['machineID'])
#singleDF.drop(['volt_ma', 'rotate_ma', 'pressure_ma', 'vibration_ma'], axis=1, inplace=True)
singleDF.fillna(0, inplace=True)
for f in ['mComp_1', 'mComp_2', 'mComp_3', 'mComp_4', 'fComp_1', 'fComp_2', 'fComp_3', 'fComp_4', 'error_1', 'error_2', 'error_3', 'error_4', 'error_5']:
    #
    singleDF[f] = singleDF[f].astype(int)

In [215]:
# sensorData was our base dataset, this confirms we're ending with the same number of rows we've started with
singleDF.shape

(876100, 21)

In [216]:
singleDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 876100 entries, 0 to 876099
Data columns (total 21 columns):
datetime     876100 non-null datetime64[ns]
machineID    876100 non-null int64
volt         876100 non-null float64
rotate       876100 non-null float64
pressure     876100 non-null float64
vibration    876100 non-null float64
mComp_1      876100 non-null int32
mComp_2      876100 non-null int32
mComp_3      876100 non-null int32
mComp_4      876100 non-null int32
error_1      876100 non-null int32
error_2      876100 non-null int32
error_3      876100 non-null int32
error_4      876100 non-null int32
error_5      876100 non-null int32
fComp_1      876100 non-null int32
fComp_2      876100 non-null int32
fComp_3      876100 non-null int32
fComp_4      876100 non-null int32
model        876100 non-null int32
age          876100 non-null int64
dtypes: datetime64[ns](1), float64(4), int32(14), int64(2)
memory usage: 100.3 MB


In [217]:
# looking at a sample / head / tail the dataframe seemed pretty empty for the components, failures, and errors. I thought
# there may have been an issue during the merge but doing a few checks has revealed the indicators are there.
a = singleDF['fComp_1'].value_counts()
b = singleDF['mComp_1'].value_counts()
c = singleDF['error_1'].value_counts()
print(a, b, c)

0    875908
1       192
Name: fComp_1, dtype: int64 0    875396
1       704
Name: mComp_1, dtype: int64 0    875090
1      1010
Name: error_1, dtype: int64


When I first saw a data set for performed maintenance, failures, and errors I assumed they should go together pretty simply. After looking at this last data set though, there are five different error messages, but we only have four unqiue components. Going back to look at the information provided at the beginning of this notebook, there is nothing useful to define what these errors are or how they correlate at the maintenance and component level. 

The errors will be summarized into a quantitative field of 0 = no error, and 1 - 5 = errors occured.

In [218]:
# creating a general column 'errors' with quantities for that specific 'datetime' and 'machineID'
singleDF['errors'] = singleDF['error_1'] + singleDF['error_2'] + singleDF['error_3'] + singleDF['error_4'] + singleDF['error_5']
singleDF.drop(['error_1', 'error_2', 'error_3', 'error_4', 'error_5'], axis=1, inplace=True)
#mask = singleDF['errors'] > 3
#singleDF[mask].head(50)

Per the requirements in 'TRAIN/SPLIT' the data needs to be grouped into three hour windows.
Just performing a function like '.rolling()' on the entire data set is going to lead to inconsistant
datetime entries for each group of machines. I'd like them to be uniform (for instance each machine 
should start at 2015-01-01 06:00:00, where if we just roll through the entire data set dropped 2/3 rows it won't be uniform). Here we'll break down the dataframe by 'machineID' into a list of dataframes 
so they can be manipulated separately and loop through to take an average of every three rows of sensor data as 
well as sum the maintenance and failed components in that three hour window hopefully giving a good summary of the three hour period.


I'M UNSURE AVERAGING THE SENSOR VALUES IS THE BEST METHOD

In [219]:
singleDF.sort_values(['machineID', 'datetime'], inplace = True)

In [220]:
singleDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 876100 entries, 0 to 876099
Data columns (total 17 columns):
datetime     876100 non-null datetime64[ns]
machineID    876100 non-null int64
volt         876100 non-null float64
rotate       876100 non-null float64
pressure     876100 non-null float64
vibration    876100 non-null float64
mComp_1      876100 non-null int32
mComp_2      876100 non-null int32
mComp_3      876100 non-null int32
mComp_4      876100 non-null int32
fComp_1      876100 non-null int32
fComp_2      876100 non-null int32
fComp_3      876100 non-null int32
fComp_4      876100 non-null int32
model        876100 non-null int32
age          876100 non-null int64
errors       876100 non-null int32
dtypes: datetime64[ns](1), float64(4), int32(10), int64(2)
memory usage: 86.9 MB


In [None]:
# create an empty list to add DataFrames to
#machineDF = []
# create a range for the one hundred machine
#for key in range(0, 99):
    # group rows by 'machineID'
#    df = singleDF.loc[singleDF['machineID'] == key]
    # append the df of a single machine to the list of df's
#    machineDF.append(df)
    # iterate through the columns below
#    for feature in ['volt', 'rotate', 'pressure', 'vibration']:
        # for each machineDF and column above, take the average of every three rows
#        machineDF[key][feature] = machineDF[key][feature].rolling(3).mean()
#    for feature in ['mComp_1', 'mComp_2', 'mComp_3', 'mComp_4', 'fComp_1', 'fComp_2', 'fComp_3', 'fComp_4', 'errors']:
#        machineDF[key][feature + 'Rolling'] = machineDF[key][feature].rolling(3).sum()  
    
    #for feature in ['mComp_1', 'mComp_2', 'mComp_3', 'mComp_4', 'fComp_1', 'fComp_2', 'fComp_3', 'fComp_4', 'errors']:
        #machineDF[key][feature] = machineDF[key][feature].rolling(3).sum()
        
        # add every three rows for columns above
        #machineDF[key][feature] = machineDF[key][feature].groupby(machineDF[key][feature].index // 3 * 3).sum()
        
    # drop the first two rows of each machineDF, there's empty sensor data from the '.rolling(3)'
    #machineDF[key] = machineDF[key].iloc[2:]
    #print(machineDF[key].head(1))
    #print(machineDF[key].iloc[0,2])
    # drop every 2 of 3 rows
#    machineDF[key] = machineDF[key].iloc[::3, :]
    # drop the original component columns

    # rename the rollings component columns
    
#convert the list back to a single df
#df_1 = pd.concat(machineDF).fillna(0).sort_index().reset_index(drop = True)



# starts a dictionary
#machineDF = {}

# naming the columns as you go, add this to for loop. Don't know if it needs to be a dictionary
#machineID[str(key) + 'machine']

In [227]:
# create an empty list to add DataFrames to
machineDF = []
# create a range for the one hundred machine
for key in range(0, 99):
    # group rows by 'machineID'
    df = singleDF.loc[singleDF['machineID'] == key]
    # append the df of a single machine to the list of df's
    machineDF.append(df)
    # iterate through the columns below
    for feature in ['volt', 'rotate', 'pressure', 'vibration']:
        # for each machineDF and column above, take the average of every three rows
        machineDF[key][feature] = machineDF[key][feature].rolling(3).mean()
    for feature in ['mComp_1', 'mComp_2', 'mComp_3', 'mComp_4', 'fComp_1', 'fComp_2', 'fComp_3', 'fComp_4', 'errors']:
        machineDF[key][feature + 'Rolling'] = machineDF[key][feature].rolling(3).sum()  
    # drop every 2 of 3 rows
    machineDF[key] = machineDF[key].iloc[::3, :]
    # drop the original component columns

    # rename the rollings component columns
    
#convert the list back to a single df
df_1 = pd.concat(machineDF).fillna(0).sort_index().reset_index(drop = True)

In [236]:
a = df_1.errors.value_counts().sort_index()
b = df_1.errorsRolling.value_counts().sort_index()
c = singleDF.errors.value_counts().sort_index()
print(a)
print('-'*15)
print(b)
print('-'*15)
print(c)

0    284636
1      1357
2       236
3        29
Name: errors, dtype: int64
---------------
0.0    282735
1.0      3246
2.0       246
3.0        31
Name: errorsRolling, dtype: int64
---------------
0    872484
1      3342
2       245
3        29
Name: errors, dtype: int64


In [None]:
df_1.info()

In [None]:
mask = df_1['mComp_4Rolling'] > 0
df_1[mask].head()

### Analysis

In [None]:
# time for a first look

#correlation heatmap
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(18, 16))
    #colormap = sns.cubehelix_palette(start = 0, n_colors = 6, reverse = True)
    
    _ = sns.heatmap(
        df.corr(), 
        #cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':10 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=20)

correlation_heatmap(singleDF)

#### a look at indivual machines vs model vs age
sns.set(style="ticks", color_codes=True)
sns.pairplot(data = originsData)

#### look at the component usage graphically
plot = maintenanceData.plot.hist(alpha = .75, bins = 4, x = 'machineID', y = ['comp'],
                                 edgecolor='black', linewidth=1.2, xticks = [1,2,3,4],
                                 fontsize = 14, legend = True, figsize = (7, 5))
plt.title('Component Maintenance Distrobution', size = 18)
plt.xlabel('Component', size = 16)
plt.ylabel("Quantity", size = 16)
plt.legend(prop={'size': 12})
plt.xticks(range(1, 5))
plt.xlim([.25,4.75])

####  JUST FOOLING AROUND
sensorGraphData = sensorData
sensorGraphData[['volt_ma', 'rotate_ma', 'pressure_ma', 'vibration_ma']] = minmax_scale(sensorGraphData[['volt_ma', 'rotate_ma', 'pressure_ma', 'vibration_ma']])
mask = sensorGraphData['machineID'] == 55
sensorGraphData = sensorGraphData[mask]
sensorGraphData.drop(['volt', 'rotate', 'pressure', 'vibration', 'machineID'], axis = 1, inplace = True)
sensorGraphData = sensorGraphData.iloc[::730, :]
sensorGraphData.shape

####  JUST FOOLING AROUND
plot = sensorGraphData.plot.line(x = 'datetime', sharey = ['volt_ma', 'rotate_ma', 'pressure_ma', 'vibration_ma'], 
                                 title = 'Mach 55 Twelve Month Normalized Averages', legend = True)
plt.xlabel('Year')
plt.ylabel("Sensor Average")
L = plt.legend()
L.get_texts()[0].set_text('Voltage')
L.get_texts()[1].set_text('Rotation')
L.get_texts()[2].set_text('Pressure')
L.get_texts()[3].set_text('Vibration')