# Feature Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive.

- a) Telemetry data
- b) Maintenance Data
- c) Error Data
- d) Machine data

**Error data** The error log contains non-breaking errors recorded while the machine is still operational. These errors are not considered failures, though they may be predictive of a future failure event. The error datetime field is rounded to the closest hour since the telemetry data (loaded later) is collected on an hourly rate.

**Failure data** correspond to component replacements within the maintenance log. Each record contains the Machine ID, component type, and replacement datetime. These records will be used to create the machine learning labels we will be trying to predict.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from multiprocessing import Pool

import gc
import time

pd.set_option('display.max_columns', 999)

In [2]:
file_failure = 'pdm_failures_data.csv'
file_error = 'pdm_errors_data.csv'
file_maintenance = 'pdm_maint_data.csv'
file_machine = 'pdm_machines_data.csv'
file_telemetry = 'pdm_telemetry_data.csv'

In [3]:
start = '2015-01-01 00:00:00'
end = '2016-01-02 00:00:00'

In [4]:
rng_H = pd.date_range(start=start, end=end, freq='H', closed='left', name='datetime')
rng_12H = pd.date_range(start=start, end=end, freq='12H', closed='left', name='datetime')
machines = np.arange(1, 1001)
idx = [(d, m) for d in rng_H for m in machines]
print(len(idx))

8784000


## a) Telemetry data set

- Compute rolling mean/std on every 12/24/36 hours
- Down sampling from 12H to 1H
- **output** is df_telemetry_feat

In [5]:
df_telemetry = pd.read_csv(file_telemetry, index_col='datetime', parse_dates=True, encoding='utf-8')
df_telemetry.sort_index(inplace=True)
print(df.shape)
df_telemetry.head(3)

(5, 4)


Unnamed: 0_level_0,machineID,volt,rotate,pressure,vibration
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-01-01 06:00:00,1,151.919999,530.813578,101.788175,49.604013
2015-01-01 06:00:00,343,143.835177,511.149503,101.831177,48.986133
2015-01-01 06:00:00,342,205.069011,468.370137,79.274415,36.174677


In [6]:
rolling_features = ['volt','rotate', 'pressure', 'vibration']
lags = [12, 24, 36]

In [7]:
cpu_count = 6 # use 6 cpu core

def tmpFunc(params):
    (name, df) = params
    df.sort_index(inplace=True)
    dfx = pd.DataFrame(index=df.index)
    dfx['machineID'] = name
    for lag in lags:
        for feat in rolling_features:
            col_name = feat + "_rollingmean_" + str(lag)
            dfx[col_name] = df[feat].rolling(lag).mean()
            col_name = feat + "_rollingstd_" + str(lag)
            dfx[col_name] = df[feat].rolling(lag).std(ddof=0)
    return dfx

def applyParallel(dfGrouped, func):
    with Pool(cpu_count) as p:
        ret_list = p.map(func, [params for params in dfGrouped])
    return pd.concat(ret_list, axis=0)


start_time = time.time()  # count time
grouped = df_telemetry.groupby('machineID')
df_tmp = applyParallel(grouped, tmpFunc)
elapsed_time = time.time() - start_time # count time

print("\nComplete", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))


Complete 00:00:29


In [8]:
df_tmp.groupby('machineID').get_group(1).iloc[10:14]

Unnamed: 0_level_0,machineID,volt_rollingmean_12,volt_rollingstd_12,rotate_rollingmean_12,rotate_rollingstd_12,pressure_rollingmean_12,pressure_rollingstd_12,vibration_rollingmean_12,vibration_rollingstd_12,volt_rollingmean_24,volt_rollingstd_24,rotate_rollingmean_24,rotate_rollingstd_24,pressure_rollingmean_24,pressure_rollingstd_24,vibration_rollingmean_24,vibration_rollingstd_24,volt_rollingmean_36,volt_rollingstd_36,rotate_rollingmean_36,rotate_rollingstd_36,pressure_rollingmean_36,pressure_rollingstd_36,vibration_rollingmean_36,vibration_rollingstd_36
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2015-01-01 16:00:00,1,,,,,,,,,,,,,,,,,,,,,,,,
2015-01-01 17:00:00,1,170.401969,16.264585,472.783012,62.920803,101.304629,8.436731,42.194259,4.928213,,,,,,,,,,,,,,,,
2015-01-01 18:00:00,1,170.407953,16.257796,469.3795,60.75718,100.459373,8.844209,41.725764,4.445085,,,,,,,,,,,,,,,,
2015-01-01 19:00:00,1,169.484001,16.312698,464.108608,57.443573,98.88104,8.076374,42.306164,4.818784,,,,,,,,,,,,,,,,


In [9]:
df_telemetry_feat = pd.DataFrame(index=rng_12H)
print("before join", len(df_telemetry_feat))
df_telemetry_feat = df_telemetry_feat.join(df_tmp, how='inner')
print("after join", len(df_telemetry_feat))
df_telemetry_feat.fillna(0, inplace=True)
df_telemetry_feat.groupby('machineID').get_group(1).iloc[10:14]

before join 732
after join 730000


Unnamed: 0_level_0,machineID,volt_rollingmean_12,volt_rollingstd_12,rotate_rollingmean_12,rotate_rollingstd_12,pressure_rollingmean_12,pressure_rollingstd_12,vibration_rollingmean_12,vibration_rollingstd_12,volt_rollingmean_24,volt_rollingstd_24,rotate_rollingmean_24,rotate_rollingstd_24,pressure_rollingmean_24,pressure_rollingstd_24,vibration_rollingmean_24,vibration_rollingstd_24,volt_rollingmean_36,volt_rollingstd_36,rotate_rollingmean_36,rotate_rollingstd_36,pressure_rollingmean_36,pressure_rollingstd_36,vibration_rollingmean_36,vibration_rollingstd_36
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2015-01-06 12:00:00,1,168.457307,16.551622,425.819863,38.619073,99.32854,12.08771,39.527937,4.449407,168.510008,13.781363,432.033429,41.346017,98.87182,9.273188,39.026092,3.793255,170.912938,13.655744,429.674761,44.614149,96.676314,10.021537,38.972309,3.339102
2015-01-07 00:00:00,1,170.921659,14.888984,447.573772,26.194959,99.220138,11.406823,39.763909,4.513878,169.689483,15.790418,436.696817,34.743524,99.274339,11.752323,39.645923,4.483311,169.313892,14.205763,437.213543,37.710139,98.987926,10.036276,39.272031,4.062658
2015-01-07 12:00:00,1,175.086814,14.190837,442.941872,40.9917,99.043822,11.251366,39.980969,5.299124,173.004236,14.692446,445.257822,34.476242,99.13198,11.329704,39.872439,4.923381,171.488593,15.486286,438.778502,37.060963,99.1975,11.588254,39.757605,4.773389
2015-01-08 00:00:00,1,175.503304,14.814961,460.420049,39.443437,104.062077,5.298274,39.421584,5.109683,175.295059,14.507751,451.680961,41.163379,101.55295,9.144845,39.701277,5.212775,173.837259,14.780422,450.311898,36.906501,100.775346,10.016653,39.722154,4.990783


In [10]:
# Clean up objects
del df_tmp, df_telemetry, grouped
gc.collect()

50

## b) Error data set

- Count number on 24 hours rolling period
- **output** is df_error_feat


In [11]:
df_error = pd.read_csv(file_error, index_col='datetime', parse_dates=['datetime'], encoding='utf-8')
df_error.sort_index(inplace=True)
df_error.head(3)

Unnamed: 0_level_0,machineID,errorID
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-01 06:00:00,613,error2
2015-01-01 06:00:00,233,error5
2015-01-01 06:00:00,935,error2


In [12]:
# pivot
error_feats = df_error['errorID'].unique()
df1 = pd.pivot_table(df_error, values='machineID', index=['datetime', 'machineID'], columns='errorID', aggfunc=len).fillna(0)
df1.reset_index('machineID', inplace=True)
df1.head(3)

errorID,machineID,error1,error2,error3,error4,error5
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01 06:00:00,183,0.0,0.0,0.0,1.0,0.0
2015-01-01 06:00:00,186,0.0,1.0,1.0,0.0,0.0
2015-01-01 06:00:00,233,0.0,0.0,0.0,0.0,1.0


In [None]:
cpu_count = 6

def tmpFunc(params):
    (name, df) = params
    dfx = df.join(pd.DataFrame(index=rng_H), how='right')
    dfx['machineID'] = name
    dfx.fillna(0, inplace=True)
    dfx.sort_index(inplace=True)
    for feat in error_feats:
        col_name = feat + "_rollingmean24"
        dfx[col_name] = dfx[feat].rolling(24).mean()
    return dfx

def applyParallel(dfGrouped, func):
    with Pool(cpu_count) as p:
        ret_list = p.map(func, [params for params in dfGrouped])
    return pd.concat(ret_list, axis=0)


start_time = time.time()
grouped = df1.groupby('machineID')
df_error_feat = applyParallel(grouped, tmpFunc)

elapsed_time = time.time() - start_time

print("\nComplete", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))

In [14]:
df_error_feat.groupby('machineID').get_group(801).head(6)

Unnamed: 0_level_0,machineID,error1,error2,error3,error4,error5,error2_rollingmean24,error5_rollingmean24,error1_rollingmean24,error3_rollingmean24,error4_rollingmean24
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01 00:00:00,801,0.0,0.0,0.0,0.0,0.0,,,,,
2015-01-01 01:00:00,801,0.0,0.0,0.0,0.0,0.0,,,,,
2015-01-01 02:00:00,801,0.0,0.0,0.0,0.0,0.0,,,,,
2015-01-01 03:00:00,801,0.0,0.0,0.0,0.0,0.0,,,,,
2015-01-01 04:00:00,801,0.0,0.0,0.0,0.0,0.0,,,,,
2015-01-01 05:00:00,801,0.0,0.0,0.0,0.0,0.0,,,,,


In [15]:
df_error_feat = df_error_feat.join(pd.DataFrame(index=rng_12H), on='datetime', how='right').fillna(0)
df_error_feat.groupby('machineID').get_group(744).head(4)

Unnamed: 0_level_0,machineID,error1,error2,error3,error4,error5,error2_rollingmean24,error5_rollingmean24,error1_rollingmean24,error3_rollingmean24,error4_rollingmean24
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01 00:00:00,744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-01-01 12:00:00,744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-01-02 00:00:00,744,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0
2015-01-02 12:00:00,744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0


In [16]:
del df1, df_error
gc.collect()

21

## c) Maintenance Data

- Compute accumuated days since last maintenance
- **output** is df_maint_feat

In [17]:
df_maint = pd.read_csv(file_maintenance, index_col='datetime', parse_dates=True, encoding='utf-8')
df_maint.sort_index(inplace=True)
print(df_maint.shape)
df_maint.head(3)

(32592, 2)


Unnamed: 0_level_0,machineID,comp
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-06-01 06:00:00,479,comp3
2014-06-01 06:00:00,128,comp2
2014-06-01 06:00:00,128,comp4


In [18]:
df1 = pd.pivot_table(df_maint, values='machineID', index=['datetime', 'machineID'], columns='comp', aggfunc=len).fillna(0)
df1.sort_index(inplace=True)
df1.head(5)

Unnamed: 0_level_0,comp,comp1,comp2,comp3,comp4
datetime,machineID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-06-01 06:00:00,3,1.0,1.0,0.0,0.0
2014-06-01 06:00:00,4,1.0,0.0,0.0,0.0
2014-06-01 06:00:00,17,1.0,0.0,0.0,0.0
2014-06-01 06:00:00,22,0.0,1.0,0.0,0.0
2014-06-01 06:00:00,23,0.0,0.0,1.0,0.0


In [19]:
start_time = time.time()

comp_feat = df_maint['comp'].unique()
midx = pd.MultiIndex.from_tuples(idx, names=('datetime', 'machineID'))
df1 = df1.join(pd.DataFrame(index=midx), on=['datetime', 'machineID'],how='right').fillna(0)
df1.reset_index('machineID', inplace=True) # move machineID from index to normal column
df1.sort_index(inplace=True)
    
elapsed_time = time.time() - start_time
print("\nComplete", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))


Complete 00:00:08


In [20]:
start_time = time.time()

for feat in comp_feat:
    col_name = "sincelast" + feat
    df1[col_name] = df1.index.where(df1[feat].eq(1))
    df1[col_name] = (df1.index - df1.groupby('machineID')[col_name].ffill()).fillna(pd.Timedelta(0)).dt.days
    print("Done "+ feat)
    
elapsed_time = time.time() - start_time
print("\nComplete", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))

Done comp3
Done comp2
Done comp4
Done comp1

Complete 00:00:07


In [21]:
df_maint_feat = df1.join(pd.DataFrame(index=rng_12H), how='inner').fillna(0)
df_maint_feat.groupby('machineID').get_group(1).iloc[7:15]

Unnamed: 0_level_0,machineID,comp1,comp2,comp3,comp4,sincelastcomp3,sincelastcomp2,sincelastcomp4,sincelastcomp1
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-01-04 12:00:00,1,0.0,0.0,0.0,0.0,0,0,0,0
2015-01-05 00:00:00,1,0.0,0.0,0.0,0.0,0,0,0,0
2015-01-05 12:00:00,1,0.0,0.0,0.0,0.0,0,0,0,0
2015-01-06 00:00:00,1,0.0,0.0,0.0,0.0,0,0,0,0
2015-01-06 12:00:00,1,0.0,0.0,0.0,0.0,0,0,0,1
2015-01-07 00:00:00,1,0.0,0.0,0.0,0.0,0,0,0,1
2015-01-07 12:00:00,1,0.0,0.0,0.0,0.0,0,0,0,2
2015-01-08 00:00:00,1,0.0,0.0,0.0,0.0,0,0,0,2


In [22]:
del df1, df_maint
gc.collect()

46

## d) Machine Data
- Onehot vector on model
- **output** df_machine_feat

In [23]:
df_machine_feat = pd.read_csv(file_machine, encoding='utf-8')
print(df_machine_feat['model'].unique())
df_machine_feat.head(3)

['model4' 'model2' 'model3' 'model1']


Unnamed: 0,machineID,model,age
0,501,model4,6
1,502,model4,4
2,1,model2,18


In [24]:
repl_index = {'model1':(0, 0, 0, 1), 'model2':(0, 0, 1, 0), 'model3':(0, 1, 0, 0), 'model4':(1, 0, 0, 0)}
df_machine_feat['model_encoded'] = df_machine_feat['model'].map(repl_index)
df_machine_feat.head(3)

Unnamed: 0,machineID,model,age,model_encoded
0,501,model4,6,"(1, 0, 0, 0)"
1,502,model4,4,"(1, 0, 0, 0)"
2,1,model2,18,"(0, 0, 1, 0)"


## Merge Features
- **output** df_feats

In [25]:
df_error_feat.reset_index(inplace=True)
df_maint_feat.reset_index(inplace=True)
df_telemetry_feat.reset_index(inplace=True)

In [26]:
df_feat = pd.merge(df_error_feat, df_maint_feat, how='left', left_on=['datetime', 'machineID'], right_on = ['datetime', 'machineID'])
print(df_feat.shape)
df_feat = pd.merge(df_feat, df_machine_feat, how='left', left_on=['machineID'], right_on = ['machineID'])
print(df_feat.shape)
df_feat = pd.merge(df_feat, df_telemetry_feat, how='right', left_on=['datetime', 'machineID'], right_on = ['datetime', 'machineID'])
print(df_feat.shape)
df_feat.head(1)

(732000, 20)
(732000, 23)
(730000, 47)


Unnamed: 0,datetime,machineID,error1,error2,error3,error4,error5,error2_rollingmean24,error5_rollingmean24,error1_rollingmean24,error3_rollingmean24,error4_rollingmean24,comp1,comp2,comp3,comp4,sincelastcomp3,sincelastcomp2,sincelastcomp4,sincelastcomp1,model,age,model_encoded,volt_rollingmean_12,volt_rollingstd_12,rotate_rollingmean_12,rotate_rollingstd_12,pressure_rollingmean_12,pressure_rollingstd_12,vibration_rollingmean_12,vibration_rollingstd_12,volt_rollingmean_24,volt_rollingstd_24,rotate_rollingmean_24,rotate_rollingstd_24,pressure_rollingmean_24,pressure_rollingstd_24,vibration_rollingmean_24,vibration_rollingstd_24,volt_rollingmean_36,volt_rollingstd_36,rotate_rollingmean_36,rotate_rollingstd_36,pressure_rollingmean_36,pressure_rollingstd_36,vibration_rollingmean_36,vibration_rollingstd_36
0,2015-01-01 12:00:00,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,model2,18,"(0, 0, 1, 0)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# code to check the result

#df_error_feat.loc[('2015-01-03 00:00:00', 801)]
#df_maint_feat.loc[(df_maint_feat.index == '2015-01-03 00:00:00') & (df_maint_feat['machineID']==801)]
#df_feat.loc[(df_feat.index =='2015-01-03 00:00:00') & (df_feat['machineID']==801)]

In [28]:
df_feat.drop(['error1', 'error2', 'error3', 'error4','error5', 'comp1', 'comp2', 'comp3', 'comp4', 'model'], axis=1, inplace=True)
df_feat.shape

(730000, 37)

## Lable Failure


Note, some machines will have multiple records

                    datetime  machineID failure
    2475 2015-11-28 06:00:00         90   comp2
    6725 2015-11-28 06:00:00         90   comp1

In [29]:
df_failure = pd.read_csv(file_failure, index_col='datetime', parse_dates=True, encoding='utf-8')
print(df_failure.shape)
df_failure.head(3)

(6726, 2)


Unnamed: 0_level_0,machineID,failure
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-10-25 06:00:00,800,comp1
2015-04-05 06:00:00,702,comp1
2015-11-24 06:00:00,800,comp2


In [30]:
df1 = pd.pivot_table(df_failure, values='machineID', index=['datetime', 'machineID'], columns='failure', aggfunc=len)
df1 = df1.join(pd.DataFrame(index=midx), on=['datetime', 'machineID'],how='right')
df1.sort_index(inplace=True)
df1.reset_index('machineID', inplace=True) # move machineID from index to normal column
df1.head(5)

Unnamed: 0_level_0,machineID,comp1,comp2,comp3,comp4
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-01-01,1,,,,
2015-01-01,2,,,,
2015-01-01,3,,,,
2015-01-01,4,,,,
2015-01-01,5,,,,


In [31]:
# backfill 7 days
win = 7* 24
df1 = df1.groupby('machineID').bfill(limit=win)
df1.fillna(0, inplace=True)

In [32]:
dfx = df1.groupby('machineID').get_group(90)
print("test on machineID 90, 2015-11-28 06:00:00 failure on comp1 and comp2")
print("tail record")
print(dfx.loc['2015-11-28 02:00:00':'2015-11-28 07:00:00'])
print("head record")
print(dfx.loc['2015-11-21 04:00:00':'2015-11-21 06:00:00']) # 7 days before

test on machineID 90, 2015-11-28 06:00:00 failure on comp1 and comp2
tail record
                     machineID  comp1  comp2  comp3  comp4
datetime                                                  
2015-11-28 02:00:00         90    1.0    1.0    0.0    0.0
2015-11-28 03:00:00         90    1.0    1.0    0.0    0.0
2015-11-28 04:00:00         90    1.0    1.0    0.0    0.0
2015-11-28 05:00:00         90    1.0    1.0    0.0    0.0
2015-11-28 06:00:00         90    1.0    1.0    0.0    0.0
2015-11-28 07:00:00         90    0.0    0.0    0.0    0.0
head record
                     machineID  comp1  comp2  comp3  comp4
datetime                                                  
2015-11-21 04:00:00         90    0.0    0.0    0.0    0.0
2015-11-21 05:00:00         90    0.0    0.0    0.0    0.0
2015-11-21 06:00:00         90    1.0    1.0    0.0    0.0


In [33]:
print(len(df1))
df1.rename(columns={"comp1": "failure_comp1", "comp2": "failure_comp2", "comp3": "failure_comp3", "comp4": "failure_comp4"}, inplace=True)
df1 = df1.join(pd.DataFrame(index=rng_12H), how='right')
df1.reset_index(inplace=True)
print(len(df1))
df_feat = pd.merge(df_feat, df1, how='left', left_on=['datetime', 'machineID'], right_on = ['datetime', 'machineID'])
df_feat.head(1)

8784000
732000


Unnamed: 0,datetime,machineID,error2_rollingmean24,error5_rollingmean24,error1_rollingmean24,error3_rollingmean24,error4_rollingmean24,sincelastcomp3,sincelastcomp2,sincelastcomp4,sincelastcomp1,age,model_encoded,volt_rollingmean_12,volt_rollingstd_12,rotate_rollingmean_12,rotate_rollingstd_12,pressure_rollingmean_12,pressure_rollingstd_12,vibration_rollingmean_12,vibration_rollingstd_12,volt_rollingmean_24,volt_rollingstd_24,rotate_rollingmean_24,rotate_rollingstd_24,pressure_rollingmean_24,pressure_rollingstd_24,vibration_rollingmean_24,vibration_rollingstd_24,volt_rollingmean_36,volt_rollingstd_36,rotate_rollingmean_36,rotate_rollingstd_36,pressure_rollingmean_36,pressure_rollingstd_36,vibration_rollingmean_36,vibration_rollingstd_36,failure_comp1,failure_comp2,failure_comp3,failure_comp4
0,2015-01-01 12:00:00,1,0.0,0.0,0.0,0.0,0.0,0,0,0,0,18,"(0, 0, 1, 0)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
del df1, df_failure, dfx
gc.collect()

29

In [35]:
df_feat.to_csv('featured_data.csv', index=False)

In [36]:
df_feat.isnull().any()

datetime                    False
machineID                   False
error2_rollingmean24        False
error5_rollingmean24        False
error1_rollingmean24        False
error3_rollingmean24        False
error4_rollingmean24        False
sincelastcomp3              False
sincelastcomp2              False
sincelastcomp4              False
sincelastcomp1              False
age                         False
model_encoded               False
volt_rollingmean_12         False
volt_rollingstd_12          False
rotate_rollingmean_12       False
rotate_rollingstd_12        False
pressure_rollingmean_12     False
pressure_rollingstd_12      False
vibration_rollingmean_12    False
vibration_rollingstd_12     False
volt_rollingmean_24         False
volt_rollingstd_24          False
rotate_rollingmean_24       False
rotate_rollingstd_24        False
pressure_rollingmean_24     False
pressure_rollingstd_24      False
vibration_rollingmean_24    False
vibration_rollingstd_24     False
volt_rollingme

In [37]:
g = df_feat.groupby('machineID').get_group(744)
g.head()

Unnamed: 0,datetime,machineID,error2_rollingmean24,error5_rollingmean24,error1_rollingmean24,error3_rollingmean24,error4_rollingmean24,sincelastcomp3,sincelastcomp2,sincelastcomp4,sincelastcomp1,age,model_encoded,volt_rollingmean_12,volt_rollingstd_12,rotate_rollingmean_12,rotate_rollingstd_12,pressure_rollingmean_12,pressure_rollingstd_12,vibration_rollingmean_12,vibration_rollingstd_12,volt_rollingmean_24,volt_rollingstd_24,rotate_rollingmean_24,rotate_rollingstd_24,pressure_rollingmean_24,pressure_rollingstd_24,vibration_rollingmean_24,vibration_rollingstd_24,volt_rollingmean_36,volt_rollingstd_36,rotate_rollingmean_36,rotate_rollingstd_36,pressure_rollingmean_36,pressure_rollingstd_36,vibration_rollingmean_36,vibration_rollingstd_36,failure_comp1,failure_comp2,failure_comp3,failure_comp4
743,2015-01-01 12:00:00,744,0.0,0.0,0.0,0.0,0.0,0,0,0,0,17,"(0, 1, 0, 0)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1743,2015-01-02 00:00:00,744,0.0,0.0,0.0,0.041667,0.0,0,0,0,0,17,"(0, 1, 0, 0)",172.359146,14.702504,452.498343,41.51221,102.298172,10.678776,39.840334,6.010805,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2743,2015-01-02 12:00:00,744,0.0,0.0,0.0,0.041667,0.0,0,0,0,0,17,"(0, 1, 0, 0)",168.71098,16.084044,438.122511,37.010838,102.978242,9.197898,39.815509,4.791518,170.535063,15.516357,445.310427,39.977481,102.638207,9.971681,39.827921,5.435473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3743,2015-01-03 00:00:00,744,0.0,0.0,0.0,0.0,0.0,0,0,0,0,17,"(0, 1, 0, 0)",176.394996,14.067841,437.034543,53.778074,99.921798,10.535467,42.035151,3.811634,172.552988,15.590424,437.578527,46.165329,101.45002,10.006706,40.92533,4.469374,172.488374,15.30045,442.551799,45.218504,101.732738,10.24344,40.563665,5.061791,0.0,0.0,0.0,1.0
4743,2015-01-03 12:00:00,744,0.0,0.0,0.0,0.0,0.0,0,0,0,0,17,"(0, 1, 0, 0)",174.314376,11.212228,435.527267,45.118387,95.080414,6.009388,42.993126,8.485824,175.354686,12.76289,436.280905,49.643157,97.501106,8.911462,42.514139,6.595325,173.140117,14.30507,436.894774,45.829209,99.326818,9.370681,41.614595,6.186284,0.0,0.0,0.0,1.0
