# CAICT 風機結冰預測 - Data Preprocess

2017/07/12  
http://www.industrial-bigdata.com/competition/competitionAction!showDetail.action?competition.competitionId=1

<a id='home'></a>
## Outline
[Load Data and Modules](#load)  
[Load raw data](#rawdata)  
[time to datetime](#datetime)    
[風機參數與風機狀態的數據對應](#merge)  
[標示無效數據](#invalid)  
[save ready-for-analysis data to csv](#csv)

### Load Data and Modules

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import datetime
import os
import sys
import lzma
import pickle
import csv
from IPython.core.interactiveshell import InteractiveShell
pd.set_option('display.max_columns', None)
InteractiveShell.ast_node_interactivity = "all"

In [2]:
def timerangeCheck(t, start, end):
    return (t>=start) & (t<=end)
 
def updateProgress(msg):
    sys.stdout.write('\r')
    sys.stdout.write(msg)
    sys.stdout.flush()


<a id='rawdata'></a>
** Load raw data ** 

In [3]:
data_M15 = pd.read_csv('data/ice1/train/15/15_data.csv')
normlabel_M15 = pd.read_csv('data/ice1/train/15/15_normalInfo.csv')
abnormlabel_M15 = pd.read_csv('data/ice1/train/15/15_failureInfo.csv')

data_M21 = pd.read_csv('data/ice1/train/21/21_data.csv')
normlabel_M21 = pd.read_csv('data/ice1/train/21/21_normalInfo.csv')
abnormlabel_M21 = pd.read_csv('data/ice1/train/21/21_failureInfo.csv')

data_M08 = pd.read_csv('data/ice1/test/08/08_data.csv')

column_desc=pd.read_csv('doc/columndesc.csv')

- ** 15號風機異常時間 **

In [16]:
#abnormlabel_M15

- ** 15號風機正常時間 **

In [5]:
#normlabel_M15

<a id='把time轉成datetime'></a>
** 把time轉成datetime, 方便後續的時間比對操作. ** 

In [5]:
data_M15['time']=pd.to_datetime(data_M15['time'])
data_M15 = data_M15.sort_values(by='time')
data_M15['timestamp'] = data_M15['time'].apply(lambda x: x.timestamp())

data_M21['time']=pd.to_datetime(data_M21['time'])
data_M21 = data_M21.sort_values(by='time')
data_M21['timestamp'] = data_M21['time'].apply(lambda x: x.timestamp())

normlabel_M15['startTime']=pd.to_datetime(normlabel_M15['startTime'])
normlabel_M15['endTime']=pd.to_datetime(normlabel_M15['endTime'])
normlabel_M15 = normlabel_M15.sort_values(by='startTime')

normlabel_M21['startTime']=pd.to_datetime(normlabel_M21['startTime'])
normlabel_M21['endTime']=pd.to_datetime(normlabel_M21['endTime'])
normlabel_M21 = normlabel_M21.sort_values(by='startTime')

abnormlabel_M15['startTime']=pd.to_datetime(abnormlabel_M15['startTime'])
abnormlabel_M15['endTime']=pd.to_datetime(abnormlabel_M15['endTime'])
abnormlabel_M15 = abnormlabel_M15.sort_values(by='startTime')

abnormlabel_M21['startTime']=pd.to_datetime(abnormlabel_M21['startTime'])
abnormlabel_M21['endTime']=pd.to_datetime(abnormlabel_M21['endTime'])
abnormlabel_M21 = abnormlabel_M21.sort_values(by='startTime')

<a id='merge'></a>
** 風機參數與風機狀態的數據對應 ** 

In [6]:
%%timeit -n 1 -r 1 -o
for i in range(abnormlabel_M15.shape[0]):
    startTime, endTime = abnormlabel_M15.iloc[i]
    subset=data_M15['time'].apply(lambda x: timerangeCheck(x,startTime,endTime))
    data_M15.loc[subset, 'label']=1

for i in range(normlabel_M15.shape[0]):
    startTime, endTime = normlabel_M15.iloc[i]
    subset=data_M15['time'].apply(lambda x: timerangeCheck(x,startTime,endTime))
    data_M15.loc[subset, 'label']=0
    
for i in range(abnormlabel_M21.shape[0]):
    startTime, endTime = abnormlabel_M21.iloc[i]
    subset=data_M21['time'].apply(lambda x: timerangeCheck(x,startTime,endTime))
    data_M21.loc[subset, 'label']=1

for i in range(normlabel_M21.shape[0]):
    startTime, endTime = normlabel_M21.iloc[i]
    subset=data_M21['time'].apply(lambda x: timerangeCheck(x,startTime,endTime))
    data_M21.loc[subset, 'label']=0
    
    

2min 29s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


<TimeitResult : 2min 29s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

<a id='invalid'></a>
** 標示無效數據 **  
(风机正常时间区间和风机结冰时间区间均不覆盖的数据视为无效数据)

In [8]:
data_M15 = data_M15.fillna(2)
data_M21 = data_M21.fillna(2)

<a id='csv'></a>
** save read-to-analysis data to csv **

In [9]:
data_M15.to_csv('data/goodformat_15_FRESH.csv',index=False)
data_M21.to_csv('data/goodformat_21_FRESH.csv',index=False)

[home](#home)

***