![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.png)

# Automated Machine Learning End To End
## - 시계열 예측
_**에너지 수요 예측**_

## Contents
1. [개념](#개념)
1. [소개](#소개)
1. [셋팅](#셋팅)
1. [데이터작업](#데이터작업)
1. [시계열Train1](#시계열Train1)
1. [BestModel추출](#BestModel추출)
1. [lags와rollingwindowfeatures사용](#lags와rollingwindowfeatures사용)
1. [시계열Train2](#시계열Train2)
1. [AKS배포하기](#AKS배포하기)

## 개념
자동화 된 시간 계열 실험은 다중 변형 회귀 문제로 처리되며 이전 시계열 데이터는 다른 예측 변수와 함께 회귀로 분석이 됩니다. 

### 시계열 Train 모델
**1. Prophet**  
  : Facebok에서 만든 시계열 예측 모델이며 Python, R에서 사용. 내부 알고리즘은 공개하지 않았기 때문에 대략 Linear Model과 비슷한 정도만 알고 있음.  
  
**2. Auto ARIMA**  
  : AR(자기상관 - 어떤 변수에 의해 이전 값이 이후에 영향을 미치는 상황)과 MA(이동평균 - 평균값이 지속적으로 증가, 감소의 패턴)모형을 합친 것이며 가장 보편적인 시계열 모델  
  
**3. ForecastTCN**   
  : DeepLearning기반 모델


## 소개
이 예에서는 에너지 수요 적용 분야에서 AutoML을 사용하여 단일 시계열을 예측하는 방법을 보여줍니다. 

Process :
1. 기존 작업 공간에서 실험 만들기
2. 간단한 시계열 모델을위한 AutoML 구성 및 로컬 실행
3. 엔지니어링 기능 및 예측 결과보기
4. 지연 및 롤링 윈도우 기능이있는 시계열 모델의 AutoML 구성 및 로컬 실행
5. 기능 중요도 추정

## 셋팅

- 모듈 Import

In [15]:
import azureml.core
import pandas as pd
import numpy as np
import logging
import warnings

# warning 메세지 없이 출력
warnings.showwarning = lambda *args, **kwargs: None

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from matplotlib import pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

* workspace 설정 - 처음 구성할 경우 안내해주는 페이지로 가서 코드를 입력하면 됩니다.

In [17]:
ws = Workspace.from_config()

# 이미 시계열 관련 실험이 있다면 똑같은 실험명으로 할 경우 추가적으로 들어감
experiment_name = 'automl-energydemandforecasting'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Unnamed: 0,Unnamed: 1
SDK version,1.0.72
Subscription ID,3fda2f18-4b0e-4ca1-aaea-72aa8f954bd3
Workspace,htmlws
Resource Group,mlsrv-rg-ht
Location,koreacentral
Run History Name,automl-energydemandforecasting


## 데이터작업

* Sample Data - 뉴욕시의 에너지 소비 데이터
* Data 내용 - 시간별 에너지 수요 및 기본 날씨 데이터
* csv
* timeStamp 컬럼은 Import 시 parse_dates를 이용하여 parsing

In [18]:
data = pd.read_csv("energy_data/nyc_energy.csv", parse_dates=['timeStamp'])
data.head()

Unnamed: 0,timeStamp,demand,precip,temp
0,2012-01-01 00:00:00,4937.5,0.0,46.13
1,2012-01-01 01:00:00,4752.1,0.0,45.89
2,2012-01-01 02:00:00,4542.6,0.0,45.04
3,2012-01-01 03:00:00,4357.7,0.0,45.03
4,2012-01-01 04:00:00,4275.5,0.0,42.61


### Target Column인 Demand에 NaN값이 있음을 알 수 있음

In [19]:
data.describe()

Unnamed: 0,demand,precip,temp
count,49124.0,48975.0,49019.0
mean,6067.45,0.0,55.52
std,1285.61,0.02,17.7
min,2859.6,0.0,0.33
25%,5133.86,0.0,41.41
50%,6020.07,0.0,56.26
75%,6684.3,0.0,70.54
max,11456.0,0.91,97.26


In [20]:
data.count()

timeStamp    49205
demand       49124
precip       48975
temp         49019
dtype: int64

In [21]:
data[pd.isnull(data['demand'])].count()

timeStamp    81
demand       0 
precip       81
temp         81
dtype: int64

In [22]:
data[pd.isnull(data['demand'])].head()

Unnamed: 0,timeStamp,demand,precip,temp
49124,2012-03-11 02:00:00,,0.0,37.78
49125,2013-03-10 02:00:00,,0.0,38.18
49126,2014-03-09 02:00:00,,0.0,40.86
49127,2015-03-08 02:00:00,,0.0,36.96
49128,2015-03-11 11:00:00,,0.0,49.95


In [23]:
print(type(data['timeStamp']))

<class 'pandas.core.series.Series'>


#### Data Set의 스키마 정의
* y(종속변수) - temp
* x(독립변수) - demand, precip
* 시계열 컬럼 - timeStamp

In [24]:
# 스키마
time_column_name = 'timeStamp'
target_column_name = 'demand'

### Forecast Horizon

  * 데이터 스키마 외에도 예측 범위를 지정  
  * 예측 기간은 일반적으로 Train 데이터의 최신 날짜 이후까지의 시간  
  * Forecas Horizon이란 Time 샘플링 간격. 예를 들어, NYC 에너지 수요 데이터는 시간별 빈도 갖으며 Domain에 따라 다르겠지만 다른 시계열 예측에서는 시간 단위가 몇 주 또는 몇 달 전의 데이터로 Y를 예측하는 시나리오도 있을 수 있음  
  * 이 예에서는 48 시간 동안의 데이터 셋을 지정.

In [25]:
max_horizon = 48

### train, test 셋 Split

* 모델 성능을 평가할 수 있도록 데이터를 Train and Test 세트로 분할 

In [26]:
data[time_column_name].min()

Timestamp('2012-01-01 00:00:00')

In [27]:
# target_column인 demand가 null인 df에서 timeStamp열 추출
print(data[~pd.isnull(data[target_column_name])][time_column_name].head())

# demand가 null인데 가장 최근 날짜 - latest_known_time
latest_known_time = data[~pd.isnull(data[target_column_name])][time_column_name].max()

# max_horizon = 48 시간
# split_time = demand null 최근 날짜 - 48
split_time = latest_known_time - pd.Timedelta(hours=max_horizon)

print("latest_known_time : {}, split_time : {}".format(latest_known_time, split_time))

0   2012-01-01 00:00:00
1   2012-01-01 01:00:00
2   2012-01-01 02:00:00
3   2012-01-01 03:00:00
4   2012-01-01 04:00:00
Name: timeStamp, dtype: datetime64[ns]
latest_known_time : 2017-08-10 05:00:00, split_time : 2017-08-08 05:00:00


In [28]:
# train. test는 split_time 기준으로.
X_train = data[data[time_column_name] <= split_time] # 2012-01-01 ~ 2017-08-08 
X_test = data[(data[time_column_name] > split_time) & (data[time_column_name] <= latest_known_time)] # 2017-08-08 ~ 2017-08-10

In [29]:
X_train.head()

Unnamed: 0,timeStamp,demand,precip,temp
0,2012-01-01 00:00:00,4937.5,0.0,46.13
1,2012-01-01 01:00:00,4752.1,0.0,45.89
2,2012-01-01 02:00:00,4542.6,0.0,45.04
3,2012-01-01 03:00:00,4357.7,0.0,45.03
4,2012-01-01 04:00:00,4275.5,0.0,42.61


In [30]:
X_test.head()

Unnamed: 0,timeStamp,demand,precip,temp
49076,2017-08-08 06:00:00,5590.99,0.0,66.17
49077,2017-08-08 07:00:00,6147.03,0.0,66.29
49078,2017-08-08 08:00:00,6592.43,0.0,66.72
49079,2017-08-08 09:00:00,6874.53,0.0,67.37
49080,2017-08-08 10:00:00,7010.54,0.0,68.3


In [31]:
print("train data set : {} ~ {}".format(X_train.timeStamp.min(), X_train.timeStamp.max()))
print("test data set : {} ~ {}".format(X_test.timeStamp.min(), X_test.timeStamp.max()))

train data set : 2012-01-01 00:00:00 ~ 2017-08-08 05:00:00
test data set : 2017-08-08 06:00:00 ~ 2017-08-10 05:00:00


In [32]:
y_train = X_train.pop(target_column_name).values
y_test = X_test.pop(target_column_name).values

## 시계열Train1

* AutoMLConfig 객체를 인스턴스화 
* 실험을 실행하는 데 사용되는 설정 및 데이터를 정의. 
* 예측 작업의 경우 시계열 데이터 스키마 및 예측 컨텍스트와 관련된 추가 구성을 제공
* 시간 열의 이름과 최대 예측 기간 만 필요

|Property|Description|
|-|-|
|**task**|예측|
|**primary_metric**|최적화를 위해 standard할 메트릭. <br> 시계열 메트릭 <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>
|**iterations**|Iteration 수. 각 이터레이션 할 때마다 Auto ML pipeline은 주어진 데이터로 Train|
|**iteration_timeout_minutes**|iteration 타임 리밋.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], targets values.|
|**n_cross_validations**|cross validation 분할 수. Rolling Origin Validation은 시계열을 시간적으로 일관된 방식으로 분할하는 데 사용됩니다.|

In [33]:
time_series_settings = {
    'time_column_name': time_column_name,# timeStamp
    'max_horizon': max_horizon 
}

automl_config = AutoMLConfig(task='forecasting', #시계열 - forecasting
                             debug_log='automl_nyc_energy_errors.log',
                             primary_metric='normalized_root_mean_squared_error',
                             blacklist_models = ['ExtremeRandomTrees'],
                             iterations=10,
                             iteration_timeout_minutes=5,
                             X=X_train,
                             y=y_train,
                             n_cross_validations=3,
                             verbosity = logging.INFO,
                             **time_series_settings)



* experiment.submit()을 하게 되면 실험이 실행 
* 하드웨어에 따라 프로세스 속도를 높일 수 있습니다.
* 현재 실행중인 Iteration은 확인할 수 있음

In [34]:
local_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_848818a5-7d18-41aa-a8d6-16129da8c749
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: DatasetFeaturization. Beginning to featurize the CV split.
Current status: DatasetFeaturizationCompleted. Completed featurizing the CV split.
Current status: DatasetFeaturization. Beginning to featurize the CV split.
Current status: DatasetFeaturizationCompleted. Completed featurizing the CV split.
Current status: DatasetFeaturization. Beginning to featurize the CV split.
Current status: DatasetFeaturizationCompleted. Completed featurizing the CV split.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summ

In [35]:
local_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-energydemandforecasting,AutoML_848818a5-7d18-41aa-a8d6-16129da8c749,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


## BestModel추출
* 돌린 iteration 중 가장 좋은 결과가 나온 Model 확인
* get_output()을 하게 되면 해당 디렉터리에 fitted model 리턴

In [36]:
best_run, fitted_model = local_run.get_output()
fitted_model.steps

[('timeseriestransformer', TimeSeriesTransformer(logger=None,
             pipeline_type=<TimeSeriesPipelineType.FULL: 1>)),
 ('stackensembleregressor',
  StackEnsembleRegressor(base_learners=[('6', Pipeline(memory=None,
       steps=[('standardscalerwrapper', <automl.client.core.runtime.model_wrappers.StandardScalerWrapper object at 0x7f4c4116d668>), ('lightgbmregressor', LightGBMRegressor(boosting_type='gbdt', class_weight=None,
           colsample_bytree=0.5, importance_type='split',
           learning_rate=0.126319473684...=0.825, silent=True, subsample=1,
           subsample_for_bin=200000, subsample_freq=7, verbose=-1))]))],
              meta_learner=ElasticNetCV(alphas=None, copy_X=True, cv='warn', eps=0.001,
         fit_intercept=True, l1_ratio=0.5, max_iter=1000, n_alphas=100,
         n_jobs=None, normalize=False, positive=False, precompute='auto',
         random_state=None, selection='cyclic', tol=0.0001, verbose=0),
              training_cv_folds=5))]

### featurized data
* 아래에는 시계열 기능을 사용하여 피철이 된 데이터에 대해 생성 된 내용보기

In [37]:
fitted_model.named_steps['timeseriestransformer'].get_engineered_feature_names()

['precip',
 'temp',
 'precip_WASNULL',
 'temp_WASNULL',
 'year',
 'half',
 'quarter',
 'month',
 'day',
 'hour',
 'am_pm',
 'hour12',
 'wday',
 'qday',
 'week']

### Best Fitted Model 테스트 하기
 
* Forcasting은 `NaN`은 예측자가 실제 값을 채우는 영역임. 
* 예측 기능을 사용하면 가능한 가장 짧은 예측 기간을 사용하여 예측이 생성. 
* NaN 이외의 값이 마지막으로 표시되는 시점은 _forecast origin_ 
* 대상의 값이 마지막으로 알려진 시점입니다.

In [38]:
#demand - y
y_query = y_test.copy().astype(np.float)
y_query

array([5590.992, 6147.033, 6592.425, 6874.533, 7010.542, 7078.158,
       7213.317, 7329.75 , 7426.25 , 7505.633, 7578.192, 7548.05 ,
       7357.117, 7131.433, 6986.575, 6869.292, 6587.058, 6194.442,
       5754.708, 5439.667, 5195.325, 5044.508, 5010.   , 5195.3  ,
       5651.033, 6240.392, 6774.967, 7140.267, 7348.917, 7516.775,
       7671.625, 7806.833, 7949.467, 8065.808, 8162.875, 8136.758,
       7852.642, 7535.067, 7360.883, 7207.583, 6917.65 , 6487.642,
       6053.458, 5714.258, 5497.025, 5360.583, 5333.775, 5534.683])

* 해당 y를 nan으로 fill - 실제로 테스트 할 떄 y는 nan 이기 떄문에

In [39]:
y_query.fill(np.nan)

In [40]:
y_query

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan])

In [41]:
y_fcst, X_trans = fitted_model.forecast(X_test, y_query)

In [42]:
y_fcst # 예측

array([5312.59200183, 5792.94422385, 6183.25897702, 6183.25897702,
       6249.18005431, 6249.18005431, 6603.73506474, 6862.72171496,
       6862.72171496, 7019.38702712, 7296.52297361, 7359.57972533,
       7296.52297361, 7225.13950619, 7019.38702712, 6680.4384586 ,
       6456.22081801, 6329.05694677, 5264.82032403, 5198.89924674,
       5198.89924674, 5198.89924674, 5095.27803939, 5095.27803939,
       5281.39682025, 5761.74904228, 6139.89346776, 6453.0473194 ,
       6791.99588792, 7069.13183441, 7472.09455461, 7627.09291508,
       7627.09291508, 7627.09291508, 7627.09291508, 7627.09291508,
       7627.09291508, 7359.57972533, 7225.13950619, 6862.72171496,
       6532.92421187, 6532.92421187, 5468.68758912, 5468.68758912,
       5468.68758912, 5391.98419526, 5264.82032403, 5264.82032403])

In [43]:
y_test # 실제

array([5590.992, 6147.033, 6592.425, 6874.533, 7010.542, 7078.158,
       7213.317, 7329.75 , 7426.25 , 7505.633, 7578.192, 7548.05 ,
       7357.117, 7131.433, 6986.575, 6869.292, 6587.058, 6194.442,
       5754.708, 5439.667, 5195.325, 5044.508, 5010.   , 5195.3  ,
       5651.033, 6240.392, 6774.967, 7140.267, 7348.917, 7516.775,
       7671.625, 7806.833, 7949.467, 8065.808, 8162.875, 8136.758,
       7852.642, 7535.067, 7360.883, 7207.583, 6917.65 , 6487.642,
       6053.458, 5714.258, 5497.025, 5360.583, 5333.775, 5534.683])

X_trans를 보면 데이터에 어떤 기능이 발생했는지 확인할 수 있음

In [45]:
X_trans

Unnamed: 0_level_0,Unnamed: 1_level_0,precip,temp,precip_WASNULL,temp_WASNULL,year,half,quarter,month,day,hour,am_pm,hour12,wday,qday,week,_automl_target_col
timeStamp,_automl_dummy_grain_col,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2017-08-08 06:00:00,_automl_dummy_grain_col,0.0,66.17,0,0,2017,2,3,8,8,6,0,6,1,39,32,5312.59
2017-08-08 07:00:00,_automl_dummy_grain_col,0.0,66.29,0,0,2017,2,3,8,8,7,0,7,1,39,32,5792.94
2017-08-08 08:00:00,_automl_dummy_grain_col,0.0,66.72,0,0,2017,2,3,8,8,8,0,8,1,39,32,6183.26
2017-08-08 09:00:00,_automl_dummy_grain_col,0.0,67.37,0,0,2017,2,3,8,8,9,0,9,1,39,32,6183.26
2017-08-08 10:00:00,_automl_dummy_grain_col,0.0,68.3,0,0,2017,2,3,8,8,10,0,10,1,39,32,6249.18
2017-08-08 11:00:00,_automl_dummy_grain_col,0.0,68.89,0,0,2017,2,3,8,8,11,0,11,1,39,32,6249.18
2017-08-08 12:00:00,_automl_dummy_grain_col,0.0,70.6,0,0,2017,2,3,8,8,12,1,12,1,39,32,6603.74
2017-08-08 13:00:00,_automl_dummy_grain_col,0.0,72.83,0,0,2017,2,3,8,8,13,1,1,1,39,32,6862.72
2017-08-08 14:00:00,_automl_dummy_grain_col,0.0,73.33,0,0,2017,2,3,8,8,14,1,2,1,39,32,6862.72
2017-08-08 15:00:00,_automl_dummy_grain_col,0.0,74.89,0,0,2017,2,3,8,8,15,1,3,1,39,32,7019.39


## lags와rollingwindowfeatures사용

앞에서 생성한 model은 lags를 사용하지 않았기때문에 예측 결과가 날짜, grain, 추라적인 피처들로 인해 나온 심플한 회귀 모델입니다. 계절성 및 추세와 같이 패턴 형식이 있다면 좋은 시계열 모델이 개발이 가능하며 과거 시간 데이터를 사용하지 않기 때문에 '미래'를 예측하는거에 중점에 두지 않습니다. 즉, 시간성 컬럼인 timeStamp는 cross-validation으로 데이터 분할하는데만 쓰였습니다.

지금 개발하고자 하는 모델은, lags를 설정하는 작업이 들어가있습니다. 즉 y값을 예측하기 위해 과거 x를 사용한다는 것입니다. 모델이 과거를 통해 미래를 예측하기 위해서는 `max_horizon`을 지정해야합니다. 
`target_lags`는 예측하고자 하는 lags입니다. 예측하고자 하는 y값의 시간이 얼마나 떨어져있느냐가 나타나있는 겁니다.
`target_rolling_window_size`는 주기적인 windows(시간)에 대한 최대, 최소, 합계와 같은 피처를 생성하는데 쓰입니다.

이 노트북은 blacklist_models 매개 변수를 사용하여이 데이터 세트를 학습하는 데 시간이 오래 걸리는 일부 모델을 제외합니다. blacklist_models 목록에서 모델을 제거하도록 선택할 수 있지만 더 정밀한 결과를 얻으려면 iteration_timeout_minutes 매개 변수 값을 늘려야 할 수도 있습니다.

In [46]:
time_series_settings_with_lags = {
    'time_column_name': time_column_name,
    'max_horizon': max_horizon,
    'target_lags': 12,
    'target_rolling_window_size': 4
}

automl_config_lags = AutoMLConfig(task='forecasting',
                                  debug_log='automl_nyc_energy_errors.log',
                                  primary_metric='normalized_root_mean_squared_error',
                                  blacklist_models=['ElasticNet','ExtremeRandomTrees','GradientBoosting','XGBoostRegressor'],
                                  iterations=10,
                                  iteration_timeout_minutes=10,
                                  X=X_train,
                                  y=y_train,
                                  n_cross_validations=3,
                                  verbosity=logging.INFO,
                                  **time_series_settings_with_lags)



## 시계열Train2

이제 지연 및 롤링 창 기능으로 새 로컬 실행을 시작합니다. AutoML은 ML 모델을 반복하기 전에 설정 단계에서 기능을 적용합니다. lag와 rolling windows 기능은 추가적인 복잡성을 유발하므로 이러한 기능이 부족한 이전 예제보다 실행 시간이 더 오래 걸립니다.

In [47]:
local_run_lags = experiment.submit(automl_config_lags, show_output=True)

Running on local machine
Parent Run ID: AutoML_1c671c68-858c-4878-be9a-0d2b7e11acb7
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: DatasetFeaturization. Beginning to featurize the CV split.
Current status: DatasetFeaturizationCompleted. Completed featurizing the CV split.
Current status: DatasetFeaturization. Beginning to featurize the CV split.
Current status: DatasetFeaturizationCompleted. Completed featurizing the CV split.
Current status: DatasetFeaturization. Beginning to featurize the CV split.
Current status: DatasetFeaturizationCompleted. Completed featurizing the CV split.

****************************************************************************************************
DATA GUARDRAILS SUMMARY:
For more details, use API: run.get_guardrails()

TYPE:         Memory Issues Detection
STATU

In [48]:
best_run_lags, fitted_model_lags = local_run_lags.get_output()
y_fcst_lags, X_trans_lags = fitted_model_lags.forecast(X_test, y_query)
df_lags = align_outputs(y_fcst_lags, X_trans_lags, X_test, y_test)
df_lags.head()

Unnamed: 0,timeStamp,_automl_dummy_grain_col,origin,predicted,precip,temp,demand
0,2017-08-08 06:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,5275.22,0.0,66.17,5590.99
1,2017-08-08 07:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,6004.24,0.0,66.29,6147.03
2,2017-08-08 08:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,6176.02,0.0,66.72,6592.43
3,2017-08-08 09:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,6699.5,0.0,67.37,6874.53
4,2017-08-08 10:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,6723.11,0.0,68.3,7010.54


In [49]:
X_trans_lags

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,horizon_origin,precip,precip_WASNULL,temp,temp_WASNULL,_automl_target_col_lag12H,_automl_target_col_min_window4H,_automl_target_col_max_window4H,_automl_target_col_mean_window4H,year,...,quarter,month,day,hour,am_pm,hour12,wday,qday,week,_automl_target_col
timeStamp,_automl_dummy_grain_col,origin,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2017-08-08 06:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,1,0.0,0.0,66.17,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,6,0,6,1,39,32,5275.22
2017-08-08 07:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,2,0.0,0.0,66.29,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,7,0,7,1,39,32,6004.24
2017-08-08 08:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,3,0.0,0.0,66.72,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,8,0,8,1,39,32,6176.02
2017-08-08 09:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,4,0.0,0.0,67.37,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,9,0,9,1,39,32,6699.5
2017-08-08 10:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,5,0.0,0.0,68.3,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,10,0,10,1,39,32,6723.11
2017-08-08 11:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,6,0.0,0.0,68.89,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,11,0,11,1,39,32,6723.11
2017-08-08 12:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,7,0.0,0.0,70.6,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,12,1,12,1,39,32,6720.49
2017-08-08 13:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,8,0.0,0.0,72.83,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,13,1,1,1,39,32,6808.36
2017-08-08 14:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,9,0.0,0.0,73.33,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,14,1,2,1,39,32,7145.49
2017-08-08 15:00:00,_automl_dummy_grain_col,2017-08-08 05:00:00,10,0.0,0.0,74.89,0.0,6831.23,4867.02,5120.31,4955.81,2017,...,3,8,8,15,1,3,1,39,32,7151.54


### 예측에 가장 중요성 피처 찾기
예측 테스트 데이터를 기반으로 엔지니어링 된 기능 중요도를 계산하고 시각화 할 수 있습니다.

In [50]:
from azureml.train.automl.automl_explain_utilities import AutoMLExplainerSetupClass, automl_setup_model_explanations
automl_explainer_setup_obj = automl_setup_model_explanations(fitted_model, X=X_train.copy(), 
                                                             X_test=X_test.copy(), y=y_train, 
                                                             task='forecasting')

Current status: Setting up data for AutoML explanations
Current status: Setting up the AutoML featurizer
Current status: Setting up the AutoML featurization for explanations
Current status: Setting up the AutoML estimator
Current status: Generating a feature map for raw feature importance
Current status: Data for AutoML explanations successfully setup


#### Feature importance 위한 모듈 import

In [51]:
from azureml.explain.model.mimic.models.lightgbm_model import LGBMExplainableModel
from azureml.explain.model.mimic_wrapper import MimicWrapper
explainer = MimicWrapper(ws, automl_explainer_setup_obj.automl_estimator, LGBMExplainableModel, 
                         init_dataset=automl_explainer_setup_obj.X_transform, run=best_run,
                         features=automl_explainer_setup_obj.engineered_feature_names, 
                         feature_maps=[automl_explainer_setup_obj.feature_map])

In [52]:
pip install azureml.contrib.interpret

Collecting azureml.contrib.interpret
  Using cached https://files.pythonhosted.org/packages/b6/87/778ca4d8b7dee885ae24b2d0f2d9a1f64e1246c2733c0895379dc5a41de6/azureml_contrib_interpret-1.0.72-py3-none-any.whl
Installing collected packages: azureml.contrib.interpret
Successfully installed azureml.contrib.interpret
Note: you may need to restart the kernel to use updated packages.


In [53]:
engineered_explanations = explainer.explain(['local', 'global'], eval_dataset=automl_explainer_setup_obj.X_test_transform)
print(engineered_explanations.get_feature_importance_dict())


from azureml.contrib.interpret.visualize import ExplanationDashboard
ExplanationDashboard(engineered_explanations, automl_explainer_setup_obj.automl_estimator, automl_explainer_setup_obj.X_test_transform)

{'temp': 509.33654853787044, 'hour': 372.9536072150579, 'week': 167.0417524343496, 'wday': 97.38256751245615, 'month': 15.638174957581187, 'quarter': 3.829809858745253, 'half': 2.435676016719605, 'hour12': 1.503223184030217, 'qday': 0.6703484519906663, 'day': 0.13349653794488933, 'year': 0.006992336820724452, 'precip': 0.0006646050599776902, 'am_pm': 0.0, 'temp_WASNULL': 0.0, 'precip_WASNULL': 0.0}


ExplanationWidget(value={'predictedY': [5312.592001825297, 5792.944223853444, 6183.258977023155, 6183.25897702…

<azureml.contrib.interpret.visualize.ExplanationDashboard.ExplanationDashboard at 0x7f4c40d57710>

## AKS배포하기

- register_model() - Azure Machine Learning Worksapce에 등록
- socre_energy_demand.py 생성 - 모델 run 스크립트
- image 생성
- 웹서버 배포

In [54]:
fitted_model

ForecastingPipelineWrapper(pipeline=Pipeline(memory=None,
     steps=[('timeseriestransformer', TimeSeriesTransformer(logger=None,
           pipeline_type=<TimeSeriesPipelineType.FULL: 1>)), ('stackensembleregressor', StackEnsembleRegressor(base_learners=[('6', Pipeline(memory=None,
     steps=[('standardscalerwrapper', <automl.client.core.runtime.model_wrappe...   random_state=None, selection='cyclic', tol=0.0001, verbose=0),
            training_cv_folds=5))]),
              stddev=None)

## local_run_lags 모델 등록하기 - model_id

In [56]:
model = local_run_lags.register_model(description = 'automated ml model for energy demand forecasting', tags = {'ml': "Forecasting", 'type': "automl"})
modelid = local_run_lags.model_id
print(local_run_lags.model_id) # This will be written to the script file later in the notebook.

AutoML1c671c688best


## score 파이썬 스크립트 작성
- model.predict() 실행
- input 데이터 변형

In [59]:
%%writefile score_energy_demand.py
import pickle
import json
import numpy as np
import azureml.train.automl
from sklearn.externals import joblib
from azureml.core.model import Model


def init():
    global model
    model_path = Model.get_model_path(model_name = modelid) # this name is model.id of model that we want to deploy
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)

def run(timestamp,precip,temp):
    try:
        rawdata = json.dumps({timestamp, precip, temp})
        data = json.loads(rawdata)
        data_arr = numpy.array(data)
        result = model.predict(data_arr)
        # result = json.dumps({'timeStamp':timestamp, 'precip':precip, 'temp':temp})
    except Exception as e:
        result = str(e)
        return json.dumps({"error": result})
    return json.dumps({"result":result.tolist()})

Overwriting score_energy_demand.py


## model을 운영하기 위해 필요한 Dependency 확인

In [61]:
experiment_name = 'automl-energydemandforecasting'

from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

experiment = Experiment(ws, experiment_name)
ml_run = AutoMLRun(experiment = experiment, run_id = local_run.id)

dependencies = ml_run.get_run_sdk_dependencies(iteration = 0)

azureml-train-automl	1.0.72
azureml-sdk	1.0.72
azureml-core	1.0.72


In [None]:
for p in ['azureml-train-automl', 'azureml-sdk', 'azureml-core']:
    print('{}\t{}'.format(p, dependencies[p]))

## 위 모듈로 배포할 컨테이너 위에 Conda 가상환경에 설치할 모듈들 정의
- CondaDependencies.create(conda_packages=[])

In [62]:
from azureml.core.conda_dependencies import CondaDependencies 

myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn'], pip_packages=["azureml-train-automl"])
print(myenv.serialize_to_string())

conda_env_file_name = 'my_conda_env.yml'
myenv.save_to_file('.', conda_env_file_name)

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
  - azureml-train-automl==1.0.72.*
- numpy
- scikit-learn
channels:
- conda-forge



'my_conda_env.yml'

In [66]:
with open(conda_env_file_name, 'r') as cefr:
    content = cefr.read()

with open(conda_env_file_name, 'w') as cefw:
    cefw.write(content.replace(azureml.core.VERSION, dependencies['azureml-sdk']))

script_file_name = 'score_energy_demand.py'

with open(script_file_name, 'r') as cefr:
    content = cefr.read()

with open(script_file_name, 'w') as cefw:
    cefw.write(content.replace(modelid, local_run_lags.model_id))

In [68]:
pip install azureml.webservice_schema

Collecting azureml.webservice_schema
  Downloading https://files.pythonhosted.org/packages/8f/15/25d65ec84d595ffeb3cb73210e7fcf4402e70119bc32048cb65f6bc29634/azureml_webservice_schema-1.0.33-py3-none-any.whl
Collecting pyspark==2.3.1 (from azureml.webservice_schema)
[?25l  Downloading https://files.pythonhosted.org/packages/ee/2f/709df6e8dc00624689aa0a11c7a4c06061a7d00037e370584b9f011df44c/pyspark-2.3.1.tar.gz (211.9MB)
[K     |████████████████████████████████| 211.9MB 35.5MB/s eta 0:00:01  |▎                               | 1.9MB 2.6MB/s eta 0:01:21     |▊                               | 4.8MB 2.6MB/s eta 0:01:20     |█                               | 6.2MB 2.6MB/s eta 0:01:19     |█▏                              | 7.7MB 2.6MB/s eta 0:01:19     |█▍                              | 9.3MB 2.6MB/s eta 0:01:18     |██▍                             | 15.8MB 2.6MB/s eta 0:01:15     |███▉                            | 25.1MB 2.6MB/s eta 0:01:12     |████▍                           | 29.0MB 14.

스키마 파일은 배포 된 웹 서비스 REST API를 정의하는 데 사용되므로 "Swagger"서비스에서 사용할 수 있습니다.

In [69]:
from azureml.webservice_schema.sample_definition import SampleDefinition
from azureml.webservice_schema.data_types import DataTypes
from azureml.webservice_schema.schema_generation import generate_schema

schema_file_name = './schema.json'
def run(timestamp,precip,temp):
    return "OK"

import numpy as np
generate_schema(run, inputs={
    "timestamp" : SampleDefinition(DataTypes.STANDARD, '2012-01-01 00:00:00'),
    "precip" : SampleDefinition(DataTypes.STANDARD, '0.0'),
    "temp" : SampleDefinition(DataTypes.STANDARD, '0.0')}, 
    filepath=schema_file_name)



{'input': {'timestamp': {'internal': 'gANjYXp1cmVtbC53ZWJzZXJ2aWNlX3NjaGVtYS5fcHl0aG9uX3V0aWwKUHl0aG9uU2NoZW1hCnEAKYFxAX1xAlgJAAAAZGF0YV90eXBlcQNjYnVpbHRpbnMKc3RyCnEEc2Iu',
   'swagger': {'type': 'string', 'example': '2012-01-01 00:00:00'},
   'type': 0,
   'version': '1.0.33'},
  'precip': {'internal': 'gANjYXp1cmVtbC53ZWJzZXJ2aWNlX3NjaGVtYS5fcHl0aG9uX3V0aWwKUHl0aG9uU2NoZW1hCnEAKYFxAX1xAlgJAAAAZGF0YV90eXBlcQNjYnVpbHRpbnMKc3RyCnEEc2Iu',
   'swagger': {'type': 'string', 'example': '0.0'},
   'type': 0,
   'version': '1.0.33'},
  'temp': {'internal': 'gANjYXp1cmVtbC53ZWJzZXJ2aWNlX3NjaGVtYS5fcHl0aG9uX3V0aWwKUHl0aG9uU2NoZW1hCnEAKYFxAX1xAlgJAAAAZGF0YV90eXBlcQNjYnVpbHRpbnMKc3RyCnEEc2Iu',
   'swagger': {'type': 'string', 'example': '0.0'},
   'type': 0,
   'version': '1.0.33'}}}

### 컨테이너 이미지 생성

In [70]:
%%writefile docker_steps.dockerfile
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y build-essential gcc g++ python-dev unixodbc unixodbc-dev

Writing docker_steps.dockerfile


In [71]:
docker_file_name = "docker_steps.dockerfile"

In [72]:
from azureml.core.image import Image, ContainerImage

image_config = ContainerImage.image_configuration(runtime= "python",
                                 execution_script = script_file_name,
                                 docker_file = docker_file_name,
                                 schema_file = schema_file_name,
                                 conda_file = conda_env_file_name,
                                 tags = {'ml': "Forecasting", 'type': "automl"},
                                 description = "Image for automated ml energy demand forecasting predictions")

image = Image.create(name = "automlenergyforecasting",
                     models = [model],
                     image_config = image_config, 
                     workspace = ws)

image.wait_for_creation(show_output = True)

Creating image
Running...................................................................................................
Succeeded
Image creation operation finished for image automlenergyforecasting:1, operation "Succeeded"
