---
layout: post
title: "시계열 데이터 - LSTM"
author: "Chanjun Kim"
categories: Data분석
tags: [Data, TimeSeries, ARIMA, LSTM, BOOSTING, REGRESSION, 시계열데이터, 시계열분석]
image: 05_timeseries.png
---

## **학습목적**
시계열 데이터를 다루는 법과 시계열 예측을 하기 위한 여러가지 모델을 사용해보고 특성을 이해한다.<br>
이 포스팅에선 시계열 데이터의 대표적인 딥러닝 기법인 LSTM에 대해서 설명한다.
> 이 글은 LSTM에 대한 글이므로 EDA에 대한 글은 따로 포스팅하겠습니다.

In [1]:
import os
import sys
import warnings
from tqdm import tqdm

import itertools
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import plotnine as p9
import seaborn as sns

import scipy
import stats
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_absolute_error

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, LSTM, GRU, RNN, Reshape
from keras.preprocessing.sequence import TimeseriesGenerator

In [2]:
%matplotlib inline
warnings.filterwarnings("ignore")

In [3]:
mpl.rcParams['axes.unicode_minus'] = False
# fm._rebuild()
plt.rcParams["font.family"] = 'NanumMyeongjo'
plt.rcParams["figure.figsize"] = (10,10)

In [4]:
train = pd.read_csv("data/dacon/energy/train.csv", encoding = "cp949")
train.head()

Unnamed: 0,num,date_time,전력사용량(kWh),기온(°C),풍속(m/s),습도(%),강수량(mm),일조(hr),비전기냉방설비운영,태양광보유
0,1,2020-06-01 00,8179.056,17.6,2.5,92.0,0.8,0.0,0.0,0.0
1,1,2020-06-01 01,8135.64,17.7,2.9,91.0,0.3,0.0,0.0,0.0
2,1,2020-06-01 02,8107.128,17.5,3.2,91.0,0.0,0.0,0.0,0.0
3,1,2020-06-01 03,8048.808,17.1,3.2,91.0,0.0,0.0,0.0,0.0
4,1,2020-06-01 04,8043.624,17.0,3.3,92.0,0.0,0.0,0.0,0.0


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122400 entries, 0 to 122399
Data columns (total 10 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   num         122400 non-null  int64  
 1   date_time   122400 non-null  object 
 2   전력사용량(kWh)  122400 non-null  float64
 3   기온(°C)      122400 non-null  float64
 4   풍속(m/s)     122400 non-null  float64
 5   습도(%)       122400 non-null  float64
 6   강수량(mm)     122400 non-null  float64
 7   일조(hr)      122400 non-null  float64
 8   비전기냉방설비운영   122400 non-null  float64
 9   태양광보유       122400 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 9.3+ MB


In [6]:
test = pd.read_csv("data/dacon/energy/test.csv", encoding = "cp949")
test.head()

Unnamed: 0,num,date_time,기온(°C),풍속(m/s),습도(%),"강수량(mm, 6시간)","일조(hr, 3시간)",비전기냉방설비운영,태양광보유
0,1,2020-08-25 00,27.8,1.5,74.0,0.0,0.0,,
1,1,2020-08-25 01,,,,,,,
2,1,2020-08-25 02,,,,,,,
3,1,2020-08-25 03,27.3,1.1,78.0,,0.0,,
4,1,2020-08-25 04,,,,,,,


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10080 entries, 0 to 10079
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   num           10080 non-null  int64  
 1   date_time     10080 non-null  object 
 2   기온(°C)        3360 non-null   float64
 3   풍속(m/s)       3360 non-null   float64
 4   습도(%)         3360 non-null   float64
 5   강수량(mm, 6시간)  1680 non-null   float64
 6   일조(hr, 3시간)   3360 non-null   float64
 7   비전기냉방설비운영     2296 non-null   float64
 8   태양광보유         1624 non-null   float64
dtypes: float64(7), int64(1), object(1)
memory usage: 708.9+ KB


In [8]:
print(train.num.nunique())
print(test.num.nunique())
print(pd.concat([train.num.value_counts().sort_index(), test.num.value_counts()], axis = 1).head())

60
60
    num  num
1  2040  168
2  2040  168
3  2040  168
4  2040  168
5  2040  168


In [24]:
y_col = "전력사용량(kWh)"
length = 24

In [11]:
for i, x in tqdm(enumerate(train.num.unique())):
    X = train.loc[train.num == x].iloc[ : , [0, 2]].reset_index(drop = True)
    Y = train.loc[train.num == x].iloc[ : , [2]].reset_index(drop = True)
    data = np.array(X)
    targets = np.array(Y)
    data_gen = TimeseriesGenerator(data, targets, length=length, batch_size = len(X) - length)
    
    if i == 0 :
        Xs = data_gen[0][0]
        Ys = data_gen[0][1]
    else :
        Xs = np.concatenate((Xs, data_gen[0][0]))
        Ys = np.concatenate((Ys, data_gen[0][1]))

60it [00:01, 54.39it/s]


In [13]:
lstm_units=32
dropout=0.2
EPOCH=30
BATCH_SIZE=128

In [14]:
model=Sequential([
    LSTM(lstm_units, return_sequences=False, recurrent_dropout=dropout),
    Dense(1, kernel_initializer=tf.initializers.zeros())
])

In [15]:
model.compile(optimizer='adam', loss='mae', metrics=['mse', "accuracy"])
# 에포크가 끝날 때마다 점(.)을 출력해 훈련 진행 과정을 표시합니다

In [16]:
#가장 좋은 성능을 낸 val_loss가 적은 model만 남겨 놓았습니다.
save_best_only=tf.keras.callbacks.ModelCheckpoint(filepath="lstm_model.h5", monitor='val_loss', save_best_only=True)

In [17]:
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)

In [18]:
print(Xs.shape)
print(Ys.shape)

(120960, 24, 2)
(120960, 1)


In [19]:
%%time
model.fit(Xs, Ys, epochs=EPOCH, batch_size=BATCH_SIZE, validation_split = 0.2, verbose=1,
          callbacks=[early_stop, save_best_only])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x2020a7a5ee0>

In [20]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (128, 32)                 4480      
_________________________________________________________________
dense (Dense)                (128, 1)                  33        
Total params: 4,513
Trainable params: 4,513
Non-trainable params: 0
_________________________________________________________________


In [25]:
test[y_col] = pd.NA

In [28]:
test = test.interpolate()

In [30]:
test[["num", y_col]]

Unnamed: 0,num,전력사용량(kWh)
0,1,
1,1,
2,1,
3,1,
4,1,
...,...,...
10075,60,
10076,60,
10077,60,
10078,60,


In [None]:
result = np.zeros(len(test))
result_idx = 0
for i in tqdm(test.num.unique()):
    for n, m in enumerate(range(len(test[test.num == i]))) :
        if n == 0 :
            predict_data = np.array(train[train.num == i][-24:][["num", y_col]])
        else : 
            predict_data = predict_data[-24:]
        next_= model.predict(np.reshape(predict_data, (1, 24, 2)))
        predict_data = np.concatenate([predict_data, np.array([[i, next_[0][0]]])])
        result[result_idx] = next_
        result_idx += 1

  7%|███▍                                                | 4/60 [00:25<05:53,  6.32s/it]

In [68]:
test[y_col] = result
test

Unnamed: 0,num,date_time,기온(°C),풍속(m/s),습도(%),"강수량(mm, 6시간)","일조(hr, 3시간)",비전기냉방설비운영,태양광보유,전력사용량(kWh)
0,1,2020-08-25 00,27.800000,1.500000,74.000000,0.0,0.000000,,,415.436188
1,1,2020-08-25 01,27.633333,1.366667,75.333333,0.0,0.000000,,,415.436188
2,1,2020-08-25 02,27.466667,1.233333,76.666667,0.0,0.000000,,,415.436188
3,1,2020-08-25 03,27.300000,1.100000,78.000000,0.0,0.000000,,,415.436188
4,1,2020-08-25 04,26.900000,1.166667,79.666667,0.0,0.000000,,,415.436188
...,...,...,...,...,...,...,...,...,...,...
10075,60,2020-08-31 19,28.633333,3.566667,66.000000,0.0,0.533333,1.0,1.0,0.000000
10076,60,2020-08-31 20,28.266667,3.833333,67.000000,0.0,0.266667,1.0,1.0,0.000000
10077,60,2020-08-31 21,27.900000,4.100000,68.000000,0.0,0.000000,1.0,1.0,0.000000
10078,60,2020-08-31 22,27.900000,4.100000,68.000000,0.0,0.000000,1.0,1.0,0.000000


---

참고 자료 : 
- https://dacon.io/competitions/official/235736/codeshare/2628?page=1&dtype=recent
- https://byeongkijeong.github.io/ARIMA-with-Python/
- https://otexts.com/fppkr/arima-estimation.html