# 1. Time Series Introduction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1BqHW6UuKS5I5QykIlRPo1V8w9qeb82Y2?usp=sharing)

In [3]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/DRZFhCBsGQY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


시계열 예측은 과거에 관측된 값을 바탕으로 미래 값을 예측하는 문제입니다. 과거에 관측된 데이터와 미래 값 사이의 패턴을 발견해야 한다는 점에서 지도학습 문제로 정의가 가능합니다. 그렇기 때문에 이번 장에서는 신경망 구조에 기반한 지도학습을 통해 미래 값을 예측하는 모델을 구축해보겠습니다. 

시계열 예측은 다방면에서 필요로 하는 기술입니다. 가장 대표적으로 에너지 분야가 있습니다. 전력발전소에서는 효율적인 예비전력 확보를 위해 미래의 전력 수요를 예측해야 하며, 도시가스 회사는 검침기 고장 및 검침기 치팅에 대한 선제적 조치를 하기 위해 미래의 도시가스 사용량 예측 모델이 필요합니다. 실제로 해당 문제들은 새로운 모델 발굴을 위해 데이터 경진대회([전력](https://dacon.io/competitions/official/235606/overview/), [도시가스](https://icim.nims.re.kr/platform/question/16))로도 개최가 됐습니다. 이 외에도 유통 분야에서도 효율적인 물품 관리를 위해 품목별 판매량 예측에 관심있으며, 마찬가지로 데이터 경진대회([유통](https://www.kaggle.com/c/m5-forecasting-accuracy/overview))로도 개최가 됐습니다.

이번 튜토리얼에서는 Johns Hopkins University의 Center for Systems Science and Engineering에서 제공하는 [코로나 확진자 데이터](https://github.com/CSSEGISandData/COVID-19)를 활용해 과거 확진자 데이터를 바탕으로 미래 확진자를 예측하는 모델을 구축해보겠습니다. 1장에서는 시계열 예측 모델 구축 시 사용 가능한 신경망 구조에 대해 알아 볼 것이며, 모델 성능 평가 시 사용 가능한 평가지표를 확인해보겠습니다. 2장에서는 데이터 탐색적 분석을 통해 코로나 확진자 데이터에 대한 이해를 심화시킬 것이며 3장에서는 시계열 데이터를 지도학습을 위한 데이터 형식으로 바꾸는 법을 알아볼 것입니다. 4장과 5장에서는 각각 딥러닝 모델을 활용해 미래 확진자를 예측해보겠습니다. 


## 1.1 Available Deep Learning Architecture

### 1.1.1 CNN

<p align="center"><img align="center" src="https://github.com/Pseudo-Lab/Tutorial-Book/blob/sungjin/pics/TS-ch1img01.PNG?raw=true"></p>


- Figure 1-1 CNN application example (Source: Lim et al. 2020. Time Series Forecasting With Deep Learning: A Survey)


In general, CNNs are network structures that perform well in computer vision problems. However, CNN can also be applied to time series prediction. A weighted sum between input sequence data can be calculated using a one-dimensional convolution filter to calculate the predicted future value. However, the CNN structure does not take into account the temporal dependence between past and future data. 

### 1.1.2 RNN

<p align="center"><img align="center" src="https://github.com/Pseudo-Lab/Tutorial-Book/blob/sungjin/pics/TS-ch1img02.PNG?raw=true"></p>


- Figure 1-2 RNN application example (Source: Lim et al. 2020. Time Series Forecasting With Deep Learning: A Survey) 

RNN is a structure that is frequently used in natural language processing problems, and it utilizes hidden state information accumulated from previous state information to predict the future. That is why it is possible to use past information to calculate future forecasts. However, if the given input sequence is too large, a vanishing gradient problem may occur that adversely affects model training. Therefore, we are mainly using the LSTM structure that solved the problem, and we will use the LSTM structure in this tutorial. 

### 1.1.3 Attention Mechanism

<p align="center"><img align="center" src="https://github.com/Pseudo-Lab/Tutorial-Book/blob/sungjin/pics/TS-ch1img03.PNG?raw=true"></p>


- Figure 1-3 Application of Attention Mechanism (Source: Lim et al. 2020. Time Series Forecasting With Deep Learning: A Survey) 

There will be information that is helpful and information that is not helpful when predicting the future by past information. For example, if a retailer wants to predict weekend sales, it may be helpful to consider weekend sales on the same day a week ago rather than sales on the weekday the day before. Using the attention mechanism makes this prediction possible. The influence of each past point in time to be predicted is calculated and used when predicting future values. More accurate prediction is possible by assigning more weight to the value that is directly related to the time point to be predicted and the value in the past. 

## 1.2 Evaluation Indicator

In this tutorial, we are going to build a predictive model for corona patients. Since the confirmed patients have consecutive values, the performance of the model can be assessed through the difference between the predicted and actual values. In this section, we'll look at various ways to calculate the difference between the predicted and actual values. Before explaining the evaluation indicators, we will first define several symbols.


> $y_i$: actual value to be predicted
>  $\hat{y}_i$: predicted value by model
>  $n$: size of test dataset


Sections 1.2.1 to 1.2.4 use the above symbols, and in Section 1.2.5, the definition of symbols will be different, so please take note of this point.

### 1.2.1 MAE (Mean Absolute Error)

> $MAE=\frac{1}{n}\displaystyle\sum_{i=1}^{n} |y_i-\hat{y}_i|$

MAE, also called L1 Loss, can be obtained by taking the absolute value of the difference between the predicted value and the actual value, adding them all, and dividing it by the number of samples calculated (n). Since adding all the samples as much as the number of samples and dividing it means to obtain the average, we will use the expression to obtain the average for the evaluation indices that will come out. Since the scale of MAE is the same scale as the target variable being predicted, it is good to intuitively understand the implications of the values. Implemented in code looks like this: 

In [None]:
import numpy as np #넘파이 패키지 불러오기

def MAE(true, pred):
    '''
    true: np.array 
    pred: np.array
    '''
    return np.mean(np.abs(true-pred))

TRUE = np.array([10, 20, 30, 40, 50])
PRED = np.array([30, 40, 50, 60, 70])

MAE(TRUE, PRED)

20.0

### 1.2.2 MSE (Mean Squared Error)

> $MSE=\frac{1}{n}\displaystyle\sum_{i=1}^{n} (y_i-\hat{y}_i)^2$

> $RMSE=\sqrt{\frac{1}{n}\displaystyle\sum_{i=1}^{n} (y_i-\hat{y}_i)^2}$


MSE, also called L2 Loss, is calculated by squaring the difference between the predicted value and the actual value, and then averaged. The more the predicted value deviates from the actual value, the more exponentially the MSE value increases. Since the calculated value is squared, the scale of the target variable and the value is different. In order to match the target variable and scale, you can put a root in the MSE value, and this value is called RMSE. Implemented in code looks like this:

In [None]:
def MSE(true, pred):
    '''
    true: np.array 
    pred: np.array
    '''
    return np.mean(np.square(true-pred))

TRUE = np.array([10, 20, 30, 40, 50])
PRED = np.array([30, 40, 50, 60, 70])

MSE(TRUE, PRED)

400.0

### 1.2.3 MAPE (Mean Absolute Percentage Error)

> $MAPE=\frac{1}{n}\displaystyle\sum_{i=1}^{n} |\frac{y_i-\hat{y}_i}{y_i}|$


(Source: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error)

MAPE calculates the relative proportion of the error to the actual value by dividing the difference between the actual value and the predicted value by the actual value. And after taking the absolute value of the corresponding value, the average is calculated. Since the degree of error is expressed as a percentage value, it is easy to intuitively understand the performance of the model, and it is easy to evaluate the performance of the model for each variable when there are multiple target variables.

However, if there is 0 in the actual value, there is a problem that MAPE is not defined. In addition, even if the absolute value has the same error, there is a problem that penalties are added to the predicted value that overestimates according to the magnitude relationship between the actual value and the predicted value ( [Makridakis, 1993](https://doi.org/10.1016/0169-2070(93)90079-3) ). Let's check this with the code below. 

In [None]:
def MAPE(true, pred):
    '''
    true: np.array 
    pred: np.array
    '''
    return np.mean(np.abs((true-pred)/true))

TRUE_UNDER = np.array([10, 20, 30, 40, 50])
PRED_OVER = np.array([30, 40, 50, 60, 70])
TRUE_OVER = np.array([30, 40, 50, 60, 70])
PRED_UNDER = np.array([10, 20, 30, 40, 50])


print('평균 오차가 2일 때 실제값과 예측값의 대소 관계에 따른 MAE, MAPE 비교 \n')

print('실제값이 예측값 보다 작을 때 (예측값이 과대추정)')
print('MAE:', MAE(TRUE_UNDER, PRED_OVER))
print('MAPE:', MAPE(TRUE_UNDER, PRED_OVER))


print('\n실제값이 예측값 보다 클 때 (예측값이 과소추정)')
print('MAE:', MAE(TRUE_OVER, PRED_UNDER))
print('MAPE:', MAPE(TRUE_OVER, PRED_UNDER))


평균 오차가 2일 때 실제값과 예측값의 대소 관계에 따른 MAE, MAPE 비교 

실제값이 예측값 보다 작을 때 (예측값이 과대추정)
MAE: 20.0
MAPE: 0.9133333333333333

실제값이 예측값 보다 클 때 (예측값이 과소추정)
MAE: 20.0
MAPE: 0.4371428571428571


MAPE takes a method of dividing it by the actual value $y$ to convert it to a percentage due to the nature of the formula. Therefore, the derived value is dependent on $y$. Even if the numerators are the same, smaller denominators increase the error.

위의 코드에서는 실제값이 예측값보다 20 만큼 작은 (`TRUE_UNDER`, `PRED_OVER`)와 20 만큼 큰 (`TRUE_OVER`, `PRED_UNDER`)를 통해 이를 확인했습니다. MAE 값은 `TRUE_UNDER`와 `PRED_OVER`, 그리고 `TRUE_OVER`와 `PRED_UNDER` 모두 20으로 같습니다. 하지만 MAPE는 실제값이 `TRUE_UNDER`일 경우 0.913, `TRUE_OVER`일 경우 0.437를 산출하고 있습니다. 

### 1.2.4 SMAPE (Symmetric Mean Absolute Percentage Error)


> $SMAPE=\frac{100}{n}\displaystyle\sum_{i=1}^{n} \frac{|y_i-\hat{y}_i|}{|y_i| + |\hat{y}_i|}$


(Source: https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)


SMAPE has been elaborated to compensate for the limitations of MAPE for the aforementioned examples ( [Makridakis, 1993](https://doi.org/10.1016/0169-2070(93)90079-3) ). Let's check it with the code below.

In [None]:
def SMAPE(true, pred):
    '''
    true: np.array 
    pred: np.array
    '''
    return np.mean((np.abs(true-pred))/(np.abs(true) + np.abs(pred))) #100은 상수이므로 이번 코드에서는 제외

print('평균 오차가 2일 때 실제값과 예측값의 대소 관계에 따른 MAE, SMAPE 비교 \n')

print('실제값이 예측값 보다 작을 때 (예측값이 과대추정)')
print('MAE:', MAE(TRUE_UNDER, PRED_OVER))
print('SMAPE:', SMAPE(TRUE_UNDER, PRED_OVER))


print('\n실제값이 예측값 보다 클 때 (예측값이 과소추정)')
print('MAE:', MAE(TRUE_OVER, PRED_UNDER))
print('SMAPE:', SMAPE(TRUE_OVER, PRED_UNDER))


평균 오차가 2일 때 실제값과 예측값의 대소 관계에 따른 MAE, SMAPE 비교 

실제값이 예측값 보다 작을 때 (예측값이 과대추정)
MAE: 20.0
SMAPE: 0.29

실제값이 예측값 보다 클 때 (예측값이 과소추정)
MAE: 20.0
SMAPE: 0.29


We can see that MAPE yielded different values of 0.91 and 0.43, but SMAPE yielded the same value of 0.29. However, SMAPE has a property dependent on $\hat{y}_i$ because the predicted value $\hat{y}_i$ is included in the denominator. When the predicted value is underestimated, the denominator becomes smaller and the calculated error increases. Let's check it with the code below. 

In [None]:
TRUE2 = np.array([40, 50, 60, 70, 80])
PRED2_UNDER = np.array([20, 30, 40, 50, 60])
PRED2_OVER = np.array([60, 70, 80, 90, 100])

print('평균 오차가 2일 때 과소추정, 과대추정에 따른 MAE, SMAPE 비교 \n')

print('과대추정 시')
print('MAE:', MAE(TRUE2, PRED2_OVER))
print('SMAPE:', SMAPE(TRUE2, PRED2_OVER))

print('\n과소추정 시')
print('MAE:', MAE(TRUE2, PRED2_UNDER))
print('SMAPE:', SMAPE(TRUE2, PRED2_UNDER))

평균 오차가 2일 때 과소추정, 과대추정에 따른 MAE, SMAPE 비교 

과대추정 시
MAE: 20.0
SMAPE: 0.14912698412698414

과소추정 시
MAE: 20.0
SMAPE: 0.21857142857142856


`PRED2_UNDER` and `PRED2_OVER` have an error of `TRUE2` and an average of 2, but SMAPE calculates a value of 0.218 for an underestimated `PRED2_UNDER` , while a value of 0.149 for an overestimated `PRED2_OVER` .

### 1.2.5 RMSSE (Root Mean Squared Scaled Error)

> $RMSSE=\sqrt{\displaystyle\frac{\frac{1}{h}\sum_{i=n+1}^{n+h} (y_i-\hat{y} *i)^2}{\frac {1}{n-1}\sum* {i=2}^{n} (y_i-y_{i-1})^2}}$

We will proceed from the definition of the symbol of the RMSSE formula. Each symbol has the following meaning.

> $y_i$: actual value to be predicted
> $\hat{y}_i$: predicted value by model
> $n$: size of the training dataset
> $h$: size of test dataset

RMSSE is a modified form of Mean Absolute Scaled Error ( [Hyndman, 2006](https://doi.org/10.1016/j.ijforecast.2006.03.001) ) and solves the problems of MAPE and SMAPE mentioned above. Since MAPE and SMAPE use the actual and predicted values of the test data to scale the MAE, even if the absolute value of the error is the same, penalties are given unevenly depending on whether they are underestimated or overestimated.

RMSSE avoids this problem as it utilizes training data when scaling the MSE. Since the training data is divided by the MSE value when naive forecasting is performed, the error value is not affected by the underestimation or overestimation of the model prediction value. The naive forecast method is a method of forecasting from the most recent observation and is defined as follows.

> $\hat{y} *i = y* {i-1}$

This is a method of predicting the predicted value at the time of $i$ as the actual value at the time of $i-1$. Since it is divided by the MSE value for the naive forecast method, if the RMSSE value is greater than 1, it means that the forecast is not possible than the naive forecast method, and if it is less than 1, it means that the forecast is better than the naive forecast method. Let's implement RMSSE with the code below. 

In [None]:
def RMSSE(true, pred, train): 
    '''
    true: np.array 
    pred: np.array
    train: np.array
    '''
    
    n = len(train)

    numerator = np.mean(np.sum(np.square(true - pred)))
    
    denominator = 1/(n-1)*np.sum(np.square((train[1:] - train[:-1])))
    
    msse = numerator/denominator
    
    return msse ** 0.5

In [None]:
TRAIN = np.array([10, 20, 30, 40, 50]) #RMSSE 계산을 위한 임의의 훈련 데이터셋 생성

In [None]:
print(RMSSE(TRUE_UNDER, PRED_OVER, TRAIN))
print(RMSSE(TRUE_OVER, PRED_UNDER, TRAIN))
print(RMSSE(TRUE2, PRED2_OVER, TRAIN))
print(RMSSE(TRUE2, PRED2_UNDER, TRAIN))

4.47213595499958
4.47213595499958
4.47213595499958
4.47213595499958


Although the absolute value of the error is the same, you can see that the penalty was equally given to the four examples in which MAPE and SMAPE were unequally penalized, and scaling was also performed.

So far, we have looked at the deep learning structure and evaluation index that can be used for time series prediction. In the next chapter, we will explore the corona confirmed data set to be used for model building. 