## Paper Explanation
#### Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks

#### Background
<font size=2>
    
Multivariate time series forecasting sometimes have both short- and long-term periodical pattern, which is not well predicted by traditional methods, like **Autoregression** and **Gaussion Process**. A new approach with **neural network** is necessary to address this issue.

#### Core innovation: recurrsive-skip
<font size=2>
    
Besides the normal RNN component to predict time-sequence-related data, they propose a new method to extract multi-periodical behavior. For example, in case of traffic, the occupancy rate of a road has both periodicities in one single day and in one week, which exhibits high occupancy during moring- and evening-peak in workdays and low occupancy during weekends.

In order to track the pattern of similar periodicity, they also add a component focusing on some specific time epoch. For instance, to predict occupancy of road at 7:00 to find out the rule said above, i.e. high during workdays and low during weekends, they set a skip-parameter to only focus on situation at 7:00 every day, which denotes the number of hidden cells skipped through.
    
<div>
<img src="paper_explanation_1.png" style="zoom:40%" />
</div>

#### How they implement
<font size=2>
   
Long- and Short-term Time-series network(LSTNet), which consists of:

1. CNN component:
    
    to extract short-term patterns in the time-sequence and variables;
    
2. RNN component:
    
    main part in this network;
    
3. Recurrsive-skip part:
    
    the innovatively proposed structure;
    to be against gradient vanishment from GRU/RNN in long-term prediction;
    
4. Autoregressive component:
    
    drawback of NN: scale of outputs is not sensitive to the scale of input;
    focus on the local scaling issue;
    
5. prediction
    
    deccompose final prediction into liear and non-liear parts;
    linear: autoregressive;
    non-linear: RNN output;
    
6. Frobenius norm or L1Loss as error function
    
to extract shot- and long-term repetitive behavior in data.

#### How they evaluate
<font size=2>

1. Metrics

    Root Relative Squared Error (RSE): the lower, the better;

    Empirical Correlation Coefficient (CORR): the higher, the better;
    
2. Comparison group
    
    introduce a case(exchange-rate) which is not suitable with LSTNet to represent the specific usage of LSTNet;
    process identical data with other different models to support their proposed one;
    
2. Ablation Study
    
    run the process without specific part to see the effect, to induce the positive influence of it;

#### Results and Conclusion
<font size=2>

1.
LSTNet outperforms other mothods especially with the large horizons, in capturing both short- and long-term repeating patterns in data;

2.
The NN component in LSTNet may not be sufficiently sensitive to violated scale fluctuations in data, while the linear AR part can;

3.
Problem:

How to automatically choose **skip-length**?
    
How to integrate different dimension of variables during CNN part, which in real world usually have different attributes but have been taken equally for now in paper?

#### Personal opinions on this paper
<font size=2>
    
1.
They emphasize not only the advantages of LSTNet, but also give example(exchange-rate case) which is not feasible to use LSTNet, in other words, the limit of their innovation;

2.
Ablation study is clear and necessary to show the influence of different part in network, although it can only indicate qualitative, rather than quantitative;

3.
The innovative **recurrsive-skip** has also a limit, that the hyper-parameter which is called **skip-length** is empirical. And it needs pre-knowledge of dataset, for example, autocorrelation has been checked previously to confirm whether the data has multi-periodical pattern. Even if it is confirmed, then how large the **skip-length** should be is also an empirical choice which even needs iterative operation to find the suitable one. Of course, if it is already clear that what this data is about, like traffic case in paper, which obviously has realtion between workdays and work-hours, it will be easier to confirm **skip-length**. In conclusion, this component is not as general as network structures like CNN, RNN;

#### SOTA of benchmarks
<font size=2>
    
1. AR: Autoregressive
    
    predict future values based on past values;
    Autoregressive models implicitly assume that the future will resemble the past;
    
2. LRidge: vector autoregression (VAR) model with L2-regularization
    
    VAR: relate current observations of a variable with past observations of itself and past observations of other variables in the system;
    L2: give penalty when weights are too many to describe the network and cause overfitting;
    
3. LSVR: VAR model with Support Vector Regression(SVR) objective function
    
    SVR: has similar concept with SVM but is a regression model. It tries to find the best fitting line within a threshold, which is distance between hyperplane and boundries. Hyperplane is defined by support vectors, which closest distribute along either side of hyperplane and define a margin/ distance. SVR also uses kernel function to transfer data into higher dimension where a hyperplane can be found, e.g. RBF(radical basis function) like gaussian.
    
4. TRMF: AR model using temporal regularized matrix factorization
    
5. GP: Gaussian Process for time series modeling
    
6. VAR-MLP: Multilayer Perception (MLP) with AR
    
    MLP: a perception usually with one input layer and one outputlayer and multiple hidden layers. Each layer is also with activation function to decide the output of hidden layer. Nodes in each layer are fully connected with the ones of previous layer.
    
7. RNN-GRU: RNN model using GRU cell.
    
    RNN: efficient for cases depending on time sequence
    GRU: more expressive inner structure to reduce gradient vanishing of RNN
