# Review Report on Long Short Term Memory (Networks)

Github URL:

## Introduction

This paper attempts to provide a solution to flaws in back propagation. To better understand the impact of this paper it is important to develop an understanding of backpropagation. Backpropagation is used to calculate gradients in a neural network so training can improve accuracy. In backpropagation the gradient from the loss function with respect to the weights is used to decide on where the next step should be. If the gradient is negative, there will be a step up on the weight value and if the gradient is positive there will be a step down on the weight value. This is carried out to minimise the total loss of the function, updating the value of each weight based on the new derivative. The mechanism of back propagation underlies the purpose of this paper.


## Content

The paper being discussed proposes a new type of neural network that is a variant of a recurrent neural network. Named as a Long short-term memory network (LSTM), it is designed to target issues regarding time and decay that is faced in recurrent backpropagation. Traditional problems related to conventional Back propagation through time see multiple multiplications of the derivatives of the weight matrix in order to compute gradients for backpropagation. Calculating the gradients over many cells as required in a recurrent neural network leads to two unstable gradients. The first is an exploding gradient where the largest singular values in the weight matrix are larger than one, which over multiple computations will lead to infinitely large gradients. Thus, large gradients cause updates of weights to be overtly large thus compromising the ability of the neural network to update weights incrementally. The second is the vanishing gradient which results in the largest singular values in the weight matrix being less than one, which over multiple computations will approach zero. This leads to gradient updates that are negligible which never or very slowly update weights and train the network. This paper proposed an LSTM which reduces the probability and effect of exploding and vanishing gradients. An initial hidden state is created which is multiplied to a set of fixed weight and upon input of x will produce a new hidden state. To regulate the effect of this hidden state, a cell state is created, and gates created as well to ensure that computations can be inputted, erased and revealed. To backpropagate to the cell state, a derivative is found from a tanh function which can deal with a set of inputs. Dealing with a set rather than a single input provides uninterrupted gradient flow. As the local gradients are properly maintained, when the gradients are transferred on to the weights they can be updated more effectively (Hochrieter and Schuber 1997).

## Innovation

Rather than simply using gradient clipping, a more accurate solution using gate mechanisms allows for more accurate adjustment of weights for training on datasets. The reason this solution was original was because using two states and four gates was never attempted previously as a means of providing uninterrupted gradient flow. The solutions to uninterrupted gradient flow were not elegant and did not solve vanishing gradients. As a result, this solution was a highly effective way of reducing the probability and effect of impeded gradient flow for the purpose of backpropagation.
This paper contributes a new methodology for calculation of gradients via implementation through a computational algorithm. It proposed the use of four gates. The first is the input gate which decides how much to input to the cell, the forget gate which decides how much to erase form the cell of the previous timestep, and the output gate which decides whether to reveal the cell to y(t). These three gates are sigmoid that represents a vector of values between 1 and 0 which represent show and hide respectively. The g gate which decides how much to write to the input cell which is a tanh that represents a vector of values between -1 and 1. The cell state of the current timestep (current memory) is computed by the sum of the cell state of the previous timestep multiplied element wise with the forget gate and the element wise product between the input gate and the g gate. Thus, the previous cell state can be forgotten and added to a cell which determines what to input and how much to write it. After the cell state is computed it is normalised through a tanh function and multiplied element wise by zero or one to determine if it should be shown or not shown (Hochrieter and Schuber 1997).

## Technical quality

This paper had great technical quality as it was able to classify the dataset used for testing very accurately. The paper conducted six experiments in order to evaluate the effectiveness of LSTM’s performance. The first experiment was designed to evaluate the performance of LSTM relative to traditional back propagation and recurrent learning. When the learning-based methods were compared on solving a task, LSTM was able to solve the task much faster than other methods as it required less training examples for correct classification. The other benefit is that it is better at solving the task as it only failed twice in 150 attempts. This experiment was good as it involved a task that all three learning methods could solve and thus could evaluate relative performance. This performance was highlighted further in the second experiment and this experiments which showed high levels of performance on noise-free and noisy sequences. This paper was effective in designing tests that could show the capabilities of LSTM networks. The paper provided insight into the advantages of LSTM and was also transparent to the limitations of the LSTM network. For example, they noted improvements to be made on learning entire input strings whilst also considering that LSTM’s give weight to more recent inputs (Hochrieter and Schuber 1997).

## Application and X-factor

Long short-term memory networks are an interesting development because of their ability to take single inputs and create predictions. As a result, the authors of this paper have accepted that it needs to be applied to real world data to properly determine its applications and limitations. However, they have proposed that it would be useful for time series prediction, music composition and speech processing. The application domain is thus appropriate for the proposed application domain due to the mechanism of the LSTM. Its ability to read inputs and preserve gradients allow it to make accurate predictions with less training. Also, as it can priorities more recent data making it effective at time series analysis of electricity price trends. This technology can be used in combination with sequence chunkers to preserve gradients more effectively. The way in which an LSTM reads the first input data, optimises weights and bias, outputs data and repeats this process for the next data point to make another prediction makes it very useful for market data such as the forecast of electricity prices. Using previous prices and other relevant data such as previous demand can make for accurate modelling of commodities in markets (Weron 2014).

## Presentation

This paper is of excellent quality as it has proposed a meaningful solution to an existing issue in neural networks, however there are several issues. It takes a high level of understanding of neural networks to extract a proper understanding of the derived mechanisms in this paper. This is because the derivations are complex in nature and the gat mechanisms are difficult to conceptualise. The paper was presented well as it outlined the issues it attempted to solve, the developments and explanation of approaches up till the LSTM approach and the experiments that highlighted its effectiveness.

## References

Hochreiter, S., Schmidhuber, J. 1997, ‘Long Short-Term Memory’, Neural Computation, vol. 9, no. 8, pp. 1735-80.

Weron, R. 2014, ‘Electricity price forecasting: A review of the state-of-the-art with a look into the future’, International Journal of Forecasting, vol 30, no. 4, pp. 1030-81.
