# Intuition to LSTMs:

LSTMs are difficult to grasp intuitively. In orden to understand better how they work organically, it is interesting to see what values can adopt the different throughputs. With that, an interpretation of their semantics can be done more easily.

## 1. Domains:

1. $x^{<t>}$ is usually a one-hot label. The domains of $a, \tilde c, c$ are yet to be found.

2. Since gates have sigmoidal activations, their output features will always be in $(0, 1)$. Therefore, element-wise multiplication by a gate won't change the domain, and will have a **dampening effect, and never amplifying**, ranging from doing nothing to total elimination of signal.

3. Regarding $tanh$: If the input of a tanh is in the whole $\mathbb{R}$, the output range is $(-1, 1)$. But if the input is in $(-2, 2)$, the output will still be in $(-0.964, 0.964)$. For $(-1, 1)$, the output range drops to $(-0.7612, 0.7612)$. Regarding the derivative $ \frac{\partial}{\partial x} tanh(x) = (1 - tanh(x)^2)$, the derivatives at values $(0,1,2)$ are respectively $(1, 0.42, 0.07)$.

4. The range of the candidate $\tilde c$ depends on trained $W_c, b_c$ parameters, which are unconstrained. Therefore the output of its $tanh$ is *a priori* unconstrained, having the domain $(-1, 1)$. This output, dampened by $\Gamma_u$ is added to $c^{<t-1>}$ dampened by $\Gamma_f$, defining $c^{<t>}$ 

5. Given this, the domain of $c^{<t-1>}$  seems unconstrained: whatever it was $c^{<t-1>}$, it will be added to the candidate with domain $(-1, 1)$. Therefore, **if the response of $\Gamma_f, \Gamma_u, \tilde c$ to the input $[a^{<t-1>}, x^{t}]$ is consistently high, the range of $c$ can increase by a whole $(-1,1)$ in the step $(t+1)$**. This means that for step $t$ the domain boundary of $c$ is $(c_{min}^{<0>}-t, c_{max}^{<0>}+t)$.

6. Now the domain of $a$ can be defined: it is a dampened and normalized $c$. As already discussed, the amount of dampening will solely depend on the response of $\Gamma_o$ to $[a^{<t-1>}, x^{<t>}]$. And the greater the range boundaries of $c$, the closer to $(-1, 1)$ the range of $a$ will be (being somewhat saturated beneath $(-2, 2)$ as seen).

7. The output $y^{<t>}= S(W_y a^{<t>} + b_y)$ has a softmax activation layer, which acts like an exponential normalization into a probability distribution. This means that the sum of all outputs equals to 1. Therefore, since it is a discrete distribution, the domain boundary for $y$ is $[0, 1]$.

<img src="LSTM.png" alt="Drawing" style="width: 700px;"/>


## 2. Interpretation:

* As in a fully connected layer, the linear combination $W[a^{<t-1>}, x^{<t>}] + b$ maps one semantic to a different one (any or none explicitly known). The mapping for each element of the input vector is independent of all the others, and the response for each output feature is the linear combination of all the separate responses. Also important, due to its linearity, **arbitrarily small inputs generate arbitrarily small responses, and proportionally big inputs generate proportionally big responses**. And importantly, **scaling the input will scale the output by the same factor**. This kind of mapping is happening (and being trained) in the computation of all 3 gates, $\tilde c$ and $y$. 

* Since $\tilde c, c, a$ have the same dimensionality, and the LSTM performs element-wise operations among the 3 of them, we could argue that **they share the same semantics: their $i^{th}$ index refers to the same feature**. Also important to observe, **a higher response value means both higher presence and higher survival chances**: the presence is coded by the one-hot format being positive only for present inputs, and zero elsewhere. Regarding survival chances, the gates explicitly map higher activations to higher throughput. In $\tilde c, c, a, y$ this is not so evident: let's imagine all the corresponding weights (and target label) are multiplied by $-1$. The responses for $\tilde c, c, a$ would behave identically, just the sign would be flipped, and with a minus-one-hot label, this would have no impact in the LSTM's semantics. But the difference comes with the softmax layer, in which lower values are mapped to smaller activations, and flipping the sign would lead to a somewhat "complementary" probability distribution.


* Regarding the $c \rightarrow a\rightarrow y$ upper end:
  * The $a\rightarrow y$ connection works as in every logistic layer: it is trained to map the hidden features to the desired target. In the RNN case it is the "current hidden state", represented by $a^{<t>}$ 
  * Recalling that $a^{<t>}$ is a normalized, element-wise dampened version of $c^{<t>}$: for the normalizing, it seems that a sigmoid instead of a softmax would also work well, but be "slower" with respect to $t$, the main difference being purely numerical: the tanh behaves more like $f(x)=x$ in the $(-0.5, 0.5)$ range and then starts saturating around $(-2, 2)$, whereas the sigmoid is close to $f(x) = 0.5 + \frac{1}{6}x$ in the $(-2, 2)$ range, and starts saturating around $(-5, 5)$. The sigmoid provides a gradient 6 times smoother in a range 4 times bigger, and its curve doesn't saturate until more than twice the range. Hence, tanh leads much quicker to saturation than sigmoid. This means that **the values of c beneath $(-2, 2)$ will lead to saturated feature responses**.
  * The output gate regulates how the relations between features in $c$ should propagate to the output: **ideally, every $x^{<t>}$ is mapped to a feature-rich $c^{<t>}$ that acts like a "histogram" capturing all possible semantics. The output gate then "prunes" the histogram in a way that codes the current state so that the output layer can extract an accurate prediction, and the gates in the next time-step can also perform well** (f.e., a pair of $c$ features looking for USA president and car company names should get triggered with Ford, but only one of them should make it to the output, depending on the current context). Note that this task of "pruning" and discriminating the activated features is somewhat shared by the $\Gamma_o$ and $y$ weights: the difference is that **the output of $\Gamma_o$ gets propagated to the next step, so it should still summarize the current state, whereas the $y$ output should refer univocally to the current $t$**. In other words,the pruning by $\Gamma_o$ is more conservative.
  
* Regarding the feedback mechanism by $a$ vs. $c$:
 * The idea that $c$ is intended to be rich is reflected by the addition that happens every step. Ideally, assuming nothing needs to be forgotten (enough model capacity and data consistency), $c$ will be indeed the result of adding every single candidate $\tilde c$. **$\Gamma_f$ should then be trained to correct skewed contributions to that histogram, and/or to redistribute the unit's capacity"**.
 * On the other hand, **$a$ can swift rapidly without affecting the stability of $c$ at all**: assuming again nothing needs to be forgotten in $c$ and that $c$ is comprehensive, it is easy to see how $\Gamma_o$ could filter out very different $a$ states in two consecutive time-steps, just based on a different input: if the active columns of the corresponding one-hot active entries are complementary, the outcomes are linearly independent from each other and could have arbitrary differences while keeping numerical stability. **Instead of expressing $a^{<t>}$ as a long chain of transformations from the initial state, here a histogram is being steadily built up and refined, and $a^{<t>}$ is an "informed filtering" of that histogram**. This approach is numerically more stable.
 * **The candidate pipeline ($\Gamma_u$ and $\tilde c$) has the task of embedding the current $[a^{<t-1>}, x^{<t>}]$ input so that it provides a meaningful contribution to the $c$ histogram**.
 
 So at this stage we developed some intuition of how an LSTM *could* be: what kind of numbers flow through them, what meaning could they have depending on they position and value, and what role would the given operations play when applied on them. This builds a good base to check related interpretations.

## 3. Related work

The first sections of http://proceedings.mlr.press/v37/jozefowicz15.pdf discuss the problematic and intuition about hand-engineered recurrent units, epecifically LSTMs:

* Architecture search over 10k+ models.

* LSTMs are best when dropout is used

* The gradient of tanh is described as "better-behaved" than sigmoid. This seems to go against the intuitions developed so far.

* "Initializing $b_f$ with a large value such as 1 or 2 (Gers, 2000) closes the gap between LSTM and more sophisticated units. This is so because not starting with $\Gamma_f$ wide open will effectively enforce gradient vanishing.

* $Gamma_i$ is important, $\Gamma_o$ is unimportant (this probably holds if $c$ has enough dimensions to provide the y-layer with a sparse enough code). $Gamma_f$ is extremely significant on all problems except language modelling (maybe language problems have orders of magnitude less latent dimensionality and/or noise presence in data than "other problems"; that would decrease the need of correcting the histogram).

* Two main operational semantics: building the histogram and decoding information from it. This aligns with the intuitions already developed.

* Also note that in this paper the output gate uses a tanh instead of a sigmoid activation. This also aligns with our thoughts so far.

* LSTMs don't feature attractor-based memory systems

* GRUs are more recent (Cho et al. 2014) and outperformed LSTM on nearly all tasks except NLP (but LSTMs almost matched GRU with $b_f^{<0>} \geq 1$.