

# 1. Introduction.

- In the previous lecture we discussed mostly полносвязный слой (fully connected network).
    - A fully connected layer is a linear model.
- Sometimes our original data has some structure that has to be taken into account when training a model. This is also the case when we are using NN.
    - Example: Speech:
        - A group of word having a predefined semantic order produces a sentence, which has a meaning.
    - Example: An image.
        - There is a relationship between neighboring pixels.
- That is, we have to perform trasformations, such that the structure of the data is preserved
    - FC Layer work of with pixels independent from each other, but the meaningfulness of a picture is not contained in each pixel separately but in the way the are group together. The same thing with speech.
    

- Фраза: **<font color=green>Сегодня утром шел доджь, поэтому на работу взял с собой</font><font color=red> зонт</font>.**
- Context: **<font color=green>Сегодня утром шел доджь, поэтому на работу взял с собой</font>**.
    - The part of the text that we use to predict the next word.
    - Требование 1: The context has to be encoded in a vector of finite dimensions. The dimensions are fixed: don't depend on the context's length.
    - Требование 2: It has to be able to actualize itself by adding new information (i.e., we can (and need) add a word to the context.
- **Question:** How to vectorize a text?
- **Answer:** We can think of a word is a vector element from the space generated by the dictionary. A dictionary is finite, so that space is also finite. Then we can think of each word as being a one-hot vector.

- RNN workflow from layer $i$ to layer $i+1$:
    - Suppose $\text{context}_i$ is the context right before the $i$-th layer.
    - Concatenate to it the new $\text{input} \implies$ gives $\text{context}_i^{\text{new}} = [\text{context}_i, \text{input}]$.
    - Perform a linear transformation $f(x) = Wx + b$, such that $f(\text{context}_i^{\text{new}})\text{.shape} = \text{context}_i\text{.shape}$.
    - Pass $\text{context}_i^{\text{new}}$ through a nonlinear activation function $g(\cdot) \implies  \text{context}_{i+1} = g\big(f(\text{context}_i^{\text{new}})\big)$.

![alt text](https://i.ibb.co/NLX107n/Screen-Shot-2020-11-28-at-01-39-24.png)

![alt text](https://i.ibb.co/qWBQRZ8/Screen-Shot-2020-11-28-at-01-53-41.png)

- $h_i$ - **_context_** vector or **_hidden state_**.
- $x_i$ - input vector (the new information).


![alt text](https://i.ibb.co/pbdRSRP/Screen-Shot-2020-11-28-at-01-59-03.png)

 - It seems that if at some point the network wrongly predicted a word, then downwards the predictions will also be wrong, since it'll be using that word from predicting across the subsequent layers.
 - `<START>` and `<END>` denote the limits of the sentence (they're called **_tokens_**).
     - This controls the points where the model start and end predictions.
     - This also allows to take $h_0 = \vec{0}$.
     
----

# 2. RNN Drawbacks.

- $h_i$ changes at each step: each input brings some sort of perturbation to the hidden state $\implies h_i$ has no change of passing through layers without changes.
    - Example of this being a problem: Speech with a lot of hesitations markers (_hmmm, eeh, este, o sea_). The marker doesn't add any meaningful new information to the context (in fact, they may be eliminated), by when dealing with an unprocessed transcription they will introduce change in the hidden state anyway.
        - Since each _hmmm, eeh, este, o sea_ will be a token feed to the network as input.
    - In general, the hidden state will change even is the input data contains some small noise.

- $\implies$ we'd want the model to decide what input are inportant than others. This is where LSTM comes to the rescue.

# 3. LSTM.

![alt text](https://i.ibb.co/znv3nVX/Screen-Shot-2020-11-28-at-17-03-02.png)

<h1><center>$\Downarrow$</center></h1>

![alt text](https://i.ibb.co/zFPr3DK/Screen-Shot-2020-11-28-at-17-03-12.png)

- Here we use $\tanh$ instead of ReLU or LeakyReLU because the latter functions are not bounded, and the hidden steps can turn out to be anything when forward passing. This we want to avoid, since we'd want to pass to each layer data having the same distribution.
    - $\tanh$ and $\sigma$ are bounded functions.
    - врата - gate.

## 3.1. LSTM Overview.

### 3.1.0. Notation.

- The notaiton changes a little bit.

![alt text](https://i.ibb.co/hgZ2SJn/Screen-Shot-2020-11-28-at-17-24-50.png)

- $h^{(t-1)}$ - hidden state coming from the previous layer.
- $x^{(t)}$ - input the the $t$-th layer.
- $h^{(t)}$ - updated hidden state with the $x^{(t)}$ input.

![alt text](https://i.ibb.co/s3qCKTV/Screen-Shot-2020-11-28-at-17-46-03.png)

# 4. Another drawback.

- Before we talk about the importanec of having bounded activation function. The problem is that for large positive/small negative values $\tanh$ and $\sigma$ leads to vanshing gradient.
- Specifically, during backpropagation, the further we go back to the beginning on the network updating gradient, the more the gradient diminishes $\implies$ during optimization, the loss function will practically ignore the contribution of the earlier layers.

![alt text](https://i.ibb.co/dmhG42V/Screen-Shot-2020-11-28-at-18-45-15.png)