# LAS - Listen, Attend and Spell Neural Network

#### Arnaud Capitan

## Objective:

Describe formally the LAS Listen Attend and Spell model

Input sequence  of filter bank spectra features, 'speech' : $\mathbf{x} = (x_1,...,x_T)$

Output sequence of characters : $\mathbf{y} = (<\text{start of signal}>,y_1,...,y_S,<\text{end of signal}>)$ 

With $y_i \in (a,b,c,...,z,0,...,9,<\text{space}>,<\text{comma}>,<\text{periods}>,<\text{apostrophe}>,<\text{unknown}>)$

___

### Listen module : 

Uses a pyramidal Bidirectional Long Short-Term Memory Recurrent Neural Network, or in short pBLSTM RNN

$$\mathbf{h} = \text{Listen}(\mathbf{x})$$

$h _i ^j = \text{pBLSTM}(h _{i-1} ^j, [h _{2i}^{j-1}, h_{2i+1}^{j-1}])$

We repeat this previous step for all the layers of our Listen module, so that in the end we reduced the input length $T$ to a feature vector $\mathbf{h}$ of length $U$ by a factor $2^L$, $L$ being the number of layer in the Listener.

___

### Attend and Spell module : 

Uses an attention-based LSTM transducer.

At every output step, generation of a probability distribution for character $y_i$ based on all the characters seen previously, using an attention mechanism :

- $c_i = \text{AttentionContext}(s_i,\mathbf{h})$ the context based of the current decoder state $s_i$
- $s_i = \text{RNN}(s_{i-1},y_{i-1},c_{i-1})$ the current decoder state, based on the previous decoder state $s_{i-1}$, the previously emitted character $y_{i-1}$ and the context vector $c_{i-1}$
- $\mathbb{P} (y_i | \mathbf{x}, y_{<i}) = \text{CharacterDistribution}(s_i,c_i)$

Where :

- $\text{CharacterDistribution}$ is a MLP with softmax outputs over the characters
- $\text{RNN}$ is a 2-layer LSTM
- $\text{AttentionContext}$ function is the attention-based mechanism, described as follows :

$e_{i,u} = <\phi(s_i),\psi(h_u)>$ the scalar energy for each time step $u$

$\alpha _{i,u} = \dfrac{\exp (e_{i,u})}{\sum _u \exp (e_{i,u})}$ the probability from softmax over time steps (attention)

$c_i = \sum _u \alpha _{i,u} h_u$ the context vector

Where $\phi$ and $\psi$ are MLP neural networks.

### Learning :

The goal is to maximize the log probability with a sequence-to-sequence method conditioned on the previous characters :

$$ \max _{_\theta} \sum _i \log \mathbb{P}(y_i | \mathbf{x},y^*_{<i},\theta) $$

10% of the time, we use the previous character distribution to avoid overfitting in the training :

- $\tilde{y}_i \sim \text{CharacterDistribution}(s_i,c_i)$
- $\max _{\theta} \sum _i \log \mathbb{P}(y_i|\mathbf{x},\tilde{y}_{<i},\theta)$

### Decoding :

Find the most likely character sequence given the input acoustics :

$$ \hat{y} = \argmax _y \log \mathbb{P}(\mathbf{y}|\mathbf{x})$$
