# Conv Sequence to Sequence Learning

## Insights
- <span class="mark">CONV's output elements equals input</span>
    - set the kernel size to $k$
    - pad $k/2$ in both left and right sides
    - delete the last $k/2$ elements from output

## Position Embeddings
- the input tokens $x=(x_1, \cdots, x_m)$
    - the corresponding word vectors $w=(w_1, \cdots, w_m)$
- position embeddings to give CNN a sense of the portioin of the sequence is currently dealing with
    - $p=(p_1, \cdots, p_m)$
- the input elements are represented as
    - $e=(w_1+p_1,\cdots,w_m+p_m)$

## Convolutional Block Structure

- each block contains **a** one dimensional convolution followed by a non-linearity
- stacking several blocks on top of each other increases the number of input elements represented in a state
    - just like a pyramid, only in that way Conv can learn a single or several states from a long but fixed-length sequence.

*it seems that each level-$l$ will has $k-l$ outputs, so stacking 5 kernel-5-blocks will result in a single one output*

So if we have 25 inputs, with another stack can transform these 25 inputs to a single state.
```
|
||
|||
||||
|||||
```

### GLU (gated linear unit)
- add non-linearities to allow the networks to exploit the full input field, or the focus on fewer elements if needed
- each convolution kernel is parameterized as $W\in R^{2d\times kd}$
    - $d$ the embedding dimension
    - $k$ the kernel size
    - so the output of each kernel is $2d$, twice of the input dimension
- GLU is performed on the $2d$ output as simple gated nonlinearities.

\begin{equation}
v([A,B]) = A \otimes\sigma(B)
\end{equation}

- $\sigma(B)$ is a gate
- $A$ and $B$ are concated as the 2d output

### residual connections to enable deep convolutional networks

\begin{equation}
h_i^l = v(W^l [h^{l-1}_{i-k/2},\cdots,h^{l-1}_{i+k/2}]+b^l_w)
+h^{l-1}_i
\end{equation}

### padding strategory
- pad the input by $k-1$ elements on both the left and right sides by zero vectors
- remote $k$ elements from the end of the convolution output

# Multi-step Attention

- separate attention mechanism for each decoder layer
- combine the current decoder state $h^l_i$ with an embedding of the previous target element $g_i$

\begin{equation}
d^l_i = W^l_d h^l_i + b^l_d + g_i
\end{equation}

each layer has seperate attention weights

$$
a^l_{ij} = \frac{exp \left(d^l_i . z^u_j\right)}
{\sum_{t=1}^m exp\left(d^l_i. z_t^u\right)}
$$

The conditional input $c_i^l$ to the current decoder layer is a weighted sum of the **encoder outputs** as well as the **input element embeddings** $e_i$

$$
c^l_i=\sum_{j=1}^m a^l_{ij}\left( z_j^u+e_j\right)
$$

## Decoder

Multiple decoder CONV layers to learn a sequence, for each decoder layer, a attention layer is used to calculate a context vector just for this layer.