# XLNet: Generalized Autoregressive Pretraining for Language Understanding

**参考**：

[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)

[XLNet (Zihang Dai)](https://github.com/zihangdai/xlnet)

[XLNet (Hungging Face)](https://huggingface.co/transformers/model_doc/xlnet.html)

[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)

**思考**：

1. 乱序语言模型的“乱序”指的是什么？输入序列的词序不重要吗？

2. 什么是上下文信息？双向语言模型能表示上下文信息吗？乱序语言模型如何学到上下文信息的？

## 摘要

`XLNet`, a generalized autoregressive pretraining method

1. learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order

2. overcomes the limitations of BERT

![](./slides/slides_07.jpg)

## 1. 预训练语言模型

**预训练语言模型**：通过自监督学习，充分利用大量无监督文本，并将其中的语言知识编码，对下游NLP任务产生积极作用。2018年`ELMo`、`GPT`、`BERT`等工作的发表开启了NLP的新时代；2019年`GPT2`、`Roberta`、`T5`等工作表明，更大规模的数据、更复杂的模型可以不断提高预训练语言模型的性能表现。

![](./slides/slides_02.jpg)

`word2vec`等词向量模型输出为静态词向量，即不考虑任何上下文信息，每个单词的词向量固定。静态词向量无法描述一词多义现象。

相比词向量模型，预训练模型能够根据上下文语境计算单词词向量，即单词词向量是动态向量。

## 2. 自回归目标函数v.s.去噪自编码器目标函数

![](./slides/slides_03.jpg)

![](./slides/slides_05.jpg)

## 3. 乱序语言模型（Permutation Language Modeling）

if model parameters are shared across all factorization orders, in expectation, the model will learn to gather information from all positions on both sides

**乱序（orderless）**：并非指输入序列可以随机排列；而是概率分布无论按何种顺序进行因子分解（factorization orders），模型都能够拟合。

![](./slides/slides_09.jpg)

![](./slides/slides_11.jpg)

![](./slides/slides_14.jpg)

![](./slides/slides_15.jpg)

![](./img/plm_obj.png)


## 3. 考虑目标位置的表示（Target-Position-Aware Representation）

目标位置（target position）指$z_{t}$，乱序后的单词位置索引。

![](./slides/slides_22.jpg)

![](./slides/slides_24.jpg)

![](./slides/slides_28.jpg)

## 4. 双流自注意力（Two-Stream Self-Attention）机制

![](./slides/slides_30.jpg)

![](./slides/slides_31.jpg)

![](./img/two_stream_self_attention_.png)

![](./img/two_stream_self_attention.png)

![](./slides/slides_32.jpg)


## 5. 工程实现

### 5.1 部分预测（Partial Prediction）

![](./img/parial_pred.png)

### 5.2 Transform-XL

![](./img/transformer_xl_.png)

![](./img/transformer_xl.png)

### 5.3 相对段落编码（Relative Segment Encodings）

![](./img/rel_seg_enc.png)


## 6. 总结

![](./slides/slides_39.jpg)

## 7. 实验

![](./slides/slides_41.jpg)

![](./slides/slides_43.jpg)

![](./slides/slides_44.jpg)

## 附录

代码实现（摘自[XLNet (Zihang Dai)](https://github.com/zihangdai/xlnet)）

### 乱序、掩膜

```python
def _local_perm(inputs, targets, is_masked, perm_size, seq_len):
  """
  Sample a permutation of the factorization order, and create an
  attention mask accordingly.

  Args:
    inputs: int64 Tensor in shape [seq_len], input ids.
    targets: int64 Tensor in shape [seq_len], target ids.
    is_masked: bool Tensor in shape [seq_len]. True means being selected
      for partial prediction.
    perm_size: the length of longest permutation. Could be set to be reuse_len.
      Should not be larger than reuse_len or there will be data leaks.
    seq_len: int, sequence length.
  """

  # Generate permutation indices
  index = tf.range(seq_len, dtype=tf.int64)
  index = tf.transpose(tf.reshape(index, [-1, perm_size]))
  index = tf.random_shuffle(index)
  index = tf.reshape(tf.transpose(index), [-1])

  # `perm_mask` and `target_mask`
  # non-functional tokens
  non_func_tokens = tf.logical_not(tf.logical_or(
      tf.equal(inputs, SEP_ID),
      tf.equal(inputs, CLS_ID)))

  non_mask_tokens = tf.logical_and(tf.logical_not(is_masked), non_func_tokens)
  masked_or_func_tokens = tf.logical_not(non_mask_tokens)

  # Set the permutation indices of non-masked (& non-funcional) tokens to the
  # smallest index (-1):
  # (1) they can be seen by all other positions
  # (2) they cannot see masked positions, so there won"t be information leak
  smallest_index = -tf.ones([seq_len], dtype=tf.int64)
  rev_index = tf.where(non_mask_tokens, smallest_index, index)

  # Create `target_mask`: non-funcional and maksed tokens
  # 1: use mask as input and have loss
  # 0: use token (or [SEP], [CLS]) as input and do not have loss
  target_tokens = tf.logical_and(masked_or_func_tokens, non_func_tokens)
  target_mask = tf.cast(target_tokens, tf.float32)

  # Create `perm_mask`
  # `target_tokens` cannot see themselves
  self_rev_index = tf.where(target_tokens, rev_index, rev_index + 1)

  # 1: cannot attend if i <= j and j is not non-masked (masked_or_func_tokens)
  # 0: can attend if i > j or j is non-masked
  perm_mask = tf.logical_and(
      self_rev_index[:, None] <= rev_index[None, :],
      masked_or_func_tokens)
  perm_mask = tf.cast(perm_mask, tf.float32)

  # new target: [next token] for LM and [curr token] (self) for PLM
  new_targets = tf.concat([inputs[0: 1], targets[: -1]],
                          axis=0)

  # construct inputs_k
  inputs_k = inputs

  # construct inputs_q
  inputs_q = target_mask

  return perm_mask, new_targets, target_mask, inputs_k, inputs_q
```

### 双流自注意力机制

```python
def two_stream_rel_attn(h, g, r, mems, r_w_bias, r_r_bias, seg_mat, r_s_bias,
                        seg_embed, attn_mask_h, attn_mask_g, target_mapping,
                        d_model, n_head, d_head, dropout, dropatt, is_training,
                        kernel_initializer, scope='rel_attn'):
  """Two-stream attention with relative positional encoding."""

  scale = 1 / (d_head ** 0.5)
  with tf.variable_scope(scope, reuse=False):

    # content based attention score
    if mems is not None and mems.shape.ndims > 1:
      cat = tf.concat([mems, h], 0)
    else:
      cat = h

    # content-based key head
    k_head_h = head_projection(
        cat, d_model, n_head, d_head, kernel_initializer, 'k')

    # content-based value head
    v_head_h = head_projection(
        cat, d_model, n_head, d_head, kernel_initializer, 'v')

    # position-based key head
    k_head_r = head_projection(
        r, d_model, n_head, d_head, kernel_initializer, 'r')

    ##### h-stream
    # content-stream query head
    q_head_h = head_projection(
        h, d_model, n_head, d_head, kernel_initializer, 'q')

    # core attention ops
    attn_vec_h = rel_attn_core(
        q_head_h, k_head_h, v_head_h, k_head_r, seg_embed, seg_mat, r_w_bias,
        r_r_bias, r_s_bias, attn_mask_h, dropatt, is_training, scale)

    # post processing
    output_h = post_attention(h, attn_vec_h, d_model, n_head, d_head, dropout,
                              is_training, kernel_initializer)

  with tf.variable_scope(scope, reuse=True):
    ##### g-stream
    # query-stream query head
    q_head_g = head_projection(
        g, d_model, n_head, d_head, kernel_initializer, 'q')

    # core attention ops
    if target_mapping is not None:
      q_head_g = tf.einsum('mbnd,mlb->lbnd', q_head_g, target_mapping)
      attn_vec_g = rel_attn_core(
          q_head_g, k_head_h, v_head_h, k_head_r, seg_embed, seg_mat, r_w_bias,
          r_r_bias, r_s_bias, attn_mask_g, dropatt, is_training, scale)
      attn_vec_g = tf.einsum('lbnd,mlb->mbnd', attn_vec_g, target_mapping)
    else:
      attn_vec_g = rel_attn_core(
          q_head_g, k_head_h, v_head_h, k_head_r, seg_embed, seg_mat, r_w_bias,
          r_r_bias, r_s_bias, attn_mask_g, dropatt, is_training, scale)

    # post processing
    output_g = post_attention(g, attn_vec_g, d_model, n_head, d_head, dropout,
                              is_training, kernel_initializer)

    return output_h, output_g
```
